xinabse is not a big search engine

Templates

Blocks
Tags
Result tags
Form values

Development and using the API
Known Bugs
License

xinabse is not a big search engine

Requirements

Perl: Version 5.6.1 and 5.8.0 were tested
libwww-perl: Get it from CPAN at http://search.cpan.org/search if it is not included in your distribution.
HTML-Parser: Get it from CPAN at http://search.cpan.org/search if it is not included in your distribution.
PHP: Version 4.x
MySQL: some recent version (3.2x) should do it

Installation

create_tables.sql

xinabse_

spider/xinabse.conf

frontend/xinabse.php

frontend/

Using the spider

When using the spider be sure to be inside the xinabse/spider/ directory or it will not find it's configfiles.

For getting a short overview on the available options of the spider run

 ./xinabse-spider.pl --help

All options may be abbreviated. (--he is the same as --help)

adding a new site

To add a site just run xinabse-spider.pl with the --add option and give the sites URL as parameter. When giving a toplevel adress as start URL be sure to add a trailing slash.

The spider will immediatly start to spider the given URL using the default recursion depht given in the config file. Alternatively you can give the level for that URL with the --level option.

 Examples:
 
   add splitbrain.org with the default depth
   
    ./xinabse-spider --add 'http://www.splitbrain.org/'

   add splitbrain.org with a spider depth of 2
   
    ./xinabse-spider --level 2 --add 'http://www.splitbrain.org/'

To revisit all sites that are older than the configured time given in the configuration file. Just run xinabse-spider.pl without any argument. Optional you can specify the timespan to use on the command line with the --reindex-after option (in hours).

maintain the database

You can delete a site and all it's subpages by using the --delete option:

  Example:

    delete splitbrain.org from the database

      ./xinabse-spider --delete 'http://www.splitbrain.org/'

You may even clear the complete database by using the --delete-all option. You will be asked to confirm this by entering yes.

  Example:

      ./xinabse-spider --delete-all

Templates

You can use the included xinabse.php frontend to query the search index. It uses a template to customize the page design. Please note that you have much more freedom when using the API as described in the next chapter.

When designing Templates for xinabse you have to deal with some placeholders which will be replaced by the values your search returns.

There are two kinds of placeholders in xinabse templates: blocks and tags. All placeholders are defined by curly brackets.

You should have a look at the default template xinabse.tpl for example usage.

Blocks

Let's have a look at the blocks first. Blocks are used to define areas in your
template. They start with a C<BEGIN> and end with an C<END> tag, everything
between these tags is the content of this block. Each block may only occure
once in your template! The following blocks are available:

=over 2

{BEGIN RESULT} {END RESULT}: this block will be only used if a search returned at least one result else it's content will be removed from the output.
{BEGIN RESULTROW} {END RESULTROW}: Inside this block you can define how a single result should look like (see result tags below). It will be repeated for each result. This block has to be inside the RESULT block!
{BEGIN NORESULT} {END NORESULT}: this block is only shown if a search returned no results

Please note that all these blocks are removed from the output when no query was given.

Result tags

The following tags may only be used inside the RESULTROW block. The will be simply replaced by the values of the actual result.

{NUMBER}: The number of the result
{TITLE}: The title tag of the page if there was any else the URL. Special chars are HTML quoted.
{URL}: The URL of the site.
{EXTRACT}: If your spider was set to save extracts this contains the Meta-Description of the site if there was any or the first few lines of the page. (See spider configuration for that). Special chars are HTML quoted.
{LASTMODIFIED}: The last modified date of the page. This is a string as returned by the database.
{WEIGHT}: The weight for this site and the actual query. Results are sorted by the descending weight.
{RELEVANCE}: The relevance of this site in percent. This is calculated from the weight. The first hit has always a relevance of 100 percent all others have a relvance proportional to the first hits weight.
{DOMAIN}: Only the domain part of the URL.
{PATH}: Only the path part of the URL.
{PAGE}: Only the page part of the URL.

Form values

When writing a template for xinabse you can use the following parameter names for your FORM tag.

tpl: The path to your template. For security reasons neither absolute pathnames nor relative paths above the path of xinabse.php are allowed. However you may define a default template by setting a $tpl variable at top of the xinabse.php file. This parameter defaults to xinabse.tpl.
start: This defines the offset from where the results start. This defaults to 0 and should not be set to anything other than 0 in your form.
limit: Defines how many results are shown per page. Defaults to 20.
q: This is the actual search query and have to be in your form definition!

Development and using the API

This software is GPL - see License!

If you want to adjust the spider to your needs have a look at the perldoc comments in the common.pl file. All functionality is in this file.

If you don't want to use xinabse's template engine you can include the xinabse-api.php file into your PHP code and use the xinabse function to commit searchqueries.

Here's a short version of it:

  function xinabse($query,&$resultarray,$all=false,$offset=0,$limit=20,$domain="")
    global $XINABSE_SERVER;
    global $XINABSE_DATABASE;
    global $XINABSE_USER;
    global $XINABSE_PASSWD;
    //query database
    //place mysql-result into $resultarray
    //return count of all results
  }

As you can see it uses some globals to connect the database and stores the results into a given array (it hast to be given as reference). Only the results from line offset to offset+limit are placed in the array but it returns the number of all results.

The resultarray contains objects with the following properties: title, lastmodified, url, extract, weight, domain, path, page, number and relevance. Having a look at the placeholders descriptions in the template section of this document should give you a hint what they contain.

Here is a list of the accepted parameters and their meaning.

$query: String. Contains the search query as given by the user.
$resultarray: Array. Results will be written back to this array.
$all: Boolean. Optional - defaults to false. If true all given keywords must exist on a page to be a valid result (aka. AND search). If false one keyword is enough (aka. OR search)
$offset: Integer. Optional - defaults to 0. Results begging from result number offset are returned. This is used for pagination of search results.
$limit: Integer. Optional - defaults to 20. Maximum number of results that are returned.
$domain: String. Optional - defaults to an empty string. Only sites from the given domain name are searched.

Known Bugs

Due to some limitations of the MySQL database engine there is a slim chance for sites being inserted more than once when indexing with more than one process. However these duplicates should be detected and removed by xinabse' cleanup function which is run after each run.

Furthermore there may occure some ``lost database connection'' errors. I'm not sure what causes them. Indexing with fewer processes may help.

Only the latin1 (ISO-8859-1) charset is supported by xinabse. While Perl 5.8 supports Unicode natively there is currently no such support in MySQL. To make xinabse compatible with older Perl versions and the actual MySQL releases all characters that differ from latin1 alphanumeric characters are replaced with spaces. However a skilled programmer should find it easy to adjust xinabse to work with any other ISO-8859 charset.

License

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. See COPYING for details