WDT Search

WDT: Creating SEARCH friendly pages

The WDT is developing a new search engine to replace the current one. The new search engine is an in house project code named "MUDIE" or Multi Domain Indexing Engine developed by Jose L. Cuevas.

The current release is alpha 1.0.0 alpha release 8.

Prior solutions were limited to physical access to a file. This approach did not allow users with pages in other domains to be part of the university search, plus since it read raw files it created a security concern. Even more this solution is not very portable or user friendly as it required configurations files to provide limited security protection.

The MUDIE is an actual controlled spider. Is controlled because it will only crawl pages as directed. Two the body of the file is not indexed and therefore it will not reveal sensitive information or pages. The spider (indexer) uses regular HTTP and therefore can index any page in the uprm.edu domain regardless of the server. It only requires that all pages contain a keyword and a description meta tag

In order to provide a useful search service, it's important that your pages meet certain requirements. The WDT reserves the right to determine which sites are searched. If you want people to be able to search your site you must:

Ensure that your content is up-to-date
Only frameless pages are reachable on a search result
Use a keyword meta tag like this:
<meta name="keywords" content="uprm,sici4044,proyectos">
This tag will indicate the keywords that will match this page.
Use a description meta tag, like this:
<meta name="description" content="Recursos para el curso SICI4044">
This is the actual description that the users will see when your page matches a search criteria

Once your pages meet these requirements send and email to wmaster@uprm.edu with the urls of the pages that you whish to get indexed. We can index an entire directory or domain, but only pages linked will be indexed.

Current Version of MUDIE:

The current version of MUDIE is 1.0a8. It still in alpha release therefore expect many changes. We still defining the first feature set for the final release of 1.0. Main efforts are on reducing redundancies on the index tables and to provide a default set of search options as seen in many search engines.

Future plans is to release MUDIE as an open source project. It was hard for us to find a practical search engine to use in our site. All of the freely available ones did not meet our requirements and therefore we decided to role our own. So we are sure that this tool can be of great help for others including other universities in the UPR system.

MUDIE Indexing Rules:

Absolute and relative URLs can be indexed.
A URL can point to a directory.
URLs with CGI parameters will be indexed.
Links produced in dynamic pages will be indexed regardless of their lifespan.
A page without keywords will not be indexed.
Links to files (eg PDFs, JPGs, DOCs, etc.) will be indexed by their name and file type.
Anchors will not be indexed.
Only sites in uprm.edu domain are indexed.

MUDIE Indexing Limitations:

Relative urls that use ../ or ./ will generate an invalid url. Therefore they will be ignored.
File names of the form name.morename.ext will cause the indexer to add an improper entry.
The crawler algorithms still needs improvements to detect conditions that would make the indexer crawl a path already visited under different circumstances in which the a url will not indicate so.
Optimization of code is required to overcome reaching the 30s execution limit impossed to the php code under ceirtain circumstances.

Using MUDIE:

To use MUDIE just link to "http://www.uprm.edu/search".
You can also add your own search box to your pages just copy the form section of the page found at "http://www.uprm.edu/search". Notice that this form has a field called "site". The "site" field may have two possible values. One of these values is "-empty". The value "-empty" tells the search engine to search pages everywhere in the domains indexed by MUDIE. You can set the value of "site" to a valid domain or relative URL to restrain the search to that location. For example if you only want to search in "ac.uprm.edu" you set "site" to "ac.uprm.edu". If I only want to search to lets say the biology department, I can set "site" to something like this "www.uprm.edu/biology". Now only pages in biology will be matched. When you create your own search form you must allow users to search the whole university.

An example form would be like this:

<form action="http://www.uprm.edu/search/index.php" method="post">
    <table border="0" cellpadding="0" cellspacing="2" width="400">
        <tr><td>
            <input type="text" name="q" value="" size="56" maxlength="255">
        </td></tr>
        <tr><td align="center" valign="top">
            <font size="1" face="Arial,Helvetica,Geneva,Swiss,SunSans-Regular">
            <input type="radio" value="ac.uprm.edu" name="site"> ac.uprm.edu  
            <input type="radio" value="-empty" name="site"> all uprm  
            <input type="submit" value="Search">
            </font>
        </td></tr>
    </table>
</form>

Version History:

1.0.8a:

Redisigned indexer. Classes were created to handle urls (ckurl).
ckURL class know implements most of the code to handle links in pages.
Rewrote code that parses urls.
Added methods to compare different urls that yield same result therefore avoiding loops.
The ckURL class has new methods to normalize relative urls. Still have to add support for "../", "./".
Added a history of links indexed to avoid loops. Still have to work on paths crawled to avoid unecessary crawling (code started).

1.0.5a:

Added GNU GENERAL PUBLIC LICENSE
URL/ and URL will no longer be treated as different entries
Queries are parsed to allow special syntax in query string. Using the new class ckQuery. Added Google like features.
Query may now include phrases using the "phrase" syntax.
Query may include a partial URL to limit the search locations using the site:url/ syntax.
Query will removed ambiguous keywords like single characters, web pages name, file extensions, and common pronouns.
Indexer will recognize different forms of URLs to the same target.
[WIP] Testing possible implementation of soundex values in keyword to allow misspelled words to be matched.
[WIP] Adding support for special meta tags:
- rumrelated meta tag. This tag will have a series of URL that are related in content to this page.
- rumbldg meta tag. This is a free string that indicates the location of a department or office. For example <meta name='rumbldg' content='Monzón 101'>.
- rumtype meta tag. This is a free string that indicates the type of resource for the matched page. Possible values are "dept","office","employee","course","program", "stdasoc", "misc". The type will determine what other special tags will be used.
- rumname meta tag. This is the name of the department or office, name of professor or other employee, name of a program, a course name or name of a student asoc.
- rumtel meta tag. This is the telephone number where you may contact the entity represented by this page. DONE
- rumemail meta tag. This is the email address where you may contact the entity represented by this page. DONE
- meta tag. The value of this tag is a string which has the full name of the entity as it appears in the Virtual Directory of the university. When this tag is included an option will be added to allow users to download the vcard file of the entity.
- rumicon meta tag. The value of this tag is an absolute URL with the path to a (16*16 pixels) gif or png image that represents your department, office, program, or student asoc. The background must be transparent or white. DONE
- rumhome meta tag. The value of this tag is the absolute URL of the home page or parent category for the particular page. DONE
- rumcategory meta tag. Not defined.

1.0.1b:

Basic functionality.
Different forms of URLs that point to the same resource will create double entries in the index table. Examples of these are www.uprm.edu/index.php, www.uprm.edu/ or www.uprm.edu. This will be fixed in future version.
No mechanism to validate URL. Very important if URLs in index table age or are moved or renamed.
Indexer may create a heavy load on a server while crawling, no sleep or crawl balancing supported.
MUDIE will fail to detect a URL to a resource that requires authentication.

GNU General Public License:

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA

NOTE: Source code is not yet available to the general public. We are in process of
setting up a CVS server.

Download GPL

Copyrights 2002, Jose L. Cuevas , jose@uprm.edu

Contact the WDT for more information.

Last Revision: Jan. 18, WDT, wmaster@uprm.edu