Michael's Document Search Engine

Amazon In a distributed system course I took we decided to do a project with Google API.

Our vision to create a document search engine so that the user can find a document on the Internet that has a high degree of similarity to the given document. We call this system 'GoogleDoc'.

Possible applications include checking for plagiarism in the academia, learning more about a big topic, and checking to see whether an idea is already taken.

We use JICOS to do distributed and parallel computing so that computation of results of a query can be returned in the shortest amount of time possible.

One thing worth noting is that Google's rank is based on hundreds of factors, many of which are related to the popularity of the web page. Therefore the highest ranked web page does not even have to contain similarity the query document. The goal of a document search engine is that if there exists such a similar document on the Web, we should find it and it should be ranked at the very top.

[2012/12/25] Later Google did come out with a service called Google Doc which was renamed Google Drive later.

Therefore we came up with our own ranking mechanism that assigns each returned URL a score based on purely the similarity and relevancy. The user can choose to display the results according to Google rank, Our rank, or alphabetical order.

Also Google API imposes several constraints such as the number of queries one can make and so on. Since our underlying search system relies on Google API completely, we have no control over that.

Our system is written in 100% pure Java. We use JSP technology so that our system can be made accessible to anyone who has an Internet-enabled browser. Here are a couple of snapshots of GoogleDoc's user interface and results' display:

Document Search Engine Result Page

As you can see Ice and Fire coexist in 'GoogleDoc', meaning that it is able to perform well under situations of both extremes.

Enjoy the following random pages..

This website gives you tips on technical subjects and general topics!

This is a visual search engine acquired by Google in 2010.

This is a voice-driven calculator program written in VB.

This is a random maze generating program written in C.

Post your comment below.
Anything is okay.
I am serious.