How does Google Desktop figure out what documents to index? Why does it only work for Microsoft documents and not Adobe PDF? Why support only AOL Instant Messenger? To get answers we start from the “Getting Started Guide” which says what Google Desktop can search:
- Your Outlook and Outlook Express email.
- The web pages in your Internet Explorer web history.
- Your AIM chats
- The files and folders on your hard drive (the ones you actually look at, not the system files only your computer uses).
What’s really interesting is how it goes about finding these documents. The Google Desktop indexer actually makes use of the “secret” index.dat files found in Microsoft’s operating systems. The “index.dat” file discovered several years ago records the URLs you visit, the email you receive and send, AIM chats, your cookies and apparently everything you open with the file manager. You can discover which apps use “index.dat” by using SysInternals Handle tool. Pretty devious don’t you think?
Another curious aspect of Google Desktop is its inability to index documents other than Microsoft specific ones (i.e. Outlook, Word, Excel, Powerpoint etc). Why doesn’t it work for an equally popular format like PDF? Could there be something built inside Windows that allows easy access to the text found in these documents? “Index.dat” perhaps?
Its so all so convenient, that its indeed pretty scary that a lot of Spyware already exploits this. Google Desktop could make it even easier to exploit, allowing for easier discovery of confidential data on your computers! Now you have to make a call. Do you trade convenience for a potentially massive privacy hole? The problem is compounded when you have no option but to use Internet Explorer, the primary entry point of most Spyware programs.
This brings up another big question. Why didn’t Google allow the user to choose the directories to be indexed? Was this a strategy to prevent its use as a general purpose search engine like Google Appliance? I understand the convenience of not having to specify what needs to be searched, however sometimes you just need more fine grained control. This relegation of control reminds me of Google Adsense.
In addition, how do you choose where Google Desktop places its indices? Right now its in “Documents and Settings”\User\”Application Data”. Is there a way of controlling its size, maybe limiting it? If we were to follow the GMail paradigm we change IE settings to never get rid of its history. Clearly a boon for Hard Drive manufacturers.
[update] Here’s an even scarier thought. What if a state passed a law that required all computers use a variant of this program? A program that would open a port that was accessible only by the state. Look at the database Google Hacking Database (GHDB) to get a glimpse of what can be exploited. Combine this with the capability inside Microsoft’s OS and Google Desktop. You are looking at the beginning of something truly evil.