|
Over the last several days, I have been making great progress with an html page index (cross reference) facility. The key effort here is separating each set of symbols into meaningful words (parsing). My first attempt simply applied the same set of rules to the entire html. This was surprisingly good. The parser was then rewritten to be context sensative. It applies different rules inside the angle brackets than to the visible text. The main object of this is to follow links to script and CSS files so that they can be indexed as well. This is working rather well. I have written a first attempt parser for CSS files and need to extend this to be context sensitive. For now, I am expecting to cross reference inside html comments because it is common to "hide" tags inside of comments. For example, script content is often surrounded by html contents so that browsers which do not support script tags will ignore the script code. CSS comments look like they have use only as comments and have no effect on the html code that CSS is embedded in. So, I will be skipping over CSS comments. The indexing code can merge and display two files as one. In a few weeks, I think I will have a deployable indexing capability. My idea for deployment is to collect url and email information via a web site, perform the indexing offline, and email the report when complete. Perhaps this idea will evolve but it is clear that I want a web site for this new program. I always thought of what I am doing as cross referencing. However, consulting a dictionary indicates that the correct term is indexing (as in the index at the back of a book). Google webmaster tools show that indexing is much more commonly searched. Perhaps those searches are people looking for information about the index.html file. If so, the overlap cannot really be helped. My new website is http://www.HtmlPageIndex.com [update: site taken down, link removed] and is presently under construction. I registered the domain name and put it up now so that it will start getting indexed on the search engines. The graphic is a screenshot of the page. It is not beautiful but it is only a quick and dirty first effort.
Trackback(0)
|