Web indexing: extending the functionality of HTML Indexer
HTML Indexer (www.html-indexer.com) is the only commercial stand-alone indexing tool that is designed solely for the indexing of web sites. For a review of the software, see Heather Hedden's article in the previous issue of The Indexer (www.hedden-information.com/Indexer_Apr_06_Hedden.pdf).
This article arises from issues that we had to overcome at TechScribe when creating the index on www.techscribe.co.uk/techw/a-z-index.htm. It shows how to extend the functionality of HTML Indexer by including special codes in the entries, then post-processing the generated HTML to obtain final HTML. (To avoid long-winded sentences, we use the term generated HTML to mean the output from HTML Indexer and the term final HTML to mean the HTML code that is used in the index page itself.)
Design goals for the index
The design goals for the index are:
- Follow best-practice guidelines for web indexing. The American Society of Indexers has a web indexing special interest group that lists guidelines (www.web-indexing.org/practices.htm).
- Match the existing visual design of the TechScribe web site, which includes having different outputs for the screen version and the printed version of the index (and doing that without using a special 'print-friendly' page).
- Conform to the Word Wide Web Consortium (W3C) specifications and pass all validation tests (http://validator.w3.org).
- Avoid the manual editing of generated HTML (the TechScribe web site is updated every few weeks, so, for efficiency, all changes to the code generated by HTML Indexer must be done programmatically).
Figure 1 shows an example from the screen version of the finished index.
Figure 1. Part of the finished index
Limitations of HTML Indexer
The specific limitations of HTML Indexer with respect to the design goals are:
- It does not permit the indexer to distinguish links to non-HTML pages (such as PDF file, Word files, video files).
- If there are subheadings for a heading, creating a hyperlinked heading is cumbersome.
- A see also cross-reference appears as a separate entry.
- It is not possible to create generated HTML that conforms to the site requirements for heading letters and 'top of page' links.
- The generated HTML is not fully compliant with W3C standards.
The next section shows how to resolve these limitations.
Solutions
HTML Indexer generates HTML code that is consistent (unlike some help authoring tools). That means it is straightforward to manipulate the generated HTML programmatically.
There are three basic approaches:
- Manipulate the generated HTML. For example, this is done to create the final HTML for the heading letters and 'top of page' links and to tidy the generated HTML to make it conform to W3C guidelines.
- Add arbitrary text to an entry and then manipulate the generated HTML. For example, +j in an index entry is converted to HTML code that marks the start of a citation (
<cite>) and j+ is converted to HTML code that marks the end of a citation (</cite>). - Use HTML Indexer in a non-standard way. This method is used for creating hyperlinked main headings where there are also subheadings, and to create see also cross-references.
Figure 2 shows the index entries in HTML Indexer, and Figure 3 shows the generated HTML in a web browser.
Figure 2. Entries in HTML Indexer
Figure 3. Part of the index from the generated HTML
To create the icons in the final index (Figure 1), the macro identifies the file extension, and then creates the relevant HTML code automatically. (Originally, we use codes in the index entries. For example, +p was converted into code that displays an image representing a PDF file.)
Hyperlinked main headings with subheadings
By default, HTML Indexer does not create a hyperlinked main heading if there are subheadings. You can force HTML Indexer to create a hyperlinked main heading by including HTML code in the text for the heading. (The section 'Create hyperlinked common headings' on the HTML Indexer Tips and Techniques web page shows how to do this. However, the method is cumbersome, error-prone, and not recommended by the developers of HTML Indexer.)
One solution is to create the main heading in the normal manner. The generated HTML will contain a link to the web page. For each subheading, create an entry where the heading contains some additional text that indicates the entry should be deleted during post-processing, as shown in Figure 2.
See also cross-references
Generally, a see also cross-reference should be part of a single entry. One entry for a heading and another entry for a cross-reference from that heading is not standard indexing practice. HTML Indexer creates a separate entry for a cross-reference, as shown here:

The solution is to create the see also text as a subheading, as shown in Figure 4.
Figure 4. See also cross-reference as a subheading
By default, the 'Sort as' entry field contains the same content as the 'X-ref heading' field, and this does not need to be changed. The <i> is HTML code that causes the text that comes after it to be displayed in italics in a web browser. Conveniently, the filing order of the angle bracket will cause the subheading to be at the top of the list. (To have the cross-reference on the same line as the heading would require a simple change to the post-processing macros.)
The 'Reference Text' field cannot be empty. The neatest solution is to include the HTML code (</i>) that ends the instruction to produce italic text.
An alternative to using the <i> and </i> markup would be to use codes, and convert these during post processing. This could save a few keystrokes and it allows for conversion to semantic markup (the strictly correct option), rather than hard-coding the tags for the italic text.
Summary
The methodology is not overly complex. You need to define a few easy-to-remember codes, and of course, you need to create macros to manipulate the generated HTML (At TechScribe, we use Microsoft Word, but there are text editors that offer macro functionality). After you update the index in HTML Indexer, you must copy the generated HTML to the editing tool, run the macro, and then copy the HTML to the final index.
From a commercial perspective, visual appearance and consistency throughout a web site are both important. Conforming to best practice indicates that you value your index and the people it serves.



