Indexing Postscript and PDF documents containing mathematical equations
Users with large personal document collections invariably want to make them
easily searchable at some point. This can be accomplished on UNIX with free tools such as ht://Dig, SWISH-E and glimpse. In case
some of the documents are in Postscript or PDF format, they must first be converted to plain text using a tool such as pstotext(1).
For mathematical documents, conversion with pstotext(1)
results in text interspersed with many
lines of random characters, because displayed equations aren't handled properly. In these cases, dbacl(1) can act as a filter to remove
the noise lines, by recognizing only lines which appear to be
mostly English text. This somewhat prevents the noise from polluting the list of indexed terms.
The following shell command converts a Postscript or PDF document and filters the noisy lines:
% pstotext Diestel-Graph_Theory.pdf | dbacl -c shake -Rf shake > output.txt
For this to work, first
check that you have both pstotext(1) and dbacl(1) in your path, and create an "English text" category
if necessary as follows:
% zcat Shakespeare-Complete_Works.txt.gz | dbacl -l shake
The sample English text can be freely downloaded, e.g. from Project Gutenberg.
|