I research the history of chainsaws. Currently there is a lot of different information available on the Internet in the form of scanned newspapers, books, brochures and other publications that were printed at some point. I go through enormous masses of information in different languages to find mentions of brands and mechanisms that interest me.
It is impossible to read all of this with my eyes because usually I am looking for just one paragraph in an entire book that is hundreds of pages long.
Therefore, I developed the following strategy:
- - The file I am reviewing is sent to the virtual Universal Document Converter (UDC)
- - UDC unpacks the file "Article.pdf" to the files "Article_0001.tif", "Article_0002.tif" and etc. and puts them into the folder "Article".
The paths for saving, names of folders and files, and the final image quality are configured so that they could later be funneled through a text identification program (in my case - Cuneiform).
The result of all of these actions is the following:
- - A set of standardized graphics files or pages of one format
- - A set of text files
All that is left to do is to run a search of the text files, and if the found text is interesting, to look at the corresponding graphics file. Besides making the work easy, this also eliminates the need to open the source file (up to several hundred megabytes in size).
That's how I created my own Google Books library!
A short time ago I was looking at several thousand European patents, which were PDF files 3-5 pages long with text in the form of images. Opening every one of them and manually sending them to print was very difficult. I tried selecting several files and right-clicking to send them to print. But Windows Vista refused to process over 15 files at a time!
The solution turned out genially simple:
- - Open the Print Settings window
- - Open a window with a thousand PDF files
- - Use the mouse to drag all the files to the Universal Document Converter icon
Some time later, I received numbered files that were neatly arranged in folders.