Page 1 of 1

OCR pdf (s) are unrecognized by X1

PostPosted: Mon Jul 24, 2006 12:09 pm
by pdf dependent
OCR pdfs are unrecognized by X1 These pdfs are text searchable in Acrobat. OCR is done by ABBY in a package by Fujistu. (don't confuse file name search for text search, it is text search that does not work).

PostPosted: Mon Jul 24, 2006 12:40 pm
by BillChapman
I have no problem finding text in my OCR'd PDF files with X1.

Which software do you use for your ocr and what version of

PostPosted: Mon Jul 24, 2006 12:54 pm
by pdf dependent
adobe do you use?

PostPosted: Mon Jul 24, 2006 1:11 pm
by BillChapman
Yes, Acrobat Professional 7.08.

PostPosted: Tue Jul 25, 2006 8:32 am
by Kenward
Silly question, but you have, of course, run the Adobe OCR on the files you want to index? And checked that they really do have the text attached?

PostPosted: Tue Jul 25, 2006 9:30 am
by BillChapman
Yes, the OCR'd files are indexed, and I can find text in them with X1.

do they really have text attached

PostPosted: Tue Jul 25, 2006 9:45 am
by pdf dependent
Frankly I'm not sure how to determine that other than to run and acrobat text search, which I have and which does find text.

PostPosted: Tue Jul 25, 2006 11:29 am
by BillChapman
Another thing you might want to do is to add the flags column to your Files list display. The contents of that column will tell you the status of each file (indexed, skipped, etc.). My OCR'd PDF files are identified as indexed, and indeed I am able to search for and find text within them. Those in which X1 is unable to find text are marked as skipped.

Updated the ABBy Software (scansnap) and it is now working

PostPosted: Tue Jul 25, 2006 1:40 pm
by pdf dependent
However it will not highlight my search times. Search terms are highlighted in other pdfs

Re: Updated the ABBy Software (scansnap) and it is now worki

PostPosted: Tue Jul 25, 2006 2:12 pm
by Kenward
pdf dependent wrote:However it will not highlight my search times. Search terms are highlighted in other pdfs


This is probably something to do with the way in which the PDF files store text and image. The viewer (which I'd guess is a bought in part of X1 has to match up the two.

In a regular PDF file, there is no image/overlay issue.

Not even sure that Acrobat is that good at handling OCR'd files.

PostPosted: Tue Jul 25, 2006 3:50 pm
by BillChapman
I see now what you are saying. X1 finds and displays the OCR'd PDFs, but doesn't highlight the text being searched for. That is the way it works on my system too. The files are indexed, X1 does search them and does find and display the ones with the text I'm searching for, but it doesn't highlight that text as it does in PDFs created directly from web pages, Word files, etc.

PostPosted: Wed Jul 26, 2006 1:49 am
by Kenward
BillChapman wrote:I see now what you are saying. X1 finds and displays the OCR'd PDFs, but doesn't highlight the text being searched for. That is the way it works on my system too. The files are indexed, X1 does search them and does find and display the ones with the text I'm searching for, but it doesn't highlight that text as it does in PDFs created directly from web pages, Word files, etc.



PDF files are a world unto themselves. They come in various flavours. Not all of them are compatible. While Adobe "owns" the standard, it is on in that you do not have to pay them royalties if you produce software that works on PDF files. There is a good explanation here:

http://en.wikipedia.org/wiki/.pdf

I am a regular user, and beta test, of Nuance PaperPort software. This scans, creates and manages PDF files. The variations between different files causes constant anguish. In particular, people get very confused by the difference between indexed and searchable PDF files.

An indexed file is one that PaperPort can find, but that it cannot search within. A searchable file means that you can find where words appear within a document.

To get searchable files in PaperPort, you also have to have third party OCR software, such as OmniPage, which also comes from Nuance.

The use of third party search tools, such as X1, is a regular topic in the PaperPort community. A lot of people use X1 and its variants, so there is plenty of experience of getting the two to work together.

PostPosted: Tue Aug 22, 2006 11:21 pm
by askwong
There are scanner hardware packages that come included with both OCR and X1 program.

Dynamite Desktop Document Scanners
08.23.06

www.pcmag.com/article2/0,1895,2006864,00.asp
www.pcmag.com/article2/0,1895,1992804,00.asp

...
And what a nice collection of software it is: ScanSoft PaperPort 10 for document management, ScanSoft OmniPage Pro 14 for industrial-strength optical character recognition (OCR), X1 Enterprise Client 5.2 for indexing files and retrieving them by searching for any text in the file, NewSoft Presto! Bizcard 5 for business cards, and a Twain driver so you can scan from virtually any program with a scan command. Rounding the assortment off is Arcsoft Scrapbook Suite, which includes a photo-editing program, but I can't recommend scanning photos on a sheet-fed scanner; the rollers tend to leave marks on the originals.

Some of these programs are a generation behind the latest and greatest versions, but even at one generation behind, PaperPort, OmniPage, and (the current) X1 are a terrific trio for small-office document management. Between them, you can scan, OCR scan, organize, and index your files, then find them as quickly as you can type the text you're looking for. And the Visioneer One Touch software, which sits in the system tray, lets you bring up a menu from which you can easily pick where to scan to—e-mail, a fax program, a printer, a searchable PDF file, and more.
...

PostPosted: Wed Aug 23, 2006 1:30 am
by Kenward
Some of these programs are a generation behind the latest and greatest versions, but even at one generation behind, PaperPort, OmniPage, and (the current) X1 are a terrific trio for small-office document management.


And if the only thing you want to do is to create searchable PDF files, you can save money by staying "behind the latest and greatest versions". For example, I have it on good authority – that is, someone in the company – that old versions of OmniPage work with PaperPort just fine to create searchable PDFs.

That could save quite a lot of dollars as the earlier versions are out there at knock-down prices.