need X1 to be able to index the text on image only files

Do you want to see something in X1? Do you dislike something about X1? Let us know!

Moderator: Mods

need X1 to be able to index the text on image only files

Postby HarryS » Sun Mar 20, 2005 9:04 am

More than anything I need X1 to be able to index the text on scanned image only PDF files. Scuh as the does.

http://www.scansoft.com/news/pressrelea ... 07_pdf.asp

ScanSoft Adds Indexing of PDF, Fax and Scanned Documents to Google Desktop Search
Available from the Google Desktop Search Web Site, the OmniPage Search Indexer Leverages ScanSoft's World-leading OCR and PDF Conversion Technologies
PEABODY, Mass., March 7, 2005 - ScanSoft, Inc. (Nasdaq: SSFT), a global leader of speech and imaging solutions, today announced the ScanSoft® OmniPage® Search Indexer for Google Desktop Search. The beta release of the plug-in, which is available free on the Google Web site, automatically creates text-index information from PDF files and faxes, as well as scanned books and documents - making them visible to Google Desktop Search.

The OmniPage Search Indexer uses ScanSoft's highly accurate and fast optical character recognition (OCR) and PDF conversion technology to recognize the text within image-based content, creating the index information needed by the search application. ScanSoft is the OCR behind the world's largest book scanning projects, and has been selected by nearly 100% of commercial vendors delivering imaging solutions, including AnyDocs, Autodesk, Avision, Brother, Canon, Captiva, Corex, Dell, FileNET, HP, Kofax, Konica, Kyocera, Lexmark, NSI, Omtool, Verity, Visioneer and Xerox.
HarryS
 
Posts: 21
Joined: Sat May 15, 2004 9:52 am

Postby ba763 » Sun Mar 20, 2005 8:11 pm

add my vote to this. i have quite a few image only pdf files and now that the free competition can do this, X1 really needs to add this feature too if it doesn't want to fall behind.
ba763
 
Posts: 11
Joined: Sun Nov 21, 2004 1:14 am

Postby gww » Mon Mar 21, 2005 4:09 am

ba763 wrote:i have quite a few image only pdf files and now that the free competition can do this, X1 really needs to add this feature too


The Google plug-in is a free beta but when it is out of beta (it does expire), you will have to buy it.

The only way the competition does this, is to convert the image to text using OCR software. Then they can read the information... X1 can read this information also.
Regards, grant

Windows 7 PRO 64-bit
Outlook 2007
gww
X1 Super User
X1 Super User
 
Posts: 284
Joined: Sun Aug 22, 2004 1:53 pm

Postby Kenward » Mon Mar 21, 2005 4:37 am

I use X1 to access PDF files scanned into PaperPort all the time.

The key is in the scanning. It has to happen with software that can add an OCR layer.
MK
X1 Search 8.6.1 - Build 6003fa (64-bit)
Windows 10 Pro 64-bit | Windows 10 Home 32-bit
No, I have nothing to do with X1, just a user since 2004.
Kenward
X1 Guru
X1 Guru
 
Posts: 4149
Joined: Tue Apr 20, 2004 2:35 am
Location: UK

no, X1 does not yet index text on image documents

Postby HarryS » Mon Mar 21, 2005 7:26 am

X1 does NOT yet do what I am asking for.

It is possible to scan a document with PaperPort or OmniPage and then perform OCR and then save the document as a searchable PDF. X1 is able to index those searchable PDF files.

X1 is not yet able to indes and search the text on PDF files that are not searchable.

It is very - very time consuming to create a searchable PDF especially when scanning magazine articles with colored background, photos columns etc.

About 10 years ago Xerox introduced a scanning program named Pagis that indexed the images of text on non-searchable xif, PDF, jpg and gif files. Unfortunately Xerox sold Pagis to ScanSoft and ScanSoft writers were unable to develop Pagis to run with Windows 20o0 and XP and it became orphaned.

ScanSoft bought PaperPort and has continued to develop it. Now they are using this background search index capability as a google add on. My argument is that to remain competitive, I expect my $100 that I paid for X1 to be able to do the same thing.
HarryS
 
Posts: 21
Joined: Sat May 15, 2004 9:52 am

Re: no, X1 does not yet index text on image documents

Postby gww » Mon Mar 21, 2005 8:13 am

HarryS wrote:X1 does NOT yet do what I am asking for.
X1 is not yet able to indes and search the text on PDF files that are not searchable.


Neither does Google, PaperPort or anybody else unless they convert the image to text using OCR.

Here is a quote from the link you provided " The OmniPage Search Indexer uses ScanSoft's highly accurate and fast optical character recognition (OCR) and PDF conversion technology to recognize the text within image-based content, creating the index information needed by the search application."

Kenward (above) also uses PaperPort's OCR to convert images for their text. I use OmniPage Office.

What you want optical character recognition (OCR). X1 will handle the searching as you want it to.
Regards, grant

Windows 7 PRO 64-bit
Outlook 2007
gww
X1 Super User
X1 Super User
 
Posts: 284
Joined: Sun Aug 22, 2004 1:53 pm

Postby HarryS » Mon Mar 21, 2005 8:26 am

I'm telling you that Pagis DID have text search capability to locate text on an image PDF and otehr image only formats. Pagis did this without altering the image only - non searchable PDF. I used Pagis from the very beginning and had a lot of input into early improvements. I still have it on a Windox 98SE machine and can assure you that it does this.

ScanSoft bought Pagis and they now claim that their Google addon has this capability.
HarryS
 
Posts: 21
Joined: Sat May 15, 2004 9:52 am

Postby Kenward » Mon Mar 21, 2005 8:49 am

The function that you ascribe to Pagis is also built into PaperPort. Pagis could not search and find text without the aid of OCR.

Pagis was, as you said, taken over by ScanSoft. It was one of the forerunners of PaperPort. Go visit the ScanSoft web site and read what it says about using Pagis with OmniPage.

OmniPage is OCR software that beefs up the rudimentary stuff built into PaperPort.

The Pagis manual spells it out: "Built into Pagis Pro is ScanSoft’s award-winning TextBridge Pro OCR (optical character recognition) program."
MK
X1 Search 8.6.1 - Build 6003fa (64-bit)
Windows 10 Pro 64-bit | Windows 10 Home 32-bit
No, I have nothing to do with X1, just a user since 2004.
Kenward
X1 Guru
X1 Guru
 
Posts: 4149
Joined: Tue Apr 20, 2004 2:35 am
Location: UK

Postby HarryS » Mon Mar 21, 2005 9:08 am

Pagis was NOT a forerunner of PaperPort.

ScanSoft bought Pagis from Xerox and ScanSoft bought PaperPort from another company - Carre or something like that. They both came out at about the same time.

Pagis was much more sophisticated and ScanSoft programmers were incapable of developing it to run with newer operating systems so ScanSoft orphaned it. That forced me to use PaperPort for my computers with new operating systems. PaperPort 9.3 is much less sophisticated than Pagis was five years ago.

PaperPort can scan and create an image only PDF or give you the option of performing OCR on an image scan and saving the document as a searchable PDF. That is a very time consuming job. Unlike Pagis, PaperPort simple search is incapable of indexing non searchable PDF's.

What we are talking here is something totally different.

ScanSoft is now announcing an add-on for Google Desktop Search that has the Pagis capability of indexing the pictures of text that are on image only - non searchable PDF files.

I am asking X1 to remain completive and get this same capability.
HarryS
 
Posts: 21
Joined: Sat May 15, 2004 9:52 am

Postby gww » Mon Mar 21, 2005 9:56 am

HarryS wrote:ScanSoft is now announcing an add-on for Google Desktop Search that has the Pagis capability of indexing the pictures of text that are on image only - non searchable PDF files.

I am asking X1 to remain completive and get this same capability.

Lets clear up one thing. Scansoft's Google Plug-in only works because it has OCR capablity. It cannot index an image without OCR. Please do not take my word for it, read it for yourself here

http://desktop.google.com/plugins/omnipagesearch.html and here
http://www.scansoft.com/OmniPage/Search/ and here
http://www.scansoft.com/omnipage/search/faq.asp

also, The OmniPage Search Indexer is free for personal use during the beta release period, which ends April 30, 2005. The final version will be available within 60 days. Pricing for the final release version has not yet been determined. Corporate, academic and government organizations, as well as OEM vendors, should contact ScanSoft for license opportunities.

Google will not have this indexing capability for free after April 30, then you have to pay for it.
Regards, grant

Windows 7 PRO 64-bit
Outlook 2007
gww
X1 Super User
X1 Super User
 
Posts: 284
Joined: Sun Aug 22, 2004 1:53 pm

Postby HarryS » Mon Mar 21, 2005 1:35 pm

Yes, it uses ocr to create an index of non searchable image files.

It does not modify the image files and save them as searchable.
HarryS
 
Posts: 21
Joined: Sat May 15, 2004 9:52 am

Postby Kenward » Mon Mar 21, 2005 3:58 pm

This continuing saga runs the risk of getting lost in total confusion.

No search engine can find text in a graphic image unless that image has been subject to OCR of some sort.

The question is how that OCR is applied.

Pagis also creates a separate OCR layer, as well as layers for images, text colour and other stuff. Four in all.

Acrobat does something similar.

The trick comes in how the software matches these layers and uses words to land on the appropriate bit of the image.

Until recently, scanning and filing software often kept the two things -- image and text -- separate, and connected internally somehow. Pagis was not the only software to adopt this approach. So did PaperMaster. It kept TIF files separated from small text files, joined up by pointers, probably in an internal database.

More recently, Adobe has become the de facto image format. Because this does allow a text overlay to the image, people like ScanSoft have tried to move towards the industry standard, often falling over in their attempts to do so.

The bottom line is that a standalone product, that's what X1 is, rather than an add on, which is what you get from ScanSoft, needs text to look for. As others have pointed out, none of the free text search or standalone search engines can find text in graphics without the intervention of OCR in some form or another.

The more you want the searcher to do, such as finding a word in an image, the more added oomph (that's a technical term) it needs.

Anyone who lusts after some hand crafted software that they had way back in the golden days should just go back and live with it. Just don't expect to be able to do all the other stuff that you do with software like X1.
MK
X1 Search 8.6.1 - Build 6003fa (64-bit)
Windows 10 Pro 64-bit | Windows 10 Home 32-bit
No, I have nothing to do with X1, just a user since 2004.
Kenward
X1 Guru
X1 Guru
 
Posts: 4149
Joined: Tue Apr 20, 2004 2:35 am
Location: UK


Return to Feature Requests and Gripes

Who is online

Users browsing this forum: No registered users and 25 guests

cron