x1 should index the text portion of mdi and tif files

Do you want to see something in X1? Do you dislike something about X1? Let us know!

Moderator: Mods

x1 should index the text portion of mdi and tif files

Postby x1uncle » Thu Mar 27, 2008 1:06 pm

x1 does a great job indexing text in PDF files.
But, it does not read text in MDI files or TIF files.

My company uses MDI heavily, and this is a big problem.

As a workaround, we can convert the MDI files to PDF. x1 indexes the PDF file just fine. But, this is far from ideal.

I did a google for ifilter and mdi/modi, and it appears that the only ifilter available is the standard one the Microsoft Office Document Imaging installs itself. But, I already have MODI installed, so that is a dead end.

I hope someday x1 will handle Mdi and tif automatically.

And, just to be clear, I am NOT asking for OCR. The MDI/tif files I am using were generated directly from xls and word documents, so their text is already there. No graphics involved.
x1uncle
X1 Power User
X1 Power User
 
Posts: 70
Joined: Mon Apr 23, 2007 3:06 pm

Re: x1 should index the text portion of mdi and tif files

Postby Kenward » Thu Mar 27, 2008 2:34 pm

x1uncle wrote:And, just to be clear, I am NOT asking for OCR. The MDI/tif files I am using were generated directly from xls and word documents, so their text is already there. No graphics involved.

MDI is just a different flavour of TIFF.

It does not follow that creating MDI or TIF from xls or Word means that the files contain text. Here's the giveaway from Microsoft:

Microsoft Office Document Imaging uses Microsoft Document Imaging Format (MDI), a file format based on the Tagged Image File Format (TIFF) (Tagged Image File Format (TIFF): A high-resolution, tag-based graphics format. TIFF is used for the universal interchange of digital graphics.) that is designed to store images by page layout. In Office Document Imaging, you can open and save files in the MDI format as well as the Tagged Image File Format (TIFF) (Tagged Image File Format (TIFF): A high-resolution, tag-based graphics format. TIFF is used for the universal interchange of digital graphics.) format. Both formats are capable of storing text recognized by optical character recognition (OCR) (OCR: Translates images of text, such as scanned documents, into actual text characters. Also known as text recognition.) along with images.

It may be that you have created an image file from xls and/or Word. TIF is, after all, an image format.

Word does not offer to save as tiff, so can you tell us how you created those files? If you printed to tiff, which is the routine suggested by Microsoft, then that is an image file. It contains no text. It needs OCR. That's what Microsoft is telling you.

Here's how to test. Open the tiff file. Try to copy text from it. If you can't then it needs OCR.

A way around your problem would be to create PDF files. X1 handles them just fine.
MK
X1 Search 8.6.1 - Build 6003fa (64-bit)
Windows 10 Pro 64-bit | Windows 10 Home 32-bit
No, I have nothing to do with X1, just a user since 2004.
Kenward
X1 Guru
X1 Guru
 
Posts: 4149
Joined: Tue Apr 20, 2004 2:35 am
Location: UK

Postby w0qj » Thu Mar 27, 2008 9:07 pm

Just in case, did you try manually asking X1 to index *.MDI files for its body contents?

Tools>>Options>>Files>>[More_Indexing_Options]>>[Specify_File_Types]

Good luck!
Rgds / Mr. Wong

"It is what X1 can do with the information found that is important."
w0qj
X1 Guru
X1 Guru
 
Posts: 1183
Joined: Wed Jun 16, 2004 3:53 am
Location: Hong Kong

Postby x1uncle » Fri Mar 28, 2008 9:20 pm

Does anybody have any other suggestions?

I added msg tif and tiff to file types, but it did not seem to help.

I attached the .mdi file, and a .pdf version of the file to an email and mailed it to myself. Then I waited for x1 real time indexing to get the files. The pdf file was indexed, and the mdi file was not.

And I guarantee (100%, no doubt, money where my mouth is guarantee) these particular files are NOT graphics. If i open the mdi file I can select any word copy it to the clipboard, then paste the text into ms word.

Plus, when I print them to my pdf995, the resulting pdf gets indexed properly. Truest me, pdf995 is not performing ocr on the files.
x1uncle
X1 Power User
X1 Power User
 
Posts: 70
Joined: Mon Apr 23, 2007 3:06 pm

Postby x1uncle » Fri Mar 28, 2008 9:24 pm

and, I create the files by printing to "microsoft office document imager" which is automatically installed when you do a typical install of office 2003 pro.

I believe it is available in all versions of office. If you have office, but MODI does not appear on your printer list, go to control panel > add remove programs > office > change and install it.
x1uncle
X1 Power User
X1 Power User
 
Posts: 70
Joined: Mon Apr 23, 2007 3:06 pm

Postby Kenward » Sat Mar 29, 2008 2:37 am

x1uncle wrote:Plus, when I print them to my pdf995, the resulting pdf gets indexed properly. Truest me, pdf995 is not performing ocr on the files.


When you print to PDF you do not need OCR. The printing process transfers the text.

Printing to PDF is not printing as an image file.

[/i]
MK
X1 Search 8.6.1 - Build 6003fa (64-bit)
Windows 10 Pro 64-bit | Windows 10 Home 32-bit
No, I have nothing to do with X1, just a user since 2004.
Kenward
X1 Guru
X1 Guru
 
Posts: 4149
Joined: Tue Apr 20, 2004 2:35 am
Location: UK

Postby Kenward » Sat Mar 29, 2008 3:08 am

x1uncle wrote:and, I create the files by printing to "microsoft office document imager" which is automatically installed when you do a typical install of office 2003 pro.

I believe it is available in all versions of office. If you have office, but MODI does not appear on your printer list, go to control panel > add remove programs > office > change and install it.


I have just printed a Word file to microsoft office document imager. The result is a TIFF file. It contains no text.

You have to OCR to get text. You can do that with MDI if you have that bit installed.

It is the MDI format that contains text. If this is essential to you, and you cannot find an iFilter that can handle the job in X1, I suggest that you try Microsoft Desktop Search (MDS).

It does not seem to be a straightforward task, although the new version of MDS may have made it easier. (See separate message.) But if you visit the various Microsoft forums and talking shops you will find that they are grappling with the same issue.

It seems that Microsoft, in its wisdom, removed MDI indexing in Office 2007.

Once you have got MDI files indexed in that it might even work in X1.

Failing that, as you have a PDF writer, why not save files in that format? Is there any compelling reason to use MDI? PDF is, after all, widely used as the industry standard for sharing formatted files.
MK
X1 Search 8.6.1 - Build 6003fa (64-bit)
Windows 10 Pro 64-bit | Windows 10 Home 32-bit
No, I have nothing to do with X1, just a user since 2004.
Kenward
X1 Guru
X1 Guru
 
Posts: 4149
Joined: Tue Apr 20, 2004 2:35 am
Location: UK

Postby x1uncle » Sat Mar 29, 2008 9:09 am

Perhaps we are talking semantics. Are you saying “Modi puts out a tif file, all tif files are graphic THEREFORE modi puts out graphics”?

But, semantics aside, on my computer the tif file DOES contain text. And I bet it does on your computer.
Try dragging your tif file to an empty folder then use explorer search (ctrl f) and look for “a word or phrase in the file”. Explorer should find your tif file.

Also, open your tif with MODI. In the left hand thumbnail window, you will see an “eyeball” in the lower right hand side which means the page contains text. In the right hand preview window you can draw a box around a single word and use ctrl c to put it to the clipboard, which can then be pasted in MS Word as text.

If you computer works differently, perhaps you and I installed office pro 2003 differently? But, I have installed office 2003 over 2 dozen times, and never notice any option that would be related to text versus OCR.

As to why I would want to use modi versus pdf:
I have used pdf as my “final report” tool for about 10 years, but I have recently started to prefer modi

MODI allows me to easily move pages around
Allows me to easily annotate pages
Allows me to easily copy a page from one report to another
Allows me to easily ocr a single page (sometimes my reports contain faxes which ARE graphic).


If I could find a single PDF tool that does the above, I would use it instead.
But, I have tried pdffactory, bullzip pdf, foxit, adobe pro 8 ($800 !!) and various others, using those all of them together, I can get the 4 features in a very cubmersome manner, but none of them put all these feature together as nicely as MODI.

But, I see that other users are having problems with MODI vista/office 2007. I hope microsoft does not abandon it.
x1uncle
X1 Power User
X1 Power User
 
Posts: 70
Joined: Mon Apr 23, 2007 3:06 pm

Postby Kenward » Sun Mar 30, 2008 3:49 am

x1uncle wrote:
MODI allows me to easily move pages around
Allows me to easily annotate pages
Allows me to easily copy a page from one report to another
Allows me to easily ocr a single page (sometimes my reports contain faxes which ARE graphic).


If I could find a single PDF tool that does the above, I would use it instead.
But, I have tried pdffactory, bullzip pdf, foxit, adobe pro 8 ($800 !!) and various others, using those all of them together, I can get the 4 features in a very cubmersome manner, but none of them put all these feature together as nicely as MODI.


I do all of these things with the "viewer" that is built into PaperPort Pro, especially if you also have an OCR package, such as OmniPage, which beefs up that aspect of PaperPort.

PaperPort also comes with PDF Create, which may also handle most, if not all, of those tasks. I haven't tried it as a standalone feature.

Acrobat also achieves all of those functions.

I use PaperPort to organise files, especially scanned newspaper cuttings. It creates PDF files that X1 can index and find.
MK
X1 Search 8.6.1 - Build 6003fa (64-bit)
Windows 10 Pro 64-bit | Windows 10 Home 32-bit
No, I have nothing to do with X1, just a user since 2004.
Kenward
X1 Guru
X1 Guru
 
Posts: 4149
Joined: Tue Apr 20, 2004 2:35 am
Location: UK

Postby Kenward » Mon Mar 31, 2008 2:03 am

I have just discovered that X1 will, indeed, index the text of .MDI files.

The original statement that X1 "does not read text in MDI files or TIF files" is not complete. Had it said, X1 "does not read text in MDI files or TIF files by default," then it would have been more accurate. You just have to do a few things to make it work.

To make this happen, you need to install the OCR component of "Microsoft Document Imaging" (MDI). This is there in Microsoft Office 2003. This feature may not be there in Office 2007.

If OCR is not already installed on your PC, then open MDI, show it a document, and then ask it to "Recognize text using OCR". Don't worry if it says there is text there already. The idea is to ensure that the OCR feature is installed.

If OCR is not installed, it will offer to install it for you. You will need the MS Office installation disk.

As a part of this process, it will add an iFilter that lets X1 index the file. You probably need to add "mdi" files to the extensions you want indexed. Add ".tiff" and ".tif" at the same time while you are at it.

You can see this iFilter in action if you search for files with the "mdi" extension. If the "Indexing Status" column is visible, it will say "IFilter OK". (It will also say, "file indexed" in the status bar.) If the filter is there, but the extension isn't included in your indexing set, it will say "excluded extension" in the status bar.

The only problem with these files is that the viewer technology, another area where X1 calls on third party support, does not recognise it. So X1's preview pane looks terrible. It will, though, show you the text you are seeking. You can copy text from the file, or open it for a more attractive view.
MK
X1 Search 8.6.1 - Build 6003fa (64-bit)
Windows 10 Pro 64-bit | Windows 10 Home 32-bit
No, I have nothing to do with X1, just a user since 2004.
Kenward
X1 Guru
X1 Guru
 
Posts: 4149
Joined: Tue Apr 20, 2004 2:35 am
Location: UK

Postby x1uncle » Mon Mar 31, 2008 10:51 am

Very interesting.

I have 200 MDI files. Index status is as follows
50% have blank
25% show “stellent OK” and content search does not work.
25% (about 40 files) show "Stellent NOFILTER IFilter OK", and CONTENT SEARCH DOES WORK! This is progress!


I have 3000 tif files
50% have blank status
49.5% have “Stellent OK”. These cannot be serched for content.
.3% have “Ifilter OK”. These CAN be searched for content.

So, overall only about 40 3200 files can be searched which is terrible. But, I only recently added the tif/mdi/tiff to the “specify file types” option, so perhaps that is the problem. I am reindexing now which will take 4 or 5 hours, and will post back when I have results.
x1uncle
X1 Power User
X1 Power User
 
Posts: 70
Joined: Mon Apr 23, 2007 3:06 pm

Postby Kenward » Mon Mar 31, 2008 11:32 am

Not all tiff files will contain text.

Yes, it can take time to index a load of new stuff.

At least we now know that X1 is up to the task. Now all you need is to get Oracle to write a proper viewer. Don't hold your breath.
MK
X1 Search 8.6.1 - Build 6003fa (64-bit)
Windows 10 Pro 64-bit | Windows 10 Home 32-bit
No, I have nothing to do with X1, just a user since 2004.
Kenward
X1 Guru
X1 Guru
 
Posts: 4149
Joined: Tue Apr 20, 2004 2:35 am
Location: UK

Postby x1uncle » Mon Mar 31, 2008 11:37 am

You say the paperport viewer does all of those things. So, perhaps you have a more recent version of paperport, or perhaps I just haven't figured out how to get my version to work.

I bought the whole scansoft/nuance suite 2 years ago for about $400 (omnipage/pdfcreate/paperport/pdf converter). I couldn't get any of their utilities to deliver on the "Easy" part of rearranging PDF pages, and I remember paperport was particularly irritating.

So, I just now revisisted Paperport viewer 9.0 thinking I must have missed something.

I still can't seem to figure out how to copy page 1 of DocumentA between pages 5 and 6 of DocumentB.

This rearrangment is the most important missing feature, but there are other missing features.

Paperport PDF viewer's annotation tool has a "Note: text tool which has a frame. But, I can't figure out how to get the note text to have a font, nor can I figure out how to make the frame be semi transparent. I much prefer MODI's text frame for my purposes.

On the other hand Paperport viewer has a plain text tool which does allow font, but it lacks a frame of any kind which limits its usefulness.

So, I still continue to whine that I have not found a PDF tool that offers the 4 features that I like about MODI.
x1uncle
X1 Power User
X1 Power User
 
Posts: 70
Joined: Mon Apr 23, 2007 3:06 pm

Postby Kenward » Mon Mar 31, 2008 1:53 pm

I have PaperPort Pro 11.

If I open two PDF files in PageViewer and enable "Page Thumbnails" then I can just drag individual pages between the two files. I can put them anywhere I like in the destination file.

I can't think of anything simpler or quicker.

It is the same as in Acrobat.

As far as I can recall, all this worked just fine in PaperPort 9.
MK
X1 Search 8.6.1 - Build 6003fa (64-bit)
Windows 10 Pro 64-bit | Windows 10 Home 32-bit
No, I have nothing to do with X1, just a user since 2004.
Kenward
X1 Guru
X1 Guru
 
Posts: 4149
Joined: Tue Apr 20, 2004 2:35 am
Location: UK

Postby x1uncle » Tue Apr 01, 2008 6:15 am

Nothing is easy in life ! Indexing files has run for 20 hours and it is still not done optimizing the indexes.

But, the good news is that all of my MDI and TIF file objects show IFilter OK, and can be searched properly. This is great progress.

I have not reindexed emails yet, so the MDIs on email attachments still have Index Status = blank. That will take another 20 hours of indexing, then I will probably be done.

MK, I am very grateful for your suggestion, it looks like my request for mdi/tif indexing has been granted. Perhaps this whole discussion should be moved from "features and gripes" and put into "frequently asked questions". I'll leave that decision up to the forum moderators.

By the way, Paperport 9 does not support thumbnails, so I downloaded paperport 11 ($100 but I used free trial). You are right, it rearranges PDF pages fairly well. So maybe I will switch back to using PDF for my "Final Reports Tool". On the other hand I may stick with MODI since it has better annotation features and undo features, and can be indexed just as well as PDF. Besides 5 copies of Paperport will be $500 versus MODI which is free with office. But, that decision is beyond the scope of the x1 forum.
x1uncle
X1 Power User
X1 Power User
 
Posts: 70
Joined: Mon Apr 23, 2007 3:06 pm


Return to Feature Requests and Gripes

Who is online

Users browsing this forum: No registered users and 62 guests