Log in

No account? Create an account
24 February 2013 @ 07:23 pm
A Question for Web Geeks  

I now have copies of a bunch of RUNEs—courtesy of Lynn Anderson—that I am carefully decollating and scanning to PDFs; once done, I’m planning to post/host them on my web page(s). What would be nice is if someone googlebingyahoo!s, say, “Don D’Ammassa”, they’ll find a reference to the PDF of RUNE 44 which contains his article, “The Magical Journeys of Robert M. Green, Jr.”

The Question: What do I do so the web spider crawler engines will notice and capture that information?

  • Is there sufficient space in PDF metadata to add full index text there?
  • Do I stuff a bunch of text into the html link tag for the PDF?
  • Or what?

Simply running the Acrobat OCR Text Recognition function and leaving the resulting text embedded in the PDF wouldn't work since there would be too many errors for it to even vaguely reflect the contents and/or be findable.


dd-bdd_b on February 25th, 2013 06:03 am (UTC)
With Abby Finereader, you can do enough quality-control on the OCR to make sure that things like article titles and author names are right, without fixing ALL the OCR bugs; dunno about the tool you mention, but that general approach is the best way I know to get what you want.

I'm inclined to think that taking the time to get it all right is worth it, myself. I'll help.

Fred A Levy Haskell: xclip- book edge purplefredcritter on February 25th, 2013 06:39 pm (UTC)

Thanks David. It's been a couple of years since I did an investigation of OCR software for my previous employer, so I expect the technology has improved and my memory might be a bit off, but IIRC all the tools (except the one I mentioned, which is native to Acrobat Pro) extract the text rather than leaving it embedded in the PDF. While getting accurate OCR output is certainly very important, my question here is focused on how to get the text back in to the PDF (presumably in a text layer beneath the image layer) once it's been extracted and cleaned up.

As far as I know, there are definite limits to how much clean up / editing can be done on embedded text within the PDF (which is why I mentioned Acrobat's native OCR and its shortcomings).

Do you know something I don't about this? (Not that far-fetched a proposition, actually.) Or were you simply looking at the extraction part of the process?

Either way, I'll be happy to enlist your help once we get the process figured out. Thanks for the thoughts and the offer!

edit: I'll email you a sample RUNE PDF for you to look at and mess around with.

Edited at 2013-02-25 06:41 pm (UTC)
dd-bdd_b on February 25th, 2013 07:25 pm (UTC)
I have PDFs I made using Finereader that are of the "rough OCR on top of scan images" sort that I think we're discussing here -- mostly you see the images, but the text can be brought to the front, and is visible to web indexing tools, so it's ideal for quick-and-dirty web presentation of things in that it makes them at least somewhat indexable while not forcing the errors in front of human readers.

(And I see I have received the PDF you sent, thanks! Will look.)

(A complexity is that the copy of Finreader I used then is obsolete and was a trial which has expired. If I get involved at the level where that's an issue I'll deal with it somehow.)
Fred A Levy Haskell: xclip- red check animatedfredcritter on March 4th, 2013 05:21 am (UTC)

Aha! Figured it out! I fact, I had thought about it before but decided it was just too obvious and went looking for something else. Basically, since I'm scanning to TIFF, cleaning up the scans in Photoshop, and saving each page to PDF format from within Photoshop (then assembling/consolidating them and, finally, optimizing them in Acrobat Pro), I can simply add the text in Photoshop using the text tool and then put the text layer behind the image layer before saving. It would be a b*tch to try to align the text with the image to make the PDF internally searchable, but I don't have to do that since my intent is to make them findable by web search. So I've simply pasted the entire text of, for example, RUNE40 on the layer underneath the blank image of the inside front cover in 4pt. Arial.

The threads in the twilltone in some of the later issues are causing some OCR problems so I'm having to do more editing. If you'd like to give it a whack, I can send you a pdf of one of those along with a text copy of the initial extract so you can compare your results with mine.

dd-bdd_b on March 4th, 2013 04:54 pm (UTC)
If that's a rendered text layer, it won't be searchable, though, it's just image. Or am I not understanding? You say "in photoshop".
Fred A Levy Haskell: xclip- book edge purplefredcritter on March 4th, 2013 09:07 pm (UTC)

When you use the text tool in Photoshop, a text layer is automatically created in which the text is stored as text. You can choose to render the text as bitmap by flattening the image (and I think one or two other ways that I can't recall off the top of my head), but otherwise it stays text.

When you Save As in Photoshop, one of the options is "Photoshop PDF document." When you do that, the text layer stays text, even if it's hidden behind another layer; unlike what would happen if you used Print to PDF or sent it to Distiller. It remains searchable—I know this, because if I open the resultant PDF and search for a word I know to be in the text, Acrobat will find and highlight the word (or, actually, will highlight the word's location—you won't see the word because it's hidden behind the image layer). Download the copy of RUNE40 I uploaded and try it! Search for "Minn-stf" perhaps…