?

Log in

No account? Create an account
 
 
24 February 2013 @ 07:23 pm
A Question for Web Geeks  

I now have copies of a bunch of RUNEs—courtesy of Lynn Anderson—that I am carefully decollating and scanning to PDFs; once done, I’m planning to post/host them on my web page(s). What would be nice is if someone googlebingyahoo!s, say, “Don D’Ammassa”, they’ll find a reference to the PDF of RUNE 44 which contains his article, “The Magical Journeys of Robert M. Green, Jr.”

The Question: What do I do so the web spider crawler engines will notice and capture that information?

  • Is there sufficient space in PDF metadata to add full index text there?
  • Do I stuff a bunch of text into the html link tag for the PDF?
  • Or what?

Simply running the Acrobat OCR Text Recognition function and leaving the resulting text embedded in the PDF wouldn't work since there would be too many errors for it to even vaguely reflect the contents and/or be findable.

Thanks!

Tags:
 
 
 
beamjockey: That's It boaterbeamjockey on February 25th, 2013 01:51 am (UTC)
I don't know, but I too will be interested to read what the Web Geeks have to say about this. And I would be very pleased to see some Runes online.
dd-bdd_b on February 25th, 2013 06:03 am (UTC)
With Abby Finereader, you can do enough quality-control on the OCR to make sure that things like article titles and author names are right, without fixing ALL the OCR bugs; dunno about the tool you mention, but that general approach is the best way I know to get what you want.

I'm inclined to think that taking the time to get it all right is worth it, myself. I'll help.

Fred A Levy Haskell: xclip- book edge purplefredcritter on February 25th, 2013 06:39 pm (UTC)

Thanks David. It's been a couple of years since I did an investigation of OCR software for my previous employer, so I expect the technology has improved and my memory might be a bit off, but IIRC all the tools (except the one I mentioned, which is native to Acrobat Pro) extract the text rather than leaving it embedded in the PDF. While getting accurate OCR output is certainly very important, my question here is focused on how to get the text back in to the PDF (presumably in a text layer beneath the image layer) once it's been extracted and cleaned up.

As far as I know, there are definite limits to how much clean up / editing can be done on embedded text within the PDF (which is why I mentioned Acrobat's native OCR and its shortcomings).

Do you know something I don't about this? (Not that far-fetched a proposition, actually.) Or were you simply looking at the extraction part of the process?

Either way, I'll be happy to enlist your help once we get the process figured out. Thanks for the thoughts and the offer!

edit: I'll email you a sample RUNE PDF for you to look at and mess around with.



Edited at 2013-02-25 06:41 pm (UTC)
dd-bdd_b on February 25th, 2013 07:25 pm (UTC)
I have PDFs I made using Finereader that are of the "rough OCR on top of scan images" sort that I think we're discussing here -- mostly you see the images, but the text can be brought to the front, and is visible to web indexing tools, so it's ideal for quick-and-dirty web presentation of things in that it makes them at least somewhat indexable while not forcing the errors in front of human readers.

(And I see I have received the PDF you sent, thanks! Will look.)

(A complexity is that the copy of Finreader I used then is obsolete and was a trial which has expired. If I get involved at the level where that's an issue I'll deal with it somehow.)
Fred A Levy Haskell: xclip- red check animatedfredcritter on March 4th, 2013 05:21 am (UTC)

Aha! Figured it out! I fact, I had thought about it before but decided it was just too obvious and went looking for something else. Basically, since I'm scanning to TIFF, cleaning up the scans in Photoshop, and saving each page to PDF format from within Photoshop (then assembling/consolidating them and, finally, optimizing them in Acrobat Pro), I can simply add the text in Photoshop using the text tool and then put the text layer behind the image layer before saving. It would be a b*tch to try to align the text with the image to make the PDF internally searchable, but I don't have to do that since my intent is to make them findable by web search. So I've simply pasted the entire text of, for example, RUNE40 on the layer underneath the blank image of the inside front cover in 4pt. Arial.

The threads in the twilltone in some of the later issues are causing some OCR problems so I'm having to do more editing. If you'd like to give it a whack, I can send you a pdf of one of those along with a text copy of the initial extract so you can compare your results with mine.

dd-bdd_b on March 4th, 2013 04:54 pm (UTC)
If that's a rendered text layer, it won't be searchable, though, it's just image. Or am I not understanding? You say "in photoshop".
Fred A Levy Haskell: xclip- book edge purplefredcritter on March 4th, 2013 09:07 pm (UTC)

When you use the text tool in Photoshop, a text layer is automatically created in which the text is stored as text. You can choose to render the text as bitmap by flattening the image (and I think one or two other ways that I can't recall off the top of my head), but otherwise it stays text.

When you Save As in Photoshop, one of the options is "Photoshop PDF document." When you do that, the text layer stays text, even if it's hidden behind another layer; unlike what would happen if you used Print to PDF or sent it to Distiller. It remains searchable—I know this, because if I open the resultant PDF and search for a word I know to be in the text, Acrobat will find and highlight the word (or, actually, will highlight the word's location—you won't see the word because it's hidden behind the image layer). Download the copy of RUNE40 I uploaded and try it! Search for "Minn-stf" perhaps…

et in Arcadia egoboo: Ada  Lovelaceapostle_of_eris on February 26th, 2013 03:39 am (UTC)
So how about a ToC or index page of minimal HTML/CSS for the spiders?
Fred A Levy Haskell: bal tashchitfredcritter on March 4th, 2013 09:20 pm (UTC)

In a way, I ended up using your suggestion too, Mr. eris; in that I've included the ToC in the text of the fanzines page on my site, next to the clickable RUNE-cover-image-link. I've even worked ahead—the page now contains the ToCs of a bunch of the RUNE I have and will be scanning and posting.

et in Arcadia egoboo: Chicago (from Adler)apostle_of_eris on March 5th, 2013 03:36 am (UTC)
“Mr. eris”?
Either plain old "eris" or "Pope eris".
Matthew Strait. Wait, too clear, I mean — Aaaahhh!quadong on March 4th, 2013 04:27 am (UTC)
I don't have any answers to your questions better than what's already been posted. However, I'm very happy to hear about this effort! As Minn-stf archivist, I have been very slowly working on building up an online Rune archive at http://mnstf.org/Rune . So far there are issues 5, 6, 7, 8, 10, and 37 1/2. I assume that I can copy your scans (and tables of contents) over to that page as you make them?

I'm happy to see that you have Runes 42 and 43 in particular, as I have not managed to get (complete) copies of those yet.

I take it from your page that you haven't heard of Rune 87, published this last October. It's at http://mnstf.org/Rune as a PDF, and I'd be happy to mail you a paper copy if you like. I'm hoping to get #88 out this late spring/early summer.
Fred A Levy Haskell: xclip- diamond beveled redfredcritter on March 4th, 2013 05:05 am (UTC)
I was going to email you directly in reply but I see I have two different domains for you: startraders and speakeasy. Is one out of date? Which do you prefer? (I believe you have my email--feel free to use it instead of replying here if you wish.)

You are absolutely welcome to download the RUNE pdfs (and ToCs) from my site and make them available on the Minn-stf site. Please be sure to mention that they are courtesy of Lynn Anderson.

Are you also in charge of the physical archives or does Kay still hold them? Once I'm finished scanning these RUNEs (and EINBLATTs and Minicon progress reports) Lynn wants them to go to the Minn-Stf archives.
Matthew Strait. Wait, too clear, I mean — Aaaahhh!quadong on March 5th, 2013 03:41 am (UTC)
Both e-mail addresses work, but I prefer the startraders one. (The speakeasy one forwards to it, so it only matters a little bit.)

I'm in charge of the physical archives, but they are at Marian Turners house, while I am in Chicago. This is a little awkward, but we're making do. I'm up in Minneapolis 6-8 times a year and am hoping to have a several-week stay this summer that could include some significant archive organizing work.