Ask Your Question
2

What tools can I use for extracting tabular data from PDFs

asked 2013-05-06 04:36:06 -0500

Rufus Pollock gravatar image

I want to extract tabular data from PDFs. What tools can I use? I have a preference for free and open-source tools and those that are easy to use.

edit retag flag offensive close merge delete

9 Answers

Sort by » oldest newest most voted
1

answered 2013-05-06 04:41:56 -0500

Rufus Pollock gravatar image

updated 2013-06-23 05:07:26 -0500

Some items I know about:

edit flag offensive delete link more

Comments

I like pdftoxml - however you need to write scripts around it.

mihi gravatar imagemihi ( 2013-05-06 04:44:11 -0500 )edit

@mihi could you post a link to an example of scripting it (e.g. a gist)?

Rufus Pollock gravatar imageRufus Pollock ( 2013-05-06 04:47:13 -0500 )edit

https://scraperwiki.com/scrapers/pres_election_results_ghana_2004/edit/ is a scraperwiki scraper I wrote with using it.

mihi gravatar imagemihi ( 2013-05-06 04:53:54 -0500 )edit
1

These seem like a good sampling of some recent tools. Here is another guide: http://thomaslevine.com/!/parsing-pdfs .

thadk gravatar imagethadk ( 2013-05-06 22:11:11 -0500 )edit

to extract images from pdfs, sometimes i use cloudconvert.org, which also has drive and dropbox integration, for saving the converted format(s). just upload pdf, select desired format, let it do the work. works great for pdfs with 100s of pages aka 100s of images-to-be.

jalbertbowdenii gravatar imagejalbertbowdenii ( 2014-04-24 20:26:48 -0500 )edit
3

answered 2013-05-06 04:43:31 -0500

mihi gravatar image

updated 2013-06-23 16:40:12 -0500

Good question without a clear answer. Tabula looks pretty promising but is still challenging to set up and I didn't have success using it.

If you are into programming: Scraperwikis pdftoxml helps to turn PDFs into something more parseable and helps to extract well structured tables. However it takes some effort each time.

I do think this is the one single biggest gap where there is large need and no good solutions yet.

EDIT: Tabula is now super easy to install: http://jazzido.github.io/tabula

edit flag offensive delete link more
2

answered 2013-07-19 18:44:52 -0500

gaba gravatar image

I have to add that the people from Tabula have been working really hard to make it easier. They have installers for mac and windows now. I tried the one for mac and it was quite simple to install and ready to use. Very good job!

edit flag offensive delete link more
2

answered 2013-05-06 05:17:41 -0500

PAC gravatar image

updated 2013-05-06 05:38:44 -0500

The command-line tool pdftotext is very easy to use and has several options. It is easy to install on Debian/Ubuntu linux distributions

$ sudo apt-get install poppler-utils

and easy to use

$ pdftotext -layout myfile.pdf

It is also possible to use it on Mac OS and Windows (see here)

edit flag offensive delete link more

Comments

never used pdftotext - Isn't it hard to get the not well formatted text into well formatted text - after extracting it?

mihi gravatar imagemihi ( 2013-05-06 06:46:21 -0500 )edit
1

There are some formatting options (see http://linux.die.net/man/1/pdftotext). I generally use the -layout option and then use regular expressions to extract the relevant data from the text file. It's not that difficult

PAC gravatar imagePAC ( 2013-05-06 08:43:13 -0500 )edit
1

answered 2013-05-07 02:39:10 -0500

Andrew Duffy gravatar image

Personally I'd probably opt to convert to xml and write a scraper from there. If you upload the pdfs somewhere we might be able to give you a hand with writing a small script?

edit flag offensive delete link more
1

answered 2013-05-06 09:47:07 -0500

JerryVermanen gravatar image

There are a few free web solutions

For most quick and easy conversion, these tools will be enough.

edit flag offensive delete link more
1

answered 2013-06-25 04:06:59 -0500

Tony Hirst gravatar image

updated 2013-07-19 10:29:35 -0500

Extracting tabular data from PDF documents is still a problematic area, although tools like Tabula make a good go of it.

If you are happy getting your coding hands dirty, there is a tutorial post on the School Of Data blog that works through what's involved in writing a Python scraper to extract a table from a simple PDF document: Get Started With Scraping – Extracting Simple Tables from PDF Documents

edit flag offensive delete link more
1

answered 2013-11-25 12:47:09 -0500

joffemd gravatar image

I have posted a list of PDF conversion resources at http://pdfliberation.wordpress.com

edit flag offensive delete link more
0

answered 2015-02-19 23:37:17 -0500

Personally I'd probably opt to convert to xml and write a scraper from there. Most of the times even I try to do like this. If you need any writing related queries you can check http://ukresearchpaperreviews.com/ (research paper blog) post and there you can read guest articles related to this.

edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

3 followers

Stats

Asked: 2013-05-06 04:36:06 -0500

Seen: 29,637 times

Last updated: Feb 19 '15