Ask Your Question
3

Extracting data from wikipedia tables

asked 2013-06-12 07:44:16 -0500

PAC gravatar image

updated 2013-06-12 07:45:35 -0500

I would like to extract data from a serie of wikipedia tables. I've found a nice add-on to firefox (ExportToCSV) but unfortunately, it doesn't export data with internal links. For instance, if you try to use it with this table : http://en.wikipedia.org/wiki/PremierLeague2007-08#Personnelandkits, you will not get the name of the manager and of the captain. Does anyone know a better tool ? I'd like something very easy to use.

edit retag flag offensive close merge delete

4 Answers

Sort by ยป oldest newest most voted
0

answered 2013-06-13 01:27:47 -0500

PAC gravatar image

I've found a solution to my problem : the html2table plugin in Chrome.

edit flag offensive delete link more
3

answered 2013-06-13 22:39:47 -0500

Andrew Duffy gravatar image

The "Scraper" plugin in Chrome also works on that table: https://chrome.google.com/webstore/detail/scraper/mbigbapnjcgaffohmbkdlecaccepngjd?hl=en

edit flag offensive delete link more

Comments

I'm a big fan of the scraper extension and usually teach it in workshops.. Will have a look at html2table though as suggested by @PAC - advantage of scraper extension: You can even scrape more complex websites.

mihi gravatar imagemihi ( 2013-06-17 07:08:06 -0500 )edit
3

answered 2013-06-25 04:00:45 -0500

Tony Hirst gravatar image

updated 2013-06-25 04:30:52 -0500

Google Sheets (aka Google Spreadsheets) has a handy formula called =importHtml() that can import a table or HTML list from a web page given its web location/URL:

  • =importHtml(URL, "table", N)
  • =importHtml(URL, "list", M)

where N is the Nth table in the webpage at URL, and M is the Mth list in the page.

You can find an example of how to use this formula here: http://mashe.hawksey.info/2012/10/feeding-google-spreadsheets-exercises-in-import/ (Feeding Google Spreadsheets: Exercises in using importHTML, importFeed, importXML, importRange and importData (with some QUERY too))

Unfortunately, it doesn't cope with the links either...

For a "simple" tool that will extract the links into a separate column, try http://www.outwit.com/products/hub/ (Outwit hub).

edit flag offensive delete link more

Comments

Very nice tool !

PAC gravatar imagePAC ( 2013-06-25 04:20:43 -0500 )edit
1

answered 2013-06-24 06:27:12 -0500

phillchill gravatar image

also, if you'd like to get into more customizable scraping, check out https://scraperwiki.com/. You can write your own custom scraper in Python that saves results to their database, and them download results as SQLite Database, .csv table or JSON file.

For inspiration, there are quite a number of wikipedia scrapers : https://scraperwiki.com/tags/wikipedia

edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

1 follower

Stats

Asked: 2013-06-12 07:44:16 -0500

Seen: 4,923 times

Last updated: Jun 25 '13