I'd like to tokenise out wikipedia pages of interest with a python library or libraries. I'm most interested in tables and listings. I want to be able to then import this data into Postgres or Neo4j.
For example, here are three data sets that I'd be interested in:
How many points each country awarded one another in the Eurovision Song contest of 2008:
http://en.wikipedia.org/wiki/Eurovision_Song_Contest_2008#Final
List of currencies and the countries in which they circulate (a many-to-many relationship):
http://en.wikipedia.org/wiki/List_of_circulating_currencies
Lists of solar plants around the world: http://en.wikipedia.org/wiki/List_of_solar_thermal_power_stations
The source of each of these is written with wikipedia's brand of markup which is used to render them out. There are many wikipedia-specific tags and syntax used in the raw data form. The HTML might almost be the easier solution as I can just use BeautifulSoup.
Anyone know of a better way of tokenizeing? I feel that I'd reinvent the wheel if I took the final HTML and parsing it with BeautifulSoup. Also, if I could find a way to output these pages in XML, the table data might not be tokenized enough and it would require further processing.
Since Wikipedia is built on MediWiki, there is an api you can exploit. There is also Special:Export that you can use.
Once you have the raw data, then you can run it through mwlib to parse it.
This goes more to semantic web direction, but DBPedia allows querying parts (community conversion effort) of wikipedia data with SPARQL. This makes it theoretically straightforward to extract the needed data, however dealing with RDF triples might be cumbersome.
Furthermore, I don't know if DBPedia yet contains any data that is of interest for you.
Related
Can any one help me on how to convert pdf file to xml file using python code? My pdf contains:
Unstructured data
It has images
Mathematical equations
Chemical Equations
Table Data
Logo's tag's etc.
I tried using PDFMiner, but my pdf data was not converted into .xml/json file format. Are there any libraries other than PDFMiner? PyPDF2, Tabula-py, PDFQuery, comelot, PyMuPDF, pdf to dox, pandas- these other libraries/utilities all not suitable for my requirement.
Please advise me on any other options. Thank you.
The first thing I would recommend you trying is GROBID (see here for the full documentation). You can play with an online demo here to see if fits your needs (select TEI -> Process Fulltext Document, and upload a PDF). You can also check out this from the Allen Institute (it is based on GROBID and has a handy function for converting TEI.XML to JSON).
The other package which--obviously--does a good job is the Adobe PDF Extract API (see here). It's of course a paid service but when you register for an account you get 1.000 document transactions for free. It's easy to implement in Python, well documented, and a good way for experimenting and getting a feel for the difficulties of reliable data extraction from PDF.
I worked with both options to extract text, figures, tables etc. from scientific papers. Both yielded good results. The main problem with out-of-the-box solutions is that, when you work with complex formats (or badly formatted docs), erroneously identified document elements are quite common (for example a footnote or a header gets merged with the main text). Both options are based on machine learning models and, at least for GROBID, it is possible to retrain these models for your specific task (I haven't tried this so far, so I don't know how worthwhile it is).
However, if your target PDFs are all of the same (simple) format (or if you can control their format) you should be fine with either option.
I need to extract the data from tables in a wiki dump in a somewhat convenient form, e.g. a list of lists. However, due to the format of the dump it looks sort of tricky. I am aware of the WikiExtractor, which is useful for getting clean text from a dump, but it drops tables altogether. Is there a parser that would get me conveniently readable tables in a same way?
I did not manage to find a good way to parse Wikipedia tables from XML dumps. However, there seem to be some ways to do so using HTML parsers, e.g. wikitables parser. This would require a lot of scraping unless you need to analyze only the tables from specific pages. However, it seems possible to do it offline as it seems HTML Wiki dumps are about to resume (dumps, phabricator task)
I wanted to do this before for some websites but didn't know where to start. This time however I am adamant. I am talking about the scripts where we crawl a website and extract the data we require. My target is this: Basically I have to appear for job interviews in December. There is this site (http://www.geeksforgeeks.org/) which contains large number of questions from previous interviews (like http://www.geeksforgeeks.org/amazon-interview-set-42-on-campus/ & http://www.geeksforgeeks.org/adobe-interview-set-6-campus-mts-1/). Every title has word "set" and a number in it. It is quite cumbersome to keep track of what I have done and what not. So I want to extract questions from each of these pages and put them in a pdf with the title. How can I do this using curl, regex and Scrapy? I am intermediate in C/C++/Java and but have only beginner proficiency in Python. Any help is much appreciated. Also point me to any such scripts you such know of. I want to do this on my own. Just requires a starting point and some guidance. Thanks.
If you want just a starting point, try scrapy a screen-scraping library for python. I would recommend that you use the requests library for making requests. It's by far the simplest option (with no loss of power).
Also, don't try to parse html or xml with a regex. Just don't. Use one of the fine libraries available (beautifulsoup or lxml, or lxml with a beautifulsoup backend are the most popular, but there are others).
I am looking for a way to extract basic stats (total count, density, count in links, hrefs) for words on an arbitrary website, ideally a Python based solution.
While it is easy to parse a specific website using, say BautifulSoup and determine where the bulk of the content is, it requires you to define the location of the content in the DOM tree ahead of processing. This is easy for, say, hrefs or any arbitraty tag but gets more complicated when determining where the rest of the data (not enclosed in well defined markers) is.
If I understand correctly, robots used by the likes of Google (GoogleBot?) are able to extract data from any website to determine the keyword density. My scenario is similar, obtain the info related to the words that define what the website is about (i.e. after removing js, links and fillers).
My question is, are there any libraries or web APIs that would allow me to get statistics of meaningful words from any given page?
There is no APIs but there could be few libraries that you can use it as a tool.
you should count the meaningful words and record them by the time.
you can also Start from something like this:
string Link= "http://www.website.com/news/Default.asp";
string itemToSearch= "Word";
int count = new Regex(itemToSearch).Matches(Link).Count;
MessageBox.Show(count.ToString());
There are multiple libraries that deal with more advanced processing of web articles, this question should be a duplicate of this one.
I am planning to develop a web-based application which could crawl wikipedia for finding relations and store it in a database. By relations, I mean searching for a name say,'Bill Gates' and find his page, download it and pull out the various information from the page and store it in a database. Information may include his date of birth, his company and a few other things. But I need to know if there is any way to find these unique data from the page, so that I could store them in a database. Any specific books or algorithms would be greatly appreciated. Also mentioning of good opensource libraries would be helpful.
Thank You
If you haven't already, you should have a look at DBpedia. Many categories of wiki articles have "Infoboxes" for the kinds of information you describe, and they've made a database out of it:
http://en.wikipedia.org/wiki/DBpedia
You might also leverage some of the information in Metaweb's Freebase (which overlaps and I believe may even integrate the info from DBpedia.) They have an API for querying their graph database, and there's a Python wrapper for it called freebase-python.
UPDATE: Freebase is no more; they were acquired by Google and eventually folded into the Google Knowledge Graph. There is an API but I don't think they have anything like the formal sync'ing Freebase had with public sources like Wikipedia. I'm personally disappointed in how this looks to have turned out. :-/
As for the natural language processing bit, if you do make headway on that problem you might consider these databases as repositories for any information you do mine.
You mention Python and Open Source, so I would investigate the NLTK (Natural Language Toolkit). Text mining and natural language processing is one of those things that you can do a lot with a dumb algorithm (eg. Pattern matching), but if you want to go a step further and do something more sophisticated - ie. Trying to extract information that is stored in a flexible manner or trying to find information that might be interesting but is not known a priori, then natural language processing should be investigated.
NLTK is intended for teaching, so it is a toolkit. This approach suits Python very well. There are a couple of books for it as well. The O'Reilly book is also published online with an open license. See NLTK.org
Jvc, there are existing python modules that can do everything you mentioned above.
For pulling information from webpages, I like to use Selenium, http://seleniumhq.org/projects/ide/. Basically, you can localize and retrieve information on any webpage using a number of identifiers (id, Xpath, etc).
However, like winwaed said, it can be inflexible if you are simply "pattern matching", especially since some websites use dynamic code- meaning the identifiers can change with each subsequent reload of the page. But, this problem can be solved by adding regular expressions, i.e. (.*), to your code. Check out this youtube video, http://www.youtube.com/watch?v=Ap_DlSrT-iE. Even though he is using BeautifulSoup to scrape the website- you can see how he uses regular expressions to pull the information from the page.
Also, I'm not sure what type of database you are working with, but pyodbc, http://code.google.com/p/pyodbc/, can work with SQL types, and also mainstream databases like Microsoft Access.
So, my advice is to look into Selenium for finding the info on the webpage, pyodbc to store and retrieve it, and regular expressions when the identifiers are dynamic.