I need to extract the data from tables in a wiki dump in a somewhat convenient form, e.g. a list of lists. However, due to the format of the dump it looks sort of tricky. I am aware of the WikiExtractor, which is useful for getting clean text from a dump, but it drops tables altogether. Is there a parser that would get me conveniently readable tables in a same way?
I did not manage to find a good way to parse Wikipedia tables from XML dumps. However, there seem to be some ways to do so using HTML parsers, e.g. wikitables parser. This would require a lot of scraping unless you need to analyze only the tables from specific pages. However, it seems possible to do it offline as it seems HTML Wiki dumps are about to resume (dumps, phabricator task)
Related
I am working on a project for which I have to parse and query a relatively large xml file in python. I am using a dataset with data about scientific articles. The dataset can be found via this link (https://dblp.uni-trier.de/xml/dblp.xml.gz). There are 7 types of entries in the dataset: article, inproceedings, proceedings, book, incollection, phdthesis and masterthesis. An entry has the following attributes: author, title, year and either journal or booktitle.
I am looking for the best way to parse this and consequently perform queries on the dataset. Examples of queries that I would like to perform are:
retrieve articles that have a certain author
retrieve articles if the title contains a certain word
retrieve articles to which author x and author y both contributed.
...
Herewith a snapshot of an entry in the xml file:
<article mdate="2020-06-25" key="tr/meltdown/s18" publtype="informal">
<author>Paul Kocher</author>
<author>Daniel Genkin</author>
<author>Daniel Gruss</author>
<author>Werner Haas 0004</author>
<author>Mike Hamburg</author>
<author>Moritz Lipp</author>
<author>Stefan Mangard</author>
<author>Thomas Prescher 0002</author>
<author>Michael Schwarz 0001</author>
<author>Yuval Yarom</author>
<title>Spectre Attacks: Exploiting Speculative Execution.</title>
<journal>meltdownattack.com</journal>
<year>2018</year>
<ee type="oa">https://spectreattack.com/spectre.pdf</ee>
</article>
Does anybody have an idea on how to do to this efficiently?
I have experimented with using the ElementTree. However, when parsing the file I get the following error:
xml.etree.ElementTree.ParseError: undefined entity Ö: line 90, column 17
Additionally, I am not sure if using the ElementTree will be the most efficient way for querying this xml file.
If the file is large, and you want to perform multiple queries, then you don't want to be parsing the file and building a tree in memory every time you do a query. You also don't want to be writing the queries in low-level Python, you need a proper query language.
You should be loading the data into an XML database such as BaseX or ExistDB. You can then query it using XQuery. This will be a bit more effort to set up, but will make your life a lot easier in the long run.
I'm scraping lots of pages and storing the source in a Postgres database, then determining bits of information to parse and either parsing it with Postgres' own regex, which is super fast, or with Python and BeautifulSoup, row by row, which is perhaps more "proper" but much, much slower.
I'm wondering if I should convert the source to JSON and store in a JSONB field. Seems faster because all the JSON can be indexed...am I wrong? Or maybe switch to MongoDB? I just feel that there must be a faster way. Let's assume, for purposes of argument, that I cannot determine in advance all data that needs to be parsed. Suggestions?
I have a dilemma.
I need to read very large XML files from all kinds of sources, so the files are often invalid XML or malformed XML. I still must be able to read the files and extract some info from them. I do need to get tag information, so I need XML parser.
Is it possible to use Beautiful Soup to read the data as a stream instead of the whole file into memory?
I tried to use ElementTree, but I cannot because it chokes on any malformed XML.
If Python is not the best language to use for this project please add your recommendations.
Beautiful Soup has no streaming API that I know of. You have, however, alternatives.
The classic approach for parsing large XML streams is using an event-oriented parser, namely SAX. In python, xml.sax.xmlreader. It will not choke with malformed XML. You can avoid erroneous portions of the file and extract information from the rest.
SAX, however, is low-level and a bit rough around the edges. In the context of python, it feels terrible.
The xml.etree.cElementTree implementation, on the other hand, has a much nicer interface, is pretty fast, and can handle streaming through the iterparse() method.
ElementTree is superior, if you can find a way to manage the errors.
I'm fetching webpages with a bunch of javascript on it, and I'm interested in parsing through the javascript portion of the pages for certain relevant info. Right now I have the following code in Python/BeautifulSoup/regex:
scriptResults = soup('script',{'type' : 'text/javascript'})
which yields an array of scripts, of which I can use a for loop to search for text I'd like:
for script in scriptResults:
for block in script:
if *patterniwant* in block:
**extract pattern from line using regex**
(Text in asterisks is pseudocode, of course.)
I was wondering if there was a better way for me to just use regex to find the pattern in the soup itself, searching only through the scripts themselves? My implementation works, but it just seems really clunky so I wanted something more elegant and/or efficient and/or Pythonic.
Thanks in advance!
I lot of website have client side data in JSON format. I that case I would suggest to extract JSON part from JavaScirpt code and parse it using Python's json modules (e.g. json.json.loads ). As a result you will get standard dictionary object.
Another option is to check with your browser what sort of AJAX requests application makes. Quite often it also returns structured data in JSON.
I would also check if page has any structured data already available (e.g. OpenGraph, microformats, RDFa, RSS feeds). A lot of web sites include this to improve pages SEO and make it better integrating with social network sharing.
I'd like to tokenise out wikipedia pages of interest with a python library or libraries. I'm most interested in tables and listings. I want to be able to then import this data into Postgres or Neo4j.
For example, here are three data sets that I'd be interested in:
How many points each country awarded one another in the Eurovision Song contest of 2008:
http://en.wikipedia.org/wiki/Eurovision_Song_Contest_2008#Final
List of currencies and the countries in which they circulate (a many-to-many relationship):
http://en.wikipedia.org/wiki/List_of_circulating_currencies
Lists of solar plants around the world: http://en.wikipedia.org/wiki/List_of_solar_thermal_power_stations
The source of each of these is written with wikipedia's brand of markup which is used to render them out. There are many wikipedia-specific tags and syntax used in the raw data form. The HTML might almost be the easier solution as I can just use BeautifulSoup.
Anyone know of a better way of tokenizeing? I feel that I'd reinvent the wheel if I took the final HTML and parsing it with BeautifulSoup. Also, if I could find a way to output these pages in XML, the table data might not be tokenized enough and it would require further processing.
Since Wikipedia is built on MediWiki, there is an api you can exploit. There is also Special:Export that you can use.
Once you have the raw data, then you can run it through mwlib to parse it.
This goes more to semantic web direction, but DBPedia allows querying parts (community conversion effort) of wikipedia data with SPARQL. This makes it theoretically straightforward to extract the needed data, however dealing with RDF triples might be cumbersome.
Furthermore, I don't know if DBPedia yet contains any data that is of interest for you.