Parse and Query Large XML File Using Python - python

I am working on a project for which I have to parse and query a relatively large xml file in python. I am using a dataset with data about scientific articles. The dataset can be found via this link (https://dblp.uni-trier.de/xml/dblp.xml.gz). There are 7 types of entries in the dataset: article, inproceedings, proceedings, book, incollection, phdthesis and masterthesis. An entry has the following attributes: author, title, year and either journal or booktitle.
I am looking for the best way to parse this and consequently perform queries on the dataset. Examples of queries that I would like to perform are:
retrieve articles that have a certain author
retrieve articles if the title contains a certain word
retrieve articles to which author x and author y both contributed.
...
Herewith a snapshot of an entry in the xml file:
<article mdate="2020-06-25" key="tr/meltdown/s18" publtype="informal">
<author>Paul Kocher</author>
<author>Daniel Genkin</author>
<author>Daniel Gruss</author>
<author>Werner Haas 0004</author>
<author>Mike Hamburg</author>
<author>Moritz Lipp</author>
<author>Stefan Mangard</author>
<author>Thomas Prescher 0002</author>
<author>Michael Schwarz 0001</author>
<author>Yuval Yarom</author>
<title>Spectre Attacks: Exploiting Speculative Execution.</title>
<journal>meltdownattack.com</journal>
<year>2018</year>
<ee type="oa">https://spectreattack.com/spectre.pdf</ee>
</article>
Does anybody have an idea on how to do to this efficiently?
I have experimented with using the ElementTree. However, when parsing the file I get the following error:
xml.etree.ElementTree.ParseError: undefined entity Ö: line 90, column 17
Additionally, I am not sure if using the ElementTree will be the most efficient way for querying this xml file.

If the file is large, and you want to perform multiple queries, then you don't want to be parsing the file and building a tree in memory every time you do a query. You also don't want to be writing the queries in low-level Python, you need a proper query language.
You should be loading the data into an XML database such as BaseX or ExistDB. You can then query it using XQuery. This will be a bit more effort to set up, but will make your life a lot easier in the long run.

Related

Searchable database of files

I have 1000's of scanned field books as PDF. Each has a unique filename. In a spreadsheet I have metadata for each, where each row has:
index number, filename, info1, info2, info3, info4, etc.
filename is the exact file name of the PDF. info1 is just an example of a metadata field, such as 'Year' or whatever. There are only about 8 fields or so, not ever PDF is relevant to all of them.
I assume there should be a reasonable way to create a database, mysql, or other, reading the spreadsheet (which I can just saves as .csv or .txt or something). This part I am sure I can handle.
I want to be able to lookup/search for a pdf file based on entering in various search items based on the metadata, and get a list of results. In a web interface, or a custom window, and be able to click on the results and open the file. Basically a typical search window with predefined fields you can enter and get results - like at an old school library terminal.
I have decent coding skills in python, mostly math, but some file skills as well. Looking for guidance on what tools and approach I should take to this. My short term goal is to be able to query and find files and open whatever results. Long term want to be able to share this with the public so they can search and find stuff.
After trying to figure out what to search for online, I am obviously at a loss. How do you suggest I do this and what tools or libraries should I use. I cannot find an example of this online. Not sure how to word it.
The actual data stuff could be done with Pandas:
read the excel file into Pandas
perform the search on the Pandas dataframe, e.g. using df.query()
But this does not give you a GUI. For that you could go for a web app, using Flask or Django framework. That, however, one does not master over night :)
This is a good course to learn that kind of stuff: https://www.edx.org/course/cs50s-web-programming-with-python-and-javascript?index=product&queryID=01efddd992de28a8b1b27d136111a2a8&position=3

How to convert pdf to xml /json using python code

Can any one help me on how to convert pdf file to xml file using python code? My pdf contains:
Unstructured data
It has images
Mathematical equations
Chemical Equations
Table Data
Logo's tag's etc.
I tried using PDFMiner, but my pdf data was not converted into .xml/json file format. Are there any libraries other than PDFMiner? PyPDF2, Tabula-py, PDFQuery, comelot, PyMuPDF, pdf to dox, pandas- these other libraries/utilities all not suitable for my requirement.
Please advise me on any other options. Thank you.
The first thing I would recommend you trying is GROBID (see here for the full documentation). You can play with an online demo here to see if fits your needs (select TEI -> Process Fulltext Document, and upload a PDF). You can also check out this from the Allen Institute (it is based on GROBID and has a handy function for converting TEI.XML to JSON).
The other package which--obviously--does a good job is the Adobe PDF Extract API (see here). It's of course a paid service but when you register for an account you get 1.000 document transactions for free. It's easy to implement in Python, well documented, and a good way for experimenting and getting a feel for the difficulties of reliable data extraction from PDF.
I worked with both options to extract text, figures, tables etc. from scientific papers. Both yielded good results. The main problem with out-of-the-box solutions is that, when you work with complex formats (or badly formatted docs), erroneously identified document elements are quite common (for example a footnote or a header gets merged with the main text). Both options are based on machine learning models and, at least for GROBID, it is possible to retrain these models for your specific task (I haven't tried this so far, so I don't know how worthwhile it is).
However, if your target PDFs are all of the same (simple) format (or if you can control their format) you should be fine with either option.

Is there an existing library to parse Wikpedia tables from dump?

I need to extract the data from tables in a wiki dump in a somewhat convenient form, e.g. a list of lists. However, due to the format of the dump it looks sort of tricky. I am aware of the WikiExtractor, which is useful for getting clean text from a dump, but it drops tables altogether. Is there a parser that would get me conveniently readable tables in a same way?
I did not manage to find a good way to parse Wikipedia tables from XML dumps. However, there seem to be some ways to do so using HTML parsers, e.g. wikitables parser. This would require a lot of scraping unless you need to analyze only the tables from specific pages. However, it seems possible to do it offline as it seems HTML Wiki dumps are about to resume (dumps, phabricator task)

Python libraries that can tokenize wikipedia pages

I'd like to tokenise out wikipedia pages of interest with a python library or libraries. I'm most interested in tables and listings. I want to be able to then import this data into Postgres or Neo4j.
For example, here are three data sets that I'd be interested in:
How many points each country awarded one another in the Eurovision Song contest of 2008:
http://en.wikipedia.org/wiki/Eurovision_Song_Contest_2008#Final
List of currencies and the countries in which they circulate (a many-to-many relationship):
http://en.wikipedia.org/wiki/List_of_circulating_currencies
Lists of solar plants around the world: http://en.wikipedia.org/wiki/List_of_solar_thermal_power_stations
The source of each of these is written with wikipedia's brand of markup which is used to render them out. There are many wikipedia-specific tags and syntax used in the raw data form. The HTML might almost be the easier solution as I can just use BeautifulSoup.
Anyone know of a better way of tokenizeing? I feel that I'd reinvent the wheel if I took the final HTML and parsing it with BeautifulSoup. Also, if I could find a way to output these pages in XML, the table data might not be tokenized enough and it would require further processing.
Since Wikipedia is built on MediWiki, there is an api you can exploit. There is also Special:Export that you can use.
Once you have the raw data, then you can run it through mwlib to parse it.
This goes more to semantic web direction, but DBPedia allows querying parts (community conversion effort) of wikipedia data with SPARQL. This makes it theoretically straightforward to extract the needed data, however dealing with RDF triples might be cumbersome.
Furthermore, I don't know if DBPedia yet contains any data that is of interest for you.

write sql results to XML using python

using SQL server 2005+
for over a year now, I've been running a query to return all users from the database for whom i'd like generate a report for. at the end of the query, I've added
for xml auto
The result of this is a link to a file where the xml lives (one row, one column). I then take this file and pass it through some xslt and am left with the final xml that a java program I created nearly two years ago can parse.
What i'd like to do is skip the whole middle part and simply automate this all with python.
Run the query, put the results into xml format and then either use the xslt or simply format the xml that's returned into the format that I need for the application.
I've searched far and wide on the internet but I'm afraid I'm not sure enough of the technologies available to find just what I'm looking for.
Any help in directing me to an answer would be greatly appreciated.
PyODBC to talk to MS SQL Server, libxml2 to handle the XSLT, and subprocess to run your Java program.

Categories

Resources