Optimize Python Script to parse xml

Optimize Python Script to parse xml - python

I'm parsing the US Patent XML files (downloaded from Google patent dumps) using Python and Beautifulsoup; parsed data is exported to MYSQL database.
Each year's data contains close to 200-300K patents - which means parsing 200-300K xml files.
The server on which I'm running the python script is pretty powerful - 16 cores, 160 gigs of RAM, etc. but still it is taking close to 3 days to parse one year's worth of data.
I've been learning and using python since 2 years - so I can get stuff done but do not know how to get it done in the most efficient manner. I'm reading on it.
How can I optimize the below script to make it efficient?
Any guidance would be greatly appreciated.
Below is the code:
from bs4 import BeautifulSoup
import pandas as pd
from pandas.core.frame import DataFrame
import MySQLdb as db
import os
cnxn = db.connect('xx.xx.xx.xx','xxxxx','xxxxx','xxxx',charset='utf8',use_unicode=True)
def separated_xml(infile):
file = open(infile, "r")
buffer = [file.readline()]
for line in file:
if line.startswith("<?xml "):
yield "".join(buffer)
buffer = []
buffer.append(line)
yield "".join(buffer)
file.close()
def get_data(soup):
df = pd.DataFrame(columns = ['doc_id','patcit_num','patcit_document_id_country', 'patcit_document_id_doc_number','patcit_document_id_kind','patcit_document_id_name','patcit_document_id_date','category'])
if soup.findAll('us-citation'):
cit = soup.findAll('us-citation')
else:
cit = soup.findAll('citation')
doc_id = soup.findAll('publication-reference')[0].find('doc-number').text
for x in cit:
try:
patcit_num = x.find('patcit')['num']
except:
patcit_num = None
try:
patcit_document_id_country = x.find('country').text
except:
patcit_document_id_country = None
try:
patcit_document_id_doc_number = x.find('doc-number').text
except:
patcit_document_id_doc_number = None
try:
patcit_document_id_kind = x.find('kind').text
except:
patcit_document_id_kind = None
try:
patcit_document_id_name = x.find('name').text
except:
patcit_document_id_name = None
try:
patcit_document_id_date = x.find('date').text
except:
patcit_document_id_date = None
try:
category = x.find('category').text
except:
category = None
print doc_id
val = {'doc_id':doc_id,'patcit_num':patcit_num, 'patcit_document_id_country':patcit_document_id_country,'patcit_document_id_doc_number':patcit_document_id_doc_number, 'patcit_document_id_kind':patcit_document_id_kind,'patcit_document_id_name':patcit_document_id_name,'patcit_document_id_date':patcit_document_id_date,'category':category}
df = df.append(val, ignore_index=True)
df.to_sql(name = 'table_name', con = cnxn, flavor='mysql', if_exists='append')
print '1 doc exported'
i=0
l = os.listdir('/path/')
for item in l:
f = '/path/'+item
print 'Currently parsing - ',item
for xml_string in separated_xml(f):
soup = BeautifulSoup(xml_string,'xml')
if soup.find('us-patent-grant'):
print item, i, xml_string[177:204]
get_data(soup)
else:
print item, i, xml_string[177:204],'***********************************soup not found********************************************'
i+=1
print 'DONE!!!'

Here is a tutorial on multi-threading, because currently that code will run on 1 thread, 1 core.
Remove all try/except statements and handle the code properly. Exceptions are expensive.
Run a profiler to find the chokepoints, and multi-thread those or find a way to do them less times.

So, you're doing two things wrong. First, you're using BeautifulSoup, which is slow, and second, you're using a "find" call, which is also slow.
As a first cut, look at lxml's ability to pre-compile xpath queries (Look at the heading "The Xpath class). That will give you a huge speed boost.
Alternatively, I've been working on a library to do this kind of parsing declaratively, using best practices for lxml speed, including precompiled xpath called yankee.
Yankee on PyPI |
Yankee on GitHub
You could do the same thing with yankee like this:
from yankee.xml import Schema, fields as f
# Create a schema for citations
class Citation(Schema):
num = f.Str(".//patcit")
country = f.Str(".//country")
# ... and so forth for the rest of your fields
# Then create a "wrapper" to get all the citations
class Patent(Schema):
citations = f.List(".//us-citation|.//citation")
# Then just feed the Schema your lxml.etrees for each patent:
import lxml.etree as ET
schema = Patent()
for _, doc in ET.iterparse(xml_string, "xml"):
result = schema.load(doc)
The result will look like this:
{
"citations": [
{
"num": "<some value>",
"country": "<some value>",
},
{
"num": "<some value>",
"country": "<some value>",
},
]
}
I would also check out Dask to help you multithread it more efficiently. Pretty much all my projects use it.

Related

Export tensorboard (with pytorch) data into csv with python

I have Tensorboard data and want it to download all of the csv files behind the data, but I could not find anything from the official documentation. From StackOverflow, I found only this question which is 7 years old and also it's about TensorFlow while I am using PyTorch.
We can do this manually, as we can see in the screenshot, manually there is an option. I wonder if we can do that via code or it is not possible? As I have a lot of data to process.

With the help of this script Below is the shortest working code it gets all of the data in dataframe then you can play further.
import traceback
import pandas as pd
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
# Extraction function
def tflog2pandas(path):
runlog_data = pd.DataFrame({"metric": [], "value": [], "step": []})
try:
event_acc = EventAccumulator(path)
event_acc.Reload()
tags = event_acc.Tags()["scalars"]
for tag in tags:
event_list = event_acc.Scalars(tag)
values = list(map(lambda x: x.value, event_list))
step = list(map(lambda x: x.step, event_list))
r = {"metric": [tag] * len(step), "value": values, "step": step}
r = pd.DataFrame(r)
runlog_data = pd.concat([runlog_data, r])
# Dirty catch of DataLossError
except Exception:
print("Event file possibly corrupt: {}".format(path))
traceback.print_exc()
return runlog_data
path="Run1" #folderpath
df=tflog2pandas(path)
#df=df[(df.metric != 'params/lr')&(df.metric != 'params/mm')&(df.metric != 'train/loss')] #delete the mentioned rows
df.to_csv("output.csv")

Using jsonpath in api.ai in case we have already our json file to query in?

import json
def makeWebhookResult(req):
if req.get("result").get("action") != "Phdapp":
return {}
result = req.get("result")
parameters = result.get("parameters")
Progr = parameters.get("PhDsubjects")
time = parameters.get("PhdTime")
Levp = parameters.get("PhDDegLevp")
with open('Sheet1.json') as f:
data = f.read()
jsondata = json.loads(data)
match = jsonpath.jsonpath(jsondata,
'$.features[[?(#.ProgramName == Progr && #.Level == Levp && #.StartDate == time)]].UniversityName')
speech = "This is the universities you were looking for " + match
This is the part of my python code which have errors i can't figure it out, I have an intent with action which is "Phdapp" with three parameters that i need to use their values in my jsonpath querying from "sheet1.json" file in the same repository on GitHub in json format. But i can't get data from my intent neither accessing my son file for querying...is it because api.ai is not compatible with jsonpath or it is the problem of my code! or if there is a best to use which easier it can be my pleasure to know it. Thanks

How to parse a single-column text file into a table using python?

I'm new here to StackOverflow, but I have found a LOT of answers on this site. I'm also a programming newbie, so i figured i'd join and finally become part of this community - starting with a question about a problem that's been plaguing me for hours.
I login to a website and scrape a big body of text within the b tag to be converted into a proper table. The layout of the resulting Output.txt looks like this:
BIN STATUS
8FHA9D8H 82HG9F RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
INVENTORY CODE: FPBC *SOUP CANS LENTILS
BIN STATUS
HA8DHW2H HD0138 RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
8SHDNADU 00A123 #2956- INVALID STOCK COUPON CODE (MISSING).
93827548 096DBR RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
There are a bunch of pages with the exact same blocks, but i need them to be combined into an ACTUAL table that looks like this:
BIN INV CODE STATUS
HA8DHW2HHD0138 FPBC-*SOUP CANS LENTILS RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
8SHDNADU00A123 FPBC-*SOUP CANS LENTILS #2956- INVALID STOCK COUPON CODE (MISSING).
93827548096DBR FPBC-*SOUP CANS LENTILS RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
8FHA9D8H82HG9F SSXR-98-20LM NM CORN CREAM RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
Essentially, all separate text blocks in this example would become part of this table, with the inv code repeating with its Bin values. I would post my attempts at parsing this data(have tried Pandas/bs/openpyxl/csv writer), but ill admit they are a little embarrassing, as i cannot find any information on this specific problem. Is there any benevolent soul out there that can help me out? :)
(Also, i am using Python 2.7)

A simple custom parser like the following should do the trick.
from __future__ import print_function
def parse_body(s):
line_sep = '\n'
getting_bins = False
inv_code = ''
for l in s.split(line_sep):
if l.startswith('INVENTORY CODE:') and not getting_bins:
inv_data = l.split()
inv_code = inv_data[2] + '-' + ' '.join(inv_data[3:])
elif l.startswith('INVENTORY CODE:') and getting_bins:
print("unexpected inventory code while reading bins:", l)
elif l.startswith('BIN') and l.endswith('MESSAGE'):
getting_bins = True
elif getting_bins == True and l:
bin_data = l.split()
# need to add exception handling here to make sure:
# 1) we have an inv_code
# 2) bin_data is at least 3 items big (assuming two for
# bin_id and at least one for message)
# 3) maybe some constraint checking to ensure that we have
# a valid instance of an inventory code and bin id
bin_id = ''.join(bin_data[0:2])
message = ' '.join(bin_data[2:])
# we now have a bin, an inv_code, and a message to add to our table
print(bin_id.ljust(20), inv_code.ljust(30), message, sep='\t')
elif getting_bins == True and not l:
# done getting bins for current inventory code
getting_bins = False
inv_code = ''

A rather complex one, but this might get you started:
import re, pandas as pd
from pandas import DataFrame
rx = re.compile(r'''
(?:INVENTORY\ CODE:)\s*
(?P<inv>.+\S)
[\s\S]+?
^BIN.+[\n\r]
(?P<bin_msg>(?:(?!^\ ).+[\n\r])+)
''', re.MULTILINE | re.VERBOSE)
string = your_string_here
# set up the dataframe
df = DataFrame(columns = ['BIN', 'INV', 'MESSAGE'])
for match in rx.finditer(string):
inv = match.group('inv')
bin_msg_raw = match.group('bin_msg').split("\n")
rxbinmsg = re.compile(r'^(?P<bin>(?:(?!\ {2}).)+)\s+(?P<message>.+\S)\s*$', re.MULTILINE)
for item in bin_msg_raw:
for m in rxbinmsg.finditer(item):
# append it to the dataframe
df.loc[len(df.index)] = [m.group('bin'), inv, m.group('message')]
print(df)
Explanation
It looks for INVENTORY CODE and sets up the groups (inv and bin_msg) for further processing in afterwork() (note: it would be easier if you had only one line of bin/msg as you need to split the group here afterwards).
Afterwards, it splits the bin and msg part and appends all to the df object.

I had a code written for a website scrapping which may help you.
Basically what you need to do is write click on the web page go to html and try to find the tag for the table you are looking for and using the module (i am using beautiful soup) extract the information. I am creating a json as I need to store it into mongodb you can create table.
#! /usr/bin/python
import sys
import requests
import re
from BeautifulSoup import BeautifulSoup
import pymongo
def req_and_parsing():
url2 = 'http://businfo.dimts.in/businfo/Bus_info/EtaByRoute.aspx?ID='
list1 = ['534UP','534DOWN']
for Route in list1:
final_url = url2 + Route
#r = requests.get(final_url)
#parsing_file(r.text,Route)
outdict = []
outdict = [parsing_file( requests.get(url2+Route).text,Route) for Route in list1 ]
print outdict
conn = f_connection()
for i in range(len(outdict)):
insert_records(conn,outdict[i])
def parsing_file(txt,Route):
soup = BeautifulSoup(txt)
table = soup.findAll("table",{"id" : "ctl00_ContentPlaceHolder1_GridView2"})
#trtags = table[0].findAll('tr')
tdlist = []
trtddict = {}
"""
for trtag in trtags:
print 'print trtag- ' , trtag.text
tdtags = trtag.findAll('td')
for tdtag in tdtags:
print tdtag.text
"""
divtags = soup.findAll("span",{"id":"ctl00_ContentPlaceHolder1_ErrorLabel"})
for divtag in divtags:
for divtag in divtags:
print "div tag - " , divtag.text
if divtag.text == "Currently no bus is running on this route" or "This is not a cluster (orange bus) route":
print "Page not displayed Errored with below meeeage for Route-", Route," , " , divtag.text
sys.exit()
trtags = table[0].findAll('tr')
for trtag in trtags:
tdtags = trtag.findAll('td')
if len(tdtags) == 2:
trtddict[tdtags[0].text] = sub_colon(tdtags[1].text)
return trtddict
def sub_colon(tag_str):
return re.sub(';',',',tag_str)
def f_connection():
try:
conn=pymongo.MongoClient()
print "Connected successfully!!!"
except pymongo.errors.ConnectionFailure, e:
print "Could not connect to MongoDB: %s" % e
return conn
def insert_records(conn,stop_dict):
db = conn.test
print db.collection_names()
mycoll = db.stopsETA
mycoll.insert(stop_dict)
if __name__ == "__main__":
req_and_parsing()

Why are the videos on the most_recent standard feed so out of date?

I'm trying to grab the most recently uploaded videos. There's a standard feed for that - it's called most_recent. I don't have any problems grabbing the feed, but when I look at the entries inside, they're all half a year old, which is hardly recent.
Here's the code I'm using:
import requests
import os.path as P
import sys
from lxml import etree
import datetime
namespaces = {"a": "http://www.w3.org/2005/Atom", "yt": "http://gdata.youtube.com/schemas/2007"}
fmt = "%Y-%m-%dT%H:%M:%S.000Z"
class VideoEntry:
"""Data holder for the video."""
def __init__(self, node):
self.entry_id = node.find("./a:id", namespaces=namespaces).text
published = node.find("./a:published", namespaces=namespaces).text
self.published = datetime.datetime.strptime(published, fmt)
def __str__(self):
return "VideoEntry[id='%s']" % self.entry_id
def paginate(xml):
root = etree.fromstring(xml)
next_page = root.find("./a:link[#rel='next']", namespaces=namespaces)
if next_page == None:
next_link = None
else:
next_link = next_page.get("href")
entries = [VideoEntry(e) for e in root.xpath("/a:feed/a:entry", namespaces=namespaces)]
return entries, next_link
prefix = "https://gdata.youtube.com/feeds/api/standardfeeds/"
standard_feeds = set("top_rated top_favorites most_shared most_popular most_recent most_discussed most_responded recently_featured on_the_web most_viewed".split(" "))
feed_name = sys.argv[1]
assert feed_name in standard_feeds
feed_url = prefix + feed_name
all_video_ids = []
while feed_url is not None:
r = requests.get(feed_url)
if r.status_code != 200:
break
text = r.text.encode("utf-8")
video_ids, feed_url = paginate(text)
all_video_ids += video_ids
all_upload_times = [e.published for e in all_video_ids]
print min(all_upload_times), max(all_upload_times)
As you can see, it prints the min and max timestamps for the entire feed.
misha#misha-antec$ python get_standard_feed.py most_recent
2013-02-02 14:40:02 2013-02-02 14:54:00
misha#misha-antec$ python get_standard_feed.py top_rated
2006-04-06 21:30:53 2013-07-28 22:22:38
I've glanced through the downloaded XML and it appears to match the output. Am I doing something wrong?
Also, on an unrelated note, the feeds I'm getting are all about 100 entries (I'm paginating through them 25 at a time). Is this normal? I expected the feeds to be a bit bigger.

Regarding the "Most-Recent-Feed"-Topic: There is a ticket for this one here. Unfortunately, the YouTube-API-Teams doesn't respond or solved the problem so far.
Regarding the number of entries: That depends on the type of standardfeed, but for the most-recent-Feed it´s usually around 100.
Note: You could try using the "orderby=published" parameter to get recents videos, although I don´t know how "recent" they are.
https://gdata.youtube.com/feeds/api/videos?orderby=published&prettyprint=True
You can combine this query with the "category"-parameter or other ones (region-specific queries - like for the standard feeds - are not possible, afaik).

How can I talk to UniProt over HTTP in Python?

I'm trying to get some results from UniProt, which is a protein database (details are not important). I'm trying to use some script that translates from one kind of ID to another. I was able to do this manually on the browser, but could not do it in Python.
In http://www.uniprot.org/faq/28 there are some sample scripts. I tried the Perl one and it seems to work, so the problem is my Python attempts. The (working) script is:
## tool_example.pl ##
use strict;
use warnings;
use LWP::UserAgent;
my $base = 'http://www.uniprot.org';
my $tool = 'mapping';
my $params = {
from => 'ACC', to => 'P_REFSEQ_AC', format => 'tab',
query => 'P13368 P20806 Q9UM73 P97793 Q17192'
};
my $agent = LWP::UserAgent->new;
push #{$agent->requests_redirectable}, 'POST';
print STDERR "Submitting...\n";
my $response = $agent->post("$base/$tool/", $params);
while (my $wait = $response->header('Retry-After')) {
print STDERR "Waiting ($wait)...\n";
sleep $wait;
print STDERR "Checking...\n";
$response = $agent->get($response->base);
}
$response->is_success ?
print $response->content :
die 'Failed, got ' . $response->status_line .
' for ' . $response->request->uri . "\n";
My questions are:
1) How would you do that in Python?
2) Will I be able to massively "scale" that (i.e., use a lot of entries in the query field)?

question #1:
This can be done using python's urllibs:
import urllib, urllib2
import time
import sys
query = ' '.join(sys.argv)
# encode params as a list of 2-tuples
params = ( ('from','ACC'), ('to', 'P_REFSEQ_AC'), ('format','tab'), ('query', query))
# url encode them
data = urllib.urlencode(params)
url = 'http://www.uniprot.org/mapping/'
# fetch the data
try:
foo = urllib2.urlopen(url, data)
except urllib2.HttpError, e:
if e.code == 503:
# blah blah get the value of the header...
wait_time = int(e.hdrs.get('Retry-after', 0))
print 'Sleeping %i seconds...' % (wait_time,)
time.sleep(wait_time)
foo = urllib2.urlopen(url, data)
# foo is a file-like object, do with it what you will.
foo.read()

You're probably better off using the Protein Identifier Cross Reference service from the EBI to convert one set of IDs to another. It has a very good REST interface.
http://www.ebi.ac.uk/Tools/picr/
I should also mention that UniProt has very good webservices available. Though if you are tied to using simple http requests for some reason then its probably not useful.

Let's assume that you are using Python 2.5.
We can use httplib to directly call the web site:
import httplib, urllib
querystring = {}
#Build the query string here from the following keys (query, format, columns, compress, limit, offset)
querystring["query"] = ""
querystring["format"] = "" # one of html | tab | fasta | gff | txt | xml | rdf | rss | list
querystring["columns"] = "" # the columns you want comma seperated
querystring["compress"] = "" # yes or no
## These may be optional
querystring["limit"] = "" # I guess if you only want a few rows
querystring["offset"] = "" # bring on paging
##From the examples - query=organism:9606+AND+antigen&format=xml&compress=no
##Delete the following and replace with your query
querystring = {}
querystring["query"] = "organism:9606 AND antigen"
querystring["format"] = "xml" #make it human readable
querystring["compress"] = "no" #I don't want to have to unzip
conn = httplib.HTTPConnection("www.uniprot.org")
conn.request("GET", "/uniprot/?"+ urllib.urlencode(querystring))
r1 = conn.getresponse()
if r1.status == 200:
data1 = r1.read()
print data1 #or do something with it
You could then make a function around creating the query string and you should be away.

check this out bioservices. they interface a lot of databases through Python.
https://pythonhosted.org/bioservices/_modules/bioservices/uniprot.html
conda install bioservices --yes

in complement to O.rka answer:
Question 1:
from bioservices import UniProt
u = UniProt()
res = u.get_df("P13368 P20806 Q9UM73 P97793 Q17192".split())
This returns a dataframe with all information about each entry.
Question 2: same answer. This should scale up.
Disclaimer: I'm the author of bioservices

There is a python package in pip which does exactly what you want
pip install uniprot-mapper

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimize Python Script to parse xml - python

Here is a tutorial on multi-threading, because currently that code will run on 1 thread, 1 core. Remove all try/except statements and handle the code properly. Exceptions are expensive. Run a profiler to find the chokepoints, and multi-thread those or find a way to do them less times.

Related

Export tensorboard (with pytorch) data into csv with python

Using jsonpath in api.ai in case we have already our json file to query in?

How to parse a single-column text file into a table using python?

Why are the videos on the most_recent standard feed so out of date?

How can I talk to UniProt over HTTP in Python?

Categories

Resources