This question already has an answer here:
Pandas - Explanation on apply function being slow
(1 answer)
Closed 7 months ago.
I am actively running some Python code in jupyter on a df consisting of about 84k rows. I'm estimating this is going to take somewhere in the neighborhood of 9 hours at this rate. My code is below, I have read that ideally one would vectorize for max speed but being sort of new to Python and coding in general, I'm not sure how I can go about changing the below code to vectorize it. The goal is to look at the value in the first column of the dataframe and add that value to the end of a url. I then check the first line in the url and compare it to some predetermined values to find out if there is a match. Any advice would be greatly appreciated!
#Python 3
import pandas as pd
import urllib
no_res = "Item Not Found"
error = "Does Not Exist"
for i in df1.index:
path = 'http://xxxx/xxx/xxx.pl?part=' + str(df1['ITEM_ID'][i])
parsed_path = path.replace(' ','%20')
f = urllib.request.urlopen(parsed_path)
raw = str(f.read().decode("utf-8"))
lines = raw.split('\n')
r = lines[0]
if r == no_res:
sap = 'NO'
elif r == error:
sap = 'ERROR'
else:
sap = 'YES'
df1["item exists"][i] = sap
df1["Path"][i] = path
df1["URL return value"][i] = r
Edit adding test code below
import concurrent.futures
import pandas as pd
import urllib
import numpy as np
def my_func(df_row):
no_res = "random"
error = "entered"
path = "http://www.google.com"
parsed_path = path.replace(' ','%20')
f = urllib.request.urlopen(parsed_path)
raw = str(f.read().decode("utf-8"))
lines = raw.split('\n')
r = df_row['0']
if r == no_res:
sap = "NO"
elif r == error:
sap = "ERROR"
else:
sap = "YES"
df_row['4'] = sap
df_row['5'] = lines[0]
df_row['6'] = r
n = 1000
my_df = pd.DataFrame(np.random.choice(['random','words','entered'], size=(n,3)))
my_df['4'] = ""
my_df['5'] = ""
my_df['6'] = ""
my_df = my_df.apply(lambda col: col.astype('category'))
executor = concurrent.futures.ProcessPoolExecutor(8)
futures = [executor.submit(my_func, row) for _,row in my_df.iterrows()]
concurrent.futures.wait(futures)
This is throwing the following error (shortened):
DoneAndNotDoneFutures(done={<Future at 0x1cfe4938040 state=finished raised BrokenProcessPool>, <Future at 0x1cfe48b8040 state=finished raised BrokenProcessPool>,
Since you are doing some outside operation with a URL, I do not think vectorization is a solution (let possible).
The bottleneck of your operation is the following line
f = urllib.request.urlopen(parsed_path)
This line waits for the response and is blocking, as mentioned your operation is I/O bound. The CPU can start other jobs while waiting for the response. The solution to address this is using concurrency.
Edit: My original answer was using python built-in multi threading which was problematic. The best way to do multiprocessing/threading with pandas data frame is using "dask" library.
The following code is tested with the dummy data set on my PC and on average speeds up the naive for loop by ~ 12 times.
#%%
import time
import urllib.request
import pandas as pd
import numpy as np
import dask.dataframe as dd
def my_func(df_row):
df_row = df_row.copy()
no_res = "random"
error = "entered"
path = "http://www.google.com"
parsed_path = path.replace(' ','%20')
f = urllib.request.urlopen(parsed_path)
# I had to change the encoding on my machine.
raw = str(f.read().decode("'windows-1252"))
lines = raw.split('\n')
r = df_row[0]
if r == no_res:
sap = "NO"
elif r == error:
sap = "ERROR"
else:
sap = "YES"
df_row['4'] = sap
df_row['5'] = lines[0]
df_row['6'] = r
return df_row
def run():
print("started.")
n = 1000
my_df = pd.DataFrame(np.random.choice(['random','words','entered'], size=(n,3)))
my_df = my_df.apply(lambda col: col.astype('category'))
my_df['4'] = ""
my_df['5'] = ""
my_df['6'] = ""
# Literally dask partitions the original dataframe into
# npartitions chunks and use them in apply function
# in parallel.
my_ddf = dd.from_pandas(my_df, npartitions=15)
start = time.time()
q = my_ddf.apply(my_func, axis= 1, meta=my_ddf)
# num_workers is number of threads used,
print(q.compute(num_workers= 50))
time_end = time.time()
print(f"Elapsed: {time_end - start:10.2f}")
if __name__ == "__main__":
run()
dask provides many other tools and options to facilitate concurrent processing and it would be a good idea to take a look at its documentation to investigate other options.
P.S. : if you run the above code too many times on google you will receive "HTTP Error 429: Too Many Requests". This happens to prevent something like DDoS attack on a public server. So, if for your real job you are querying a public website, you may end up receiving the same 429 response, if you try 84K queries in a short time.
Related
I am studying bitcoin.
https://en.bitcoin.it/wiki/Protocol_documentation#Merkle_Trees
I read the above URL and implemented Merkle root in Python.
Using the API below, I collected all transactions in block 641150 and calculated the Merkle Root.
https://www.blockchain.com/explorer/api/blockchain_api
The following is the expected value
67a637b1c49d95165b3dd3177033adbbbc880f6da3620498d451ee0976d7b1f4
(https://www.blockchain.com/btc/block/641150 )
The values I calculated were as follows
f2a2207a1e8360b75729fd2f23659b1b79b14940b6e4982a985cf6aa6f941ad7
What is wrong?
My python code is;
from hashlib import sha256
import requests, json
base_url = 'https://blockchain.info/rawblock/'
block_hash = '000000000000000000042cef688cf40b4a70ac814e4222e6646bd6bb79d18168'
end_point = base_url + block_hash
def reverse(hex_be):
bytes_be = bytes.fromhex(hex_be)
bytes_le = bytes_be[::-1]
hex_le = bytes_le.hex()
return hex_le
def dhash(hash):
return sha256(sha256(hash.encode('utf-8')).hexdigest().encode('utf-8')).hexdigest()
def culculate_merkle(hash_list):
if len(hash_list) == 1:
return dhash(hash_list[0])
hashed_list = list(map(dhash, hash_list))
if len(hashed_list) % 2 == 1:
hashed_list.append(hashed_list[-1])
parent_hash_list = []
it = iter(hashed_list)
for hash1, hash2 in zip(it, it):
parent_hash_list.append(hash1 + hash2)
hashed_list = list(map(dhash, hash_list))
return culculate_merkle(parent_hash_list)
data = requests.get(end_point)
jsondata = json.loads(data.text)
tx_list = list(map(lambda tx_object: tx_object['hash'], jsondata['tx']))
markleroot = '67a637b1c49d95165b3dd3177033adbbbc880f6da3620498d451ee0976d7b1f4'
tx_list = list(map(reverse, tx_list))
output = culculate_merkle(tx_list)
output = reverse(output)
print(output)
result
$ python merkleTree.py
f2a2207a1e8360b75729fd2f23659b1b79b14940b6e4982a985cf6aa6f941ad7
I expect the following output as a result
67a637b1c49d95165b3dd3177033adbbbc880f6da3620498d451ee0976d7b1f4
self solving
first:
There was mistake in the dhash calculation method.
second:
The hash list obtained from the API had already had the dhash applied once.
⇒ The first time dhash was performed without this perspective
The complete source code is summarized in the following blog (in Japanese)
https://tec.tecotec.co.jp/entry/2022/12/25/000000
I'm a hobby coder started with AHK, then some java and now I try to learn Python. I have searched and found some tips but I have yet not been able to implement it into my own code.
Hopefully someone here can help me, it's a very short program.
I'm using .txt csv database with ";" as a separator.
DATABASE EXAMPLE:
Which color is normally a cat?;Black
How tall was the longest man on earth?;272 cm
Is the earth round?;Yes
The database now consists of 20.000 lines which makes the program "to slow", only using 25% CPU (1 core).
If I can make it use all 4 cores (100%) I guess it would perform the task alot faster. The task is basically to compare the CLIPBOARD with the database and if there is a match, it should give me an answer as a return. Perhaps also I can separate the database into 4 pieces?
The code right now looks like this! Not more then 65 lines and its doing its job (but to slow). Advice on how I can make this process into multi core needed.
import time
import pyperclip as pp
import pandas as pd
import pymsgbox as pmb
from fuzzywuzzy import fuzz
import numpy
ratio_threshold = 90
fall_back_time = 1
db_file_path = 'database.txt'
db_separator = ';'
db_encoding = 'latin-1'
def load_db():
while True:
try:
# Read and create database
db = pd.read_csv(db_file_path, sep=db_separator, encoding=db_encoding)
db = db.drop_duplicates()
return db
except:
print("Error in load_db(). Will sleep for %i seconds..." % fall_back_time)
time.sleep(fall_back_time)
def top_answers(db, question):
db['ratio'] = db['question'].apply(lambda q: fuzz.ratio(q, question))
db_sorted = db.sort_values(by='ratio', ascending=False)
db_sorted = db_sorted[db_sorted['ratio'] >= ratio_threshold]
return db_sorted
def write_txt(top):
result = top.apply(lambda row: "%s" % (row['answer']), axis=1).tolist()
result = '\n'.join(result)
fileHandle = open("svar.txt", "w")
fileHandle.write(result)
fileHandle.close()
pp.copy("")
def main():
try:
db = load_db()
last_db_reload = time.time()
while True:
# Get contents of clipboard
question = pp.paste()
# Rank answer
top = top_answers(db, question)
# If answer was found, show results
if len(top) > 0:
write_txt(top)
time.sleep(fall_back_time)
except:
print("Error in main(). Will sleep for %i seconds..." % fall_back_time)
time.sleep(fall_back_time)
if name == 'main':
main()'
If you could divide the db into four equally large you could process them in parallel like this:
import time
import pyperclip as pp
import pandas as pd
import pymsgbox as pmb
from fuzzywuzzy import fuzz
import numpy
import threading
ratio_threshold = 90
fall_back_time = 1
db_file_path = 'database.txt'
db_separator = ';'
db_encoding = 'latin-1'
def worker(thread_id, question):
thread_id = str(thread_id)
db = pd.read_csv(db_file_path + thread_id, sep=db_separator, encoding=db_encoding)
db = db.drop_duplicates()
db['ratio'] = db['question'].apply(lambda q: fuzz.ratio(q, question))
db_sorted = db.sort_values(by='ratio', ascending=False)
db_sorted = db_sorted[db_sorted['ratio'] >= ratio_threshold]
top = db_sorted
result = top.apply(lambda row: "%s" % (row['answer']), axis=1).tolist()
result = '\n'.join(result)
fileHandle = open("svar" + thread_id + ".txt", "w")
fileHandle.write(result)
fileHandle.close()
pp.copy("")
return
def main():
question = pp.paste()
for i in range(1, 4):
t = threading.Thread(target=worker, args=(i, question))
t.start()
t.join()
if name == 'main':
main()
The solution with multiprocessing:
import time
import pyperclip as pp
import pandas as pd
#import pymsgbox as pmb
from fuzzywuzzy import fuzz
import numpy as np
# pathos uses better pickle to tranfer more complicated objects
from pathos.multiprocessing import Pool
from functools import reduce
import sys
import os
from contextlib import closing
ratio_threshold = 70
fall_back_time = 1
db_file_path = 'database.txt'
db_separator = ';'
db_encoding = 'latin-1'
chunked_db = []
NUM_PROCESSES = os.cpu_count()
def load_db():
while True:
try:
# Read and create database
db = pd.read_csv(db_file_path, sep=db_separator, encoding=db_encoding)
db.columns = ['question', 'answer']
#db = db.drop_duplicates() # i drop it for experiment
break
except:
print("Error in load_db(). Will sleep for %i seconds..." % fall_back_time)
time.sleep(fall_back_time)
# split database into equal chunks:
# (if you have a lot of RAM, otherwise you
# need to compute ranges in db, something like
# chunk_size = len(db)//NUM_PROCESSES
# ranges[i] = (i*chunk_size, (i+1)*cjunk_size)
# and pass ranges in original db to processes
chunked_db = np.split(db, [NUM_PROCESSES], axis=0)
return chunked_db
def top_answers_multiprocessed(question, chunked_db):
# on unix, python uses 'fork' mode by default
# so the process has 'copy-on-change' access to all global variables
# i.e. if process will change something in db, it will be copied to it
# with a lot of overhead
# Unfortunately, I'fe heard that on Windows only 'spawn' mode with full
# copy of everything is used
# Process pipeline uses pickle, it's quite slow.
# so on small database you may not have benefit from multiprocessing
# If you are going to transfer big objects in or out, look
# in the direction of multiprocessing.Array
# this solution is not fully efficient,
# as pool is recreated each time
# You can create daemon processes which will monitor
# Queue for incoming questions, but it's harder to implement
def top_answers(idx):
# question is in the scope of parent function,
chunked_db[idx]['ratio'] = chunked_db[idx]['question'].apply(lambda q: fuzz.ratio(q, question))
db_sorted = chunked_db[idx].sort_values(by='ratio', ascending=False)
db_sorted = db_sorted[db_sorted['ratio'] >= ratio_threshold]
return db_sorted
with closing(Pool(processes=NUM_PROCESSES)) as pool:
# chunked_db is a list of databases
# they are in global scope, we send only index beacause
# all the data set is pickled
num_chunks = len(chunked_db)
# apply function top_answers across generator range(num_chunks)
res = pool.imap_unordered(top_answers, range(num_chunks))
res = list(res)
# now res is list of dataframes, let's join it
res_final = reduce(lambda left,right: pd.merge(left,right,on='ratio'), res)
return res_final
def write_txt(top):
result = top.apply(lambda row: "%s" % (row['answer']), axis=1).tolist()
result = '\n'.join(result)
fileHandle = open("svar.txt", "w")
fileHandle.write(result)
fileHandle.close()
pp.copy("")
def mainfunc():
global chunked_db
chunked_db = load_db()
last_db_reload = time.time()
print('db loaded')
last_clip = ""
while True:
# Get contents of clipboard
try:
new_clip = pp.paste()
except:
continue
if (new_clip != last_clip) and (len(new_clip)> 0):
print(new_clip)
last_clip = new_clip
question = new_clip.strip()
else:
continue
# Rank answer
top = top_answers_multiprocessed(question, chunked_db)
# If answer was found, show results
if len(top) > 0:
#write_txt(top)
print(top)
if __name__ == '__main__':
mainfunc()
I'm trying to build a list of parent/comment pairs from the publicly available Reddit data set.
I have a CSV file which I load into a Pandas dataframe which contains rows of the comments with the parent and child id, as well as the child comment. The data is loaded using the following block of code:
import os
import multiprocessing as mp
import numpy as np
import pandas as pd
sourcePATH = r'C:\'
workingFILE = r'\output-pt1.csv'
# filepaths
input_file = sourcePATH + workingFILE
data_df = pd.read_csv(input_file,header=None,names=['PostIDX','ParentIDX','Comment','Score','Controversiality'])
The aim is to scan through each row in the dataframe and using the parent id to search through the rest of the dataframe to see if their is a parent comment present. If it is I then store the child and parent comments in a tuple with some other information. This will then be added to a list which will then be written out to a csv file at the end. To do this I use the following code:
def checkChildParent(ParentIDX_curr, ChildIDX_curr,ChildComment_curr,ChildScore_curr,ChildCont_curr):
idx = data_df.loc[data_df['PostIDX'] == ParentIDX_curr]
if idx.empty is False:
ParentComment = idx.iloc[0,2]
ParentScore = idx.iloc[0,3]
ParentCont = idx.iloc[0,4]
outPut.put([ParentIDX_curr[0], ParentComment,ParentScore,ParentCont,ChildIDX_curr[0], ChildComment_curr[0],ChildScore_curr[0],ChildCont_curr[0]])
if __name__ == '__main__':
print('Process started')
t_start_init = time.time()
t_start = time.time()
noCores = 1
#pool = mp.Pool(processes=noCores)
update_freq = 100
n = 1000
#n = round(len(data_df)/8)
flag_create = 0
flag_run = 0
i = 0
outPut = mp.Queue()
#parent_child_df = pd.DataFrame()
#parent_child_df.coumns = ['PostIDX','ParentIDX']
while i < n:
#print(i)
procs = []
ParentIDX = []
ParentComment = []
ParentScore = []
ParentCont = []
ChildIDX = []
ChildComment = []
ChildScore = []
ChildCont = []
for worker in range(0,noCores):
ParentIDX.append(data_df.iloc[i,1])
ChildIDX.append(data_df.iloc[i,0])
ChildComment.append(data_df.iloc[i,2])
ChildScore.append(data_df.iloc[i,3])
ChildCont.append(data_df.iloc[i,4])
i = i + 1
#when I call the function this way it returns the expected matches
#checkChildParent(ParentIDX,ChildIDX,ChildComment,
# ChildScore,ChildCont)
#when I call the function with Process function nothing appears to be happening
for proc in range(0,noCores):
p = mp.Process(target = checkChildParent, args=(ParentIDX[proc],ChildIDX[proc],ChildComment[proc],ChildScore[proc],ChildCont[proc]))
procs.append(p)
p.start()
#for p in procs:
# p.join()
if outPut.empty() is False:
print(outPut.get())
At the top of the file is a function which scans the dataframe for a given row and returns the tuple of the matched parent and child comment if it was found. If I call this function normally then it works fine, however when I call the function using the Process function it doesn't match anything!. I'm guessing its the form the arguments that are being passed to the function is being passed to the function that is causing the issue, but I have been trying to debug this all afternoon and have failed so far. If anyone has any suggestions then please let me know!
Thanks!
I'm new here to StackOverflow, but I have found a LOT of answers on this site. I'm also a programming newbie, so i figured i'd join and finally become part of this community - starting with a question about a problem that's been plaguing me for hours.
I login to a website and scrape a big body of text within the b tag to be converted into a proper table. The layout of the resulting Output.txt looks like this:
BIN STATUS
8FHA9D8H 82HG9F RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
INVENTORY CODE: FPBC *SOUP CANS LENTILS
BIN STATUS
HA8DHW2H HD0138 RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
8SHDNADU 00A123 #2956- INVALID STOCK COUPON CODE (MISSING).
93827548 096DBR RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
There are a bunch of pages with the exact same blocks, but i need them to be combined into an ACTUAL table that looks like this:
BIN INV CODE STATUS
HA8DHW2HHD0138 FPBC-*SOUP CANS LENTILS RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
8SHDNADU00A123 FPBC-*SOUP CANS LENTILS #2956- INVALID STOCK COUPON CODE (MISSING).
93827548096DBR FPBC-*SOUP CANS LENTILS RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
8FHA9D8H82HG9F SSXR-98-20LM NM CORN CREAM RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
Essentially, all separate text blocks in this example would become part of this table, with the inv code repeating with its Bin values. I would post my attempts at parsing this data(have tried Pandas/bs/openpyxl/csv writer), but ill admit they are a little embarrassing, as i cannot find any information on this specific problem. Is there any benevolent soul out there that can help me out? :)
(Also, i am using Python 2.7)
A simple custom parser like the following should do the trick.
from __future__ import print_function
def parse_body(s):
line_sep = '\n'
getting_bins = False
inv_code = ''
for l in s.split(line_sep):
if l.startswith('INVENTORY CODE:') and not getting_bins:
inv_data = l.split()
inv_code = inv_data[2] + '-' + ' '.join(inv_data[3:])
elif l.startswith('INVENTORY CODE:') and getting_bins:
print("unexpected inventory code while reading bins:", l)
elif l.startswith('BIN') and l.endswith('MESSAGE'):
getting_bins = True
elif getting_bins == True and l:
bin_data = l.split()
# need to add exception handling here to make sure:
# 1) we have an inv_code
# 2) bin_data is at least 3 items big (assuming two for
# bin_id and at least one for message)
# 3) maybe some constraint checking to ensure that we have
# a valid instance of an inventory code and bin id
bin_id = ''.join(bin_data[0:2])
message = ' '.join(bin_data[2:])
# we now have a bin, an inv_code, and a message to add to our table
print(bin_id.ljust(20), inv_code.ljust(30), message, sep='\t')
elif getting_bins == True and not l:
# done getting bins for current inventory code
getting_bins = False
inv_code = ''
A rather complex one, but this might get you started:
import re, pandas as pd
from pandas import DataFrame
rx = re.compile(r'''
(?:INVENTORY\ CODE:)\s*
(?P<inv>.+\S)
[\s\S]+?
^BIN.+[\n\r]
(?P<bin_msg>(?:(?!^\ ).+[\n\r])+)
''', re.MULTILINE | re.VERBOSE)
string = your_string_here
# set up the dataframe
df = DataFrame(columns = ['BIN', 'INV', 'MESSAGE'])
for match in rx.finditer(string):
inv = match.group('inv')
bin_msg_raw = match.group('bin_msg').split("\n")
rxbinmsg = re.compile(r'^(?P<bin>(?:(?!\ {2}).)+)\s+(?P<message>.+\S)\s*$', re.MULTILINE)
for item in bin_msg_raw:
for m in rxbinmsg.finditer(item):
# append it to the dataframe
df.loc[len(df.index)] = [m.group('bin'), inv, m.group('message')]
print(df)
Explanation
It looks for INVENTORY CODE and sets up the groups (inv and bin_msg) for further processing in afterwork() (note: it would be easier if you had only one line of bin/msg as you need to split the group here afterwards).
Afterwards, it splits the bin and msg part and appends all to the df object.
I had a code written for a website scrapping which may help you.
Basically what you need to do is write click on the web page go to html and try to find the tag for the table you are looking for and using the module (i am using beautiful soup) extract the information. I am creating a json as I need to store it into mongodb you can create table.
#! /usr/bin/python
import sys
import requests
import re
from BeautifulSoup import BeautifulSoup
import pymongo
def req_and_parsing():
url2 = 'http://businfo.dimts.in/businfo/Bus_info/EtaByRoute.aspx?ID='
list1 = ['534UP','534DOWN']
for Route in list1:
final_url = url2 + Route
#r = requests.get(final_url)
#parsing_file(r.text,Route)
outdict = []
outdict = [parsing_file( requests.get(url2+Route).text,Route) for Route in list1 ]
print outdict
conn = f_connection()
for i in range(len(outdict)):
insert_records(conn,outdict[i])
def parsing_file(txt,Route):
soup = BeautifulSoup(txt)
table = soup.findAll("table",{"id" : "ctl00_ContentPlaceHolder1_GridView2"})
#trtags = table[0].findAll('tr')
tdlist = []
trtddict = {}
"""
for trtag in trtags:
print 'print trtag- ' , trtag.text
tdtags = trtag.findAll('td')
for tdtag in tdtags:
print tdtag.text
"""
divtags = soup.findAll("span",{"id":"ctl00_ContentPlaceHolder1_ErrorLabel"})
for divtag in divtags:
for divtag in divtags:
print "div tag - " , divtag.text
if divtag.text == "Currently no bus is running on this route" or "This is not a cluster (orange bus) route":
print "Page not displayed Errored with below meeeage for Route-", Route," , " , divtag.text
sys.exit()
trtags = table[0].findAll('tr')
for trtag in trtags:
tdtags = trtag.findAll('td')
if len(tdtags) == 2:
trtddict[tdtags[0].text] = sub_colon(tdtags[1].text)
return trtddict
def sub_colon(tag_str):
return re.sub(';',',',tag_str)
def f_connection():
try:
conn=pymongo.MongoClient()
print "Connected successfully!!!"
except pymongo.errors.ConnectionFailure, e:
print "Could not connect to MongoDB: %s" % e
return conn
def insert_records(conn,stop_dict):
db = conn.test
print db.collection_names()
mycoll = db.stopsETA
mycoll.insert(stop_dict)
if __name__ == "__main__":
req_and_parsing()
I'm parsing the US Patent XML files (downloaded from Google patent dumps) using Python and Beautifulsoup; parsed data is exported to MYSQL database.
Each year's data contains close to 200-300K patents - which means parsing 200-300K xml files.
The server on which I'm running the python script is pretty powerful - 16 cores, 160 gigs of RAM, etc. but still it is taking close to 3 days to parse one year's worth of data.
I've been learning and using python since 2 years - so I can get stuff done but do not know how to get it done in the most efficient manner. I'm reading on it.
How can I optimize the below script to make it efficient?
Any guidance would be greatly appreciated.
Below is the code:
from bs4 import BeautifulSoup
import pandas as pd
from pandas.core.frame import DataFrame
import MySQLdb as db
import os
cnxn = db.connect('xx.xx.xx.xx','xxxxx','xxxxx','xxxx',charset='utf8',use_unicode=True)
def separated_xml(infile):
file = open(infile, "r")
buffer = [file.readline()]
for line in file:
if line.startswith("<?xml "):
yield "".join(buffer)
buffer = []
buffer.append(line)
yield "".join(buffer)
file.close()
def get_data(soup):
df = pd.DataFrame(columns = ['doc_id','patcit_num','patcit_document_id_country', 'patcit_document_id_doc_number','patcit_document_id_kind','patcit_document_id_name','patcit_document_id_date','category'])
if soup.findAll('us-citation'):
cit = soup.findAll('us-citation')
else:
cit = soup.findAll('citation')
doc_id = soup.findAll('publication-reference')[0].find('doc-number').text
for x in cit:
try:
patcit_num = x.find('patcit')['num']
except:
patcit_num = None
try:
patcit_document_id_country = x.find('country').text
except:
patcit_document_id_country = None
try:
patcit_document_id_doc_number = x.find('doc-number').text
except:
patcit_document_id_doc_number = None
try:
patcit_document_id_kind = x.find('kind').text
except:
patcit_document_id_kind = None
try:
patcit_document_id_name = x.find('name').text
except:
patcit_document_id_name = None
try:
patcit_document_id_date = x.find('date').text
except:
patcit_document_id_date = None
try:
category = x.find('category').text
except:
category = None
print doc_id
val = {'doc_id':doc_id,'patcit_num':patcit_num, 'patcit_document_id_country':patcit_document_id_country,'patcit_document_id_doc_number':patcit_document_id_doc_number, 'patcit_document_id_kind':patcit_document_id_kind,'patcit_document_id_name':patcit_document_id_name,'patcit_document_id_date':patcit_document_id_date,'category':category}
df = df.append(val, ignore_index=True)
df.to_sql(name = 'table_name', con = cnxn, flavor='mysql', if_exists='append')
print '1 doc exported'
i=0
l = os.listdir('/path/')
for item in l:
f = '/path/'+item
print 'Currently parsing - ',item
for xml_string in separated_xml(f):
soup = BeautifulSoup(xml_string,'xml')
if soup.find('us-patent-grant'):
print item, i, xml_string[177:204]
get_data(soup)
else:
print item, i, xml_string[177:204],'***********************************soup not found********************************************'
i+=1
print 'DONE!!!'
Here is a tutorial on multi-threading, because currently that code will run on 1 thread, 1 core.
Remove all try/except statements and handle the code properly. Exceptions are expensive.
Run a profiler to find the chokepoints, and multi-thread those or find a way to do them less times.
So, you're doing two things wrong. First, you're using BeautifulSoup, which is slow, and second, you're using a "find" call, which is also slow.
As a first cut, look at lxml's ability to pre-compile xpath queries (Look at the heading "The Xpath class). That will give you a huge speed boost.
Alternatively, I've been working on a library to do this kind of parsing declaratively, using best practices for lxml speed, including precompiled xpath called yankee.
Yankee on PyPI |
Yankee on GitHub
You could do the same thing with yankee like this:
from yankee.xml import Schema, fields as f
# Create a schema for citations
class Citation(Schema):
num = f.Str(".//patcit")
country = f.Str(".//country")
# ... and so forth for the rest of your fields
# Then create a "wrapper" to get all the citations
class Patent(Schema):
citations = f.List(".//us-citation|.//citation")
# Then just feed the Schema your lxml.etrees for each patent:
import lxml.etree as ET
schema = Patent()
for _, doc in ET.iterparse(xml_string, "xml"):
result = schema.load(doc)
The result will look like this:
{
"citations": [
{
"num": "<some value>",
"country": "<some value>",
},
{
"num": "<some value>",
"country": "<some value>",
},
]
}
I would also check out Dask to help you multithread it more efficiently. Pretty much all my projects use it.