I am extracting data from a large pdf file using regex using python in databricks. This data is in form of a long string and I am using string split function to convert this into a pandas dataframe as I want the final data as csv file. But while doing line.split command it takes about 5 hours for the command to run and I am looking for ways to optimize this. I am new to python and I am not sure which part of the code should I look at for reducing this time of running the command.
for pdf in os.listdir(data_directory):
# creating an object
file = open(data_directory + pdf, 'rb')
# creating file reader object
fileReader = PyPDF2.PdfFileReader(file)
num_pages = fileReader.numPages
#print("total pages = " + str(num_pages))
extracted_string = "start of file"
current_page = 0
while current_page < num_pages:
#print("adding page " + str(current_page) + " to the file")
extracted_string += (fileReader.getPage(current_page).extract_text())
current_page = current_page + 1
regex_date = "\d{2}\/\d{2}\/\d{4}[^\n]*"
table_lines = re.findall(regex_date, extracted_string)
Above code is to get the data from PDF
#create dataframe out of extracted string and load into a single dataframe
for line in table_lines:
df = pd.DataFrame([x.split(' ') for x in line.split('\n')])
df.rename(columns={0: 'date_of_import', 1: 'entry_num', 2: 'warehouse_code_num', 3: 'declarant_ref_num', 4: 'declarant_EORI_num', 5: 'VAT_due'}, inplace=True)
table = pd.concat([table,df],sort= False)
This part of the code is what is taking up huge time. I have tried different ways to get a dataframe out of this data but the above has worked best for me. I am looking for faster way to run this code.
https://drive.google.com/file/d/1ew3Fw1IjeToBA-KMbTTD_hIINiQm0Bkg/view?usp=share_link pdf file for reference
There are 2 immediate optimization steps in your code.
Pre-compile regex if they are used many times. It may or not be relevant here, because I could not guess how many times table_lines = re.findall(regex_date, extracted_string) is executed. But this if often more efficient:
# before any loop
regex_date = re.compile("\d{2}\/\d{2}\/\d{4}[^\n]*")
...
# inside the loop
table_lines = regex_date.findall(extracted_string)
Do not repeatedly append to a dataframe. A dataframe is a rather complex container, and appending rows is a costly operation. It is generally much more efficient do build a Python container (list or dict) first and then convert it as a whole to a dataframe
data = [[x.split(' ') for x in line.split('\n')] for line in table_lines]
table = pd.DataFrame(data, columns = ['date_of_import', 'entry_num',
'warehouse_code_num', 'declarant_ref_num',
'declarant_EORI_num', 'VAT_due'])
Related
I am trying to analyze a large dataset from Yelp. Data is in json file format but it is too large, so script is crahsing when it tries to read all data in same time. So I decided to read line by line and concat the lines in a dataframe to have a proper sample from the data.
f = open('./yelp_academic_dataset_review.json', encoding='utf-8')
I tried without encoding utf-8 but it creates an error.
I created a function that reads the file line by line and make a pandas dataframe up to given number of lines.
Anyway some lines are lists. And script iterates in each list too and adds to dataframe.
def json_parser(file, max_chunk):
f = open(file)
df = pd.DataFrame([])
for i in range(2, max_chunk + 2):
try:
type(f.readlines(i)) == list
for j in range(len(f.readlines(i))):
part = json.loads(f.readlines(i)[j])
df2 = pd.DataFrame(part.items()).T
df2.columns = df2.iloc[0]
df2 = df2.drop(0)
datas = [df2, df]
df2 = pd.concat(datas)
df = df2
except:
f = open(file, encoding = "utf-8")
for j in range(len(f.readlines(i))):
try:
part = json.loads(f.readlines(i)[j-1])
except:
print(i,j)
df2 = pd.DataFrame(part.items()).T
df2.columns = df2.iloc[0]
df2 = df2.drop(0)
datas = [df2, df]
df2 = pd.concat(datas)
df = df2
df2.reset_index(inplace=True, drop=True)
return df2
But still I am having an error that list index out of range. (Yes I used print to debug).
So I looked closer to that lines which causes this error.
But very interestingly when I try to look at that lines, script gives me different list.
Here what I meant:
I runned the cells repeatedly and having different length of the list.
So I looked at lists:
It seems they are completely different lists. In each run it brings different list although line number is same. And readlines documentation is not helping. What am I missing?
Thanks in advance.
You are using the expression f.readlines(i) several times as if it was referring to the same set of lines each time.
But as as side effect of evaluating the expression, more lines are actually read from the file. At one point you are basing the indices j on more lines than are actually available, because they came from a different invocation of f.readlines.
You should use f.readlines(i) only once in each iteration of the for i in ... loop and store its result in a variable instead.
I have been trying to build vlookup function in python. I have two files. name = data - created in python which has more than 2000 rows. comp_data = csv file loaded in system which has 35 rows. I have to match date of data file with comp_data and have to load Exp_date corresponding to it. Current code gives error 35. I am not able to understand the problem.
Following are the codes:
data['Exp_date'] = datetime.date(2020,3,30)
z=0
for i in range(len(data)):
if data['Date'][i] == comp_data['Date'][z]:
data['Exp_date'][i] = comp_data['Exp_date'][z]
else:
z=z+1
One option would be to put your comp_data in a dictionary with your data/exp_date as key/value pairs and let python do the lookup for you.
data = {"date":["a","b","c","d","e","f"],"exp_date":[0,0,0,0,0,0]}
comp = {"a":10,"d":13}
for i in range(len(data['date'])):
if data['date'][i] in comp:
data['exp_date'][i] = comp[data['date'][i]]
print(data)
There's probably a one-liner way of doing this with iterators!
I need to read xlsx file 300gb. Count of rows ~ 10^9. I need to get values from one column. File consists of 8 columns. I want to do it as fast as it possible.
from openpyxl import load_workbook
import datetime
wb = load_workbook(filename="C:\Users\Predator\Downloads\logs_sample.xlsx",
read_only=True)
ws = wb.worksheets[0]
count = 0
emails = []
p = datetime.datetime.today()
for row in ws.rows:
count += 1
val = row[8].value
if count >= 200000: break
emails.append(val)
q = datetime.datetime.today()
res = (q-p).total_seconds()
print "time: {} seconds".format(res)
emails = emails[1:]
Now cycle needs ~ 16 seconds to read 200.000 rows. And time complexity is O(n). So, for 10^6 rows will be read for 1.5 minutes nearly. Bit we have 10^9. And for this we must wait 10^3 * 1.5 = 1500 minutes = 25 hours. It's too bad...
Help me, please, to solve this problem.
I've just had a very similar issue. I had a bunch of xlsx files containing a single worksheet with between 2 and 4 million rows.
First, I went about extracting the relevant xml files (using bash script):
f='<xlsx_filename>'
unzip -p $f xl/worksheets/sheet1.xml > ${f%%.*}.xml
unzip -p $f xl/sharedStrings.xml > ${f%%.*}_strings.xml
This leads to all the xml file being placed in the working directory. Then, I used python to convert the xml to csv. This code makes use of ElementTree.iterparse() method. However, it can only work if every element gets cleared after it has been processed (see also here):
import pandas as pd
import numpy as np
import os
import xml.etree.ElementTree as et
base_directory = '<path/to/files>'
file = '<xml_filename>'
os.chdir(base_directory)
def read_file(base_directory, file):
ns = '{http://schemas.openxmlformats.org/spreadsheetml/2006/main}'
print('Working on strings file.')
string_it = et.parse(base_directory + '/' + file[:-4] + '_strings.xml').getroot()
strings = []
for st in string_it:
strings.append(st[0].text)
print('Working on data file.')
iterate_file = et.iterparse(base_directory + '/' + file, events=['start', 'end'])
print('Iterator created.')
rows = []
curr_column = ''
curr_column_elem = None
curr_row_elem = None
count = 0
for event, element in iterate_file:
if event == 'start' and element.tag == ns + 'row':
count += 1
print(' ', end='\r')
print(str(count) + ' rows done', end='\r')
if not curr_row_elem is None:
rows.append(curr_row_elem)
curr_row_elem = []
element.clear()
if not curr_row_elem is None :
### Column element started
if event == 'start' and element.tag == ns + 'c':
curr_column_elem = element
curr_column = ''
### Column element ended
if event == 'end' and element.tag == ns + 'c':
curr_row_elem.append(curr_column)
element.clear()
curr_column_elem.clear()
### Value element ended
if event == 'end' and element.tag == ns + 'v':
### Replace string if necessary
if curr_column_elem.get('t') == 's':
curr_column = strings[int(element.text)]
else:
curr_column = element.text
df = pd.DataFrame(rows).replace('', np.nan)
df.columns = df.iloc[0]
df = df.drop(index=0)
### Export
df.to_csv(file[:-4] + '.csv', index=False)
read_file(base_directory, file)
Maybe this helps you or anyone running into this issue. This is still relatively slow, however was working a lot better than basic "parse".
One possible option would be to read the .xml data inside the .xslx directly.
.xlsx is actually a zipfile, containing multiple xml files.
All the distinct emails could be in xl/sharedStrings.xml, so you could try to extract them there.
To test (with a smaller file): add '.zip' to the name of your file and view the contents.
Of course, unzipping the whole 300GB file is not an option, so you would have to stream the compressed data (of that single file inside the zip), uncompress parts in memory and extract the data you want.
I don't know Python, so I can't help with a code example.
Also: emails.append(val) will create an array/list with 1 billion items.. It might be better to directly write those values to a file instead of storing them in an array (which will have to grow and reallocate memory each time).
To run such task efficiently you need to use a database. Sqlite can help you here.
Using pandas from, http://pandas.pydata.org/ and sqlite from
http://sqlite.org/
You can install pandas with; pip or conda from Continuum.
import pandas as pd
import sqlite3 as sql
#create a connection/db
con = sql.connect('logs_sample.db')
#read you file
df = pd.read_excel("C:\\Users\\Predator\\Downloads\\logs_sample.xlsx")
#send it to the db
pd.to_sql('logs_sample',con,if_exists='replace')
See more, http://pandas.pydata.org
Looking to make the following code parallel- it reads in data in one large 9gb proprietary format and produces 30 individual csv files based on the 30 columns of data. It currently takes 9 minutes per csv written on a 30 minute data set. The solution space of parallel libraries in Python is a bit overwhelming. Can you direct me to any good tutorials/sample code? I couldn't find anything very informative.
for i in range(0, NumColumns):
aa = datetime.datetime.now()
allData = [TimeStamp]
ColumnData = allColumns[i].data # Get the data within this one Column
Samples = ColumnData.size # Find the number of elements in Column data
print('Formatting Column {0}'.format(i+1))
truncColumnData = [] # Initialize truncColumnData array each time for loop runs
if ColumnScale[i+1] == 'Scale: '+ tempScaleName: # If it's temperature, format every value to 5 characters
for j in range(Samples):
truncValue = '{:.1f}'.format((ColumnData[j]))
truncColumnData.append(truncValue) # Appends formatted value to truncColumnData array
allData.append(truncColumnData) #append the formatted Column data to the all data array
zipObject = zip(*allData)
zipList = list(zipObject)
csvFileColumn = 'Column_' + str('{0:02d}'.format(i+1)) + '.csv'
# Write the information to .csv file
with open(csvFileColumn, 'wb') as csvFile:
print('Writing to .csv file')
writer = csv.writer(csvFile)
counter = 0
for z in zipList:
counter = counter + 1
timeString = '{:.26},'.format(z[0])
zList = list(z)
columnVals = zList[1:]
columnValStrs = list(map(str, columnVals))
formattedStr = ','.join(columnValStrs)
csvFile.write(timeString + formattedStr + '\n') # Writes the time stamps and channel data by columns
one possible solution may be to use Dask http://dask.pydata.org/en/latest/
A coworker recently recommended it to me which is why I thought of it.
and thank you for looking.
I am trying my hand at modifying a Python script to download a bunch of data from a website. I have decided that given the large data that will be used, I am wanting to convert the script to Pandas for this. I have this code so far.
snames = ['Index','Node_ID','Node','Id','Name','Tag','Datatype','Engine']
sensorinfo = pd.read_csv(sensorpath, header = None, names = snames, index_col=['Node', 'Index'])
for j in sensorinfo['Node']:
for z in sensorinfo['Index']:
# create a string for the url of the data
data_url = "http://www.mywebsite.com/emoncms/feed/data.json?id=" + sensorinfo['Id'] + "&apikey1f8&start=&end=&dp=600"
print data_url
# read in the data from emoncms
sock = urllib.urlopen(data_url)
data_str = sock.read()
sock.close
# data is output as a string so we convert it to a list of lists
data_list = eval(data_str)
myfile = open(feed_list['Name'[k]] + ".csv",'wb')
wr=csv.writer(myfile,quoting=csv.QUOTE_ALL)
The first part of the code gives me a very nice table which means I am opening my csv data file and import the information, my question is this:
So I am trying to do this in pseudo code:
For node is nodes (4 nodes so far)
For index in indexes
data_url = websiteinfo + Id + sampleinformation
smalldata.read.csv(data_url)
merge(bigdata, smalldata.no_time_column)
This is my first post here, I tried to keep it short but still supply the relevant data. Let me know if I need to clarify anything.
In your pseudocode, you can do this:
dfs = []
For node is nodes (4 nodes so far)
For index in indexes
data_url = websiteinfo + Id + sampleinformation
df = smalldata.read.csv(data_url)
dfs.append(df)
df = pd.concat(dfs)