I need to read xlsx file 300gb. Count of rows ~ 10^9. I need to get values from one column. File consists of 8 columns. I want to do it as fast as it possible.
from openpyxl import load_workbook
import datetime
wb = load_workbook(filename="C:\Users\Predator\Downloads\logs_sample.xlsx",
read_only=True)
ws = wb.worksheets[0]
count = 0
emails = []
p = datetime.datetime.today()
for row in ws.rows:
count += 1
val = row[8].value
if count >= 200000: break
emails.append(val)
q = datetime.datetime.today()
res = (q-p).total_seconds()
print "time: {} seconds".format(res)
emails = emails[1:]
Now cycle needs ~ 16 seconds to read 200.000 rows. And time complexity is O(n). So, for 10^6 rows will be read for 1.5 minutes nearly. Bit we have 10^9. And for this we must wait 10^3 * 1.5 = 1500 minutes = 25 hours. It's too bad...
Help me, please, to solve this problem.
I've just had a very similar issue. I had a bunch of xlsx files containing a single worksheet with between 2 and 4 million rows.
First, I went about extracting the relevant xml files (using bash script):
f='<xlsx_filename>'
unzip -p $f xl/worksheets/sheet1.xml > ${f%%.*}.xml
unzip -p $f xl/sharedStrings.xml > ${f%%.*}_strings.xml
This leads to all the xml file being placed in the working directory. Then, I used python to convert the xml to csv. This code makes use of ElementTree.iterparse() method. However, it can only work if every element gets cleared after it has been processed (see also here):
import pandas as pd
import numpy as np
import os
import xml.etree.ElementTree as et
base_directory = '<path/to/files>'
file = '<xml_filename>'
os.chdir(base_directory)
def read_file(base_directory, file):
ns = '{http://schemas.openxmlformats.org/spreadsheetml/2006/main}'
print('Working on strings file.')
string_it = et.parse(base_directory + '/' + file[:-4] + '_strings.xml').getroot()
strings = []
for st in string_it:
strings.append(st[0].text)
print('Working on data file.')
iterate_file = et.iterparse(base_directory + '/' + file, events=['start', 'end'])
print('Iterator created.')
rows = []
curr_column = ''
curr_column_elem = None
curr_row_elem = None
count = 0
for event, element in iterate_file:
if event == 'start' and element.tag == ns + 'row':
count += 1
print(' ', end='\r')
print(str(count) + ' rows done', end='\r')
if not curr_row_elem is None:
rows.append(curr_row_elem)
curr_row_elem = []
element.clear()
if not curr_row_elem is None :
### Column element started
if event == 'start' and element.tag == ns + 'c':
curr_column_elem = element
curr_column = ''
### Column element ended
if event == 'end' and element.tag == ns + 'c':
curr_row_elem.append(curr_column)
element.clear()
curr_column_elem.clear()
### Value element ended
if event == 'end' and element.tag == ns + 'v':
### Replace string if necessary
if curr_column_elem.get('t') == 's':
curr_column = strings[int(element.text)]
else:
curr_column = element.text
df = pd.DataFrame(rows).replace('', np.nan)
df.columns = df.iloc[0]
df = df.drop(index=0)
### Export
df.to_csv(file[:-4] + '.csv', index=False)
read_file(base_directory, file)
Maybe this helps you or anyone running into this issue. This is still relatively slow, however was working a lot better than basic "parse".
One possible option would be to read the .xml data inside the .xslx directly.
.xlsx is actually a zipfile, containing multiple xml files.
All the distinct emails could be in xl/sharedStrings.xml, so you could try to extract them there.
To test (with a smaller file): add '.zip' to the name of your file and view the contents.
Of course, unzipping the whole 300GB file is not an option, so you would have to stream the compressed data (of that single file inside the zip), uncompress parts in memory and extract the data you want.
I don't know Python, so I can't help with a code example.
Also: emails.append(val) will create an array/list with 1 billion items.. It might be better to directly write those values to a file instead of storing them in an array (which will have to grow and reallocate memory each time).
To run such task efficiently you need to use a database. Sqlite can help you here.
Using pandas from, http://pandas.pydata.org/ and sqlite from
http://sqlite.org/
You can install pandas with; pip or conda from Continuum.
import pandas as pd
import sqlite3 as sql
#create a connection/db
con = sql.connect('logs_sample.db')
#read you file
df = pd.read_excel("C:\\Users\\Predator\\Downloads\\logs_sample.xlsx")
#send it to the db
pd.to_sql('logs_sample',con,if_exists='replace')
See more, http://pandas.pydata.org
Related
I am extracting data from a large pdf file using regex using python in databricks. This data is in form of a long string and I am using string split function to convert this into a pandas dataframe as I want the final data as csv file. But while doing line.split command it takes about 5 hours for the command to run and I am looking for ways to optimize this. I am new to python and I am not sure which part of the code should I look at for reducing this time of running the command.
for pdf in os.listdir(data_directory):
# creating an object
file = open(data_directory + pdf, 'rb')
# creating file reader object
fileReader = PyPDF2.PdfFileReader(file)
num_pages = fileReader.numPages
#print("total pages = " + str(num_pages))
extracted_string = "start of file"
current_page = 0
while current_page < num_pages:
#print("adding page " + str(current_page) + " to the file")
extracted_string += (fileReader.getPage(current_page).extract_text())
current_page = current_page + 1
regex_date = "\d{2}\/\d{2}\/\d{4}[^\n]*"
table_lines = re.findall(regex_date, extracted_string)
Above code is to get the data from PDF
#create dataframe out of extracted string and load into a single dataframe
for line in table_lines:
df = pd.DataFrame([x.split(' ') for x in line.split('\n')])
df.rename(columns={0: 'date_of_import', 1: 'entry_num', 2: 'warehouse_code_num', 3: 'declarant_ref_num', 4: 'declarant_EORI_num', 5: 'VAT_due'}, inplace=True)
table = pd.concat([table,df],sort= False)
This part of the code is what is taking up huge time. I have tried different ways to get a dataframe out of this data but the above has worked best for me. I am looking for faster way to run this code.
https://drive.google.com/file/d/1ew3Fw1IjeToBA-KMbTTD_hIINiQm0Bkg/view?usp=share_link pdf file for reference
There are 2 immediate optimization steps in your code.
Pre-compile regex if they are used many times. It may or not be relevant here, because I could not guess how many times table_lines = re.findall(regex_date, extracted_string) is executed. But this if often more efficient:
# before any loop
regex_date = re.compile("\d{2}\/\d{2}\/\d{4}[^\n]*")
...
# inside the loop
table_lines = regex_date.findall(extracted_string)
Do not repeatedly append to a dataframe. A dataframe is a rather complex container, and appending rows is a costly operation. It is generally much more efficient do build a Python container (list or dict) first and then convert it as a whole to a dataframe
data = [[x.split(' ') for x in line.split('\n')] for line in table_lines]
table = pd.DataFrame(data, columns = ['date_of_import', 'entry_num',
'warehouse_code_num', 'declarant_ref_num',
'declarant_EORI_num', 'VAT_due'])
So this is kind of weird but I'm new to Python and I'm committed to seeing my first project with Python through to the end.
So I am reading about 100 .xlsx files in from a file path. I then trim each file and send only the important information to a list, as an individual and unique dataframe. So now I have a list of 100 unique dataframes, but iterating through the list and writing to excel just overwrites the data in the file. I want to append the end of the .xlsx file. The biggest catch to all of this is, I can only use Excel 2010, I do not have any other version of the application. So the openpyxl library seems to have some interesting stuff, I've tried something like this:
from openpyxl.utils.dataframe import dataframe_to_rows
wb = load_workbook(outfile_path)
ws = wb.active
for frame in main_df_list:
for r in dataframe_to_rows(frame, index = True, header = True):
ws.append(r)
Note: In another post I was told it's not best practice to read dataframes line by line using loops, but when I started I didn't know that. I am however committed to this monstrosity.
Edit after reading Comments
So my code scrapes .xlsx files and stores specific data based on a keyword comparison into dataframes. These dataframes are stored in a list, I will list the entirety of the program below so hopefully I can explain what's in my head. Also, feel free to roast my code because I have no idea what is actual good python practices vs. not.
import os
import pandas as pd
from openpyxl import load_workbook
#the file path I want to pull from
in_path = r'W:\R1_Manufacturing\Parts List Project\Tool_scraping\Excel'
#the file path where row search items are stored
search_parameters = r'W:\R1_Manufacturing\Parts List Project\search_params.xlsx'
#the file I will write the dataframes to
outfile_path = r'W:\R1_Manufacturing\Parts List Project\xlsx_reader.xlsx'
#establishing my list that I will store looped data into
file_list = []
main_df = []
master_list = []
#open the file path to store the directory in files
files = os.listdir(in_path)
#database with terms that I want to track
search = pd.read_excel(search_parameters)
search_size = search.index
#searching only for files that end with .xlsx
for file in files:
if file.endswith('.xlsx'):
file_list.append(in_path + '/' + file)
#read in the files to a dataframe, main loop the files will be maninpulated in
for current_file in file_list:
df = pd.read_excel(current_file)
#get columns headers and a range for total rows
columns = df.columns
total_rows = df.index
#adding to store where headers are stored in DF
row_list = []
column_list = []
header_list = []
for name in columns:
for number in total_rows:
cell = df.at[number, name]
if isinstance(cell, str) == False:
continue
elif cell == '':
continue
for place in search_size:
search_loop = search.at[place, 'Parameters']
#main compare, if str and matches search params, then do...
if insensitive_compare(search_loop, cell) == True:
if cell not in header_list:
header_list.append(df.at[number, name]) #store data headers
row_list.append(number) #store row number where it is in that data frame
column_list.append(name) #store column number where it is in that data frame
else:
continue
else:
continue
for thing in column_list:
df = pd.concat([df, pd.DataFrame(0, columns=[thing], index = range(2))], ignore_index = True)
#turns the dataframe into a set of booleans where its true if
#theres something there
na_finder = df.notna()
#create a new dataframe to write the output to
outdf = pd.DataFrame(columns = header_list)
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
#I turn the dataframe into booleans and read until False
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
#Store actual dataframe into my output dataframe, outdf
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
So main_df is a list that has 100+ dataframes in it. For this example I will only use 2 of them. I would like them to print out into excel like:
So the comment from Ashish really helped me, all of the dataframes had different column titles so my 100+ dataframes eventually concat'd to a dataframe that is 569X52. Here is the code that I used, I completely abandoned openpyxl because once I was able to concat all of the dataframes together, I just had to export it using pandas:
# what I want to do here is grab all the data in the same column as each
# header, then move to the next column
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
to_xlsx_df = pd.DataFrame()
for frame in main_df:
to_xlsx_df = pd.concat([to_xlsx_df, frame])
to_xlsx_df.to_excel(outfile_path)
The output to excel ended up looking something like this:
Hopefully this can help someone else out too.
I have a csv file containing around 8 Million records, but it is taking more than an hour to complete the process, so please could you please help me with this?
Note: There is no issue with the python code; it works very well without any errors. The only problem is that is taking too much time to load and process the 8M records.
Here is the code
import pandas as pd
import numpy as np
import ipaddress
from pathlib import Path
import shutil
import os
from time import time
start = time()
inc_path = 'C:/Users/phani/OneDrive/Desktop/pandas/inc'
arc_path = 'C:/Users/phani/OneDrive/Desktop/pandas/arc'
dropZone_path = 'C:/Users/phani/OneDrive/Desktop/pandas/dropZone'
for src_file in Path(dropZone_path).glob('XYZ*.csv*'):
process_file = shutil.copy(os.path.join(dropZone_path, src_file), arc_path)
for sem_file in Path(dropZone_path).glob('XYZ*.sem'):
semaphore_file = shutil.copy(os.path.join(dropZone_path, sem_file), inc_path)
# rename the original file
for file in os.listdir(dropZone_path):
file_path = os.path.join(dropZone_path, file)
shutil.copy(file_path, os.path.join(arc_path, "Original_" + file))
for sema_file in
Path(arc_path).glob('Original_XYZ*.sem*'):
os.remove(sema_file)
## Read CSVfile from TEMP folder
df = pd.read_csv(process_file)
df.sort_values(["START_IP_ADDRESS"], ascending=True,)
i = 0
while i < len(df) - 1:
i += 1
line = df.iloc[i:i + 1].copy(deep=True)
curr_START_IP_NUMBER = line.START_IP_NUMBER.values[0]
curr_END_IP_NUMBER = line.END_IP_NUMBER
prev_START_IP_NUMBER = df.loc[i - 1, 'START_IP_NUMBER']
prev_END_IP_NUMBER = df.loc[i - 1, 'END_IP_NUMBER']
# if no gap - continue
if curr_START_IP_NUMBER == (prev_END_IP_NUMBER + 1):
continue
# else fill the gap
# new line start ip number
line.START_IP_NUMBER = prev_END_IP_NUMBER + 1
line.START_IP_ADDRESS = (ipaddress.ip_address(int(line.START_IP_NUMBER)))
# new line end ip number
line.END_IP_NUMBER = curr_START_IP_NUMBER - 1
line.END_IP_ADDRESS = (ipaddress.ip_address(int(line.END_IP_NUMBER)))
line.COUNTRY_CODE = ''
line.LATITUDE_COORDINATE = ''
line.LONGITUDE_COORDINATE = ''
line.ISP_NAME = ''
line.AREA_CODE = ''
line.CITY_NAME = ''
line.METRO_CODE = ''
line.ORGANIZATION_NAME = ''
line.ZIP_CODE = ''
line.REGION_CODE = ''
# insert the line between curr index to previous index
df = pd.concat([df.iloc[:i], line, df.iloc[i:]]).reset_index(drop=True)
df.to_csv(process_file, index=False)
for process_file in Path(arc_path).glob('XYZ*.csv*'):
EREFile_CSV = shutil.copy(os.path.join(arc_path, process_file), inc_path)
You can either read the .csv file in chunks using the Pandas library, and then process each chunk separately, or concat all chunks in a single dataframe (if you have enough RAM to accommodate all the data):
#read data in chunks of 1 million rows at a time
chunks = pd.read_csv(process_file, chunksize=1000000)
# Process each chunk
for chunk in chunks:
# Process the chunk
print(len(chunk))
# or concat the chunks in a single dataframe
#pd_df = pd.concat(chunks)
Alternatively, you can use the Dask library, which can handle large datasets by internally chunking the dataframe and processing it in parallel:
from dask import dataframe as dd
dask_df = dd.read_csv(process_file)
if the csv file doesn't change very often, and you always need to repeat the analysis perhaps during dev stage, the best way is to save it as a pickle .pkl file.
I am working on one task where we have many xlsx files each with about 100 rows and I would like to merge them into one new big xlsx file with xlsxwriter.
Is it possible to do it with one loop which would read and write simultaneuosly ?
I can read the files, I can create a new one. On the first run I could write all cells into new file but when I checked the file, it is overwriting the actual values with the last read file. So I got only part where number of rows variable is not the same as in previous file.
Here is the code I wrote:
#!/usr/bin/env python3
import os
import time
import xlrd
import xlsxwriter
from datetime import datetime
from datetime import date
def print_values_to():
loc = ("dir/")
wr_workbook = xlsxwriter.Workbook('All_Year_All_Values.xlsx')
wr_worksheet = wr_workbook.add_worksheet('Test')
# --------------------------------------------------------
all_rows = 0
for file in os.listdir(loc):
print(loc + file)
workbook = xlrd.open_workbook(loc + file)
sheet = workbook.sheet_by_index(0)
number_of_rows = sheet.nrows
number_of_columns = sheet.ncols
all_rows = all_rows + number_of_rows
dropped_numbers = []
for i in range(number_of_rows): # -------- number / number_of_rows
if i == 0:
all_rows = all_rows - 1
continue
for x in range(number_of_columns):
type_value = sheet.cell_value(i, x)
if isinstance(type_value, float):
changed_to_integer = int(sheet.cell_value(i, x)) # ----
values = changed_to_integer # -----
elif isinstance(type_value, str):
new_date = datetime.strptime(type_value, "%d %B %Y")
right_format = new_date.strftime("%Y-%m-%d")
values = right_format
# write into new excel file
wr_worksheet.write(i, x, values)
# list of all values
dropped_numbers.append(values)
# print them on the console
print(dropped_numbers)
# Writing into new excel
# wr_worksheet.write(i, x, values)
# clear list of values for another run
dropped_numbers = []
print("Number of all rows: ", number_of_rows)
print("\n")
wr_workbook.close()
I went through the xlsxwrite guidance but it didnt tell exactly that it is not possible.
So I still hoping that I could arrange it somehow.
For any idea many thaanks.
me again. But now, with an answer. This was really stupid solution.
One simple variable incrementation did a trick. Right after the first loop. I just added p = p + 1 and wualaa all data are in one xlsx file.
So on the top:
for i in range(number_of_rows): # -------- number / number_of_rows
p = p + 1
and for writer just changed the row counter:
wr_worksheet.write(p, x, values)
aaaaaaaaaaah...
Many thanks.
Looking to make the following code parallel- it reads in data in one large 9gb proprietary format and produces 30 individual csv files based on the 30 columns of data. It currently takes 9 minutes per csv written on a 30 minute data set. The solution space of parallel libraries in Python is a bit overwhelming. Can you direct me to any good tutorials/sample code? I couldn't find anything very informative.
for i in range(0, NumColumns):
aa = datetime.datetime.now()
allData = [TimeStamp]
ColumnData = allColumns[i].data # Get the data within this one Column
Samples = ColumnData.size # Find the number of elements in Column data
print('Formatting Column {0}'.format(i+1))
truncColumnData = [] # Initialize truncColumnData array each time for loop runs
if ColumnScale[i+1] == 'Scale: '+ tempScaleName: # If it's temperature, format every value to 5 characters
for j in range(Samples):
truncValue = '{:.1f}'.format((ColumnData[j]))
truncColumnData.append(truncValue) # Appends formatted value to truncColumnData array
allData.append(truncColumnData) #append the formatted Column data to the all data array
zipObject = zip(*allData)
zipList = list(zipObject)
csvFileColumn = 'Column_' + str('{0:02d}'.format(i+1)) + '.csv'
# Write the information to .csv file
with open(csvFileColumn, 'wb') as csvFile:
print('Writing to .csv file')
writer = csv.writer(csvFile)
counter = 0
for z in zipList:
counter = counter + 1
timeString = '{:.26},'.format(z[0])
zList = list(z)
columnVals = zList[1:]
columnValStrs = list(map(str, columnVals))
formattedStr = ','.join(columnValStrs)
csvFile.write(timeString + formattedStr + '\n') # Writes the time stamps and channel data by columns
one possible solution may be to use Dask http://dask.pydata.org/en/latest/
A coworker recently recommended it to me which is why I thought of it.