I am using pd.read_excel to read large excel files. How do I add in a GUI progress bar to inform users the progress while the system is reading the file?
I came across tqdm and PySimpleGui as the most commonly used progress bar. PySimpleGUI OneLineProgress Meter seems to be the solution to my need, but I'm too novice to add pd.read_excel in. Can anyone teach me how to work this out? Thank you
for i in range(1000): # this is your "work loop" that you want to monitor
sg.OneLineProgressMeter('One Line Meter Example', i + 1, 1000, 'key')
It's not really easy to do with read_excel due to the function lacking any means of tracking progress, unlike read_csv where you can use chunksize parameter to return an interator and then you can update your progress based on how many chunks you've loaded out of the total.
This won't work in case of excel files so your only option is to use a combination of nrows and skiprows. To do this we first need to find out how many rows the file has. There are some neat ways of doing this with little overhead (see answers here) but these won't work for and Excel file so what you do is load one entire column of the sheet and find out the length that way:
# Get the total number of rows
df_temp = pd.read_excel(sheet_name='balbla', usecols=[0])
rows = df_temp.shape[0]
# Now load the file in chunks of 1000 rows at a time
chunks = rows//1000 + 1
chunk_list = []
for i in range(chunks):
tmp = pd.read_excel(sheet_name='blabla', nrows=1000, skiprows=[k for k in range(i*1000)])
chunk_list.append(chunks)
# Update progress
sg.OneLineProgressMeter('Loading excel file', i + 1, chunks)
df = pd.concat((f for f in chunk_list), axis=0)
This should work but will come with significant overhead. Unfortunately, there's no easy way to do this for excel files and you're better off using some other data format.
Related
I am working with very big Excel files, which take a long time to be loaded with Pandas in Python. Before processing the data, the user has to select quite a few options related to the data, which only require the names of the each column in each dataset. It is very inconvenient for the user to have to wait sometimes minutes until the data is loaded to be able to select the necessary options and then let the program do the actual processing for another few minutes.
So, my question is: is there a way to load only the data header from an Excel file with Python? In a way I think of it as an alternate version to the "skiprows" parameter in the read_excel Pandas function, where instead of skipping rows in the beginning of the data, I would like to skip rows at the end of the data. I want to emphasize that my goal is to reduce the time Python takes to load the files. I also know there are ways to do this with csv files, but unfortunately it didn't help me.
Thank you for the help!
You can try to use the sxl module (https://pypi.org/project/sxl/). Here is the code I tried for a large excel file (around 75,000 rows) and the timing results:
from datetime import datetime
startTime = datetime.now()
import pandas as pd
import sxl
startTime = datetime.now()
df = pd.read_excel('\\Big_Excel.xlsx')
print("Time taken to load whole data with pandas read excel is {}".format((datetime.now() - startTime)))
startTime = datetime.now()
df = pd.read_excel('\\Big_Excel.xlsx', nrows = 5)
print("Time taken with top 5 rows with pandas read excel is {}".format((datetime.now() - startTime)))
startTime = datetime.now()
wb = sxl.Workbook('\\Big_Excel.xlsx')
ws = wb.sheets[1]
data = ws.head(5)
print("Time taken to load top 5 rows using sxl is {}".format((datetime.now() - startTime)))
Pandas read excel loads the whole data in memory, so there is not much of a difference difference in timing. Here are the outputs from the above:
Time taken to load whole data with pandas read excel is 0:00:49.174538
Time taken with top 5 rows with pandas read excel is 0:00:44.478523
Time taken to load top 5 rows using sxl is 0:00:00.671717
I hope this helps!!
You can use 'skipfooter' parameter or 'nrows' parameter in both .xlsx & .csv. However, both cannot be used together.
path = r'c:\users\abc\def\stack.xlsx'
df = pd.read_excel(path, skipfooter = 99999)
which means, 99999 rows will be skipped from footer to top & remaining records from header will load.
path = r'c:\users\abc\def\stack.xlsx'
df = pd.read_excel(path, nrows= 5)
which means, first 5 rows will be shown with header.
Also refer this Stack over flow Question.
from dask import dataframe as dd
df= dd.read_csv(“filename”)
Trust me its fast I am reading 800 mb of file
I am having trouble with reading and writing moderately sized excel files in Pandas. I have 5 files each around 300 MB large. I need to combine these files into one, do some processing and then save it (as excel preferably):
import pandas as pd
f1 = pd.read_excel('File_1.xlsx')
f2 = pd.read_excel('File_2.xlsx')
f3 = pd.read_excel('File_3.xlsx')
f4 = pd.read_excel('File_4.xlsx')
f5 = pd.read_excel('File_5.xlsx')
FULL = pd.concat([f1,f2,f3,f4,f5], axis=0, ignore_index=True, sort=False)
FULL.to_excel('filename.xlsx', index=False)'
But unfortunately read takes way too much time (around 15 minutes or so), and write used up 100% of memory (on my 16 GB ram PC), and was taking so much time that I was forced to interrupt the program.
Is there any way I could accelerate both read/write?
In this post it is defined a nice function append_df_to_excel().
You can use that function to read the files one by one and append their content to the final excel file. This will save you RAM since you are not going to keep all the files in memory at once.
files = ['File_1.xlsx','File_2.xlsx',...]
for file in files:
df = pd.read_excel(file)
append_df_to_excel('filename.xlsx', df)
Depending on your input files, you may need to pass some extra arguments to the function. Check the linked post for extra info.
Note that you could use df.to_csv() with mode='a' to append to a csv file. Most of the time you can swap excel files for csv easily. If this is also your case, I would suggest this method instead of the custom function.
Not ideal (and dependent on use case), but I've always found it much quicker to load up the XLSX (in Excel) and save it as a CSV file, just because I tend to do multiple reads on the data and in the long run the time taken to wait for the XLSX load outweighs the amount of time it takes to convert the file.
I'm trying to read a rather large CSV (2 GB) with pandas to do some datatype manipulation and joining with other dataframes I have already loaded before. As I want to be a little careful with memory I decided to read the it in chunks. For the purpose of the questions here is an extract of my CSV layout with dummy data (cant really share the real data, sorry!):
institution_id,person_id,first_name,last_name,confidence,institution_name
1141414141,4141414141,JOHN,SMITH,0.7,TEMP PLACE TOWN
10123131114,4141414141,JOHN,SMITH,0.7,TEMP PLACE CITY
1003131313188,4141414141,JOHN,SMITH,0.7,"TEMP PLACE,TOWN"
18613131314,1473131313,JOHN,SMITH,0.7,OTHER TEMP PLACE
192213131313152,1234242383,JANE,SMITH,0.7,"OTHER TEMP INC, LLC"
My pandas code to read the files:
inst_map = pd.read_csv("data/hugefile.csv",
engine="python",
chunksize=1000000,
index_col=False)
print("processing institution chunks")
chunk_list = [] # append each chunk df here
for chunk in inst_map:
# perform data filtering
chunk['person_id'] = chunk['person_id'].progress_apply(zip_check)
chunk['institution_id'] = chunk['institution_id'].progress_apply(zip_check)
# Once the data filtering is done, append the chunk to list
chunk_list.append(chunk)
ins_processed = pd.concat(chunk_list)
The zip check function that I'm applying is basically performing some datatype checks and then converting the value that it gets into an integer.
Whenever I read the CSV it will only ever read the institution_id column and generates an index. The other columns in the CSV are just silently dropped.
When i dont use index_col=False as an option it will just set 1141414141/4141414141/JOHN/SMITH/0.7 (basically the first 5 values in the row) as the index and only institution_id as the header while only reading the institution_name into the dataframe as a value.
I have honestly no clue what is going on here, and after 2 hours of SO / google search I decided to just ask this as a question. Hope someone can help me, thanks!
The issue came out to be that something went wrong while transferring the large CSV file to my remote processing server (which sufficient RAM to handle in memory editing). Processing the chunks on my local computer seems to work.
After reuploading the file it worked fine on the remote server.
Let us say i have an excel file with 100k rows. My code is trying to read it row by row, and for each row do computation (including benchmark of how long it takes to perform each row). Then, my code will produce an array of results, with 100k rows. I did my python code but it is not efficient and taking me several days and also the benchmark results getting worse due to high consumption of memory i guess. Please see my attempt and let me know how to improve it.
My code save results=[] and only write it at the end. Also, at the start i store the whole excel file in worksheet.. I think like this will cause memory issue since my excel has very large text in cells (not only numbers).
ExcelFileName = 'Data.xlsx'
workbook = xlrd.open_workbook(ExcelFileName)
worksheet = workbook.sheet_by_name("Sheet1") # We need to read the data
num_rows = worksheet.nrows #Number of Rows
num_cols = worksheet.ncols #Number of Columns
results=[]
for curr_row in range(1,num_rows,1):
row_data = []
for curr_col in range(0, num_cols, 1):
data = worksheet.cell_value(curr_row, curr_col) # Read the data in the current cell
row_data.append(data)
#### do computation here ####
## save results like results+=[]
### save results array in dataframe and then print it to excel
df = pd.DataFrame(results)
writer = pd.ExcelWriter("XX.xlsx", engine="xlsxwriter")
df.to_excel(writer, sheet_name= 'results')
writer.save()
What i would like is to read the first row from excel and store it in memory, do the calculation, get the result and save it into excel,, then go for the second row,,, without keep memory so busy. By doing so, i will not have results array containing 100k rows, since each loop i erase it.
To solve the issue about loading the input file into memory, I would look into using a generator. A Generator works by iterating over any iterable, but only returning the next element instead of the entire iterable. In your case, this would return only the next row from your .xlsx file, instead of keeping the entire file in memory.
However, this will not solve the issue of having a very large "results" array. Unfortunately, updating a .csv or .xlsx file in as you go will take a very long time, significantly longer than updating the object in memory. There is a trade off here, you can either use up lots of memory by updating your "results" array and then writing it all to a file at the end, or you can very slowly update a file in the file system with the results as you go at the cost of much slower execution.
For this kind of operation you are probably better off loading the csv directly into a DataFrame, there are several methods for dealing with large files in pandas that are detailed here, How to read a 6 GB csv file with pandas. Which method you choose will have a lot to do with the type of computation you need to do, since you seem to be processing one row at a time, using chunks will probably be the way to go.
Pandas has a lot of built in optimization for dealing with operations on large sets of data, so the majority of the time you will find increased performance working with data within a DataFrame or Series than you will using pure Python. For the best performance consider vectorizing your function or looping using the apply method, which allows pandas to apply the function to all rows in the most efficient way possible.
It seems that you can look at columns in a file no problem, but there's no apparent way to look at rows. I know I can read the entire file (CSV or excel) into a crazy huge dataframe in order to select rows, but I'd rather be able to grab particular rows straight from the file and store those in a reasonably sized dataframe.
I do realize that I could just transpose/pivot the df before saving it to the aforementioned CVS/Excel file. This would be a problem for Excel because I'd run out of columns (the transposed rows) far too quickly. I'd rather use Excel than CSV.
My original, not transposed data file has 9000+ rows and 20ish cols. I'm using Excel 2003 which supports up to 256 columns.
EDIT: Figured out a solution that works for me. It's a lot simpler than I expected. I did end up using CSV instead of Excel (I found no serious difference in terms of my project) Here it is for whoever may have the same problem:
import pandas as pd
selectionList = (2, 43, 792, 4760) #rows to select
df = pd.read_csv(your_csv_file, index_col=0).T
selection = {}
for item in selectionList:
selection[item] = df[item]
selection = pd.DataFrame.from_dict(selection)
selection.T.to_csv(your_path)
I think you can use the skiprows and nrows arguments in pandas.read_csv to pick out individual rows to read in.
With skiprows, you can provide it a long list (0 indexed) of rows not to import , e.g. [0,5,6,10]. That might end up being a huge list though. If you provide it a single integer, it will skip that number of rows and start reading. Set nrows to whatever to pick up the number of rows you want at the point where you have it start.
If I've misunderstood the issue, let me know.