Pandas ignoring headers while reading large (2GB) csv - python

I'm trying to read a rather large CSV (2 GB) with pandas to do some datatype manipulation and joining with other dataframes I have already loaded before. As I want to be a little careful with memory I decided to read the it in chunks. For the purpose of the questions here is an extract of my CSV layout with dummy data (cant really share the real data, sorry!):
institution_id,person_id,first_name,last_name,confidence,institution_name
1141414141,4141414141,JOHN,SMITH,0.7,TEMP PLACE TOWN
10123131114,4141414141,JOHN,SMITH,0.7,TEMP PLACE CITY
1003131313188,4141414141,JOHN,SMITH,0.7,"TEMP PLACE,TOWN"
18613131314,1473131313,JOHN,SMITH,0.7,OTHER TEMP PLACE
192213131313152,1234242383,JANE,SMITH,0.7,"OTHER TEMP INC, LLC"
My pandas code to read the files:
inst_map = pd.read_csv("data/hugefile.csv",
engine="python",
chunksize=1000000,
index_col=False)
print("processing institution chunks")
chunk_list = [] # append each chunk df here
for chunk in inst_map:
# perform data filtering
chunk['person_id'] = chunk['person_id'].progress_apply(zip_check)
chunk['institution_id'] = chunk['institution_id'].progress_apply(zip_check)
# Once the data filtering is done, append the chunk to list
chunk_list.append(chunk)
ins_processed = pd.concat(chunk_list)
The zip check function that I'm applying is basically performing some datatype checks and then converting the value that it gets into an integer.
Whenever I read the CSV it will only ever read the institution_id column and generates an index. The other columns in the CSV are just silently dropped.
When i dont use index_col=False as an option it will just set 1141414141/4141414141/JOHN/SMITH/0.7 (basically the first 5 values in the row) as the index and only institution_id as the header while only reading the institution_name into the dataframe as a value.
I have honestly no clue what is going on here, and after 2 hours of SO / google search I decided to just ask this as a question. Hope someone can help me, thanks!

The issue came out to be that something went wrong while transferring the large CSV file to my remote processing server (which sufficient RAM to handle in memory editing). Processing the chunks on my local computer seems to work.
After reuploading the file it worked fine on the remote server.

Related

How to efficiently read data from a large excel file, do the computation and then store results back in python?

Let us say i have an excel file with 100k rows. My code is trying to read it row by row, and for each row do computation (including benchmark of how long it takes to perform each row). Then, my code will produce an array of results, with 100k rows. I did my python code but it is not efficient and taking me several days and also the benchmark results getting worse due to high consumption of memory i guess. Please see my attempt and let me know how to improve it.
My code save results=[] and only write it at the end. Also, at the start i store the whole excel file in worksheet.. I think like this will cause memory issue since my excel has very large text in cells (not only numbers).
ExcelFileName = 'Data.xlsx'
workbook = xlrd.open_workbook(ExcelFileName)
worksheet = workbook.sheet_by_name("Sheet1") # We need to read the data
num_rows = worksheet.nrows #Number of Rows
num_cols = worksheet.ncols #Number of Columns
results=[]
for curr_row in range(1,num_rows,1):
row_data = []
for curr_col in range(0, num_cols, 1):
data = worksheet.cell_value(curr_row, curr_col) # Read the data in the current cell
row_data.append(data)
#### do computation here ####
## save results like results+=[]
### save results array in dataframe and then print it to excel
df = pd.DataFrame(results)
writer = pd.ExcelWriter("XX.xlsx", engine="xlsxwriter")
df.to_excel(writer, sheet_name= 'results')
writer.save()
What i would like is to read the first row from excel and store it in memory, do the calculation, get the result and save it into excel,, then go for the second row,,, without keep memory so busy. By doing so, i will not have results array containing 100k rows, since each loop i erase it.
To solve the issue about loading the input file into memory, I would look into using a generator. A Generator works by iterating over any iterable, but only returning the next element instead of the entire iterable. In your case, this would return only the next row from your .xlsx file, instead of keeping the entire file in memory.
However, this will not solve the issue of having a very large "results" array. Unfortunately, updating a .csv or .xlsx file in as you go will take a very long time, significantly longer than updating the object in memory. There is a trade off here, you can either use up lots of memory by updating your "results" array and then writing it all to a file at the end, or you can very slowly update a file in the file system with the results as you go at the cost of much slower execution.
For this kind of operation you are probably better off loading the csv directly into a DataFrame, there are several methods for dealing with large files in pandas that are detailed here, How to read a 6 GB csv file with pandas. Which method you choose will have a lot to do with the type of computation you need to do, since you seem to be processing one row at a time, using chunks will probably be the way to go.
Pandas has a lot of built in optimization for dealing with operations on large sets of data, so the majority of the time you will find increased performance working with data within a DataFrame or Series than you will using pure Python. For the best performance consider vectorizing your function or looping using the apply method, which allows pandas to apply the function to all rows in the most efficient way possible.

Pandas: How to read rows from CSV or Excel file?

It seems that you can look at columns in a file no problem, but there's no apparent way to look at rows. I know I can read the entire file (CSV or excel) into a crazy huge dataframe in order to select rows, but I'd rather be able to grab particular rows straight from the file and store those in a reasonably sized dataframe.
I do realize that I could just transpose/pivot the df before saving it to the aforementioned CVS/Excel file. This would be a problem for Excel because I'd run out of columns (the transposed rows) far too quickly. I'd rather use Excel than CSV.
My original, not transposed data file has 9000+ rows and 20ish cols. I'm using Excel 2003 which supports up to 256 columns.
EDIT: Figured out a solution that works for me. It's a lot simpler than I expected. I did end up using CSV instead of Excel (I found no serious difference in terms of my project) Here it is for whoever may have the same problem:
import pandas as pd
selectionList = (2, 43, 792, 4760) #rows to select
df = pd.read_csv(your_csv_file, index_col=0).T
selection = {}
for item in selectionList:
selection[item] = df[item]
selection = pd.DataFrame.from_dict(selection)
selection.T.to_csv(your_path)
I think you can use the skiprows and nrows arguments in pandas.read_csv to pick out individual rows to read in.
With skiprows, you can provide it a long list (0 indexed) of rows not to import , e.g. [0,5,6,10]. That might end up being a huge list though. If you provide it a single integer, it will skip that number of rows and start reading. Set nrows to whatever to pick up the number of rows you want at the point where you have it start.
If I've misunderstood the issue, let me know.

Steaming a large file into BigQuery

I am trying to tidy up a large (8gb) .csv file in python then stream it into BigQuery. My code below starts off okay, as the table is created and the first 1000 rows go in, but then I get the error:
InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table.
Is this perhaps related to the streaming buffer? My issue is that I will need to remove the table before i run the code again, otherwise the first 1000 entries will be duplicated due to the 'append' method.
import pandas as pd
destination_table = 'product_data.FS_orders'
project_id = '##'
pkey ='##'
chunks = []
for chunk in pd.read_csv('Historic_orders.csv',chunksize=1000, encoding='windows-1252', names=['Orderdate','Weborderno','Productcode','Quantitysold','Paymentmethod','ProductGender','DeviceType','Brand','ProductDescription','OrderType','ProductCategory','UnitpriceGBP' 'Webtype1','CostPrice','Webtype2','Webtype3','Variant','Orderlinetax']):
chunk.replace(r' *!','Null', regex=True)
chunk.to_gbq(destination_table, project_id, if_exists='append', private_key=pkey)
chunks.append(chunk)
df = pd.concat(chunks, axis=0)
print(df.head(5))
pd.to_csv('Historic_orders_cleaned.csv')
Question:
- why streaming and not simply loading? This way you can upload batches of 1 GB instead of 1000 rows. Streaming is usually the case when you do have continuous data that needs to be appended as they happen. If you have a break of 1 day between the collection of the data and the load job it's usually safer to just load it. see here.
apart from that. I've had my share of issues loading tables in bigQuery from csv files and most of the times it was either 1) encoding (I see you have non utf-8 encoding) and 2) invalid characters, some comma that was lost in the middle of the file that broke the line.
To validate that, what if you insert the rows backwards? do you get the same error?

OverflowError with Pandas to_hdf

Python newbie here.
I am trying to save a large data frame into HDF file with lz4 compression using to_hdf.
I use Windows 10, Python 3, Pandas 20.2
I get the error “OverflowError: Python int too large to convert to C long”.
None of the machine resources are close to their limits (RAM, CPU, SWAP usage)
Previous posts discuss the dtype, but the following example shows that there is some other problem, potentially related to the size?
import numpy as np
import pandas as pd
# sample dataframe to be saved, pardon my French
n=500*1000*1000
df= pd.DataFrame({'col1':[999999999999999999]*n,
'col2':['aaaaaaaaaaaaaaaaa']*n,
'col3':[999999999999999999]*n,
'col4':['aaaaaaaaaaaaaaaaa']*n,
'col5':[999999999999999999]*n,
'col6':['aaaaaaaaaaaaaaaaa']*n})
# works fine
lim=200*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
# works fine
lim=300*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
# Error
lim=400*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
....
OverflowError: Python int too large to convert to C long
I experienced the same issue and it seems that it is indeed connected to the size of the data frame rather than to dtype (I had all the columns stored as strings and was able to store them to .h5 separately).
The solution that worked for me is to save the data frame in chunks using mode='a'.
As suggested in pandas documentation : mode{‘a’, ‘w’, ‘r+’}, default ‘a’: ‘a’: append, an existing file is opened for reading and writing, and if the file does not exist it is created.
So the sample code would look something like:
batch_size = 1000
for i, df_chunk in df.groupby(np.arange(df.shape[0]) // batch_size):
df_chunk.to_hdf('df.h5','table', complib= 'blosc:lz4', mode='a')
As #Giovanni Maria Strampelli pointed, the answer of #Artem Snorkovenko only saves the last batch. Pandas documentation states the following:
In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.
Here is a possible workaround to save all batches (adjusted from the answer of #Artem Snorkovenko):
for i in range(len(df)):
sr = df.loc[i] #pandas series object for the given index
sr.to_hdf('df.h5', key='table_%i'%i, complib='blosc:lz4', mode='a')
This code saves each Pandas Series object with a different key. Each key is indexed by i.
To load the existing .h5 file after saving, one can do the following:
i = 0
dfdone = False #if True, all keys in the .h5 file are successfully loaded.
srl = [] #df series object list
while dfdone == False:
#print(i) #this is to see if code is working properly.
try: #check whether current i value exists in the keys of the .h5 file
sdfr = pd.read_hdf('df.h5', key='table_%i'%i) #Current series object
srl.append(sdfr) #append each series to a list to create the dataframe in the end.
i += 1 #increment i by 1 after loading the series object
except: #if an error occurs, current i value exceeds the number of keys, all keys are loaded.
dfdone = True #Terminate the while loop.
df = pd.DataFrame(srl) #Generate the dataframe from the list of series objects.
I used a while loop, assuming we do not know the exact length of the dataframe in the .h5 file. If the length is known, for loop can also be used.
Note that I am not saving dataframes in chunks here. Thus, the loading procedure is in its current form not suitable for saving in chunks, where the data type would be DataFrame for each chunk. In my implementation, each saved object is Series, and DataFrame is generated from a list of Series. The code I provided can be adjusted to work for saving in chunks and generating a DataFrame from a list of DataFrame objects (a nice starting point can be found in ths Stack Overflow entry.).
In addition to #tetrisforjeff 's answer:
If the df contains object types, the reading could lead to error. I would suggest pd.concat(srl) instead of pd.DataFrame(srl)

Looping through .xlsx files using pandas, only does first file

My ultimate goal is to merge the contents of a folder full of .xlsx files into one big file.
I thought the below code would suffice, but it only does the first file, and I can't figure out why it stops there. The files are small (~6 KB), so it shouldn't be a matter of waiting. If I print f_list, it shows the complete list of files. So, where am I going wrong? To be clear, there is no error returned, it just does not do the entire for loop. I feel like there should be a simple fix, but being new to Python and coding, I'm having trouble seeing it.
I'm doing this with Anaconda on Windows 8.
import pandas as pd
import glob
f_list = glob.glob("C:\\Users\\me\\dt\\xx\\*.xlsx") # creates my file list
all_data = pd.DataFrame() # creates my DataFrame
for f in f_list: # basic for loop to go through file list but doesn't
df = pd.read_excel(f) # reads .xlsx file
all_data = all_data.append(df) # appends file contents to DataFrame
all_data.to_excel("output.xlsx") # creates new .xlsx
Edit with new information:
After trying some of the suggested changes, I noticed the output claiming the files are empty, except for 1 of them which is slightly larger than the others. If I put them into the DataFrame, it claims the DataFrame is empty. If I put it into the dict, it claims there are no values associated. Could this have something to do with the file size? Many, if not most, of these files have 3-5 rows with 5 columns. The one it does see has 12 rows.
I strongly recommend reading the DataFrames into a dict:
sheets = {f: pd.read_excel(f) for f in f_list}
For one thing this is very easy to debug: just inspect the dict in the REPL.
Another is that you can then concat these into one DataFrame efficiently in one pass:
pd.concat(sheets.values())
Note: This is significantly faster than append, which has to allocate a temporary DataFrame at each append-call.
An alternative issue is that your glob may not be picking up all the files, you should check that it is by printing f_list.

Categories

Resources