memory issue when merging two data frames

memory issue when merging two data frames - python

I am stuck at this second last statement clueless. The error is : numpy.core._exceptions.MemoryError: Unable to allocate 58.1 GiB for an array with shape (7791676634,) and data type int64
My thinking was that merging a data frame of ~12 million records with another data frame of 3-4 more columns should not be a big deal.
Please help me out. Totally stuck here. Thanks
Select_Emp_df has around 900k records
and Big_df has around 12 million records and 9 columns. I just need to merge two DFs like we do vlookup in Excel on key column.
import pandas as pd
Emp_df = pd.read_csv('New_Employee_df.csv', low_memory = False )
# Append data into one data frame from three csv files of 3 years'
transactions
df2019 = pd.read_csv('U21_02767G - Customer Trade Info2019.csv',
low_memory = False )
df2021 = pd.read_csv('U21_02767G - Customer Trade
Info2021(TillSep).csv', low_memory = False)
df2020 = pd.read_csv('Newdf2020.csv', low_memory = False)
Big_df = pd.concat([df2019, df2020, df2021], ignore_index=True)
Select_Emp_df = Emp_df[['CUSTKEY','GCIF_GENDER_DSC','SEX']]
Big_df = pd.merge(Big_df, Select_Emp_df, on='CUSTKEY')
print (Big_df.info)

Just before Big_df = pd.merge(Big_df, Select_Emp_df, on='CUSTKEY')
try to delete previous dataframes. Like this.
del df2019
del df2020
del df2021
This should save some memory
also try
Select_Emp_df = Emp_df[['CUSTKEY','GCIF_GENDER_DSC','SEX']].drop_duplicates(subset=['CUSTKEY'])

When younger, I used machines where available RAM per process where 32k to 640k. And I used to process huge datasets on that (err... several Mo but much larger than memory). The key was to only keep in memory what was required.
Here you concat 3 large dataframes to later merge that with another one. If you have memory issues, just reverse concat and merging: merge each individual file with Emp_df and immediately write the merged file to the disk an throw everything out of your memory between each step. If you use csv files, you can even directly build the contatenated csv file by appending the 2nd and 3rd merge files to the first one (use mode='a', header=False in to_csv method).

Using suggestions provided by community here and a bit of my research I edited my codes and here is what worked for me -
Select_Emp_df['LSTTRDDT'] = pd.to_datetime(Select_Emp_df['LSTTRDDT'],errors = 'coerce') Select_Emp_df = Select_Emp_df.sort_values(by='LSTTRDDT',ascending=True) Select_Emp_df = Select_Emp_df.drop_duplicates(subset='CUSTKEY',keep='last')
I just sorted values on Last transaction date and deleted duplicates(IN CUSTKEY) in Select_Emp_df data frame.

Related

Efficient merging of 2.1 billion records on datasets

I have several data sets consisting of 2 columns: ID and ActivityDate. Within an dataset a ID+ActivityDate is unique. Each dataset is about 35 million records long. There are 60+ datasets.
My desired out put is basically ID, FirstActivityDate and LastActivityDate. This is basically the reduce part of a map/reduce job.
My first try was basically read the first dataset, establishing the base line, and then as I read the next dataset I do a foreach comparing and updating the LastActivityDate. Although the memory used was very acceptable (spiking at 2 GB but consistently under 1.25 GB), this took too long. I made the calculation, the result set should be around 1.5 GB long, so it's local-memory manageable.
for x in files:
parsedData = parseFile(x)
dt = parsedData[0]
cards = parsedData[1]
for card in cards:
#num = int(card[:16])
if card in result:
result[card].lastRecharged = dt
else:
result[card]=CreditCard(dt)
Commenting that line #num = int(card[:16]) made the loop execution drop to 30 seconds per file (original was around 150 seconds), but now the memory is out of control. The file parsing is basically a file read, which takes less then 1 second.
My second try was using pandas, but I couldn't merge the datasets the way I want. I must say I'm not proficient in pandas.
Is there a third option?

I ended up getting pretty close to my objective.
Fist I made the reading and parsing into memory concurrent and in batches using multithreading.pool, each result was pushed into a queue. Each pool would have 3 consecutive files, cycling in 4 pools. In a 5th pool, I pre-merge the dictionaries dropping non-recurrent keys (cards). Then in a 6th pool I do the final merge.
The whole process takes about 75 seconds on my local machine. The whole thing ends up consuming over 4gb of ram, which is not ideal, but manageable.

IIUC you are interested on the first and the last ActivityDate per each ID
If this is the case you could use dask. Let assume all your file are csv and they are stored inside a folder called data.
import dask.dataframe as dd
import pandas as pd
df = dd.read_csv("data/*.csv")
# convert ActivityDate to datetime
df["ActivityDate"] = df["ActivityDate"].astype("M8[us]")
# use aggregate
out = df.groupby("ID")\
.agg({"ActivityDate":["min", "max"]})\
.compute()
out.columns = ["_".join(col) for col in out.columns]
out = out.reset_index()
out = out.rename(columns={'ActivityDate_min':"FirstActivityDate",
'ActivityDate_max':"LastActivityDate"})

How to efficiently 'query' multiple tsv files?

I've got about 40 tsv files, with the size of any given tsv ranging from 250mb to 3GB. I'm looking to pull data from the tsvs where rows contain certain values.
My current approach is far from efficient:
nums_to_look = ['23462346', '35641264', ... , '35169331'] # being about 40k values I'm interested in
all_tsv_files = glob.glob(PATH_TO_FILES + '*.tsv')
all_dfs = []
for file in all_tsv_files:
df = pd.read_csv(file, sep='\t')
# Extract rows which match values in nums_to_look
df = df[df['col_of_interest'].isin(nums_to_look)].reset_index(drop=True)
all_dfs.append(df)
Surely there's a much more efficient way to do this without having to read in every single file fully, and go through the entire file?
Any thoughts / insights would be much appreciated.
Thanks!

How to efficiently match values from 2 series and add them to a dataframe

I have a csv file "qwi_ak_se_fa_gc_ns_op_u.csv" which contains a lot of observations of 80 variables. One of them is geography which is the county. Every county belongs to something called a Commuting Zone (CZ). Using a matching table given in "czmatcher.csv" I can assign a CZ to every county given in geography.
The code below shows my approach. It is simply going through every row and finding its CZ by going through the whole "czmatcher.csv" row and finding the right one. Then i proceed to just copy the values using .loc. The problem is, this took over 10 hours to run on a 0.5 GB file (2.5 million rows) which isn't that much and my intuition says this should be faster?
This picture illustrates the way the csv files look like. The idea would be to construct the "Wanted result (CZ)" column, name it CZ and add it to the dataframe.
File example
import pandas as pd
data = pd.read_csv("qwi_ak_se_fa_gc_ns_op_u.csv")
czm = pd.read_csv("czmatcher.csv")
sLength = len(data['geography'])
data['CZ']=0
#this is just to fill the first value
for j in range(0,len(czm)):
if data.loc[0,'geography']==czm.loc[0,'FIPS']:
data.loc[0,'CZ'] = czm.loc[0,'CZID']
#now fill the rest
for i in range(1,sLength):
if data.loc[i,'geography']==data.loc[i-1,'geography']:
data.loc[i,'CZ'] = data.loc[i-1,'CZ']
else:
for j in range(0,len(czm)):
if data.loc[i,'geography']==czm.loc[j,'FIPS']:
data.loc[i,'CZ'] = czm.loc[j,'CZID']
Is there a faster way of doing this?

The best way to do this is a left merge on your dataframes,
data = pd.read_csv("qwi_ak_se_fa_gc_ns_op_u.csv")
czm = pd.read_csv("czmatcher.csv")
I assume that in both dataframes the column country is spelled the same,
data_final = data.merge(czm, how='left', on = 'country')
If it isn't spelled the same way you can rename your columns,
data.rename(columns:{col1:country}, inplace=True)
read the doc for further information https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

In order to make it faster, but not reworking your whole solution I would recommend to use Dask DataFrames, to say it simple, Dask divides your reads your csv in chunks and processes each of them in parallel. After reading csv. you can use .compute method to get pandas df instead of Dask df.
This will look like this:
import pandas as pd
import dask.dataframe as dd #IMPROT DASK DATAFRAMES
# YOU NEED TO USE dd.read_csv instead of pd.read_csv
data = dd.read_csv("qwi_ak_se_fa_gc_ns_op_u.csv")
data = data.compute()
czm = dd.read_csv("czmatcher.csv")
czm = czm.compute()
sLength = len(data['geography'])
data['CZ']=0
#this is just to fill the first value
for j in range(0,len(czm)):
if data.loc[0,'geography']==czm.loc[0,'FIPS']:
data.loc[0,'CZ'] = czm.loc[0,'CZID']
#now fill the rest
for i in range(1,sLength):
if data.loc[i,'geography']==data.loc[i-1,'geography']:
data.loc[i,'CZ'] = data.loc[i-1,'CZ']
else:
for j in range(0,len(czm)):
if data.loc[i,'geography']==czm.loc[j,'FIPS']:
data.loc[i,'CZ'] = czm.loc[j,'CZID']

How to properly save each large chunk of data as a pandas dataframe and concatenate them with each other

I have a dataframe that has over 400K rows and several hundred columns that I have decided to read in with chunks because it does not fit into Memory and gives me MemoryError.
I have managed to read it in in chunks like this:
x = pd.read_csv('Training.csv', chunksize=10000)
and afterwards I can get each of the chunks by doing this:
a = x.get_chunk()
b = x.get_chunk()
etc etc keep doing this over 40 times which is obviously slow and bad programming practice.
When I try doing the following in an attempt to create a loop that can save each chunk into a dataframe and somehow concatenate them:
for x in pd.read_csv('Training.csv', chunksize=500):
x.get_chunk()
I get:
AttributeError: 'DataFrame' object has no attribute 'get_chunk'
What is the easiest way I can read in my file and concatenate all my chunks during the import?
Also, how do I do further manipulation on my dataset to avoid memory error issues (particularly, imputing null values, standardizing/normalizing the dataframe, and then running machine learning models on it using scikit learn?

When you specify chunksize in a call to pandas.read_csv you get back a pandas.io.parsers.TextFileReader object rather than a DataFrame. Try this to go through the chunks:
reader = pd.read_csv('Training.csv',chunksize=500)
for chunk in reader:
print(type(chunk)) # chunk is a dataframe
Or grab all the chunks (which probably won't solve your problem!):
reader = pd.read_csv('Training.csv',chunksize=500)
chunks = [chunk for chunk in reader] # list of DataFrames
Depending on what is in your dataset a great way of reducing memory use is to identify columns that can be converted to categorical data. Any column where the number of distinct values is much lower than the number of rows is a candidate for this. Suppose a column contains some sort of status with limited values (e.g. 'Open','Closed','On hold') do this:
chunk['Status'] = chunk.assign(Status=lambda x: pd.Categorical(x['Status']))
This will now store just an integer for each row and the DataFrame will hold a mapping (e.g 0 = 'Open', 1 = 'Closed etc. etc.)
You should also look at whether or not any of your data columns are redundant (they effectively contain the same information) - if any are then delete them. I've seen spreadsheets containing dates where people have generated columns for year, week, day as they find it easier to work with. Get rid of them!

How to add rows to pandas dataframe with reasonable performance

I have an empty data frame with about 120 columns, I want to fill it using data I have in a file.
I'm iterating over a file that has about 1.8 million lines.
(The lines are unstructured, I can't load them to a dataframe directly)
For each line in the file I do the following:
Extract the data I need from the current line
Copy the last row in the data frame and append it to the end df = df.append(df.iloc[-1]). The copy is critical, most of the data in the previous row won't be changed.
Change several values in the last row according to the data I've extracted df.iloc[-1, df.columns.get_loc('column_name')] = some_extracted_value
This is very slow, I assume the fault is in the append.
What is the correct approach to speed things up ? preallocate the dataframe ?
EDIT:
After reading the answers I did the following:
I preallocated the dataframe (saved like 10% of the time)
I replaced this : df = df.append(df.iloc[-1]) with this : df.iloc[i] = df.iloc[i-1] (i is the current iteration in the loop).(save like 10% of the time).
Did profiling, even though I removed the append the main issue is copying the previous line, meaning : df.iloc[i] = df.iloc[i-1] takes about 95% of the time.

You may need plenty of memory, whichever option you choose.
However, what you should certainly avoid is using pd.DataFrame.append within a loop. This is expensive versus list.append.
Instead, aggregate to a list of lists, then feed into a dataframe. Since you haven't provided an example, here's some pseudo-code:
# initialize empty list
L = []
for line in my_binary_file:
# extract components required from each line to a list of Python types
line_vars = [line['var1'], line['var2'], line['var3']]
# append to list of results
L.append(line_vars)
# create dataframe from list of lists
df = pd.DataFrame(L, columns=['var1', 'var2', 'var3'])

The Fastest way would be load to dataframe directly via pd.read_csv()
Try separating the logic to clean out unstructured to structured data and then use pd.read_csv to load the dataframe.
You can share the sample unstructured line and logic to take out the structured data, So that might share some insights on the same.

Where you use append you end up copying the dataframe which is inefficient. Try this whole thing again but avoiding this line:
df = df.append(df.iloc[-1])
You could do something like this to copy the last row to a new row (only do this if the last row contains information that you want in the new row):
df.iloc[...calculate the next available index...] = df.iloc[-1]
Then edit the last row accordingly as you have done
df.iloc[-1, df.columns.get_loc('column_name')] = some_extracted_value

You could try some multiprocessing to speed things up
from multiprocessing.dummy import Pool as ThreadPool
def YourCleaningFunction(line):
for each line do the following
blablabla
return(your formated lines with ,) # or use the kind of function jpp just provided
pool = ThreadPool(8) # your number of cores
lines = open('your_big_csv.csv').read().split('\n') # your csv as a list of lines
df = pool.map(YourCleaningFunction, lines)
df = pandas.DataFrame(df)
pool.close()
pool.join()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

memory issue when merging two data frames - python

Just before Big_df = pd.merge(Big_df, Select_Emp_df, on='CUSTKEY') try to delete previous dataframes. Like this. del df2019 del df2020 del df2021 This should save some memory also try Select_Emp_df = Emp_df[['CUSTKEY','GCIF_GENDER_DSC','SEX']].drop_duplicates(subset=['CUSTKEY'])

Related

Efficient merging of 2.1 billion records on datasets

How to efficiently 'query' multiple tsv files?

How to efficiently match values from 2 series and add them to a dataframe

How to properly save each large chunk of data as a pandas dataframe and concatenate them with each other

How to add rows to pandas dataframe with reasonable performance

Categories

Resources