Efficient merging of 2.1 billion records on datasets - python

I have several data sets consisting of 2 columns: ID and ActivityDate. Within an dataset a ID+ActivityDate is unique. Each dataset is about 35 million records long. There are 60+ datasets.
My desired out put is basically ID, FirstActivityDate and LastActivityDate. This is basically the reduce part of a map/reduce job.
My first try was basically read the first dataset, establishing the base line, and then as I read the next dataset I do a foreach comparing and updating the LastActivityDate. Although the memory used was very acceptable (spiking at 2 GB but consistently under 1.25 GB), this took too long. I made the calculation, the result set should be around 1.5 GB long, so it's local-memory manageable.
for x in files:
parsedData = parseFile(x)
dt = parsedData[0]
cards = parsedData[1]
for card in cards:
#num = int(card[:16])
if card in result:
result[card].lastRecharged = dt
else:
result[card]=CreditCard(dt)
Commenting that line #num = int(card[:16]) made the loop execution drop to 30 seconds per file (original was around 150 seconds), but now the memory is out of control. The file parsing is basically a file read, which takes less then 1 second.
My second try was using pandas, but I couldn't merge the datasets the way I want. I must say I'm not proficient in pandas.
Is there a third option?

I ended up getting pretty close to my objective.
Fist I made the reading and parsing into memory concurrent and in batches using multithreading.pool, each result was pushed into a queue. Each pool would have 3 consecutive files, cycling in 4 pools. In a 5th pool, I pre-merge the dictionaries dropping non-recurrent keys (cards). Then in a 6th pool I do the final merge.
The whole process takes about 75 seconds on my local machine. The whole thing ends up consuming over 4gb of ram, which is not ideal, but manageable.

IIUC you are interested on the first and the last ActivityDate per each ID
If this is the case you could use dask. Let assume all your file are csv and they are stored inside a folder called data.
import dask.dataframe as dd
import pandas as pd
df = dd.read_csv("data/*.csv")
# convert ActivityDate to datetime
df["ActivityDate"] = df["ActivityDate"].astype("M8[us]")
# use aggregate
out = df.groupby("ID")\
.agg({"ActivityDate":["min", "max"]})\
.compute()
out.columns = ["_".join(col) for col in out.columns]
out = out.reset_index()
out = out.rename(columns={'ActivityDate_min':"FirstActivityDate",
'ActivityDate_max':"LastActivityDate"})

Related

memory issue when merging two data frames

I am stuck at this second last statement clueless. The error is : numpy.core._exceptions.MemoryError: Unable to allocate 58.1 GiB for an array with shape (7791676634,) and data type int64
My thinking was that merging a data frame of ~12 million records with another data frame of 3-4 more columns should not be a big deal.
Please help me out. Totally stuck here. Thanks
Select_Emp_df has around 900k records
and Big_df has around 12 million records and 9 columns. I just need to merge two DFs like we do vlookup in Excel on key column.
import pandas as pd
Emp_df = pd.read_csv('New_Employee_df.csv', low_memory = False )
# Append data into one data frame from three csv files of 3 years'
transactions
df2019 = pd.read_csv('U21_02767G - Customer Trade Info2019.csv',
low_memory = False )
df2021 = pd.read_csv('U21_02767G - Customer Trade
Info2021(TillSep).csv', low_memory = False)
df2020 = pd.read_csv('Newdf2020.csv', low_memory = False)
Big_df = pd.concat([df2019, df2020, df2021], ignore_index=True)
Select_Emp_df = Emp_df[['CUSTKEY','GCIF_GENDER_DSC','SEX']]
Big_df = pd.merge(Big_df, Select_Emp_df, on='CUSTKEY')
print (Big_df.info)
Just before Big_df = pd.merge(Big_df, Select_Emp_df, on='CUSTKEY')
try to delete previous dataframes. Like this.
del df2019
del df2020
del df2021
This should save some memory
also try
Select_Emp_df = Emp_df[['CUSTKEY','GCIF_GENDER_DSC','SEX']].drop_duplicates(subset=['CUSTKEY'])
When younger, I used machines where available RAM per process where 32k to 640k. And I used to process huge datasets on that (err... several Mo but much larger than memory). The key was to only keep in memory what was required.
Here you concat 3 large dataframes to later merge that with another one. If you have memory issues, just reverse concat and merging: merge each individual file with Emp_df and immediately write the merged file to the disk an throw everything out of your memory between each step. If you use csv files, you can even directly build the contatenated csv file by appending the 2nd and 3rd merge files to the first one (use mode='a', header=False in to_csv method).
Using suggestions provided by community here and a bit of my research I edited my codes and here is what worked for me -
Select_Emp_df['LSTTRDDT'] = pd.to_datetime(Select_Emp_df['LSTTRDDT'],errors = 'coerce') Select_Emp_df = Select_Emp_df.sort_values(by='LSTTRDDT',ascending=True) Select_Emp_df = Select_Emp_df.drop_duplicates(subset='CUSTKEY',keep='last')
I just sorted values on Last transaction date and deleted duplicates(IN CUSTKEY) in Select_Emp_df data frame.

Best way to find out number of rows in csv without loading the full thing

I've been dealing with a lot of 4-5 Gb csv files last few days at work and so that I know how much they progressed through reading/writing I wrote couple wrapper functions on top of pandas' methods. It all seems to work very well, a bit of overhead but convenience outweighs most issues.
At the same time, when reading a csv, so that the progress bar displays correct percentage, I need to know the number of rows in advance since that determines how many chunks there will be. The simplest solution I came up with is to simply load the 0th column of the csv before starting to load the rest and get its size. But this does take a bit of time when you have files of millions of rows in size.
Also, reading of a single column takes an unreasonably high proportion of total time: reading a single column in a csv with 125 columns a few million rows took ~24 seconds, reading the whole file is 63 seconds.
And this is a function I've been using to read csvs:
def read_csv_with_progressbar(filename: str,
chunksize: int = 50000) -> pd.DataFrame:
length = pd.read_csv(filename, usecols=[0])
length = length.values.shape[0]
total = length//chunksize
chunk_list = []
chunks = pd.read_csv(filename, chunksize=chunksize)
with tqdm(total=total, file=sys.stdout) as pbar:
for chunk in chunks:
chunk_list.append(chunk)
pbar.set_description('Reading source csv file')
pbar.update(1)
df = pd.concat([i for i in chunk_list], axis=0)
return df
Any way to get the number of rows in a csv faster that using my flawed method?
Assuming there are no quoted strings (with newlines in them) or other shenanigans in your CSV file an accurate (but hacky) solution is to not even parse the file but simply count the number of newlines in the file:
chunk = 1024*1024 # Process 1 MB at a time.
f = np.memmap("test.csv")
num_newlines = sum(np.sum(f[i:i+chunk] == ord('\n'))
for i in range(0, len(f), chunk))
del f
I was dealing with the same problem but the solutions proposed didnt work for me. Dealing with csv files over 20 GB in size the procesing time was still to large for me. Consulting with a co worker I found an almost instant solution using subprocess. It goes like:
import subprocess
num_lines = int(subprocess.check_output("wc -l test.csv", shell=True).split()[0]) - 1
subprocess chek_output returns the number of lines including the header plus the path to the file, split returns the number of lines as a str, int converts to integer, and finally we substract 1 to account for the header.

Fastest way to insert data(5000 rows) in dataframe in python

I have 5000 json data points which I am iterating and holding data in dataframe. Initially I am adding data in series list and thereafter adding it into dataframe using below code
1. (5000 times)pd.Series([trading_symbol, instrument_token], index=stock_instrument_token_df.columns)
then:
2. (once) stock_instrument_token_df.append(listOfSeries, ignore_index=True)
time taken in executing 1 is around 700-800 ms and 2 is around 200-300ms
So overall it takes around 1 second for this process
Before this I am iterating through another set of 50,000 json data points and adding them into python dict. That takes around 300 ms
Is there any faster way to do insertion in data frame.
Is there something wrong the way I am looping through data or inserting in data frame ?Is there any faster way to get work done in dataframe?
Complete code as requested, if it helps
stock_instrument_token_df = pd.DataFrame(columns=['first', 'second'])
listOfSeries = []
for data in api_response:
trading_symbol = data[Constants.tradingsymbol]
instrument_token = data[Constants.instrument_token]
listOfSeries.append(
pd.Series([trading_symbol, instrument_token], index=stock_instrument_token_df.columns))
stock_instrument_token_df = stock_instrument_token_df.append(listOfSeries, ignore_index=True)

the replacement of converted columns after downcasting doesn't end

I'm working on my first correlation analysis. I've received the data through an excel file, I've imported it as Dataframe (had to pivot it) and now I have a set of almost 3000 rows and 25000 columns. I can't choose a subset from it, as every column is important for this project and I also don't know what information every column stores in order to choose the most interesting ones, because it is encoded with integer numbers (it is an university project). It is like a big questionnaire, where every person has his/hers own row and the answers for every question are stored in a different column.
I really need to solve this issue because later I'll have to replace the many Nans with the medians of the columns and then start the correlation analysis. I tried this part first and it didn't go because of the size so that's why I've tried downcasting first
The dataset has 600 MB and I used the downcasting instruction for the floats and saved 300 MB but when I try to replace the new columns in a copy of my dataset, it runs for 30 minutes and it doesn't do anything. No warning, no error until I interrupt the kernel and it still gives me no hint why it doesn't work.
I can't use the delete Nans instruction first, because there are so many, that it will erase almost everything.
#i've got this code from https://www.dataquest.io/blog/pandas-big-data/
def mem_usage(pandas_obj):
if isinstance(pandas_obj,pd.DataFrame):
usage_b = pandas_obj.memory_usage(deep=True).sum()
else: # we assume if not a df it's a series
usage_b = pandas_obj.memory_usage(deep=True)
usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
return "{:03.2f} MB".format(usage_mb)
gl_float = myset.select_dtypes(include=['float'])
converted_float = gl_float.apply(pd.to_numeric,downcast='float')
print(mem_usage(gl_float)) #almost 600
print(mem_usage(converted_float)) #almost 300
optimized_gl = myset.copy()
optimized_gl[converted_float.columns]= converted_float #this doesn't end
after the replacement works, I want to use the Imputer function for the Nans-replacement and print the correlation result for my dataset
in the end I've decided to use this:
column1 = myset.iloc[:,0]
converted_float.insert(loc=0, column='ids', value=column1)
instead of the lines with optimized_gl and it solved it but it was possible only because every column changed except for the first one. So I just had to add the first to the others.

How to add rows to pandas dataframe with reasonable performance

I have an empty data frame with about 120 columns, I want to fill it using data I have in a file.
I'm iterating over a file that has about 1.8 million lines.
(The lines are unstructured, I can't load them to a dataframe directly)
For each line in the file I do the following:
Extract the data I need from the current line
Copy the last row in the data frame and append it to the end df = df.append(df.iloc[-1]). The copy is critical, most of the data in the previous row won't be changed.
Change several values in the last row according to the data I've extracted df.iloc[-1, df.columns.get_loc('column_name')] = some_extracted_value
This is very slow, I assume the fault is in the append.
What is the correct approach to speed things up ? preallocate the dataframe ?
EDIT:
After reading the answers I did the following:
I preallocated the dataframe (saved like 10% of the time)
I replaced this : df = df.append(df.iloc[-1]) with this : df.iloc[i] = df.iloc[i-1] (i is the current iteration in the loop).(save like 10% of the time).
Did profiling, even though I removed the append the main issue is copying the previous line, meaning : df.iloc[i] = df.iloc[i-1] takes about 95% of the time.
You may need plenty of memory, whichever option you choose.
However, what you should certainly avoid is using pd.DataFrame.append within a loop. This is expensive versus list.append.
Instead, aggregate to a list of lists, then feed into a dataframe. Since you haven't provided an example, here's some pseudo-code:
# initialize empty list
L = []
for line in my_binary_file:
# extract components required from each line to a list of Python types
line_vars = [line['var1'], line['var2'], line['var3']]
# append to list of results
L.append(line_vars)
# create dataframe from list of lists
df = pd.DataFrame(L, columns=['var1', 'var2', 'var3'])
The Fastest way would be load to dataframe directly via pd.read_csv()
Try separating the logic to clean out unstructured to structured data and then use pd.read_csv to load the dataframe.
You can share the sample unstructured line and logic to take out the structured data, So that might share some insights on the same.
Where you use append you end up copying the dataframe which is inefficient. Try this whole thing again but avoiding this line:
df = df.append(df.iloc[-1])
You could do something like this to copy the last row to a new row (only do this if the last row contains information that you want in the new row):
df.iloc[...calculate the next available index...] = df.iloc[-1]
Then edit the last row accordingly as you have done
df.iloc[-1, df.columns.get_loc('column_name')] = some_extracted_value
You could try some multiprocessing to speed things up
from multiprocessing.dummy import Pool as ThreadPool
def YourCleaningFunction(line):
for each line do the following
blablabla
return(your formated lines with ,) # or use the kind of function jpp just provided
pool = ThreadPool(8) # your number of cores
lines = open('your_big_csv.csv').read().split('\n') # your csv as a list of lines
df = pool.map(YourCleaningFunction, lines)
df = pandas.DataFrame(df)
pool.close()
pool.join()

Categories

Resources