Creating Cartesian Product DataFrame without maxing Memory - python

I have several dataframes, from which I'm creating a cartesian product (on purpose!)
After this, I'm exporting the result to disk.
I believe the size of the resulting dataframe could exceed my memory footprint, so I'm wondering is there a way that I can chunk this so that the dataframe doesn't need to all be in memory at the same time?
Example Code:
import pandas as pd
def create_list_from_range(r1,r2):
if (r1 == r2):
return r1
else:
res = []
while(r1 < r2+1 ):
res.append(r1)
r1 += 1
return res
# make a list of options
color_opt = ['red','blue','green','orange']
dow_opt = create_list_from_range(1,7)
hod_opt = create_list_from_range(0,23)
# turn each list into a dataframe
df_color = pd.DataFrame({'color': color_opt})
df_day = pd.DataFrame({'day_of_week': dow_opt})
df_hour = pd.DataFrame({'hour_of_day': hod_opt})
# add a dummy columns to everything so I can easily do a cartesian product
df_color['dummy']=1
df_day['dummy']=1
df_hour['dummy']=1
# now cartesian product... cascading
merge1 = pd.merge(df_day, df_hour, on='dummy')
FINAL = pd.merge(merge1, df_color, on='dummy')
FINAL.to_csv('FINAL_OUTPUT.csv', index=False)

You could try building up individual rows using itertools.product. In your example, you could do this as follows:
from itertools import product
prod = product(color_opt, dow_opt, hod_opt)
You can then get a number of rows and append them to an existing csv file using
df.to_csv("file", mode="a")

Related

Grouping a dataframe and performing operations on the resulting matrix in a parallelized manner using Python/Dask/multiprocessing?

I am working on a project where I need to group molecules in a database by their ID and perform operations on the resulting matrix. I am using Python and I want to improve performance by parallelizing the process.
I am currently loading the molecules from an SDF file and storing them in a Pandas dataframe. Each molecule has an ID, a unique Pose ID, and a unique Structure. My goal is to group the dataframe by ID and create a matrix for each ID group. The rows and columns of the matrix would correspond to the unique Pose IDs of the molecules in that ID group. Then, I can calculate values for each cell in the matrix, such as the similarity between the molecules that define that cell. However, the specific operations on the molecules are not important for this question. I am primarily asking for advice on how to set up such a system for parallelized computing using Dask or Multiprocessing, or if there are other better options.
Here is a gist of the version without any parallelisation (please note i have heavily modified to make my questions clearer, the code below outputs the desired things, but I am looking to calculate the celles on molecules not the Pose ID) : https://gist.github.com/Tonylac77/abfd54b1ceef39f0d161fb6b21950edb
#Generate sample dataframe
import pandas as pd
df = pd.DataFrame(columns=['ID', 'Pose ID'])
ids = ['ID' + str(i) for i in range(1, 6)]
pose_ids = ['Pose ' + str(i) for i in range(1, 11)]
# For each ID, add 10 rows to the dataframe with the corresponding Pose IDs
df_list = []
for i in ids:
temp_df = pd.DataFrame({'ID': [i] * 10, 'Pose ID': pose_ids})
df_list.append(temp_df)
df= pd.concat(df_list)
print(df)
################
from tqdm import tqdm
import itertools
import functools
import numpy as np
from IPython.display import display
def full_matrix_calculation(df):
#Here I am using just string concatenation as an example calculation, in reality i am calling external functions
def matrix_calculation(df, id_list):
matrices = {}
calc_dataframes = []
for id in tqdm(id_list):
df_name = df[df['ID']==id]
df_name.index = range(len(df_name['Pose ID']))
matrix = pd.DataFrame(0.0, index=[df_name['Pose ID']], columns=df_name['Pose ID'])
for subset in itertools.combinations(df_name['Pose ID'], 2):
result = subset[0]+subset[1]
matrix.iloc[df_name[df_name['Pose ID']==subset[0]].index.values, df_name[df_name['Pose ID']==subset[1]].index.values] = result
matrix.iloc[df_name[df_name['Pose ID']==subset[1]].index.values, df_name[df_name['Pose ID']==subset[0]].index.values] = result
matrices[id] = matrix
return matrices
id_list = np.unique(np.array(df['ID']))
calculated_dfs = matrix_calculation(df, id_list)
return calculated_dfs
calculated_dfs = full_matrix_calculation(df)
display(calculated_dfs)
I have tried using multiprocessing, however, my implementation seems to be slower than the non-parallelized version : https://gist.github.com/Tonylac77/b4bbada97ee2bab7c37d4a29079af574
def function(tuple):
return tuple[0]+tuple[1]
def full_matrix_calculation(df):
#Here I am using just string concatenation as an example calculation, in reality i am calling external functions
def matrix_calculation(df, id_list):
matrices = {}
calc_dataframes = []
for id in tqdm(id_list):
df_name = df[df['ID']==id]
df_name.index = range(len(df_name['Pose ID']))
matrix = pd.DataFrame(0.0, index=[df_name['Pose ID']], columns=df_name['Pose ID'])
with multiprocessing.Pool() as p:
try:
results = p.map(function, itertools.combinations(df_name['Pose ID'], 2))
except KeyError:
print('Incorrect clustering method selected')
return
results_list = list(zip(itertools.combinations(df_name['Pose ID'], 2), results))
for subset, result in results_list:
matrix.iloc[df_name[df_name['Pose ID']==subset[0]].index.values, df_name[df_name['Pose ID']==subset[1]].index.values] = result
matrix.iloc[df_name[df_name['Pose ID']==subset[1]].index.values, df_name[df_name['Pose ID']==subset[0]].index.values] = result
matrices[id] = matrix
for subset in itertools.combinations(df_name['Pose ID'], 2):
result = subset[0]+subset[1]
matrix.iloc[df_name[df_name['Pose ID']==subset[0]].index.values, df_name[df_name['Pose ID']==subset[1]].index.values] = result
matrix.iloc[df_name[df_name['Pose ID']==subset[1]].index.values, df_name[df_name['Pose ID']==subset[0]].index.values] = result
matrices[id] = matrix
return matrices
id_list = np.unique(np.array(df['ID']))
calculated_dfs = matrix_calculation(df, id_list)
return calculated_dfs
calculated_dfs = full_matrix_calculation(df)
display(calculated_dfs)
I have also started playing around with Dask, however the main issue I'm facing is that I need all of the values of one ID to be in the same dask partition, otherwise I will have incomplete matrices (if I understand correctly at least). I have tried to find a solution to this (like chunking in x partitions etc) but so far to no avail. Will update this thread if something changes.
Any advice welcome to speed these calculations up. For reference, the actual datasets I'm working contain ~10000 unique IDs and ~300000 Pose IDs. With the calculations I'm running on the molecules, some of these are taking 40h to complete.
This should be pretty straightforward using Dask Dataframe and groupBy:
ddf = your_dataframe_as_dask
def matrix_calculation(df, id):
matrix = pd.DataFrame(0.0, index=[df['Pose ID']], columns=df_name['Pose ID'])
for subset in itertools.combinations(df['Pose ID'], 2):
result = subset[0]+subset[1]
matrix.iloc[df[df['Pose ID']==subset[0]].index.values, df_name[df_name['Pose ID']==subset[1]].index.values] = result
matrix.iloc[df[df['Pose ID']==subset[1]].index.values, df_name[df_name['Pose ID']==subset[0]].index.values] = result
return matrix
ddf.groupby('ID').apply(matrix_calculation).compute()
See https://examples.dask.org/dataframes/02-groupby.html#Groupby-Apply.
This will parallelize the work for each ID.
You might then want to look at https://docs.dask.org/en/stable/scheduling.html to chose the scheduler that suits your need (default with Dataframe is threads, which might not be efficient depending on your code).

How to create a new dataframe by sorted data

I would like to find out the row which meets the condition RSI < 25.
However, the result is generated with one data frame. Is it possible to create separate dataframes for any single row?
Thanks.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas_datareader import data as wb
stock='TSLA'
ck_df = wb.DataReader(stock,data_source='yahoo',start='2015-01-01')
rsi_period = 14
chg = ck_df['Close'].diff(1)
gain = chg.mask(chg<0,0)
ck_df['Gain'] = gain
loss = chg.mask(chg>0,0)
ck_df['Loss'] = loss
avg_gain = gain.ewm(com = rsi_period-1,min_periods=rsi_period).mean()
avg_loss = loss.ewm(com = rsi_period-1,min_periods=rsi_period).mean()
ck_df['Avg Gain'] = avg_gain
ck_df['Avg Loss'] = avg_loss
rs = abs(avg_gain/avg_loss)
rsi = 100-(100/(1+rs))
ck_df['RSI'] = rsi
RSIFactor = ck_df['RSI'] <25
ck_df[RSIFactor]
If you want to know at what index the RSI < 25 then just use:
ck_df[ck_df['RSI'] <25].index
The result will also be a dataframe. If you insist on making a new one then:
new_df = ck_df[ck_df['RSI'] <25].copy()
To split the rows found by #Omkar's solution into separate dataframes you might use this function taken from here: Pandas: split dataframe into multiple dataframes by number of rows;
def split_dataframe_to_chunks(df, n):
df_len = len(df)
count = 0
dfs = []
while True:
if count > df_len-1:
break
start = count
count += n
dfs.append(df.iloc[start : count])
return dfs
With this you get a list of dataframes.

Get a coordinate distance matrix using Pandas without loops

I am currently getting a distance matrix of coordinates from two data frames (ref_df and comp_df) using a nested for-loop over rows in both data frames, as shown below.
import geopy.distance
import pandas as pd
ref_df = pd.DataFrame({"grp_id":['M-00353','M-00353','M-00353','M-00538','M-00538','M-00160','M-00160','M-00160',
'M-00509','M-00509','M-00509','M-00509'],"name": ['B1','IIS','IISB I','BK',
'MM - BK','H(SL)','H(PKS SL)','PTH','ASSM 1','PKS SSM','SSM',
'Sukajadi Sawit Mekar 1'],"lat": [0.43462,0.43462,0.43462,1.74887222,1.74887222,-2.6081,
-2.6081,-2.6081, -2.378258,-2.378258,-2.378258,-2.378258],"long":[101.822603,101.822603,101.822603,101.3710944,101.3710944,
104.12525,104.12525,104.12525,112.542356,112.542356,112.542356,112.542356]})
comp_df = pd.DataFrame({"uml_id": ['PO1000000021','PO1000000054','PO1000000058','PO1000000106'],
"mill_name": ['PT IIS-BI','PT MM-BK','HL','PT SSM'],
"Latitude": [0.4344444,0.077043,-2.6081,-2.381111],"Longitude":[101.825,102.030838,104.12525,112.539722]})
matched_coords = []
for row in ref_df.index:
mill_id = ref_df.get_value(row, "grp_id")
mill_lat = ref_df.get_value(row, "lat")
mill_long = ref_df.get_value(row, "long")
for columns in comp_df.index:
gm_id = comp_df.get_value(columns, "uml_id")
gm_lat = comp_df.get_value(columns, "Latitude")
gm_long = comp_df.get_value(columns, "Longitude")
dist = geopy.distance.distance(
(mill_lat, mill_long),
(gm_lat, gm_long)).km
matched_coords.append([
mill_id, mill_lat, mill_long,
gm_id, gm_lat, gm_long, dist
])
# Convert to data frame
mc_df = pd.DataFrame(matched_coords)
mc_df.columns = [
'grp_id', 'grp_lat', 'grp_long',
'match_id', 'match_lat', 'match_long', 'dist'
]
# Pivot to create wide data frame (matrix of distances)
mc_wide_df = mc_df.pivot_table(
values="dist",
index=["grp_id", "grp_lat","grp_long"],
columns="match_id").reset_index()
However, I'd like to simplify the process and code by just creating a helper function using an apply on the data frames. My attempt below is not working. Is anybody able to help me figure out what's going wrong here.
# Test apply!
def get_coords_dist(x):
dist = geopy.distance.distance((x['lat'],x['long']),(comp_df['Latitude'],comp_df['Longitude'])).km
return pd.Series({comp_df.iloc[i[2]]['uml_id']: i for i in dist})
mc_df = ref_df.merge(ref_df.sort_values('grp_id').apply(get_coords_dist, axis=1), left_index=True, right_index=True)
You're looking to perform a cross join between the two data frames ref_df and comp_df. One way to do this is to pd.merge on a dummy column.
def distance_km(x, y):
return geopy.distance.distance(x, y).km
# it looks like your coordinates depend only on grp_id
ref_df_dd = ref_df.drop_duplicates(['grp_id', 'lat', 'long'])
# assign a dummy "_" column in both data frames, merge, and drop the dummy
# column afterwards
merged_df = pd.merge(
ref_df_dd.assign(_=1),
comp_df.assign(_=1),
).drop('_', axis=1)
# apply your distance function on (lat, long) tuples in the Cartesian product
merged_df['distance'] = list(
map(distance_km,
merged_df[['lat', 'long']].apply(tuple, 1),
merged_df[['Latitude', 'Longitude']].apply(tuple, 1)))
# pivot table
merged_df.set_index(['grp_id', 'uml_id']).distance.unstack()
At this point merged_df looks like
uml_id PO1000000021 PO1000000054 PO1000000058 PO1000000106
grp_id
M-00160 422.745678 377.461999 0.000000 936.147322
M-00353 0.267531 45.832819 422.922708 1232.700696
M-00509 1232.642382 1200.904305 936.449658 0.430525
M-00538 153.871840 198.911938 571.009484 1324.234511
which is pretty close to what you want.
Another solution (which is more transparent and 2x faster than the approach above) makes use of itertools.product.
from itertools import product
# create a data frame by iterating over row pairs in the Cartesian product
merged_df = pd.DataFrame([{
'grp_id': r.grp_id,
'uml_id': c.uml_id,
'distance': distance_km((r.lat, r.long), (c.Latitude, c.Longitude))
} for r, c in product(ref_df_dd.itertuples(), comp_df.itertuples())])
# pivot table
merged_df.set_index(['grp_id', 'uml_id']).distance.unstack()
This gives the same merged_df as above.

Python Pandas Merge Causing Memory Overflow

I'm new to Pandas and am trying to merge a few subsets of data. I'm giving a specific case where this happens, but the question is general: How/why is it happening and how can I work around it?
The data I load is around 85 Megs or so but I often watch my python session run up close to 10 gigs of memory usage then give a memory error.
I have no idea why this happens, but it's killing me as I can't even get started looking at the data the way I want to.
Here's what I've done:
Importing the Main data
import requests, zipfile, StringIO
import numpy as np
import pandas as pd
STAR2013url="http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013_all_csv_v3.zip"
STAR2013fileName = 'ca2013_all_csv_v3.txt'
r = requests.get(STAR2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STAR2013=pd.read_csv(z.open(STAR2013fileName))
Importing some Cross Cross Referencing Tables
STARentityList2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013entities_csv.zip"
STARentityList2013fileName = "ca2013entities_csv.txt"
r = requests.get(STARentityList2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARentityList2013=pd.read_csv(z.open(STARentityList2013fileName))
STARlookUpTestID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/tests.zip"
STARlookUpTestID2013fileName = "Tests.txt"
r = requests.get(STARlookUpTestID2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARlookUpTestID2013=pd.read_csv(z.open(STARlookUpTestID2013fileName))
STARlookUpSubgroupID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/subgroups.zip"
STARlookUpSubgroupID2013fileName = "Subgroups.txt"
r = requests.get(STARlookUpSubgroupID2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARlookUpSubgroupID2013=pd.read_csv(z.open(STARlookUpSubgroupID2013fileName))
Renaming a Column ID to Allow for Merge
STARlookUpSubgroupID2013 = STARlookUpSubgroupID2013.rename(columns={'001':'Subgroup ID'})
STARlookUpSubgroupID2013
Successful Merge
merged = pd.merge(STAR2013,STARlookUpSubgroupID2013, on='Subgroup ID')
Try a second merge. This is where the Memory Overflow Happens
merged=pd.merge(merged, STARentityList2013, on='School Code')
I did all of this in ipython notebook, but don't think that changes anything.
Although this is an old question, I recently came across the same problem.
In my instance, duplicate keys are required in both dataframes, and I needed a method which could tell if a merge will fit into memory ahead of computation, and if not, change the computation method.
The method I came up with is as follows:
Calculate merge size:
def merge_size(left_frame, right_frame, group_by, how='inner'):
left_groups = left_frame.groupby(group_by).size()
right_groups = right_frame.groupby(group_by).size()
left_keys = set(left_groups.index)
right_keys = set(right_groups.index)
intersection = right_keys & left_keys
left_diff = left_keys - intersection
right_diff = right_keys - intersection
left_nan = len(left_frame[left_frame[group_by] != left_frame[group_by]])
right_nan = len(right_frame[right_frame[group_by] != right_frame[group_by]])
left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan
right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan
sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection]
sizes += [left_nan * right_nan]
left_size = [left_groups[group_name] for group_name in left_diff]
right_size = [right_groups[group_name] for group_name in right_diff]
if how == 'inner':
return sum(sizes)
elif how == 'left':
return sum(sizes + left_size)
elif how == 'right':
return sum(sizes + right_size)
return sum(sizes + left_size + right_size)
Note:
At present with this method, the key can only be a label, not a list. Using a list for group_by currently returns a sum of merge sizes for each label in the list. This will result in a merge size far larger than the actual merge size.
If you are using a list of labels for the group_by, the final row size is:
min([merge_size(df1, df2, label, how) for label in group_by])
Check if this fits in memory
The merge_size function defined here returns the number of rows which will be created by merging two dataframes together.
By multiplying this with the count of columns from both dataframes, then multiplying by the size of np.float[32/64], you can get a rough idea of how large the resulting dataframe will be in memory. This can then be compared against psutil.virtual_memory().available to see if your system can calculate the full merge.
def mem_fit(df1, df2, key, how='inner'):
rows = merge_size(df1, df2, key, how)
cols = len(df1.columns) + (len(df2.columns) - 1)
required_memory = (rows * cols) * np.dtype(np.float64).itemsize
return required_memory <= psutil.virtual_memory().available
The merge_size method has been proposed as an extension of pandas in this issue. https://github.com/pandas-dev/pandas/issues/15068.

How to create a pivot table on extremely large dataframes in Pandas

I need to create a pivot table of 2000 columns by around 30-50 million rows from a dataset of around 60 million rows. I've tried pivoting in chunks of 100,000 rows, and that works, but when I try to recombine the DataFrames by doing a .append() followed by .groupby('someKey').sum(), all my memory is taken up and python eventually crashes.
How can I do a pivot on data this large with a limited ammount of RAM?
EDIT: adding sample code
The following code includes various test outputs along the way, but the last print is what we're really interested in. Note that if we change segMax to 3, instead of 4, the code will produce a false positive for correct output. The main issue is that if a shipmentid entry is not in each and every chunk that sum(wawa) looks at, it doesn't show up in the output.
import pandas as pd
import numpy as np
import random
from pandas.io.pytables import *
import os
pd.set_option('io.hdf.default_format','table')
# create a small dataframe to simulate the real data.
def loadFrame():
frame = pd.DataFrame()
frame['shipmentid']=[1,2,3,1,2,3,1,2,3] #evenly distributing shipmentid values for testing purposes
frame['qty']= np.random.randint(1,5,9) #random quantity is ok for this test
frame['catid'] = np.random.randint(1,5,9) #random category is ok for this test
return frame
def pivotSegment(segmentNumber,passedFrame):
segmentSize = 3 #take 3 rows at a time
frame = passedFrame[(segmentNumber*segmentSize):(segmentNumber*segmentSize + segmentSize)] #slice the input DF
# ensure that all chunks are identically formatted after the pivot by appending a dummy DF with all possible category values
span = pd.DataFrame()
span['catid'] = range(1,5+1)
span['shipmentid']=1
span['qty']=0
frame = frame.append(span)
return frame.pivot_table(['qty'],index=['shipmentid'],columns='catid', \
aggfunc='sum',fill_value=0).reset_index()
def createStore():
store = pd.HDFStore('testdata.h5')
return store
segMin = 0
segMax = 4
store = createStore()
frame = loadFrame()
print('Printing Frame')
print(frame)
print(frame.info())
for i in range(segMin,segMax):
segment = pivotSegment(i,frame)
store.append('data',frame[(i*3):(i*3 + 3)])
store.append('pivotedData',segment)
print('\nPrinting Store')
print(store)
print('\nPrinting Store: data')
print(store['data'])
print('\nPrinting Store: pivotedData')
print(store['pivotedData'])
print('**************')
print(store['pivotedData'].set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('**************')
print('$$$')
for df in store.select('pivotedData',chunksize=3):
print(df.set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('$$$')
store['pivotedAndSummed'] = sum((df.set_index('shipmentid').groupby('shipmentid',level=0).sum() for df in store.select('pivotedData',chunksize=3)))
print('\nPrinting Store: pivotedAndSummed')
print(store['pivotedAndSummed'])
store.close()
os.remove('testdata.h5')
print('closed')
You could do the appending with HDF5/pytables. This keeps it out of RAM.
Use the table format:
store = pd.HDFStore('store.h5')
for ...:
...
chunk # the chunk of the DataFrame (which you want to append)
store.append('df', chunk)
Now you can read it in as a DataFrame in one go (assuming this DataFrame can fit in memory!):
df = store['df']
You can also query, to get only subsections of the DataFrame.
Aside: You should also buy more RAM, it's cheap.
Edit: you can groupby/sum from the store iteratively since this "map-reduces" over the chunks:
# note: this doesn't work, see below
sum(df.groupby().sum() for df in store.select('df', chunksize=50000))
# equivalent to (but doesn't read in the entire frame)
store['df'].groupby().sum()
Edit2: Using sum as above doesn't actually work in pandas 0.16 (I thought it did in 0.15.2), instead you can use reduce with add:
reduce(lambda x, y: x.add(y, fill_value=0),
(df.groupby().sum() for df in store.select('df', chunksize=50000)))
In python 3 you must import reduce from functools.
Perhaps it's more pythonic/readable to write this as:
chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000))
res = next(chunks) # will raise if there are no chunks!
for c in chunks:
res = res.add(c, fill_value=0)
If performance is poor / if there are a large number of new groups then it may be preferable to start the res as zero of the correct size (by getting the unique group keys e.g. by looping through the chunks), and then add in place.

Categories

Resources