Related
I am a beginner working with a clinical data set using Pandas in Jupyter Notebook.
A column of my data contains census tract codes and I am trying to merge my data with a large transportation data file that also has a column with census tract codes.
I initially only wanted 2 of the other columns from that transportation file so, after I downloaded the file, I removed all of the other columns except the 2 that I wanted to add to my file and the census tract column.
This is the code I used:
df_my_data = pd.read_excel("my_data.xlsx")
df_transportation_data = pd.read_excel("transportation_data.xlsx")
df_merged_file = pd.merge(df_my_data, df_transportation_data)
df_merged_file.to_excel('my_merged_file.xlsx', index = False)
This worked but then I wanted to add the other columns from the transportation file so I used my initial file (prior to adding the 2 transportation columns) and tried to merge the entire transportation file. This resulted in a new DataFrame with all of the desired columns but only 4 rows.
I thought maybe the transportation file is too big so I tried merging individual columns (other than the 2 I was initially able to merge) and this again results in all of the correct columns but only 4 rows merging.
Any help would be much appreciated.
Edits:
Sorry for not being more clear.
Here is the code for the 2 initial columns I merged:
import pandas as pd
df_my_data = pd.read_excel('my_data.xlsx')
df_two_columns = pd.read_excel('two_columns_from_transportation_file.xlsx')
df_two_columns_merged = pd.merge(df_my_data, df_two_columns, on=['census_tract'])
df_two_columns_merged.to_excel('two_columns_merged.xlsx', index = False)
The outputs were:
df_my_data.head()
census_tract id e t
0 6037408401 1 1 1092
1 6037700200 2 1 1517
2 6065042740 3 1 2796
3 6037231210 4 1 1
4 6059076201 5 1 41
df_two_columns.head()
census_tract households_with_no_vehicle vehicles_per_household
0 6001400100 2.16 2.08
1 6001400200 6.90 1.50
2 6001400300 17.33 1.38
3 6001400400 8.97 1.41
4 6001400500 11.59 1.39
df_two_columns_merged.head()
census_tract id e t households_with_no_vehicle vehicles_per_household
0 6037408401 1 1 1092 4.52 2.43
1 6037700200 2 1 1517 9.88 1.26
2 6065042740 3 1 2796 2.71 1.49
3 6037231210 4 1 1 25.75 1.35
4 6059076201 5 1 41 1.63 2.22
df_my_data has 657 rows and df_two_columns_merged came out with 657 rows.
The code for when I tried to merge the entire transport file:
import pandas as pd
df_my_data = pd.read_excel('my_data.xlsx')
df_transportation_data = pd.read_excel('transportation_data.xlsx')
df_merged_file = pd.merge(df_my_data, df_transportation_data, on=['census_tract'])
df_merged_file.to_excel('my_merged_file.xlsx', index = False)
The output:
df_transportation_data.head()
census_tract Bike Carpooled Drove Alone Households No Vehicle Public Transportation Walk Vehicles per Household
0 6001400100 0.00 12.60 65.95 2.16 20.69 0.76 2.08
1 6001400200 5.68 3.66 45.79 6.90 39.01 5.22 1.50
2 6001400300 7.55 6.61 46.77 17.33 31.19 6.39 1.38
3 6001400400 8.85 11.29 43.91 8.97 27.67 4.33 1.41
4 6001400500 8.45 7.45 46.94 11.59 29.56 4.49 1.39
df_merged_file.head()
census_tract id e t Bike Carpooled Drove Alone Households No Vehicle Public Transportation Walk Vehicles per Household
0 6041119100 18 0 2755 1.71 3.02 82.12 4.78 8.96 3.32 2.10
1 6061023100 74 1 1201 0.00 9.85 86.01 0.50 2.43 1.16 2.22
2 6041110100 80 1 9 0.30 4.40 72.89 6.47 13.15 7.89 1.82
3 6029004902 123 0 1873 0.00 18.38 78.69 4.12 0.00 0.00 2.40
The df_merged_file only has 4 total rows.
So my question is: why is it that I am able to merge those initial 2 columns from the transportation file and keep all of the rows from my file but when I try to merge the entire transportation file I only get 4 rows of output?
I recommend specifying merge type and merge column(s).
When you use pd.merge(), the default merge type is inner merge, and on the same named columns using:
df_merged_file = pd.merge(df_my_data, df_transportation_data, how='left', left_on=[COLUMN], right_on=[COLUMN])
It is possible that one of the columns you removed from the "transportation_data.xlsx" file previously is the same name as a column in your "my_data.xlsx", causing unmatched rows to be removed due to an inner merge.
A 'left' merge would allow the two columns you need from "transportation_data.xlsx" to attach to values in your "my_data.xlsx", but only where there is a match. This means your merged DataFrame will have the same number of rows as your "my_data.xlsx" has currently.
Well, I think there was something wrong with the initial download of the transportation file. I downloaded it again and this time I was able to get a complete merge. Sorry for being an idiot. Thank you all for your help.
Looking for some help for the below. Have 2 big csv files and need to get the data based on few conditions. Here is my sample data file.
a1,a2,v1,v2,v3,v4
1,12,12.99,0.0,34.33,0
1,13,12.99,0.0,34.33,0
1,145,12.99,0.0,34.33,0
2,15,12.99,0.0,34.33,0
2,169,12.99,0.0,34.33,0
3,164,12.99,0.0,34.33,0
3,147,12.99,0.0,34.33,0
1,174,12.99,0.0,34.33,0
2,148,12.99,0.0,34.33,0
4,154,12.99,0.0,34.33,0
a1,k1
1,v1
2,v2
3,v3
4,v4
The values under a1 and k1 to be matched and if any of the v* are zero,those to be dropped from the final csv file.
a1,a2,v1,v2,v3,v4
1,12,12.99,0.0,34.33,0
1,13,12.99,0.0,34.33,0
1,145,12.99,0.0,34.33,0
3,164,12.99,0.0,34.33,0
3,147,12.99,0.0,34.33,0
1,174,12.99,0.0,34.33,0
The values of v2 and v4 are zeroes, so 2 and 4 from A1 rows are dropped.
Thanks in adanvce.
IIUC:
# Find v* columns where all (any?) values are 0
vx_idx = df1.filter(regex=r'^v\d+').eq(0).all().loc[lambda x: x].index
# Find a1 values that match with v*
a1_val = df2.loc[df2['k1'].isin(v), 'a1'].tolist()
# Filter out your final dataframe
out = df1[~df1['a1'].isin(a1_val)]
Output:
>>> out
a1 a2 v1 v2 v3 v4
0 1 12 12.99 0.0 34.33 0
1 1 13 12.99 0.0 34.33 0
2 1 145 12.99 0.0 34.33 0
5 3 164 12.99 0.0 34.33 0
6 3 147 12.99 0.0 34.33 0
7 1 174 12.99 0.0 34.33 0
>>> print(out.to_csv(index=False))
a1,a2,v1,v2,v3,v4
1,12,12.99,0.0,34.33,0
1,13,12.99,0.0,34.33,0
1,145,12.99,0.0,34.33,0
3,164,12.99,0.0,34.33,0
3,147,12.99,0.0,34.33,0
1,174,12.99,0.0,34.33,0
I am trying to expand a column of a pandas dataframe
(see column segments in example below.)
I am able to break it out into the components seperated by ;
However, as you can see, some of the rows in the columns do
not have all the elements. So, what is happening is that the
data which should go into the Geo column ends up going into the
BusSeg column, since there was no Geo column; or the data
that should be in ProdServ column ends up in the Geo column.
Ideally I would like to have only the data and not the indicator
in each cell correctly placed. So,
In the Geo column it should say 'NonUs'. Not 'Geo=NonUs.'
That is after seperating correctly, I would like to remove the text
upto and including the '=' sign in each. How can I do this?
Code below:
import pandas as pd
company1 = ('Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[500,200,3000,400,10,300,560,500,600]
df1['date'] = [20191231,20191231,20191231,20181231,20181231,20181231,20171231,20171231,20171231 ]
df1['line'] = [1,3,2,1,3,2,1,3,2]
df1['segments'] =['BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1',
'BusSeg=Dev;Prd=Alpha;Subseg=Tr1',
'BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2',
'Subseg=Tr1',
'BusSeg=Pharma',
'Geo=China;Prd=Alpha;Subseg=Tr4;',
'Prd=Beta;Subseg=Tr1',
'BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;',
'BusSeg=Pharma;Geo=NonUs;']
print("\ndf1:")
df1[['BusSeg','Geo','ProdServ','Sub','Misc']] = df1['segments'].str.split(';',expand=True)
print(df1)
print(df1[['BusSeg','Geo','ProdServ','Sub','Misc']])
print(df1.dtypes)
print()
Your Data
import pandas as pd
company1 = ('Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[500,200,3000,400,10,300,560,500,600]
df1['date'] = [20191231,20191231,20191231,20181231,20181231,20181231,20171231,20171231,20171231 ]
df1['line'] = [1,3,2,1,3,2,1,3,2]
df1['segments'] =['BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1',
'BusSeg=Dev;Prd=Alpha;Subseg=Tr1',
'BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2',
'Subseg=Tr1',
'BusSeg=Pharma',
'Geo=China;Prd=Alpha;Subseg=Tr4;',
'Prd=Beta;Subseg=Tr1',
'BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;',
'BusSeg=Pharma;Geo=NonUs;']
df:
company clv date line segments
0 Rev 500 20191231 1 BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1
1 Rev 200 20191231 3 BusSeg=Dev;Prd=Alpha;Subseg=Tr1
2 Rev 3000 20191231 2 BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2
3 Rev 400 20181231 1 Subseg=Tr1
4 Rev 10 20181231 3 BusSeg=Pharma
5 Rev 300 20181231 2 Geo=China;Prd=Alpha;Subseg=Tr4;
6 Rev 560 20171231 1 Prd=Beta;Subseg=Tr1
7 Rev 500 20171231 3 BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;
8 Rev 600 20171231 2 BusSeg=Pharma;Geo=NonUs;
Comment this line df1[['BusSeg','Geo','ProdServ','Sub','Misc']] = df1['segments'].str.split(';',expand=True) in your code, and add theese two lines
d = pd.DataFrame(df1['segments'].str.split(';').apply(lambda x:{i.split("=")[0] : i.split("=")[1] for i in x if i}).to_dict()).T
df = pd.concat([df1, d], axis=1)
df:
company clv date line segments BusSeg Geo Prd Subseg
0 Rev 500 20191231 1 BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1 Pharma NonUs Alpha Tr1
1 Rev 200 20191231 3 BusSeg=Dev;Prd=Alpha;Subseg=Tr1 Dev NaN Alpha Tr1
2 Rev 3000 20191231 2 BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2 Pharma US Alpha Tr2
3 Rev 400 20181231 1 Subseg=Tr1 NaN NaN NaN Tr1
4 Rev 10 20181231 3 BusSeg=Pharma Pharma NaN NaN NaN
5 Rev 300 20181231 2 Geo=China;Prd=Alpha;Subseg=Tr4; NaN China Alpha Tr4
6 Rev 560 20171231 1 Prd=Beta;Subseg=Tr1 NaN NaN Beta Tr1
7 Rev 500 20171231 3 BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1; Pharma US Delta Tr1
8 Rev 600 20171231 2 BusSeg=Pharma;Geo=NonUs; Pharma NonUs NaN NaN
I sugest, to fill the columns one by one instead of using split, something like the followin code:
col = ['BusSeg', 'Geo', 'ProdServ', 'Sub'] # Columns Names.
var = ['BusSeg', 'Geo', 'Prd', 'Subseg'] # Variables Name in 'Subseg' column.
for c, v in zip(col, var):
df1[c] = df1['segments'].str.extract(r'' + v + '=(\w*);')
Here's a suggestion:
df1.segments = (df1.segments.str.split(';')
.apply(lambda s:
dict(t.split('=') for t in s if t.strip() != '')))
df2 = pd.DataFrame({col: [dict_.get(col, '') for dict_ in df1.segments]
for col in set().union(*df1.segments)},
index=df1.index)
df1.drop(columns=['segments'], inplace=True)
df1 = pd.concat([df1, df2], axis='columns')
Result:
company clv date line Subseg Geo BusSeg Prd
0 Rev 500 20191231 1 Tr1 NonUs Pharma Alpha
1 Rev 200 20191231 3 Tr1 Dev Alpha
2 Rev 3000 20191231 2 Tr2 US Pharma Alpha
3 Rev 400 20181231 1 Tr1
4 Rev 10 20181231 3 Pharma
5 Rev 300 20181231 2 Tr4 China Alpha
6 Rev 560 20171231 1 Tr1 Beta
7 Rev 500 20171231 3 Tr1 US Pharma Delta
8 Rev 600 20171231 2 NonUs Pharma
I have 2 dataframes. rdf is the reference dataframe I am trying to use to define the interval (top and bottom) to calculate an average between (all of the depths between this interval), but use ldf to actually run that calculation since it contains the values. rdf defines the top and bottom for each id number an average should be run for. There are multiple intervals for each id.
rdf is formatted as such:
ID Top Bottom
1 2010 3000
1 4300 4500
1 4550 5000
1 7100 7700
2 3200 4100
2 4120 4180
2 4300 5300
2 5500 5520
3 2300 2380
3 3200 4500
ldf is fromated as such:
ID Depth(ft) Value1 Value2 Value3
1 2000 45 .32 423
1 2000.5 43 .33 500
1 2001 40 .12 643
1 2001.5 28 .10 20
1 2002 40 .10 34
1 2002.5 23 .11 60
1 2003 34 .08 900
1 2003.5 54 .04 1002
2 2000 40 .28 560
2 2000 38 .25 654
...
3 2000 43 .30 343
I want to use rdf to define the top and bottom of the interval to calculate the average for Value1, Value2, and Value3. I would also like to have a count documented as well (not all of the values between the intervals necessarily exist, so it could be less than the difference of Bottom - Top). This will then modify rdf to make a new file:
new_rdf is formatted as such:
ID Top Bottom avgValue1 avgValue2 avgValue3 ThicknessCount(ft)
1 2010 3000 54 .14 456 74
1 4300 4500 23 .18 632 124
1 4550 5000 34 .24 780 111
1 7100 7700 54 .19 932 322
2 3200 4100 52 .32 134 532
2 4120 4180 16 .11 111 32
2 4300 5300 63 .29 872 873
2 5500 5520 33 .27 1111 9
3 2300 2380 63 .13 1442 32
3 3200 4500 37 .14 1839 87
I've been going back and forth on the best way to do this. I tried mimicking this time series example: Sum set of values from pandas dataframe within certain time frame
but it doesn't seem translatable:
import pandas as pd
Top = rdf['Top']
Bottom = rdf['Bottom']
Depths = ldf['DEPTH']
def get_depths(x):
n = ldf[(ldf['DEPTH']>x['top']) & (ldf['DEPTH']<x['bottom'])]
return n['ID'].values[0],n['DEPTH'].sum()
test = pd.DataFrame({'top':Top, 'bottom':Bottom})
test[['ID','Value1']] = test.apply(lambda x : get_depths(x),1).apply(pd.Series)
I get "TypeError: Invalid comparison between dtype=float64 and str"
And it works if I use the samples they made in the post, but it doesn't work with my data. I'm also hoping there's a more, simple way to do this.
Edit # 2A:
Note:
Sample DataFrame below is not exactly the same as posted in question
Posting a new code here that does uses Top and Bottom from rdf to check for DEPTH in ldf to calculate .mean() for each group using for-loop. A range_key is created in rdf that is unique to each row, assuming that the DataFrame rdf does not have any duplicates.
# Import libraries
import pandas as pd
# Create DataFrame
rdf = pd.DataFrame({
'ID': [1,1,1,1,2,2,2,2,3,3],
'Top': [2000,4300,4500,7100,3200,4120,4300,5500,2300,3200],
'Bottom':[2500,4500,5000,7700,4100,4180,5300,5520,2380,4500]
})
ldf = pd.DataFrame({
'ID': [1,1,1,1,1,1,1,1,2,2,3],
'DEPTH': [2000,2000.5,2001,2001.5,4002,4002.5,5003,5003.5,2000,2000,2000],
'Value1':[45,43,40,28,40,23,34,54,40,38,43],
'Value2':[.32,.33,.12,.10,.10,.11,.08,.04,.28,.25,.30],
'Value3':[423,500,643,20,34,60,900,1002,560,654,343]
})
# Create a key for merge later
ldf['range_key'] = np.nan
rdf['range_key'] = np.linspace(1,rdf.shape[0],rdf.shape[0]).astype(int).astype(str)
# Flag each row for a range
for i in range(ldf.shape[0]):
for j in range(rdf.shape[0]):
d = ldf['DEPTH'][i]
if (d>= rdf['Top'][j]) & (d<=rdf['Bottom'][j]):
rkey = rdf['range_key'][j]
ldf['range_key'][i]=rkey
break;
ldf['range_key'] = ldf['range_key'].astype(int).astype(str) # Convert to string
# Calculate mean for groups
ldf_mean = ldf.groupby(['ID','range_key']).mean().reset_index()
ldf_mean = ldf_mean.drop(['DEPTH'], axis=1)
# Merge into 'rdf'
new_rdf = rdf.merge(ldf_mean, on=['ID','range_key'], how='left')
new_rdf = new_rdf.drop(['range_key'], axis=1)
new_rdf
Output:
ID Top Bottom Value1 Value2 Value3
0 1 2000 2500 39.0 0.2175 396.5
1 1 4300 4500 NaN NaN NaN
2 1 4500 5000 NaN NaN NaN
3 1 7100 7700 NaN NaN NaN
4 2 3200 4100 NaN NaN NaN
5 2 4120 4180 NaN NaN NaN
6 2 4300 5300 NaN NaN NaN
7 2 5500 5520 NaN NaN NaN
8 3 2300 2380 NaN NaN NaN
9 3 3200 4500 NaN NaN NaN
Edit # 1:
Code below seems to work. Added an if-statement to the return from the code posted in question above. Not sure if this is what you were looking to get. It calculates the .sum(). The first value in rdf is changed to a lower the range to match the data in ldf.
# Import libraries
import pandas as pd
# Create DataFrame
rdf = pd.DataFrame({
'ID': [1,1,1,1,2,2,2,2,3,3],
'Top': [2000,4300,4500,7100,3200,4120,4300,5500,2300,3200],
'Bottom':[2500,4500,5000,7700,4100,4180,5300,5520,2380,4500]
})
ldf = pd.DataFrame({
'ID': [1,1,1,1,1,1,1,1,2,2,3],
'DEPTH': [2000,2000.5,2001,2001.5,2002,2002.5,2003,2003.5,2000,2000,2000],
'Value1':[45,43,40,28,40,23,34,54,40,38,43],
'Value2':[.32,.33,.12,.10,.10,.11,.08,.04,.28,.25,.30],
'Value3':[423,500,643,20,34,60,900,1002,560,654,343]
})
##### Code from the question (copy-pasted here)
Top = rdf['Top']
Bottom = rdf['Bottom']
Depths = ldf['DEPTH']
def get_depths(x):
n = ldf[(ldf['DEPTH']>x['top']) & (ldf['DEPTH']<x['bottom'])]
if (n.shape[0]>0):
return n['ID'].values[0],n['DEPTH'].sum()
test = pd.DataFrame({'top':Top, 'bottom':Bottom})
test[['ID','Value1']] = test.apply(lambda x : get_depths(x),1).apply(pd.Series)
Output:
test
top bottom ID Value1
0 2000 2500 1.0 14014.0
1 4300 4500 NaN NaN
2 4500 5000 NaN NaN
3 7100 7700 NaN NaN
4 3200 4100 NaN NaN
5 4120 4180 NaN NaN
6 4300 5300 NaN NaN
7 5500 5520 NaN NaN
8 2300 2380 NaN NaN
9 3200 4500 NaN NaN
Sample data and imports
import pandas
import numpy
import random
# dfr
rdata = {'ID': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3],
'Top': [2010, 4300, 4550, 7100, 3200, 4120, 4300, 5500, 2300, 3200],
'Bottom': [3000, 4500, 5000, 7700, 4100, 4180, 5300, 5520, 2380, 4500]}
dfr = pd.DataFrame(rdata)
# display(dfr.head())
ID Top Bottom
0 1 2010 3000
1 1 4300 4500
2 1 4550 5000
3 1 7100 7700
4 2 3200 4100
# df
np.random.seed(365)
random.seed(365)
rows = 10000
data = {'id': [random.choice([1, 2, 3]) for _ in range(rows)],
'depth': [np.random.randint(2000, 8000) for _ in range(rows)],
'v1': [np.random.randint(40, 50) for _ in range(rows)],
'v2': np.random.rand(rows),
'v3': [np.random.randint(20, 1000) for _ in range(rows)]}
df = pd.DataFrame(data)
df.sort_values(['id', 'depth'], inplace=True)
df.reset_index(drop=True, inplace=True)
# display(df.head())
id depth v1 v2 v3
0 1 2004 48 0.517014 292
1 1 2004 41 0.997347 859
2 1 2006 42 0.278217 851
3 1 2006 49 0.570363 32
4 1 2009 43 0.462985 409
Use each row of dfr to filter and extract stats from df
There are plenty of answers on SO dealing with "TypeError: Invalid comparison between dtype=float64 and str". The numeric columns need to be cleaned of any value that can't be converted to a numeric type.
This code deals with using one dataframe to filter and return metrics for another dataframe.
For each row in dfr:
Filter df
Aggregate the mean and count for v1, v2 and v3
.T to transpose the mean and count rows to columns
Convert to a numpy array
Slice the array for the 3 means and append the array to the v_mean
Slice the array for the max count and append the value to count
They could be all the same, if there are no NaNs in the data
Convert the list of arrays, v_mean to a dataframe, and join it to dfr_new
Add counts a column in dfr_new
v_mean = list()
counts = list()
for idx, (i, t, b) in dfr.iterrows(): # iterate through each row of dfr
data = df[['v1', 'v2', 'v3']][(df.id == i) & (df.depth >= t) & (df.depth <= b)].agg(['mean', 'count']).T.to_numpy() # apply filters and get stats
v_mean.append(data[:, 0]) # get the 3 means
counts.append(data[:, 1].max()) # get the max of the 3 counts; each column has a count, the count cound be different if there are NaNs in data
# copy dfr to dfr_new
dfr_new = dfr.copy()
# add stats values
dfr_new = dfr_new.join(pd.DataFrame(v_mean, columns=['v1_m', 'v2_m', 'v3_m']))
dfr_new['counts'] = counts
# display(dfr_new)
ID Top Bottom v1_mean v2_mean v3_mean count
0 1 2010 3000 44.577491 0.496768 502.068266 542.0
1 1 4300 4500 44.555556 0.518066 530.968254 126.0
2 1 4550 5000 44.446281 0.538855 482.818182 242.0
3 1 7100 7700 44.348083 0.489983 506.681416 339.0
4 2 3200 4100 44.804040 0.487011 528.707071 495.0
5 2 4120 4180 45.096774 0.526687 520.967742 31.0
6 2 4300 5300 44.476980 0.529476 523.095764 543.0
7 2 5500 5520 46.000000 0.608876 430.500000 12.0
8 3 2300 2380 44.512195 0.456632 443.195122 41.0
9 3 3200 4500 44.554755 0.516616 501.841499 694.0
If I have df1:
A B C D
0 4.51 6.212 3.12 1
1 3.12 3.444 1.12 1
2 6.98 7.413 7.02 0
3 4.51 8.916 5.12 1
....
n1 ~ 2000
and df2
A B C D
0 4.51 6.212 3.12 1
1 3.12 3.444 1.12 1
2 6.98 7.413 7.02 0
3 4.51 8.916 5.12 1
....
n2 = 10000+
And have to perform an operation like:
df12 =
df1[0,A]-df2[0,A] df1[0,B]-df2[0,B] df1[0,C]-df2[0,C]....
df1[0,A]-df2[1,A] df1[0,B]-df2[1,B] df1[0,C]-df2[1,C]
...
df1[0,A]-df2[n2,A] df1[0,B]-df2[n2,B] df1[0,C]-df2[n2,C]
...
df1[1,A]-df2[0,A] df1[1,B]-df2[0,B] df1[1,C]-df2[0,C]....
df1[1,A]-df2[1,A] df1[1,B]-df2[1,B] df1[1,C]-df2[1,C]
...
df1[1,A]-df2[n2,A] df1[1,B]-df2[n2,B] df1[1,C]-df2[n2,C]
...
df1[n1,A]-df2[0,A] df1[n1,B]-df2[0,B] df1[n1,C]-df2[0,C]....
df1[n1,A]-df2[1,A] df1[n1,B]-df2[1,B] df1[n1,C]-df2[1,C]
...
df1[n1,A]-df2[n2,A] df1[n1,B]-df2[n2,B] df1[n1,C]-df2[n2,C]
Where every row in df1 is compared against every row in df2 producing a score.
What would be the best way to perform this operation using either pandas or vaex/equivalent?
Thanks in advance!
Broadcasting is the way to go:
pd.DataFrame((df1.to_numpy()[:,None] - df2.to_numpy()[None,...]).reshape(-1, df1.shape[1]),
columns = df2.columns,
index = pd.MultiIndex.from_product((df1.index,df2.index))
)
Output (for df1 the three first rows, df2 the two first rows):
A B C D
0 0 0.00 0.000 0.0 0.0
1 1.39 2.768 2.0 0.0
1 0 -1.39 -2.768 -2.0 0.0
1 0.00 0.000 0.0 0.0
2 0 2.47 1.201 3.9 -1.0
1 3.86 3.969 5.9 -1.0
I would use openpyxl
This loop would do
for row in sheet.iter_rows(min_row=minr, min_col=starting col, max_col=finshing col, max_row=maxr):
for cell in row:
df1 = cell.value
for row in sheet.iter_rows(min_row=minr, min_col=starting col, max_col=finshing col, max_row=maxr):
for cell in row:
df2 = cell.value
from here you want do what , create new values ? put them where? the code here points on them
An interesting question, that vaex can actually solve quite memory efficient (although we should be able to require practically no memory in the future).
Let's start creating the vaex dataframes, and increase the numbers a bit, 2,000 and 200,000 rows.
import vaex
import numpy as np
names = "ABCD"
N = 2000
M = N * 100
print(f'{N*M:,} rows')
400,000,000 rows
df1 = vaex.from_dict({name + '1': np.random.random(N) * 6 for name in names})
# We add a virtual range column for joining (requires no memory)
df1['i1'] = vaex.vrange(0, N, dtype=np.int32)
print(df1)
# A1 B1 C1 D1 i1
0 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 0
1 5.873731485927979 5.669031702051764 5.696571067838359 1.0310578585207142 1
2 4.513310303419997 4.466469647700519 5.047406986222205 3.4417402924374407 2
3 0.43402400660624174 1.157476656465433 2.179139262842482 1.1666706679131253 3
4 3.3698854360766526 2.203558794966768 0.39649910973621827 2.5576740079630502 4
... ... ... ... ... ...
1,995 4.836227485536714 4.093067389612236 5.992282902119859 1.3549691660861871 1995
1,996 1.1157617217838995 1.1606619796004967 3.2771620798090533 4.249631266421745 1996
1,997 4.628846984287445 4.019449674317169 3.7307713985954947 3.7702606362049362 1997
1,998 1.3196727531762933 2.6758762345410565 3.249315566523623 2.6501467681546123 1998
1,999 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 1999
df2 = vaex.from_dict({name + '2': np.random.random(M) * 6 for name in names})
df2['i2'] = vaex.vrange(0, M, dtype=np.int32)
print(df2)
# A2 B2 C2 D2 i2
0 2.6928366822161234 3.227321076730826 5.944154034728931 3.3584747680090814 0
1 4.824761575636117 2.960087600265437 3.492601702004836 1.054879207993813 1
2 4.33510613806528 0.46883404117516103 5.632361155412736 0.436374137912523 2
3 0.0422709543055384 2.5319705848478855 3.5596949266321216 2.5364151309685354 3
4 2.749335843510271 3.5446979145461146 2.550223710733076 5.02069361871291 4
... ... ... ... ... ...
199,995 5.32205669155252 4.321667991189379 2.1192950613326182 5.937425946574905 199995
199,996 0.10746705113978328 4.104809740632655 0.6282195590464632 3.9603843538752974 199996
199,997 5.74108180127652 3.5863223687990136 4.64031507831471 4.610807268734913 199997
199,998 5.839402924722246 2.630974123991404 4.50411700551054 3.0960758923309983 199998
199,999 1.6954091816701466 1.8054911765387567 4.300317113825266 4.900845720973579 199999
Now we create our 'master' vaex dataframe, that requires no memory at all, it's made of a virtual column and two expressions (stored as virtual columns):
df = vaex.from_arrays(i=vaex.vrange(0, N*M, dtype=np.int64))
df['i1'] = df.i // M # index to df1
df['i2'] = df.i % M # index to df2
print(df)
# i i1 i2
0 0 0 0
1 1 0 1
2 2 0 2
3 3 0 3
4 4 0 4
... ... ... ...
399,999,995 399999995 1999 199995
399,999,996 399999996 1999 199996
399,999,997 399999997 1999 199997
399,999,998 399999998 1999 199998
399,999,999 399999999 1999 199999
Unfortunately vaex cannot use these integer indices as lookups for joining directly, it has to go through as hashmap. So there is room for improvement here for vaex. If vaex could do this, we could scale this idea up to trillions of rows.
print(f"The next two joins require ~{len(df)*8*2//1024**2:,} MB of RAM")
The next two joins require ~6,103 MB of RAM
df_big = df.join(df1, on='i1')
df_big = df_big.join(df2, on='i2')
print(df_big)
# i i1 i2 A1 B1 C1 D1 A2 B2 C2 D2
0 0 0 0 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 2.6928366822161234 3.227321076730826 5.944154034728931 3.3584747680090814
1 1 0 1 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 4.824761575636117 2.960087600265437 3.492601702004836 1.054879207993813
2 2 0 2 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 4.33510613806528 0.46883404117516103 5.632361155412736 0.436374137912523
3 3 0 3 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 0.0422709543055384 2.5319705848478855 3.5596949266321216 2.5364151309685354
4 4 0 4 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 2.749335843510271 3.5446979145461146 2.550223710733076 5.02069361871291
... ... ... ... ... ... ... ... ... ... ... ...
399,999,995 399999995 1999 199995 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 5.32205669155252 4.321667991189379 2.1192950613326182 5.937425946574905
399,999,996 399999996 1999 199996 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 0.10746705113978328 4.104809740632655 0.6282195590464632 3.9603843538752974
399,999,997 399999997 1999 199997 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 5.74108180127652 3.5863223687990136 4.64031507831471 4.610807268734913
399,999,998 399999998 1999 199998 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 5.839402924722246 2.630974123991404 4.50411700551054 3.0960758923309983
399,999,999 399999999 1999 199999 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 1.6954091816701466 1.8054911765387567 4.300317113825266 4.900845720973579
Now we have our big dataframe, and we only need to do the computation, which is using virtual columns, and thus require no extra memory.
# add virtual colums (which require no memory)
for name in names:
df_big[name] = df_big[name + '1'] - df_big[name + '2']
print(df_big[['A', 'B', 'C', 'D']])
# A B C D
0 3.17088337394884 0.13557709240744487 -2.6561095686690526 -3.2102593816457916
1 1.038958480528846 0.40281056887283384 -0.20455723594495767 -0.9066638216305232
2 1.5286139180996834 2.8940641279631096 -2.344316689352858 -0.2881587515492332
3 5.821449101859425 0.8309275842903854 -0.2716504605722432 -2.3881997446052456
4 3.1143842126546923 -0.18179974540784372 0.7378207553268026 -4.87247823234962
... ... ... ... ...
399,999,995 -0.30815443055055525 -4.249263637083556 -1.244858612724894 -3.257308187087283
399,999,996 4.906435209862181 -4.032405386526832 0.2462168895612611 -1.2802665943876756
399,999,997 -0.7271795402745553 -3.51391801469319 -3.765878629706985 -1.930689509247291
399,999,998 -0.8255006637202813 -2.5585697698855805 -3.629680556902816 -0.41595813284337657
399,999,999 3.318493079331818 -1.7330868224329334 -3.4258806652175418 -2.220727961485957
If we had to do this all in memory, how much RAM would that have required?
print(f"This would otherwise require {len(df_big) * (4*3*8)//1023**2:,} MB of RAM")
This would otherwise require 36,692 MB of RAM
So, quite efficient I would say, and in the future it would be interesting to see if we can do the join more efficiently, and require practically zero RAM for this problem.