How to combine/merge dataframes by approximate values of a column? - python

This is an example of a bigger data. Imagine I have two dataframes like these:
import pandas as pd
import numpy as np
np.random.seed(42)
df1 = pd.DataFrame({'Depth':np.arange(0.5, 4.5, 0.5),
'Feat1':np.random.randint(20, 70, 8)})
df2 = pd.DataFrame({'Depth':[0.4, 1.1, 1.5, 2.2, 2.8],
'Rock':['Sand','Sand','Clay','Clay','Marl']})
They have different size and I would like to put the information of 'Rock' column from df2 on df1 as a new column. This combination should be done based on the 'Depth' columns from these two dataframes, but they have different sampling rates. Df1 follows a constant step of 0.5, but the thickness of df2 is different.
So I would like to merge these information based on approximate values of 'Depth'. For example: if sample of df2 has 'Depth' of 2.2, then look at the most near 'Depth' value of df1, that should be 2.0, and add 'Rock' information ('Clay') on that sample. And it is important to say that 'Rock' values can be repeated on the new column to avoid missing data just inside this segmentation. Anyone could help me?
I already tried some pandas methods as 'merge' and 'combine_first', but I couldn't get the result I wanted. It should be something like this:

Use merge_asof:
df3 = pd.merge_asof(df1, df2, on='Depth', tolerance=0.5, direction='nearest')
df3:
Depth Feat1 Rock
0 0.5 58 Sand
1 1.0 48 Sand
2 1.5 34 Clay
3 2.0 62 Clay
4 2.5 27 Clay
5 3.0 40 Marl
6 3.5 58 NaN
7 4.0 38 NaN
Complete Working Example:
import numpy as np
import pandas as pd
np.random.seed(42)
df1 = pd.DataFrame({
'Depth': np.arange(0.5, 4.5, 0.5),
'Feat1': np.random.randint(20, 70, 8)
})
df2 = pd.DataFrame({
'Depth': [0.4, 1.1, 1.5, 2.2, 2.8],
'Rock': ['Sand', 'Sand', 'Clay', 'Clay', 'Marl']
})
df3 = pd.merge_asof(df1, df2, on='Depth', tolerance=0.5, direction='nearest')
print(df3)

Related

Pandas: Multiply row value by groupby of another column as a new column

I have a DataFrame that looks like this. I am trying to add a new column df['new_sales'] where I multiply df['rate'] by the groupby sum of df['state','store'].
import pandas as pd
data = [['california', 'a', 11, 0.6], ['california', 'a', 12, 0.4], ['california', 'b', 32, 0.7]]
df= pd.DataFrame(data, columns=['state','store','sales','rate'])
I was trying something like this but couldn't get it to work.
df['new_sales'] = df.groupby(['state','store'])['sales'].apply(lambda x: x.sum()*df['rate'])
The output would look like this.
Use the transform option, to align the values with the length of the original dataframe; should be faster than an apply, and without the anonymous function :
df['NewSales'] = df.groupby(['state', 'store']).sales.transform('sum') * df.rate
print(Df)
state store sales rate NewSales
0 california a 11 0.6 13.8
1 california a 12 0.4 9.2
2 california b 32 0.7 22.4
1.without groupby
df = df.groupby(['state','store'])['sales'].apply(lambda x: x.sum()*df['rate'])
output:
2.with groupby:
def doCalculation(df):
sales = df['sales'].sum()
rate = df['rate']
return sales * rate
df = df.groupby(['state','store']).apply(doCalculation)
output:
newdf['NewSales'] = df.values
out:

Combine NaN value in rows - Pandas

I would like to know if it's possible to combine some rows if we have in specific columns NaN value ? But the order can be change. I thought combine the rows if Name is duplicated.
import pandas as pd
import numpy as np
d = {'Name': ['Jacque','Paul', 'Jacque'], 'City': [np.nan, '4', '10'], 'Birthday' : ['1','2',np.nan]}
df = pd.DataFrame(data=d)
df
And I would like to have this output :
Check with sorted
out = df.apply(lambda x : sorted(x,key=pd.isnull)).dropna()
Name City Birthday
0 Jacque 4.0 1.0
1 Paul 10.0 2.0

How to concatenate or merge series to dataframe without duplicating column names

I am trying to concatenate a Series returned from a function to a Dataframe, but I don't want the columns to be duplicated. How can I accomplish this? The full dataset is ~100k rows, and there are about 100 subsets (defined in a loop with masks), so hopefully, there is a computationally fast solution. Using Python 3.7
Example
import pandas as pd
def myfcn(row, data, val):
z1 = row['y'] + val
z2 = row['x']*row['y']
return pd.Series(
{'fancy_column_name1': z1,
'fancy_column_name2': z2/val},
name=row.name
)
col1 = [1, 1.5, 3.1, 3.4, 2, -1]
col2 = [1, -3, 2, 8, 2.5, -1.3]
df = pd.DataFrame(list(zip(col1, col2)), columns=['x', 'y'])
display(df)
### In the real case, this is all in a loop with many subsets that
### are created with masks & specific criteria; this is
### simplified here
df_subset = df.iloc[[0,2,3]]
#display(df_subset)
out = df_subset.apply(myfcn, axis=1, args=(df_subset, 100))
df = pd.concat([df, out], axis=1)
df_subset2 = df.iloc[[5]]
out = df_subset2.apply(myfcn, axis=1, args=(df_subset2, 250))
df = pd.concat([df, out], axis=1)
display(df)
Here is the parent dataframe "df"
Here is the current output
Here is the wanted output
How can I remove the duplicated column names, collapsing the data into the same column? I want to retain the numbers, not the NaNs. There will never be an instance where where there is more than one number to retain in a row, but there may be an instance where there are no numbers (so then, retain NaN).
pandas.DataFrame.combine_first: Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame. The row and column indexes of the resulting DataFrame will be the union of the two.
Just replace the df = pd.concat([df, out], axis=1) with -
df = df.combine_first(out)
More details here.
The reason why your order is not retained is because out has only 2 columns. Those are the ones that replace the values of nans first. Therefore they become the first ones. You can insert a blank x and y ahead of out to solve this -
out.insert(0, 'x', 0)
out.insert(1, 'y', 0)
df = df.combine_first(out)
Add this to the loop and me know if your column order is now fixed.
Do your calculations of sub-setting together, then append those out columns together and then merge into your main dataframe. I modified your code a bit:
def myfcn(row, data, val):
z1 = row['y'] + val
z2 = row['x']*row['y']
return pd.Series(
{'fancy_column_name1': z1,
'fancy_column_name2': z2/val},
name=row.name
)
col1 = [1, 1.5, 3.1, 3.4, 2, -1]
col2 = [1, -3, 2, 8, 2.5, -1.3]
df = pd.DataFrame(list(zip(col1, col2)), columns=['x', 'y'])
df_subset = df.iloc[[0,2,3]]
#display(df_subset)
out1 = df_subset.apply(myfcn, axis=1, args=(df_subset, 100))
df_subset2 = df.iloc[[5]]
out2 = df_subset2.apply(myfcn, axis=1, args=(df_subset2, 250))
out = out1.append(out2)
df = pd.merge(df, out, left_index=True, right_index=True, how="left")
print(df)
output:
x y fancy_column_name1 fancy_column_name2
0 1.0 1.0 101.0 0.0100
1 1.5 -3.0 NaN NaN
2 3.1 2.0 102.0 0.0620
3 3.4 8.0 108.0 0.2720
4 2.0 2.5 NaN NaN
5 -1.0 -1.3 248.7 0.0052

How to aggregate, combining dataframes, with pandas groupby

I have a dataframe df and a column df['table'] such that each item in df['table'] is another dataframe with the same headers/number of columns. I was wondering if there's a way to do a groupby like this:
Original dataframe:
name table
Bob Pandas df1
Joe Pandas df2
Bob Pandas df3
Bob Pandas df4
Emily Pandas df5
After groupby:
name table
Bob Pandas df containing the appended df1, df3, and df4
Joe Pandas df2
Emily Pandas df5
I found this code snippet to do a groupby and lambda for strings in a dataframe, but haven't been able to figure out how to append entire dataframes in a groupby.
df['table'] = df.groupby(['name'])['table'].transform(lambda x : ' '.join(x))
I've also tried df['table'] = df.groupby(['name'])['HTML'].apply(list), but that gives me a df['table'] of all NaN.
Thanks for your help!!
Given 3 dataframes
import pandas as pd
dfa = pd.DataFrame({'a': [1, 2, 3]})
dfb = pd.DataFrame({'a': ['a', 'b', 'c']})
dfc = pd.DataFrame({'a': ['pie', 'steak', 'milk']})
Given another dataframe, with dataframes in the columns
df = pd.DataFrame({'name': ['Bob', 'Joe', 'Bob', 'Bob', 'Emily'], 'table': [dfa, dfa, dfb, dfc, dfb]})
# print the type for the first value in the table column, to confirm it's a dataframe
print(type(df.loc[0, 'table']))
[out]:
<class 'pandas.core.frame.DataFrame'>
Each group of dataframes, can be combined into a single dataframe, by using .groupby and aggregating a list for each group, and combining the dataframes in the list, with pd.concat
# if there is only one column, or if there are multiple columns of dataframes to aggregate
dfg = df.groupby('name').agg(lambda x: pd.concat(list(x)).reset_index(drop=True))
# display(dfg.loc['Bob', 'table'])
a
0 1
1 2
2 3
3 a
4 b
5 c
6 pie
7 steak
8 milk
# to specify a single column, or specify multiple columns, from many columns
dfg = df.groupby('name')[['table']].agg(lambda x: pd.concat(list(x)).reset_index(drop=True))
Not a duplicate
Originally, I had marked this question as a duplicate of How to group dataframe rows into list in pandas groupby, thinking the dataframes could be aggregated into a list, and then combined with pd.concat.
df.groupby('name')['table'].apply(list)
df.groupby('name').agg(list)
df.groupby('name')['table'].agg(list)
df.groupby('name').agg({'table': list})
df.groupby('name').agg(lambda x: list(x))
However, these all result in a StopIteration error, when there are dataframes to aggregate.
Here let's create a dataframe with dataframes as columns:
First, I start with three dataframes:
import pandas as pd
#creating dataframes that we will assign to Bob and Joe, notice b's and j':
df1 = pd.DataFrame({'var1':[12, 34, -4, None], 'letter':['b1', 'b2', 'b3', 'b4']})
df2 = pd.DataFrame({'var1':[1, 23, 44, 0], 'letter':['j1', 'j2', 'j3', 'j4']})
df3 = pd.DataFrame({'var1':[22, -3, 7, 78], 'letter':['b5', 'b6', 'b7', 'b8']})
#lets make a list of dictionaries:
list_of_dfs = [
{'name':'Bob' ,'table':df1},
{'name':'Joe' ,'table':df2},
{'name':'Bob' ,'table':df3}
]
#constuct the main dataframe:
original_df = pd.DataFrame(list_of_dfs)
print(original_df)
original_df.shape #shows (3, 2)
Now we have the original dataframe created as the input, we will produce the resulting new dataframe. In doing so, we use groupby(),agg(), and pd.concat(). We also reset the index.
new_df = original_df.groupby('name')['table'].agg(lambda series: pd.concat(series.tolist())).reset_index()
print(new_df)
#check that Bob's table is now a concatenated table of df1 and df3:
new_df[new_df['name']=='Bob']['table'][0]
The output to the last line of code is:
var1 letter
0 12.0 b1
1 34.0 b2
2 -4.0 b3
3 NaN b4
0 22.0 b5
1 -3.0 b6
2 7.0 b7
3 78.0 b8

Python Pandas DataFrame: Matching column names to row index'

I have a DataFrame containing my raw data:
Var1 Var2 Var3
0 3090.032408 18.0 1545.016204
1 3048.781680 18.0 1524.390840
2 3090.032408 18.0 1545.016204
3 3112.086341 18.0 1556.043170
4 3075.100780 16.0 1537.550390
And a DataFrame containing values relating to the variables in my first DataFrame:
minVal maxVal
Var1 3045 4000
Var2 15 19
Var3 1500 1583
For every column in DF1, I need to find the relating row in DF2 in order to apply standardisation where I'm subtracting the minVal and dividing by the range. Column1 in DF1 may not relate to row1 in DF2 - there are more rows in DF2 than columns in DF1.
How do I loop through my columns and apply standardisation in an efficient way?
Many thanks
Thanks to Pandas' automatic index alignment, expressing this computation is remarkably easy:
(DF1-DF2['minVal'])/(DF2['maxVal']-DF2['minVal'])
import pandas as pd
DF1 = pd.DataFrame({
'Var1': [3090.032408, 3048.78168, 3090.032408, 3112.086341, 3075.10078],
'Var2': [18.0, 18.0, 18.0, 18.0, 16.0],
'Var3': [1545.016204, 1524.39084, 1545.016204, 1556.04317, 1537.55039]})
DF2 = pd.DataFrame({'maxVal': [4000, 19, 1583,10], 'minVal': [3045, 15, 1500,11],
'A':[1,2,3,12], 'B':[5,6,7,13]},
index=['Var1', 'Var2', 'Var3','Var4'])
DF3 = DF2.loc[DF1.columns, :]
result = (DF1-DF3['minVal'])/(DF3['maxVal']-DF3['minVal'])
print(result)
yields
Var1 Var2 Var3
0 0.047154 0.75 0.542364
1 0.003960 0.75 0.293866
2 0.047154 0.75 0.542364
3 0.070247 0.75 0.675219
4 0.031519 0.25 0.452414
Here's a simple way to get what you want. Calculates min, max, range for each column on the fly
df2 = (df - df.min()) / (df.max() - df.min())

Categories

Resources