Lets say I have two dataframes: df with columns ('a', 'b', 'c') and tf with columns ('a', 'b'). I do a group-combine on the two common columns in df:
grouped_sum = df.groupby(('a', 'b')).sum()
How can I "add" the column c to tf according to grouped_sum, i.e.
tf[i]['c'] = grouped_sum[tf[i]['a'], tf[i]['b']]
for all rows i of the second data frame? For a groupby with a single level it works simply by indexing the group with the corresponding column of tf.
If you groupby with as_index=False you can merge with tf:
In [11]: tf = pd.DataFrame([[1, 2], [3, 4]], columns=list('ab'))
In [12]: df = pd.DataFrame([[1, 2, 3], [1, 2, 4], [3, 4, 5]], columns=list('abc'))
In [13]: grouped_sum = df.groupby(['a', 'b'], as_index=False).sum()
In [14]: grouped_sum
Out[14]:
a b c
0 1 2 7
1 3 4 5
In [15]: tf.merge(grouped_sum) # this won't always be the same as grouped_sum!
Out[15]:
a b c
0 1 2 7
1 3 4 5
another option is to set a and b as the index of tf.
Related
I would like to add new column using other column value with condition
In pandas, I do this like below
import pandas as pd
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df['c'] = df['a']
df.loc[df['b']==4, 'c'] = df['b']
The result is
a
b
c
1
3
1
2
4
4
Could you teach me how to do this with polars?
Use when/then/otherwise
df = pl.DataFrame({'a': [1, 2], 'b': [3, 4]})
df.with_columns(
pl.when(pl.col("b") == 4).then(pl.col('b')).otherwise(pl.col('a')).alias("c")
)
This works:
import pandas as pd
data = [["aa", 1, 2], ["bb", 3, 4]]
df = pd.DataFrame(data, columns=['id', 'a', 'b'])
df = df.set_index('id')
print(df)
"""
a b
id
aa 1 2
bb 3 4
"""
but is it possible in just one call of pd.DataFrame(...) directly with a parameter, without using set_index after?
Convert values to 2d array:
data = [["aa", 1, 2], ["bb", 3, 4]]
arr = np.array(data)
df = pd.DataFrame(arr[:, 1:], columns=['a', 'b'], index=arr[:, 0])
print (df)
a b
aa 1 2
bb 3 4
Details:
print (arr)
[['aa' '1' '2']
['bb' '3' '4']]
Another solution:
data = [["aa", 1, 2], ["bb", 3, 4], ["cc", 30, 40]]
cols = ['a','b']
L = list(zip(*data))
print (L)
[('aa', 'bb', 'cc'), (1, 3, 30), (2, 4, 40)]
df = pd.DataFrame(dict(zip(cols, L[1:])), index=L[0])
print (df)
a b
aa 1 2
bb 3 4
cc 30 40
I am trying to create a new column in a dataframe and polulate it with a value from another data frame column which matches a common column from both data frames columns.
DF1 DF2
A B W B
——— ———
Y 2 X 2
N 4 F 4
Y 5 T 5
I though the following could do the tick.
df2[‘new_col’] = df1[‘A’] if df1[‘B’] == df2[‘B’] else “Not found”
So result should be:
DF2
W B new_col
X 2 Y -> Because DF1[‘B’] == 2 and value in same row is Y
F 4 N
T 5 Y
but I get the below error, I believe that is because dataframes are different sizes?
raise ValueError("Can only compare identically-labeled Series objects”)
Can you help me understand what am I doing wrong and what is the best way to achieve what I am after?
Thank you in advance.
UPDATE 1
Trying Corralien solution I still get the below:
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat
This is the code I wrote
df1 = pd.DataFrame(np.array([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]]),
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df2.reset_index().merge(df1.reset_index(), on=['b'], how='left') \
.drop(columns='index').rename(columns={'One': 'new_col'})
UPDATE 2
Here is the second option, but it does not seem to add columns in df2.
df1 = pd.DataFrame(np.array([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]]),
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df2 = df2.set_index('b', append=True).join(df1.set_index('b', append=True)) \
.reset_index('b').rename(columns={'One': 'new_col'})
print(df2)
b a c new_col Three
0 2 1 3 NaN NaN
1 5 4 6 NaN NaN
2 8 7 9 NaN NaN
Why is the code above not working?
Your question is not clear because why is F associated with N and T with Y? Why not F with Y and T with N?
Using merge:
>>> df2.merge(df1, on='B', how='left')
W B A
0 X 2 Y
1 F 4 N # What you want
2 F 4 Y # Another solution
3 T 4 N # What you want
4 T 4 Y # Another solution
How do you decide on the right value? With row index?
Update
So you need to use the index position:
>>> df2.reset_index().merge(df1.reset_index(), on=['index', 'B'], how='left') \
.drop(columns='index').rename(columns={'A': 'new_col'})
W B new_col
0 X 2 Y
1 F 4 N
2 T 4 Y
In fact you can consider the column B as an additional index of each dataframe.
Using join
>>> df2.set_index('B', append=True).join(df1.set_index('B', append=True)) \
.reset_index('B').rename(columns={'A': 'new_col'})
B W new_col
0 2 X Y
1 4 F N
2 4 T Y
Setup:
df1 = pd.DataFrame([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]],
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
columns=['a', 'b', 'c'])
how can I sum the values in column 'two' based on the items in column 'one' in pandas dataframe:
df = pd.DataFrame({'One': ['A', 'B', 'A', 'B'], 'Two': [1, 5, 3, 4]})
out[1]:
One Two
0 A 1
1 B 5
2 A 3
3 B 4
Expected output should be:
A 4
B 9
You need to group by the first column and sum on the second.
df.groupby('One', as_index=False).sum()
One Two
0 A 4
1 B 9
The trick is use pandas built-in functions .groupby(COLUMN_NAME) and then .sum() that new pandas object
import pandas as pd
df = pd.DataFrame({'One': ['A', 'B', 'A', 'B'], 'Two': [1, 5, 3, 4]})
groups = df.groupby('One').sum()
print(groups.head())
I'm trying to do multiple aggragations over a pandas dataframe, the problem is that I want to keep the column over I aggregate
df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
df3.groupby('X', as_index=False).agg('sum')
X Y
0 A 4
1 B 6
That's good but what I want is multiple aggregations like this
df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
df3.groupby('X', as_index=False).agg(['sum', 'mean'])
It gives me
Y
sum mean
X
A 4 2
B 6 3
But I want this
X Y
sum mean
0 A 4 2
1 B 6 3
To move X from the index to a column use reset_index:
In [4]: df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
In [5]: df3.groupby('X', as_index=False).agg(['sum', 'mean']).reset_index()
Out[5]:
X Y
sum mean
0 A 4 2
1 B 6 3