Filling column of dataframe based on 'groups' of values of another column - python

I am trying to fill values of a column based on the value of another column. Suppose I have the following dataframe:
import pandas as pd
data = {'A': [4, 4, 5, 6],
'B': ['a', np.nan, np.nan, 'd']}
df = pd.DataFrame(data)
And I would like to fill column B but only if the value of column A equals 4. Hence, all rows that have the same value as another in column A should have the same value in column B (by filling this).
Thus, the desired output should be:
data = {'A': [4, 4, 5, 6],
'B': ['a', a, np.nan, 'd']}
df = pd.DataFrame(data)
I am aware of the fillna method, but this gives the wrong output as the third row also gets the value 'A' assigned:
df['B'] = fillna(method="ffill", inplace=True)
data = {'A': [4, 4, 5, 6],
'B': ['a', 'a', 'a', 'd']}
df = pd.DataFrame(data)
How can I get the desired output?

Try this:
df['B'] = df.groupby('A')['B'].ffill()
Output:
>>> df
A B
0 4 a
1 4 a
2 5 NaN
3 6 d

Related

How to fill column with condition in polars

I would like to add new column using other column value with condition
In pandas, I do this like below
import pandas as pd
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df['c'] = df['a']
df.loc[df['b']==4, 'c'] = df['b']
The result is
a
b
c
1
3
1
2
4
4
Could you teach me how to do this with polars?
Use when/then/otherwise
df = pl.DataFrame({'a': [1, 2], 'b': [3, 4]})
df.with_columns(
pl.when(pl.col("b") == 4).then(pl.col('b')).otherwise(pl.col('a')).alias("c")
)

pandas fill missing values with mean of other columns grouped by a value

I have this dataset where I have NaN values on column 'a'. I want to group rows by 'user_id', compute the mean on column 'c' grouped by 'user_id' and fill NaN values on 'a' with this mean. How can I do it?
this is the code
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [0, np.nan, np.nan], 'user_id': [1, 2, 2], 'c': [3, 7, 7]})
print(df)
what I should have
df = pd.DataFrame({'a': [0, 7, 7], 'user_id': [1, 2, 2], 'c': [3, 7, 7]})
print(df)
what I've tried
df['a'].fillna(df.groupby('user_id')['a'].transform('mean'), inplace=True)
print(df)
after printing the df I still se NaN on 'a' column
Note since I have a huge dataset I need to do id inplace
I think you need processing column c:
df['a'].fillna(df.groupby('user_id')['c'].transform('mean'), inplace=True)
df['a'] = df['a'].fillna(df.groupby('user_id')['c'].transform('mean'))
print (df)
a user_id c
0 0.0 1 3
1 7.0 2 7
2 7.0 2 7

How to convert Pandas data frame to dict with values in a list

I have a huge Pandas data frame with the structure follows as an example below:
import pandas as pd
df = pd.DataFrame({'col1': ['A', 'A', 'B', 'C', 'C', 'C'], 'col2': [1, 2, 5, 2, 4, 6]})
df
col1 col2
0 A 1
1 A 2
2 B 5
3 C 2
4 C 4
5 C 6
The task is to build a dictionary with elements in col1 as keys and corresponding elements in col2 as values. For the example above the output should be:
A -> [1, 2]
B -> [5]
C -> [2, 4, 6]
Although I write a solution as
from collections import defaultdict
dd = defaultdict(set)
for row in df.itertuples():
dd[row.col1].append(row.col2)
I wonder if somebody is aware of a more "Python-native" solution, using in-build pandas functions.
Without apply we do it by for loop
{x : y.tolist() for x , y in df.col2.groupby(df.col1)}
{'A': [1, 2], 'B': [5], 'C': [2, 4, 6]}
Use GroupBy.apply with list for Series of lists and then Series.to_dict:
d = df.groupby('col1')['col2'].apply(list).to_dict()
print (d)
{'A': [1, 2], 'B': [5], 'C': [2, 4, 6]}

Sum the values in a pandas column based on the items in another column

how can I sum the values in column 'two' based on the items in column 'one' in pandas dataframe:
df = pd.DataFrame({'One': ['A', 'B', 'A', 'B'], 'Two': [1, 5, 3, 4]})
out[1]:
One Two
0 A 1
1 B 5
2 A 3
3 B 4
Expected output should be:
A 4
B 9
You need to group by the first column and sum on the second.
df.groupby('One', as_index=False).sum()
One Two
0 A 4
1 B 9
The trick is use pandas built-in functions .groupby(COLUMN_NAME) and then .sum() that new pandas object
import pandas as pd
df = pd.DataFrame({'One': ['A', 'B', 'A', 'B'], 'Two': [1, 5, 3, 4]})
groups = df.groupby('One').sum()
print(groups.head())

Assign new column using a set of sub-columns

I have a dataframe with a column 'name' of the form ['A','B','C',A','B','B'....] and a set of arrays: one corresponding to 'A', say array_A = [0, 1, 2 ...] and array_B = [3, 1, 0 ...], array_C etc...
I want to create a new column 'value' by assigning array_A where the row name in the dataframe is 'A', and similarly for 'B' and 'C'.
The function df['value']=np.where(df['name']=='A',array_A, df['value']) won't do it because it would overwrite the values for other names or have dimensionality issues.
For example:
arrays = {'A': np.array([0, 1, 2]),
'B': np.array([3, 1])}
Desired output:
df = pd.DataFrame({'name': ['A', 'B', 'A', 'A', 'B']})
name value
0 A 0
1 B 3
2 A 1
3 A 2
4 B 1
You can use a for loop with a dictionary:
arrays = {'A': np.array([0, 1, 2]),
'B': np.array([3, 1])}
df = pd.DataFrame({'name': ['A', 'B', 'A', 'A', 'B']})
for k, v in arrays.items():
df.loc[df['name'] == k, 'value'] = v
df['value'] = df['value'].astype(int)
print(df)
name value
0 A 0
1 B 3
2 A 1
3 A 2
4 B 1

Categories

Resources