How to fill column with condition in polars - python

I would like to add new column using other column value with condition
In pandas, I do this like below
import pandas as pd
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df['c'] = df['a']
df.loc[df['b']==4, 'c'] = df['b']
The result is
a
b
c
1
3
1
2
4
4
Could you teach me how to do this with polars?

Use when/then/otherwise
df = pl.DataFrame({'a': [1, 2], 'b': [3, 4]})
df.with_columns(
pl.when(pl.col("b") == 4).then(pl.col('b')).otherwise(pl.col('a')).alias("c")
)

Related

pandas fill missing values with mean of other columns grouped by a value

I have this dataset where I have NaN values on column 'a'. I want to group rows by 'user_id', compute the mean on column 'c' grouped by 'user_id' and fill NaN values on 'a' with this mean. How can I do it?
this is the code
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [0, np.nan, np.nan], 'user_id': [1, 2, 2], 'c': [3, 7, 7]})
print(df)
what I should have
df = pd.DataFrame({'a': [0, 7, 7], 'user_id': [1, 2, 2], 'c': [3, 7, 7]})
print(df)
what I've tried
df['a'].fillna(df.groupby('user_id')['a'].transform('mean'), inplace=True)
print(df)
after printing the df I still se NaN on 'a' column
Note since I have a huge dataset I need to do id inplace
I think you need processing column c:
df['a'].fillna(df.groupby('user_id')['c'].transform('mean'), inplace=True)
df['a'] = df['a'].fillna(df.groupby('user_id')['c'].transform('mean'))
print (df)
a user_id c
0 0.0 1 3
1 7.0 2 7
2 7.0 2 7

matching 2 dataframes in pandas

If I have 2 dataframes in pandas like below
but 2 dataframes don't have same columns, only few columns are same.
df1
no datas1 datas2 datas3 datas4
0 a b a a
1 b c b b
2 d b c a
df2
no datas1 datas2 datas3 data4 data 5 data6
0 c a a a a b
1 a c b b b b
2 a b c b c c
I'd like to know how much it's matched for each same column based on "no" filed using pandas functions
the result are below
data3 is 100% match
data4 is 66% match
or
data3 is 3 matched
data4 is 2 matched
What's the best way to make like that ?
You can do this - first run equals method and if True then print that dataframes match, otherwise use compare method and then calculate the percentage of rows that matched between dfs:
import pandas as pd
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df2 = pd.DataFrame({'a': [2, 2, 3], 'b': [4, 5, 7]})
if df1.equals(df2):
print('df1 matched df2')
else:
comp = df1.compare(df2)
match_perc = (df1.shape[0] - comp.shape[0]) / df1.shape[0]
print(f'{match_perc * 100: .4f} match') # Out: 33.3333 match
You can simplify by just using compare and if the dataframes match perfectly then you print that they matched:
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df3 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
comp = df1.compare(df3)
match_perc = (df1.shape[0] - comp.shape[0]) / df1.shape[0]
if match_perc == 1:
print('dfs matched')
else:
print(f'{match_perc * 100: .4f} match')
# Out: dfs matched

Filling column of dataframe based on 'groups' of values of another column

I am trying to fill values of a column based on the value of another column. Suppose I have the following dataframe:
import pandas as pd
data = {'A': [4, 4, 5, 6],
'B': ['a', np.nan, np.nan, 'd']}
df = pd.DataFrame(data)
And I would like to fill column B but only if the value of column A equals 4. Hence, all rows that have the same value as another in column A should have the same value in column B (by filling this).
Thus, the desired output should be:
data = {'A': [4, 4, 5, 6],
'B': ['a', a, np.nan, 'd']}
df = pd.DataFrame(data)
I am aware of the fillna method, but this gives the wrong output as the third row also gets the value 'A' assigned:
df['B'] = fillna(method="ffill", inplace=True)
data = {'A': [4, 4, 5, 6],
'B': ['a', 'a', 'a', 'd']}
df = pd.DataFrame(data)
How can I get the desired output?
Try this:
df['B'] = df.groupby('A')['B'].ffill()
Output:
>>> df
A B
0 4 a
1 4 a
2 5 NaN
3 6 d

Can I use pd.drop in method chaining to drop specific rows?

I was wondering if I can used pandas .drop method to drop rows when chaining methods to construct a data frame.
Dropping rows is straight forward once the data frame exists:
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
print(df1)
# drop the entries that match "2"
df1 = df1[df1['A'] !=2]
print(df1)
However, I would like to do this while I am creating the data frame:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
# .drop(lambda x: x['A']!=2)
)
print(df2)
The commented line does not work, but maybe there is a correct way of doing this. Grateful for any input.
Use DataFrame.loc with callable:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
.loc[lambda x: x['AA']!=2]
)
Or DataFrame.query:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
.query("AA != 2")
)
print(df2)
AA B
0 1 5
2 3 3
You can use DataFrame.apply with DataFrame.dropna:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
.apply(lambda x: x if x['AA'] !=2 else np.nan,axis=1).dropna()
)
print(df2)
AA B
0 1.0 5.0
2 3.0 3.0
Maybe you can try a selection with ".loc" at the end of your definition line code.

Pandas Row-Indexing without looping

I'm attempting to row index by using pandas indexing, but it seems that there isn't an appropriate way to input a list for this. This is the solution I'm trying to use without loops.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 9], 'b': [3, 3, 4, 5, 100]})
print(df)
interest = [3, 4]
# results = df['a'].eq(interest)
# results = df[(df['a'] == 3) & (df['a'] == 4)]
df(results)
# print(df[df['b'] == 3]) # index 0 1 2
With loops, I'm able to get my desired result.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 9], 'b': [3, 3, 4, 5, 100]})
print(df)
lst = [3,4]
print('index values are : {}'.format(lst))
results = pd.DataFrame()
for itr in lst:
if results.empty:
results = df[ df['a'] == itr]
else:
results = results.append(df[ df['a'] == itr])
print('result : \n{}'.format(results))
I've search but most documentation will index both columns 'a' and 'b' and/or only use one value at a time for indexing, rather than a list. Let me know if I wasn't clear
IIUC you want .isin?
>>> df[df.a.isin([3,4])]
a b
2 3 4
3 4 5

Categories

Resources