I would like to add new column using other column value with condition
In pandas, I do this like below
import pandas as pd
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df['c'] = df['a']
df.loc[df['b']==4, 'c'] = df['b']
The result is
a
b
c
1
3
1
2
4
4
Could you teach me how to do this with polars?
Use when/then/otherwise
df = pl.DataFrame({'a': [1, 2], 'b': [3, 4]})
df.with_columns(
pl.when(pl.col("b") == 4).then(pl.col('b')).otherwise(pl.col('a')).alias("c")
)
Related
I have this dataset where I have NaN values on column 'a'. I want to group rows by 'user_id', compute the mean on column 'c' grouped by 'user_id' and fill NaN values on 'a' with this mean. How can I do it?
this is the code
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [0, np.nan, np.nan], 'user_id': [1, 2, 2], 'c': [3, 7, 7]})
print(df)
what I should have
df = pd.DataFrame({'a': [0, 7, 7], 'user_id': [1, 2, 2], 'c': [3, 7, 7]})
print(df)
what I've tried
df['a'].fillna(df.groupby('user_id')['a'].transform('mean'), inplace=True)
print(df)
after printing the df I still se NaN on 'a' column
Note since I have a huge dataset I need to do id inplace
I think you need processing column c:
df['a'].fillna(df.groupby('user_id')['c'].transform('mean'), inplace=True)
df['a'] = df['a'].fillna(df.groupby('user_id')['c'].transform('mean'))
print (df)
a user_id c
0 0.0 1 3
1 7.0 2 7
2 7.0 2 7
If I have 2 dataframes in pandas like below
but 2 dataframes don't have same columns, only few columns are same.
df1
no datas1 datas2 datas3 datas4
0 a b a a
1 b c b b
2 d b c a
df2
no datas1 datas2 datas3 data4 data 5 data6
0 c a a a a b
1 a c b b b b
2 a b c b c c
I'd like to know how much it's matched for each same column based on "no" filed using pandas functions
the result are below
data3 is 100% match
data4 is 66% match
or
data3 is 3 matched
data4 is 2 matched
What's the best way to make like that ?
You can do this - first run equals method and if True then print that dataframes match, otherwise use compare method and then calculate the percentage of rows that matched between dfs:
import pandas as pd
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df2 = pd.DataFrame({'a': [2, 2, 3], 'b': [4, 5, 7]})
if df1.equals(df2):
print('df1 matched df2')
else:
comp = df1.compare(df2)
match_perc = (df1.shape[0] - comp.shape[0]) / df1.shape[0]
print(f'{match_perc * 100: .4f} match') # Out: 33.3333 match
You can simplify by just using compare and if the dataframes match perfectly then you print that they matched:
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df3 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
comp = df1.compare(df3)
match_perc = (df1.shape[0] - comp.shape[0]) / df1.shape[0]
if match_perc == 1:
print('dfs matched')
else:
print(f'{match_perc * 100: .4f} match')
# Out: dfs matched
I am trying to fill values of a column based on the value of another column. Suppose I have the following dataframe:
import pandas as pd
data = {'A': [4, 4, 5, 6],
'B': ['a', np.nan, np.nan, 'd']}
df = pd.DataFrame(data)
And I would like to fill column B but only if the value of column A equals 4. Hence, all rows that have the same value as another in column A should have the same value in column B (by filling this).
Thus, the desired output should be:
data = {'A': [4, 4, 5, 6],
'B': ['a', a, np.nan, 'd']}
df = pd.DataFrame(data)
I am aware of the fillna method, but this gives the wrong output as the third row also gets the value 'A' assigned:
df['B'] = fillna(method="ffill", inplace=True)
data = {'A': [4, 4, 5, 6],
'B': ['a', 'a', 'a', 'd']}
df = pd.DataFrame(data)
How can I get the desired output?
Try this:
df['B'] = df.groupby('A')['B'].ffill()
Output:
>>> df
A B
0 4 a
1 4 a
2 5 NaN
3 6 d
I was wondering if I can used pandas .drop method to drop rows when chaining methods to construct a data frame.
Dropping rows is straight forward once the data frame exists:
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
print(df1)
# drop the entries that match "2"
df1 = df1[df1['A'] !=2]
print(df1)
However, I would like to do this while I am creating the data frame:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
# .drop(lambda x: x['A']!=2)
)
print(df2)
The commented line does not work, but maybe there is a correct way of doing this. Grateful for any input.
Use DataFrame.loc with callable:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
.loc[lambda x: x['AA']!=2]
)
Or DataFrame.query:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
.query("AA != 2")
)
print(df2)
AA B
0 1 5
2 3 3
You can use DataFrame.apply with DataFrame.dropna:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
.apply(lambda x: x if x['AA'] !=2 else np.nan,axis=1).dropna()
)
print(df2)
AA B
0 1.0 5.0
2 3.0 3.0
Maybe you can try a selection with ".loc" at the end of your definition line code.
I'm attempting to row index by using pandas indexing, but it seems that there isn't an appropriate way to input a list for this. This is the solution I'm trying to use without loops.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 9], 'b': [3, 3, 4, 5, 100]})
print(df)
interest = [3, 4]
# results = df['a'].eq(interest)
# results = df[(df['a'] == 3) & (df['a'] == 4)]
df(results)
# print(df[df['b'] == 3]) # index 0 1 2
With loops, I'm able to get my desired result.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 9], 'b': [3, 3, 4, 5, 100]})
print(df)
lst = [3,4]
print('index values are : {}'.format(lst))
results = pd.DataFrame()
for itr in lst:
if results.empty:
results = df[ df['a'] == itr]
else:
results = results.append(df[ df['a'] == itr])
print('result : \n{}'.format(results))
I've search but most documentation will index both columns 'a' and 'b' and/or only use one value at a time for indexing, rather than a list. Let me know if I wasn't clear
IIUC you want .isin?
>>> df[df.a.isin([3,4])]
a b
2 3 4
3 4 5