I was wondering if I can used pandas .drop method to drop rows when chaining methods to construct a data frame.
Dropping rows is straight forward once the data frame exists:
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
print(df1)
# drop the entries that match "2"
df1 = df1[df1['A'] !=2]
print(df1)
However, I would like to do this while I am creating the data frame:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
# .drop(lambda x: x['A']!=2)
)
print(df2)
The commented line does not work, but maybe there is a correct way of doing this. Grateful for any input.
Use DataFrame.loc with callable:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
.loc[lambda x: x['AA']!=2]
)
Or DataFrame.query:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
.query("AA != 2")
)
print(df2)
AA B
0 1 5
2 3 3
You can use DataFrame.apply with DataFrame.dropna:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
.apply(lambda x: x if x['AA'] !=2 else np.nan,axis=1).dropna()
)
print(df2)
AA B
0 1.0 5.0
2 3.0 3.0
Maybe you can try a selection with ".loc" at the end of your definition line code.
Related
I would like to add new column using other column value with condition
In pandas, I do this like below
import pandas as pd
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df['c'] = df['a']
df.loc[df['b']==4, 'c'] = df['b']
The result is
a
b
c
1
3
1
2
4
4
Could you teach me how to do this with polars?
Use when/then/otherwise
df = pl.DataFrame({'a': [1, 2], 'b': [3, 4]})
df.with_columns(
pl.when(pl.col("b") == 4).then(pl.col('b')).otherwise(pl.col('a')).alias("c")
)
I have this dataset where I have NaN values on column 'a'. I want to group rows by 'user_id', compute the mean on column 'c' grouped by 'user_id' and fill NaN values on 'a' with this mean. How can I do it?
this is the code
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [0, np.nan, np.nan], 'user_id': [1, 2, 2], 'c': [3, 7, 7]})
print(df)
what I should have
df = pd.DataFrame({'a': [0, 7, 7], 'user_id': [1, 2, 2], 'c': [3, 7, 7]})
print(df)
what I've tried
df['a'].fillna(df.groupby('user_id')['a'].transform('mean'), inplace=True)
print(df)
after printing the df I still se NaN on 'a' column
Note since I have a huge dataset I need to do id inplace
I think you need processing column c:
df['a'].fillna(df.groupby('user_id')['c'].transform('mean'), inplace=True)
df['a'] = df['a'].fillna(df.groupby('user_id')['c'].transform('mean'))
print (df)
a user_id c
0 0.0 1 3
1 7.0 2 7
2 7.0 2 7
I have a data frame. I want to change values in column C to null values based on whether conditions in columns A and B are met. To do this, I think I need to iterate over the rows of the dataframe, but I can't figure out how:
import pandas as pd
df = pd.DataFrame({'A': [1, 4, 1, 4], 'B': [9, 2, 5, 3], 'C': [0, 0, 5, 3]})
dataframe image
I tried something like this:
for row in df.iterrows()
if df['A'] > 2 and df['B'] == 3:
df['C'] == np.nan
but I just keep getting errors. Could someone please show me how to do this?
Yours is not a DataFrame, it's a dictionary. This is a DataFrame:
import pandas as pd
df = pd.DataFrame({'A': [1, 4, 1, 4], 'B': [9, 2, 5, 3], 'C': [0, 0, 5, 3]})
It is usually faster to use pandas/numpy arithmetic instead of regular Python loops.
df.loc[(df['A'].values > 2) & (df['B'].values == 3), 'C'] = np.nan
Or if you insist on your way of coding, the code (besides converting df to a real DataFrame) can be updated to:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 4, 1, 4], 'B': [9, 2, 5, 3], 'C': [0, 0, 5, 3]})
for i, row in df.iterrows():
if row.loc['A'] > 2 and row.loc['B'] == 3:
df.loc[i, 'C'] = np.nan
or
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 4, 1, 4], 'B': [9, 2, 5, 3], 'C': [0, 0, 5, 3]})
for i, row in df.iterrows():
if df.loc[i, 'A'] > 2 and df.loc[i, 'B'] == 3:
df.loc[i, 'C'] = np.nan
You can try
df.loc[(df["A"].values > 2) & (df["B"].values==3), "C"] = None
Using pandas and numpy is way easier for you :D
Please see pandas df:
pd.DataFrame({'id': [1, 1, 2, 2, 2, 3],
'pay_date': ['Jul1', 'Jul2', 'Jul8', 'Aug5', 'Aug7', 'Aug22'],
'id_ind': [1, 2, 1, 2, 3, 1]})
I am trying to groupby 'id' and 'pay_date'. I only want to keep df['id_ind'].nlargest(2) in the dataframe after grouping by 'id' and 'pay_date'. Here is my code:
df = pd.DataFrame(df.groupby(['id', 'pay_date'])['id_ind'].apply(
lambda x: x.nlargest(2)).reset_index()
This does not work, as the new df returns all the records. If it worked, 'id'==2 would only appear twice in the df, as there are 3 records and I only want the 2 largest by 'id_ind'.
My desired output:
pd.DataFrame({'id': [1, 1, 2, 2, 3],
'pay_date': ['Jul1', 'Jul2', 'Aug5', 'Aug7', 'Aug22'],
'id_ind': [1, 2, 2, 3, 1]})
Sort on id_ind and doing groupby.tail
df_final = (df.sort_values('id_ind').groupby('id').tail(2)
.sort_index()
.reset_index(drop=True))
Out[29]:
id id_ind pay_date
0 1 1 Jul1
1 1 2 Jul2
2 2 2 Aug5
3 2 3 Aug7
4 3 1 Aug22
I'm attempting to row index by using pandas indexing, but it seems that there isn't an appropriate way to input a list for this. This is the solution I'm trying to use without loops.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 9], 'b': [3, 3, 4, 5, 100]})
print(df)
interest = [3, 4]
# results = df['a'].eq(interest)
# results = df[(df['a'] == 3) & (df['a'] == 4)]
df(results)
# print(df[df['b'] == 3]) # index 0 1 2
With loops, I'm able to get my desired result.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 9], 'b': [3, 3, 4, 5, 100]})
print(df)
lst = [3,4]
print('index values are : {}'.format(lst))
results = pd.DataFrame()
for itr in lst:
if results.empty:
results = df[ df['a'] == itr]
else:
results = results.append(df[ df['a'] == itr])
print('result : \n{}'.format(results))
I've search but most documentation will index both columns 'a' and 'b' and/or only use one value at a time for indexing, rather than a list. Let me know if I wasn't clear
IIUC you want .isin?
>>> df[df.a.isin([3,4])]
a b
2 3 4
3 4 5