Can I use pd.drop in method chaining to drop specific rows?

Can I use pd.drop in method chaining to drop specific rows? - python

I was wondering if I can used pandas .drop method to drop rows when chaining methods to construct a data frame.
Dropping rows is straight forward once the data frame exists:
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
print(df1)
# drop the entries that match "2"
df1 = df1[df1['A'] !=2]
print(df1)
However, I would like to do this while I am creating the data frame:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
# .drop(lambda x: x['A']!=2)
)
print(df2)
The commented line does not work, but maybe there is a correct way of doing this. Grateful for any input.

Use DataFrame.loc with callable:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
.loc[lambda x: x['AA']!=2]
)
Or DataFrame.query:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
.query("AA != 2")
)
print(df2)
AA B
0 1 5
2 3 3

You can use DataFrame.apply with DataFrame.dropna:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
.apply(lambda x: x if x['AA'] !=2 else np.nan,axis=1).dropna()
)
print(df2)
AA B
0 1.0 5.0
2 3.0 3.0

Maybe you can try a selection with ".loc" at the end of your definition line code.

Related

How to fill column with condition in polars

I would like to add new column using other column value with condition
In pandas, I do this like below
import pandas as pd
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df['c'] = df['a']
df.loc[df['b']==4, 'c'] = df['b']
The result is
a
b
c
1
3
1
2
4
4
Could you teach me how to do this with polars?

Use when/then/otherwise
df = pl.DataFrame({'a': [1, 2], 'b': [3, 4]})
df.with_columns(
pl.when(pl.col("b") == 4).then(pl.col('b')).otherwise(pl.col('a')).alias("c")
)

pandas fill missing values with mean of other columns grouped by a value

I have this dataset where I have NaN values on column 'a'. I want to group rows by 'user_id', compute the mean on column 'c' grouped by 'user_id' and fill NaN values on 'a' with this mean. How can I do it?
this is the code
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [0, np.nan, np.nan], 'user_id': [1, 2, 2], 'c': [3, 7, 7]})
print(df)
what I should have
df = pd.DataFrame({'a': [0, 7, 7], 'user_id': [1, 2, 2], 'c': [3, 7, 7]})
print(df)
what I've tried
df['a'].fillna(df.groupby('user_id')['a'].transform('mean'), inplace=True)
print(df)
after printing the df I still se NaN on 'a' column
Note since I have a huge dataset I need to do id inplace

I think you need processing column c:
df['a'].fillna(df.groupby('user_id')['c'].transform('mean'), inplace=True)
df['a'] = df['a'].fillna(df.groupby('user_id')['c'].transform('mean'))
print (df)
a user_id c
0 0.0 1 3
1 7.0 2 7
2 7.0 2 7

How to change values in one column based on whether conditions in columns A and B are met in Pandas/Python

I have a data frame. I want to change values in column C to null values based on whether conditions in columns A and B are met. To do this, I think I need to iterate over the rows of the dataframe, but I can't figure out how:
import pandas as pd
df = pd.DataFrame({'A': [1, 4, 1, 4], 'B': [9, 2, 5, 3], 'C': [0, 0, 5, 3]})
dataframe image
I tried something like this:
for row in df.iterrows()
if df['A'] > 2 and df['B'] == 3:
df['C'] == np.nan
but I just keep getting errors. Could someone please show me how to do this?

Yours is not a DataFrame, it's a dictionary. This is a DataFrame:
import pandas as pd
df = pd.DataFrame({'A': [1, 4, 1, 4], 'B': [9, 2, 5, 3], 'C': [0, 0, 5, 3]})
It is usually faster to use pandas/numpy arithmetic instead of regular Python loops.
df.loc[(df['A'].values > 2) & (df['B'].values == 3), 'C'] = np.nan
Or if you insist on your way of coding, the code (besides converting df to a real DataFrame) can be updated to:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 4, 1, 4], 'B': [9, 2, 5, 3], 'C': [0, 0, 5, 3]})
for i, row in df.iterrows():
if row.loc['A'] > 2 and row.loc['B'] == 3:
df.loc[i, 'C'] = np.nan
or
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 4, 1, 4], 'B': [9, 2, 5, 3], 'C': [0, 0, 5, 3]})
for i, row in df.iterrows():
if df.loc[i, 'A'] > 2 and df.loc[i, 'B'] == 3:
df.loc[i, 'C'] = np.nan

You can try
df.loc[(df["A"].values > 2) & (df["B"].values==3), "C"] = None
Using pandas and numpy is way easier for you :D

pandas groupby & lambda function to return nlargest(2)

Please see pandas df:
pd.DataFrame({'id': [1, 1, 2, 2, 2, 3],
'pay_date': ['Jul1', 'Jul2', 'Jul8', 'Aug5', 'Aug7', 'Aug22'],
'id_ind': [1, 2, 1, 2, 3, 1]})
I am trying to groupby 'id' and 'pay_date'. I only want to keep df['id_ind'].nlargest(2) in the dataframe after grouping by 'id' and 'pay_date'. Here is my code:
df = pd.DataFrame(df.groupby(['id', 'pay_date'])['id_ind'].apply(
lambda x: x.nlargest(2)).reset_index()
This does not work, as the new df returns all the records. If it worked, 'id'==2 would only appear twice in the df, as there are 3 records and I only want the 2 largest by 'id_ind'.
My desired output:
pd.DataFrame({'id': [1, 1, 2, 2, 3],
'pay_date': ['Jul1', 'Jul2', 'Aug5', 'Aug7', 'Aug22'],
'id_ind': [1, 2, 2, 3, 1]})

Sort on id_ind and doing groupby.tail
df_final = (df.sort_values('id_ind').groupby('id').tail(2)
.sort_index()
.reset_index(drop=True))
Out[29]:
id id_ind pay_date
0 1 1 Jul1
1 1 2 Jul2
2 2 2 Aug5
3 2 3 Aug7
4 3 1 Aug22

Pandas Row-Indexing without looping

I'm attempting to row index by using pandas indexing, but it seems that there isn't an appropriate way to input a list for this. This is the solution I'm trying to use without loops.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 9], 'b': [3, 3, 4, 5, 100]})
print(df)
interest = [3, 4]
# results = df['a'].eq(interest)
# results = df[(df['a'] == 3) & (df['a'] == 4)]
df(results)
# print(df[df['b'] == 3]) # index 0 1 2
With loops, I'm able to get my desired result.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 9], 'b': [3, 3, 4, 5, 100]})
print(df)
lst = [3,4]
print('index values are : {}'.format(lst))
results = pd.DataFrame()
for itr in lst:
if results.empty:
results = df[ df['a'] == itr]
else:
results = results.append(df[ df['a'] == itr])
print('result : \n{}'.format(results))
I've search but most documentation will index both columns 'a' and 'b' and/or only use one value at a time for indexing, rather than a list. Let me know if I wasn't clear

IIUC you want .isin?
>>> df[df.a.isin([3,4])]
a b
2 3 4
3 4 5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can I use pd.drop in method chaining to drop specific rows? - python

Use DataFrame.loc with callable: df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]}) .rename(columns={'A': 'AA'}) .loc[lambda x: x['AA']!=2] ) Or DataFrame.query: df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]}) .rename(columns={'A': 'AA'}) .query("AA != 2") ) print(df2) AA B 0 1 5 2 3 3

You can use DataFrame.apply with DataFrame.dropna: df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]}) .rename(columns={'A': 'AA'}) .apply(lambda x: x if x['AA'] !=2 else np.nan,axis=1).dropna() ) print(df2) AA B 0 1.0 5.0 2 3.0 3.0

Maybe you can try a selection with ".loc" at the end of your definition line code.

Related

How to fill column with condition in polars

pandas fill missing values with mean of other columns grouped by a value

How to change values in one column based on whether conditions in columns A and B are met in Pandas/Python

pandas groupby & lambda function to return nlargest(2)

Pandas Row-Indexing without looping

Categories

Resources