This question already has answers here:
How to delete rows from a pandas DataFrame based on a conditional expression [duplicate]
(6 answers)
How do you filter pandas dataframes by multiple columns?
(10 answers)
Closed 4 years ago.
I want to drop rows with zero value in specific columns
>>> df
salary age gender
0 10000 23 1
1 15000 34 0
2 23000 21 1
3 0 20 0
4 28500 0 1
5 35000 37 1
some data in columns salary and age are missing
and the third column, gender is a binary variables, which 1 means male 0 means female. And 0 here is not a missing data,
I want to drop the row in either salary or age is missing
so I can get
>>> df
salary age gender
0 10000 23 1
1 15000 34 0
2 23000 21 1
3 35000 37 1
Option 1
You can filter your dataframe using pd.DataFrame.loc:
df = df.loc[~((df['salary'] == 0) | (df['age'] == 0))]
Option 2
Or a smarter way to implement your logic:
df = df.loc[df['salary'] * df['age'] != 0]
This works because if either salary or age are 0, their product will also be 0.
Option 3
The following method can be easily extended to several columns:
df.loc[(df[['a', 'b']] != 0).all(axis=1)]
Explanation
In all 3 cases, Boolean arrays are generated which are used to index your dataframe.
All these methods can be further optimised by using numpy representation, e.g. df['salary'].values.
Related
This question already has answers here:
How to make an order column when grouping by another column
(2 answers)
Add column to dataframe with the order parameter of a groupby
(1 answer)
Closed 5 months ago.
I have a dataframe characterized by two essential columns: name and timestamp.
df = pd.DataFrame({'name':['tom','tom','tom','bert','bert','sam'], \
'timestamp':[15,13,14,23,22,14]})
I would like to create a third column chronology that checks the timestamp for each name and gives me the chronological order per name such that the final product looks like this:
df_final = pd.DataFrame({'name':['tom','tom','tom','bert','bert','sam'], \
'timestamp':[15,13,14,23,22,14], \
'chronology':[3,2,1,2,1,1]})
I understand that I can go df = df.sort_values(['name', 'timestamp']) but how do I create the chronology column?
You can do with groupby().cumcount() if the timestamps are not likely repeated:
df['chronology']= df.sort_values('timestamp').groupby('name').cumcount().add(1)
or groupby().rank():
df['chronology'] = df.groupby('name')['timestamp'].rank().astype(int)
Output:
name timestamp chronology
0 tom 15 3
1 tom 13 1
2 tom 14 2
3 bert 23 2
4 bert 22 1
5 sam 14 1
The function GroupBy.rank(), does exactly what you need. From the documentation:
GroupBy.rank(method='average', ascending=True, na_option='keep', pct=False, axis=0)
Provide the rank of values within each group.
Try this code:
df['chronology'] = df.groupby(by=['name']).timestamp.rank().astype(int)
Result:
name timestamp chronology
tom 15 3
tom 13 1
tom 14 2
bert 23 2
bert 22 1
sam 14 1
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 7 months ago.
I have dataframe like this
customer_id product_id sum
0 1 A 20
1 2 A 10
2 2 C 30
3 1 B 40
I want to transform it into a sparse dataframe where I will have customer data as rows and columns will be decoded products in order. Something like this:
customer_id Product_A_Sum Product_B_Sum Product_C_Sum
0 1 20 40 0
1 2 10 0 30
I have a brute force solution - something like this:
df_new= pd.DataFrame()
df_new['customer_id'] = df.customer_id.unique()
for product in range(len(list_products)):
temp = df.groupby(['customer_id', product])['sum'].sum().reset_index().rename(columns = {'sum':('sum'+product)})
df_new = df_new.merge(temp[['customer_id', 'sum'+product]], how='left', on = 'customer_id').fillna(0)
This code works but my list of products is large so it will not scale at all. Are there any pandas tricks that will allow me to do this easier?
Thank you in advance!
try this,
df = df.pivot_table(index='customer_id', columns='product_id', values='sum', fill_value=0)
df.columns = [f"Product_sum_{x}" for x in df.columns]
df = df.reset_index()
O/P:
customer_id Product_sum_A Product_sum_B Product_sum_C
0 1 20 40 0
1 2 10 0 30
You should use pivot:
df.pivot(index='customer_id', columns='product_id', values='sum')
This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed 1 year ago.
When using groupby(), how can I create a DataFrame with a new column containing an increasing index of each group. For example, if I have
df=pd.DataFrame('a':[1,1,1,2,2,2])
df
a
0 1
1 1
2 1
3 2
4 2
5 2
How can I get a DataFrame where the index resets for each new group in the column. The association between a and index is not important...just need to have each case of a receive a unique index starting from 1.
a idx
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
The answer in the comments :
df['idx'] = df.groupby('a').cumcount() + 1
This question already has answers here:
Pandas - Duplicate Row based on condition
(3 answers)
Closed 2 years ago.
I need to replicate some rows in a panda data frame like this
name times
A 2
B 1
C 3
D 20
...
What I need is to replicate rows just when col2 value is less than 20
What I'm doing now is:
for t in df["times"]:
if t < 20:
df = df.loc[df.index.repeat(t)]
But the script keeps running and I have to stop it (I've been waiting a long time...).
Is there any way to improve this or doing it in another way?
Use:
#condition lt for <
mask = df['times'].lt(20)
#filter by boolean indexing
df1 = df[mask].copy()
#repeat rows
df1 = df1.loc[df1.index.repeat(df1['times'])]
#add rows higher like 20, sorting and create default index
df = pd.concat([df1, df[~mask]]).sort_index().reset_index(drop=True)
print (df)
name times
0 A 2
1 A 2
2 B 1
3 C 3
4 C 3
5 C 3
6 D 20
This question already has answers here:
Pandas conditional creation of a series/dataframe column
(13 answers)
Closed 3 years ago.
I have a dataframe where say, 1 column is filled with dates and the 2nd column is filled with Ages. I want to add a 3rd column which looks at the Ages column and multiplies it the value by 2 if the value in the row is < 20, else just put the Age in that row. The lambda function below multiples every Age by 2.
def fun(df):
change = df.loc[:, "AGE"].apply(lambda x: x * 2 if x <20 else x)
df.insert(2, "NEW_AGE", change)
return df
Use pandas.Series.where:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(15, 25), columns=['AGE'])
df['AGE'].where(df['AGE'] >= 20, df['AGE'] * 2)
Output:
0 30
1 32
2 34
3 36
4 38
5 20
6 21
7 22
8 23
9 24
Name: AGE, dtype: int64