How to keep certain rows based on a condition in python pandas

How to keep certain rows based on a condition in python pandas - python

I have the following df. Below are two fields that pertain to my question
name tardy
max 0
max 1
ben 0
amy 0
amy 1
sue 1
tyler 0
tyler 1
I would like to keep only the name of those who have both tardy==0 and tardy==1. Thus, my desired output is the following
name tardy
max 0
max 1
amy 0
amy 1
tyler 0
tyler 1
Getting rid of name==sue and name==ben makes it so that the only name showing up is for those who have both a 0 and 1 value for tardy.
I tried doing a .loc
df[(df.tardy==0) & (df.tardy==1)]
but this doesn't take into account filtering it by name.
Any help is appreciated. Thanks!

For most general solution working for any data compare values of groups converted to sets with original and for avoid matching data like 0,1,0 compare by length if match:
vals = set([0,1])
m = df.groupby('name')['tardy'].transform(lambda x: set(x)==vals and len(x)==len(vals))
df = df[m]
print (df)
name tardy
0 max 0
1 max 1
3 amy 0
4 amy 1
6 tyler 0
7 tyler 1
Or solution with pandas functions - compare values if unique is same like set, compare lengths and also if matching values 0,1:
vals = [0,1]
g = df.groupby('name')['tardy']
df = df[g.transform('size').eq(2) & g.transform('size').eq(2) & df['tardy'].isin(vals)]
print (df)
name tardy
0 max 0
1 max 1
3 amy 0
4 amy 1
6 tyler 0
7 tyler 1

You can use groupby().nunique():
df[df.groupby('name')['tardy'].transform('nunique')==2]
Output:
name tardy
0 max 0
1 max 1
3 amy 0
4 amy 1
6 tyler 0
7 tyler 1

The easiest way is to use df.groupby().filter, which filters the dataframe's groups based on a condition.
tardy_vals = {0, 1}
df.groupby('name').filter(lambda g: tardy_vals.issubset(g['tardy']))
name tardy
0 max 0
1 max 1
3 amy 0
4 amy 1
6 tyler 0
7 tyler 1

Related

Pandas - dense rank but keep current group numbers

I'm dealing with pandas dataframe and have a frame like:
data = {
"name": ["Andrew", "Andrew", "James", "James", "Mary", "Andrew", "Michael"],
"id": [3, 3, 1, 0, 0, 0, 2]
}
df = pd.DataFrame(data)
----------------------
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 0
4 Mary 0
5 Andrew 0
6 Michael 2
I'm trying to write code to group values by "name" column. However, I want to keep the current group numbers.
If the value is 0, it means that there is no assignment.
For the example above, assign a value of 3 for each occurrence of Andrew and a value of 1 for each occurrence of James. For Mary, there is no assignment so assign next/unique number.
The expected output:
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 1
4 Mary 4
5 Andrew 3
6 Michael 2
I've spent time already trying to figure this out. I managed to get to something like this:
df.loc[df["id"].eq(0), "id"] = ( df['name'].rank(method='dense').astype(int))
The issue with above it that it ignore records equal 0, thus numbers are incorrect. I removed that part (values equal to 0) but then numbering is not preserved.
Can u please support me?

Replace 0 values to missing values, so if use GroupBy.transform with first get all existing values instead them and then replace missing values by Series.rank with add maximal id and converting to integers:
df = df.replace({'id':{0:np.nan}})
df['id'] = df.groupby('name')['id'].transform('first')
s = df.loc[df["id"].isna(), 'name'].rank(method='dense') + df['id'].max()
df['id'] = df['id'].fillna(s).astype(int)
print (df)
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 1
4 Mary 4
5 Andrew 3
6 Michael 2

IIUC you can first fill in the non-zero IDs with groupby.transform('max') to get the max existing ID, then complete the names without ID to the next available ID on the masked data (you can use factorize or rank as you wish):
# fill existing non-zero IDs
s = df.groupby('name')['id'].transform('max')
m = s.eq(0)
df['id'] = s.mask(m)
# add new ones
df.loc[m, 'id'] = pd.factorize(df.loc[m, 'name'])[0]+df['id'].max()+1
# or rank, although factorize is more appropriate for non numerical data
# df.loc[m, 'id'] = df.loc[m, 'name'].rank(method='dense')+df['id'].max()
# optional, if you want integers
df['id']= df['id'].convert_dtypes()
output:
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 1
4 Mary 4
5 Andrew 3
6 Michael 2

Group by column and Spread values of another Column into other Columns

I have the current dataframe and I'm trying to group by the Name and spread the values of weight into the columns and count each time they occur. Thanks!
df = pd.DataFrame({'Name':['John','Paul','Darren','John','Darren'],
'Weight':['Average','Below Average','Above Average','Average','Above Average']})
Desired output:

Try pandas crosstab :
pd.crosstab(df.Name, df.Weight)
Weight Above Average Average Below Average
Name
John 0 2 0
Paul 0 0 1
Darren 2 0 0

use groupby and unstack:
df = pd.DataFrame({'Name':['John','Paul','Darren','John','Darren'],
'Weight':['Average','Below Average','Above Average','Average','Above Average']})
df = df.groupby(['Name', 'Weight'])['Weight'].count().unstack(1).fillna(0).astype(int).reset_index()
df = df.rename_axis('', axis=1).set_index('Name')
df
Out[1]:
Above Average Average Below Average
Name
Darren 2 0 0
John 0 2 0
Paul 0 0 1

Use get dummies to achieve what you need here
pd.get_dummies(df.set_index('Name'), dummy_na=False,prefix=[None]).groupby('Name').sum()
Above Average Average Below Average
Name
Darren 2 0 0
John 0 2 0
Paul 0 0 1

Split Columns in pandas with str.split and keep values

So I am stuck with a problem here:
I have a pandas dataframe which looks like the following:
ID Name Value
0 Peter 21,2
1 Frank 24
2 Tom 23,21/23,60
3 Ismael 21,2/ 21,54
4 Joe 23,1
and so on...
What I am trying to is to split the "Value" column by the slash forward (/) but keep all the values, which do not have this kind of pattern.
Like here:
ID Name Value
0 Peter 21,2
1 Frank 24
2 Tom 23,21
3 Ismael 21,2
4 Joe 23,1
How can I achieve this? I tried the str.split method but it's not giving me the solution I want. Instead, it returns NaN as can be seen in the following.
My Code: df['Value']=df['value'].str.split('/', expand=True)[0]
Returns:
ID Name Value
0 Peter NaN
1 Frank NaN
2 Tom 23,21
3 Ismael 21,2
4 Joe Nan
All I need is the very first Value before the '/' is coming.
Appreciate any kind of help!

Remove expand=True for return lists and add str[0] for select first value:
df['Value'] = df['Value'].str.split('/').str[0]
print (df)
ID Name Value
0 0 Peter 21,2
1 1 Frank 24
2 2 Tom 23,21
3 3 Ismael 21,2
4 4 Joe 23,1
If performance is important use list comprehension:
df['Value'] = [x.split('/')[0] for x in df['Value']]

pandas.Series.str.replace with regex
df.assign(Value=df.Value.str.replace('/.*', ''))
ID Name Value
0 0 Peter 21,2
1 1 Frank 24
2 2 Tom 23,21
3 3 Ismael 21,2
4 4 Joe 23,1
Optionally, you can assign results directly back to dataframe
df['Value'] = df.Value.str.replace('/.*', '')

Label encoding across multiple columns with same attributes in sckit-learn

If I have two columns as below:
Origin Destination
China USA
China Turkey
USA China
USA Turkey
USA Russia
Russia China
How would I perform label encoding while ensuring the label for the Origin column matches the one in the destination column i.e
Origin Destination
0 1
0 3
1 0
1 0
1 0
2 1
If I do the encoding for each column separately then the algorithm will see the China in column1 as different from column2 which is not the case

stack
df.stack().pipe(lambda s: pd.Series(pd.factorize(s.values)[0], s.index)).unstack()
Origin Destination
0 0 1
1 0 2
2 1 0
3 1 2
4 1 3
5 3 0
factorize with reshape
pd.DataFrame(
pd.factorize(df.values.ravel())[0].reshape(df.shape),
df.index, df.columns
)
Origin Destination
0 0 1
1 0 2
2 1 0
3 1 2
4 1 3
5 3 0
np.unique and reshape
pd.DataFrame(
np.unique(df.values.ravel(), return_inverse=True)[1].reshape(df.shape),
df.index, df.columns
)
Origin Destination
0 0 3
1 0 2
2 3 0
3 3 2
4 3 1
5 1 0
Disgusting Option
I couldn't stop trying stuff... sorry!
df.applymap(
lambda x, y={}, c=itertools.count():
y.get(x) if x in y else y.setdefault(x, next(c))
)
Origin Destination
0 0 1
1 0 3
2 1 0
3 1 3
4 1 2
5 2 0
As pointed out by cᴏʟᴅsᴘᴇᴇᴅ
You can shorten this by assigning back to dataframe
df[:] = pd.factorize(df.values.ravel())[0].reshape(df.shape)

pandas Method
You could create a dictionary of {country: value} pairs and map the dataframe to that:
country_map = {country:i for i, country in enumerate(df.stack().unique())}
df['Origin'] = df['Origin'].map(country_map)
df['Destination'] = df['Destination'].map(country_map)
>>> df
Origin Destination
0 0 1
1 0 2
2 1 0
3 1 2
4 1 3
5 3 0
sklearn method
Since you tagged sklearn, you could use LabelEncoder():
from sklearn.preprocessing import LabelEncoder
le= LabelEncoder()
le.fit(df.stack().unique())
df['Origin'] = le.transform(df['Origin'])
df['Destination'] = le.transform(df['Destination'])
>>> df
Origin Destination
0 0 3
1 0 2
2 3 0
3 3 2
4 3 1
5 1 0
To get the original labels back:
>>> le.inverse_transform(df['Origin'])
# array(['China', 'China', 'USA', 'USA', 'USA', 'Russia'], dtype=object)

You can using replace
df.replace(dict(zip(np.unique(df.values),list(range(len(np.unique(df.values)))))))
Origin Destination
0 0 3
1 0 2
2 3 0
3 3 2
4 3 1
5 1 0
Succinct and nice answer from Pir
df.replace((lambda u: dict(zip(u, range(u.size))))(np.unique(df)))
And
df.replace(dict(zip(np.unique(df), itertools.count())))

Edit: just found out about return_inverse option to np.unique. No need to search and substitute!
df.values[:] = np.unique(df, return_inverse=True)[1].reshape(-1,2)
You could leverage the vectorized version of np.searchsorted with
df.values[:] = np.searchsorted(np.sort(np.unique(df)), df)
Or you could create an array of one-hot encodings and recover indices with argmax. Probably not a great idea if there are many countries.
df.values[:] = (df.values[...,None] == np.unique(df)).argmax(-1)

Using LabelEncoder from sklearn, you can also try:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.values.flatten())
df = df.apply(le.fit_transform)
print(df)
Result:
Origin Destination
0 0 3
1 0 2
2 2 0
3 2 2
4 2 1
5 1 0
If you have more columns and only want to apply to selected columns of dataframe then, you can try:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
# columns to select for encoding
selected_col = ['Origin','Destination']
le.fit(df[selected_col].values.flatten())
df[selected_col] = df[selected_col].apply(le.fit_transform)
print(df)

pivot_table with group and without value field

I have pandas data frame url like
location dom_category
3 'edu'
3 'gov'
3 'edu'
4 'org'
4 'others'
4 'org'
and i want this data frame to be like
location edu gov org others
3 2 1 0 0
4 0 0 2 1
the edu,gov,org and others contains the count for specific location.
i have right the code but i know its not the optimized
url['val']=1
url_final=url.pivot_table(index=['location'],values='val',columns=
['dom_category'],aggfunc=np.sum)

First if necessary remove ' by str.strip.
Then use groupby with aggregating size and reshape by unstack:
df['dom_category'] = df['dom_category'].str.strip("\'")
df = df.groupby(['location','dom_category']).size().unstack(fill_value=0)
print (df)
dom_category edu gov org others
location
3 2 1 0 0
4 0 0 2 1
Or use pivot_table:
df['dom_category'] = df['dom_category'].str.strip("\'")
df=df.pivot_table(index='location',columns='dom_category',aggfunc='size', fill_value=0)
print (df)
dom_category edu gov org others
location
3 2 1 0 0
4 0 0 2 1
Last is possible convert index to column and remove columns name dom_category by reset_index + rename_axis:
df = df.reset_index().rename_axis(None, axis=1)
print (df)
location edu gov org others
0 3 2 1 0 0
1 4 0 0 2 1

Using groupby and value_counts
House Keeping
get rid of '
df.dom_category = df.dom_category.str.strip("'")
Rest of Solution
df.groupby('location').dom_category.value_counts().unstack(fill_value=0)
dom_category edu gov org others
location
3 2 1 0 0
4 0 0 2 1
To get the formatting just right
df.groupby('location').dom_category.value_counts().unstack(fill_value=0) \
.reset_index().rename_axis(None, 1)
location edu gov org others
0 3 2 1 0 0
1 4 0 0 2 1

Let's use str.strip, get_dummies and groupby:
df['dom_category'] = df.dom_category.str.strip("\'")
df.assign(**df.dom_category.str.get_dummies()).groupby('location').sum().reset_index()
Output:
location edu gov org others
0 3 2 1 0 0
1 4 0 0 2 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to keep certain rows based on a condition in python pandas - python

You can use groupby().nunique(): df[df.groupby('name')['tardy'].transform('nunique')==2] Output: name tardy 0 max 0 1 max 1 3 amy 0 4 amy 1 6 tyler 0 7 tyler 1

The easiest way is to use df.groupby().filter, which filters the dataframe's groups based on a condition. tardy_vals = {0, 1} df.groupby('name').filter(lambda g: tardy_vals.issubset(g['tardy'])) name tardy 0 max 0 1 max 1 3 amy 0 4 amy 1 6 tyler 0 7 tyler 1

Related

Pandas - dense rank but keep current group numbers

Group by column and Spread values of another Column into other Columns

Split Columns in pandas with str.split and keep values

Label encoding across multiple columns with same attributes in sckit-learn

pivot_table with group and without value field

Categories

Resources