Groupby a column in Dataframe and create another dataframe with grouped data - python

I have a dataframe like the following:
data:
items status
0 jet fail
1 car fail
2 car pass
3 bike fail
4 car fail
5 jet fail
6 bike pass
7 jet fail
8 jet fail
9 bike pass
I want to group the data by items and create a new dataframe with the counts of each value.
Expected output:
df:
unique count pass fail
0 jet 4 0 4
1 car 3 1 2
2 bike 3 2 1
Now one method would be to get a list of unique items and loop on it to find the count, pass and fail and then combine these lists to a dataframe
But how can I do that efficiently ?

Use crosstab with DataFrame.rename_axis for new index name, then add new column for 0 position by DataFrame.insert and last convert index to column by DataFrame.reset_index:
df = pd.crosstab(df['items'], df['status']).rename_axis(columns=None, index='unique')
df.insert(0, 'count', df.sum(axis=1))
df = df.reset_index()
print (df)
unique count fail pass
0 bike 3 1 2
1 car 3 2 1
2 jet 4 4 0
If count should be last column is possible use margin parameter and remove last row:
df = (pd.crosstab(df['items'], df['status'],
margins=True,
margins_name='count')
.rename_axis(columns=None, index='unique')
.iloc[:-1]
.reset_index())
print (df)
unique fail pass count
0 bike 1 2 3
1 car 2 1 3
2 jet 4 0 4

You could get the values separately and combine with pd.concat :
A = df.groupby("items").size().rename("count")
A
items
bike 3
car 3
jet 4
Name: count, dtype: int64
B = (
df.groupby(["items", "status"])
.size()
.unstack(fill_value=0)
.rename_axis(columns=None)
)
B
fail pass
items
bike 1 2
car 2 1
jet 4 0
pd.concat((A, B), axis=1).reset_index()
items count fail pass
0 bike 3 1 2
1 car 3 2 1
2 jet 4 4 0

Related

Python Get all activity before an activity according to a condition

I hope you can help me, I want to get all the activity according to a condition.
I have a dataframe like this:
ID
Number
Activity
1
1
Get Up
1
2
Wash
1
3
Dress Up
2
1
Get Up
2
2
Dress Up
2
3
Eat
2
4
Work
I have as Target Activity Dress Up, so I should look for the Number of the activity and remove all the number after the number of target activity The output:
ID
Number
Activity
1
1
Get Up
1
2
Wash
1
3
Dress Up
2
1
Get Up
2
2
Dress Up
I have tried to use the function where but it removes all rows expect the one with target activity:
df= pd.read_csv('data.csv')
End_act = 'Dress Up'
cond = df['Activity']==Endact
df = df[ df['Number']<= df['Number'].where(cond)]
Use GroupBy.cummax with compare values by End_act in column Activity with change order rows by DataFrame.iloc for set Trues by all previous rows by End_act, last change order by original and filter in boolean indexing:
End_act = 'Dress Up'
m = (df.iloc[::-1]
.assign(new = lambda x: x['Activity'].eq(End_act))
.groupby('ID')['new']
.cummax())
df = df[m.iloc[::-1]]
print (df)
ID Number Activity
0 1 1 Get Up
1 1 2 Wash
2 1 3 Dress Up
3 2 1 Get Up
4 2 2 Dress Up
Your solution is changed with DataFrameGroupBy.idxmax for maximal index created by column Number:
End_act = 'Dress Up'
s = (df.set_index('Number')
.assign(new = lambda x: x['Activity'].eq(End_act))
.groupby('ID')['new']
.transform('idxmax'))
df = df[df['Number'].le(s.to_numpy())]
print (df)
ID Number Activity
0 1 1 Get Up
1 1 2 Wash
2 1 3 Dress Up
3 2 1 Get Up
4 2 2 Dress Up

transform data by conditional addition

I need help transforming the data as follows:
From a dataset in this version (df1)
ID apples oranges pears apples_pears oranges_pears
0 1 1 0 0 1 0
1 2 0 1 0 1 0
2 3 0 1 1 0 1
to a data set like the following (df2):
ID apples oranges pears
0 1 2 0 1
1 2 1 1 1
2 3 0 2 2
What I'm trying to accomplish is get the total values of apples, from all the columns in which the word "apple" appears in the column name. E.g. in the df1 there are 2 column names in which the word "apple" appears. If you sum up all the apples from the first row, there would be a total of 2. I want a single column for apples in the new dataset (df2). Note that a 1 for appleas_pears is a 1 for EACH apples and pears.
Idea is split DataFrame to new 2 - first change columns names by all values before _ and for second filter columns with _ by DataFrame.filter and change columns by value after _, last join together by concat and sum per columns:
df1 = df.set_index('ID')
df2 = df1.filter(like='_')
df1.columns = df1.columns.str.split('_').str[0]
df2.columns = df2.columns.str.split('_').str[1]
df = pd.concat([df1, df2], axis=1).sum(level=0, axis=1).reset_index()
print (df)
ID apples oranges pears
0 1 2 0 1
1 2 1 1 1
2 3 0 2 2

How to count the most popular value from multiple value pandas column

i have such a problem:
I have pandas dataframe with shop ID and shop cathegories, looking smth like that:
id cats
0 10002718 182,45001,83079
1 10004056 9798
2 10009726 17,45528
3 10009752 64324,17
4 1001107 44607,83520,76557
... ... ...
24922 9992184 45716
24923 9997866 77063
24924 9998461 45001,44605,3238,72627,83785
24925 9998954 69908,78574,77890
24926 9999728 45653,44605,83648,85023,84481,68822
So the problem is that each shop can have multiple cathegories, and the task is to count frequency of each cathegoty. What's the easiest way to do it?
In conclusion i need to have dataframe with columns
cats count
0 1 133
1 2 1
2 3 15
3 4 12
Use Series.str.split with Series.explode and Series.value_counts:
df1 = (df['cats'].str.split(',')
.explode()
.value_counts()
.rename_axis('cats')
.reset_index(name='count'))
Or add expand=True to split to DataFrame and DataFrame.stack:
df1 = (df['cats'].str.split(',', expand=True)
.stack()
.value_counts()
.rename_axis('cats')
.reset_index(name='count'))
print (df1.head(10))
cats count
0 17 2
1 44605 2
2 45001 2
3 83520 1
4 64324 1
5 44607 1
6 45653 1
7 69908 1
8 83785 1
9 83079 1

Joining two dataframes on unique ID, but using another value if id doesn't exist

I have two dataframes as such:
UID mainColumn .... (other columns of data)
1 apple
2 orange
3 apple
4 orange
5 berry
....
UID2 mainColumn2
1 truck
3 car
4 boat
5 plane
...
I need to join the second dataframe onto dataframe based on UID, however if df2 does not contain a uid, then the maincolumn value is the one I'd to use. In the above example, UID2 does not contain the value 2, so the final table would look something like
UID mainColumn ....
1 truck
2 orange
3 car
4 boat
5 plane
...
Now I'm aware we can do something in the form of
df1=df1.merge(df2,left_on='UID', right_on='UID2')
But the issue I have is not replacing the missing values, and making sure they are still included. Thanks!
You can use combine_first() after renaming the columns of df2 as df1 (eg UID2 to UID..) :
df2.columns=df1.columns#be careful, rename only matching columns
final_df=df2.set_index('UID').combine_first(df1.set_index('UID')).reset_index()
UID mainColumn
0 1 truck
1 2 orange
2 3 car
3 4 boat
4 5 plane
We can first use merge, then fillna the missing values and finally drop the extra column:
final = df1.merge(df2, left_on='UID', right_on='UID2', how='left').drop('UID2', axis=1)
final['mainColumn'] = final['mainColumn2'].fillna(final['mainColumn'])
final.drop('mainColumn2', axis=1, inplace=True)
UID mainColumn
0 1 truck
1 2 orange
2 3 car
3 4 boat
4 5 plane

How to return a dataframe of number of duplicates based on filtering the two columns in python

ID group categories
1 0 red
2 1 blue
3 1 green
4 1 green
1 0 blue
1 0 blue
2 1 red
3 0 red
4 0 red
4 1 red
Hi, I am new to python I am trying to get the count of duplicates of ID columns based on multiple conditions on the other 2 columns. So I am filtering out red and 0 and then I wanted ID's that repeated more than once.
df1 = df[(df['categories']=='red')& (df['group'] == 0)]
df1['ID'].value_counts()[df1['ID'].value_counts()>1]
There are almost 10 categories in the categories column so I was thinking if there is any way to write a function or for loop instead of repeating the same steps. The final goal is to see how many duplicate ID's in each group given category is 'red'/'blue'/'green'. Thanks in advance
P.S : the group values doesn't change it is a binomial variable
output
ID count
1 3
2 2
3 2
4 3
I think you can use groupby with SeriesGroupBy.value_counts:
s = df.groupby(['ID','group'])['categories'].value_counts()
print (s)
ID group categories
1 0 blue 2
red 1
2 1 blue 1
red 1
3 0 red 1
1 green 1
4 0 red 1
1 green 1
red 1
Name: categories, dtype: int64
out = s[s > 1].reset_index(name='count')
print (out)
ID group categories count
0 1 0 blue 2
Another solution is get duplicates first by filtering with duplicated and then count:
df = df[df.duplicated(['ID','group','categories'], keep=False)]
print (df)
ID group categories
4 1 0 blue
5 1 0 blue
df1 = df.groupby(['ID','group'])['categories'].value_counts().reset_index(name='count')
print (df1)
ID group categories count
0 1 0 blue 2
EDIT: For count categories (all rows) per ID use GroupBy.size:
df1 = df.groupby('ID').size().reset_index(name='count')
print (df1)
ID count
0 1 3
1 2 2
2 3 2
3 4 3

Categories

Resources