i have such a problem:
I have pandas dataframe with shop ID and shop cathegories, looking smth like that:
id cats
0 10002718 182,45001,83079
1 10004056 9798
2 10009726 17,45528
3 10009752 64324,17
4 1001107 44607,83520,76557
... ... ...
24922 9992184 45716
24923 9997866 77063
24924 9998461 45001,44605,3238,72627,83785
24925 9998954 69908,78574,77890
24926 9999728 45653,44605,83648,85023,84481,68822
So the problem is that each shop can have multiple cathegories, and the task is to count frequency of each cathegoty. What's the easiest way to do it?
In conclusion i need to have dataframe with columns
cats count
0 1 133
1 2 1
2 3 15
3 4 12
Use Series.str.split with Series.explode and Series.value_counts:
df1 = (df['cats'].str.split(',')
.explode()
.value_counts()
.rename_axis('cats')
.reset_index(name='count'))
Or add expand=True to split to DataFrame and DataFrame.stack:
df1 = (df['cats'].str.split(',', expand=True)
.stack()
.value_counts()
.rename_axis('cats')
.reset_index(name='count'))
print (df1.head(10))
cats count
0 17 2
1 44605 2
2 45001 2
3 83520 1
4 64324 1
5 44607 1
6 45653 1
7 69908 1
8 83785 1
9 83079 1
Related
Given a df
a b ngroup
0 1 3 0
1 1 4 0
2 1 1 0
3 3 7 2
4 4 4 2
5 1 1 4
6 2 2 4
7 1 1 4
8 6 6 5
I would like to compute the summation of multiple columns (i.e., a and b) grouped by the column ngroup.
In addition, I would like to count the number of element for each of the group.
Based on these two condition, the expected output as below
a b nrow_same_group ngroup
3 8 3 0
7 11 2 2
4 4 3 4
6 6 1 5
The following code should do the work
import pandas as pd
df=pd.DataFrame(list(zip([1,1,1,3,4,1,2,1,6,10],
[3,4,1,7,4,1,2,1,6,1],
[0,0,0,2,2,4,4,4,5])),columns=['a','b','ngroup'])
grouped_df = df.groupby(['ngroup'])
df1 = grouped_df[['a','b']].agg('sum').reset_index()
df2 = df['ngroup'].value_counts().reset_index()
df2.sort_values('index', axis=0, ascending=True, inplace=True, kind='quicksort', na_position='last')
df2.reset_index(drop=True, inplace=True)
df2.rename(columns={'index':'ngroup','ngroup':'nrow_same_group'},inplace=True)
df= pd.merge(df1, df2, on=['ngroup'])
However, I wonder whether there exist built-in pandas that achieve something similar, in single line.
You can do it using only groupby + agg.
import pandas as pd
df=pd.DataFrame(list(zip([1,1,1,3,4,1,2,1,6,10],
[3,4,1,7,4,1,2,1,6,1],
[0,0,0,2,2,4,4,4,5])),columns=['a','b','ngroup'])
res = (
df.groupby('ngroup', as_index=False)
.agg(a=('a','sum'), b=('b', 'sum'),
nrow_same_group=('a', 'size'))
)
Here the parameters passed to agg are tuples whose first element is the column to aggregate and the second element is the aggregation function to apply to that column. The parameter names are the labels for the resulting columns.
Output:
>>> res
ngroup a b nrow_same_group
0 0 3 8 3
1 2 7 11 2
2 4 4 4 3
3 5 6 6 1
First aggregate a, b with sum then calculate size of each group and assign this to nrow_same_group column
g = df.groupby('ngroup')
g.sum().assign(nrow_same_group=g.size())
a b nrow_same_group
ngroup
0 3 8 3
2 7 11 2
4 4 4 3
5 6 6 1
I have a DataFrame that looks like the one below
Index Category Class
0 1 A
1 1 A
2 1 B
3 2 A
4 3 B
5 3 B
And I would like to get an output data frame that groups by category and have one column for each of the classes with the counting of the occurrences of that class in each category, such as the one below
Index Category A B
0 1 2 1
1 2 1 0
2 3 0 2
So far I've tried various combinations of the groupby and agg methods, but I still can't get what I want. I've also tried df.pivot_table(index='Category', columns='Class', aggfunc='count'), but that return a DataFrame without columns. Any ideas of what could work in this case?
You can use aggfunc="size" to achieve your desired result:
>>> df.pivot_table(index='Category', columns='Class', aggfunc='size', fill_value=0)
Class A B
Category
1 2 1
2 1 0
3 0 2
Alternatively, you can use .groupby(...).size() to get the counts, and then unstack to reshape your data as well:
>>> df.groupby(["Category", "Class"]).size().unstack(fill_value=0)
Class A B
Category
1 2 1
2 1 0
3 0 2
Assign a dummy value to count:
out = df.assign(val=1).pivot_table('val', 'Category', 'Class',
aggfunc='count', fill_value=0).reset_index()
print(out)
# Output
Class Category A B
0 1 2 1
1 2 1 0
2 3 0 2
import pandas as pd
df = pd.DataFrame({'Index':[0,1,2,3,4,5],
'Category': [1,1,1,2,3,3],
'Class':['A','A','B','A','B','B'],
})
df = df.groupby(['Category', 'Class']).count()
df = df.pivot_table(index='Category', columns='Class')
print(df)
output:
Index
Class A B
Category
1 2.0 1.0
2 1.0 NaN
3 NaN 2.0
Use crosstab:
pd.crosstab(df['Category'], df['Class']).reset_index()
output:
Class Category A B
0 1 2 1
1 2 1 0
2 3 0 2
I need help transforming the data as follows:
From a dataset in this version (df1)
ID apples oranges pears apples_pears oranges_pears
0 1 1 0 0 1 0
1 2 0 1 0 1 0
2 3 0 1 1 0 1
to a data set like the following (df2):
ID apples oranges pears
0 1 2 0 1
1 2 1 1 1
2 3 0 2 2
What I'm trying to accomplish is get the total values of apples, from all the columns in which the word "apple" appears in the column name. E.g. in the df1 there are 2 column names in which the word "apple" appears. If you sum up all the apples from the first row, there would be a total of 2. I want a single column for apples in the new dataset (df2). Note that a 1 for appleas_pears is a 1 for EACH apples and pears.
Idea is split DataFrame to new 2 - first change columns names by all values before _ and for second filter columns with _ by DataFrame.filter and change columns by value after _, last join together by concat and sum per columns:
df1 = df.set_index('ID')
df2 = df1.filter(like='_')
df1.columns = df1.columns.str.split('_').str[0]
df2.columns = df2.columns.str.split('_').str[1]
df = pd.concat([df1, df2], axis=1).sum(level=0, axis=1).reset_index()
print (df)
ID apples oranges pears
0 1 2 0 1
1 2 1 1 1
2 3 0 2 2
I'm translating an excel formula in pandas COUNTIFS(pos!$D:$D,$A3,pos!$N:$N,$E3). I have two dataframe df1 and df2, and I will need to count values in a column first dataframe df1 and populate dataframe df2 where the values counted in df1 is equal to a value in df2. How do I check for second condition in my solution below?
df1:
id member seq
0 48299 Koif 1
1 48299 Iki 1
2 48299 Juju 2
3 48299 PNik 3
4 48865 Lok 1
5 48865 Mkoj 2
6 48865 Kino 1
7 64865 Boni 1
8 64865 Afriya 2
9 50774 Amah 2
df2:
group_id group_name seq count
0 48299 e_sys 1
1 50774 Y3N 2
2 64865 nana 1
3 48865 juzti 1
Using the answer of a related question:
df2['count'] = df2['group_id'].map(df1.groupby('id')['id'].count())
the count for the groupby first condition works, to add the second condition. I've tried a few solutions below:
soln1:
df2['count'] = df2['seq'].map(df1.groupby(['seq'])['id'].count())
soln2:
df2['count'] = df2['seq'].map(df1[df1['seq']==df2['seq']].groupby(['seq'])['id'].count())
But i dont seems to get correct counts for df2
Expected results:
group_id group_name seq count
0 48299 e_sys 1 2
1 50774 Y3N 2 1
2 64865 nana 1 1
3 48865 juzti 1 2
I suppose you can merge, groupby and then map:
merge = pd.merge(df2,df1, left_on=['group_id', 'seq'], right_on=['id','seq']).groupby('id')['id'].count()
df2['count'] = df2['group_id'].map(merge)
group_id group_name seq count
0 48299 e_sys 1 2
1 50774 Y3N 2 1
2 64865 nana 1 1
3 48865 juzti 1 2
Here is the first DataFrame:
In: df.head()
Out:
avg_lmp avg_load
read_year read_month trading_block
2017 3 0 24.606666 0.018033
1 32.090800 0.023771
4 0 25.136200 0.017487
1 33.487529 0.023570
5 0 24.085170 0.018008
And here is the second DataFrame that I want to join to the first one based on month (even if it has to repeat values such as read_year = 2018 and read_month = 3. If it's 2019 and the read_month is 3, I want it to say the same value for read_month 3.
In: df2.head()
Out:
fpc
read_month trading_block
1 0 37.501837
1 45.750000
2 0 35.531818
1 41.550000
3 0 28.348427
1 35.900000
4 0 26.250870
1 34.150000
5 0 23.599388
1 34.550000
6 0 25.617027
1 38.670000
7 0 27.531765
1 42.050000
8 0 26.628298
1 40.400000
9 0 25.201923
1 36.500000
10 0 25.299149
1 35.250000
11 0 25.349091
1 34.300000
12 0 28.249623
1 35.500000
Is it clear what I'm asking for?
You seem to have common indexes. Set them, then join:
df = df.reset_index().set_index(['read_month', 'trading_block']).join(df2)
and if you wish:
df.reset_index().set_index(['read year', 'read_month', 'trading_block'])
Not sure if that is what you're after:
index avg_lmp avg_load fpc
read_year read_month trading_block
2017 3 0 0 24.606666 0.018033 28.348427
1 1 32.090800 0.023771 35.900000
4 0 2 25.136200 0.017487 26.250870
1 3 33.487529 0.023570 34.150000
5 0 4 24.085170 0.018008 23.599388
maybe try this. just merge on the two columns with a outer. (like full outer)
DataFrame1.merge(DataFrame2, left_on='read_month', right_on='read_month', how='outer')