I'm learning Python/Pandas with a DataFrame having the following structure:
df1 = pd.DataFrame({'unique_id' : [1, 1, 2, 2, 2, 3, 3, 3, 3, 3],
'brand' : ['A', 'B', 'A', 'C', 'X', 'A', 'C', 'X', 'X', 'X']})
print(df1)
unique_id brand
0 1 A
1 1 B
2 2 A
3 2 C
4 2 X
5 3 A
6 3 C
7 3 X
8 3 X
9 3 X
My goal is to make some calculations on the above DataFrame.
Specifically, for each unique_id, I want to:
Count the number of brands without taking brand X into account;
Count only how many times brand ´X´ appears.
Visually, using the above example, the resulting DataFrame I'm looking for should look like this:
unique_id count_brands_not_x count_brand_x
0 1 2 0
1 2 2 1
2 3 2 3
I have used the groupby method on simple examples in the past but I don't know how to specify conditions in a groupby to solve this new problem I have. Any help would be appreciated.
You can use GroupBy and merge:
maskx = df1['brand'].eq('X')
d1 = df1[~maskx].groupby('unique_id')['brand'].size().reset_index()
d2 = df1[maskx].groupby('unique_id')['brand'].size().reset_index()
df = d1.merge(d2, on='unique_id', how='outer', suffixes=['_not_x', '_x']).fillna(0)
unique_id brand_not_x brand_x
0 1 2 0.00
1 2 2 1.00
2 3 2 3.00
I use pd.crosstab on True/False mask of comparing against value X
s = df1.brand.eq('X')
df_final = (pd.crosstab(df1.unique_id, s)
.rename({False: 'count_brands_not_x' , True: 'count_brand_x'}, axis=1))
Out[134]:
brand count_brands_not_x count_brand_x
unique_id
1 2 0
2 2 1
3 2 3
You can subset the original DataFrame and use the appropriate groupby operations for each calculation. concat joins the results.
import pandas as pd
s = df1.brand.eq('X')
res = (pd.concat([df1[~s].groupby('unique_id').brand.nunique().rename('unique_not_X'),
df1[s].groupby('unique_id').size().rename('count_X')],
axis=1)
.fillna(0))
# unique_not_X count_X
#unique_id
#1 2 0.0
#2 2 1.0
#3 2 3.0
If instead of "unique_brands" you just want the number of rows with brands that are not "X" then we can perform a single groupby and unstack the result.
(df1.groupby(['unique_id', df1.brand.eq('X').map({True: 'count_X', False: 'count_not_X'})])
.size().unstack(-1).fillna(0))
#brand count_X count_not_X
#unique_id
#1 0.0 2.0
#2 1.0 2.0
#3 3.0 2.0
I would first create groups and later count elements in groups
But maybe there is better function to count items in agg()
import pandas as pd
df1 = pd.DataFrame({'unique_id' : [1, 1, 2, 2, 2, 3, 3, 3, 3, 3],
'brand' : ['A', 'B', 'A', 'C', 'X', 'A', 'C', 'X', 'X', 'X']})
g = df1.groupby('unique_id')
df = pd.DataFrame()
df['count_brand_x'] = g['brand'].agg(lambda data:sum(data=='X'))
df['count_brands_not_x'] = g['brand'].agg(lambda data:sum(data!='X'))
df = df.reset_index()
print(df)
EDIT: If I have df['count_brand_x'] then other can count
df['count_brands_not_x'] = g['brand'].count() - df['count_brand_x']
Related
Suppose there is the following dataframe:
import pandas as pd
df = pd.DataFrame({'Group': ['A', 'A', 'B', 'B', 'C', 'C'], 'Value': [1, 2, 3, 4, 5, 6]})
I would like to subtract the values from group B and C with those of group A and make a new column with the difference. That is, I would like to do something like this:
df[df['Group'] == 'B']['Value'].reset_index() - df[df['Group'] == 'A']['Value'].reset_index()
df[df['Group'] == 'C']['Value'].reset_index() - df[df['Group'] == 'A']['Value'].reset_index()
and place the result in a new column. Is there a way of doing it without a for loop?
Assuming you want to subtract the first A to the first B/C, second A to second B/C, etc. the easiest might be to reshape:
df2 = (df
.assign(cnt=df.groupby('Group').cumcount())
.pivot('cnt', 'Group', 'Value')
)
# Group A B C
# cnt
# 0 1 3 5
# 1 2 4 6
df['new_col'] = df2.sub(df2['A'], axis=0).melt()['value']
variant:
df['new_col'] = (df
.assign(cnt=df.groupby('Group').cumcount())
.groupby('cnt', group_keys=False)
.apply(lambda d: d['Value'].sub(d.loc[d['Group'].eq('A'), 'Value'].iloc[0]))
)
output:
Group Value new_col
0 A 1 0
1 A 2 0
2 B 3 2
3 B 4 2
4 C 5 4
5 C 6 4
I know this is in S0 somewhere but I can't seem to find it. I want to subset a df on a specific value and include the following unique rows. Using below, I can return values equal to A, but I'm hoping to return the next unique values, which is B.
Note: The subsequent unique value may not be B or may have varying rows, so I need a function that finds the returns all subsequent unique values.
import pandas as pd
df = pd.DataFrame({
'Time' : [1,1,1,1,1,1,2,2,2,2,2,2],
'ID' : ['A','A','B','B','C','C','A','A','B','B','C','C'],
'Val' : [2.0,5.0,2.5,2.0,2.0,1.0,1.0,6.0,4.0,2.0,5.0,1.0],
})
df = df[df['ID'] == 'A']
intended output:
Time ID Val
0 1 A 2.0
1 1 A 5.0
2 1 B 2.5
3 1 B 2.0
4 2 A 1.0
5 2 A 6.0
6 2 B 4.0
7 2 B 2.0
Ok OP let me do this again, you want to find all the rows which are "A" (base condition) and all the rows which are following a "A" row at some point, right ?
Then,
is_A = df["ID"] == "A"
not_A_follows_from_A = (df["ID"] != "A") &( df["ID"].shift() == "A")
candidates = df["ID"].loc[is_A | not_A_follows_from_A].unique()
df.loc[df["ID"].isin(candidates)]
Should work as intented.
Edit : example
df = pd.DataFrame({
'Time': [1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1],
'ID': ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'A', 'E', 'E', 'E', 'A', 'F'],
'Val': [7, 2, 7, 5, 1, 6, 7, 3, 2, 4, 7, 8, 2]})
is_A = df["ID"] == "A"
not_A_follows_from_A = (df["ID"] != "A") &( df["ID"].shift() == "A")
candidates = df["ID"].loc[is_A | not_A_follows_from_A].unique()
df.loc[df["ID"].isin(candidates)]
outputs this :
Time ID Val
0 1 A 7
1 1 A 2
2 1 B 7
3 0 B 5
7 1 A 3
8 0 E 2
9 0 E 4
10 1 E 7
11 1 A 8
12 1 F 2
Let us try drop_duplicates, then groupby select the number of unique ID we would like to keep by head, and merge
out = df.merge(df[['Time','ID']].drop_duplicates().groupby('Time').head(2))
Time ID Val
0 1 A 2.0
1 1 A 5.0
2 1 B 2.5
3 1 B 2.0
4 2 A 1.0
5 2 A 6.0
6 2 B 4.0
7 2 B 2.0
Say I have two different columns within a large transportation dataset, one with a trip id and another with a user id. How can I count the amount of times two people have ridden on the same trip together, i.e. different user id but same trip id?
df = pd.DataFrame([[1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 5, 5], ['A', 'B', 'C', 'A', 'B', 'A', 'B', 'B', 'C', 'D', 'D','A']]).T
df.columns = ['trip_id', 'user_id']
print(df)
trip_id user_id
0 1 A
1 1 B
2 1 C
3 2 A
4 2 B
5 3 A
6 3 B
7 4 B
8 4 C
9 4 D
10 5 D
11 5 A
The ideal output would be a sort of aggregated pivot table or crosstab that displays each user_id and their count of trips with other user_id's, so as to see who has the highest counts of trips together.
I tried something like this:
df5 = pd.crosstab(index=df4['trip_id'], columns=df4['user_id'])
df5['sum'] = df5[df5.columns].sum(axis=1)
df5
user_id A B C D sum
trip_id
1 1 1 1 0 3
2 1 1 0 0 2
3 1 1 0 0 2
4 0 1 1 1 3
5 1 0 0 1 2
which I can use to get the average users per trip, but not the frequency of unique user_ids riding together on a trip.
I also tried some variations with this:
df.trip_id = df.trip_id+'_'+df.groupby(['user_id','trip_id']).cumcount().add(1).astype(str)
df.pivot('trip_id','user_id')
but I'm not getting what I want. I'm not sure if I need to approach this by iterating with a for loop or if I'll need to stack the dataframe from a crosstab to get those aggregate values. Also, I'm trying to avoid having the trip_id and user_id in the original data be aggregated as numerical datatypes since they should not be treated as ints but strings.
Thank you for any insight you may be able to provide!
Here is an example dataset
import pandas as pd
df = pd.DataFrame([[1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3], ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B']]).T
df.columns = ['trip_id', 'user_id']
print(df)
Gives:
trip_id user_id
0 1 A
1 1 B
2 1 C
3 2 A
4 2 B
5 2 C
6 3 A
7 3 B
8 3 C
9 3 A
10 3 B
I think what you're asking for is:
df.groupby(['trip_id', 'user_id']).size()
trip_id user_id
1 A 1
B 1
C 1
2 A 1
B 1
C 1
3 A 2
B 2
C 1
dtype: int64
Am I correct?
I want to reshape the data by Date in Python as dataframe.
Required:
IS there any Pandas function?
Create additional key by using cumcount , then we do pivot , Data from jpp
df.assign(key=df.groupby('Col1').cumcount()).pivot('key','Col1','Col2')
Out[29]:
Col1 A B C
key
0 1.0 4.0 6.0
1 2.0 5.0 7.0
2 3.0 NaN 8.0
One way is to use pandas.concat on series derived from unique values in your key column.
Here is a minimal example.
import pandas as pd
df = pd.DataFrame({'Col1': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C'],
'Col2': [1, 2, 3, 4, 5, 6, 7, 8]})
res = pd.concat({k: df.loc[df['Col1']==k, 'Col2'].reset_index(drop=True)
for k in df['Col1'].unique()}, axis=1)
print(res)
A B C
0 1 4.0 6
1 2 5.0 7
2 3 NaN 8
I have a data frame of the following format
user category
1 A
1 B
1 A
2 B
3 B
2 B
Now I am trying to count how many unique users are in each category and in both categories. So for above table I have A=1, B= 3 and A&B = 1
The following code gives me no of users in each category:
df.groupby(['category',]).count()
But this is not what exactly I am looking for. Any help or clue will be appreciated.
Use groupby + size and unstack for pivoting and then use count, for number of intersection add dropna and get length:
df1 = df.groupby(['user','category']).size().unstack()
print (df1)
category A B
user
1 2.0 1.0
2 NaN 2.0
3 NaN 1.0
print (df1.count())
A 1
B 3
dtype: int64
print (len(df1.dropna()))
Or:
print (df.notnull().all().sum())
1
If need all users in all categories:
print (df1.dropna().index.tolist())
[1]
Here is one way. The output is in dictionary format. Intersections are denoted by a tuple key.
import pandas as pd
import itertools
df = pd.DataFrame([[1, 'A'], [1, 'B'], [1, 'A'], [2, 'B'], [3, 'B'], [2, 'B']], columns=['user', 'category'])
result = df.groupby('category')['user'].agg(lambda x: set(x)).to_dict()
for i, j in itertools.combinations(result, 2):
result[x] = result[i] & result[j]
result = {k: len(v) for k, v in result.items()}
print(result)
# output
# {'A': 1, 'B': 3, ('A', 'B'): 1}
Without groupby By using crosstab
pd.crosstab(df.user,df.category)
Out[604]:
category A B
user
1 2 1
2 0 2
3 0 1
pd.crosstab(df.user,df.category).replace(0,np.nan).count()
Out[612]:
category
A 1
B 3
dtype: int64
pd.crosstab(df.user,df.category).replace(0,np.nan).count().min()
Out[613]: 1