I have a DF as follows:
Date Bought | Fruit
2018-01 Apple
2018-02 Orange
2018-02 Orange
2018-02 Lemon
I wish to group the data by 'Date Bought' & 'Fruit' and count the occurrences.
Expected result:
Date Bought | Fruit | Count
2018-01 Apple 1
2018-02 Orange 2
2018-02 Lemon 1
What I get:
Date Bought | Fruit | Count
2018-01 Apple 1
2018-02 Orange 2
Lemon 1
Code used:
Initial attempt:
df.groupby(['Date Bought','Fruit'])['Fruit'].agg('count')
#2
df.groupby(['Date Bought','Fruit'])['Fruit'].agg('count').reset_index()
ERROR: Cannot insert Fruit, already exists
#3
df.groupby(['Date Bought','Fruit'])['Fruit'].agg('count').reset_index(inplace=True)
ERROR: Type Error: Cannot reset_index inplace on a Series to create a DataFrame
Documentation shows that the groupby function returns a 'groupby object' not a standard DF. How can I group the data as mentioned above and retain the DF format?
The problem here is that by resetting the index you'd end up with 2 columns with the same name. Because working with Series is possible set parameter name in Series.reset_index:
df1 = (df.groupby(['Date Bought','Fruit'], sort=False)['Fruit']
.agg('count')
.reset_index(name='Count'))
print (df1)
Date Bought Fruit Count
0 2018-01 Apple 1
1 2018-02 Orange 2
2 2018-02 Lemon 1
Related
Input Data:
sn
fruits
Quality
Date
1
Apple
A
2022-09-01
2
Apple
A
2022-08-15
3
Apple
A
2022-07-15
4
Apple
B
2022-06-01
5
Apple
A
2022-05-15
6
Apple
A
2022-04-15
7
Banana
A
2022-08-15
8
Orange
A
2022-08-15
Get the average date diff for each type of fruit, only if quality=A and there are consecutive record with quality A.
If there are three rows of A quality only first 2 make valid pair. Third one is not valid pair as 4th record is quality=B
So in above data we have 2 valid pairs for Apple 1st pair= (1,2) = 15days date diff and 2nd pair = (5,6) = 15days diff so avg for apple is 15days
Expected output
fruits
avg time diff
Apple
15 days
Banana
null
Orange
null
How can I do this without using any looping in pandas dataframe?
I have a dataframe with the below example
Type | date
Apple |01/01/2021
Apple |10/02/2021
Orange |05/01/2021
Orange |20/20/2020
Is there any easiest way transform the data as below?
Type | Date
Apple | 01/01/2020 | 10/20/2021
Orange| 05/01/2020 | 20/20/2020
The stack function does not match my requirement
You could group by "type", collect the "date" values and make a new dataframe.
df = pd.DataFrame({'type':['Apple','Apple','Orange','Orange'], 'date':['01/01/2021','10/02/2021','05/01/2021','20/20/2020']})
d = {}
for fruit, group in df.groupby('type'):
d[fruit] = group.date.values
pd.DataFrame(d).T
0 1
Apple 01/01/2021 10/02/2021
Orange 05/01/2021 20/20/2020
This is part 2 question of the original problem.
Is there a way to find IDs that have both Apple and Strawberry, and then find the total length? and IDs that has only Apple, and IDS that has only Strawberry? BASED ON LOCATION
df:
ID Fruit Location
0 ABC Apple NY <-ABC has Apple and Strawberry
1 ABC Strawberry NY <-ABC has Apple and Strawberry
2 EFG Apple LA <-EFG has Apple only
3 XYZ Apple HOUSTON <-XYZ has Apple and Strawberry
4 XYZ Strawberry HOUSTON <-XYZ has Apple and Strawberry
5 CDF Strawberry BOSTON <-CDF has Strawberry
6 AAA Apple CHICAGO <-AAA has Apple only
Desired output:
IDs that has Apple and Strawberry:
NY 1
HOUSTON 1
IDs that has Apple only:
LA 1
CHICAGO 1
IDs that has Strawberry only:
BOSTON 1
The previous code was:
v = ['Apple','Strawberry']
out = df.groupby('ID')['Fruit'].apply(lambda x: set(x) == set(v)).sum()
print(out)
>>> 2
I tried the following but it did not work and gave me the same results
v = ['Apple','Strawberry']
out = df.groupby('ID', 'LOCATION')['Fruit'].apply(lambda x: set(x) == set(v)).sum()
print(out)
>>> 2
Thanks!
Inefficient solution using groupby and apply
x = df.groupby('ID').agg({ 'Fruit': lambda x: tuple(x), 'Location': 'first'})
y=x.groupby('Fruit')['Location'].value_counts()
y:
Fruit Location
(Apple,) CHICAGO 1
LA 1
(Apple, Strawberry) HOUSTON 1
NY 1
(Strawberry,) BOSTON 1
Name: Location, dtype: int64
for index in set(y.index.get_level_values(0)):
if len(index)==2:
print(f"IDs that has {index[0]} and {index[1]}:")
print(y.loc[index].to_string())
else:
print(f"IDs that has {index[0]} only:")
print(y.loc[index].to_string())
IDs that has Apple only:
Location
CHICAGO 1
LA 1
IDs that has Apple and Strawberry:
Location
HOUSTON 1
NY 1
IDs that has Strawberry only:
Location
BOSTON 1
df
type content
1 task buy xbox
2 task buy fruit from supermarket
3 note orange with squash\buy if cheap
4 note apple
5 task buy sunglassess
The notes refer to the task directly above it. How could I manipulate the df to get the following df?
Expected Output:
task comment1 comment2
1 buy xbox
2 buy fruit from supermarket orange with squash apple
buy if cheap
3 buy sunglassess
...
Use helper Series for get groups by task by compare value with cumulkative sum, get counter by GroupBy.cumcount and reshape by DataFrame.set_index and Series.unstack:
s = df['type'].eq('task').cumsum()
g = df.groupby(s).cumcount()
df1 = (df.set_index([s, g])['content']
.unstack(fill_value='')
.add_prefix('comment')
.rename(columns={'comment0':'task'})
.reset_index(drop=True))
print (df1)
task comment1 comment2
0 buy xbox
1 buy fruit from supermarket orange with squasuy if cheap apple
2 buy sunglassess
In a pandas Dataframe df I have columns likes this:
NAME KEYWORD AMOUNT INFO
0 orange fruit 13 from italy
1 potato veggie 7 from germany
2 potato veggie 9 from germany
3 orange fruit 8 from italy
4 potato veggie 6 from germany
Doing a groupby KEYWORD operation I want to build the sum of the AMOUNT values per group and keep from the other columns always the first value, so that the result reads:
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany
I tried
df.groupby('KEYWORD).sum()
but this "summarises" over all columns, i.e I get
NAME KEYWORD AMOUNT INFO
0 orangeorange fruit 21 from italyfrom italy
1 potatopotatopotato veggie 22 from germanyfrom germanyfrom germany
Then I tried to use different functions for different columns:
df.groupby('KEYWORD).agg({'AMOUNT': sum, 'NAME': first, ....})
with
def first(f_arg, *args):
return f_arg
But this gives me unfortunately a "ValueError: function does not reduce" error.
So I am a bit at a loss. How can I apply sum only to the AMOUNT column, while keeping the others?
Use groupby + agg with a custom aggfunc dict.
f = dict.fromkeys(df.columns.difference(['KEYWORD']), 'first')
f['AMOUNT'] = sum
df = df.groupby('KEYWORD', as_index=False).agg(f)
df
KEYWORD NAME AMOUNT INFO
0 fruit orange 21 from italy
1 veggie potato 22 from germany
dict.fromkeys gives me a nice way of generalising this for N number of columns. If column order matters, add a reindex operation at the end:
df = df.groupby('KEYWORD', as_index=False).agg(f).reindex(columns=df.columns)
df
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany
Use drop_duplicates by column KEYWORD and then assign aggregate values:
df=df.drop_duplicates('KEYWORD').assign(AMOUNT=df.groupby('KEYWORD')['AMOUNT'].sum().values)
print (df)
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany