Group By Dinstinct in Pandas - python

I Have Script Like This in Pandas :
dfmi['Time'] = pd.to_datetime(dfmi['Time'], format='%H:%M:%S')
dfmi['hours'] = dfmi['Time'].dt.hour
sum_dh = dfmi.groupby(['Date','hours']).agg({'Amount': 'sum', 'Price':'sum'})
dfdhsum = pd.DataFrame(sum_dh)
dfdhsum.columns = ['Amount', 'Gas Sales']
dfdhsum
And the output :
I want Sum Distinct Group BY and the final result is like This :
How its pandas code solution ??

I don't understand what you want to exactly but this instruction will sum hours , amount ans gas sales for each date
dfmi.groupby("Date").agg({'hours': 'sum', 'Amount': 'sum','Gas Sales':'sum})

Related

Different questions about pandas pivot tables

Here's my df:
df=pd.DataFrame(
{
'Color': ['red','blue','red','red','green','red','yellow'],
'Type': ['Oil', 'Aluminium', 'Oil', 'Oil', 'Cement Paint', 'Synthetic Rubber', 'Emulsion'],
'Finish' : ['Satin', 'Matte', 'Matte', 'Satin', 'Semi-gloss', 'Satin', 'Satin'],
'Use' : ['Interior', 'Exterior', 'Interior', 'Interior', 'Exterior', 'Exterior', 'Exterior'],
'Price' : [55, 75, 60, 60, 55, 75, 50]
}
)
I want to create a pivot table that will output 'Color', 'color count', the percentage or weight or each count of color, and finally a total row, outputting the total color count next to 100%. Additionally, I'd like to add a header with today's date in the following format (02 - Nov).
Here is my current pivot with the aproximating inputs
today=datetime.date.today()
today_format=today.strftime("%d-m%")
pivot_table=pd.pivot_table(
data=df,
index='Color',
aggfunc={'Color':'count'}
)
df['Color'].value_counts(
normalize=True
).mul(100).round(1).astype(str) + '%'
Is there a way to add more information to the pivot as a header, total and extra column? Or just I just try to convert the pivot back to a DF and edit it from there?
The main difficulty I'm finding is that since I'm handling string data, when I 'aggfunc='sum' it actually adds the strings. And If I try to add 'margins=True, margins_name='Total count' I get the following error:
if isinstance(aggfunc[k], str):
KeyError: 'Type'
The desired table output would look something like this:
Updated Answer
Thanks to a great suggestion by Rabinzel, we can also have today's date as a column header as well:
df = (df['Color'].value_counts().reset_index().pivot_table(index = ['index'], aggfunc = np.sum, margins=True, margins_name='Total')
.assign(perc = lambda x: x['Color']/x.iloc[:-1]['Color'].sum() * 100)
.rename(columns = {'Color' : 'Color Count',
'perc' : '%'}))
new_cols = pd.MultiIndex.from_product([[datetime.today().strftime('%#d-%b')], df.columns])
df.columns = new_cols
df
2-Nov
Color Count %
index
blue 1 14.285714
green 1 14.285714
red 4 57.142857
yellow 1 14.285714
Total 7 100.000000

Pandas: count identical values in columns but from different index

I have a data frame representing the customers ratings of restaurants. rating_year is the year the rating was made, first_year is the year the restaurant opened and last_year is the last business year of a restaurant.
What i want to do is calculate the number of restaurants that opened in the same year as the restaurant in question, so with the same first_year.
The problem from what i did here is that i group restaurant_id and first_year and do the count, but i dont exclude the rest with the same id's. I dont know the syntax do to this.
Can anyone help?
data = {'rating_id': ['1', '2','3','4','5','6','7','8','9'],
'user_id': ['56', '13','56','99','99','13','12','88','45'],
'restaurant_id': ['xxx', 'xxx','yyy','yyy','xxx','zzz','zzz','eee','eee'],
'star_rating': ['2.3', '3.7','1.2','5.0','1.0','3.2','1.0','2.2','0.2'],
'rating_year': ['2012','2012','2020','2001','2020','2015','2000','2003','2004'],
'first_year': ['2012', '2012','2001','2001','2012','2000','2000','2001','2001'],
'last_year': ['2020', '2020','2020','2020','2020','2015','2015','2020','2020'],
}
df = pd.DataFrame (data, columns = ['rating_id','user_id','restaurant_id','star_rating','rating_year','first_year','last_year'])
df['star_rating'] = df['star_rating'].astype(float)
df['nb_rating'] = (
df.groupby('restaurant_id')['rating_id'].transform('count')
)
#here
df['nb_opened_sameYear'] = (
df.groupby('restaurant_id')['first_year']
.transform('count')
)
df.head(10)
IIUC, you want to groupby first_year and transform with nunique on the column restaurant_id. try:
df['nb_opened_sameYear'] = (
df.groupby('first_year')['restaurant_id']
.transform('nunique')
)

I'm trying to create a new dataframe based on a different dataframe using an if statement

I want to know how I can apply an if statement on a row of a dataframe. All columns consists of strings like so:
d = {'emp' : {'Breakdown' : '11/12/2019', 'expl' : '123'}, 'emp2': {'Breakdown' : '11/03/2020', 'expl' : '123'}, 'emp3' : {'Breakdown' : '31/12/2019', 'expl' : '123'}, 'emp4' : {'Breakdown' : '31/12/2020', 'expl' : '123'}}
d1 = pd.DataFrame(d)
So I made it into a dataframe d and I want to try to make a new dataframe from the strings that contains '2020'.
I tried this:
df = {}
for t in d:
df = d1[t]
if '2020' in df.get('Breakdown'):
...
I also tried df.loc[: 'Breakdown']. This gives me two values 11/03/2020 and 31/12/2020. So from here I don't really know what to do. I want it to look like this:
new_d = {'emp2' : {'Breakdown' : '11/03/2020', 'expl' : '123'}, 'emp4' : {'Breakdown' : '31/12/2020', 'expl' : '123'}}
new_df = pd.DataFrame(new_d)
Maybe this is a bit above my level of programming but I like to experiment with dataframes. I'm still playing around with the code so if I come up with a solution I'll obviously edit it here.
Thanks in advance.
Use, Series.str.contains to create a boolean mask then use this mask with DataFrame.loc to filter the corresponding columns:
df = d1.loc[:, d1.loc['Breakdown'].str.contains('2020')]
Result:
# print(df)
emp2 emp4
Breakdown 11/03/2020 31/12/2020
expl 123 123

Pandas MultiIndex with an unrecognised time format - how to convert time and apply calculation

EDIT: Thanks to Scott Boston for advising me on to correctly post.
I have a dataframe containing clock in/out date and times from work for all employees. Sample df input is below, but the real data set has a year of data for many employees.
Question:
What I would like to do is to calculate the time spent in work for each employee over the year.
df = pd.DataFrame({'name': ['Joe Bloggs', 'Joe Bloggs', 'Joe Bloggs',
... 'Joe Bloggs', 'Jane Doe', 'Jane Doe', 'Jane Doe',
... 'Jane Doe'],
... 'Date': ['2020-06-19','2020-06-19' , '2020-06-18', '2020-06-18', '2020-06-19',
... '2020-06-19', '2020-06-18', '2020-06-18'],
... 'Time': ["17:30:06", "09:00:00", "17:44:00", "08:34:02", "16:30:06",
... "10:00:02", "15:45:33", "09:30:33"],
... 'type': ["Logout", "Login", "Logout",
... "Login", "Logout", "Login",
... "Logout", "Login"]})```
You can do it this way:
#Create a datetime column combining both date and time also create year column
df['datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'], format='%Y-%m-%d %H:%M:%S')
df['year'] = df['datetime'].dt.year
#Sort the dataframe by datetime
df = df.sort_values('datetime')
#Create "sessions" worked by Login records
session = (df['type'] == 'Login').groupby(df['name']).cumsum().rename('Session_No')
#Reshape the dataframe to get login and logouts for a session on one row
#The use diff to calculate worked during that session
df_time = df.set_index(['name', 'year', session, 'type'])['datetime']\
.unstack().diff(axis=1).dropna(axis=1, how='all')\
.rename(columns={'Logout':'TimeLoggedIn'})
#Sum on Name and Year
df_time.sum(level=[0,1])
Output:
name year TimeLoggedIn
0 Jane Doe 2020 12:45:04
1 Joe Bloggs 2020 17:40:04
Note: #warped solution works and works well, however, if you had an employee who worked overnight, I think that code breaks down. This answer should capture where an employee works past midnight.
df['Time'] = pd.to_timedelta(df['Time'])
df['Date'] = pd.to_datetime(df['Date'])
df['time_complete'] = df['Time'] + df['Date']
df.groupby(['name', 'Date']).apply(lambda x: (x.sort_values('type', ascending=True)['time_complete'].diff().dropna()))
how it works:
Convert the dates to datetime, to allow grouping.
Convert the times to timedelta, to allow subtraction.
Create a complete time, to incorporate potential nighshifts (as spotted by #ScottBoston)
Then, group by date and employee to isolate those.
So, each group now corresponds to one employee at a specific date.
The individual groups have three columns, 'type' and 'Time', 'time_complete'.
Sorting the columns by 'type' will cause logout to come before login.
Then, we take the difference (column-(n) - column-(n+1)) of column 'time_complete' within each sorted group, which gives the time spent between login and logout.
Finally, we remove null values that arise through None - column-(n).

Operations within DataFrameGroupBy

I am trying to understand how to apply function within the 'groupby' or each groups of the groups in a dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Stock' : ['apple', 'ford', 'google', 'samsung','walmart', 'kroger'],
'Sector' : ['tech', 'auto', 'tech', 'tech','retail', 'retail'],
'Price': np.random.randn(6),
'Signal' : np.random.randn(6)}, columns= ['Stock','Sector','Price','Signal'])
dfg = df.groupby(['Sector'],as_index=False)
type(dfg)
pandas.core.groupby.DataFrameGroupBy
I want to get the sum ( Price * (1/Signal) ) group by 'Sector'.
i.e. The resulting output should look like
Sector | Value
auto | 0.744944
retail |-0.572164053
tech | -1.454632
I can get the results by creating separate data frames, but was looking for a way to
figure out how to operate withing each of the grouped ( sector) frames.
I can find mean or sum of Price
dfg.agg({'Price' : [np.mean, np.sum] }).head(2)
but not get sum ( Price * (1/Signal) ), which is what I need.
Thanks,
You provided random data, so there is no way we can get the exact number that you got. But based on what you just described, I think the following will do:
In [121]:
(df.Price/df.Signal).groupby(df.Sector).sum()
Out[121]:
Sector
auto -1.693373
retail -5.137694
tech -0.984826
dtype: float64

Categories

Resources