I want to group by my dataframe by different columns based on UserId,Date,category (frequency of use per day ) ,max duration per category ,and the part of the day when it is most used and finally store the result in a .csv file.
name duration UserId category part_of_day Date
Settings 3.436 1 System tool evening 2020-09-10
Calendar 2.167 1 Calendar night 2020-09-11
Calendar 5.705 1 Calendar night 2020-09-11
Messages 7.907 1 Phone_and_SMS night 2020-09-11
Instagram 50.285 9 Social night 2020-09-28
Drive 30.260 9 Productivity night 2020-09-28
df.groupby(["UserId", "Date","category"])["category"].count()
my code result is :
UserId Date category
1 2020-09-10 System tool 1
2020-09-11 Calendar 8
Clock 2
Communication 86
Health & Fitness 5
But i want this result
UserId Date category count(category) max-duration
1 2020-09-10 System tool 1 3
2020-09-11 Calendar 2 5
2 2020-09-28 Social 1 50
Productivity 1 30
How can I do that? I can not find the wanted result for any solution
Use agg:
df.groupby(["UserId", "Date","category"]).agg({'category':'count',
'Date': np.ptp})
or replace np.ptp with lambda x: x.max() - x.min().
Data
df = pd.DataFrame({'name ': {0: 'Settings', 1: 'Calendar', 2: 'Calendar', 3: 'Messages', 4: 'Instagram', 5: 'Drive'}, ' duration': {0: 3.4360000000000004, 1: 2.167, 2: 5.705, 3: 7.907, 4: 50.285, 5: 30.26}, ' UserId': {0: 1, 1: 1, 2: 1, 3: 1, 4: 9, 5: 9}, ' category': {0: ' System tool', 1: ' Calendar', 2: ' Calendar', 3: ' Phone_and_SMS', 4: ' Social', 5: ' Productivity'}, ' part_of_day': {0: ' evening', 1: ' night ', 2: ' night ', 3: 'night ', 4: ' night ', 5: ' night '}, ' Date': {0: ' 2020-09-10', 1: ' 2020-09-11', 2: ' 2020-09-11', 3: ' 2020-09-11', 4: ' 2020-09-28', 5: ' 2020-09-28'}})
df.columns = df.columns.str.strip()
df:
name duration UserId category part_of_day Date
0 Settings 3.436 1 System tool evening 2020-09-10
1 Calendar 2.167 1 Calendar night 2020-09-11
2 Calendar 5.705 1 Calendar night 2020-09-11
3 Messages 7.907 1 Phone_and_SMS night 2020-09-11
4 Instagram 50.285 9 Social night 2020-09-28
5 Drive 30.260 9 Productivity night 2020-09-28
grouping = df.groupby(["UserId", "Date","category"]).agg({"category": 'count', 'duration':max}).rename(columns={"duration" : "max-duration"})
grouping:
category max-duration
UserId Date category
1 2020-09-10 System tool 1 3.436
2020-09-11 Calendar 2 5.705
Phone_and_SMS 1 7.907
9 2020-09-28 Productivity 1 30.260
Social 1 50.285
You take advantage of pandas.DataFrame.groupby , pandas.DataFrame.aggregate and pandas.DataFrame.rename in following format to generate your desired output in one line:
code:
import pandas as pd
df = pd.DataFrame({'name': ['Settings','Calendar','Calendar', 'Messages', 'Instagram', 'Drive'],
'duration': [3.436, 2.167, 5.7050, 7.907, 50.285, 30.260],
'UserId': [1, 1, 1, 1, 2, 2],
'category' : ['System_tool', 'Calendar', 'Calendar', 'Phone_and_SMS', 'Social', 'Productivity'],
'part_of_day' : ['evening', 'night','night','night','night','night' ],
'Date' : ['2020-09-10', '2020-09-11', '2020-09-11', '2020-09-11', '2020-09-28', '2020-09-28'] })
df.groupby(['UserId', 'Date', 'category']).aggregate( count_cat = ('category', 'count'), max_duration = ('duration', 'max'))
out:
Related
This question already has answers here:
python: how to melt dataframe retaining specific order / custom sorting
(2 answers)
Closed 7 months ago.
The community reviewed whether to reopen this question 7 months ago and left it closed:
Original close reason(s) were not resolved
Say I have a dataframe
data_dict = {'Number': {0: 1, 1: 2, 2: 3}, 'mw link': {0: 'SAM3703_2SAM3944 2', 1: 'SAM3720_2SAM4115 2', 2: 'SAM3729_2SAM4121_ 2'}, 'site_a': {0: 'SAM3703', 1: 'SAM3720', 2: 'SAM3729'}, 'name_a': {0: 'Chelak', 1: 'KattakurganATC', 2: 'Payariq'}, 'site_b': {0: 'SAM3944', 1: 'SAM4115', 2: 'SAM4121'}, 'name_b': {0: 'Turkibolo', 1: 'Kattagurgon Sement Zavod', 2: 'Payariq Dehgonobod'}, 'distance km': {0: 3.618, 1: 7.507, 2: 9.478}, 'manufacture': {0: 'ZTE NR 8150/8250', 1: 'ZTE NR 8150/8250', 2: 'ZTE NR 8150/8250'}}
df = pd.DataFrame(data_dict)
Expected Output :
There are these two columns site_a and site_b which I want to melt into rows but applying a simple melt gives output in series, I want them to be in an alternate fashion.
Number mw link distance km manufacture variable value
0 1 SAM3703_2SAM3944 2 3.618 ZTE NR 8150/8250 site_a SAM3703
1 1 SAM3703_2SAM3944 2 3.618 ZTE NR 8150/8250 site_b SAM3944
2 2 SAM3720_2SAM4115 2 7.507 ZTE NR 8150/8250 site_a SAM3720
3 2 SAM3720_2SAM4115 2 7.507 ZTE NR 8150/8250 site_b SAM4115
4 3 SAM3729_2SAM4121_ 2 9.478 ZTE NR 8150/8250 site_a SAM3729
5 3 SAM3729_2SAM4121_ 2 9.478 ZTE NR 8150/8250 site_b SAM4121
My Solution :
This is what I have tried
df1 = pd.melt(df, id_vars=['Number', 'mw link', 'distance km', 'manufacture'], value_vars=['site_a', 'site_b'])
which gives me :
You just add sort_values(['Number', 'variable']):
pd.melt(df, id_vars=['Number', 'mw link', 'distance km', 'manufacture'], value_vars=['site_a', 'site_b']).sort_values(['Number', 'variable'])
Alternatives:
pd.melt(df, id_vars=['Number', 'mw link', 'distance km', 'manufacture'], value_vars=['site_a', 'site_b']).sort_values(['mw link', 'variable'])
Or:
pd.melt(df, id_vars=['Number', 'mw link', 'distance km', 'manufacture'], value_vars=['site_a', 'site_b']).sort_values(['distance km', 'variable'])
I was wondering if someone could help me find a more efficiency way to run my code.
I have a dataset contains 7 columns, which are country,sector,year,month,week,weekday,value.
the year column have only 3 elements, 2019,2020,2021
What I have to do here is to substract every value in 2020 and 2021 from 2019.
But its more complicated that I need to match the weekday columns.
For example,i need to use year 2020, month 1, week 1, weekday 0(monday) value to substract,
year 2019, month 1, week 1, weekday 0(monday) value, if cant find it, it will pass, and so on, which means, the weekday(monday,Tuesaday....must be matched)
And here is my code, it can run, but it tooks me hours:(
for i in itertools.product(year_list,country_list, sector_list,month_list,week_list,weekday_list):
try:
data_2 = df_carbon[(df_carbon['country'] == i[1])
& (df_carbon['sector'] == i[2])
& (df_carbon['year'] == i[0])
& (df_carbon['month'] == i[3])
& (df_carbon['week'] == i[4])
& (df_carbon['weekday'] == i[5])]['co2'].tolist()[0]
data_1 = df_carbon[(df_carbon['country'] == i[1])
& (df_carbon['sector'] == i[2])
& (df_carbon['year'] == 2019)
& (df_carbon['month'] == i[3])
& (df_carbon['week'] == i[4])
& (df_carbon['weekday'] == i[5])]['co2'].tolist()[0]
co2.append(data_2-data_1)
country.append(i[1])
sector.append(i[2])
year.append(i[0])
month.append(i[3])
week.append(i[4])
weekday.append(i[5])
except:
pass
I changed the for loops to itertools, but it still not fast enough, any other ideas?
many thanks:)
##############################
here is the sample dataset
country co2 sector date week weekday year month
Brazil 108.767782 Power 2019-01-01 0 1 2019 1
China 14251.044482 Power 2019-01-01 0 1 2019 1
EU27 & UK 1886.493814 Power 2019-01-01 0 1 2019 1
France 53.856398 Power 2019-01-01 0 1 2019 1
Germany 378.323440 Power 2019-01-01 0 1 2019 1
Japan 21.898788 IA 2021-11-30 48 1 2021 11
Russia 19.773822 IA 2021-11-30 48 1 2021 11
Spain 42.293944 IA 2021-11-30 48 1 2021 11
UK 56.425121 IA 2021-11-30 48 1 2021 11
US 166.425000 IA 2021-11-30 48 1 2021 11
or this
import pandas as pd
pd.DataFrame({
'year': [2019, 2020, 2021],
'co2': [1,2,3],
'country': ['Brazil', 'Brazil', 'Brazil'],
'sector': ['power', 'power', 'power'],
'month': [1, 1, 1],
'week': [0,0,0],
'weekday': [0,0,0]
})
pandas can subtract two dataframe index-by-index, so the idea would be to separate your data into a minuend and a subtrahend, set ['country', 'sector', 'month', 'week', 'weekday'] as their indices, just subtract them, and remove rows (by dropna) where a match in year 2019 is not found.
df_carbon = pd.DataFrame({
'year': [2019, 2020, 2021],
'co2': [1,2,3],
'country': ['ab', 'ab', 'bc']
})
index = ['country']
# index = ['country', 'sector', 'month', 'week', 'weekday']
df_2019 = df_carbon[df_carbon['year']==2019].set_index(index)
df_rest = df_carbon[df_carbon['year']!=2019].set_index(index)
ans = (df_rest - df_2019).reset_index().dropna()
ans['year'] += 2019
Two additional points:
In this subtraction the year is also covered, so I need to add 2019 back.
I created a small example of df_carbon to test my code. If you had provided a more realistic version in text form, I would have tested my code using your data.
dataset example:
experience
5 month
nan
1 months
8 month
17 months
8 year
11 years
1.7 year
3.1 years
15.7 months
18 year
2017.2 years
98.3 years
68 year
I have a column with applicant's experience in years. It is very messy and I tried to go through it and create a sample. I have numbers followed by (month or months) and (year of years).
There are many nan entries and it should be ignored.
The goal is to create a column experience in months:
if nan
copy nan to the corresponding column
if the row has month or months
copy the number to the corresponding column
if year or years in the row and the number <55
the number shall be multiplied by 12 and copied to the corresponding column
else copy nan to the corresponding column
How to achieve this?
my_dict = {'Experience': ['5 month', 'nan', '1 months', '8 month','17 months','8 year',
'11 years','1.7 year', '3.1 years', '15.7 months','18 year',
'2017.2 years', '98.3 years', '68 year']}
df = pd.DataFrame(my_dict)
# Create filter for month/months
month_filt = df['Experience'].str.contains('month')
# Filter DataFrame for rows that contain month/months
df['Months'] = df.loc[month_filt, 'Experience'].str.strip('month|months')
# Create filter for year/years
year_filt = df['Experience'].str.contains('year')
# Filter DataFrame for rows that contain year/years
df['Years'] = df.loc[year_filt, 'Experience'].str.strip('year|years')
# Fill NaN in Years column
df.loc[df['Years'].isna(),'Years'] = np.nan
# Convert Years to months
df.loc[df['Months'].isna(),'Months'] = df['Years'].astype('float') * 12
# Set years greater than 55 to NaN
df.loc[df['Years'].astype('float') > 55, 'Months'] = np.nan
Experience Months Years
0 5 month 5 NaN
1 nan NaN NaN
2 1 months 1 NaN
3 8 month 8 NaN
4 17 months 17 NaN
5 8 year 96 8
6 11 years 132 11
7 1.7 year 20.4 1.7
8 3.1 years 37.2 3.1
9 15.7 months 15.7 NaN
10 18 year 216 18
11 2017.2 yearsNaN 2017.2
12 98.3 years NaN 98.3
13 68 year NaN 68
Simple solution using reg expressions, keeping workings for transparency.
import numpy as np
df = pd.read_csv(io.StringIO("""experience
5 month
nan
1 months
8 month
17 months
8 year
11 years
1.7 year
3.1 years
15.7 months
18 year
2017.2 years
98.3 years
68 year"""))
df = df.assign(unit=lambda dfa: dfa["experience"].str.extract("([a-z]+)+"),
val=lambda dfa: dfa["experience"].str.extract("([0-9,\.]+)").astype(float),
months=lambda dfa: np.where(dfa["unit"].isin(["month","months"]), dfa["val"],
np.where(dfa["unit"].isin(["year","years"])
&dfa["val"].lt(55), dfa["val"]*12, np.nan)))
print(df.to_string(index=False))
output
experience unit val months
5 month month 5.0 5.0
NaN NaN NaN NaN
1 months months 1.0 1.0
8 month month 8.0 8.0
17 months months 17.0 17.0
8 year year 8.0 96.0
11 years years 11.0 132.0
1.7 year year 1.7 20.4
3.1 years years 3.1 37.2
15.7 months months 15.7 15.7
18 year year 18.0 216.0
2017.2 years years 2017.2 NaN
98.3 years years 98.3 NaN
68 year year 68.0 NaN
This assumes the formatting is consistent (value, space, time period). You can use split to get the two parts.
df = pd.DataFrame({'experience': ['5 month', np.nan, '1 months', '8 month', '17 months', '8 year', '11 years']})
def get_values(x):
if pd.notnull(x):
val = int(x.split(' ')[0])
prd = x.split(' ')[1]
if prd in ['month', 'months']:
return val
elif prd in ['year', 'years'] and val < 55:
return val * 12
else:
return x
df['months'] = df.apply(lambda x: get_values(x.experience), axis=1)
Output:
experience months
0 5 month 5.0
1 NaN NaN
2 1 months 1.0
3 8 month 8.0
4 17 months 17.0
5 8 year 96.0
6 11 years 132.0
If there are a high percentage of NaN, you can filter first before running the lambda function
df[df.experience.notnull()].apply(lambda x: get_values(x.experience), axis=1)
There's probably a nicer way to do this using panda data frames but is this what you are trying to achieve? You can probably use the regex if nothing else. I've not added the condition for < 55 years but I'm sure you can work that out.
import re
applicants = []
applicant1 = {'name': 'Lisa', 'experience': 'nan'}
applicant2 = {'name': 'Bill', 'experience': '3.1 months'}
applicant3 = {'name': 'Mandy', 'experience': '1 month'}
applicant4 = {'name': 'Geoff', 'experience': '6.7 years'}
applicant5 = {'name': 'Patricia', 'experience': '1 year'}
applicant6 = {'name': 'Kirsty', 'experience': '2017.2 years'}
applicants.append(applicant1)
applicants.append(applicant2)
applicants.append(applicant3)
applicants.append(applicant4)
applicants.append(applicant5)
applicants.append(applicant6)
print(applicants)
month_pattern = '^([\d]+[\.\d]*) month(s*)'
year_pattern = '^([\d]+[\.\d]*) year(s*)'
applicant_output = []
for applicant in applicants:
if applicant['experience'] == 'nan':
applicant_output.append(applicant)
else:
month = re.search(month_pattern, applicant['experience'])
if month is not None:
applicant_output.append(
{
'name': applicant['name'],
"exprience_months": month.group(1)
})
else:
year = re.search(year_pattern, applicant['experience'])
if year is not None:
months = str(float(year.group(1)) * 12)
applicant_output.append(
{
'name': applicant['name'],
"exprience_months": months
})
print(applicant_output)
This gives the output:
[{'name': 'Lisa', 'experience': 'nan'}, {'name': 'Bill', 'experience': '3.1 months'}, {'name': 'Mandy', 'experience': '1 month'}, {'name': 'Geoff', 'experience': '6.7 years'}, {'name': 'Patricia', 'experience': '1 year'}, {'name': 'Kirsty', 'experience': '2017. years'}]
with the result:
[{'name': 'Lisa', 'experience': 'nan'}, {'name': 'Bill', 'exprience_months': '3.1'}, {'name': 'Mandy', 'exprience_months': '1'}, {'name': 'Geoff', 'exprience_months': '80.4'}, {'name': 'Patricia', 'exprience_months': '12.0'}, {'name': 'Kirsty', 'exprience_months': '24206.4'}]
temp_df to sepearate out month/year part
temp_df = df['experience'].str.split('([A-Za-z]+)', expand=True)
temp_df = temp_df.loc[:, ~(temp_df == "").any(axis=0)] # deleting the extra column coming upon split
temp_df[0] = temp_df[0].astype(float)
temp_df
Getting the multiplier for experince value
multiplier = pd.Series([1] * len(temp_df), index=temp_df.index)
year_rows = temp_df[1].str.contains('year', case=False).fillna(False) # getting the rows which has year
temp_df.loc[(year_rows) & (temp_df[0]>=55), 0] = np.nan # converting exp value to nan where value is >= 55 and unit is year
multiplier[year_rows] = 12
df['experience_in_months'] = temp_df[0] * multiplier
df
I have a dataframe that has a list of One Piece manga, which currently looks like this:
0 # Title Pages
Date
1997-07-19 1 Romance Dawn - The Dawn of the Adventure 53
1997-07-28 2 That Guy, "Straw Hat Luffy" 23
1997-08-04 3 Introducing "Pirate Hunter Zoro" 21
1997-08-11 4 Marine Captain "Axe-Hand Morgan" 19
1997-08-25 5 Pirate King and Master Swordsman 19
1997-09-01 6 The First Crew Member 23
1997-09-08 7 Friends 20
1997-09-13 8 Introducing Nami 19
Although every episode is to be issued weekly, sometimes they are delayed or on break, resulting in an irregular interval in the dates. What I would like to do is to add a missing date. For example, between 1997-08-11 and 1997-08-25, there should be 1997-08-18 (7 days from 1997-08-11) where the episode was not issued. Could you help me out with how to operate this code?
Thank you.
You sould use the shift builtin function.
df['day_between'] = df['Date'].shift(-1) - df['Date']
output of print(df[['Date', 'day_between']]) is then:
Date day_between
0 1997-07-19 9 days
1 1997-07-28 7 days
2 1997-08-04 7 days
3 1997-08-11 14 days
4 1997-08-25 7 days
5 1997-09-01 7 days
6 1997-09-08 5 days
7 1997-09-13 NaT
I used relativedelta and list comprehension to get a 14-day interval per row and .shift(1) to compare to another row with .np.where() with a 1 returning a row where we would want to insert a row before. Then, I looped through the dataframe and appended the relevant rows to another dataframe. Then, I used pd.concat to bring the two dataframes together, sorted by date, deleted the helper columns and reset the index.
There may be some gaps as others have mentioned like 22 days+ but this should get you in the right direction. Perhaps you could turn it into a function and run it multiple times, which is why I added .reset_index(drop=True) at the end. Obviously, you could just make this more advanced, but I hope this helps.
from dateutil.relativedelta import relativedelta
import pandas
from datetime import datetime
df = pd.DataFrame({'Date': {0: '1997-07-19',
1: '1997-07-28',
2: '1997-08-04',
3: '1997-08-11',
4: '1997-08-25',
5: '1997-09-01',
6: '1997-09-08',
7: '1997-09-13'},
'#': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8},
'Title': {0: 'Romance Dawn - The Dawn of the Adventure',
1: 'That Guy, "Straw Hat Luffy"',
2: 'Introducing "Pirate Hunter Zoro"',
3: 'Marine Captain "Axe-Hand Morgan"',
4: 'Pirate King and Master Swordsman',
5: 'The First Crew Member',
6: 'Friends',
7: 'Introducing Nami'},
'Pages': {0: 53, 1: 23, 2: 21, 3: 19, 4: 19, 5: 23, 6: 20, 7: 19}})
df['Date'] = pd.to_datetime(df['Date'])
df['Date2'] = [d - relativedelta(days=-14) for d in df['Date']]
df['Date3'] = np.where((df['Date'] >= df['Date2'].shift(1)), 1 , 0)
df1 = pd.DataFrame({})
n=0
for j in (df['Date3']):
n+=1
if j == 1:
new_row = pd.DataFrame({"Date": df['Date'][n-1] - relativedelta(days=7)}, index=[n])
df1=df1.append(new_row)
df = pd.concat([df, df1]).sort_values('Date').drop(['Date2', 'Date3'], axis=1).reset_index(drop=True)
df
Output:
Date # Title Pages
0 1997-07-19 1.0 Romance Dawn - The Dawn of the Adventure 53.0
1 1997-07-28 2.0 That Guy, "Straw Hat Luffy" 23.0
2 1997-08-04 3.0 Introducing "Pirate Hunter Zoro" 21.0
3 1997-08-11 4.0 Marine Captain "Axe-Hand Morgan" 19.0
4 1997-08-18 NaN NaN NaN
5 1997-08-25 5.0 Pirate King and Master Swordsman 19.0
6 1997-09-01 6.0 The First Crew Member 23.0
7 1997-09-08 7.0 Friends 20.0
8 1997-09-13 8.0 Introducing Nami 19.0
I have large dataframe of data in Pandas (let's say of courses at a university) looking like:
ID name credits enrolled ugrad/grad year semester
1 Math 4 62 ugrad 2016 Fall
2 History 3 15 ugrad 2016 Spring
3 Adv Math 3 8 grad 2017 Fall
...
and I want to group it by year and semester, and then get a bunch of different aggregate data on it, but all at one time if I can. For example, I want a total count of courses, count of only undergraduate courses, and sum of enrollment for a given semester. I can do each of these individually using value_counts, but I'd like to get an output such as:
year semester count count_ugrad total_enroll
2016 Fall # # #
Spring # # #
2017 Fall # # #
Spring # # #
...
Is this possible?
Here I added a new subject for Python and provided as a dict to load into dataframe.
Solution is a combination of the agg() method on a groupby, where the aggregations are provided in a dictionary, and then the use of a custom aggregation function for your ugrad requirement:
def my_custom_ugrad_aggregator(arr):
return sum(arr == 'ugrad')
dict = {'name': {0: 'Math', 1: 'History', 2: 'Adv Math', 3: 'Python'}, 'year': {0: 2016, 1: 2016, 2: 2017, 3: 2017}, 'credits': {0: 4, 1: 3, 2: 3, 3: 4}, 'semester': {0: 'Fall', 1: 'Spring', 2: 'Fall', 3: 'Spring'}, 'ugrad/grad': {0: 'ugrad', 1: 'ugrad', 2: 'grad', 3: 'ugrad'}, 'enrolled': {0: 62, 1: 15, 2: 8, 3: 8}, 'ID': {0: 1, 1: 2, 2: 3, 3: 4}}
df =pd.DataFrame(dict)
ID credits enrolled name semester ugrad/grad year
0 1 4 62 Math Fall ugrad 2016
1 2 3 15 History Spring ugrad 2016
2 3 3 8 Adv Math Fall grad 2017
3 4 4 8 Python Spring ugrad 2017
print df.groupby(['year','semester']).agg({'name':['count'],'enrolled':['sum'],'ugrad/grad':my_custom_ugrad_aggregator})
gives:
name ugrad/grad enrolled
count my_custom_ugrad_aggregator sum
year semester
2016 Fall 1 1 62
Spring 1 1 15
2017 Fall 1 0 8
Spring 1 1 8
Use agg with dictionary on how to rollup/aggregate each column:
df_out = df.groupby(['year','semester'])[['enrolled','ugrad/grad']]\
.agg({'ugrad/grad':lambda x: (x=='ugrad').sum(),'enrolled':['sum','size']})\
.set_axis(['Ugrad Count','Total Enrolled','Count Courses'], inplace=False, axis=1)
df_out
Output:
Ugrad Count Total Enrolled Count Courses
year semester
2016 Fall 1 62 1
Spring 1 15 1
2017 Fall 0 8 1