groupby two columns based on third column values [duplicate] - python

This question already has answers here:
group by two columns based on created column
(2 answers)
Closed 2 years ago.
I have a dataset like this
df = pd.DataFrame({'time':['13:30', '9:20', '18:12', '19:00', '11:20', '13:30', '15:20', '17:12', '16:00', '8:20'],
'item': [coffee, bread, pizza, rice, soup, coffee, bread, pizza, rice, soup]})
I want to produce this output
first need to split time into 3 categories, based on these intervals
interval (6,11] for breakfast, (11,15] for lunch and (15,20] for dinner.
this is my code:
df['hour'] = df.Time.apply(lambda x: int(x.split(':')[0]))
def time_period(hour):
if hour >= 6 and hour < 11:
return 'breakfast'
elif hour >= 11 and hour < 15:
return 'lunch'
else:
return 'dinner'
df['meal'] = df['hour'].apply(lambda x: time_period(x))
but I don't know how to do 'groupby' part.

You can simply groupby on "meal" and create list of items for each "meal". Just add the following line to do this-
df.groupby(['meal'])["item"].apply(list)
You can apply count and all on top of this to achieve the result you want.

Related

Update a dataframe iteratively

I have a dataframe:
QID URL Questions Answers Section QType Theme Topics Answer0 Answer1 Answer2 Answer3 Answer4 Answer5 Answer6
1113 1096 https://docs.google.com/forms/d/1hIkfKc2frAnxsQzGw_h4bIqasPwAPzzLFWqzPE3B_88/edit?usp=sharing To what extent are the following factors considerations in your choice of flight? ['Very important consideration', 'Important consideration', 'Neutral', 'Not an important consideration', 'Do not consider'] When choosing an airline to fly with, which factors are most important to you? (Please list 3.) Multiple Choice Airline XYZ ['extent', 'follow', 'factor', 'consider', 'choic', 'flight'] Very important consideration Important consideration Neutral Not an important consideration Do not consider NaN NaN
1116 1097 https://docs.google.com/forms/d/1hIkfKc2frAnxsQzGw_h4bIqasPwAPzzLFWqzPE3B_88/edit?usp=sharing How far in advance do you typically book your tickets? ['0-2 months in advance', '2-4 months in advance', '4-6 months in advance', '6-8 months in advance', '8-10 months in advance', '10-12 months in advance', '12+ months in advance'] When choosing an airline to fly with, which factors are most important to you? (Please list 3.) Multiple Choice Airline XYZ ['advanc', 'typic', 'book', 'ticket'] 0-2 months in advance 2-4 months in advance 4-6 months in advance 6-8 months in advance 8-10 months in advance 10-12 months in advance 12+ months in advance
with rows of which I want to change a few lines that are actually QuestionGrid titles, with new lines that also represent the answers. I have a other, Pickle, which contains the information to build the lines that will update the old ones. Each time an old line will be transformed into several new lines (I specify this because I do not know how to do it).
These lines are just the grid titles of questions like the following one:
Expected dataframe
I would like to insert them in the original dataframe, instead of the lines where they match in the 'Questions' column, as in the following dataframe:
QID Questions QType Answer1 Answer2 Answer3 Answer4 Answer5
1096 'To what extent are the following factors considerations in your choice of flight?' Question Grid 'Very important consideration' 'Important consideration' 'Neutral' 'Not an important consideration' 'Do not consider'
1096_S01 'The airline/company you fly with'
1096_S02 'The departure airport'
1096_S03 'Duration of flight/route'
1096_S04 'Baggage policy'
1097 'To what extent are the following factors considerations in your choice of flight?' Question Grid ...
1097_S01 ...
...
What I tried
import pickle
qa = pd.read_pickle(r'Python/interns.p')
df = pd.read_csv("QuestionBank.csv")
def isGrid(dic, df):
'''Check if a row is a row related to a Google Forms grid
if it is a case update this row'''
d_answers = dic['answers']
try:
answers = d_answers[2]
if len(answers) > 1:
# find the line in df and replace the line where it matches by the lines
update_lines(dic, df)
return df
except TypeError:
return df
def update_lines(dic, df):
'''find the line in df and replace the line where it matches
with the question in dic by the new lines'''
lines_to_replace = df.index[df['Questions'] == dic['question']].tolist() # might be several rows and maybe they aren't all to replace
# I check there is at least a row to replace
if lines_to_replace:
# I replace all rows where the question matches
for line_to_replace in lines_to_replace:
# replace this row and the following by the following dataframe
questions = reduce(lambda a,b: a + b,[data['answers'][2][x][3] for x in range(len(data['answers'][2]))])
ind_answers = dic["answers"][2][0][1]
answers = []
# I get all the potential answers
for i in range(len(ind_answers)):
answers.append(reduce(lambda a,b: a+b,[ind_answers[i] for x in range(len(questions))])) # duplicates as there are many lines with the same answers in a grid, maybe I should have used set
answers = {f"Answer{i}": answers[i] for i in range(0, len(answers))} # dyanmically allocate to place them in the right columns
dict_replacing = {'Questions': questions, **answers} # dictionary that will replace the forle create the new lines
df1 = pd.DataFrame(dict_replacing)
df1.index = df1.index / 10 + line_to_replace
df = df1.combine_first(df)
return df
I did a Colaboratory notebook if needed.
What I obtain
But the dataframe is the same size before and after we do this. In effect, I get:
QID Questions QType Answer1 Answer2 Answer3 Answer4 Answer5
1096 'To what extent are the following factors considerations in your choice of flight?' Question Grid 'Very important consideration' 'Important consideration' 'Neutral' 'Not an important consideration' 'Do not consider'
1097 'To what extent are the following factors considerations in your choice of flight?' Question Grid ...

How to create a function for a dataframe with pandas

I have this data frame of clients purchases and I would like to create a function that gave me the total purchases for a given input of month and year.
I have a dataframe (df) with lots of columns but i'm going to use only 3 ("year", "month", "value")
This is what I'm trying but not working:
def total_purchases():
y = input('Which year do you want to consult?')
m = int(input('Which month do you want to consult?')
sum = []
if df[df['year']== y] & df[df['month']== m]:
for i in df:
sum = sum + df[df['value']]
return sum
You're close, you need to ditch the IF statement and the For loop.
additionally, when dealing with multiple logical operators in pandas you need to use parenthesis to seperate the conditions.
def total_purchases(df):
y = input('Which year do you want to consult? ')
m = int(input('Which month do you want to consult? '))
return df[(df['year'].eq(y)) & (df['month'].eq(m))]['value'].sum()
setup
df_p = pd.DataFrame({'year' : ['2011','2011','2012','2013'],
'month' : [1,2,1,2],
'value' : [200,500,700,900]})
Test
total_purchases(df_p)
Which year do you want to consult? 2011
Which month do you want to consult? 2
500

Pandas Advanced: How to get results for customer who has bought at least twice within 5 days of period?

I have been attempting to solve a problem for hours and stuck on it. Here is the problem outline:
import numpy as np
import pandas as pd
df = pd.DataFrame({'orderid': [10315, 10318, 10321, 10473, 10621, 10253, 10541, 10645],
'customerid': ['ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'HANAR', 'HANAR', 'HANAR'],
'orderdate': ['1996-09-26', '1996-10-01', '1996-10-03', '1997-03-13', '1997-08-05', '1996-07-10', '1997-05-19', '1997-08-26']})
df
orderid customerid orderdate
0 10315 ISLAT 1996-09-26
1 10318 ISLAT 1996-10-01
2 10321 ISLAT 1996-10-03
3 10473 ISLAT 1997-03-13
4 10621 ISLAT 1997-08-05
5 10253 HANAR 1996-07-10
6 10541 HANAR 1997-05-19
7 10645 HANAR 1997-08-26
I would like to select all the customers who has ordered items more than once WITHIN 5 DAYS.
For example, here only the customer ordered within 5 days of period and he has done it twice.
I would like to get the output in the following format:
Required Output
customerid initial_order_id initial_order_date nextorderid nextorderdate daysbetween
ISLAT 10315 1996-09-26 10318 1996-10-01 5
ISLAT 10318 1996-10-01 10321 1996-10-03 2
First, to be able to count the difference in days, convert orderdate
column to datetime:
df.orderdate = pd.to_datetime(df.orderdate)
Then define the following function:
def fn(grp):
return grp[(grp.orderdate.shift(-1) - grp.orderdate) / np.timedelta64(1, 'D') <= 5]
And finally apply it:
df.sort_values(['customerid', 'orderdate']).groupby('customerid').apply(fn)
It is a bit tricky because there can be any number of purchase pairs within 5 day windows. It is a good use case for leveraging merge_asof, which allows to do approximate-but-not-exact matching of a dataframe with itself.
Input data
import pandas as pd
df = pd.DataFrame({'orderid': [10315, 10318, 10321, 10473, 10621, 10253, 10541, 10645],
'customerid': ['ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'HANAR', 'HANAR', 'HANAR'],
'orderdate': ['1996-09-26', '1996-10-01', '1996-10-03', '1997-03-13', '1997-08-05', '1996-07-10', '1997-05-19', '1997-08-26']})
Define a function that computes the pairs of purchases, given data for a customer.
def compute_purchase_pairs(df):
# Approximate self join on the date, but not exact.
df_combined = pd.merge_asof(df,df, left_index=True, right_index=True,
suffixes=('_first', '_second') , allow_exact_matches=False)
# Compute difference
df_combined['timedelta'] = df_combined['orderdate_first'] - df_combined['orderdate_second']
return df_combined
Do the preprocessing and compute the pairs
# Convert to datetime
df['orderdate'] = pd.to_datetime(df['orderdate'])
# Sort dataframe from last buy to newest (groupby will not change this order)
df2 = df.sort_values(by='orderdate', ascending=False)
# Create an index for joining
df2 = df.set_index('orderdate', drop=False)
# Compute puchases pairs for each customer
df_differences = df2.groupby('customerid').apply(compute_purchase_pairs)
# Show only the ones we care about
result = df_differences[df_differences['timedelta'].dt.days<=5]
result.reset_index(drop=True)
Result
orderid_first customerid_first orderdate_first orderid_second \
0 10318 ISLAT 1996-10-01 10315.0
1 10321 ISLAT 1996-10-03 10318.0
customerid_second orderdate_second timedelta
0 ISLAT 1996-09-26 5 days
1 ISLAT 1996-10-01 2 days
you can create the column 'daysbetween' with sort_values and diff. After to get the following order, you can join df with df once groupby per customerid and shift all the data. Finally, query where the number of days in 'daysbetween_next ' is met:
df['daysbetween'] = df.sort_values(['customerid', 'orderdate'])['orderdate'].diff().dt.days
df_final = df.join(df.groupby('customerid').shift(-1),
lsuffix='_initial', rsuffix='_next')\
.drop('daysbetween_initial', axis=1)\
.query('daysbetween_next <= 5 and daysbetween_next >=0')
It's quite simple. Let's write down the requirements one at the time and try to build upon.
First, I guess that the customer has a unique id since it's not specified. We'll use that id for identifying customers.
Second, I assume it does not matter if the customer bought 5 days before or after.
My solution, is to use a simple filter. Note that this solution can also be implemented in a SQL database.
As a condition, we require the user to be the same. We can achieve this as follows:
new_df = df[df["ID"] == df["ID"].shift(1)]
We create a new DataFrame, namely new_df, with all rows such that the xth row has the same user id as the xth - 1 row (i.e. the previous row).
Now, let's search for purchases within the 5 days, by adding the condition to the previous piece of code
new_df = df[df["ID"] == df["ID"].shift(1) & (df["Date"] - df["Date"].shift(1)) <= 5]
This should do the work. I cannot test it write now, so some fixes may be needed. I'll try to test it as soon as I can

Group Months in string type to quarters in string type and add values on pandas df column

I want to sum the values of every quarter, and rename it to 2000q1 format. This is returning a value, but I am not sure if it is the right one. To test it I am running a sum between the original columns and checking the values. What do you think about the method below?
homes = homes[['State', 'RegionName', '2000-01', '2000-02', '2000-03', '2000-04', '2000-05', "2000-06", "2000-07", "2000-08", "2000-09", "2000-10", "2000-11", "2000-12", "2001-01", "2001-02", "2001-03","2001-04","2001-05","2001-06","2001-07","2001-08","2001-09","2001-10","2001-11","2001-12","2002-01","2002-02","2002-03","2002-04","2002-05","2002-06","2002-07","2002-08","2002-09","2002-10","2002-11","2002-12","2003-01","2003-02","2003-03","2003-04","2003-05","2003-06","2003-07","2003-08","2003-09","2003-10","2003-11","2003-12","2004-01","2004-02","2004-03","2004-04","2004-05","2004-06","2004-07","2004-08","2004-09","2004-10","2004-11","2004-12","2005-01","2005-02","2005-03","2005-04","2005-05","2005-06","2005-07","2005-08","2005-09","2005-10","2005-11","2005-12","2006-01","2006-02","2006-03","2006-04","2006-05","2006-06","2006-07","2006-08","2006-09","2006-10","2006-11","2006-12","2007-01","2007-02","2007-03","2007-04","2007-05","2007-06","2007-07","2007-08","2007-09","2007-10","2007-11","2007-12","2008-01","2008-02","2008-03","2008-04","2008-05","2008-06","2008-07","2008-08","2008-09","2008-10","2008-11","2008-12","2009-01","2009-02","2009-03","2009-04","2009-05","2009-06","2009-07","2009-08","2009-09","2009-10","2009-11","2009-12","2010-01","2010-02","2010-03","2010-04","2010-05","2010-06","2010-07","2010-08","2010-09","2010-10","2010-11","2010-12","2011-01","2011-02","2011-03","2011-04","2011-05","2011-06","2011-07","2011-08","2011-09","2011-10","2011-11","2011-12","2012-01","2012-02","2012-03","2012-04","2012-05","2012-06","2012-07","2012-08","2012-09","2012-10","2012-11","2012-12","2013-01","2013-02","2013-03","2013-04","2013-05","2013-06","2013-07","2013-08","2013-09","2013-10","2013-11","2013-12","2014-01","2014-02","2014-03","2014-04","2014-05","2014-06","2014-07","2014-08","2014-09","2014-10","2014-11","2014-12","2015-01","2015-02","2015-03","2015-04","2015-05","2015-06","2015-07","2015-08","2015-09","2015-10","2015-11","2015-12","2016-01","2016-02","2016-03","2016-04","2016-05","2016-06","2016-07","2016-08"]]
years = homes.iloc[:, 3:]
years = years.groupby(pd.to_datetime(years.columns).to_period('Q'), axis=1).sum() # GROUP ALL MONTHS TO QUARTERS and sum their values !
years.columns = years.columns.strftime('%Y'+'q'+'%q')
#print years
homes = homes[['State', 'RegionName']]
homes = pd.merge(homes, years, how='outer', left_index=True, right_index=True)
homes['State'] = homes['State'].apply(state_names_transformed)
#homes.columns = homes.columns.str.replace('Q', 'q')
homes = homes.set_index(['State', 'RegionName'])
return homes

new column based on specific string info from two different columns Python Pandas

I have 2 columns, I need to take specific string information from each column and create a new column with new strings based on this.
In column "Name" I have wellnames, I need to look at the last 4 characters of each wellname and if it Contains "H" then call that "HZ" in a new column.
I need to do the same thing if the column "WELLTYPE" contains specific words.
Using a Data Analysis program Spotfire I can do this all in one simple equation. (see below).
case
When right([UWI],4)~="H" Then "HZ"
When [WELLTYPE]~="Horizontal" Then "HZ"
When [WELLTYPE]~="Deviated" Then "D"
When [WELLTYPE]~="Multilateral" Then "ML"
else "V"
End
What would be the best way to do this in Python Pandas?
Is there a simple clean way you can do this all at once like in the spotfire equaiton above?
Here is the datatable with the two columns and my hopeful outcome column. (it did not copy very well into this), I also provide the code for the table below.
Name WELLTYPE What I Want
0 HH-001HST2 Oil Horizontal HZ
1 HH-001HST Oil_Horizontal HZ
2 HB-002H Oil HZ
3 HB-002 Water_Deviated D
4 HB-002 Oil_Multilateral ML
5 HB-004 Oil V
6 HB-005 Source V
7 BB-007 Water V
Here is the code to create the dataframe
# Dataframe with hopeful outcome
raw_data = {'Name': ['HH-001HST2', 'HH-001HST', 'HB-002H', 'HB-002', 'HB-002','HB-004','HB-005','BB-007'],
'WELLTYPE':['Oil Horizontal', 'Oil_Horizontal', 'Oil', 'Water_Deviated', 'Oil_Multilateral','Oil','Source','Water'],
'What I Want': ['HZ', 'HZ', 'HZ', 'D', 'ML','V','V','V']}
df = pd.DataFrame(raw_data, columns = ['Name','WELLTYPE','What I Want'])
df
Nested 'where' variant:
df['What I Want'] = np.where(df.Name.str[-4:].str.contains('H'), 'HZ',
np.where(df.WELLTYPE.str.contains('Horizontal'),'HZ',
np.where(df.WELLTYPE.str.contains('Deviated'),'D',
np.where(df.WELLTYPE.str.contains('Multilateral'),'ML',
'V'))))
Using apply by row:
def criteria(row):
if row.Name[-4:].find('H') > 0:
return 'HZ'
elif row.WELLTYPE.find('Horizontal') > 0:
return 'HZ'
elif row.WELLTYPE.find('Deviated') > 0:
return 'D'
elif row.WELLTYPE.find('Multilateral') > 0:
return 'ML'
else:
return 'V'
df['want'] = df.apply(criteria, axis=1)
This feels more natural to me. Obviously subjective
from_name = df.Name.str[-4:].str.contains('H').map({True: 'HZ'})
regex = '(Horizontal|Deviated|Multilateral)'
m = dict(Horizontal='HZ', Deviated='D', Multilateral='ML')
from_well = df.WELLTYPE.str.extract(regex, expand=False).map(m)
df['What I Want'] = from_name.fillna(from_well).fillna('V')
print(df)
Name WELLTYPE What I Want
0 HH-001HST2 Oil Horizontal HZ
1 HH-001HST Oil_Horizontal HZ
2 HB-002H Oil HZ HZ
3 HB-002 Water_Deviated D
4 HB-002 Oil_Multilateral ML
5 HB-004 Oil V V
6 HB-005 Source V
7 BB-007 Water V

Categories

Resources