Can I pass a list to pandas series as index?
I have the following dataframe:
d = {'no': ['1','2','3','4','5','6','7','8','9'], 'buyer_code': ['Buy1', 'Buy2', 'Buy3', 'Buy1', 'Buy2', 'Buy2', 'Buy2', 'Buy1', 'Buy3'], 'dollar_amount': ['200.25', '350.00', '120.00', '400.50', '1231.25', '700.00', '350.00', '200.25', '2340.00'], 'date': ['22-01-2010','14-03-2010','17-06-2010','13-04-2011','17-05-2011','28-01-2012','23-07-2012','25-10-2012','25-12-2012']}
df = pd.DataFrame(data=d)
df
buyer_code date dollar_amount no
0 Buy1 22-01-2010 200.25 1
1 Buy2 14-03-2010 350.00 2
2 Buy3 17-06-2010 120.00 3
3 Buy1 13-04-2011 400.50 4
4 Buy2 17-05-2011 1231.25 5
5 Buy2 28-01-2012 700.00 6
6 Buy2 23-07-2012 350.00 7
7 Buy1 25-10-2012 200.25 8
8 Buy3 25-12-2012 2340.00 9
Converting to float for aggregate
pd.options.display.float_format = '{:,.4f}'.format
df['dollar_amount'] = df['dollar_amount'].astype(float)
Getting the most important Buyers by frequency and dollars:
NOTE: Here I am getting just top 2 buyers, In real example I might have to get upto 40 buyers.
xx = df.groupby('buyer_code').agg({'dollar_amount' : 'mean', 'no' : 'size'})
xx['frqAmnt'] = xx['no'].values * xx['dollar_amount'].values
xx = xx['frqAmnt'].nlargest(2)
xx
buyer_code
Buy2 2,631.2500
Buy3 2,460.0000
Name: frqAmnt, dtype: float64
Grouping buyers and their purchase dates:
zz = df.groupby(['buyer_code'])['date'].value_counts().groupby('buyer_code').head(all)
zz
buyer_code date
Buy1 2010-01-22 1
2011-04-13 1
2012-10-25 1
Buy2 2010-03-14 1
2011-05-17 1
2012-01-28 1
2012-07-23 1
Buy3 2010-06-17 1
2012-12-25 1
Name: date, dtype: int64
Now I want to pass my top buyer_codes to my zz sereis to get only the transactional data corresponding to those buyers.
How can I do it? I might be on the wrong path here, but kindly help me out.
I think you need:
a = zz[zz.index.get_level_values(0).isin(xx.index)]
print (a)
buyer_code date
Buy2 14-03-2010 1
17-05-2011 1
23-07-2012 1
28-01-2012 1
Buy3 17-06-2010 1
25-12-2012 1
Name: date, dtype: int64
For order need reindex:
a = zz[zz.index.get_level_values(0).isin(xx.index)].reindex(xx.index, level=0)
And for all dates by buyer_code:
b = a.reset_index(name='a').groupby('buyer_code')['date'].apply(list).reset_index()
print (b)
buyer_code date
0 Buy2 [14-03-2010, 17-05-2011, 23-07-2012, 28-01-2012]
1 Buy3 [17-06-2010, 25-12-2012]
Related
This question already has answers here:
How to join two dataframes for which column values are within a certain range?
(9 answers)
Closed 7 days ago.
Existing dataframe :
df_1
Id dates time(sec)_1 time(sec)_2
1 02/02/2022 15 20
1 04/02/2022 20 30
1 03/02/2022 30 40
1 06/02/2022 50 40
2 10/02/2022 10 10
2 11/02/2022 15 20
df_2
Id min_date action_date
1 02/02/2022 04/02/2022
2 06/02/2022 10/02/2022
Expected Dataframe :
df_2
Id min_date action_date count_of_dates avg_time_1 avg_time_2
1 02/02/2022 04/02/2022 3 21.67 30
2 06/02/2022 10/02/2022 1 10 10
count of dates, avg_time_1 , avg_time_2 is to be created from the df_1.
count of dates is calculated considering the min_date and action_date i.e. number of dates from from df_1 falling under min_date and action_date.
avg_time_1 and avg_time_2 are calculated w.r.t. to count of dates
stuck with applying the condition for dates :-( any leads.?
If small data is possible filter per rows by custom function:
df_1['dates'] = df_1['dates'].apply(pd.to_datetime)
df_2[['min_date','action_date']] = df_2[['min_date','action_date']].apply(pd.to_datetime)
def f(x):
m = df_1['Id'].eq(x['Id']) & df_1['dates'].between(x['min_date'], x['action_date'])
s = df_1.loc[m, ['time(sec)_1','time(sec)_2']].mean()
return pd.Series([m.sum()] + s.to_list(), index=['count_of_dates'] + s.index.tolist())
df = df_2.join(df_2.apply(f, axis=1))
print (df)
Id min_date action_date count_of_dates time(sec)_1 time(sec)_2
0 1 2022-02-02 2022-04-02 3.0 21.666667 30.0
1 2 2022-06-02 2022-10-02 1.0 10.000000 10.0
If Id in df_2 is unique is possible improve performance by merge df_1 with aggregate size and mean:
df = df_2.merge(df_1, on='Id')
d = {'count_of_dates':('Id','size'),
'time(sec)_1':('time(sec)_1','mean'),
'time(sec)_2':('time(sec)_2','mean')}
df = df_2.join(df[df['dates'].between(df['min_date'], df['action_date'])]
.groupby('Id').agg(**d), on='Id')
print (df)
Id min_date action_date count_of_dates time(sec)_1 time(sec)_2
0 1 2022-02-02 2022-04-02 3 21.666667 30
1 2 2022-06-02 2022-10-02 1 10.000000 10
I have a dataset, df, where I have a new value for each day. I would like to output the percent difference of these values from row to row as well as the raw value difference:
Date Value
10/01/2020 1
10/02/2020 2
10/03/2020 5
10/04/2020 8
Desired output:
Date Value PercentDifference ValueDifference
10/01/2020 1
10/02/2020 2 100 2
10/03/2020 5 150 3
10/04/2020 8 60 3
This is what I am doing:
import pandas as pd
df = pd.read_csv('df.csv')
df = (df.merge(df.assign(Date=df['Date'] - pd.to_timedelta('1D')),
on='Date')
.assign(Value = lambda x: x['Value_y']-x['Value_x'])
[['Date','Value']]
)
df['PercentDifference'] = [f'{x:.2%}' for x in (df['Value'].div(df['Value'].shift(1)) -
1).fillna(0)]
A member has helped me with the code above, I am also trying to incorporate the value difference as shown in my desired output.
Note - Is there a way to incorporate a 'period' - say, checking the percent difference and value difference over a 7 day period and 30 day period and so on?
Any suggestion is appreciated
Use Series.pct_change and Series.diff
df['PercentageDiff'] = df['Value'].pct_change().mul(100)
df['ValueDiff'] = df['Value'].diff()
Date Value PercentageDiff ValueDiff
0 10/01/2020 1 NaN NaN
1 10/02/2020 2 100.0 1.0
2 10/03/2020 5 150.0 3.0
3 10/04/2020 8 60.0 3.0
Or you use df.assign
df.assign(
percentageDiff=df["Value"].pct_change().mul(100),
ValueDiff=df["Value"].diff()
)
In a pandas data frame I would like to find the mean values of a column, grouped by a 'customized' year.
An example would be to compute the mean values of school marks for a school year (e.g. Sep/YYYY to Aug/YYYY+1).
The pandas docs gives some information on offsets and business year etc., but I can't really make any sense out of that to get a working example.
Here is a minimal example where mean values of school marks are computed per year (Jan-Dec), which is what I do not want.
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(low=1, high=5, size=36),
index=pd.date_range('2001-09-01', freq='M', periods=36),
columns=['marks'])
df_yearly = df.groupby(pd.Grouper(freq="A")).mean()
This could yield e.g.:
print(df):
marks
2001-09-30 1
2001-10-31 4
2001-11-30 2
2001-12-31 1
2002-01-31 4
2002-02-28 1
2002-03-31 2
2002-04-30 1
2002-05-31 3
2002-06-30 3
2002-07-31 3
2002-08-31 3
2002-09-30 4
2002-10-31 1
...
2003-11-30 4
2003-12-31 2
2004-01-31 1
2004-02-29 2
2004-03-31 1
2004-04-30 3
2004-05-31 4
2004-06-30 2
2004-07-31 2
2004-08-31 4
print(df_yearly):
marks
2001-12-31 2.000000
2002-12-31 2.583333
2003-12-31 2.666667
2004-12-31 2.375000
My desired output would correspond to something like:
2001-09/2002-08 mean_value
2002-09/2003-08 mean_value
2003-09/2004-08 mean_value
Many thanks!
We can manually compute the school years:
# if month>=9 we move it to the next year
school_years = df.index.year + (df.index.month>8).astype(int)
Another option is to use fiscal year starting from September:
school_years = df.index.to_period('Q-AUG').qyear
And we can groupby:
df.groupby(school_years).mean()
Output:
marks
2002 2.333333
2003 2.500000
2004 2.500000
One more approach
a = (df.index.month == 9).cumsum()
val = df.groupby(a, sort=False)['marks'].mean().reset_index()
dates = df.index.to_series().groupby(a, sort=False).agg(['first', 'last']).reset_index()
dates.merge(val, on='index')
Output
index first last marks
0 1 2001-09-30 2002-08-31 2.750000
1 2 2002-09-30 2003-08-31 2.333333
2 3 2003-09-30 2004-08-31 2.083333
Please, suggest a more suitable title for this question
I have: Two-level indexed DF (crated via groupby):
clicks yield
country report_date
AD 2016-08-06 1 31
2016-12-01 1 0
AE 2016-10-11 1 0
2016-10-13 2 0
I need:
Consequently take country by country data, process it and put it back:
for country in set(DF.get_level_values(0)):
DF_country = process(DF.loc[country])
DF[country] = DF_country
Where process add new rows to DF_country.
Problem is in last string:
ValueError: Wrong number of items passed 2, placement implies 1
I just modify your code, I change the process to add, Base on my understanding process is a self-define function right ?
for country in set(DF.index.get_level_values(0)): # change here
DF_country = DF.loc[country].add(1)
DF.loc[country] = DF_country.values #and here
DF
Out[886]:
clicks yield
country report_date
AD 2016-08-06 2 32
2016-12-01 2 1
AE 2016-10-11 2 1
2016-10-13 3 1
EDIT :
l=[]
for country in set(DF.index.get_level_values(0)):
DF1=DF.loc[country]
DF1.loc['2016-01-01']=[1,2] #adding row here
l.append(DF1)
pd.concat(l,axis=0,keys=set(DF.index.get_level_values(0)))
Out[923]:
clicks yield
report_date
AE 2016-10-11 1 0
2016-10-13 2 0
2016-01-01 1 2
AD 2016-08-06 1 31
2016-12-01 1 0
2016-01-01 1 2
I am learning python and at the moment I am playing with some sales data. The data is in csv format and is showing weekly sales.
I have below columns with some sample data as below:
store# dept# dates weeklysales
1 1 01/01/2005 50000
1 1 08/01/2005 120000
1 1 15/01/2005 75000
1 1 22/01/2005 25000
1 1 29/01/2005 18000
1 2 01/01/2005 15000
1 2 08/01/2005 12000
1 2 15/01/2005 75000
1 2 22/01/2005 35000
1 2 29/01/2005 28000
1 1 01/02/2005 50000
1 1 08/02/2005 120000
1 1 15/02/2005 75000
1 1 22/03/2005 25000
1 1 29/03/2005 18000
I want to add the weeklysales to monthly basis in each department and want to display the records.
I have tried to use groupby function in Pandas from below links:
how to convert monthly data to quarterly in pandas
Pandas group by and sum two columns
Pandas group-by and sum
But what is happening in the above that I get sum of all the columns and getting the following output by adding the store and dept numbers as well:
store# dept# dates weeklysales
4 3 01/2005 28800
4 1 01/2005 165000
4 3 02/2005 245000
4 3 03/2005 43000
I do not want to add store and dept numbers but want to just add the weeklysales figure by each month and want the display like:
store# dept# dates weeklysales
1 1 01/2005 28800
1 2 01/2005 165000
1 1 02/2005 245000
1 1 03/2005 43000
Will be grateful if I can get a solution for that.
Cheers,
Is this what you are after?
Convert dates to month/year format and then group and sum sales.
(df.assign(dates=df.dates.dt.strftime('%m/%Y'))
.groupby(['store#','dept#','dates'])
.sum()
.reset_index()
)
Out[243]:
store# dept# dates weeklysales
0 1 1 01/2005 288000
1 1 1 02/2005 245000
2 1 1 03/2005 43000
3 1 2 01/2005 165000