Aggregates in python - python

I have a txt file with the following content:
ID1;ID2;TIME;VALUE
1000;100;012021;12
1000;100;022021;4129
1000;100;032021;128
1000;100;042021;412
1000;100;052021;12818
1000;120;022021;4129
1000;100;062021;546
1000;100;072021;86
1000;120;052021;12818
1000;100;082021;754
1000;100;092021;2633
1000;100;102021;571
1000;120;092021;2633
1000;100;112021;2233
1000;100;122021;571
1000;120;012021;12
1000;120;032021;128
1000;120;042021;412
1000;120;062021;546
1000;120;072021;86
1000;120;082021;754
1000;120;102021;571
1000;120;112021;2233
1000;120;122021;571
1000;100;012022;12
1000;100;022022;4129
1000;120;022022;4129
1000;120;032022;128
1000;120;042022;412
1000;100;032022;128
1000;100;042022;412
1000;100;052022;12818
1000;100;062022;546
1000;100;072022;86
1000;100;082022;754
1000;120;072022;86
1000;120;082022;754
1000;120;092022;2633
1000;120;102022;571
1000;100;092022;2633
1000;100;102022;571
1000;100;112022;2233
1000;100;122022;571
1000;120;012022;12
1000;120;052022;12818
1000;120;062022;546
1000;120;112022;2233
1000;120;122022;571
I need to make aggregates of time (half year, total year), using the items from column time, which have the same ID1, ID2 and sum up the values.
The output should look like this:
I would appreciate your help! This is what I have so far for half year:
#already sorted by time
data=open("file.txt").readlines()
count = 0
for line in data:
count += 1
for n in range(count - 1, len(data), 6):
subList = [data[n:n + 6]]
break

I'm far from being a Python expert but how about something like:
dd = defaultdict(lambda: [])
rows = [elems for elems in [line.strip().split(';') for line in data[1:]]]
for row in rows:
mm = row[2][:2]
yy = row[2][2:]
vv = int(row[3])
key = (row[0], row[1], yy)
dd[key].append([mm, yy, vv])
# print("Total of all values", sum(int(row[3]) for row in rows))
for k, v in dd.items():
h1 = sum(c[2] for c in v if c[0] <= '06')
h2 = sum(c[2] for c in v if c[0] > '06')
tt = sum(c[2] for c in v)
# or, much more simply, tt = h1 + h2
# print(k[0], k[1], k[2], "H1:", h1, "H2:", h2, "TT:", tt)
print(f"{k[0]};{k[1]};HY1{k[2]};{h1}")
print(f"{k[0]};{k[1]};HY2{k[2]};{h2}")
print(f"{k[0]};{k[1]};TY{k[2]};{tt}")
Seems to give correct results for the data supplied. Might not be efficient if you have huge amounts of data. YMMV.

This is a task that is very well suited for the pandas library, which is designed to work with tabular data. A way to do this in pandas would be something like this:
import pandas as pd
# Read the data from the textfile (here I saved it as data.csv).
# Make sure the TIME column is not read as an integer by declaring
# dtype={'TIME': object}, because this would omit the leading zeroes, which
# we need for conversion to datetime object
df = pd.read_csv('data.csv', delimiter=';', dtype={'TIME': object})
# Convert TIME column to datetime
df['TIME'] = pd.to_datetime(df['TIME'], format='%m%Y')
# Create new column with year
df['Y'] = df['TIME'].dt.year
# Create new column with halfyear (1 = first halfyear, 2 = second halfyear)
df['HY'] = df['TIME'].dt.month.floordiv(7).add(1)
After this, your table looks like this:
df.head() # Just show the first couple of rows
ID1 ID2 TIME VALUE Y HY
0 1000 100 2021-01-20 12 2021 1
1 1000 100 2021-02-20 4129 2021 1
2 1000 100 2021-03-20 128 2021 1
3 1000 100 2021-04-20 412 2021 1
4 1000 100 2021-05-20 12818 2021 1
Getting the table into the desired format takes a bit of work, but grouping and aggregating then becomes really easy. You can then also perform other grouping and aggregating operations as you please without having to code it all from hand.
To group by year and calculate the sum:
df.groupby(['ID1', 'ID2', 'Y']).sum()
ID1 ID2 Y VALUE
1000 100 2021 24893
1000 100 2022 24893
1000 120 2021 24893
1000 120 2022 24893
To group by halfyear and calculate the sum:
df.groupby(['ID1', 'ID2', 'Y', 'HY']).sum()
ID1 ID2 Y HY VALUE
1000 100 2021 1 18045
1000 100 2021 2 6848
1000 100 2022 1 18045
1000 100 2022 2 6848
1000 120 2021 1 18045
1000 120 2021 2 6848
1000 120 2022 1 18045
1000 120 2022 2 6848
Edit: Added a datetime format specifier to correctly read the date as MMYYYY instead of MMDDYY. Thanks to shunty for mentioning it! The results shown in this post will of course be different now than the actual results.

Related

Pandas dataframe vectorized bucketing/aggregation?

The Task
I have a dataframe that looks like this:
date
money_spent ($)
meals_eaten
weight
2021-01-01 10:00:00
350
5
140
2021-01-02 18:00:00
250
2
170
2021-01-03 12:10:00
200
3
160
2021-01-04 19:40:00
100
1
150
I want to discretize this so that it "cuts" the rows every $X. I want to know some statistics on how much is being done for every $X i spend.
So if I were to use $500 as a threshold, the first two rows would fall in the first cut, and I could aggregate the remaining columns as follows:
first date of the cut
average meals_eaten
minimum weight
maximum weight
So the final table would be two rows like this:
date
cumulative_spent ($)
meals_eaten
min_weight
max_weight
2021-01-01 10:00:00
600
3.5
140
170
2021-01-03 12:10:00
300
2
150
160
My Approach:
My first instinct is to calculate the cumsum() of the money_spent (assume the data is sorted by date), then I use pd.cut() to basically make a new column, we call it spent_bin, that determines each row's bin.
Note: In this toy example, spent_bin would basically be: [0,500] for the first two rows and (500-1000] for the last two.
Then it's fairly simple, I do a groupby spent_bin then aggregate as follows:
.agg({
'date':'first',
'meals_eaten':'mean',
'returns': ['min', 'max']
})
What I've Tried
import pandas as pd
rows = [
{"date":"2021-01-01 10:00:00","money_spent":350, "meals_eaten":5, "weight":140},
{"date":"2021-01-02 18:00:00","money_spent":250, "meals_eaten":2, "weight":170},
{"date":"2021-01-03 12:10:00","money_spent":200, "meals_eaten":3, "weight":160},
{"date":"2021-01-05 22:07:00","money_spent":100, "meals_eaten":1, "weight":150}]
df = pd.DataFrame.from_dict(rows)
df['date'] = pd.to_datetime(df.date)
df['cum_spent'] = df.money_spent.cumsum()
print(df)
print(pd.cut(df.cum_spent, 500))
For some reason, I can't get the cut step to work. Here is my toy code from above. The labels are not cleanly [0-500], (500,1000] for some reason. Honestly I'd settle for [350,500],(500-800] (this is what the actual cum sum values are at the edges of the cuts), but I can't even get that to work even though I'm doing the exact same as the documentation example. Any help with this?
Caveats and Difficulties:
It's pretty easy to write this in a for loop of course, just do a while cum_spent < 500:. The problem is I have millions of rows in my actual dataset, it currently takes me 20 minutes to process a single df this way.
There's also a minor issue that sometimes rows will break the interval. When that happens, I want that last row included. This problem is in the toy example where row #2 actually ends at $600 not $500. But it is the first row that ends at or surpasses $500, so I'm including it in the first bin.
The customized function to achieve the cumsum with reset limitation
df['new'] = cumli(df['money_spent ($)'].values,500)
out = df.groupby(df.new.iloc[::-1].cumsum()).agg(
date = ('date','first'),
meals_eaten = ('meals_eaten','mean'),
min_weight = ('weight','min'),
max_weight = ('weight','max')).sort_index(ascending=False)
Out[81]:
date meals_eaten min_weight max_weight
new
1 2021-01-01 3.5 140 170
0 2021-01-03 2.0 150 160
from numba import njit
#njit
def cumli(x, lim):
total = 0
result = []
for i, y in enumerate(x):
check = 0
total += y
if total >= lim:
total = 0
check = 1
result.append(check)
return result

Replacing a for loop with something more efficient when comparing dates to a list

Edit: Title changed to reflect map not being more efficient than a for loop.
Original title: Replacing a for loop with map when comparing dates
I have a list of sequential dates date_list and a data frame df which contains, for the purposes of now, contains one column named Event Date which contains the date that an event occured:
Index Event Date
0 02-01-20
1 03-01-20
2 03-01-20
I want to know how many events have happened by a given date in the format:
Date Events
01-01-20 0
02-01-20 1
03-01-20 3
My current method for doing so is as follows:
for date in date_list:
event_rows = df.apply(lambda x: True if x['Event Date'] > date else False , axis=1)
event_count = len(event_rows[event_rows == True].index)
temp = [date,event_count]
pre_df_list.append(temp)
Where the list pre_df_list is later converted to a dataframe.
This method is slow and seems inelegant but I am struggling to find a method that works.
I think it should be something along the lines of:
map(lambda x,y: True if x > y else False, df['Event Date'],date_list)
but that would compare each item in the list in pairs which is not what I'm looking for.
I appreaciate it might be odd asking for help when I have working code but I'm trying to cut down my reliance of loops as they are somewhat of a crutch for me at the moment. Also I have multiple different events to track in the full data and looping through ~1000 dates for each one will be unsatisfyingly slow.
Use groupby() and size() to get counts per date and cumsum() to get a cumulative sum, i.e. include all the dates before a particular row.
from datetime import date, timedelta
import random
import pandas as pd
# example data
dates = [date(2020, 1, 1) + timedelta(days=random.randrange(1, 100, 1)) for _ in range(1000)]
df = pd.DataFrame({'Event Date': dates})
# count events <= t
event_counts = df.groupby('Event Date').size().cumsum().reset_index()
event_counts.columns = ['Date', 'Events']
event_counts
Date Events
0 2020-01-02 13
1 2020-01-03 23
2 2020-01-04 34
3 2020-01-05 42
4 2020-01-06 51
.. ... ...
94 2020-04-05 972
95 2020-04-06 981
96 2020-04-07 989
97 2020-04-08 995
98 2020-04-09 1000
Then if there's dates in your date_list file that don't exist in your dataframe, convert the date_list into a dataframe and merge the previous results. The fillna(method='ffill') will fill gaps in the middle of the data, whille the last fillna(0) incase there's gaps at the start of the column.
date_list = [date(2020, 1, 1) + timedelta(days=x) for x in range(150)]
date_df = pd.DataFrame({'Date': date_list})
merged_df = pd.merge(date_df, event_counts, how='left', on='Date')
merged_df.columns = ['Date', 'Events']
merged_df = merged_df.fillna(method='ffill').fillna(0)
Unless I am mistaken about your objective, it seems to me that you can simply use pandas DataFrames' ability to compare against a single value and slice the dataframe like so:
>>> df = pd.DataFrame({'event_date': [date(2020,9, 1), date(2020, 9, 2), date(2020, 9, 3)]})
>>> df
event_date
0 2020-09-01
1 2020-09-02
2 2020-09-03
>>> df[df.event_date > date(2020, 9, 1)]
event_date
1 2020-09-02
2 2020-09-03

How to convert this to a for-loop with an output to CSV

I'm trying to put together a generic piece of code that would:
Take a time series for some price data and divide it into deciles, e.g. take the past 18m of gold prices and divide it into deciles [DONE, see below]
date 4. close decile
2017-01-03 1158.2 0
2017-01-04 1166.5 1
2017-01-05 1181.4 2
2017-01-06 1175.7 1
... ...
2018-04-23 1326.0 7
2018-04-24 1333.2 8
2018-04-25 1327.2 7
[374 rows x 2 columns]
Pull out the dates for a particular decile, then create a secondary datelist with an added 30 days
#So far only for a single decile at a time
firstdecile = gold.loc[gold['decile'] == 1]
datelist = list(pd.to_datetime(firstdecile.index))
datelist2 = list(pd.to_datetime(firstdecile.index) + pd.DateOffset(months=1))
Take an average of those 30-day price returns for each decile
level1 = gold.ix[datelist]
level2 = gold.ix[datelist2]
level2.index = level2.index - pd.DateOffset(months=1)
result = pd.merge(level1,level2, how='inner', left_index=True, right_index=True)
def ret(one, two):
return (two - one)/one
pricereturns = result.apply(lambda x :ret(x['4. close_x'], x['4. close_y']), axis=1)
mean = pricereturns.mean()
Return the list of all 10 averages in a single CSV file
So far I've been able to put together something functional that does steps 1-3 but only for a single decile, but I'm struggling to expand this to a looped-code for all 10 deciles at once with a clean CSV output
First append the close price at t + 1 month as a new column on the whole dataframe.
gold2_close = gold.loc[gold.index + pd.DateOffset(months=1), 'close']
gold2_close.index = gold.index
gold['close+1m'] = gold2_close
However practically relevant should be the number of trading days, i.e. you won't have prices for the weekend or holidays. So I'd suggest you shift by number of rows, not by daterange, i.e. the next 20 trading days
gold['close+20'] = gold['close'].shift(periods=-20)
Now calculate the expected return for each row
gold['ret'] = (gold['close+20'] - gold['close']) / gold['close']
You can also combine steps 1. and 2. directly so you don't need the additional column (only if you shift by number of rows, not by fixed daterange due to reindexing)
gold['ret'] = (gold['close'].shift(periods=-20) - gold['close']) / gold['close']
Since you already have your deciles, you just need to groupby the deciles and aggregate the returns with mean()
gold_grouped = gold.groupby(by="decile").mean()
Putting in some random data you get something like the dataframe below. close and ret are the averages for each decile. You can create a csv from a dataframe via pandas.DataFrame.to_csv
close ret
decile
0 1238.343597 -0.018290
1 1245.663315 0.023657
2 1254.073343 -0.025934
3 1195.941312 0.009938
4 1212.394511 0.002616
5 1245.961831 -0.047414
6 1200.676333 0.049512
7 1181.179956 0.059099
8 1214.438133 0.039242
9 1203.060985 0.029938

Need to add values in a column corresponding to the week (Python/Pandas)

I have a dataframe containing dates and prices. I need to add all prices belonging to the week of ex: 17/12 to 23/12 and put it infront of a new label corresponding to that week.
Date Price
12/17/2015 10
12/18/2015 20
12/19/2015 30
12/21/2015 40
12/24/2015 50
I want the output to be the following
week total
17/12-23/12 100
24/12-30/12 50
I tried using different datetime functions and groupby functions but was not able to get the o/p. Please help
what about this approach?
In [19]: df.groupby(df.Date.dt.weekofyear)['Price'].sum().rename_axis('week_no').reset_index(name='total')
Out[19]:
week_no total
0 51 60
1 52 90
UPDATE:
In [49]: df.resample(on='Date', rule='7D', base='4D').sum().rename_axis('week_from') \
.reset_index('total')
Out[49]:
week_from Price
0 2015-12-17 100
1 2015-12-24 50
UPDATE2:
x = (df.resample(on='Date', rule='7D', base='4D')
.sum()
.reset_index()
.rename(columns={'Price':'total'}))
x = x.assign(week=x['Date'].dt.strftime('%d/%m')
+'-'
+(x.pop('Date')+pd.DateOffset(days=7)).dt.strftime('%d/%m'))
In [127]: x
Out[127]:
total week
0 100 17/12-24/12
1 50 24/12-31/12
Using resample
df['Date'] = pd.to_datetime(df['Date'])
df.set_index(df.Date, inplace = True)
df = df.resample('W').sum()
Price
Date
2015-12-20 60
2015-12-27 90

Speeding up past-60-day mean in pandas

I use data from a past kaggle challenge based on panel data across a number of stores and a period spanning 2.5 years. Each observation includes the number of customers for a given store-date. For each store-date, my objective is to compute the average number of customers that visited this store during the past 60 days.
Below is code that does exactly what I need. However, it lasts forever - it would take a night to process the c.800k rows. I am looking for a clever way to achieve the same objective faster.
I have included 5 observations of the initial dataset with the relevant variables: store id (Store), Date and number of customers ("Customers").
Note:
For each row in the iteration, I end up writing the results using .loc instead of e.g. row["Lagged No of customers"] because "row" does not write anything in the cells. I wonder why that's the case.
I normally populate new columns using "apply, axis = 1" so I would really appreciate any solution based on that. I found that "apply" works fine when for each row, computation is done across columns using values at the same row level. However, I don't know how an "apply" function can involve different rows, which is what this problem requires. the only exception I have seen so far is "diff", which is not useful here.
Thanks.
Sample data:
pd.DataFrame({
'Store': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'Customers': {0: 668, 1: 578, 2: 619, 3: 635, 4: 785},
'Date': {
0: pd.Timestamp('2013-01-02 00:00:00'),
1: pd.Timestamp('2013-01-03 00:00:00'),
2: pd.Timestamp('2013-01-04 00:00:00'),
3: pd.Timestamp('2013-01-05 00:00:00'),
4: pd.Timestamp('2013-01-07 00:00:00')
}
})
Code that works but is incredibly slow:
import pandas as pd
import numpy as np
data = pd.read_csv("Rossman - no of cust/dataset.csv")
data.Date = pd.to_datetime(data.Date)
data.Customers = data.Customers.astype(int)
for index, row in data.iterrows():
d = row["Date"]
store = row["Store"]
time_condition = (d - data["Date"]<np.timedelta64(60, 'D')) & (d > data["Date"])
sub_df = data.loc[ time_condition & (data["Store"] == store), :]
data.loc[ (data["Date"]==d) & (data["Store"] == store), "Lagged No customers"] = sub_df["Customers"].sum()
data.loc[ (data["Date"]==d) & (data["Store"] == store), "No of days"] = len(sub_df["Customers"])
if len(sub_df["Customers"]) > 0:
data.loc[ (data["Date"]==d) & (data["Store"] == store), "Av No of customers"] = int(sub_df["Customers"].sum()/len(sub_df["Customers"]))
Given your small sample data, I used a two day rolling average instead of 60 days.
>>> (pd.rolling_mean(data.pivot(columns='Store', index='Date', values='Customers'), window=2)
.stack('Store'))
Date Store
2013-01-03 1 623.0
2013-01-04 1 598.5
2013-01-05 1 627.0
2013-01-07 1 710.0
dtype: float64
By taking a pivot of the data with dates as your index and stores as your columns, you can simply take a rolling average. You then need to stack the stores to get the data back into the correct shape.
Here is some sample output of the original data prior to the final stack:
Store 1 2 3
Date
2015-07-29 541.5 686.5 767.0
2015-07-30 534.5 664.0 769.5
2015-07-31 550.5 613.0 822.0
After .stack('Store'), this becomes:
Date Store
2015-07-29 1 541.5
2 686.5
3 767.0
2015-07-30 1 534.5
2 664.0
3 769.5
2015-07-31 1 550.5
2 613.0
3 822.0
dtype: float64
Assuming the above is named df, you can then merge it back into your original data as follows:
data.merge(df.reset_index(),
how='left',
on=['Date', 'Store'])
EDIT:
There is a clear seasonal pattern in the data for which you may want to make adjustments. In any case, you probably want your rolling average to be in multiples of seven to represent even weeks. I've used a time window of 63 days in the example below (9 weeks).
In order to avoid losing data on stores that just open (and those at the start of the time period), you can specify min_periods=1 in the rolling mean function. This will give you the average value over all available observations for your given time window
df = data.loc[data.Customers > 0, ['Date', 'Store', 'Customers']]
result = (pd.rolling_mean(df.pivot(columns='Store', index='Date', values='Customers'),
window=63, min_periods=1)
.stack('Store'))
result.name = 'Customers_63d_mvg_avg'
df = df.merge(result.reset_index(), on=['Store', 'Date'], how='left')
>>> df.sort_values(['Store', 'Date']).head(8)
Date Store Customers Customers_63d_mvg_avg
843212 2013-01-02 1 668 668.000000
842103 2013-01-03 1 578 623.000000
840995 2013-01-04 1 619 621.666667
839888 2013-01-05 1 635 625.000000
838763 2013-01-07 1 785 657.000000
837658 2013-01-08 1 654 656.500000
836553 2013-01-09 1 626 652.142857
835448 2013-01-10 1 615 647.500000
To more clearly see what is going on, here is a toy example:
s = pd.Series([1,2,3,4,5] + [np.NaN] * 2 + [6])
>>> pd.concat([s, pd.rolling_mean(s, window=4, min_periods=1)], axis=1)
0 1
0 1 1.0
1 2 1.5
2 3 2.0
3 4 2.5
4 5 3.5
5 NaN 4.0
6 NaN 4.5
7 6 5.5
The window is four observations, but note that the final value of 5.5 equals (5 + 6) / 2. The 4.0 and 4.5 values are (3 + 4 + 5) / 3 and (4 + 5) / 2, respectively.
In our example, the NaN rows of the pivot table do not get merged back into df because we did a left join and all the rows in df have one or more Customers.
You can view a chart of the rolling data as follows:
df.set_index(['Date', 'Store']).unstack('Store').plot(legend=False)

Categories

Resources