I have a file, df, that I wish to take the delta of every 7 day period
df:
Date Value
10/15/2020 75
10/14/2020 70
10/13/2020 65
10/12/2020 60
10/11/2020 55
10/10/2020 50
10/9/2020 45
10/8/2020 40
10/7/2020 35
10/6/2020 30
10/5/2020 25
10/4/2020 20
10/3/2020 15
10/2/2020 10
10/1/2020 5
Desired Output:
Date Value
10/9/2020 30
10/2/2020 30
This is what I am doing, thanks to the help of someone on this platform:
df.Date = pd.to_datetime(df.Date)
s = df.set_index('Date')['Value']
df['New'] = s.shift(freq = '-6 D').reindex(s.index).values
df['Delta'] = df['New'] - df['Value']
df[['Date','Delta']].dropna()
However, this gives me a running delta, I wish to have the delta displayed for every 7 day period, as shown in the Desired Output.
Any suggestion is appreciated
i think the way you have done is the perfect way. I think modifying it a bit will give you the desired result. Try this:
df.Date = pd.to_datetime(df.Date)
s = df.set_index('Date')['Value']
df['New'] = s.shift(freq = '-6 D').reindex(s.index).values
df['Delta'] = df['New'] - df['Value']
df_new=df[['Date','Delta']].dropna()
df_new.iloc[::7, :]
Related
I have a txt file with the following content:
ID1;ID2;TIME;VALUE
1000;100;012021;12
1000;100;022021;4129
1000;100;032021;128
1000;100;042021;412
1000;100;052021;12818
1000;120;022021;4129
1000;100;062021;546
1000;100;072021;86
1000;120;052021;12818
1000;100;082021;754
1000;100;092021;2633
1000;100;102021;571
1000;120;092021;2633
1000;100;112021;2233
1000;100;122021;571
1000;120;012021;12
1000;120;032021;128
1000;120;042021;412
1000;120;062021;546
1000;120;072021;86
1000;120;082021;754
1000;120;102021;571
1000;120;112021;2233
1000;120;122021;571
1000;100;012022;12
1000;100;022022;4129
1000;120;022022;4129
1000;120;032022;128
1000;120;042022;412
1000;100;032022;128
1000;100;042022;412
1000;100;052022;12818
1000;100;062022;546
1000;100;072022;86
1000;100;082022;754
1000;120;072022;86
1000;120;082022;754
1000;120;092022;2633
1000;120;102022;571
1000;100;092022;2633
1000;100;102022;571
1000;100;112022;2233
1000;100;122022;571
1000;120;012022;12
1000;120;052022;12818
1000;120;062022;546
1000;120;112022;2233
1000;120;122022;571
I need to make aggregates of time (half year, total year), using the items from column time, which have the same ID1, ID2 and sum up the values.
The output should look like this:
I would appreciate your help! This is what I have so far for half year:
#already sorted by time
data=open("file.txt").readlines()
count = 0
for line in data:
count += 1
for n in range(count - 1, len(data), 6):
subList = [data[n:n + 6]]
break
I'm far from being a Python expert but how about something like:
dd = defaultdict(lambda: [])
rows = [elems for elems in [line.strip().split(';') for line in data[1:]]]
for row in rows:
mm = row[2][:2]
yy = row[2][2:]
vv = int(row[3])
key = (row[0], row[1], yy)
dd[key].append([mm, yy, vv])
# print("Total of all values", sum(int(row[3]) for row in rows))
for k, v in dd.items():
h1 = sum(c[2] for c in v if c[0] <= '06')
h2 = sum(c[2] for c in v if c[0] > '06')
tt = sum(c[2] for c in v)
# or, much more simply, tt = h1 + h2
# print(k[0], k[1], k[2], "H1:", h1, "H2:", h2, "TT:", tt)
print(f"{k[0]};{k[1]};HY1{k[2]};{h1}")
print(f"{k[0]};{k[1]};HY2{k[2]};{h2}")
print(f"{k[0]};{k[1]};TY{k[2]};{tt}")
Seems to give correct results for the data supplied. Might not be efficient if you have huge amounts of data. YMMV.
This is a task that is very well suited for the pandas library, which is designed to work with tabular data. A way to do this in pandas would be something like this:
import pandas as pd
# Read the data from the textfile (here I saved it as data.csv).
# Make sure the TIME column is not read as an integer by declaring
# dtype={'TIME': object}, because this would omit the leading zeroes, which
# we need for conversion to datetime object
df = pd.read_csv('data.csv', delimiter=';', dtype={'TIME': object})
# Convert TIME column to datetime
df['TIME'] = pd.to_datetime(df['TIME'], format='%m%Y')
# Create new column with year
df['Y'] = df['TIME'].dt.year
# Create new column with halfyear (1 = first halfyear, 2 = second halfyear)
df['HY'] = df['TIME'].dt.month.floordiv(7).add(1)
After this, your table looks like this:
df.head() # Just show the first couple of rows
ID1 ID2 TIME VALUE Y HY
0 1000 100 2021-01-20 12 2021 1
1 1000 100 2021-02-20 4129 2021 1
2 1000 100 2021-03-20 128 2021 1
3 1000 100 2021-04-20 412 2021 1
4 1000 100 2021-05-20 12818 2021 1
Getting the table into the desired format takes a bit of work, but grouping and aggregating then becomes really easy. You can then also perform other grouping and aggregating operations as you please without having to code it all from hand.
To group by year and calculate the sum:
df.groupby(['ID1', 'ID2', 'Y']).sum()
ID1 ID2 Y VALUE
1000 100 2021 24893
1000 100 2022 24893
1000 120 2021 24893
1000 120 2022 24893
To group by halfyear and calculate the sum:
df.groupby(['ID1', 'ID2', 'Y', 'HY']).sum()
ID1 ID2 Y HY VALUE
1000 100 2021 1 18045
1000 100 2021 2 6848
1000 100 2022 1 18045
1000 100 2022 2 6848
1000 120 2021 1 18045
1000 120 2021 2 6848
1000 120 2022 1 18045
1000 120 2022 2 6848
Edit: Added a datetime format specifier to correctly read the date as MMYYYY instead of MMDDYY. Thanks to shunty for mentioning it! The results shown in this post will of course be different now than the actual results.
I have a dataframe containing two columns of dates: start date and end date. I need to set up a dataframe where all months of the year are set up in separate columns based on the start and end date intervals so I can sum values from another column for each of the months per name.
To illustrate:
Original df:
Start Date End Date Name Value
10/22/20 01/25/21 John 100
10/12/20 04/30/21 John 50
02/25/21 None John 20
Desired df:
Name Oct_20 Nov_20 Dec_20 Jan_21 Feb_21 Mar_21 Apr_21 May_21 Jun_21 Jul_21 Aug_21 ...
John 150 150 150 150 70 70 70 20 20 20 20 ...
Any suggestions or pointers on how I could achieve that result would be greatly appreciated!
First convert values to datetimes with replace non datetimes to missing values and replace them to some date, then in list comprehension get all months to Series, which is used for pivoting by DataFrame.pivot_table:
end = '2021-12-31'
df['Start'] = pd.to_datetime(df['Start Date'])
df['End'] = pd.to_datetime(df['End Date'], errors='coerce').fillna(end)
s = pd.concat([pd.Series(r.Index,pd.date_range(r.Start, r.End, freq='M'))
for r in df.itertuples()])
df1 = pd.DataFrame({'Date': s.index}, s).join(df)
df2 = df1.pivot_table(index='Name',
columns='Date',
values='Value',
aggfunc='sum',
fill_value=0)
df2.columns = df2.columns.strftime('%b_%y')
print (df2)
Date Oct_20 Nov_20 Dec_20 Jan_21 Feb_21 Mar_21 Apr_21 May_21 Jun_21 \
Name
John 150 150 150 50 70 70 70 20 20
Date Jul_21 Aug_21 Sep_21 Oct_21 Nov_21 Dec_21
Name
John 20 20 20 20 20 20
I have example dataframe in yearly granularity:
df = pd.DataFrame({
"date": ["2020-01-01", "2021-01-01", "2022-01-01"],
"cost": [100, 1000, 150],
"person": ["Tom","Jerry","Brian"]
})
I want to create dataframe with monthly granularity without any estimation methods (just repeat a row 12 times for each year. So in a result from this 3 row dataframe I would like to get 36 rows exactly like:
2020-01-01 / 100 / Tom
2020-02-01 / 100 / Tom
2020-03-01 / 100 / Tom
2020-04-01 / 100 / Tom
2020-05-01 / 100 / Tom
[...]
2022-10-01 / 150 / Brian
2022-11-01 / 150 / Brian
2022-12-01 / 150 / Brian
I tried
df.resample('M', on = 'date').apply(lambda x:x)
but cant seem to get it working...
Im beginner so forgive me my ignorance
Thanks for help in advance!
Here is a way to do that.
count = len(df)
for var in df[['date','cost','person']].values:
for i in range(2,13):
df.loc[count] = [(var[0][0:5] + "{:02d}".format(i) + var[0][7:]),var[1], var[2]]
count += 1
df = df.sort_values('date')
Following should also work,
#Typecasting
df['date'] = pd.to_datetime(df['date'])
#Making new dataframe based on frequency
op = pd.DataFrame(pd.date_range(start=df['date'].min(), end=df['date'].max()+pd.offsets.DateOffset(months=11),freq='MS'),columns = ['date'])
#merging both results on year using merge( with outer join)
res = pd.merge(df,op,left_on=df['date'].apply(lambda x: x.year), right_on = op['date'].apply(lambda x: x.year), how = 'outer')
#dropping key columns from left side
res.drop(['key_0','date_x'],axis=1,inplace=True)
I have a dataframe containing dates and prices. I need to add all prices belonging to the week of ex: 17/12 to 23/12 and put it infront of a new label corresponding to that week.
Date Price
12/17/2015 10
12/18/2015 20
12/19/2015 30
12/21/2015 40
12/24/2015 50
I want the output to be the following
week total
17/12-23/12 100
24/12-30/12 50
I tried using different datetime functions and groupby functions but was not able to get the o/p. Please help
what about this approach?
In [19]: df.groupby(df.Date.dt.weekofyear)['Price'].sum().rename_axis('week_no').reset_index(name='total')
Out[19]:
week_no total
0 51 60
1 52 90
UPDATE:
In [49]: df.resample(on='Date', rule='7D', base='4D').sum().rename_axis('week_from') \
.reset_index('total')
Out[49]:
week_from Price
0 2015-12-17 100
1 2015-12-24 50
UPDATE2:
x = (df.resample(on='Date', rule='7D', base='4D')
.sum()
.reset_index()
.rename(columns={'Price':'total'}))
x = x.assign(week=x['Date'].dt.strftime('%d/%m')
+'-'
+(x.pop('Date')+pd.DateOffset(days=7)).dt.strftime('%d/%m'))
In [127]: x
Out[127]:
total week
0 100 17/12-24/12
1 50 24/12-31/12
Using resample
df['Date'] = pd.to_datetime(df['Date'])
df.set_index(df.Date, inplace = True)
df = df.resample('W').sum()
Price
Date
2015-12-20 60
2015-12-27 90
I would like to shift the index of a pandas.dataframe by one quarter. The dataframe looks like:
ID Nowcast Forecast
1991-01-01 35 4144.70 4137.40
1991-01-01 40 4114.00 4105.00
1991-01-01 60 4135.00 4130.00
....
So far, I calculate the number of occurrences of the first timestamp 1991-01-01 and shifted the dataframe accordingly. The code is:
Stamps = df.index.unique()
zero = 0
for val in df.index:
if val == Stamps[0]:
zero = zero + 1
df = df.shift(zero)
The operation results in the following dataframe:
ID Nowcast Forecast
1991-04-01 35.0 4144.70 4137.40
1991-04-01 40.0 4114.00 4105.00
1991-04-01 60.0 4135.00 4130.00
The way I'm doing this strikes me as inefficient and error-prone. Is there a better way?
you can use pd.DateOffset():
In [110]: df
Out[110]:
ID Nowcast Forecast
1991-01-01 35 4144.7 4137.4
1991-01-01 40 4114.0 4105.0
1991-01-01 60 4135.0 4130.0
In [111]: df.index += pd.DateOffset(months=3)
In [112]: df
Out[112]:
ID Nowcast Forecast
1991-04-01 35 4144.7 4137.4
1991-04-01 40 4114.0 4105.0
1991-04-01 60 4135.0 4130.0