Create timeseries data - Pandas - python

I have a multi-index dataframe of timeseries data which looks like the following;
A B C
1 1 21 32 4
2 4 2 23
3 12 9 10
4 1 56 37
.
.
.
.
30 63 1 27
31 32 2 32
.
.
.
12 1 2 3 23
2 23 1 12
3 32 3 23
.
.
.
31 23 2 32
It is essentially a multi-index of month and dates with three columns.
I need to turn this into daily data and essentially have a dataframe whereby there is a single index where value in the above dataframe responds to its' respective date over 10 years.
For exmaple;
Desired output;
A B C
01/01/2017 21 32 4
.
.
31/12/2017 23 2 32
.
.
01/01/2022 21 32 4
.
.
31/12/2022 23 2 32
I hope this is clear! Its essentially turning daily/monthly data into daily/monthly/yearly data.

You can use:
df.index = pd.to_datetime(df.index.rename(['month', 'day']).to_frame().assign(year=2022))
Output:
A B C
2022-01-01 21 32 4
2022-01-02 4 2 23
2022-01-03 12 9 10
2022-01-04 1 56 37
2022-01-30 63 1 27
2022-01-31 32 2 32
2022-12-01 2 3 23
2022-12-02 23 1 12
2022-12-03 32 3 23
2022-12-31 23 2 32
spanning several years
There is no absolute fool proof way to handle years if those are missing. What we can do it to infer the year change when a date goes back in the past and add 1 year in this case:
# let's assume the starting year is 2017
date = pd.to_datetime(df.index.rename(['month', 'day']).to_frame().assign(year=2017))
df.index = date + date.diff().lt('0').cumsum().mul(pd.DateOffset(years=1))
output:
A B C
2017-01-01 21 32 4
2017-01-02 4 2 23
2017-06-03 12 9 10
2017-06-04 1 56 37
2018-01-30 63 1 27 # added 1 year
2018-01-31 32 2 32
2018-12-01 2 3 23
2018-12-02 23 1 12
2018-12-03 32 3 23
2018-12-31 23 2 32
used input:
A B C
1 1 21 32 4
2 4 2 23
6 3 12 9 10
4 1 56 37
1 30 63 1 27 # here we go back from month 1 after month 6
31 32 2 32
12 1 2 3 23
2 23 1 12
3 32 3 23
31 23 2 32

Related

Iterate over rows and calculate values

I have the following pandas dataframe:
temp stage issue_datetime
20 1 2022/11/30 19:20
21 1 2022/11/30 19:21
20 1 None
25 1 2022/11/30 20:10
30 2 None
22 2 2022/12/01 10:00
22 2 2022/12/01 10:01
31 3 2022/12/02 11:00
32 3 2022/12/02 11:01
19 1 None
20 1 None
I want to get the following result:
temp stage num_issues
20 1 3
21 1 3
20 1 3
25 1 3
30 2 2
22 2 2
22 2 2
31 3 2
32 3 2
19 1 0
20 1 0
Basically, I need to calculate the number of non-None per continuous value of stage and create a new column called num_issues.
How can I do it?
You can find the blocks of continuous value with cumsum on the diff, then groupby that and transform the non-null`
blocks = df['stage'].ne(df['stage'].shift()).cumsum()
df['num_issues'] = df['issue_datetime'].notna().groupby(blocks).transform('sum')
# or
# df['num_issues'] = df['issue_datetime'].groupby(blocks).transform('count')
Output:
temp stage issue_datetime num_issues
0 20 1 2022/11/30 19:20 3
1 21 1 2022/11/30 19:21 3
2 20 1 None 3
3 25 1 2022/11/30 20:10 3
4 30 2 None 2
5 22 2 2022/12/01 10:00 2
6 22 2 2022/12/01 10:01 2
7 31 3 2022/12/02 11:00 2
8 32 3 2022/12/02 11:01 2
9 19 1 None 0
10 20 1 None 0

Categorise hour into four different slots of 15 mins

I am working on a dataframe and I want to group the data for an hour into 4 different slots of 15 mins,
0-15 - 1st slot
15-30 - 2nd slot
30-45 - 3rd slot
45-00(or 60) - 4th slot
I am not even able to think, how to go forward with this
I tried extracting hours, minutes and seconds from the time, but what to do now?
Use integer division by 15 and then add 1:
df = pd.DataFrame({'M': range(60)})
df['slot'] = df['M'] // 15 + 1
print (df)
M slot
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 11 1
12 12 1
13 13 1
14 14 1
15 15 2
16 16 2
17 17 2
18 18 2
19 19 2
20 20 2
21 21 2
22 22 2
23 23 2
24 24 2
25 25 2
26 26 2
27 27 2
28 28 2
29 29 2
30 30 3
31 31 3
32 32 3
33 33 3
34 34 3
35 35 3
36 36 3
37 37 3
38 38 3
39 39 3
40 40 3
41 41 3
42 42 3
43 43 3
44 44 3
45 45 4
46 46 4
47 47 4
48 48 4
49 49 4
50 50 4
51 51 4
52 52 4
53 53 4
54 54 4
55 55 4
56 56 4
57 57 4
58 58 4
59 59 4

Merge dataframes including extreme values

I have 2 data frames, df1 and df2:
df1
Out[66]:
A B
0 1 11
1 1 2
2 1 32
3 1 42
4 1 54
5 1 66
6 2 16
7 2 23
8 3 13
9 3 24
10 3 35
11 3 46
12 3 51
13 4 12
14 4 28
15 4 39
16 4 49
df2
Out[80]:
B
0 32
1 42
2 13
3 24
4 35
5 39
6 49
I want to merge dataframes but at the same time including the first and/or last value of the set in column A. This is an example of the desired outcome:
df3
Out[93]:
A B
0 1 2
1 1 32
2 1 42
3 1 54
4 3 13
5 3 24
6 3 35
7 3 46
8 4 28
9 4 39
10 4 49
I'm trying to use merge but that only slice the portion of data frames that coincides. Someone have an idea to deal with this? thanks!
Here's one way to do it using merge with indicator, groupby, and rolling:
df[df.merge(df2, on='B', how='left', indicator='Ind').eval('Found=Ind == "both"')
.groupby('A')['Found']
.apply(lambda x: x.rolling(3, center=True, min_periods=2).max()).astype(bool)]
Output:
A B
1 1 2
2 1 32
3 1 42
4 1 54
8 3 13
9 3 24
10 3 35
11 3 46
14 4 28
15 4 39
16 4 49
pd.concat([df1.groupby('A').min().reset_index(), pd.merge(df1,df2, on="B"), df1.groupby('A').max().reset_index()]).reset_index(drop=True).drop_duplicates().sort_values(['A','B'])
A B
0 1 2
4 1 32
5 1 42
1 2 16
2 3 13
7 3 24
8 3 35
3 4 12
9 4 39
10 4 49
Breaking down each part
#Get Minimum
df1.groupby('A').min().reset_index()
# Merge on B
pd.merge(df1,df2, on="B")
# Get Maximum
df1.groupby('A').max().reset_index()
# Reset the Index and drop duplicated rows since there may be similarities between the Merge and Min/Max. Sort values by 'A' then by 'B'
.reset_index(drop=True).drop_duplicates().sort_values(['A','B'])

Pandas DataFrame Return Value from Column Index

I have a dataframe that has values of the different column numbers for another dataframe. Is there a way that I can just return the value from the other dataframe instead of just having the column index.
I basically want to match up the index between the Push and df dataframes. The values in the Push dataframe contain what column I want to return from the df dataframe.
Push dataframe:
0 1
0 1 2
1 0 3
2 0 3
3 1 3
4 0 2
df dataframe:
0 1 2 3 4
0 10 11 22 33 44
1 10 11 22 33 44
2 10 11 22 33 44
3 10 11 22 33 44
4 10 11 22 33 44
return:
0 1
0 11 22
1 10 33
2 10 33
3 11 33
4 10 22
You can do it with np.take ; However this function works on the flattened array. push must be shift like that :
In [285]: push1 = push.values+np.arange(0,25,5)[:,None]
In [229]: pd.DataFrame(df.values.take(push1))
EDIT
No, I just reinvent np.choose :
In [24]: df
Out[24]:
0 1 2 3 4
0 0 1 2 3 4
1 10 11 12 13 14
2 20 21 22 23 24
3 30 31 32 33 34
4 40 41 42 43 44
In [25]: push
Out[25]:
0 1
0 1 2
1 0 3
2 0 3
3 1 3
4 0 2
In [27]: np.choose(push.T,df).T
Out[27]:
0 1
0 1 2
1 10 13
2 20 23
3 31 33
4 40 42
We using melt then replace notice (df1 is your push , df2 is your df)
df1.astype(str).replace(df2.melt().drop_duplicates().set_index('variable').value.to_dict())
Out[31]:
0 1
0 11 22
1 10 33
2 10 33
3 11 33
4 10 22

performing differences between rows in pandas based on columns values

I have this dataframe, I'm trying to create a new column where I want to store the difference of products sold based on code and date.
for example this is the starting dataframe:
date code sold
0 20150521 0 47
1 20150521 12 39
2 20150521 16 39
3 20150521 20 38
4 20150521 24 38
5 20150521 28 37
6 20150521 32 36
7 20150521 4 43
8 20150521 8 43
9 20150522 0 47
10 20150522 12 37
11 20150522 16 36
12 20150522 20 36
13 20150522 24 36
14 20150522 28 35
15 20150522 32 31
16 20150522 4 42
17 20150522 8 41
18 20150523 0 50
19 20150523 12 48
20 20150523 16 46
21 20150523 20 46
22 20150523 24 46
23 20150523 28 45
24 20150523 32 42
25 20150523 4 49
26 20150523 8 49
27 20150524 0 39
28 20150524 12 33
29 20150524 16 30
... ... ... ...
150 20150606 32 22
151 20150606 4 34
152 20150606 8 33
153 20150607 0 31
154 20150607 12 30
155 20150607 16 30
156 20150607 20 29
157 20150607 24 28
158 20150607 28 26
159 20150607 32 24
160 20150607 4 30
161 20150607 8 30
162 20150608 0 47
I think this could be a solution...
full_df1=full_df[full_df.date == '20150609'].reset_index(drop=True)
full_df1['code'] = full_df1['code'].astype(float)
full_df1= full_df1.sort(['code'], ascending=[False])
code date sold
8 32 20150609 33
7 28 20150609 36
6 24 20150609 37
5 20 20150609 39
4 16 20150609 42
3 12 20150609 46
2 8 20150609 49
1 4 20150609 49
0 0 20150609 50
full_df1.set_index('code')['sold'].diff().reset_index()
that gives me back this output for a single date 20150609 :
code difference
0 32 NaN
1 28 3
2 24 1
3 20 2
4 16 3
5 12 4
6 8 3
7 4 0
8 0 1
is there a better solution to have the same result in a more pythonic way?
I would like to create a new column [difference] and store the data there having as result 4 columns [date, code, sold, difference]
This exactly the kind of thing that panda's groupby functionality is built for, and I highly recommend reading and working through this documentation: panda's groupby documentation
This code replicates what you are asking for, but for every date.
df = pd.DataFrame({'date':['Mon','Mon','Mon','Tue','Tue','Tue'],'code':[10,21,30,10,21,30], 'sold':[12,13,34,10,15,20]})
df['difference'] = df.groupby('date')['sold'].diff()
df
code date sold difference
0 10 Mon 12 NaN
1 21 Mon 13 1
2 30 Mon 34 21
3 10 Tue 10 NaN
4 21 Tue 15 5
5 30 Tue 20 5

Categories

Resources