Transpose and Reorder - python

This is a sample of my df, consisting of temperature and rain (mm) per city:
Datetime
Berlin_temperature
Dublin_temperature
London_temperature
Paris_temperature
Berlin_rain
Dublin_rain
London_rain
Paris_rain
2022-01-01 10:00:00
24
24
24
24
10
10
10
10
2022-01-01 11:00:00
24
24
24
24
10
10
10
10
2022-01-01 12:00:00
24
24
24
24
10
10
10
10
2022-01-01 13:00:00
24
24
24
24
10
10
10
10
I want to achieve following output as a dataframe:
Datetime
City
Temperature
Rainfall
2022-01-01 10:00:00
Berlin
24
10
2022-01-01 10:00:00
Dublin
24
10
2022-01-01 10:00:00
London
24
10
2022-01-01 10:00:00
Paris
24
10
2022-01-01 11:00:00
Berlin
24
10
2022-01-01 11:00:00
Dublin
24
10
2022-01-01 11:00:00
London
24
10
2022-01-01 11:00:00
Paris
24
10
2022-01-01 12:00:00
...
...
...
At the moment I don't know how to achieve this by transposing or something similar. How would this be possible?

Use DataFrame.stack with MultiIndex created by splitted columns with _ - but first convert Datetime to index by DataFrame.set_index:
df1 = df.set_index('Datetime')
df1.columns = df1.columns.str.split('_', expand=True)
df1 = df1.stack(0).rename_axis(['Datetime','City']).reset_index()
print (df1)
Datetime City rain temperature
0 2022-01-01 10:00:00 Berlin 10 24
1 2022-01-01 10:00:00 Dublin 10 24
2 2022-01-01 10:00:00 London 10 24
3 2022-01-01 10:00:00 Paris 10 24
4 2022-01-01 11:00:00 Berlin 10 24
5 2022-01-01 11:00:00 Dublin 10 24
6 2022-01-01 11:00:00 London 10 24
7 2022-01-01 11:00:00 Paris 10 24
8 2022-01-01 12:00:00 Berlin 10 24
9 2022-01-01 12:00:00 Dublin 10 24
10 2022-01-01 12:00:00 London 10 24
11 2022-01-01 12:00:00 Paris 10 24
12 2022-01-01 13:00:00 Berlin 10 24
13 2022-01-01 13:00:00 Dublin 10 24
14 2022-01-01 13:00:00 London 10 24
15 2022-01-01 13:00:00 Paris 10 24

Using janitor.pivot_longer:
import janitor
out = df.pivot_longer(index='Datetime', names_to=('City', '.value'),
names_sep='_', sort_by_appearance=True)
Output:
Datetime City temperature rain
0 2022-01-01 10:00:00 Berlin 24 10
1 2022-01-01 10:00:00 Dublin 24 10
2 2022-01-01 10:00:00 London 24 10
3 2022-01-01 10:00:00 Paris 24 10
4 2022-01-01 11:00:00 Berlin 24 10
5 2022-01-01 11:00:00 Dublin 24 10
6 2022-01-01 11:00:00 London 24 10
7 2022-01-01 11:00:00 Paris 24 10
8 2022-01-01 12:00:00 Berlin 24 10
9 2022-01-01 12:00:00 Dublin 24 10
10 2022-01-01 12:00:00 London 24 10
11 2022-01-01 12:00:00 Paris 24 10
12 2022-01-01 13:00:00 Berlin 24 10
13 2022-01-01 13:00:00 Dublin 24 10
14 2022-01-01 13:00:00 London 24 10
15 2022-01-01 13:00:00 Paris 24 10

Related

Remove row comparing two datetime [duplicate]

This question already has answers here:
How to compare two date columns in a dataframe using pandas?
(2 answers)
Closed 5 months ago.
I have a dataset of New York taxi rides. There are some wrong values in the pickup_datetime and dropoff_datetime, because the dropoff is before than the pickup. How can i compare this two values and drop the row?
You can do with:
import pandas as pd
import numpy as np
df=pd.read_excel("Taxi.xlsx")
#convert to datetime
df["Pickup"]=pd.to_datetime(df["Pickup"])
df["Drop"]=pd.to_datetime(df["Drop"])
#Identify line to remove
df["ToRemove"]=np.where(df["Pickup"]>df["Drop"],1,0)
print(df)
#filter the dataframe if pickup time is higher than drop time
dfClean=df[df["ToRemove"]==0]
print(dfClean)
Result:
df:
Pickup Drop Price
0 2022-01-01 10:00:00 2022-01-01 10:05:00 5
1 2022-01-01 10:20:00 2022-01-01 10:25:00 8
2 2022-01-01 10:40:00 2022-01-01 10:45:00 3
3 2022-01-01 11:00:00 2022-01-01 10:05:00 10
4 2022-01-01 11:20:00 2022-01-01 11:25:00 5
5 2022-01-01 11:40:00 2022-01-01 08:45:00 8
6 2022-01-01 12:00:00 2022-01-01 12:05:00 3
7 2022-01-01 12:20:00 2022-01-01 12:25:00 10
8 2022-01-01 12:40:00 2022-01-01 12:45:00 5
9 2022-01-01 13:00:00 2022-01-01 13:05:00 8
10 2022-01-01 13:20:00 2022-01-01 13:25:00 3
Df Clean
Pickup Drop Price ToRemove
0 2022-01-01 10:00:00 2022-01-01 10:05:00 5 0
1 2022-01-01 10:20:00 2022-01-01 10:25:00 8 0
2 2022-01-01 10:40:00 2022-01-01 10:45:00 3 0
4 2022-01-01 11:20:00 2022-01-01 11:25:00 5 0
6 2022-01-01 12:00:00 2022-01-01 12:05:00 3 0
7 2022-01-01 12:20:00 2022-01-01 12:25:00 10 0
8 2022-01-01 12:40:00 2022-01-01 12:45:00 5 0
9 2022-01-01 13:00:00 2022-01-01 13:05:00 8 0
10 2022-01-01 13:20:00 2022-01-01 13:25:00 3 0

Is there a method that replace NaN values with values of previous 24 hours?

I would like to replace the value NaN of day 20211021 - HOUR 1 with the value of day 20211020 - HOUR 1, the value of day day 20211021 - HOUR 2 with the value of day 20211020 - HOUR 2...
timestamp
data
year
month
day
hour
solar_total
2021-10-20 00:00:00
20211020
2021
10
20
1
0.0
2021-10-20 01:00:00
20211020
2021
10
20
2
0.0
2021-10-20 02:00:00
20211020
2021
10
20
3
0.0
2021-10-20 03:00:00
20211020
2021
10
20
4
0.0
2021-10-20 04:00:00
20211020
2021
10
20
5
0.0
2021-10-20 05:00:00
20211020
2021
10
20
6
0.0
2021-10-20 06:00:00
20211020
2021
10
20
7
0.0
2021-10-20 07:00:00
20211020
2021
10
20
8
65.0
2021-10-20 08:00:00
20211020
2021
10
20
9
1498.0
2021-10-20 09:00:00
20211020
2021
10
20
10
4034.0
2021-10-20 10:00:00
20211020
2021
10
20
11
6120.0
2021-10-20 11:00:00
20211020
2021
10
20
12
7450.0
2021-10-20 12:00:00
20211020
2021
10
20
13
7943.0
2021-10-20 13:00:00
20211020
2021
10
20
14
7821.0
2021-10-20 14:00:00
20211020
2021
10
20
15
7058.0
2021-10-20 16:00:00
20211020
2021
10
20
17
3664.0
2021-10-20 17:00:00
20211020
2021
10
20
18
1375.0
2021-10-20 18:00:00
20211020
2021
10
20
19
11.0
2021-10-20 19:00:00
20211020
2021
10
20
20
0.0
2021-10-20 20:00:00
20211020
2021
10
20
21
0.0
2021-10-20 21:00:00
20211020
2021
10
20
22
0.0
2021-10-20 22:00:00
20211020
2021
10
20
23
0.0
2021-10-20 23:00:00
20211020
2021
10
20
24
0.0
2021-10-21 00:00:00
20211021
2021
10
21
1
NaN
2021-10-21 01:00:00
20211021
2021
10
21
2
NaN
2021-10-21 02:00:00
20211021
2021
10
21
3
NaN
2021-10-21 03:00:00
20211021
2021
10
21
4
NaN
2021-10-21 04:00:00
20211021
2021
10
21
5
NaN
2021-10-21 05:00:00
20211021
2021
10
21
6
NaN
2021-10-21 06:00:00
20211021
2021
10
21
7
NaN
2021-10-21 07:00:00
20211021
2021
10
21
8
NaN
2021-10-21 08:00:00
20211021
2021
10
21
9
NaN
2021-10-21 09:00:00
20211021
2021
10
21
10
NaN
2021-10-21 10:00:00
20211021
2021
10
21
11
NaN
2021-10-21 11:00:00
20211021
2021
10
21
12
NaN
2021-10-21 12:00:00
20211021
2021
10
21
13
NaN
2021-10-21 13:00:00
20211021
2021
10
21
14
NaN
2021-10-21 14:00:00
20211021
2021
10
21
15
NaN
2021-10-21 15:00:00
20211021
2021
10
21
16
NaN
2021-10-21 16:00:00
20211021
2021
10
21
17
NaN
2021-10-21 17:00:00
20211021
2021
10
21
18
NaN
2021-10-21 18:00:00
20211021
2021
10
21
19
NaN
2021-10-21 19:00:00
20211021
2021
10
21
20
NaN
2021-10-21 20:00:00
20211021
2021
10
21
21
NaN
2021-10-21 21:00:00
20211021
2021
10
21
22
NaN
2021-10-21 22:00:00
20211021
2021
10
21
23
NaN
2021-10-21 23:00:00
20211021
2021
10
21
24
NaN
If you have no missing timestamp, shift your rows by 23:
df['solar_total'] = df['solar_total'].fillna(df['solar_total'].shift(23))
timestamp
data
year
month
day
hour
solar_total
2021-10-20 00:00:00
20211020
2021
10
20
1
0
2021-10-20 01:00:00
20211020
2021
10
20
2
0
2021-10-20 02:00:00
20211020
2021
10
20
3
0
2021-10-20 03:00:00
20211020
2021
10
20
4
0
2021-10-20 04:00:00
20211020
2021
10
20
5
0
2021-10-20 05:00:00
20211020
2021
10
20
6
0
2021-10-20 06:00:00
20211020
2021
10
20
7
0
2021-10-20 07:00:00
20211020
2021
10
20
8
65
2021-10-20 08:00:00
20211020
2021
10
20
9
1498
2021-10-20 09:00:00
20211020
2021
10
20
10
4034
2021-10-20 10:00:00
20211020
2021
10
20
11
6120
2021-10-20 11:00:00
20211020
2021
10
20
12
7450
2021-10-20 12:00:00
20211020
2021
10
20
13
7943
2021-10-20 13:00:00
20211020
2021
10
20
14
7821
2021-10-20 14:00:00
20211020
2021
10
20
15
7058
2021-10-20 16:00:00
20211020
2021
10
20
17
3664
2021-10-20 17:00:00
20211020
2021
10
20
18
1375
2021-10-20 18:00:00
20211020
2021
10
20
19
11
2021-10-20 19:00:00
20211020
2021
10
20
20
0
2021-10-20 20:00:00
20211020
2021
10
20
21
0
2021-10-20 21:00:00
20211020
2021
10
20
22
0
2021-10-20 22:00:00
20211020
2021
10
20
23
0
2021-10-20 23:00:00
20211020
2021
10
20
24
0
2021-10-21 00:00:00
20211021
2021
10
21
1
0
2021-10-21 01:00:00
20211021
2021
10
21
2
0
2021-10-21 02:00:00
20211021
2021
10
21
3
0
2021-10-21 03:00:00
20211021
2021
10
21
4
0
2021-10-21 04:00:00
20211021
2021
10
21
5
0
2021-10-21 05:00:00
20211021
2021
10
21
6
0
2021-10-21 06:00:00
20211021
2021
10
21
7
0
2021-10-21 07:00:00
20211021
2021
10
21
8
65
2021-10-21 08:00:00
20211021
2021
10
21
9
1498
2021-10-21 09:00:00
20211021
2021
10
21
10
4034
2021-10-21 10:00:00
20211021
2021
10
21
11
6120
2021-10-21 11:00:00
20211021
2021
10
21
12
7450
2021-10-21 12:00:00
20211021
2021
10
21
13
7943
2021-10-21 13:00:00
20211021
2021
10
21
14
7821
2021-10-21 14:00:00
20211021
2021
10
21
15
7058
2021-10-21 15:00:00
20211021
2021
10
21
16
3664
2021-10-21 16:00:00
20211021
2021
10
21
17
1375
2021-10-21 17:00:00
20211021
2021
10
21
18
11
2021-10-21 18:00:00
20211021
2021
10
21
19
0
2021-10-21 19:00:00
20211021
2021
10
21
20
0
2021-10-21 20:00:00
20211021
2021
10
21
21
0
2021-10-21 21:00:00
20211021
2021
10
21
22
0
2021-10-21 22:00:00
20211021
2021
10
21
23
0
2021-10-21 23:00:00
20211021
2021
10
21
24
nan

Percent of difference between consecutive and fixed values

A B C D
0 2002-01-12 10:00:00 John 19
1 2002-01-12 11:00:00 Africa 15
2 2002-01-12 12:00:00 Mary 30
3 2002-01-13 09:00:00 Billy 5
4 2002-01-13 11:00:00 Mira 6
5 2002-01-13 12:00:00 Hillary 50
6 2002-01-13 12:00:00 Romina 50
7 2002-01-14 10:00:00 George 30
8 2002-01-14 11:00:00 Denzel 12
9 2002-01-14 11:00:00 Michael 12
10 2002-01-14 12:00:00 Bisc 25
11 2002-01-16 10:00:00 Virgin 16
12 2002-01-16 11:00:00 Antonio 10
13 2002-01-16 12:00:00 Sito 5
I want to create two new columns df['E'] and df['F'], knowing that the same A and B values, always correspond to the same D value:
df['E']: percent of variance of D value respect previous D value.
df['F']: percent of variance between D and previous D value at 12:00:00.
Output should be:
A B C D E F
0 2002-01-12 10:00:00 John 19 0 0
1 2002-01-12 11:00:00 Africa 15 -21.05 0
2 2002-01-12 12:00:00 Mary 30 100.00 0
3 2002-01-13 09:00:00 Billy 5 -83.33 -83.33
4 2002-01-13 11:00:00 Mira 6 20.00 -80.00
5 2002-01-13 12:00:00 Hillary 50 733.33 66.66
6 2002-01-13 12:00:00 Romina 50 733.33 66.66
7 2002-01-14 10:00:00 George 30 -40.00 -40.00
8 2002-01-14 11:00:00 Denzel 12 -60.00 -76.00
9 2002-01-14 11:00:00 Michael 12 -60.00 -76.00
10 2002-01-14 12:00:00 Bisc 25 108.33 -50.00
11 2002-01-16 10:00:00 Virgin 16 -36.00 -36.00
12 2002-01-16 11:00:00 Antonio 10 -37.50 -60.00
13 2002-01-16 12:00:00 Sito 5 -50.00 -80.00
Would it be possible to use map to get it?
I´ve tried:
x = df[df['B'].eq(time(12))].drop_duplicates(subset=['A']).set_index('A')['D'](100 * (df.D - df.D.shift(1)) / df.D.shift(1)).fillna(0)
df['F'] = df['A'].map(x)
Use:
df['E'] = df['D'].pct_change().mul(100).replace(0,np.nan).ffill().fillna(0).round(2)
s = df[df['B'].eq(time(12))].drop_duplicates(subset=['A']).set_index('A')['D']
df['F'] = (df['D'].div(df['A'].map(s.shift()))).sub(1).mul(100).round(2).fillna(0)
print (df)
A B C D E F
0 2002-01-12 10:00:00 John 19 0.00 0.00
1 2002-01-12 11:00:00 Africa 15 -21.05 0.00
2 2002-01-12 12:00:00 Mary 30 100.00 0.00
3 2002-01-13 09:00:00 Billy 5 -83.33 -83.33
4 2002-01-13 11:00:00 Mira 6 20.00 -80.00
5 2002-01-13 12:00:00 Hillary 50 733.33 66.67
6 2002-01-13 12:00:00 Romina 50 733.33 66.67
7 2002-01-14 10:00:00 George 30 -40.00 -40.00
8 2002-01-14 11:00:00 Denzel 12 -60.00 -76.00
9 2002-01-14 11:00:00 Michael 12 -60.00 -76.00
10 2002-01-14 12:00:00 Bisc 25 108.33 -50.00
11 2002-01-16 10:00:00 Virgin 16 -36.00 -36.00
12 2002-01-16 11:00:00 Antonio 10 -37.50 -60.00
13 2002-01-16 12:00:00 Sito 5 -50.00 -80.00
Explanation:
For column E is used pct_change, then replace 0 to NaN and forward filling NaNs.
For column F is used formula with mapping column A by rows with 12:00:00 in column B

New column with "3 period - Simple Moving Average"

A B C
0 2002-01-13 10:00:00 Jack 10
1 2002-01-13 10:00:00 Hellen 10
2 2002-01-13 12:00:00 Sibl 14
3 2002-01-13 12:00:00 Steve 14
4 2002-01-18 10:00:00 Ridley 38
5 2002-01-18 10:00:00 Scott 38
6 2002-01-18 12:00:00 Rambo 52
7 2002-01-18 12:00:00 Peter 52
8 2002-02-09 08:00:00 Brad 90
9 2002-02-09 08:00:00 Victoria 90
10 2002-02-09 14:00:00 Caroline 8
11 2002-02-09 14:00:00 Andrea 8
I want to create a new df['D'] with a "3 period - Simple Moving Average" of C, group by A datetime column. If possible, using convolve.
Output should be:
A B C D
0 2002-01-13 10:00:00 Jack 10
1 2002-01-13 10:00:00 Hellen 10
2 2002-01-13 12:00:00 Sibl 14
3 2002-01-13 12:00:00 Steve 14
4 2002-01-18 10:00:00 Ridley 38 20.66
5 2002-01-18 10:00:00 Scott 38 20.66
6 2002-01-18 12:00:00 Rambo 52 34.66
7 2002-01-18 12:00:00 Peter 52 34.66
8 2002-02-09 08:00:00 Brad 90 60.00
9 2002-02-09 08:00:00 Victoria 90 60.00
10 2002-02-09 14:00:00 Caroline 8 50.00
11 2002-02-09 14:00:00 Andrea 8 50.00
Let's try:
df['D'] = df.A.map(df.groupby('A')['C'].mean().rolling(3).mean())
Output:
A B C D
0 2002-01-13 10:00:00 Jack 10 NaN
1 2002-01-13 10:00:00 Hellen 10 NaN
2 2002-01-13 12:00:00 Sibl 14 NaN
3 2002-01-13 12:00:00 Steve 14 NaN
4 2002-01-18 10:00:00 Ridley 38 20.666667
5 2002-01-18 10:00:00 Scott 38 20.666667
6 2002-01-18 12:00:00 Rambo 52 34.666667
7 2002-01-18 12:00:00 Peter 52 34.666667
8 2002-02-09 08:00:00 Brad 90 60.000000
9 2002-02-09 08:00:00 Victoria 90 60.000000
10 2002-02-09 14:00:00 Caroline 8 50.000000
11 2002-02-09 14:00:00 Andrea 8 50.000000

Resample by some timeframe

I have dataframe like:
Timestamp Sold
10.01.2017 10:00:20 10
10.01.2017 10:01:55 20
10.01.2017 11:02:11 15
11.01.2017 11:04:30 10
11.01.2017 11:15:35 35
12.01.2017 10:02:01 22
How to resample it by hour. Ordinary resample resamples by all hours from first row to last. But what I need is to make timeframe (10-11) and resample it within this timeframe.
Last df should be like this:
Timestamp Sold
10.01.2017 10:00:00 30
10.01.2017 11:00:00 15
11.01.2017 10:00:00 NAN
11.01.2017 11:00:00 45
12.01.2017 10:00:00 22
12.01.2017 11:00:00 NAN
You could do something like this:
df_out = df.groupby(df.Timestamp.dt.floor('H')).sum()
df_out.reset_index()
Output:
Timestamp Sold
0 2017-10-01 10:00:00 30
1 2017-10-01 11:00:00 15
2 2017-11-01 11:00:00 45
3 2017-12-01 10:00:00 22

Categories

Resources