I am new to python and pandas and I am trying to solve this problem:
I have a dataset that looks something like this:
timestamp par_1 par_2
1486873206867 0 0
1486873207039 NaN 0
1486873207185 0 NaN
1486873207506 1 0
1486873207518 NaN NaN
1486873207831 1 0
1486873208148 0 NaN
1486873208469 0 1
1486873208479 1 NaN
1486873208793 1 NaN
1486873208959 NaN 1
1486873209111 1 NaN
1486873209918 NaN 0
1486873210075 0 NaN
I want to know the total duration of the event "1" for each parameter. (Parameters can only be NaN, 1 or 0)
I have already tried
df['duration_par_1'] = df.groupby(['par_1'])['timestamp'].apply(lambda x: x.max() - x.min())
but for further processing, I only need the duration of the event "1" to be in new columns and then that duration needs to be in every row of the new column so that it looks like this:
timestamp par_1 par_2 duration_par_1 duration_par2
1486873206867 0 0 2238 1449
1486873207039 NaN 0 2238 1449
1486873207185 0 NaN 2238 1449
1486873207506 1 0 2238 1449
1486873207518 NaN NaN 2238 1449
1486873207831 1 0 2238 1449
1486873208148 0 NaN 2238 1449
1486873208469 0 1 2238 1449
1486873208479 1 NaN 2238 1449
1486873208793 1 NaN 2238 1449
1486873208959 NaN 1 2238 1449
1486873209111 1 NaN 2238 1449
1486873209918 NaN 0 2238 1449
1486873210075 0 NaN 2238 1449
Thanks in advance!
I believe you need multiple values of par columns by difference of datetimes, because not exist another values like 0, 1 and NaN in data:
d = df['timestamp'].diff()
df1 = df.filter(like='par')
#if need duration by some value e.g. by `0`
#df1 = df.filter(like='par').eq(0).astype(int)
s = df1.mul(d, axis=0).sum().astype(int).add_prefix('duration_')
df = df.assign(**s)
print (df)
timestamp par_1 par_2 duration_par_1 duration_par_2
0 1486873206867 0.0 0.0 1110 487
1 1486873207039 NaN 0.0 1110 487
2 1486873207185 0.0 NaN 1110 487
3 1486873207506 1.0 0.0 1110 487
4 1486873207518 NaN NaN 1110 487
5 1486873207831 1.0 0.0 1110 487
6 1486873208148 0.0 NaN 1110 487
7 1486873208469 0.0 1.0 1110 487
8 1486873208479 1.0 NaN 1110 487
9 1486873208793 1.0 NaN 1110 487
10 1486873208959 NaN 1.0 1110 487
11 1486873209111 1.0 NaN 1110 487
12 1486873209918 NaN 0.0 1110 487
13 1486873210075 0.0 NaN 1110 487
Explanation:
First get difference of timestamp column:
print (df['timestamp'].diff())
0 NaN
1 172.0
2 146.0
3 321.0
4 12.0
5 313.0
6 317.0
7 321.0
8 10.0
9 314.0
10 166.0
11 152.0
12 807.0
13 157.0
Name: timestamp, dtype: float64
Select all columns with string par by filter:
print (df.filter(like='par'))
par_1 par_2
0 0.0 0.0
1 NaN 0.0
2 0.0 NaN
3 1.0 0.0
4 NaN NaN
5 1.0 0.0
6 0.0 NaN
7 0.0 1.0
8 1.0 NaN
9 1.0 NaN
10 NaN 1.0
11 1.0 NaN
12 NaN 0.0
13 0.0 NaN
Multiple filtered columns by mul by d:
print (df1.mul(d, axis=0))
par_1 par_2
0 NaN NaN
1 0.0 0.0
2 0.0 0.0
3 321.0 0.0
4 0.0 0.0
5 313.0 0.0
6 0.0 0.0
7 0.0 321.0
8 10.0 0.0
9 314.0 0.0
10 0.0 166.0
11 152.0 0.0
12 0.0 0.0
13 0.0 0.0
And sum values:
print (df1.mul(d, axis=0).sum())
par_1 1110.0
par_2 487.0
dtype: float64
Convert to integers and change index by add_prefix:
print (df1.mul(d, axis=0).sum().astype(int).add_prefix('duration_'))
duration_par_1 1110
duration_par_2 487
dtype: int32
Last create new columns by assign.
Related
I have csv file like the below table:
depth
x1
x2
x3
depth
x1
x2
x3
1000
Nan
Nan
Nan
1001
Nan
Nan
Nan
1002
Nan
Nan
Nan
1003
Nan
10
Nan
1004
Nan
Nan
Nan
1005
Nan
Nan
10
1006
Nan
Nan
Nan
1007
10
Nan
Nan
1008
11
Nan
Nan
1009
12
Nan
Nan
1010
13
Nan
Nan
1011
14
Nan
15
1012
15
20
Nan
1013
Nan
Nan
Nan
1014
Nan
Nan
Nan
1015
18
Nan
Nan
1016
19
Nan
Nan
1017
20
Nan
Nan
1018
21
Nan
20
1019
22
Nan
Nan
1020
23
Nan
Nan
1021
24
25
Nan
1022
25
Nan
Nan
1023
26
Nan
Nan
1024
27
Nan
25
1025
28
15
Nan
1026
Nan
Nan
Nan
1027
Nan
Nan
Nan
1028
Nan
Nan
Nan
I want interpolate between first and last valid values then fill Nan values by zeros
the result should be like that
depth
x1
x2
x3
1000
0
0
0
1001
0
0
0
1002
0
0
0
1003
0
10
0
1004
0
11.11111111
0
1005
0
12.22222222
10
1006
0
13.33333333
10.83333333
1007
10
14.44444444
11.66666667
1008
11
15.55555556
12.5
1009
12
16.66666667
13.33333333
1010
13
17.77777778
14.16666667
1011
14
18.88888889
15
1012
15
20
15.71428571
1013
16
20.55555556
16.42857143
1014
17
21.11111111
17.14285714
1015
18
21.66666667
17.85714286
1016
19
22.22222222
18.57142857
1017
20
22.77777778
19.28571429
1018
21
23.33333333
20
1019
22
23.88888889
20.83333333
1020
23
24.44444444
21.66666667
1021
24
25
22.5
1022
25
22.5
23.33333333
1023
26
20
24.16666667
1024
27
17.5
25
1025
28
15
25
1026
28
0
0
1027
28
0
0
1028
28
0
0
I have tried the below code but it does not give the correct result
import pandas as pd
df = pd.read_csv(r"C:\Users\mohamed\OneDrive\Desktop\test_interpolate.csv")
df = df.interpolate()
df = df.fillna(0)
print (df)
df.to_csv(r"C:\Users\mohamed\OneDrive\Desktop\result.csv")
You can limit the interpolation to the inner NaN using limit_area='inside' (this is well covered in the documentation of interpolate):
df = df.interpolate(limit_area='inside').fillna(0)
output:
depth x1 x2 x3
0 1000 0.0 0.000000 0.000000
1 1001 0.0 0.000000 0.000000
2 1002 0.0 0.000000 0.000000
3 1003 0.0 10.000000 0.000000
4 1004 0.0 11.111111 0.000000
5 1005 0.0 12.222222 10.000000
6 1006 0.0 13.333333 10.833333
7 1007 10.0 14.444444 11.666667
8 1008 11.0 15.555556 12.500000
9 1009 12.0 16.666667 13.333333
10 1010 13.0 17.777778 14.166667
11 1011 14.0 18.888889 15.000000
12 1012 15.0 20.000000 15.714286
13 1013 16.0 20.555556 16.428571
14 1014 17.0 21.111111 17.142857
15 1015 18.0 21.666667 17.857143
16 1016 19.0 22.222222 18.571429
17 1017 20.0 22.777778 19.285714
18 1018 21.0 23.333333 20.000000
19 1019 22.0 23.888889 20.833333
20 1020 23.0 24.444444 21.666667
21 1021 24.0 25.000000 22.500000
22 1022 25.0 22.500000 23.333333
23 1023 26.0 20.000000 24.166667
24 1024 27.0 17.500000 25.000000
25 1025 28.0 15.000000 0.000000
26 1026 0.0 0.000000 0.000000
27 1027 0.0 0.000000 0.000000
28 1028 0.0 0.000000 0.000000
I have a list of multiple dataframes dfs.
The dataframes come from files that have dates in its name. Eg. FilenameYYYYMMDD.xlsx
files = [str(file) for file in Path(/dir)]
dfs = [pd.read_excel(file, header=1)] for file in files]
I can extract the date from the file names:
date_extract = re.search('[0-9]{8}',files[0...20])
date = datetime.datetime.strptime(date_extract[0...20], '%Y%m%d').date()
But how can I assign to each df its respective date (by adding a column called 'Date')?
if your using pathlib we can use a dictionary to hold your dataframes and use a quick regex to extract the date, when we concat the dataframes the index will be set to the date.
import re
from pathlib import Path
dfs = {
re.search('(\d{4}.*).xlsx',f.name).group(1): pd.read_excel(f,header=1)
for f in Path(
/dir
).glob("*.xlsx")
}
print(pd.concat(dfs))
Unnamed: 0 e f c d
20200610 0 0 0.0 0.0 NaN NaN
1 1 0.0 0.0 NaN NaN
2 2 0.0 0.0 NaN NaN
3 3 0.0 0.0 NaN NaN
4 4 1.0 0.0 NaN NaN
5 5 0.0 1.0 NaN NaN
6 6 0.0 0.0 NaN NaN
7 7 0.0 0.0 NaN NaN
8 8 0.0 0.0 NaN NaN
9 9 0.0 0.0 NaN NaN
10 10 0.0 0.0 NaN NaN
11 11 0.0 0.0 NaN NaN
12 12 0.0 0.0 NaN NaN
13 13 0.0 0.0 NaN NaN
14 14 0.0 0.0 NaN NaN
15 15 0.0 0.0 NaN NaN
16 16 0.0 0.0 NaN NaN
17 17 0.0 0.0 NaN NaN
18 18 0.0 0.0 NaN NaN
19 19 0.0 0.0 NaN NaN
20 20 0.0 0.0 NaN NaN
21 21 0.0 0.0 NaN NaN
22 22 0.0 0.0 NaN NaN
23 23 0.0 0.0 NaN NaN
24 24 0.0 0.0 NaN NaN
25 25 0.0 0.0 NaN NaN
20201012 0 0 NaN NaN 0.0 0.0
1 1 NaN NaN 0.0 0.0
2 2 NaN NaN 1.0 0.0
3 3 NaN NaN 0.0 1.0
I am trying to aggregate the dataframe in order to have one date per row (for each group).
Cod1 Cod2 Date E A S
327 100013.0 001 2019-02-01 0.0 0.0 511.0
323 100013.0 001 2019-02-01 0.0 -14.0 NaN
336 100013.0 001 2019-02-02 0.0 -28.0 NaN
341 100013.0 001 2019-02-03 0.0 -6.0 NaN
350 100013.0 001 2019-02-03 0.0 -3.0 NaN
373 100013.0 001 2019-02-07 0.0 -15.0 0
377 100013.0 001 2019-02-07 0.0 -9.0 NaN
Using the following:
df = df.groupby(['Date', 'Cod1', 'Cod2'])['E','A', 'S'].sum()
I got the following output:
2019-02-01 100013.0 001 0.0 -14.0 511.0
2019-02-02 100013.0 001 0.0 -28.0 0.0
2019-02-03 100013.0 001 0.0 -9.0 0.0
2019-02-06 100013.0 001 0.0 -24.0 0.0
My questions is:
Is there some way to aggregate preserving NaN ?
There will be 3 scenarios:
1 -) Two rows on same date, last column having NaN and a not null number:
327 100013.0 001 2019-02-01 0.0 0.0 511.0
323 100013.0 001 2019-02-01 0.0 -14.0 NaN
I would like that in this situation always keep the number.
2-) Two rows on same date, last column having 2 NaNs rows
341 100013.0 001 2019-02-03 0.0 -6.0 NaN
350 100013.0 001 2019-02-03 0.0 -3.0 NaN
I would like that in this situation always keep the NaN.
3-) Two rows on same date, last column having one zero value column and one NaN column
373 100013.0 001 2019-02-07 0.0 -15.0 0
377 100013.0 001 2019-02-07 0.0 -9.0 NaN
I would like that in this situation always keep the 0.
So my expected out should be this one:
2019-02-01 100013.0 001 0.0 -14.0 511.0
2019-02-02 100013.0 001 0.0 -28.0 NaN
2019-02-03 100013.0 001 0.0 -9.0 NaN
2019-02-06 100013.0 001 0.0 -24.0 0.0
Check min_count
df.groupby(['Date', 'Cod1', 'Cod2'])['E','A', 'S'].sum(min_count=1)
Out[260]:
E A S
Date Cod1 Cod2
2019-02-01 100013.0 1 0.0 -14.0 511.0
2019-02-02 100013.0 1 0.0 -28.0 NaN
2019-02-03 100013.0 1 0.0 -9.0 NaN
2019-02-07 100013.0 1 0.0 -24.0 0.0
I guess a custom function can do:
(df.groupby(['Date', 'Cod1', 'Cod2'])
['E','A', 'S']
.agg(lambda x: np.nan if x.isna().all() else x.sum())
)
Output:
E A S
Date Cod1 Cod2
2019-02-01 100013.0 1 0.0 -14.0 511.0
2019-02-02 100013.0 1 0.0 -28.0 NaN
2019-02-03 100013.0 1 0.0 -9.0 NaN
2019-02-07 100013.0 1 0.0 -24.0 0.0
I have a dataframe with a date+time and a label, which I want to reshape into date (/month) columns with label frequencies for that month:
date_time label
1 2017-09-26 17:08:00 0
3 2017-10-03 13:27:00 2
4 2017-10-04 19:04:00 0
11 2017-10-11 18:28:00 1
27 2017-10-13 11:22:00 0
28 2017-10-13 21:43:00 0
39 2017-10-16 14:43:00 0
40 2017-10-16 21:39:00 0
65 2017-10-21 21:53:00 2
...
98 2017-11-01 20:08:00 3
99 2017-11-02 12:00:00 3
100 2017-11-02 12:01:00 2
109 2017-11-02 12:03:00 3
110 2017-11-03 22:24:00 0
111 2017-11-04 09:05:00 3
112 2017-11-06 12:36:00 3
113 2017-11-06 12:48:00 2
128 2017-11-07 15:20:00 2
143 2017-11-10 16:36:00 3
144 2017-11-10 20:00:00 0
145 2017-11-10 20:02:00 0
I group the label frequency by month with this line (thanks partially to this post):
df2 = df.groupby([pd.Grouper(key='date_time', freq='M'), 'label'])['label'].count()
which outputs
date_time label
2017-09-30 0 1
2017-10-31 0 6
1 1
2 8
3 2
2017-11-30 0 25
4 2
5 1
2 4
3 11
2017-12-31 0 14
5 3
2 5
3 7
2018-01-31 0 8
4 1
5 1
2 2
3 3
but, as mentioned before, I would like to get the data by month/date columns:
2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
0 1 6 25 14 8
1 0 1 0 0 0
2 0 8 4 5 2
3 0 2 11 7 3
4 0 0 2 0 1
5 0 0 1 3 1
currently I can do sort of divide the data with
pd.concat([df2[m] for m in df2.index.levels[0]], axis=1).fillna(0)
but I lose the column names:
label label label label label
0 1.0 6.0 25.0 14.0 8.0
1 0.0 1.0 0.0 0.0 0.0
2 0.0 8.0 4.0 5.0 2.0
3 0.0 2.0 11.0 7.0 3.0
4 0.0 0.0 2.0 0.0 1.0
5 0.0 0.0 1.0 3.0 1.0
So I have to do a longer version where I generate a series, rename it, concatenate and then fill in the blanks:
m_list = []
for m in df2.index.levels[0]:
m_labels = df2[m]
m_labels = m_labels.rename(m)
m_list.append(m_labels)
pd.concat(m_list, axis=1).fillna(0)
resulting in
2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
0 1.0 6.0 25.0 14.0 8.0
1 0.0 1.0 0.0 0.0 0.0
2 0.0 8.0 4.0 5.0 2.0
3 0.0 2.0 11.0 7.0 3.0
4 0.0 0.0 2.0 0.0 1.0
5 0.0 0.0 1.0 3.0 1.0
Is there a shorter/more elegant way to get to this last datagrame from my original one?
You just need unstack here
df.groupby([pd.Grouper(key='date_time', freq='M'), 'label'])['label'].count().unstack(0,fill_value=0)
Out[235]:
date_time 2017-09-30 2017-10-31 2017-11-30
label
0 1 5 3
1 0 1 0
2 0 2 3
3 0 0 6
Base on your groupby output
s.unstack(0,fill_value=0)
Out[240]:
date_time 2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
label
0 1 6 25 14 8
1 0 1 0 0 0
2 0 8 4 5 2
3 0 2 11 7 3
4 0 0 2 0 1
5 0 0 1 3 1
Here is my pandas DataFrame:
import pandas as pd
import numpy as np
data = {"column1": [338, 519, 871, 1731, 2693, 2963, 3379, 3789, 3910, 4109, 4307, 4800, 4912, 5111, 5341, 5820, 6003, ...],
"column2": [NaN, 1, 1, 1, 1, NaN, NaN, 2, 2, NaN, NaN, 3, 3, 3, 3, 3, NaN, NaN], ...}
df = pd.DataFrame(data)
df
>>> column1 column2
0 338 NaN
1 519 1.0
2 871 1.0
3 1731 1.0
4 2693 1.0
5 2963 NaN
6 3379 NaN
7 3789 2.0
8 3910 2.0
9 4109 NaN
10 4307 NaN
11 4800 3.0
12 4912 3.0
13 5111 3.0
14 5341 3.0
15 5820 3.0
16 6003 NaN
17 .... ....
The integers in column2 denote "groups" in column1, e.g. rows 1-4 is group "1", rows 7-8 is group "2", rows 11-15 is group "3", etc.
I would like to calculate the difference between the first row and last row in each group. The resulting dataframe would look like this:
df
>>> column1 column2 column3
0 338 NaN NaN
1 519 1.0 2174
2 871 1.0 2174
3 1731 1.0 2174
4 2693 1.0 2174
5 2963 NaN NaN
6 3379 NaN NaN
7 3789 2.0 121
8 3910 2.0 121
9 4109 NaN NaN
10 4307 NaN NaN
11 4800 3.0 1020
12 4912 3.0 1020
13 5111 3.0 1020
14 5341 3.0 1020
15 5820 3.0 1020
16 6003 NaN NaN
17 .... .... ...
because:
2693-519 = 2174
3910-3789 = 121
5820-4800 = 1020
What is the "pandas way" to calculate column3? Somehow, one must iterate through column3, looking for consecutive groups of values such that df.column2 != "NaN".
EDIT: I realized my example may lead readers to assume the values in column1 are only increasing. Actually, there are intervals, column intervals
df = pd.DataFrame(data)
df
>>> interval column1 column2
0 interval1 338 NaN
1 interval1 519 1.0
2 interval1 871 1.0
3 interval1 1731 1.0
4 interval1 2693 1.0
5 interval1 2963 NaN
6 interval1 3379 NaN
7 interval1 3789 2.0
8 interval1 3910 2.0
9 interval1 4109 NaN
10 interval1 4307 NaN
11 interval1 4800 3.0
12 interval1 4912 3.0
13 interval1 5111 3.0
14 interval1 5341 3.0
15 interval1 5820 3.0
16 interval1 6003 NaN
17 .... ....
18 interval2 12 13
19 interval2 115 13
20 interval2 275 NaN
....
You can filter first and then get difference first and last value in transform:
df['col3'] = df[df.column2.notnull()]
.groupby('column2')['column1']
.transform(lambda x: x.iat[-1] - x.iat[0])
print (df)
column1 column2 col3
0 338 NaN NaN
1 519 1.0 2174.0
2 871 1.0 2174.0
3 1731 1.0 2174.0
4 2693 1.0 2174.0
5 2963 NaN NaN
6 3379 NaN NaN
7 3789 2.0 121.0
8 3910 2.0 121.0
9 4109 NaN NaN
10 4307 NaN NaN
11 4800 3.0 1020.0
12 4912 3.0 1020.0
13 5111 3.0 1020.0
14 5341 3.0 1020.0
15 5820 3.0 1020.0
16 6003 NaN NaN
EDIT1 by your new data:
df['col3'] = df[df.column2.notnull()]
.groupby('column2')['column1']
.transform(lambda x: x.iat[-1] - x.iat[0])
print (df)
interval column1 column2 col3
0 interval1 338 NaN NaN
1 interval1 519 1.0 2174.0
2 interval1 871 1.0 2174.0
3 interval1 1731 1.0 2174.0
4 interval1 2693 1.0 2174.0
5 interval1 2963 NaN NaN
6 interval1 3379 NaN NaN
7 interval1 3789 2.0 121.0
8 interval1 3910 2.0 121.0
9 interval1 4109 NaN NaN
10 interval1 4307 NaN NaN
11 interval1 4800 3.0 1020.0
12 interval1 4912 3.0 1020.0
13 interval1 5111 3.0 1020.0
14 interval1 5341 3.0 1020.0
15 interval1 5820 3.0 1020.0
16 interval1 6003 NaN NaN
18 interval2 12 13.0 103.0
19 interval2 115 13.0 103.0
20 interval2 275 NaN NaN