My dataset(yearly data) looks like this
CODE Date PRCP TAVG TMAX TMIN
AE000041196 01-01-2020 0 21.1
AE000041196 02-01-2020 0 21.4
AE000041196 03-01-2020 0 21.2 15.4
AE000041196 04-01-2020 0 21.9 14.9
AE000041196 05-01-2020 0 23.7 16.5
AE000041196 06-01-2020 0.5 20.7
AE000041196 07-01-2020 0 18.1 11.5
AE000041196 08-01-2020 0 19.6 10.3
AE000041196 09-01-2020 0.3 20.6 13.8
I am trying to find out the longest run of consecutive missing values[Max count of consecutive NaN for each 'CODE'] for columns TMAX and TMIN for each value in CODE. eg. From the limited dataset above:
Max consecutive missing value for TMAX would be 9, and for TMIN would be 2
The code I am using
df['TMAX_nullccount'] = df.TMAX.isnull().astype(int).groupby(df['TMAX'].notnull().astype(int).cumsum()).cumsum()
This leads to errors in dataset when
CODE Date PRCP TAVG TMAX TMIN TMAX_nullccount
CA1AB000014 10-03-2021 2.3 297
CA1AB000014 11-03-2021 0 298
CA1AB000014 12-03-2021 0 299
CA1AB000014 13-03-2021 0 300
CA1AB000014 14-03-2021 0 301
CA1AB000015 01-01-2021 0 302
CA1AB000015 02-01-2021 0 303
CA1AB000015 03-01-2021 0 304
CA1AB000015 04-01-2021 0 305
In theory the count(TMAX_nullcount) should have started from 0 again code changed from CA1AB000014 to CA1AB000015. Also value in column TMAX_nullcount cannot exceed 365(yearly dataset) but my code give values way more than that.
Expected Output file(values are made up)
CODE TMAX_maxcnullcount TMIN_maxcnullcount TAVG_maxcnullcount
AE000041196 2 2 0
AEM00041194 1 1 0
AEM00041217 3 1 0
AEM00041218 1 2 45
AFM00040938 65 65 0
AFM00040948 132 132 0
AG000060390 155 141 0
How can I fix this? Thanks in advance
You can use:
First test if match missing values:
print (df.isna())
CODE Date PRCP TAVG TMAX TMIN
0 False False False False True True
1 False False False False True True
2 False False False False True False
3 False False False False True False
4 False False False False True False
5 False False False False True True
6 False False False False True False
7 False False False False True False
8 False False False False True False
#columsn for test missing values
cols = ['TMAX','TMIN','TAVG']
#CODe to index, filter columns and create one Series
m = df.set_index('CODE')[cols].isna().unstack()
#create consecutive groups and count them with maximal count per column and group
df = (m.ne(m.shift()).cumsum()
.where(m)
.groupby(level=[0,1]).value_counts()
.max(level=[0,1])
.unstack(0)
.add_suffix('_maxcnullcount'))
print (df)
TMAX_maxcnullcount TMIN_maxcnullcount
CODE
AE000041196 9 2
You can try something like this:
df.groupby(['CODE', df['PRCP'].ne(df['PRCP'].shift()).cumsum()]).size().max()
groupby by CODE and the consecutive zeros then compute size.
Your groupby result (aggr->size) will be:
CODE PRCP
AE000041196 1 5
2 1
3 2
4 1
Now you can find max and min.
So your final solution will look like this:
df1 = df.fillna(0)
df1.groupby(['CODE', df1['TMAX'].ne(df1['TMAX'].shift()).cumsum()]).size().max()
9
Related
Let's say I have the following pandas.dataframe:
data
series time_idx value
0 0 0 -0.000000
1 0 1 0.018844
2 0 2 0.028694
3 0 3 0.050784
4 0 4 0.067037
... ... ... ...
3995 9 395 0.973978
3996 9 396 0.944002
3997 9 397 1.001089
3998 9 398 1.132001
3999 9 399 1.169244
4000 rows × 3 columns
I want to test if for each series (0..9) the time indexes are incremented by 1 from row to row and if not where the difference is?
I thought about sorting the dataframe by series and by time_index and then compare to the index mod 400, but it's not a nice solution. Any suggestions?
Thanks
The following is based on what I understand from your question. See if this answers your question. I have to use 'True' instead of Boolean True because the dataframe converts it to the numeric 1.0.
df['IncOne'] = (df.series==df.series.shift())
df['IncOne'] = (
np.where(df.IncOne,
np.where( df.time_idx.eq(df.time_idx.shift()+1),
'True' , df.time_idx-df.time_idx.shift() ),
''))
series
time_idx
value
IncOne
0
0
0
0
1
0
1
0.018844
True
2
0
2
0.028694
True
3
0
3
0.050784
True
4
0
4
0.067037
True
5
0
6
0
2.0
6
0
7
0.018844
True
7
0
8
0.028694
True
8
0
9
0.050784
True
9
0
12
0.067037
3.0
10
0
13
1
True
11
9
395
0.973978
12
9
396
0.944002
True
13
9
397
1.00109
True
14
9
398
1.132
True
15
9
399
1.16924
True
Assuming that the dataframe is df you can try this:
df["diff"] = df.groupby(by="series")["time_idx"].diff().fillna(1) != 1
It will create a new column "diff" with boolean values. A True value indicates that the difference between the time_idx value in the current row and the row preceding it is different than one. Only differences between rows corresponding to the same series can give a True value.
This may be a litte confusing, but I have the following dataframe:
exporter assets liabilities
False 5 1
True 10 8
False 3 1
False 24 20
False 40 2
True 12 11
I want to calculate a ratio with this formula df['liabilieties'].sum()/df['assets'].sum())*100
And I expect to create a new column where the values are the ratio but calculated for each boolean value, like this:
exporter assets liabilities ratio
False 5 1 33.3
True 10 8 86.3
False 3 1 33.3
False 24 20 33.3
False 40 2 33.3
True 12 11 86.3
Use DataFrame.groupby on column exporter and transform the datafarme using sum, then use Series.div to divide liabilities by assets and use Series.mul to multiply by 100:
d = df.groupby('exporter').transform('sum')
df['ratio'] = d['liabilities'].div(d['assets']).mul(100).round(2)
Result:
print(df)
exporter assets liabilities ratio
0 False 5 1 33.33
1 True 10 8 86.36
2 False 3 1 33.33
3 False 24 20 33.33
4 False 40 2 33.33
5 True 12 11 86.36
I would like to select a specified number of lines after a condition is verified:
Here is my dataframe :
I would like to select three columns after the entry is equal to 1, so for the first occurrence I would obtain something like that :
I don't know what's the most appropriate output if I want to study every occurrence, maybe a groupby ?
First remove 0 rows before first 1:
df = df[df['entry'].eq(1).cumsum().ne(0)]
df = df.groupby(df['entry'].cumsum()).head(4)
Timestamp entry
1 11.2 1
2 11.3 0
3 11.4 0
4 11.5 0
7 11.8 1
8 11.9 0
9 12.0 0
10 12.1 0
Details & explanation:
For general solution for remove all values before first match is used compare by Series.eq, then cumulative sum by Series.cumsum and compare by Series.ne - so filter out all 0 values after cumsum operation:
print (df.assign(comp1 = df['entry'].eq(1),
cumsum =df['entry'].eq(1).cumsum(),
mask = df['entry'].eq(1).cumsum().ne(0)))
Timestamp entry comp1 cumsum mask
0 11.1 0 False 0 False
1 11.2 1 True 1 True
2 11.3 0 False 1 True
3 11.4 0 False 1 True
4 11.5 0 False 1 True
5 11.6 0 False 1 True
6 11.7 0 False 1 True
7 11.8 1 True 2 True
8 11.9 0 False 2 True
9 12.0 0 False 2 True
10 12.1 0 False 2 True
After filter by boolean indexing create helper Series with cumulative sum for groups:
print (df['entry'].cumsum())
1 1
2 1
3 1
4 1
5 1
6 1
7 2
8 2
9 2
10 2
Name: entry, dtype: int64
So for final solution use GroupBy.head with 4 values for get rows with 1 and next 3 rows:
df = df.groupby(df['entry'].cumsum()).head(4)
print (df)
Timestamp entry
1 11.2 1
2 11.3 0
3 11.4 0
4 11.5 0
7 11.8 1
8 11.9 0
9 12.0 0
10 12.1 0
For loop by groups use:
for i, g in df.groupby(df['entry'].cumsum()):
print (g.head(4))
If want output list of DataFrames:
L = [g.head(4) for i, g in df.groupby(df['entry'].cumsum())]
I was looking into this post which almost solved my problem. However, in my case, I want to work based on the 2nd level of the df, but trying not to specify my 1st level column names explicitly.
Borrowing the original dataframe:
df = pd.DataFrame({('A','a'): [-1,-1,0,10,12],
('A','b'): [0,1,2,3,-1],
('B','a'): [-20,-10,0,10,20],
('B','b'): [-200,-100,0,100,200]})
##df
A B
a b a b
0 -1 0 -20 -200
1 -1 1 -10 -100
2 0 2 0 0
3 10 3 10 100
4 12 -1 20 200
I want to assign NA to all columns a and b where b<0. I was selecting them based on: df.xs('b',axis=1,level=1)<b, but then I cannot actually perform the replace. However, I have varying 1st level names, so the indexing there cannot be made based on A and B explicitly, but possibly through df.columns.values?
The desired output would be
##df
A B
a b a b
0 -1 0 NA NA
1 -1 1 NA NA
2 0 2 0 0
3 10 3 10 100
4 NA NA 20 200
I appreciate all tips, thank you in advance.
You can use DataFrame.mask with reindex for same index and column names as original DataFrame created by reindex:
mask = df.xs('b',axis=1,level=1) < 0
print (mask)
A B
0 False True
1 False True
2 False False
3 False False
4 True False
print (mask.reindex(columns = df.columns, level=0))
A B
a b a b
0 False False True True
1 False False True True
2 False False False False
3 False False False False
4 True True False False
df = df.mask(mask.reindex(columns = df.columns, level=0))
print (df)
A B
a b a b
0 -1.0 0.0 NaN NaN
1 -1.0 1.0 NaN NaN
2 0.0 2.0 0.0 0.0
3 10.0 3.0 10.0 100.0
4 NaN NaN 20.0 200.0
Edit by OP: I had asked in comments how to consider multiple conditions (e.g. df.xs('b',axis=1,level=1) < 0 OR df.xs('b',axis=1,level=1) being an NA). #Jezrael kindly indicated that if I wanted to do this, I should consider
mask=(df.xs('b',axis=1,level=1) < 0 | df.xs('b',axis=1,level=1).isnull())
In Pandas, I'm trying to figure out how to generate a column that is the difference between the time of the current row and time of the last row in which the value of another column is True:
So given the dataframe:
df = pd.DataFrame({'Time':[5,10,15,20,25,30,35,40,45,50],
'Event_Occured': [True,False,False,True,True,False,False,True,False,False]})
print df
Event_Occured Time
0 True 5
1 False 10
2 False 15
3 True 20
4 True 25
5 False 30
6 False 35
7 True 40
8 False 45
9 False 50
I'm trying to generate a column that would look like this:
Event_Occured Time Time_since_last
0 True 5 0
1 False 10 5
2 False 15 10
3 True 20 0
4 True 25 0
5 False 30 5
6 False 35 10
7 True 40 0
8 False 45 5
9 False 50 10
Thanks very much!
Using df.Event_Occured.cumsum() gives you distinct groups to groupby. Then applying a function per group that subtracts the first member's value from every member gets you what you want.
df['Time_since_last'] = \
df.groupby(df.Event_Occured.cumsum()).Time.apply(lambda x: x - x.iloc[0])
df
Here's an alternative that fills the values corresponding to Falses with the last valid observation:
df['Time'] - df.loc[df['Event_Occured'], 'Time'].reindex(df.index).ffill()
Out:
0 0.0
1 5.0
2 10.0
3 0.0
4 0.0
5 5.0
6 10.0
7 0.0
8 5.0
9 10.0
Name: Time, dtype: float64