I'm searching for an elegant way to match datetimes within a panda DataFrame.
The original data looks like this:
point_id datetime value1 value2
1 2017-05-2017 00:00 1 1.1
2 2017-05-2017 00:00 2 2.2
3 2017-05-2017 00:00 3 3.3
2 2017-05-2017 01:00 4 4.4
what the result should look like:
datetime value value_cal value2 value_calc2 value3 value_calc3
2017-05-2017 00:00 1 1.1 2 2.2 3 3.3
2017-05-2017 01:00 Nan Nan 4 4.4 Nan NaN
In the end there should be one row for each datetime and missing datapoints decleared as so.
In [180]: x = (df.drop('point_id',1)
...: .rename(columns={'value1':'value','value2':'value_cal'})
...: .assign(n=df.groupby('datetime')['value1'].cumcount()+1)
...: .pivot_table(index='datetime', columns='n', values=['value','value_cal'])
...: .sort_index(axis=1, level=1)
...: )
...:
In [181]: x
Out[181]:
value value_cal value value_cal value value_cal
n 1 1 2 2 3 3
datetime
2017-05-2017 00:00 1.0 1.1 2.0 2.2 3.0 3.3
2017-05-2017 01:00 4.0 4.4 NaN NaN NaN NaN
now we can "fix" column names
In [182]: x.columns = ['{0[0]}{0[1]}'.format(c) for c in x.columns]
In [183]: x
Out[183]:
value1 value_cal1 value2 value_cal2 value3 value_cal3
datetime
2017-05-2017 00:00 1.0 1.1 2.0 2.2 3.0 3.3
2017-05-2017 01:00 4.0 4.4 NaN NaN NaN NaN
Related
I have data with date, time, and values and want to calculate a forward looking rolling maximum for each date:
Date Time Value Output
01/01/2022 01:00 1.3 1.4
01/01/2022 02:00 1.4 1.2
01/01/2022 03:00 0.9 1.2
01/01/2022 04:00 1.2 NaN
01/02/2022 01:00 5 4
01/02/2022 02:00 4 3
01/02/2022 03:00 2 3
01/02/2022 04:00 3 NaN
I have tried this:
df = df.sort_values(by=['Date','Time'], ascending=True)
df['rollingmax'] = df.groupby(['Date'])['Value'].rolling(window=4,min_periods=0).max()
df = df.sort_values(by=['Date','Time'], ascending=False)
but that doesn't seem to work...
It looks like you want a shifted reverse rolling max:
n = 4
df['Output'] = (df[::-1]
.groupby('Date')['Value']
.apply(lambda g: g.rolling(n-1, min_periods=1).max().shift())
)
Output:
Date Time Value Output
0 01/01/2022 01:00 1.3 1.4
1 01/01/2022 02:00 1.4 1.2
2 01/01/2022 03:00 0.9 1.2
3 01/01/2022 04:00 1.2 NaN
4 01/02/2022 01:00 5.0 4.0
5 01/02/2022 02:00 4.0 3.0
6 01/02/2022 03:00 2.0 3.0
7 01/02/2022 04:00 3.0 NaN
I want to drop all rows for a specific CODE if there is at least one NaN value in PPTOT by CODE.
This is my df:
CODE MONTH_DAY PPTOT
0 113250 01-01 8.4
1 113250 01-02 9.3
2 113250 01-03 NaN
3 113250 01-04 12.7
4 113250 01-05 7.7
... ... ...
16975 47E94706 12-27 5.0
16976 47E94706 12-28 10.2
16977 47E94706 12-29 0.2
16978 47E94706 12-30 0.3
16979 47E94706 12-31 2.0
There is one NaN value in PPTOT for 113250 CODE so all values with CODE 113250 must be converted to NaN
Expected result:
CODE MONTH_DAY PPTOT
0 113250 01-01 NaN
1 113250 01-02 NaN
2 113250 01-03 NaN
3 113250 01-04 NaN
4 113250 01-05 NaN
... ... ...
16975 47E94706 12-27 5.0
16976 47E94706 12-28 10.2
16977 47E94706 12-29 0.2
16978 47E94706 12-30 0.3
16979 47E94706 12-31 2.0
So i tried this code:
notnan=pd.DataFrame()
for code, data in df.groupby('CODE'):
data.dropna(subset=['PPTOT'], how='any')
notnan=notnan.append(data)
But in notnan i'm getting values with NaN
I don't understand why.
Would you mind to help me?
Thanks in advance.
Try:
>>> df.loc[df['PPTOT'].notnull().groupby(df['CODE']).transform('all')]
CODE MONTH_DAY PPTOT
16975 47E94706 12-27 5.0
16976 47E94706 12-28 10.2
16977 47E94706 12-29 0.2
16978 47E94706 12-30 0.3
16979 47E94706 12-31 2.0
Given a toy data frame like so;
>>> df = pd.DataFrame({'group': [1, 1, 2, 2], 'value': [np.nan, 'A', 'B', 'B']})
>>> df
group value
0 1 NaN
1 1 A
2 2 B
3 2 B
Within groups, test whether any are nan. If any are, then substitute nan. Otherwise, fill with existing values.
>>> df.groupby('group').transform(lambda s: np.where(s.isnull().any(), np.nan, s))
value
0 NaN
1 NaN
2 B
3 B
Reassign with overwrite to complete.
based on you label and df:
df['PPTOT'] = df.groupby('CODE')['PPTOT'].transform(lambda x: np.nan if x.isnull().any() else x)
I have a DataFrame of users and their ratings for movies:
userId movie1 movie2 movie3 movie4 movie5 movie6
0 4.1 NaN 1.0 NaN 2.1 NaN
1 3.1 1.1 3.4 1.4 NaN NaN
2 2.8 NaN 1.7 NaN 3.0 NaN
3 NaN 5.0 NaN 2.3 NaN 2.1
4 NaN NaN NaN NaN NaN NaN
5 2.3 NaN 2.0 4.0 NaN NaN
There isnt actually a userId column in the dataframe, it's just being used for the index
From this DataFrame, I'm trying to make a another DataFrame that only contain movies that have been rated by a specific user. For example if I wanted to make a new DataFrame of movies only rated by user with userId == 0 the output would a dataframe with:
userId movie1 movie3 movie5
0 4.1 1.0 2.1
1 3.1 3.4 NaN
2 2.8 1.7 3.0
3 NaN NaN NaN
4 NaN NaN NaN
5 2.3 2.0 NaN
I know how to iterate over the columns but I dont know how to select the columns I want by checking a row value.
You can use .loc accessor to select the particular userId then use notna to create a boolean mask which specifies the columns which does not contain NaN values, finally use this boolean mask to filter the columns:
userId = 0 # specify the userid here
df_user = df.loc[:, df.loc[userId].notna()]
Details:
>>> df.loc[userId].notna()
movie1 True
movie2 False
movie3 True
movie4 False
movie5 True
movie6 False
Name: 0, dtype: bool
>>> df.loc[:, df.loc[userId].notna()]
movie1 movie3 movie5
userId
0 4.1 1.0 2.1
1 3.1 3.4 NaN
2 2.8 1.7 3.0
3 NaN NaN NaN
4 NaN NaN NaN
5 2.3 2.0 NaN
Another approach:
import pandas as pd
user0 = df.iloc[0,:] #select the first row
flags = user0.notna() #flag the non NaN values
flags = flags.tolist() #convert to list instead of series
newdf = df.iloc[:,flags] #return all rows, and the columns where flags are true
Declare and loc the userId of interest into a new df keeping only the relevant columns.
Then pd.concat the new df with the other userId's and keep columns (movies) of your userId that you selected:
user = 0 # set your userId
a = df.loc[[user]].dropna(axis=1)
b = pd.concat([a, (df.drop(a.index))[[i for i in a.columns]]])
Which prints:
b
movie1 movie3 movie5
userId
0 4.10 1.00 2.10
1 3.10 3.40 NaN
2 2.80 1.70 3.00
3 NaN NaN NaN
4 NaN NaN NaN
5 2.30 2.00 NaN
Note that I have set the index to be userId as you specified.
Let's say I have the following data:
import pandas as pd
csv = [
['2019-05-01 00:00', ],
['2019-05-01 01:00', 2],
['2019-05-01 02:00', 4],
['2019-05-01 03:00', ],
['2019-05-01 04:00', 2],
['2019-05-01 05:00', 4],
['2019-05-01 06:00', 6],
['2019-05-01 07:00', ],
['2019-05-01 08:00', ],
['2019-05-01 09:00', 2]]
df = pd.DataFrame(csv, columns=["DateTime", "Value"])
So I am working with a time series with gaps in data:
DateTime Value
0 2019-05-01 00:00 NaN
1 2019-05-01 01:00 2.0
2 2019-05-01 02:00 4.0
3 2019-05-01 03:00 NaN
4 2019-05-01 04:00 2.0
5 2019-05-01 05:00 4.0
6 2019-05-01 06:00 6.0
7 2019-05-01 07:00 NaN
8 2019-05-01 08:00 NaN
9 2019-05-01 09:00 2.0
Now, I want to work one by one with each chunk of existing data. I mean, I want to split the series in the compact pieces between NaNs. The goal is to iterate these chunks so I can pass each one individually to another function which can't handle gaps in data. Then, I want to store the result in the original dataframe in its corresponding place. For a trivial example, let's say the function calculates the average value of the chunk. Expected result:
DateTime Value ChunkAverage
0 2019-05-01 00:00 NaN NaN
1 2019-05-01 01:00 2.0 3.0
2 2019-05-01 02:00 4.0 3.0
3 2019-05-01 03:00 NaN NaN
4 2019-05-01 04:00 2.0 4.0
5 2019-05-01 05:00 4.0 4.0
6 2019-05-01 06:00 6.0 4.0
7 2019-05-01 07:00 NaN NaN
8 2019-05-01 08:00 NaN NaN
9 2019-05-01 09:00 2.0 2.0
I know this can be made in a "traditional way" with iterating loops, "if" clauses, slicing with indexes, etc. But I guess there is something more efficient and safe built in Pandas. But I can't figure out how.
You can use df.groupby, with using pd.Series.isna with pd.Series.cumsum
g = df.Value.isna().cumsum()
df.assign(chunk = df.Value.groupby(g).transform('mean').mask(df.Value.isna()))
# df['chunk'] = df.Value.groupby(g).transform('mean').mask(df.Value.isna()))
# df['chunk'] = df.Value.groupby(g).transform('mean').where(df.Value.notna())
DateTime Value chunk
0 2019-05-01 00:00 NaN NaN
1 2019-05-01 01:00 2.0 3.0
2 2019-05-01 02:00 4.0 3.0
3 2019-05-01 03:00 NaN NaN
4 2019-05-01 04:00 2.0 4.0
5 2019-05-01 05:00 4.0 4.0
6 2019-05-01 06:00 6.0 4.0
7 2019-05-01 07:00 NaN NaN
8 2019-05-01 08:00 NaN NaN
9 2019-05-01 09:00 2.0 2.0
Note:
df.assign(...) gives new dataframe.
df['chunk'] = ... mutate the original dataframe in-place
One possibility would be to add a separator column, based on the NaN in Value, and group by that:
df['separator']=df['Value'].isna().cumsum().fillna("")
df['Value'] = df['Value'].fillna("")
grp = df.groupby('separator').agg(avg = pd.NamedAgg(column='Value', aggfunc='sum'))
print(grp)
This counts the values in each group:
avg
separator
1 2
2 3
3 0
4 1
How you want to fill the NaNs depends a bit on what you want to achieve with the calculation.
Say I have a dataframe as follows:
df = pd.DataFrame({'date': pd.date_range(start='2013-01-01', periods=6, freq='M'),
'value': [3, 3.5, -5, 2, 7, 6.8], 'type': ['a', 'a', 'a', 'b', 'b', 'b']})
df['pct'] = df.groupby(['type'])['value'].pct_change()
Ouput:
date value type pct
0 2013-01-31 3.0 a NaN
1 2013-02-28 3.5 a 0.166667
2 2013-03-31 -5.0 a -2.428571
3 2013-04-30 2.0 b NaN
4 2013-05-31 7.0 b 2.500000
5 2013-06-30 6.8 b -0.028571
I want to replace the pct values which is bigger than 0.2 or smaller than -0.2, then replace them with groupby type means.
My attempt to solve this problem by: first, replace "outliers" with extrame values -999, then replace them by groupby outputs, this is what I have done:
df.loc[df['pct'] >= 0.2, 'pct'] = -999
df.loc[df['pct'] <= -0.2, 'pct'] = -999
df["pct"] = df.groupby(['type'])['pct'].transform(lambda x: x.replace(-999, x.mean()))
But obviously, it is not best solution to solve this problem and results are not correct:
date value type pct
0 2013-01-31 3.0 a NaN
1 2013-02-28 3.5 a 0.166667
2 2013-03-31 -5.0 a -499.416667
3 2013-04-30 2.0 b NaN
4 2013-05-31 7.0 b -499.514286
5 2013-06-30 6.8 b -0.028571
The expected result should look like this:
date value type pct
0 2013-01-31 3.0 a NaN
1 2013-02-28 3.5 a 0.166667
2 2013-03-31 -5.0 a -1.130
3 2013-04-30 2.0 b NaN
4 2013-05-31 7.0 b 2.500000
5 2013-06-30 6.8 b 1.24
What I have done wrong? Again thanks for your kind help.
Instead your both conditions is possible use Series.between and set values in pct by GroupBy.transform with mean:
mask = df['pct'].between(-0.2, 0.2)
df.loc[mask, 'pct'] = df.groupby('type')['pct'].transform('mean').values
print (df)
date value type pct
0 2013-01-31 3.0 a NaN
1 2013-02-28 3.5 a -1.130952
2 2013-03-31 -5.0 a -2.428571
3 2013-04-30 2.0 b NaN
4 2013-05-31 7.0 b 2.500000
5 2013-06-30 6.8 b 1.235714
Alternative solution is use numpy.where:
mask = df['pct'].between(-0.2, 0.2)
df['pct'] = np.where(mask, df.groupby('type')['pct'].transform('mean'), df['pct'])