I want to create a new column in my dataframe with the value of a other row.
DataFrame
TimeStamp Event Value
0 1603822620000 1 102.0
1 1603822680000 1 108.0
2 1603822740000 1 107.0
3 1603822800000 2 1
4 1603823040000 1 106.0
5 1603823100000 2 0
6 1603823160000 2 1
7 1603823220000 1 105.0
I would like to add a new column with the previous value where event = 1.
TimeStamp Event Value PrevValue
0 1603822620000 1 102.0 NaN
1 1603822680000 1 108.0 102.0
2 1603822740000 1 107.0 108.0
3 1603822800000 2 1 107.0
4 1603823040000 1 106.0 107.0
5 1603823100000 2 0 106.0
6 1603823160000 2 1 106.0
7 1603823220000 1 105.0 106.0
So I can't simply use shift(1) and also not groupBy(event).shift(1).
Current solution
df["PrevValue"] =df.timestamp.apply(lambda ts: (df[(df.Event == 1) & (df.timestamp < ts)].iloc[-1].value))
But I guess, that's not the best solution.
Is there something like shiftUntilCondition(condition)?
Thanks a lot!
Try with
df['new'] = df['Value'].where(df['Event']==1).ffill().shift()
Out[83]:
0 NaN
1 102.0
2 108.0
3 107.0
4 107.0
5 106.0
6 106.0
7 106.0
Name: Value, dtype: float64
Related
I have one DataFrame, df, I have four columns shown below:
IDP1 IDP1Number IDP2 IDP2Number
1 100 1 NaN
3 110 2 150
5 120 3 NaN
7 140 4 160
9 150 5 190
NaN NaN 6 130
NaN NaN 7 NaN
NaN NaN 8 200
NaN NaN 9 90
NaN NaN 10 NaN
I want instead to map values from df.IDP1Number to IDP2Number using IDP1 to IDP2. I want to replace existing values if IDP1 and IDP2 both exist with IDP1Number. Otherwise leave values in IDP2Number alone.
The error message that appears reads, " Reindexing only valid with uniquely value Index objects
The Dataframe below is what I wish to have:
IDP1 IDP1Number IDP2 IDP2Number
1 100 1 100
3 110 2 150
5 120 3 110
7 140 4 160
9 150 5 120
NaN NaN 6 130
NaN NaN 7 140
NaN NaN 8 200
NaN NaN 9 150
NaN NaN 10 NaN
Here's a way to do:
# filter the data and create a mapping dict
maps = df.query("IDP1.notna()")[['IDP1', 'IDP1Number']].set_index('IDP1')['IDP1Number'].to_dict()
# create new column using ifelse condition
df['IDP2Number'] = df.apply(lambda x: maps.get(x['IDP2'], None) if (pd.isna(x['IDP2Number']) or x['IDP2'] in maps) else x['IDP2Number'], axis=1)
print(df)
IDP1 IDP1Number IDP2 IDP2Number
0 1.0 100.0 1 100.0
1 3.0 110.0 2 150.0
2 5.0 120.0 3 110.0
3 7.0 140.0 4 160.0
4 9.0 150.0 5 120.0
5 NaN NaN 6 130.0
6 NaN NaN 7 140.0
7 NaN NaN 8 200.0
8 NaN NaN 9 150.0
9 NaN NaN 10 NaN
I'm having a bit of trouble with this. My dataframe looks like this:
id amount dummy
1 130 0
1 120 0
1 110 1
1 nan nan
1 nan nan
2 nan 0
2 50 0
2 20 1
2 nan nan
2 nan nan
So, what I need to do is, after the dummy gets value = 1, I need to fill the amount variable with zeroes for each id, like this:
id amount dummy
1 130 0
1 120 0
1 110 1
1 0 nan
1 0 nan
2 nan 0
2 50 0
2 20 1
2 0 nan
2 0 nan
I'm guessing I'll need some combination of groupby('id'), fillna(method='ffill'), maybe a .loc or a shift() , but everything I tried has had some problem or is very slow. Any suggestions?
The way I will use
s = df.groupby('id')['dummy'].ffill().eq(1)
df.loc[s&df.dummy.isna(),'amount']=0
You can do this much easier:
data[data['dummy'].isna()]['amount'] = 0
This will select all the rows where dummy is nan and fill the amount column with 0.
IIUC, ffill() and mask the still-nan:
s = df.groupby('id')['amount'].ffill().notnull()
df.loc[df['amount'].isna() & s, 'amount'] = 0
Output:
id amount dummy
0 1 130.0 0.0
1 1 120.0 0.0
2 1 110.0 1.0
3 1 0.0 NaN
4 1 0.0 NaN
5 2 NaN 0.0
6 2 50.0 0.0
7 2 20.0 1.0
8 2 0.0 NaN
9 2 0.0 NaN
Could you please try following.
df.loc[df['dummy'].isnull(),'amount']=0
df
Output will be as follows.
id amount dummy
0 1 130.0 0.0
1 1 120.0 0.0
2 1 110.0 1.0
3 1 0.0 NaN
4 1 0.0 NaN
5 2 NaN 0.0
6 2 50.0 0.0
7 2 20.0 1.0
8 2 0.0 NaN
9 2 0.0 NaN
This question already has answers here:
How to fillna by groupby outputs in pandas?
(3 answers)
Closed 4 years ago.
I have a dataset as follow -
alldata.loc[:,["Age","Pclass"]].head(10)
Out[24]:
Age Pclass
0 22.0 3
1 38.0 1
2 26.0 3
3 35.0 1
4 35.0 3
5 NaN 3
6 54.0 1
7 2.0 3
8 27.0 3
9 14.0 2
Now I want to fill all the null values in Age with the mean of all the Age values for that respective Pclass type.
Example -
In the above snippet for null value of Age for Pclass = 3, it takes mean of all the age belonging to Pclass = 3. Therefore replacing null value of Age = 22.4.
I tried some solutions using groupby, but it made changes only to a specific Pclass value and converted rest of the fields to null. How to achieve 0 null values in this case.
You can use
1] transform and lambda function
In [41]: df.groupby('Pclass')['Age'].transform(lambda x: x.fillna(x.mean()))
Out[41]:
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 22.4
6 54.0
7 2.0
8 27.0
9 14.0
Name: Age, dtype: float64
Or use
2] fillna over mean
In [46]: df['Age'].fillna(df.groupby('Pclass')['Age'].transform('mean'))
Out[46]:
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 22.4
6 54.0
7 2.0
8 27.0
9 14.0
Name: Age, dtype: float64
Or use
3] loc to replace null values
In [47]: df.loc[df['Age'].isnull(), 'Age'] = df.groupby('Pclass')['Age'].transform('mean')
In [48]: df
Out[48]:
Age Pclass
0 22.0 3
1 38.0 1
2 26.0 3
3 35.0 1
4 35.0 3
5 22.4 3
6 54.0 1
7 2.0 3
8 27.0 3
9 14.0 2
I create the following dataframe:
Date ProductID SubProductId Value
0 2015-01-02 1 1 11
1 2015-01-02 1 2 12
2 2015-01-02 1 3 NaN
3 2015-01-02 1 4 NaN
4 2015-01-02 2 1 14
5 2015-01-02 2 2 15
6 2015-01-02 2 3 16
7 2015-01-03 1 1 17
8 2015-01-03 1 2 18
9 2015-01-03 1 3 NaN
10 2015-01-03 1 4 21
11 2015-01-03 2 1 20
12 2015-01-03 2 2 21
And then I group the subproducts by products:
df.set_index(['Date','ProductID','SubProductId']).unstack(['ProductID','SubProductId'])
and I would like to get the following:
Value
ProductID 1 2
SubProductId 1 2 3 4 1 2 3
Date
2015-01-02 11.0 12.0 NaN NaN 14.0 15.0 16.0
2015-01-03 17.0 18.0 NaN 21.0 20.0 21.0 NaN
But what it does when I print it is that it pulls every column that start with some NaN at the end:
Value
ProductID 1 2 1
SubProductId 1 2 1 2 3 4 3
Date
2015-01-02 11.0 12.0 14.0 15.0 16.0 NaN NaN
2015-01-03 17.0 18.0 20.0 21.0 NaN 21.0 NaN
How to have every sub columns grouped under its corresponding column ? even the sub columns that contain NaN
NB: Versions used:
Python version: 3.6.0
Pandas version: 0.19.2
If you want to have ordered column names, you can use sort_level with axis = 1 to sort the column index:
df1 = df.set_index(['Date','ProductID','SubProductId']).unstack(['ProductID','SubProductId'])
# sort in descending order
df1.sortlevel(axis=1, ascending=False)
# Value
#ProductID 2 1
#SubProductId 3 2 1 4 3 2 1
#Date
#2015-01-02 16.0 15.0 14.0 NaN NaN 12.0 11.0
#2015-01-03 NaN 21.0 20.0 21.0 NaN 18.0 17.0
# sort in ascending order
df1.sortlevel(axis=1, ascending=True)
# Value
#ProductID 1 2
#SubProductId 1 2 3 4 1 2 3
#Date
#2015-01-02 11.0 12.0 NaN NaN 14.0 15.0 16.0
#2015-01-03 17.0 18.0 NaN 21.0 20.0 21.0 NaN
Trying to apply the method from here to a multi-index dataframe, doesn't seem to work.
Take a data-frame:
import pandas as pd
import numpy as np
dates = pd.date_range('20070101',periods=3200)
df = pd.DataFrame(data=np.random.randint(0,100,(3200,1)), columns =list('A'))
df['A'][5,6,7, 8, 9, 10, 11, 12, 13] = np.nan #add missing data points
df['date'] = dates
df = df[['date','A']]
Apply season function to the datetime index
def get_season(row):
if row['date'].month >= 3 and row['date'].month <= 5:
return '2'
elif row['date'].month >= 6 and row['date'].month <= 8:
return '3'
elif row['date'].month >= 9 and row['date'].month <= 11:
return '4'
else:
return '1'
Apply the function
df['Season'] = df.apply(get_season, axis=1)
Create a 'Year' column for indexing
df['Year'] = df['date'].dt.year
Multi-index by Year and Season
df = df.set_index(['Year', 'Season'], inplace=False)
Count datapoints in each season
count = df.groupby(level=[0, 1]).count()
Drop the seasons with less than 75 days in them
count = count.drop(count[count.A < 75].index)
Create a variable for seasons with more than 75 days
complete = count[count['A'] >= 75].index
Using isin function turns up false for everything, while I want it to select all the seasons who have more than 75 days of valid data in 'A'
df = df.isin(complete)
df
Every value comes up false, and I can't see why.
I hope this is concise enough, I need this to work on a multi-index using seasons so I included it!
EDIT
Another method based on multi-index reindexing not working (which also produces a blank dataframe) from here
df3 = df.reset_index().groupby('Year').apply(lambda x: x.set_index('Season').reindex(count,method='pad'))
EDIT 2
Also tried this
seasons = count[count['A'] >= 75].index
df = df[df['A'].isin(seasons)]
Again, blank output
I think you can use Index.isin:
complete = count[count['A'] >= 75].index
idx = df.index.isin(complete)
print idx
[ True True True ..., False False False]
print df[idx]
date A
Year Season
2007 1 2007-01-01 24.0
1 2007-01-02 92.0
1 2007-01-03 54.0
1 2007-01-04 91.0
1 2007-01-05 91.0
1 2007-01-06 NaN
1 2007-01-07 NaN
1 2007-01-08 NaN
1 2007-01-09 NaN
1 2007-01-10 NaN
1 2007-01-11 NaN
1 2007-01-12 NaN
1 2007-01-13 NaN
1 2007-01-14 NaN
1 2007-01-15 18.0
1 2007-01-16 82.0
1 2007-01-17 55.0
1 2007-01-18 64.0
1 2007-01-19 89.0
1 2007-01-20 37.0
1 2007-01-21 45.0
1 2007-01-22 4.0
1 2007-01-23 34.0
1 2007-01-24 35.0
1 2007-01-25 90.0
1 2007-01-26 17.0
1 2007-01-27 29.0
1 2007-01-28 58.0
1 2007-01-29 7.0
1 2007-01-30 57.0
... ... ...
2015 3 2015-08-02 42.0
3 2015-08-03 0.0
3 2015-08-04 31.0
3 2015-08-05 39.0
3 2015-08-06 25.0
3 2015-08-07 1.0
3 2015-08-08 7.0
3 2015-08-09 97.0
3 2015-08-10 38.0
3 2015-08-11 59.0
3 2015-08-12 28.0
3 2015-08-13 84.0
3 2015-08-14 43.0
3 2015-08-15 63.0
3 2015-08-16 68.0
3 2015-08-17 0.0
3 2015-08-18 19.0
3 2015-08-19 61.0
3 2015-08-20 11.0
3 2015-08-21 84.0
3 2015-08-22 75.0
3 2015-08-23 37.0
3 2015-08-24 40.0
3 2015-08-25 66.0
3 2015-08-26 50.0
3 2015-08-27 74.0
3 2015-08-28 37.0
3 2015-08-29 19.0
3 2015-08-30 25.0
3 2015-08-31 15.0
[3106 rows x 2 columns]