For unbalanced panel data, it's hard for me to generate lagged variable, especially the lagged length is more than 2. For example, I have a dataset that is a unbalanced panel data. The objective of the task is to generate a lagged 2-month variable.
import pandas as pd
import numpy as np
a=[[1,'1990/1/1',1],
[1,'1990/2/1',2],
[1,'1990/3/1',3],
[2,'1989/12/1',3],
[2,'1990/1/1',3],
[2,'1990/2/1',4],
[2,'1990/3/1',5.5],
[2,'1990/4/1',5],
[2,'1990/6/1',6]]
data=pd.DataFrame(a,columns=['id','date','value'])
data['date']=pd.to_datetime(data['date'])
Currently, My solution is
data['lag2value']=np.where((data.groupby('id')['date'].diff(2)/np.timedelta64(1, 'M')).fillna(0).round()==2,
data.sort_values(['id','date']).groupby('id')['value'].shift(2),np.nan)
However, for the last obs, it does have a lagged two-month observation, that's to say the date 1990-6-1 corresponds to the 1990-4-1. My codes cannot figure it out.
id date value lag2value
0 1 1990-01-01 1.0 NaN
1 1 1990-02-01 2.0 NaN
2 1 1990-03-01 3.0 1.0
3 2 1989-12-01 3.0 NaN
4 2 1990-01-01 3.0 NaN
5 2 1990-02-01 4.0 3.0
6 2 1990-03-01 5.5 3.0
7 2 1990-04-01 5.0 4.0
8 2 1990-06-01 6.0 NaN
One possible solution is to build a complete date table that is a balanced panel dataset, and merge the current table to it. However, if the working data is large, it's time-comsuming to work on the complete table.
I want to know any elegent solution to the problem? Thanks in advance.
Use:
val = df.set_index('date').groupby('id').resample('MS').asfreq()['value']
val = val.groupby(level=0).shift(2)
df['lag2val'] = df.set_index(['id', 'date']).index.map(val)
Details:
STEP A: Use DataFrame.groupby on id and use groupby.resample to resample the grouped frame using monthly start frequency.
print(val)
id date
1 1990-01-01 1.0
1990-02-01 2.0
1990-03-01 3.0
2 1989-12-01 3.0
1990-01-01 3.0
1990-02-01 4.0
1990-03-01 5.5
1990-04-01 5.0
1990-05-01 NaN
1990-06-01 6.0
Name: value, dtype: float64
STEP B: Use Series.groupby on level=0 to group the series val and shift 2 periods down to create a lagged 2 months val series.
print(val)
id date
1 1990-01-01 NaN
1990-02-01 NaN
1990-03-01 1.0
2 1989-12-01 NaN
1990-01-01 NaN
1990-02-01 3.0
1990-03-01 3.0
1990-04-01 4.0
1990-05-01 5.5
1990-06-01 5.0
Name: value, dtype: float64
STEP C: Finally, use set_index along with Series.map to map the new lagged val series to the orginal dataframe df.
print(df)
id date value lag2val
0 1 1990-01-01 1.0 NaN
1 1 1990-02-01 2.0 NaN
2 1 1990-03-01 3.0 1.0
3 2 1989-12-01 3.0 NaN
4 2 1990-01-01 3.0 NaN
5 2 1990-02-01 4.0 3.0
6 2 1990-03-01 5.5 3.0
7 2 1990-04-01 5.0 4.0
8 2 1990-06-01 6.0 5.0
Related
I have two dataframes shown below:
df_1 =
Lon Lat N
0 2 1 1
1 2 2 3
2 2 3 1
3 3 2 2
and
df_2 =
Lon Lat N
0 1.0 1.0 NaN
1 2.0 1.0 NaN
2 3.0 1.0 NaN
3 4.0 1.0 NaN
4 1.0 2.0 NaN
5 2.0 2.0 NaN
6 3.0 2.0 NaN
7 4.0 2.0 NaN
8 1.0 3.0 NaN
9 2.0 3.0 NaN
10 3.0 3.0 NaN
11 4.0 3.0 NaN
What I want to do is to compare these two dfs and merge them according to Lon and Lat. That is to say NaN in df_2 will be covered with values in df_1 if the corresponding Lon and Lat are identical. The ideal output should be as:
Lon Lat N
0 1.0 1.0 NaN
1 2.0 1.0 1
2 3.0 1.0 NaN
3 4.0 1.0 NaN
4 1.0 2.0 NaN
5 2.0 2.0 3
6 3.0 2.0 2
7 4.0 2.0 NaN
8 1.0 3.0 NaN
9 2.0 3.0 1
10 3.0 3.0 NaN
11 4.0 3.0 NaN
The reason I want to do this is df_1's coordinates Lat and Lon are non-rectangular or unstructured grid, and I need to fill some NaN values so as to get a rectangular meshgrid and make contourf applicable. It would be highly appreciated if you can provide better ways to make the contour plot.
I have tried df_2.combine_first(df_1), but it doesn't work.
Thanks!
df_2.drop(columns = 'N').merge(df_1, on = ['Lon', 'Lat'], how = 'left')
Lon Lat N
0 1.0 1.0 NaN
1 2.0 1.0 1.0
2 3.0 1.0 NaN
3 4.0 1.0 NaN
4 1.0 2.0 NaN
5 2.0 2.0 3.0
6 3.0 2.0 2.0
7 4.0 2.0 NaN
8 1.0 3.0 NaN
9 2.0 3.0 1.0
10 3.0 3.0 NaN
11 4.0 3.0 NaN
If you first create the df_2 with all needed values you can update it with the second DataFrame by using pandas.DataFrame.update.
For this you need to first set the the correct index by using pandas.DataFrame.set_index.
Have a look at this Post for more information.
I have a daily measured dataset with datetime index. I am trying to resample it to monthly by only taking the data of the first available day of month.
dataframe df:
2010-10-04 Nan 4
2010-10-05 3 5
2010-10-06 5 2
I tried using
df.resample("MS").first()
But that ends up giving me
2010-10-01 3 4
instead of
2010-10-04 Nan 4
Can I avoid droping Nan values? I couldnt find a suitable parameter in the documentation.
IIUC, you need the first row for each month in your dataset. I tried elaborating on your example by adding more months and from different years.
Try grouping by the month and take the head(1) for each month, assuming the data is sorted by dates.
# A B
# 2010-10-04 NaN 4.0 <----
# 2010-10-05 3.0 5.0
# 2010-10-06 5.0 2.0
# 2010-09-05 NaN NaN <----
# 2010-09-05 3.0 5.0
# 2010-09-06 5.0 2.0
# 2019-10-04 7.0 7.0 <----
# 2019-10-05 3.0 5.0
# 2019-10-06 5.0 2.0
df.groupby(pd.Grouper(freq="M")).head(1)
A B
2010-09-05 NaN NaN
2010-10-04 NaN 4.0
2019-10-04 7.0 7.0
I have a table of daily (time series) rain of cities. how to use pandas fillna NaN by the negative of the next day rain of the same city? Thank you.
import pandas as pd
import numpy as np
rain_before = pd.DataFrame({'date':Date*2,'city':list('aaaaabbbbb'),'rain':[6,np.nan,1,np.nan,np.nan,4,np.nan,np.nan,8,np.nan]})
# after fillna, the table should look like this.
rain_after_fillna = pd.DataFrame({'date':Date*2,'city':list('aaaaabbbbb'),'rain':[6,-1,1,np.nan,np.nan,4,np.nan,-8,8,np.nan]})
You can you shift and fillna
rain_before['rain'].fillna(rain_before.groupby('city')['rain']
.transform(lambda x: -x.shift(-1)))
0 6.0
1 -1.0
2 1.0
3 NaN
4 NaN
5 4.0
6 NaN
7 -8.0
8 8.0
9 NaN
Name: rain, dtype: float64
Using the series of shift(-1)*-1. There is no sample dataset so I've synthesized and not included city. Same approach can be used for city, sort order needs to be considered
import datetime as dt
import random
df = pd.DataFrame({"Date":pd.date_range(dt.date(2021,1,1), dt.date(2021,1,10))
,"rainfall":[i*random.randint(0,1) for i in range(10)]}).replace({0:np.nan})
df["rainfall_nan"] = df["rainfall"].fillna(df["rainfall"].shift(-1)*-1)
output
Date rainfall rainfall_nan
2021-01-01 NaN -1.0
2021-01-02 1.0 1.0
2021-01-03 2.0 2.0
2021-01-04 3.0 3.0
2021-01-05 NaN -5.0
2021-01-06 5.0 5.0
2021-01-07 6.0 6.0
2021-01-08 7.0 7.0
2021-01-09 NaN -9.0
2021-01-10 9.0 9.0
For example, I have 2 dfs:
df1
ID,col1,col2
1,5,9
2,6,3
3,7,2
4,8,5
and another df is
df2
ID,col1,col2
1,11,9
2,12,7
3,13,2
I want to calculate first pairwise subtraction from df2 to df1. I am using scipy.spatial.distance using a function subtract_
def subtract_(a, b):
return abs(a - b)
d1_s = df1[['col1']]
d2_s = df2[['col1']]
dist = cdist(d1_s, d2_s, metric=subtract_)
dist_df = pd.DataFrame(dist, columns= d2_s.values.ravel())
print(dist_df)
11 12 13
6.0 7.0 8.0
5.0 6.0 7.0
4.0 5.0 6.0
3.0 4.0 5.0
Now, I want to check, these new columns name like 11,12 and 13. I am checking if there is any values in this new dataframe less than 5. If there is, then I want to do further calculations. Like this.
For example, here for columns name '11', less than 5 value is 4 which is at rows 3. Now in this case, I want to subtract columns name ('col2') of df1 but at row 3, in this case it would be value 2. I want to subtract this value 2 with df2(col2) but at row 1 (because column name '11') was from value at row 1 in df2.
My for loop is so complex for this. It would be great, if there would be some easier way in pandas.
Any help, suggestions would be great.
The expected new dataframe is this
0,1,2
Nan,Nan,Nan
Nan,Nan,Nan
(2-9)=-7,Nan,Nan
(5-9)=-4,(5-7)=-2,Nan
Similar to Ben's answer, but with np.where:
pd.DataFrame(np.where(dist_df<5, df1.col2.values[:,None] - df2.col2.values, np.nan),
index=dist_df.index,
columns=dist_df.columns)
Output:
11 12 13
0 NaN NaN NaN
1 NaN NaN NaN
2 -7.0 NaN NaN
3 -4.0 -2.0 NaN
In your case using numpy with mask
df.mask(df<5,df-(df1.col2.values[:,None]+df2.col2.values))
Out[115]:
11 12 13
0 6.0 7.0 8.0
1 5.0 6.0 7.0
2 -7.0 5.0 6.0
3 -11.0 -8.0 5.0
Update
Newdf=(df-(-df1.col2.values[:,None]+df2.col2.values)-df).where(df<5)
Out[148]:
11 12 13
0 NaN NaN NaN
1 NaN NaN NaN
2 -7.0 NaN NaN
3 -4.0 -2.0 NaN
I need to rid myself of all rows with a null value in column C. Here is the code:
infile="C:\****"
df=pd.read_csv(infile)
A B C D
1 1 NaN 3
2 3 7 NaN
4 5 NaN 8
5 NaN 4 9
NaN 1 2 NaN
There are two basic methods I have attempted.
method 1:
source: How to drop rows of Pandas DataFrame whose value in certain columns is NaN
df.dropna()
The result is an empty dataframe, which makes sense because there is an NaN value in every row.
df.dropna(subset=[3])
For this method I tried to play around with the subset value using both column index number and column name. The dataframe is still empty.
method 2:
source: Deleting DataFrame row in Pandas based on column value
df = df[df.C.notnull()]
Still results in an empty dataframe!
What am I doing wrong?
df = pd.DataFrame([[1,1,np.nan,3],[2,3,7,np.nan],[4,5,np.nan,8],[5,np.nan,4,9],[np.nan,1,2,np.nan]], columns = ['A','B','C','D'])
df = df[df['C'].notnull()]
df
It's just a prove that your method 2 works properly (at least with pandas 0.18.0):
In [100]: df
Out[100]:
A B C D
0 1.0 1.0 NaN 3.0
1 2.0 3.0 7.0 NaN
2 4.0 5.0 NaN 8.0
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [101]: df.dropna(subset=['C'])
Out[101]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [102]: df[df.C.notnull()]
Out[102]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [103]: df = df[df.C.notnull()]
In [104]: df
Out[104]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN