I'm very confused by the output of the pct_change function when data with NaN values are involved. The first several rows of output in the right column are correct - it gives the percentage change in decimal form of the cell to the left in Column A relative to the cell in Column A two rows prior. But as soon as it reaches the NaN values in Column A, the output of the pct_change function makes no sense.
For example:
Row 8: NaN is 50% greater than 2?
Row 9: NaN is 0% greater than 3?
Row 11: 4 is 33% greater than NaN?
Row 12: 2 is 33% less than NaN?`
Based on the above math, it seems like pct_change is assigning NaN a value of "3". Is that because pct_change effectively fills forward the last non-NaN value? Could someone please explain the logic here and why this happens?
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [2,1,3,1,4,5,2,3,np.nan,np.nan,np.nan,4,2,1,0,4]})
x = 2
df['pctchg_A'] = df['A'].pct_change(periods = x)
print(df.to_string())
Here's the output:
The behaviour is as expected. You need to carefully read the df.pct_change docs.
As per docs:
fill_method: str, default ‘pad’
How to handle NAs before computing percent changes.
Here, method pad means, it will forward-fill the NaN values with the nearest non-NaN value.
So, if you ffill or pad your NaN values, you will understand what's exactly happening. Check this out:
In [3201]: df['padded_A'] = df['A'].fillna(method='pad')
In [3203]: df['pctchg_A'] = df['A'].pct_change(periods = x)
In [3204]: df
Out[3204]:
A padded_A pctchg_A
0 2.0 2.0 NaN
1 1.0 1.0 NaN
2 3.0 3.0 0.500000
3 1.0 1.0 0.000000
4 4.0 4.0 0.333333
5 5.0 5.0 4.000000
6 2.0 2.0 -0.500000
7 3.0 3.0 -0.400000
8 NaN 3.0 0.500000
9 NaN 3.0 0.000000
10 NaN 3.0 0.000000
11 4.0 4.0 0.333333
12 2.0 2.0 -0.333333
13 1.0 1.0 -0.750000
14 0.0 0.0 -1.000000
15 4.0 4.0 3.000000
Now you can compare padded_A values with pctchg_A and see that it works as expected.
Related
Hello i have an dataframe as shown below
daf = pd.DataFrame({'A':[10,np.nan,20,np.nan,30]})
daf['B'] = ''
the above code has created a data frame with column B having empty strings
A B
0 10.0
1 NaN
2 20.0
3 NaN
4 30.0
the problem here is i need to replace column with all empty strings,(note here entire column should be empty) with values provided with numpy place last argument here it is 1
so i used following code
np.place(daf.to_numpy(),((daf[['A','B']] == '').all() & (daf[['A','B']] == '')).to_numpy(),[1])
which did nothing it gave same output
A B
0 10.0
1 NaN
2 20.0
3 NaN
4 30.0
but when i assign daf['B'] = np.nan the code seems to work fine by checking if entire column is null, then replace it with 1
here is the data frame
A B
0 10.0 NaN
1 NaN NaN
2 20.0 NaN
3 NaN NaN
4 30.0 NaN
replace where those nan with 1 where the entire column is nan
np.place(daf.to_numpy(),(daf[['A','B']].isnull() & daf[['A','B']].isnull().all()).to_numpy(),[1])
which gave correct output
A B
0 10.0 1.0
1 NaN 1.0
2 20.0 1.0
3 NaN 1.0
4 30.0 1.0
can some one tell me how to work with empty strings replacing , and give a reason why its not working with empty string as input
If I'm understanding your question correctly, you're wanting to replace a column with empty strings with a column of 1s. This can be done with pandas.replace()
daf.replace('', 1.0)
A B
0 10.0 1.0
1 NaN 1.0
2 20.0 1.0
3 NaN 1.0
4 30.0 1.0
This function also works with regex if you want to be more granular with the replacement.
I have a dataframe in the following format:
A B C D
01-01-2021 1.0 NaN NaN NaN
02-01-2021 2.0 1.0 NaN NaN
03-01-2021 2.0 2.0 NaN NaN
04-01-2021 3.0 2.0 NaN 1.0
05-01-2021 2.0 3.0 NaN 3.0
06-01-2021 1.0 2.0 1.0 2.0
07-01-2021 1.0 1.0 3.0 2.0
08-01-2021 2.0 2.0 1.0 3.0
09-01-2021 3.0 2.0 1.0 2.0
I want to create future-looking windows of width N=6 for each cell and depending on the number of valid (non-NA) values in the cells of these windows, either return the first non-NA value in the window shifted by N=5 downwards, or a NaN.
In the example dataframe, column A is a fully valid one without any NaN values. We create the first future window with width of N=6 for 01-01-2021 which includes dates from 01-01-2021 to 06-01-2021. There are no NaN values, i.e. the total number of valid values (6) is above a threshold of thresh=3. This way, our the first value in column A in our resulting dataframe will be 1.0 on 06-01-2021: we simply take the uppermost valid (non-NA) value in the window we created, and move it down by 5 days, from 01-01-2021 to 06-01-2021. The rest of the values in this column are analogous. This way, therefore, column A will simply be shifted downwards by 5 days.
Column B has its first value missing (NaN). This way, our first value in the resulting dataframe will still appear on 06-01-2021 but it will be the value corresponding to 02-01-2021 from the original dataframe. Importantly, the second value in the resulting dataframe (on 07-01-2021) will be identical to the value on the day before, i.e. both will be 1.0 in the resulting dataframe.
In column C, the first and second future windows do not have enough non-NA values, therefore the values on 06-01-2021 and 07-01-2021 in the resulting dataframe will be NaNs. On 08-01-2021, the resulting dataframe will be 1.0 since that is the first non-NA value in the time window created for 5 days back, on 03-01-2021.
Here is how the resulting dataframe should look like:
A B C D
06-01-2021 1.0 1.0 NaN 1.0
07-01-2021 2.0 1.0 NaN 1.0
08-01-2021 2.0 2.0 1.0 1.0
09-01-2021 3.0 2.0 1.0 1.0
10-01-2021 2.0 3.0 1.0 3.0
11-01-2021 1.0 2.0 1.0 2.0
12-01-2021 1.0 1.0 3.0 2.0
13-01-2021 2.0 2.0 1.0 3.0
14-01-2021 3.0 2.0 1.0 2.0
I am aware of pandas's rolling functionality and that it has a min_periods parameter that really resembles to the functionality that I am trying to apply here. I also know that groupby has a first method that is also partly what I'd need here. However, I am not sure how to connect the dots. My initial idea was shifting the entire dataframe upwards by 6 and use rolling with min_periods, however, rolling has no first method (unlike groupby) and using df.shift(-6) removes the first rows in my dataframe that would be important to determine the values.
It sounds like the future-looking window of width N=6 with thresh=3 is exactly the same as filling values at most 3 days previously. The command pd.fillna was made for this. Specify bfill to fill upwards (axis=0) a maximum of limit=3.
Turning the date column into datetime objects makes it easy to turn them 5 days into the future with DateOffset.
dt=pd.read_csv("data.txt", header=0)
dt["Date"]=pd.to_datetime(dt["Date"], dayfirst=True) + pd.DateOffset(days=5)
print(dt.fillna(method="bfill", axis=0, limit=3))
Date A B C D
0 2021-01-06 1.0 1.0 NaN 1.0
1 2021-01-07 2.0 1.0 NaN 1.0
2 2021-01-08 2.0 2.0 1.0 1.0
3 2021-01-09 3.0 2.0 1.0 1.0
4 2021-01-10 2.0 3.0 1.0 3.0
5 2021-01-11 1.0 2.0 1.0 2.0
6 2021-01-12 1.0 1.0 3.0 2.0
7 2021-01-13 2.0 2.0 1.0 3.0
8 2021-01-14 3.0 2.0 1.0 2.0
Data
Date,A,B,C,D
01-01-2021,1.0,NaN,NaN,NaN
02-01-2021,2.0,1.0,NaN,NaN
03-01-2021,2.0,2.0,NaN,NaN
04-01-2021,3.0,2.0,NaN,1.0
05-01-2021,2.0,3.0,NaN,3.0
06-01-2021,1.0,2.0,1.0,2.0
07-01-2021,1.0,1.0,3.0,2.0
08-01-2021,2.0,2.0,1.0,3.0
09-01-2021,3.0,2.0,1.0,2.0
For unbalanced panel data, it's hard for me to generate lagged variable, especially the lagged length is more than 2. For example, I have a dataset that is a unbalanced panel data. The objective of the task is to generate a lagged 2-month variable.
import pandas as pd
import numpy as np
a=[[1,'1990/1/1',1],
[1,'1990/2/1',2],
[1,'1990/3/1',3],
[2,'1989/12/1',3],
[2,'1990/1/1',3],
[2,'1990/2/1',4],
[2,'1990/3/1',5.5],
[2,'1990/4/1',5],
[2,'1990/6/1',6]]
data=pd.DataFrame(a,columns=['id','date','value'])
data['date']=pd.to_datetime(data['date'])
Currently, My solution is
data['lag2value']=np.where((data.groupby('id')['date'].diff(2)/np.timedelta64(1, 'M')).fillna(0).round()==2,
data.sort_values(['id','date']).groupby('id')['value'].shift(2),np.nan)
However, for the last obs, it does have a lagged two-month observation, that's to say the date 1990-6-1 corresponds to the 1990-4-1. My codes cannot figure it out.
id date value lag2value
0 1 1990-01-01 1.0 NaN
1 1 1990-02-01 2.0 NaN
2 1 1990-03-01 3.0 1.0
3 2 1989-12-01 3.0 NaN
4 2 1990-01-01 3.0 NaN
5 2 1990-02-01 4.0 3.0
6 2 1990-03-01 5.5 3.0
7 2 1990-04-01 5.0 4.0
8 2 1990-06-01 6.0 NaN
One possible solution is to build a complete date table that is a balanced panel dataset, and merge the current table to it. However, if the working data is large, it's time-comsuming to work on the complete table.
I want to know any elegent solution to the problem? Thanks in advance.
Use:
val = df.set_index('date').groupby('id').resample('MS').asfreq()['value']
val = val.groupby(level=0).shift(2)
df['lag2val'] = df.set_index(['id', 'date']).index.map(val)
Details:
STEP A: Use DataFrame.groupby on id and use groupby.resample to resample the grouped frame using monthly start frequency.
print(val)
id date
1 1990-01-01 1.0
1990-02-01 2.0
1990-03-01 3.0
2 1989-12-01 3.0
1990-01-01 3.0
1990-02-01 4.0
1990-03-01 5.5
1990-04-01 5.0
1990-05-01 NaN
1990-06-01 6.0
Name: value, dtype: float64
STEP B: Use Series.groupby on level=0 to group the series val and shift 2 periods down to create a lagged 2 months val series.
print(val)
id date
1 1990-01-01 NaN
1990-02-01 NaN
1990-03-01 1.0
2 1989-12-01 NaN
1990-01-01 NaN
1990-02-01 3.0
1990-03-01 3.0
1990-04-01 4.0
1990-05-01 5.5
1990-06-01 5.0
Name: value, dtype: float64
STEP C: Finally, use set_index along with Series.map to map the new lagged val series to the orginal dataframe df.
print(df)
id date value lag2val
0 1 1990-01-01 1.0 NaN
1 1 1990-02-01 2.0 NaN
2 1 1990-03-01 3.0 1.0
3 2 1989-12-01 3.0 NaN
4 2 1990-01-01 3.0 NaN
5 2 1990-02-01 4.0 3.0
6 2 1990-03-01 5.5 3.0
7 2 1990-04-01 5.0 4.0
8 2 1990-06-01 6.0 5.0
For example, I have 2 dfs:
df1
ID,col1,col2
1,5,9
2,6,3
3,7,2
4,8,5
and another df is
df2
ID,col1,col2
1,11,9
2,12,7
3,13,2
I want to calculate first pairwise subtraction from df2 to df1. I am using scipy.spatial.distance using a function subtract_
def subtract_(a, b):
return abs(a - b)
d1_s = df1[['col1']]
d2_s = df2[['col1']]
dist = cdist(d1_s, d2_s, metric=subtract_)
dist_df = pd.DataFrame(dist, columns= d2_s.values.ravel())
print(dist_df)
11 12 13
6.0 7.0 8.0
5.0 6.0 7.0
4.0 5.0 6.0
3.0 4.0 5.0
Now, I want to check, these new columns name like 11,12 and 13. I am checking if there is any values in this new dataframe less than 5. If there is, then I want to do further calculations. Like this.
For example, here for columns name '11', less than 5 value is 4 which is at rows 3. Now in this case, I want to subtract columns name ('col2') of df1 but at row 3, in this case it would be value 2. I want to subtract this value 2 with df2(col2) but at row 1 (because column name '11') was from value at row 1 in df2.
My for loop is so complex for this. It would be great, if there would be some easier way in pandas.
Any help, suggestions would be great.
The expected new dataframe is this
0,1,2
Nan,Nan,Nan
Nan,Nan,Nan
(2-9)=-7,Nan,Nan
(5-9)=-4,(5-7)=-2,Nan
Similar to Ben's answer, but with np.where:
pd.DataFrame(np.where(dist_df<5, df1.col2.values[:,None] - df2.col2.values, np.nan),
index=dist_df.index,
columns=dist_df.columns)
Output:
11 12 13
0 NaN NaN NaN
1 NaN NaN NaN
2 -7.0 NaN NaN
3 -4.0 -2.0 NaN
In your case using numpy with mask
df.mask(df<5,df-(df1.col2.values[:,None]+df2.col2.values))
Out[115]:
11 12 13
0 6.0 7.0 8.0
1 5.0 6.0 7.0
2 -7.0 5.0 6.0
3 -11.0 -8.0 5.0
Update
Newdf=(df-(-df1.col2.values[:,None]+df2.col2.values)-df).where(df<5)
Out[148]:
11 12 13
0 NaN NaN NaN
1 NaN NaN NaN
2 -7.0 NaN NaN
3 -4.0 -2.0 NaN
I am trying to do a pivot_table on a pandas Dataframe. I am almost getting the expected result, but it seems to be multiplied by two. I could just divide by two and call it a day, however, I want to know whether I am doing something wrong.
Here goes the code:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={"IND":[1,2,3,4,5,1,5,5],"DATA":[2,3,4,2,10,4,3,3]})
df_pvt = pd.pivot_table(df, aggfunc=np.size, index=["IND"], columns="DATA")
df_pvt is now:
DATA 2 3 4 10
IND
1 2.0 NaN 2.0 NaN
2 NaN 2.0 NaN NaN
3 NaN NaN 2.0 NaN
4 2.0 NaN NaN NaN
5 NaN 4.0 NaN 2.0
However, instead of the 2.0 is should be 1.0! What am I misunderstanding / doing wrong?
Use the string 'size' instead. This will trigger the Pandas interpretation of "size", i.e. the number of elements in a group. The NumPy interpretation of size is the product of the lengths of each dimension.
df = pd.pivot_table(df, aggfunc='size', index=["IND"], columns="DATA")
print(df)
DATA 2 3 4 10
IND
1 1.0 NaN 1.0 NaN
2 NaN 1.0 NaN NaN
3 NaN NaN 1.0 NaN
4 1.0 NaN NaN NaN
5 NaN 2.0 NaN 1.0