I have a daily measured dataset with datetime index. I am trying to resample it to monthly by only taking the data of the first available day of month.
dataframe df:
2010-10-04 Nan 4
2010-10-05 3 5
2010-10-06 5 2
I tried using
df.resample("MS").first()
But that ends up giving me
2010-10-01 3 4
instead of
2010-10-04 Nan 4
Can I avoid droping Nan values? I couldnt find a suitable parameter in the documentation.
IIUC, you need the first row for each month in your dataset. I tried elaborating on your example by adding more months and from different years.
Try grouping by the month and take the head(1) for each month, assuming the data is sorted by dates.
# A B
# 2010-10-04 NaN 4.0 <----
# 2010-10-05 3.0 5.0
# 2010-10-06 5.0 2.0
# 2010-09-05 NaN NaN <----
# 2010-09-05 3.0 5.0
# 2010-09-06 5.0 2.0
# 2019-10-04 7.0 7.0 <----
# 2019-10-05 3.0 5.0
# 2019-10-06 5.0 2.0
df.groupby(pd.Grouper(freq="M")).head(1)
A B
2010-09-05 NaN NaN
2010-10-04 NaN 4.0
2019-10-04 7.0 7.0
Related
I have a Dataframe with different columns. Some columns might start with a serie of NaN value before the real values starts. However, after the first non NaN value in each columns, some NaN value can also appear. For example :
A B C
2021-08-31 1.0 NaN 5.0
2021-09-01 2.0 NaN NaN
2021-09-02 4.0 3.0 NaN
2021-09-03 NaN 7.0 5.0
2021-09-06 2.0 5.0 5.0
2021-09-07 9.0 NaN 5.0
2021-09-08 4.0 5.0 NaN
I would like to remove all the line where there is a NaN value but only after the first non NaN value in the column. Said differently, the NaN values before the first non NaN value are not taken into account in the removal process.
So the previous example, would look something like this :
A B C
2021-08-31 1.0 NaN 5.0
2021-09-06 2.0 5.0 5.0
I started looking for a solution using the list of 'first_valid_date' and then removing with conditions on the index being after the first_valid_date of the column plus the value being NaN but I have problem with the removal of value with 2 conditions (NaN and index) :
df.drop(df[df.isna().any(axis=1) & df.index > mydateindex].index)
Try using this with loc and isna with notna, and shift:
>>> df.loc[(~(df.shift().notna() & df.isna() & df.shift(-1).notna())).all(1)]
A B
2021-08-31 1.0 NaN
2021-09-01 2.0 NaN
2021-09-02 4.0 3.0
2021-09-06 2.0 5.0
2021-09-08 4.0 5.0
>>>
I think I found the correct way to do it :
df.loc[~(df.fillna(method='ffill').notna() & ~df.notna()).max(axis=1)]
I'm very confused by the output of the pct_change function when data with NaN values are involved. The first several rows of output in the right column are correct - it gives the percentage change in decimal form of the cell to the left in Column A relative to the cell in Column A two rows prior. But as soon as it reaches the NaN values in Column A, the output of the pct_change function makes no sense.
For example:
Row 8: NaN is 50% greater than 2?
Row 9: NaN is 0% greater than 3?
Row 11: 4 is 33% greater than NaN?
Row 12: 2 is 33% less than NaN?`
Based on the above math, it seems like pct_change is assigning NaN a value of "3". Is that because pct_change effectively fills forward the last non-NaN value? Could someone please explain the logic here and why this happens?
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [2,1,3,1,4,5,2,3,np.nan,np.nan,np.nan,4,2,1,0,4]})
x = 2
df['pctchg_A'] = df['A'].pct_change(periods = x)
print(df.to_string())
Here's the output:
The behaviour is as expected. You need to carefully read the df.pct_change docs.
As per docs:
fill_method: str, default ‘pad’
How to handle NAs before computing percent changes.
Here, method pad means, it will forward-fill the NaN values with the nearest non-NaN value.
So, if you ffill or pad your NaN values, you will understand what's exactly happening. Check this out:
In [3201]: df['padded_A'] = df['A'].fillna(method='pad')
In [3203]: df['pctchg_A'] = df['A'].pct_change(periods = x)
In [3204]: df
Out[3204]:
A padded_A pctchg_A
0 2.0 2.0 NaN
1 1.0 1.0 NaN
2 3.0 3.0 0.500000
3 1.0 1.0 0.000000
4 4.0 4.0 0.333333
5 5.0 5.0 4.000000
6 2.0 2.0 -0.500000
7 3.0 3.0 -0.400000
8 NaN 3.0 0.500000
9 NaN 3.0 0.000000
10 NaN 3.0 0.000000
11 4.0 4.0 0.333333
12 2.0 2.0 -0.333333
13 1.0 1.0 -0.750000
14 0.0 0.0 -1.000000
15 4.0 4.0 3.000000
Now you can compare padded_A values with pctchg_A and see that it works as expected.
For unbalanced panel data, it's hard for me to generate lagged variable, especially the lagged length is more than 2. For example, I have a dataset that is a unbalanced panel data. The objective of the task is to generate a lagged 2-month variable.
import pandas as pd
import numpy as np
a=[[1,'1990/1/1',1],
[1,'1990/2/1',2],
[1,'1990/3/1',3],
[2,'1989/12/1',3],
[2,'1990/1/1',3],
[2,'1990/2/1',4],
[2,'1990/3/1',5.5],
[2,'1990/4/1',5],
[2,'1990/6/1',6]]
data=pd.DataFrame(a,columns=['id','date','value'])
data['date']=pd.to_datetime(data['date'])
Currently, My solution is
data['lag2value']=np.where((data.groupby('id')['date'].diff(2)/np.timedelta64(1, 'M')).fillna(0).round()==2,
data.sort_values(['id','date']).groupby('id')['value'].shift(2),np.nan)
However, for the last obs, it does have a lagged two-month observation, that's to say the date 1990-6-1 corresponds to the 1990-4-1. My codes cannot figure it out.
id date value lag2value
0 1 1990-01-01 1.0 NaN
1 1 1990-02-01 2.0 NaN
2 1 1990-03-01 3.0 1.0
3 2 1989-12-01 3.0 NaN
4 2 1990-01-01 3.0 NaN
5 2 1990-02-01 4.0 3.0
6 2 1990-03-01 5.5 3.0
7 2 1990-04-01 5.0 4.0
8 2 1990-06-01 6.0 NaN
One possible solution is to build a complete date table that is a balanced panel dataset, and merge the current table to it. However, if the working data is large, it's time-comsuming to work on the complete table.
I want to know any elegent solution to the problem? Thanks in advance.
Use:
val = df.set_index('date').groupby('id').resample('MS').asfreq()['value']
val = val.groupby(level=0).shift(2)
df['lag2val'] = df.set_index(['id', 'date']).index.map(val)
Details:
STEP A: Use DataFrame.groupby on id and use groupby.resample to resample the grouped frame using monthly start frequency.
print(val)
id date
1 1990-01-01 1.0
1990-02-01 2.0
1990-03-01 3.0
2 1989-12-01 3.0
1990-01-01 3.0
1990-02-01 4.0
1990-03-01 5.5
1990-04-01 5.0
1990-05-01 NaN
1990-06-01 6.0
Name: value, dtype: float64
STEP B: Use Series.groupby on level=0 to group the series val and shift 2 periods down to create a lagged 2 months val series.
print(val)
id date
1 1990-01-01 NaN
1990-02-01 NaN
1990-03-01 1.0
2 1989-12-01 NaN
1990-01-01 NaN
1990-02-01 3.0
1990-03-01 3.0
1990-04-01 4.0
1990-05-01 5.5
1990-06-01 5.0
Name: value, dtype: float64
STEP C: Finally, use set_index along with Series.map to map the new lagged val series to the orginal dataframe df.
print(df)
id date value lag2val
0 1 1990-01-01 1.0 NaN
1 1 1990-02-01 2.0 NaN
2 1 1990-03-01 3.0 1.0
3 2 1989-12-01 3.0 NaN
4 2 1990-01-01 3.0 NaN
5 2 1990-02-01 4.0 3.0
6 2 1990-03-01 5.5 3.0
7 2 1990-04-01 5.0 4.0
8 2 1990-06-01 6.0 5.0
For example, I have 2 dfs:
df1
ID,col1,col2
1,5,9
2,6,3
3,7,2
4,8,5
and another df is
df2
ID,col1,col2
1,11,9
2,12,7
3,13,2
I want to calculate first pairwise subtraction from df2 to df1. I am using scipy.spatial.distance using a function subtract_
def subtract_(a, b):
return abs(a - b)
d1_s = df1[['col1']]
d2_s = df2[['col1']]
dist = cdist(d1_s, d2_s, metric=subtract_)
dist_df = pd.DataFrame(dist, columns= d2_s.values.ravel())
print(dist_df)
11 12 13
6.0 7.0 8.0
5.0 6.0 7.0
4.0 5.0 6.0
3.0 4.0 5.0
Now, I want to check, these new columns name like 11,12 and 13. I am checking if there is any values in this new dataframe less than 5. If there is, then I want to do further calculations. Like this.
For example, here for columns name '11', less than 5 value is 4 which is at rows 3. Now in this case, I want to subtract columns name ('col2') of df1 but at row 3, in this case it would be value 2. I want to subtract this value 2 with df2(col2) but at row 1 (because column name '11') was from value at row 1 in df2.
My for loop is so complex for this. It would be great, if there would be some easier way in pandas.
Any help, suggestions would be great.
The expected new dataframe is this
0,1,2
Nan,Nan,Nan
Nan,Nan,Nan
(2-9)=-7,Nan,Nan
(5-9)=-4,(5-7)=-2,Nan
Similar to Ben's answer, but with np.where:
pd.DataFrame(np.where(dist_df<5, df1.col2.values[:,None] - df2.col2.values, np.nan),
index=dist_df.index,
columns=dist_df.columns)
Output:
11 12 13
0 NaN NaN NaN
1 NaN NaN NaN
2 -7.0 NaN NaN
3 -4.0 -2.0 NaN
In your case using numpy with mask
df.mask(df<5,df-(df1.col2.values[:,None]+df2.col2.values))
Out[115]:
11 12 13
0 6.0 7.0 8.0
1 5.0 6.0 7.0
2 -7.0 5.0 6.0
3 -11.0 -8.0 5.0
Update
Newdf=(df-(-df1.col2.values[:,None]+df2.col2.values)-df).where(df<5)
Out[148]:
11 12 13
0 NaN NaN NaN
1 NaN NaN NaN
2 -7.0 NaN NaN
3 -4.0 -2.0 NaN
I have a dataset which is only one column. I want to cut the column into multiple dataframes.
I use a for loop to create a list which contains the values at which positions I want to cut the dataframe.
import pandas as pd
df = pd.read_csv("column.csv", delimiter=";", header=0, index_col=(0))
number_of_pixels = int(len(df.index))
print("You have " + str(number_of_pixels) +" pixels in your file")
number_of_rows = int(input("Enter number of rows you want to create"))
list=[] #this list contains the number of pixels per row
for i in range (0,number_of_rows): #this loop fills the list with the number of pixels per row
pixels_per_row=int(input("Enter number of pixels in row " + str(i)))
list.append(pixels_per_row)
print(list)
After cutting the column into multiple dataframes I want to transpose each dataframe and concating all dataframes back together using:
df1=df1.reset_index(drop=True)
df1=df1.T
df2=df2.reset_index(drop=True)
df2=df2.T
frames = [df1,df2]
result = pd.concat(frames, axis=0)
print(result)
So I want to create a loop that cuts my dataframe into multiple frames at the positions stored in my list.
Thank you!
This is a problem that is better solved with numpy. I'll start from the point of you receiving a list from your user input. The whole point is to use numpy.split to separate the values based on the cumulative number of pixels requested, and then create a new DataFrame
Setup
import numpy as np
import pandas as pd
np.random.seed(123)
df = pd.DataFrame({'val': np.random.randint(1,10,50)})
lst = [4,10,2,1,15,8,9,1]
Code
pd.DataFrame(np.split(df.val.values, np.cumsum(lst)[:-1]))
Output
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 3 3.0 7.0 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 4 7.0 2.0 1.0 2.0 1.0 1.0 4.0 5.0 1.0 NaN NaN NaN NaN NaN
2 1 5.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 8 4.0 3.0 5.0 8.0 3.0 5.0 9.0 1.0 8.0 4.0 5.0 7.0 2.0 6.0
5 7 3.0 2.0 9.0 4.0 6.0 1.0 3.0 NaN NaN NaN NaN NaN NaN NaN
6 7 3.0 5.0 5.0 7.0 4.0 1.0 7.0 5.0 NaN NaN NaN NaN NaN NaN
7 8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
If your list has more pixels than the total number of rows in your initial DataFrame then you'll get extra all NaN rows in your output. If your lst sums to less than the total number of pixels, it will add them to all to the last row. Since you didn't specify either of these conditions in your question, not sure how you'd want to handle that.