I have two dataframes df1 and df2 of the same size and dimensions. Is there a simple way to copy all the NaN values in 'df1' to 'df2' ? The example below demonstrates the output I want from .copynans()
In: df1
Out:
10053802 10053856 10053898 10058054
2012-07-01 00:00:00 100.0 0.353 0.300 0.326
2012-07-01 00:30:00 101.0 0.522 0.258 0.304
2012-07-01 01:00:00 102.0 0.311 0.369 0.228
2012-07-01 01:30:00 103.0 NaN 0.478 0.247
2012-07-01 02:00:00 101.0 NaN NaN 0.259
2012-07-01 02:30:00 102.0 0.281 NaN 0.239
2012-07-01 03:00:00 125.0 0.320 NaN 0.217
2012-07-01 03:30:00 136.0 0.288 NaN 0.283
In: df2
Out:
10053802 10053856 10053898 10058054
2012-07-01 00:00:00 1.0 2.0 3.0 4.0
2012-07-01 00:30:00 1.0 2.0 3.0 4.0
2012-07-01 01:00:00 1.0 2.0 3.0 4.0
2012-07-01 01:30:00 1.0 2.0 3.0 4.0
2012-07-01 02:00:00 1.0 2.0 3.0 4.0
2012-07-01 02:30:00 1.0 2.0 3.0 4.0
2012-07-01 03:00:00 1.0 2.0 3.0 4.0
2012-07-01 03:30:00 1.0 2.0 3.0 4.0
In: df2.copynans(df1)
Out:
10053802 10053856 10053898 10058054
2012-07-01 00:00:00 1.0 2.0 3.0 4.0
2012-07-01 00:30:00 1.0 2.0 3.0 4.0
2012-07-01 01:00:00 1.0 2.0 3.0 4.0
2012-07-01 01:30:00 1.0 NaN 3.0 4.0
2012-07-01 02:00:00 1.0 NaN NaN 4.0
2012-07-01 02:30:00 1.0 2.0 NaN 4.0
2012-07-01 03:00:00 1.0 2.0 NaN 4.0
2012-07-01 03:30:00 1.0 2.0 NaN 4.0
Either
df1.where(df2.notnull())
Or
df1.mask(df2.isnull())
#Use null cells from df1 as index to set the the corresponding cell to nan in df2
df2[df1.isnull()]=np.nan
Related
I have a DataFrame looking like this:
year 2015 2016 2017 2018 2019 2015 2016 2017 2018 2019 ... 2015 2016 2017 2018 2019 2015 2016 2017 2018 2019
PATIENTS PATIENTS PATIENTS PATIENTS PATIENTS month month month month month ... diffs_24h diffs_24h diffs_24h diffs_24h diffs_24h diffs_168h diffs_168h diffs_168h diffs_168h diffs_168h
date
2016-01-01 00:00:00 0.0 2.0 1.0 7.0 3.0 1.0 1.0 1.0 1.0 1.0 ... NaN -1.0 -4.0 2.0 -2.0 NaN -3.0 -2.0 -3.0 -6.0
2016-01-01 01:00:00 6.0 6.0 7.0 6.0 7.0 1.0 1.0 1.0 1.0 1.0 ... NaN 4.0 0.0 0.0 1.0 NaN 3.0 1.0 2.0 -1.0
2016-01-01 02:00:00 2.0 7.0 6.0 2.0 3.0 1.0 1.0 1.0 1.0 1.0 ... NaN 4.0 3.0 -1.0 0.0 NaN 6.0 2.0 -3.0 0.0
2016-01-01 03:00:00 0.0 2.0 2.0 4.0 6.0 1.0 1.0 1.0 1.0 1.0 ... NaN -1.0 0.0 2.0 4.0 NaN -1.0 -2.0 3.0 3.0
2016-01-01 04:00:00 1.0 2.0 5.0 8.0 0.0 1.0 1.0 1.0 1.0 1.0 ... NaN -1.0 5.0 7.0 -1.0 NaN -2.0 3.0 5.0 -2.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2016-12-31 19:00:00 6.0 7.0 6.0 6.0 6.0 12.0 12.0 12.0 12.0 12.0 ... -9.0 -1.0 -7.0 1.0 -2.0 1.0 0.0 -6.0 -4.0 0.0
2016-12-31 20:00:00 2.0 2.0 5.0 5.0 3.0 12.0 12.0 12.0 12.0 12.0 ... -9.0 -7.0 -12.0 -1.0 -10.0 -2.0 -6.0 -2.0 -1.0 -4.0
2016-12-31 21:00:00 4.0 5.0 3.0 3.0 3.0 12.0 12.0 12.0 12.0 12.0 ... -2.0 -3.0 -10.0 -2.0 -11.0 -2.0 -2.0 -2.0 -3.0 -2.0
2016-12-31 22:00:00 5.0 2.0 6.0 6.0 3.0 12.0 12.0 12.0 12.0 12.0 ... 0.0 -6.0 -4.0 5.0 -4.0 2.0 -1.0 0.0 2.0 -3.0
2016-12-31 23:00:00 1.0 3.0 4.0 4.0 6.0 12.0 12.0 12.0 12.0 12.0 ... -6.0 -1.0 -11.0 2.0 -3.0 -4.0 -2.0 -7.0 -2.0 -2.0
and I want to end with a DataFrame in which the first level is the years but having a single year with all of the columns inside. How can I achieve that?
Example:
year 2015 2016 2017 2018 2019
PATIENTS month PATIENTS motnh PATIENTS month PATIENTS month PATIENTS month ...
date
2016-01-01 00:00:00 0.0 2.0 1.0 7.0 3.0 1.0 1.0 1.0 1.0 1.0 ... NaN -1.0 -4.0 2.0 -2.0 NaN -3.0 -2.0 -3.0 -6.0
2016-01-01 01:00:00 6.0 6.0 7.0 6.0 7.0 1.0 1.0 1.0 1.0 1.0 ... NaN 4.0 0.0 0.0 1.0 NaN 3.0 1.0 2.0 -1.0
2016-01-01 02:00:00 2.0 7.0 6.0 2.0 3.0 1.0 1.0 1.0 1.0 1.0 ... NaN 4.0 3.0 -1.0 0.0 NaN 6.0 2.0 -3.0 0.0
2016-01-01 03:00:00 0.0 2.0 2.0 4.0 6.0 1.0 1.0 1.0 1.0 1.0 ... NaN -1.0 0.0 2.0 4.0 NaN -1.0 -2.0 3.0 3.0
2016-01-01 04:00:00 1.0 2.0 5.0 8.0 0.0 1.0 1.0 1.0 1.0 1.0 ... NaN -1.0 5.0 7.0 -1.0 NaN -2.0 3.0 5.0 -2.0
... ... ... ... ... ... ... ... ... ... ... .
I think you only need sort your columns:
new_df = df.sort_index(axis=1, level=0)
I have rain and temp data sourced from Environment Canada but it contains some NaN values.
start_date = '2015-12-31'
end_date = '2021-05-26'
mask = (data['date'] > start_date) & (data['date'] <= end_date)
df = data.loc[mask]
print(df)
date time rain_gauge_value temperature
8760 2016-01-01 00:00:00 0.0 -2.9
8761 2016-01-01 01:00:00 0.0 -3.4
8762 2016-01-01 02:00:00 0.0 -3.6
8763 2016-01-01 03:00:00 0.0 -3.6
8764 2016-01-01 04:00:00 0.0 -4.0
... ... ... ... ...
56107 2021-05-26 19:00:00 0.0 22.0
56108 2021-05-26 20:00:00 0.0 21.5
56109 2021-05-26 21:00:00 0.0 21.1
56110 2021-05-26 22:00:00 0.0 19.5
56111 2021-05-26 23:00:00 0.0 18.5
[47352 rows x 4 columns]
Find the rows with a NaN value
null = df[df['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
11028 2016-04-04 12:00:00 NaN -6.9
11986 2016-05-14 10:00:00 NaN NaN
11987 2016-05-14 11:00:00 NaN NaN
11988 2016-05-14 12:00:00 NaN NaN
11989 2016-05-14 13:00:00 NaN NaN
... ... ... ... ...
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
[6346 rows x 4 columns]
This is my dataframe I want to use to fill the NaN values
print(rain_df)
date time rain_gauge_value temperature
0 2015-12-28 00:00:00 0.1 -6.0
1 2015-12-28 01:00:00 0.0 -7.0
2 2015-12-28 02:00:00 0.0 -8.0
3 2015-12-28 03:00:00 0.0 -8.0
4 2015-12-28 04:00:00 0.0 -7.0
... ... ... ... ...
48043 2021-06-19 19:00:00 0.6 20.0
48044 2021-06-19 20:00:00 0.6 19.0
48045 2021-06-19 21:00:00 0.8 18.0
48046 2021-06-19 22:00:00 0.4 17.0
48047 2021-06-19 23:00:00 0.0 16.0
[48048 rows x 4 columns]
But when I use the fillna() method, some of the values don't get substitued.
null = null.fillna(rain_df)
null = null[null['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
48057 2020-06-25 09:00:00 NaN NaN
48058 2020-06-25 10:00:00 NaN NaN
48059 2020-06-25 11:00:00 NaN NaN
48060 2020-06-25 12:00:00 NaN NaN
48586 2020-07-17 10:00:00 NaN NaN
48587 2020-07-17 11:00:00 NaN NaN
48588 2020-07-17 12:00:00 NaN NaN
49022 2020-08-04 14:00:00 NaN NaN
49023 2020-08-04 15:00:00 NaN NaN
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
How can I resolve this issue?
when fillna, you probably want a method, like fill using previous/next value, mean of column etc, what we can do is like this
nulls_index = df['rain_gauge_value'].isnull()
df = df.fillna(method='ffill') # use ffill as example
nulls_after_fill = df[nulls_index]
take a look at:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
You need to inform pandas how you want to patch. It may be obvious to you want to use the "patch" dataframe's values when the date and times line up, but it won't be obvious to pandas. see my dummy example:
raw = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time= [time(0,0,0),time(0,0,1)],temp=[1.,np.nan],rain=[4.,np.nan]))
raw
date time temp rain
0 2015-12-28 00:00:00 1.0 4.0
1 2015-12-28 00:00:01 NaN NaN
patch = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time=[time(0,0,0),time(0,0,1)],temp=[5.,5.],rain=[10.,10.]))
patch
date time temp rain
0 2015-12-28 00:00:00 5.0 10.0
1 2015-12-28 00:00:01 5.0 10.0
you need the indexes of raw and patch to correspond to how you want to patch the raw data (in this case, you want to patch based on date and time)
raw.set_index(['date','time']).fillna(patch.set_index(['date','time']))
returns
temp rain
date time
2015-12-28 00:00:00 1.0 4.0
00:00:01 5.0 10.0
It is bit hard to put my problem in words. I have a dataframe with positive and negative values.
2012-01-01 58.0
2012-06-01 8.0
2012-07-01 10.0
2013-01-01 50.0
2013-02-01 -6.0
2013-03-01 -8.0
2013-04-01 20.0
2013-07-01 3.0
2013-12-01 0.0
2014-02-01 88.0
2014-03-01 -40.0
I want to sum a negative value in a row with the previous row value if it is positive until no negatives are left.
For example: the final list should be : [58,8,10, 50+(-6-8),20.0, 3.0, 0.0, 88+(-40)]
2012-01-01 58.0
2012-06-01 8.0
2012-07-01 10.0
2013-01-01 36.0
2013-04-01 20.0
2013-07-01 3.0
2013-12-01 0.0
2014-02-01 48.0
The dataframe is huge so i would really prefer a pandas solution.
You can identify the negative blocks with cumsum, and use that for groupby:
(df.groupby(df['value'].ge(0).cumsum(), as_index=False)
.agg({'date':'first','value':'sum'})
)
Output:
date value
0 2012-01-01 58.0
1 2012-06-01 8.0
2 2012-07-01 10.0
3 2013-01-01 36.0
4 2013-04-01 20.0
5 2013-07-01 3.0
6 2013-12-01 0.0
7 2014-02-01 48.0
As here I need to calculate the mean of the colums duration and km for the
rows with value ==1 and values = 0.
This time I would like that the aggregation period is flexible.
df
Out[20]:
Date duration km value
0 2015-03-28 09:07:00.800001 0 0 0
1 2015-03-28 09:36:01.819998 1 2 1
2 2015-03-30 09:36:06.839997 1 3 1
3 2015-03-30 09:37:27.659997 nan 5 0
4 2015-04-22 09:51:40.440003 3 7 0
5 2015-04-23 10:15:25.080002 0 nan 1
For the aggregation period of 1 day I can use the solution suggested before:
df.pivot_table(values=['duration','km'],columns=['value'],index=df['Date'].dt.date,aggfunc='mean'
ndf.columns = [i[0]+str(i[1]) for i in ndf.columns]
duration0 duration1 km0 km1
Date
2015-03-28 0.0 1.0 0.0 2.0
2015-03-30 NaN 1.0 5.0 3.0
2015-04-22 3.0 NaN 7.0 NaN
2015-04-23 NaN 0.0 NaN NaN
However, I do not know how to change the aggregation period in case, for example, I want to pass it as an argument of a function...
For this reason an approach with pd.Grouper(freq=freq_aggregation), being freq_aggregation = 'd' or '60s' would be preferred...
You can pass grouper to the index of pivot table. Hope this is what you are looking for i.e
ndf = df.pivot_table(values=['duration','km'],columns=['value'],index=pd.Grouper(key='Date', freq='60s'),aggfunc='mean')
ndf.columns = [i[0]+str(i[1]) for i in ndf.columns]
Output:
duration0 duration1 km0 km1
Date
2015-03-28 09:07:00 0.0 NaN 0.0 NaN
2015-03-28 09:36:00 NaN 1.0 NaN 2.0
2015-03-30 09:36:00 NaN 1.0 NaN 3.0
2015-03-30 09:37:00 NaN NaN 5.0 NaN
2015-04-22 09:51:00 3.0 NaN 7.0 NaN
2015-04-23 10:15:00 NaN 0.0 NaN NaN
If frequency is D then
duration0 duration1 km0 km1
Date
2015-03-28 0.0 1.0 0.0 2.0
2015-03-30 NaN 1.0 5.0 3.0
2015-04-22 3.0 NaN 7.0 NaN
2015-04-23 NaN 0.0 NaN NaN
Let's use pd.Grouper, unstack, and columns map:
freq_str = '60s'
df_out = df.groupby([pd.Grouper(freq=freq_str, key='Date'),'value'])['duration','km'].agg('mean').unstack()
df_out.columns = df_out.columns.map('{0[0]}{0[1]}'.format)
df_out
Output:
duration0 duration1 km0 km1
Date
2015-03-28 09:07:00 0.0 NaN 0.0 NaN
2015-03-28 09:36:00 NaN 1.0 NaN 2.0
2015-03-30 09:36:00 NaN 1.0 NaN 3.0
2015-03-30 09:37:00 NaN NaN 5.0 NaN
2015-04-22 09:51:00 3.0 NaN 7.0 NaN
2015-04-23 10:15:00 NaN 0.0 NaN NaN
Now, let's change freq_str to 'D':
freq_str = 'D'
df_out = df.groupby([pd.Grouper(freq=freq_str, key='Date'),'value'])['duration','km'].agg('mean').unstack()
df_out.columns = df_out.columns.map('{0[0]}{0[1]}'.format)
print(df_out)
Output:
duration0 duration1 km0 km1
Date
2015-03-28 0.0 1.0 0.0 2.0
2015-03-30 NaN 1.0 5.0 3.0
2015-04-22 3.0 NaN 7.0 NaN
2015-04-23 NaN 0.0 NaN NaN
use groupby
df = df.set_index('Date')
df.groupby([pd.TimeGrouper('D'), 'value']).mean()
duration km
Date value
2017-10-11 0 1.500000 4.0
1 0.666667 2.5
df.groupby([pd.TimeGrouper('60s'), 'value']).mean()
duration km
Date value
2017-10-11 09:07:00 0 0.0 0.0
2017-10-11 09:36:00 1 1.0 2.5
2017-10-11 09:37:00 0 NaN 5.0
2017-10-11 09:51:00 0 3.0 7.0
2017-10-11 10:15:00 1 0.0 NaN
if you want it unstacked, then unstack it.
df.groupby([pd.TimeGrouper('D'), 'value']).mean().unstack()
duration km
value 0 1 0 1
Date
2017-10-11 1.50 0.67 4.00 2.50
I have a dataframe which contains a Time Stamp column, and two data columns (data1 and data2).
The data1 column spans the entire Time Stamp, while the data2 column stops about halfway. When I was collecting my data, both data1 and data2 collected data for the same time, except at different frequencies.
I would like the data2 column to I understand that I should be leaning towards the resample or reindex functions, but I am unsure how to do this. My Time Stamp column is an object, while my two data columns are float64 types.
What is the easiest way for me to accomplish this goal?
I have tried to refer to the following question, but I was having trouble implementing it:
PANDAS - Loop over two datetime indexes with different sizes to compare days and values
Here's what I think you're trying to do. My assumptions is that your timestamps are aligned by some multiplier. I've used every 2 minutes in my example, since that's what your example appears to be. Here's my sample dataframe:
df
a b
DATE
2017-05-29 06:30:00 0.0 0.0
2017-05-29 06:31:00 9.0 24.0
2017-05-29 06:32:00 10.0 1.0
2017-05-29 06:33:00 10.0 1.0
2017-05-29 06:34:00 0.0 7.0
2017-05-29 06:35:00 3.0 3.0
2017-05-29 06:36:00 0.0 4.0
2017-05-29 06:37:00 0.0 1.0
2017-05-29 06:38:00 0.0 0.0
2017-05-29 06:39:00 0.0 2.0
2017-05-29 06:40:00 0.0 NaN
2017-05-29 06:41:00 0.0 NaN
2017-05-29 06:42:00 0.0 NaN
2017-05-29 06:43:00 0.0 NaN
2017-05-29 06:44:00 0.0 NaN
2017-05-29 06:45:00 2.0 NaN
2017-05-29 06:46:00 4.0 NaN
2017-05-29 06:47:00 0.0 NaN
2017-05-29 06:48:00 4.0 NaN
2017-05-29 06:49:00 8.0 NaN
Extract the misaligned column to it's own dataframe and add a counter column, then add the timedelta to the index, replace the old index, and concatenate the data columns.
b = df['b'][:10].to_frame()
b.insert(0, 'counter', range(len(b)))
b.index = b.index.to_series().apply(lambda x: x + pd.Timedelta(minutes=b.loc[x].counter))
pd.concat([df['a'], b['b']], axis=1)
a b
DATE
2017-05-29 06:30:00 0.0 0.0
2017-05-29 06:31:00 9.0 NaN
2017-05-29 06:32:00 10.0 24.0
2017-05-29 06:33:00 10.0 NaN
2017-05-29 06:34:00 0.0 1.0
2017-05-29 06:35:00 3.0 NaN
2017-05-29 06:36:00 0.0 1.0
2017-05-29 06:37:00 0.0 NaN
2017-05-29 06:38:00 0.0 7.0
2017-05-29 06:39:00 0.0 NaN
2017-05-29 06:40:00 0.0 3.0
2017-05-29 06:41:00 0.0 NaN
2017-05-29 06:42:00 0.0 4.0
2017-05-29 06:43:00 0.0 NaN
2017-05-29 06:44:00 0.0 1.0
2017-05-29 06:45:00 2.0 NaN
2017-05-29 06:46:00 4.0 0.0
2017-05-29 06:47:00 0.0 NaN
2017-05-29 06:48:00 4.0 2.0
2017-05-29 06:49:00 8.0 NaN
It probably goes without saying, but it would be much better to apply correct timestamps to each of the columns when you ingest them.