Selecting rows based on value closest to - python

I have a dataframe radiosondes which contains a lot of radiosonde data. Now there are hundreds of radiosondes being done, all with a unique timestamp, so the dataframe has a datetimeindex. What I want is a timeseries of the variables (temperature, pressure etc) based on a certain pressure level. So basically every individual radiosonde should give me the values of the other variables for a certain pressure level. The problem arises that the pressure interval isn't homogeneous, and is written in 2 decimals. Also every radiosonde has a different pressure interval because measurements were taken every second, and not based on pressure. What I did was the following:
x = radiosondes[(radiosondes['Press'] >= 500) & (radiosondes['Press'] <= 501)]
Now this line gave me somewhat correct data, but not exactly as you see in the results below: Some timestamps are included multiple times, because they have multiple measurements where the pressure was between 500 and 501 HPa.
Press GeopHgt Temp RH PO3 GPSHgt O3
datetime
2019-09-21 05:00:00 500.86 5263 237.4 79 NaN 5279.0 NaN
2019-09-21 05:00:00 500.49 5268 237.4 78 NaN 5285.0 NaN
2019-09-21 05:00:00 500.12 5273 237.3 76 NaN 5290.0 NaN
2019-09-22 04:00:00 500.64 5359 243.5 54 NaN 5369.0 NaN
2019-09-22 04:00:00 500.14 5368 243.4 54 NaN 5378.0 NaN
... ... ... ... .. ... ... ..
2020-10-01 11:00:00 500.68 5443 244.6 63 NaN 5460.0 NaN
2020-10-01 11:00:00 500.29 5449 244.6 63 NaN 5466.0 NaN
2020-10-01 14:00:00 500.92 5465 245.1 29 NaN 5485.0 NaN
2020-10-01 14:00:00 500.55 5469 245.1 29 NaN 5490.0 NaN
2020-10-01 14:00:00 500.16 5474 245.1 28 NaN 5496.0 NaN
So what I want is that every radiosonde is included only once in the new timeseries. I would like to select the row where the pressure is closest too 500. So then the result would be something like:
Press GeopHgt Temp RH PO3 GPSHgt O3
datetime
2019-09-21 05:00:00 500.12 5273 237.3 76 NaN 5290.0 NaN
2019-09-22 04:00:00 500.14 5368 243.4 54 NaN 5378.0 NaN
... ... ... ... .. ... ... ..
2020-10-01 11:00:00 500.29 5449 244.6 63 NaN 5466.0 NaN
2020-10-01 14:00:00 500.16 5474 245.1 28 NaN 5496.0 NaN
Hopefully it is clear what I meant here. Thanks very much in advance!

To achieve this you can do the following:
if your dataframe is x,
and considering that you look for the pressure as close to 500 as possible, so it is equal to the minimum pressure between 500 and 501:
print(x.loc[x.groupby("datetime")["Press"].idxmin()])
This will keep one line per datetime group, with the minimum pressure, so the closest to 500.

After your initial manipulation, do:
x.sort_values('Press').drop_duplicates('date').sort_index()
Afterwards, you might want to re-sort your dataframe with regard to the timestamp, which is trivial.

Related

Using Pandas to filter 2 specific day of year

I have a big CSV dataset and i wish to filter my dataset with use of Pandas and save it into new CSV File
The aim is to find all the records for 1 and 15 days of the year
when i used following code it is work
print (df[(df['data___date_time'].dt.day == 1)])
and result appear as follow:
data___date_time NO2 SO2 PM10
26 2020-07-01 00:00:00 1.591616 0.287604 NaN
27 2020-07-01 01:00:00 1.486401 NaN NaN
28 2020-07-01 02:00:00 1.362056 NaN NaN
29 2020-07-01 03:00:00 1.295101 0.194399 NaN
30 2020-07-01 04:00:00 1.260667 0.362168 NaN
... ... ... ...
17054 2022-07-01 19:00:00 2.894369 2.077140 19.34
17055 2022-07-01 20:00:00 3.644265 1.656386 23.09
17056 2022-07-01 21:00:00 2.907760 1.291555 23.67
17057 2022-07-01 22:00:00 2.974715 1.318185 27.68
17058 2022-07-01 23:00:00 2.858022 1.169057 25.18
However when i used following code nothing comes out
print (df[(df['data___date_time'].dt.day == 1) & (df['data___date_time'].dt.day == 15)])
this just gave me:
Empty DataFrame
Columns: [data___date_time, NO2, SO2, PM10]
Index: []
Is there any idea what could be the problem
There is logical problem, is not possible same row 1 and 15, need | for bitwise OR. If need test multiple values simplier is use Series.isin:
df = pd.DataFrame({'data___date_time': pd.date_range('2000-01-01', periods=20)})
print (df[df['data___date_time'].dt.day.isin([1,15])])
data___date_time
0 2000-01-01
14 2000-01-15

Calculating mean values yearly in a dataframe with a new value daily

I have this dataframe, which contains average temps for all the summer days:
DATE TAVG
0 1955-06-01 NaN
1 1955-06-02 NaN
2 1955-06-03 NaN
3 1955-06-04 NaN
4 1955-06-05 NaN
... ... ...
5805 2020-08-27 2.067854
5806 2020-08-28 3.267854
5807 2020-08-29 3.067854
5808 2020-08-30 1.567854
5809 2020-08-31 4.167854
And I want to calculate the mean value yearly, so I can plot it, how could I do that?
If I understand correctly, can you try this ?
df['DATE']=pd.to_datetime(df['DATE'])
df.groupby(df['DATE'].dt.year)['TAVG'].mean()

How to use pd.interpolate fill the gap with only one missing data

I have a time series data for air pollution with several missing gaps like this:
Date AMB_TEMP CO PM10 PM2.5
2010-01-01 0 8 10 ... 15
2010-01-01 1 10 15 ... 20
...
2010-01-02 0 5 ...
2010-01-02 1 ... 20
...
2010-01-03 1 4 13 ... 34
To specify, here's the data link: shorturl.at/blBN1
The gaps were composed of several consecutive or inconsecutive NAs, and there are some helpful statistics done by R like:
Length of time series: 87648
Number of Missing Values:746
Percentage of Missing Values: 0.85 %
Number of Gaps: 136
Average Gap Size: 5.485294
Longest NA gap (series of consecutive NAs): 32
Most frequent gap size (series of consecutive NA series): 1(occurring 50 times)
Generally if I use the df.interpolate(limit=1),
gaps with more than one missing will be interpolated as well.
So I guess a better way to interpolate the gap with only one missing is to get the gap id.
To do so, I grouped the different size of gap and used the following function:
cum = df.notna().cumsum()
cum[cum.duplicated()]
and got the result:
PM2.5
2019-01-09 13:00:00 205
2019-01-10 15:00:00 230
2019-01-10 16:00:00 230
2019-01-16 11:00:00 368
2019-01-23 14:00:00 538
...
2019-12-02 10:00:00 7971
2019-12-10 09:00:00 8161
2019-12-16 15:00:00 8310
2019-12-24 12:00:00 8498
2019-12-31 10:00:00 8663
How to get the index of each first missing value in each gap like this?
PM2.5 gap size
2019-01-09 13:00:00 1
2019-01-10 15:00:00 2
2019-01-16 11:00:00 1
2019-01-23 14:00:00 1
...
2019-12-02 10:00:00 1
2019-12-10 09:00:00 1
2019-12-16 15:00:00 1
2019-12-24 12:00:00 1
2019-12-31 10:00:00 1
but when I used cum[cum.duplicated()].groupby(cum[cum.duplicated()]).count() the index would miss.
Are there better solutions to do these?
OR How to interpolate case by case?
Anyone can help me?

start time series data when there are no NaN's

I currently have some time series data that I applied a rolling mean on with a window of 17520.
Thus before the head of my data looked like this:
SETTLEMENTDATE ==
0 2006/01/01 00:30:00 8013.27833 ... 5657.67500 20.03
1 2006/01/01 01:00:00 7726.89167 ... 5460.39500 18.66
2 2006/01/01 01:30:00 7372.85833 ... 5766.02500 20.38
3 2006/01/01 02:00:00 7071.83333 ... 5503.25167 18.59
4 2006/01/01 02:30:00 6865.44000 ... 5214.01500 17.53
And now it looks like this:
SETTLEMENTDATE =
0 2006/01/01 00:30:00 NaN ... NaN NaN
1 2006/01/01 01:00:00 NaN ... NaN NaN
2 2006/01/01 01:30:00 NaN ... NaN NaN
3 2006/01/01 02:00:00 NaN ... NaN NaN
4 2006/01/01 02:30:00 NaN ... NaN NaN
How can I get it so that my data only begins, when there is not a NaN? (also making sure that the date matches)
=
You can try with rolling with min_periods = 1
data['NSW DEMAND'] = data['NSW DEMAND'].rolling(17520,min_periods=17520).mean()
Also try using for loo, you do not need to write the columns one by one
youcols=['xxx'...'xxx1']
for x in youcols:
data[x]=data[x].rolling(17520,min_periods=1).mean()
Base on your comments
for x in youcols:
data[x]=data[x].rolling(17520,min_periods=1).mean()
then ,
data=data.dropna(subset=youcols,thresh =1)

Choosing time from 2300-0000 for different days

So I'm having a issue with with the 23:00-00:00 time for different days within in Python.
times A B C D
2003-01-08 00:00:00 NaN 0.086215 0.086135 0.090659
2003-01-08 23:00:00 NaN 0.060930 0.059008 0.057293
2003-01-09 23:00:00 NaN 0.102374 0.101441 0.100743
2003-01-10 00:00:00 NaN 0.078799 0.077739 0.076138
2003-01-10 23:00:00 NaN 0.207653 0.205911 0.202886
2003-01-11 00:00:00 NaN 0.203436 0.201588 0.197515
...
What I'm looking for is to mainly select the 00:00:00 hour which is why I've applied df = df.reset_index().groupby(df.index.date).first().set_index('times') but if that doesn't exist that it should use the 23:00:00 of the previous days as the 00:00:00 of the next day. The following is wrong:
times A B C D
2003-01-08 00:00:00 NaN 0.086215 0.086135 0.090659
2003-01-09 23:00:00 NaN 0.102374 0.101441 0.100743
2003-01-10 00:00:00 NaN 0.078799 0.077739 0.076138
2003-01-11 00:00:00 NaN 0.203436 0.201588 0.197515
...
How do I get it to look at the 23:00:00 of the previous day to the 00:00:00 of the next day, to achieve this solution.
2003-01-08 00:00:00 NaN 0.086215 0.086135 0.090659
2003-01-08 23:00:00 NaN 0.060930 0.059008 0.057293
2003-01-10 00:00:00 NaN 0.078799 0.077739 0.076138
2003-01-11 00:00:00 NaN 0.203436 0.201588 0.197515
...

Categories

Resources