I have a dataframe radiosondes which contains a lot of radiosonde data. Now there are hundreds of radiosondes being done, all with a unique timestamp, so the dataframe has a datetimeindex. What I want is a timeseries of the variables (temperature, pressure etc) based on a certain pressure level. So basically every individual radiosonde should give me the values of the other variables for a certain pressure level. The problem arises that the pressure interval isn't homogeneous, and is written in 2 decimals. Also every radiosonde has a different pressure interval because measurements were taken every second, and not based on pressure. What I did was the following:
x = radiosondes[(radiosondes['Press'] >= 500) & (radiosondes['Press'] <= 501)]
Now this line gave me somewhat correct data, but not exactly as you see in the results below: Some timestamps are included multiple times, because they have multiple measurements where the pressure was between 500 and 501 HPa.
Press GeopHgt Temp RH PO3 GPSHgt O3
datetime
2019-09-21 05:00:00 500.86 5263 237.4 79 NaN 5279.0 NaN
2019-09-21 05:00:00 500.49 5268 237.4 78 NaN 5285.0 NaN
2019-09-21 05:00:00 500.12 5273 237.3 76 NaN 5290.0 NaN
2019-09-22 04:00:00 500.64 5359 243.5 54 NaN 5369.0 NaN
2019-09-22 04:00:00 500.14 5368 243.4 54 NaN 5378.0 NaN
... ... ... ... .. ... ... ..
2020-10-01 11:00:00 500.68 5443 244.6 63 NaN 5460.0 NaN
2020-10-01 11:00:00 500.29 5449 244.6 63 NaN 5466.0 NaN
2020-10-01 14:00:00 500.92 5465 245.1 29 NaN 5485.0 NaN
2020-10-01 14:00:00 500.55 5469 245.1 29 NaN 5490.0 NaN
2020-10-01 14:00:00 500.16 5474 245.1 28 NaN 5496.0 NaN
So what I want is that every radiosonde is included only once in the new timeseries. I would like to select the row where the pressure is closest too 500. So then the result would be something like:
Press GeopHgt Temp RH PO3 GPSHgt O3
datetime
2019-09-21 05:00:00 500.12 5273 237.3 76 NaN 5290.0 NaN
2019-09-22 04:00:00 500.14 5368 243.4 54 NaN 5378.0 NaN
... ... ... ... .. ... ... ..
2020-10-01 11:00:00 500.29 5449 244.6 63 NaN 5466.0 NaN
2020-10-01 14:00:00 500.16 5474 245.1 28 NaN 5496.0 NaN
Hopefully it is clear what I meant here. Thanks very much in advance!
To achieve this you can do the following:
if your dataframe is x,
and considering that you look for the pressure as close to 500 as possible, so it is equal to the minimum pressure between 500 and 501:
print(x.loc[x.groupby("datetime")["Press"].idxmin()])
This will keep one line per datetime group, with the minimum pressure, so the closest to 500.
After your initial manipulation, do:
x.sort_values('Press').drop_duplicates('date').sort_index()
Afterwards, you might want to re-sort your dataframe with regard to the timestamp, which is trivial.
Related
I have a big CSV dataset and i wish to filter my dataset with use of Pandas and save it into new CSV File
The aim is to find all the records for 1 and 15 days of the year
when i used following code it is work
print (df[(df['data___date_time'].dt.day == 1)])
and result appear as follow:
data___date_time NO2 SO2 PM10
26 2020-07-01 00:00:00 1.591616 0.287604 NaN
27 2020-07-01 01:00:00 1.486401 NaN NaN
28 2020-07-01 02:00:00 1.362056 NaN NaN
29 2020-07-01 03:00:00 1.295101 0.194399 NaN
30 2020-07-01 04:00:00 1.260667 0.362168 NaN
... ... ... ...
17054 2022-07-01 19:00:00 2.894369 2.077140 19.34
17055 2022-07-01 20:00:00 3.644265 1.656386 23.09
17056 2022-07-01 21:00:00 2.907760 1.291555 23.67
17057 2022-07-01 22:00:00 2.974715 1.318185 27.68
17058 2022-07-01 23:00:00 2.858022 1.169057 25.18
However when i used following code nothing comes out
print (df[(df['data___date_time'].dt.day == 1) & (df['data___date_time'].dt.day == 15)])
this just gave me:
Empty DataFrame
Columns: [data___date_time, NO2, SO2, PM10]
Index: []
Is there any idea what could be the problem
There is logical problem, is not possible same row 1 and 15, need | for bitwise OR. If need test multiple values simplier is use Series.isin:
df = pd.DataFrame({'data___date_time': pd.date_range('2000-01-01', periods=20)})
print (df[df['data___date_time'].dt.day.isin([1,15])])
data___date_time
0 2000-01-01
14 2000-01-15
I have this dataframe, which contains average temps for all the summer days:
DATE TAVG
0 1955-06-01 NaN
1 1955-06-02 NaN
2 1955-06-03 NaN
3 1955-06-04 NaN
4 1955-06-05 NaN
... ... ...
5805 2020-08-27 2.067854
5806 2020-08-28 3.267854
5807 2020-08-29 3.067854
5808 2020-08-30 1.567854
5809 2020-08-31 4.167854
And I want to calculate the mean value yearly, so I can plot it, how could I do that?
If I understand correctly, can you try this ?
df['DATE']=pd.to_datetime(df['DATE'])
df.groupby(df['DATE'].dt.year)['TAVG'].mean()
I have a time series data for air pollution with several missing gaps like this:
Date AMB_TEMP CO PM10 PM2.5
2010-01-01 0 8 10 ... 15
2010-01-01 1 10 15 ... 20
...
2010-01-02 0 5 ...
2010-01-02 1 ... 20
...
2010-01-03 1 4 13 ... 34
To specify, here's the data link: shorturl.at/blBN1
The gaps were composed of several consecutive or inconsecutive NAs, and there are some helpful statistics done by R like:
Length of time series: 87648
Number of Missing Values:746
Percentage of Missing Values: 0.85 %
Number of Gaps: 136
Average Gap Size: 5.485294
Longest NA gap (series of consecutive NAs): 32
Most frequent gap size (series of consecutive NA series): 1(occurring 50 times)
Generally if I use the df.interpolate(limit=1),
gaps with more than one missing will be interpolated as well.
So I guess a better way to interpolate the gap with only one missing is to get the gap id.
To do so, I grouped the different size of gap and used the following function:
cum = df.notna().cumsum()
cum[cum.duplicated()]
and got the result:
PM2.5
2019-01-09 13:00:00 205
2019-01-10 15:00:00 230
2019-01-10 16:00:00 230
2019-01-16 11:00:00 368
2019-01-23 14:00:00 538
...
2019-12-02 10:00:00 7971
2019-12-10 09:00:00 8161
2019-12-16 15:00:00 8310
2019-12-24 12:00:00 8498
2019-12-31 10:00:00 8663
How to get the index of each first missing value in each gap like this?
PM2.5 gap size
2019-01-09 13:00:00 1
2019-01-10 15:00:00 2
2019-01-16 11:00:00 1
2019-01-23 14:00:00 1
...
2019-12-02 10:00:00 1
2019-12-10 09:00:00 1
2019-12-16 15:00:00 1
2019-12-24 12:00:00 1
2019-12-31 10:00:00 1
but when I used cum[cum.duplicated()].groupby(cum[cum.duplicated()]).count() the index would miss.
Are there better solutions to do these?
OR How to interpolate case by case?
Anyone can help me?
I currently have some time series data that I applied a rolling mean on with a window of 17520.
Thus before the head of my data looked like this:
SETTLEMENTDATE ==
0 2006/01/01 00:30:00 8013.27833 ... 5657.67500 20.03
1 2006/01/01 01:00:00 7726.89167 ... 5460.39500 18.66
2 2006/01/01 01:30:00 7372.85833 ... 5766.02500 20.38
3 2006/01/01 02:00:00 7071.83333 ... 5503.25167 18.59
4 2006/01/01 02:30:00 6865.44000 ... 5214.01500 17.53
And now it looks like this:
SETTLEMENTDATE =
0 2006/01/01 00:30:00 NaN ... NaN NaN
1 2006/01/01 01:00:00 NaN ... NaN NaN
2 2006/01/01 01:30:00 NaN ... NaN NaN
3 2006/01/01 02:00:00 NaN ... NaN NaN
4 2006/01/01 02:30:00 NaN ... NaN NaN
How can I get it so that my data only begins, when there is not a NaN? (also making sure that the date matches)
=
You can try with rolling with min_periods = 1
data['NSW DEMAND'] = data['NSW DEMAND'].rolling(17520,min_periods=17520).mean()
Also try using for loo, you do not need to write the columns one by one
youcols=['xxx'...'xxx1']
for x in youcols:
data[x]=data[x].rolling(17520,min_periods=1).mean()
Base on your comments
for x in youcols:
data[x]=data[x].rolling(17520,min_periods=1).mean()
then ,
data=data.dropna(subset=youcols,thresh =1)
So I'm having a issue with with the 23:00-00:00 time for different days within in Python.
times A B C D
2003-01-08 00:00:00 NaN 0.086215 0.086135 0.090659
2003-01-08 23:00:00 NaN 0.060930 0.059008 0.057293
2003-01-09 23:00:00 NaN 0.102374 0.101441 0.100743
2003-01-10 00:00:00 NaN 0.078799 0.077739 0.076138
2003-01-10 23:00:00 NaN 0.207653 0.205911 0.202886
2003-01-11 00:00:00 NaN 0.203436 0.201588 0.197515
...
What I'm looking for is to mainly select the 00:00:00 hour which is why I've applied df = df.reset_index().groupby(df.index.date).first().set_index('times') but if that doesn't exist that it should use the 23:00:00 of the previous days as the 00:00:00 of the next day. The following is wrong:
times A B C D
2003-01-08 00:00:00 NaN 0.086215 0.086135 0.090659
2003-01-09 23:00:00 NaN 0.102374 0.101441 0.100743
2003-01-10 00:00:00 NaN 0.078799 0.077739 0.076138
2003-01-11 00:00:00 NaN 0.203436 0.201588 0.197515
...
How do I get it to look at the 23:00:00 of the previous day to the 00:00:00 of the next day, to achieve this solution.
2003-01-08 00:00:00 NaN 0.086215 0.086135 0.090659
2003-01-08 23:00:00 NaN 0.060930 0.059008 0.057293
2003-01-10 00:00:00 NaN 0.078799 0.077739 0.076138
2003-01-11 00:00:00 NaN 0.203436 0.201588 0.197515
...