I have this dataframe, which contains average temps for all the summer days:
DATE TAVG
0 1955-06-01 NaN
1 1955-06-02 NaN
2 1955-06-03 NaN
3 1955-06-04 NaN
4 1955-06-05 NaN
... ... ...
5805 2020-08-27 2.067854
5806 2020-08-28 3.267854
5807 2020-08-29 3.067854
5808 2020-08-30 1.567854
5809 2020-08-31 4.167854
And I want to calculate the mean value yearly, so I can plot it, how could I do that?
If I understand correctly, can you try this ?
df['DATE']=pd.to_datetime(df['DATE'])
df.groupby(df['DATE'].dt.year)['TAVG'].mean()
Related
I have a dataframe radiosondes which contains a lot of radiosonde data. Now there are hundreds of radiosondes being done, all with a unique timestamp, so the dataframe has a datetimeindex. What I want is a timeseries of the variables (temperature, pressure etc) based on a certain pressure level. So basically every individual radiosonde should give me the values of the other variables for a certain pressure level. The problem arises that the pressure interval isn't homogeneous, and is written in 2 decimals. Also every radiosonde has a different pressure interval because measurements were taken every second, and not based on pressure. What I did was the following:
x = radiosondes[(radiosondes['Press'] >= 500) & (radiosondes['Press'] <= 501)]
Now this line gave me somewhat correct data, but not exactly as you see in the results below: Some timestamps are included multiple times, because they have multiple measurements where the pressure was between 500 and 501 HPa.
Press GeopHgt Temp RH PO3 GPSHgt O3
datetime
2019-09-21 05:00:00 500.86 5263 237.4 79 NaN 5279.0 NaN
2019-09-21 05:00:00 500.49 5268 237.4 78 NaN 5285.0 NaN
2019-09-21 05:00:00 500.12 5273 237.3 76 NaN 5290.0 NaN
2019-09-22 04:00:00 500.64 5359 243.5 54 NaN 5369.0 NaN
2019-09-22 04:00:00 500.14 5368 243.4 54 NaN 5378.0 NaN
... ... ... ... .. ... ... ..
2020-10-01 11:00:00 500.68 5443 244.6 63 NaN 5460.0 NaN
2020-10-01 11:00:00 500.29 5449 244.6 63 NaN 5466.0 NaN
2020-10-01 14:00:00 500.92 5465 245.1 29 NaN 5485.0 NaN
2020-10-01 14:00:00 500.55 5469 245.1 29 NaN 5490.0 NaN
2020-10-01 14:00:00 500.16 5474 245.1 28 NaN 5496.0 NaN
So what I want is that every radiosonde is included only once in the new timeseries. I would like to select the row where the pressure is closest too 500. So then the result would be something like:
Press GeopHgt Temp RH PO3 GPSHgt O3
datetime
2019-09-21 05:00:00 500.12 5273 237.3 76 NaN 5290.0 NaN
2019-09-22 04:00:00 500.14 5368 243.4 54 NaN 5378.0 NaN
... ... ... ... .. ... ... ..
2020-10-01 11:00:00 500.29 5449 244.6 63 NaN 5466.0 NaN
2020-10-01 14:00:00 500.16 5474 245.1 28 NaN 5496.0 NaN
Hopefully it is clear what I meant here. Thanks very much in advance!
To achieve this you can do the following:
if your dataframe is x,
and considering that you look for the pressure as close to 500 as possible, so it is equal to the minimum pressure between 500 and 501:
print(x.loc[x.groupby("datetime")["Press"].idxmin()])
This will keep one line per datetime group, with the minimum pressure, so the closest to 500.
After your initial manipulation, do:
x.sort_values('Press').drop_duplicates('date').sort_index()
Afterwards, you might want to re-sort your dataframe with regard to the timestamp, which is trivial.
Working with daily [Date, Open, High, Low, Close] stock data, I am trying to better understand a good method for the type of statements to use when I am backtesting multiple conditions. For example:
#Signal:
Todays Close > Todays Open AND
Yesterdays Close > Yesterdays Open AND
Todays Close >= Todays High - 10%
#Position:
If ALL of the signal conditions above are true, then "Buy" tomorrow at (todays High + 5%) and "Sell" at the Close of the day.
**To take the position I would have to test that the "Buy" condition was satisfied on the 'tomorrow' bar
#Calculate Return:
If Position taken, calculate profit or loss for the day
I've seen sample algorithms, but many examples are just basic moving average crossover systems (one condition), very simple to do with vectorized approach.
When you have multiple conditions as above, can someone show me a good way to code this?
Assuming your data has been sorted by date and indexed sequentially, try this:
cond1 = df['Close'] > df['Open']
cond2 = df['Close'].shift() > df['Open'].shift()
cond3 = df['Close'] >= (df['High'] * 0.9)
signal = df[cond1 & cond2 & cond3]
df.loc[signal.index + 1, 'BuyAt'] = (signal['High'] * 1.05).values
df.loc[signal.index + 1, 'SellAt'] = df.loc[signal.index + 1, 'Close']
df['PnL'] = df['SellAt'] - df['BuyAt']
Result (from MSFT stock price courtesy of Yahoo Finance):
Date Open High Low Close BuyAt SellAt PnL
0 2019-01-02 99.550003 101.750000 98.940002 101.120003 NaN NaN NaN
1 2019-01-03 100.099998 100.190002 97.199997 97.400002 NaN NaN NaN
2 2019-01-04 99.720001 102.510002 98.930000 101.930000 NaN NaN NaN
3 2019-01-07 101.639999 103.269997 100.980003 102.059998 NaN NaN NaN
4 2019-01-08 103.040001 103.970001 101.709999 102.800003 108.433497 102.800003 -5.633494
5 2019-01-09 103.860001 104.879997 103.239998 104.269997 NaN NaN NaN
6 2019-01-10 103.220001 103.750000 102.379997 103.599998 NaN NaN NaN
7 2019-01-11 103.190002 103.440002 101.639999 102.800003 108.937500 102.800003 -6.137497
8 2019-01-14 101.900002 102.870003 101.260002 102.050003 NaN NaN NaN
9 2019-01-15 102.510002 105.050003 101.879997 105.010002 NaN NaN NaN
10 2019-01-16 105.260002 106.260002 104.959999 105.379997 110.302503 105.379997 -4.922506
11 2019-01-17 105.000000 106.629997 104.760002 106.120003 111.573002 106.120003 -5.452999
12 2019-01-18 107.459999 107.900002 105.910004 107.709999 111.961497 107.709999 -4.251498
13 2019-01-22 106.750000 107.099998 104.860001 105.680000 113.295002 105.680000 -7.615002
14 2019-01-23 106.120003 107.040001 105.339996 106.709999 NaN NaN NaN
15 2019-01-24 106.860001 107.000000 105.339996 106.199997 NaN NaN NaN
16 2019-01-25 107.239998 107.879997 106.199997 107.169998 NaN NaN NaN
17 2019-01-28 106.260002 106.480003 104.660004 105.080002 NaN NaN NaN
18 2019-01-29 104.879997 104.970001 102.169998 102.940002 NaN NaN NaN
19 2019-01-30 104.620003 106.379997 104.330002 106.379997 NaN NaN NaN
It seems like a losing strategy to me!
I have calculated the moving average of 15 minutes from 10 second recorded data. Now I wanted to merge two timeseries data (15 minutes average and 15 minutes moving average) from different files into a new file based on the nearest timestamp.
The 15 minutes moving average data is as below. As I have calculated the moving average, the first few rows are NaN:
RecTime NO2_RAW NO2 Ox_RAW Ox CO_RAW CO SO2_RAW SO2
2019-06-03 00:00:08 NaN NaN NaN NaN NaN NaN NaN NaN
2019-06-03 00:00:18 NaN NaN NaN NaN NaN NaN NaN NaN
2019-06-03 00:00:28 NaN NaN NaN NaN NaN NaN NaN NaN
2019-06-03 00:00:38 NaN NaN NaN NaN NaN NaN NaN NaN
The 15 minute average data is shown below:
Site Species ReadingDateTime Value Units Provisional or Ratified
0 CR9 NO2 2019-03-06 00:00:00 8.2 ug m-3 P
1 CR9 NO2 2019-03-06 00:15:00 7.6 ug m-3 P
2 CR9 NO2 2019-03-06 00:30:00 5.9 ug m-3 P
3 CR9 NO2 2019-03-06 00:45:00 5.1 ug m-3 P
4 CR9 NO2 2019-03-06 01:00:00 5.2 ug m-3 P
I want a table like this:
ReadingDateTime Value NO2_Raw NO2
2019-06-03 00:00:00
2019-06-03 00:15:00
2019-06-03 00:30:00
2019-06-03 00:45:00
2019-06-03 01:00:00
I tried to match the two dataframes with nearest time
df3 = pd.merge_asof(df1, df2, left_on = 'RecTime', right_on = 'ReadingDateTime', tolerance=pd.Timedelta('59s'), allow_exact_matches=False)
I got a new dataframe
RecTime NO2_RAW NO2 Ox_RAW Ox CO_RAW CO SO2_RAW SO2 Site Species ReadingDateTime Value Units Provisional or Ratified
0 2019-06-03 00:14:58 1.271111 21.557111 65.188889 170.011111 152.944444 294.478000 -124.600000 -50.129444 NaN NaN NaT NaN NaN NaN
1 2019-06-03 00:15:08 1.294444 21.601778 65.161111 169.955667 152.844444 294.361556 -124.595556 -50.117556 NaN NaN NaT NaN NaN NaN
2 2019-06-03 00:15:18 1.318889 21.648556 65.104444 169.842556 152.750000 294.251556 -124.593333 -50.111667 NaN NaN NaT NaN NaN NaN
But the values of df2 became NaN. Can someone please help?
Assuming the minutes are correct, you could remove the seconds, and then you would be able to merge.
df.RecTime.map(lambda x: x.replace(second=0)).
You could either create a new column or replace the existing one to merge.
I have a problem with filtering data from column so I have a question about it.
My df looks like this:
TempHigh TempLow City
Date
2017-01-01 25 15 A
2017-01-02 23 14 A
2017-01-03 29 10 A
2017-01-01 22 13 B
2017-01-02 21 12 B
2017-01-03 12 11 B
How to make df.describe() only for City A? But not with df['City'].describe()
How to make separate plots for City A and City B and another plot where both cities and in one plot comparing with kind='line' ?
Also how to make histogram subplots for City A and plot for City B?
I tried with code by it gives me all columns and I want only one of them? And how to make City A and CIty B histogram in one?
df.groupby('CityName').hist()
Thanks in advance!
You need to read basic documentation about Indexing and Selecting Data.
>>> df[df['City']=='A'].describe()
TempHigh TempLow
count 3.000000 3.000000
mean 25.666667 13.000000
std 3.055050 2.645751
min 23.000000 10.000000
25% 24.000000 12.000000
50% 25.000000 14.000000
75% 27.000000 14.500000
max 29.000000 15.000000
I have a data frame that looks like this, with monthly data points:
Date Value
1 2010-01-01 18.45
2 2010-02-01 18.13
3 2010-03-01 18.25
4 2010-04-01 17.92
5 2010-05-01 18.85
I want to make it daily data and fill in the resulting new dates with the current month value. For example:
Date Value
1 2010-01-01 18.45
2 2010-01-02 18.45
3 2010-01-03 18.45
4 2010-01-04 18.45
5 2010-01-05 18.45
....
This is the code I'm using to add the interim dates and fill the values:
today = get_datetime('US/Eastern') #.strftime('%Y-%m-%d')
enddate='1881-01-01'
idx = pd.date_range(enddate, today.strftime('%Y-%m-%d'), freq='D')
df = df.reindex(idx)
df = df.fillna(method = 'ffill')
The output is as follows:
Date Value
2010-01-01 00:00:00 NaN NaN
2010-01-02 00:00:00 NaN NaN
2010-01-03 00:00:00 NaN NaN
2010-01-04 00:00:00 NaN NaN
2010-01-05 00:00:00 NaN NaN
The logs show that the NaN values appear just before the .fillna method is invoked. So the forward fill is not the culprit.
Any ideas why this is happening?
option 3
safest approach, very general
up-sample to daily, then group monthly with a transform
The reason why this is important is that your day may not fall on the first of the month. If you want to ensure that that days value gets broadcast for every other day in the month, do this
df.set_index('Date').asfreq('D') \
.groupby(pd.TimeGrouper('M')).Value \
.transform('first').reset_index()
option 2
asfreq
df.set_index('Date').asfreq('D').ffill().reset_index()
option 3
resample
df.set_index('Date').resample('D').first().ffill().reset_index()
For pandas=0.16.1
df.set_index('Date').resample('D').ffill().reset_index()
All produce the same result over this sample data set
you need to add index to the original dataframe before calling reindex
test = pd.DataFrame(np.random.randn(4), index=pd.date_range('2017-01-01', '2017-01-04'), columns=['test'])
test.reindex(pd.date_range('2017-01-01', '2017-01-05'), method='ffill')