I am working on plotting a 20 year climatology and have had issues with averaging.
My data is hourly data since December 1999 in CSV format. I used an API to get the data and currently have it in a pandas data frame. I was able to split up hours, days, etc like this:
dfROVC1['Month'] = dfROVC1['time'].apply(lambda cell: int(cell[5:7]))
dfROVC1['Day'] = dfROVC1['time'].apply(lambda cell: int(cell[8:9]))
dfROVC1['Year'] = dfROVC1['time'].apply(lambda cell: int(cell[0:4]))
dfROVC1['Hour'] = dfROVC1['time'].apply(lambda cell: int(cell[11:13]))
So I averaged all the days using:
z=dfROVC1.groupby([dfROVC1.index.day,dfROVC1.index.month]).mean()
That worked, but I realized I should take the average of the mins and average of the maxes of all my data. I have been having a hard time figuring all of this out.
I want my plot to look like this:
Monthly Average Section
but I can't figure out how to make it work.
I am currently using Jupyter Notebook with Python 3.
Any help would be appreciated.
Is there a reason you didn't just use datetime to convert your time column?
The minimums by month would be:
z=dfROVC1.groupby(['Year','Month']).min()
Related
I am trying to calculate the 30 year temperature normal (1981-2010 average) for the NARR daily gridded data set linked below.
In the end for each grid point I want an array that contains 365 values, each of which contains the average temperature of that day calculated from the 30 years of data for that day. For example the first value in each grid point's array would be the average Jan 1 temperature calculated from the 30 years (1981-2010) of Jan 1 temperature data for that grid point. My end goal is to be able to use this new 30yrNormal array to calculate daily temperature anomalies from.
So far I have only been able to calculate anomalies from one year worth of data. The problem with this is that it is taking the difference between the daily temperature and the average for the whole year rather then the difference between the daily temperature and the 30 year average of that daily temperature:
file='air.sfc.2018.nc'
ncin = Dataset(file,'r')
#put data into numpy arrays
lons=ncin.variables['lon'][:]
lats=ncin.variables['lat'][:]
lats1=ncin.variables['lat'][:,0]
temp=ncin.variables['air'][:]
ncin.close()
AvgT=np.mean(temp[:,:,:],axis=0)
#compute anomalies by removing time-mean
T_anom=temp-AvgT
Data:
ftp://ftp.cdc.noaa.gov/Datasets/NARR/Dailies/monolevel/
For the years 1981-2010
This is most easily solved using CDO.
You can use my package, nctoolkit (https://nctoolkit.readthedocs.io/en/latest/ & https://pypi.org/project/nctoolkit/) if you are working with Python on Linux. This uses CDO as a backend.
Assuming that the 30 files are a list called ff_list. The code below should work.
First you would create the 30 year daily mean climatology.
import nctoolkit as nc
mean_30 = nc.open_data(ff_list)
mean_30.merge_time()
mean_30.drop(month=2,day=29)
mean_30.tmean("day")
mean_30.run()
Then you would subtract this from the daily figures to get the anomalies.
anom_30 = nc.open_data(ff_list)
anom_30.cdo_command("del29feb")
anom_30.subtract(mean_30)
anom_30.run()
This should have the anomalies
One issue is whether the files have leap years or how you want to handle leap years if they exists. CDO has an undocumented command -delfeb29, which I have used above
I have a data set as follows:
[Time of notification], [Station], [Category]
2019-02-04 19.36:22, Location A, Alert
2019-02-04 20.06:35, Location B, Request
2019-02-05 07.04:53, Location A, Incident
Time of notification is in datetime64[ns] format. The time span is one year.
I am trying to get the following line graphs:
One per station
Time on x axis. Preferably: Accumulated for days of the week and hours (e.g. all Mondays, Tuesdays etc together, so that a daily/weekly trend over the whole year becomes visible).
Number of notifications (for that station) on the y axis. Category is irrelevant.
I have tried a lot, but I am new to time series and to visualization, and I am getting nowhere after hours of trying. I have been trying with plt.subplots, value_counts etcetera. Also tried making this graph for one station first, but even that didn't work out.
Can anyone help?
Thank you!
I read several data frames from kafka topics using Pyspark Structured Streaming 2.4.4. I would like to add some new columns to that data frames that mainly are based on window calculations over past N data points (for instance: Moving average over last 20 data points), and as a new data point is delivered, the corresponding value of MA_20 should be instantly calculated.
Data may look like this:
Timestamp | VIX
2020-01-22 10:20:32 | 13.05
2020-01-22 10:25:31 | 14.35
2020-01-23 09:00:20 | 14.12
It is worth to mention that data will be received from Monday to Friday over 8 hour period a day.
Thus Moving average calculated on Monday morning should include data from Friday!
I tried different approaches but, still I am not able to achieve what I want.
windows = df_vix \
.withWatermark("Timestamp", "100 minutes") \
.groupBy(F.window("Timestamp", "100 minute", "5 minute")) \
aggregatedDF = windows.agg(F.avg("VIX"))
Preceding code calculated MA but it will consider data from Friday as late, so they will be excluded. better than last 100 minutes should be last 20 points (with 5 minute intervals).
I thought that I can use rowsBetween or rangeBetween, but in streaming data frames window cannot be applie over non-timestamp columns (F.col('Timestamp').cast('long'))
w = Window.orderBy(F.col('Timestamp').cast('long')).rowsBetween(-600, 0)
df = df_vix.withColumn('MA_20', F.avg('VIX').over(w)
)
But on the other hand there is no possibility to specify interval within rowsBetween(), using rowsBetween(- minutes(20), 0) throws: minutes are not defined (there is no such a function in sql.functions)
I found the other way, but it doesn't work for streaming data frames either. Don't know why 'Non-time-based windows are not supported on streaming DataFrames' error is raised (df_vix.Timestamp is of timestamp type)
df.createOrReplaceTempView("df_vix")
df_vix.createOrReplaceTempView("df_vix")
aggregatedDF = spark.sql(
"""SELECT *, mean(VIX) OVER (
ORDER BY CAST(df_vix.Timestamp AS timestamp)
RANGE BETWEEN INTERVAL 100 MINUTES PRECEDING AND CURRENT ROW
) AS mean FROM df_vix""")
I have no idea what else could I use to calculate simple Moving Average. It looks like it is impossible to achive that in Pyspark... maybe better solution will be to transform each time new data is comming entire Spark data frame to Pandas and calculate everything in Pandas (or append new rows to pandas and calculate MA) ???
I thought that creating new features as new data is comming is the main purpose of Structured Streaming, but as it turned out Pyspark is not suited to this, I am considering giving up Pyspark an move to Pandas ...
EDIT
The following doesn't work as well, altough df_vix.Timestamp of type: 'timestamp', but it throws 'Non-time-based windows are not supported on streaming DataFrames' error anyway.
w = Window.orderBy(df_vix.Timestamp).rowsBetween(-20, -1)
aggregatedDF = df_vix.withColumn("MA", F.avg("VIX").over(w))
Have you looked at window operation in event times? window(timestamp, "10 minutes", "5 minutes") Will give you a dataframe of 10 minutes every 5 minutes that you can then do aggregations on, including moving averages.
So I have sensor-based time series data for a subject measured in second intervals, with the corresponding heart rate at each time point in an Excel format. My goal is to analyze whether there are any trends over time. When I import it into Python, I can see a certain number, but not the time. However, when imported in Excel, I can convert it into time format easily.
This is what it looks like in Python.. (column 1 = timestamp, column 2 = heart rate in bpm)
This is what it should look like though:
This is what I tried to convert it into datetime format in Python:
import datetime
Time = datetime.datetime.now()
"%s:%s.%s" % (Time.minute, Time.second, str(Time.microsecond)[:2])
if isinstance(Time,datetime.datetime):
print ("Yay!")
df3.set_index('Time', inplace=True)
Time gets recognized as a float64 if I do this, not datetime64 [ns].
Consequently, when I try to plot this timeseries, I get the following:
I even did the Dickey-fuller Test to analyze trends in Python with this dataset. Does my misconfiguration of the time column in Python actually affect my ADF-test? I'm assuming since only trends in the 'heartrate' column are analyzed with this code, it shouldn't matter, right?
Here's the code I used:
#Perform Dickey-Fuller test:
print("Results of Dickey-Fuller Test:")
dftest=adfuller(df3 ['HeartRate'], autolag='AIC')
dfoutput=pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
test_stationarity(df3)
Did I do this correctly? I don't have experience in the engineering field and I'm doing this to improve healthcare for older people, so any help will be very much appreciated!
Thanks in advance! :)
It seems that the dateformat in excel is expresed as the number of days that have passed since 12/30/1899. In order to transform the number on the timestamp column to seconds you only need to multiply it by 24*60*60 = 86400 (the number of seconds in one day).
I'm trying to find the maximum rainfall value for each season (DJF, MAM, JJA, SON) over a 10 year period. I am using netcdf data and xarray to try and do this. The data consists of rainfall (recorded every 3 hours), lat, and lon data. Right now I have the following code:
ds.groupby('time.season).max('time')
However, when I do it this way the output has a shape of (4,145,192) indicating that it's taking the maximum value for each season over the entire period. I would like the maximum for each individual season every year. In other words, output should have something with a shape like (40,145,192) (4 values for each year x 10 years)
I've looked into trying to do this with DataSet.resample as well using time=3M as the frequency, but then it doesn't split the months up correctly. If I have to I can alter the dataset, so it starts in the correct place, but I was hoping there would be an easier way considering there's already a function to group it correctly.
Thanks and let me know if you need anymore details!
Resample is going to be the easiest tool for this job. You are close with the time frequency but you probably want to use the quarterly frequency with an offset:
ds.resample(time='QS-Mar').max('time')
These offsets can be further configured as described in the Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases