how to combine 4D xarray data - python

I have a 4D xarray which contains time, lev, lat, and lon. The data is for specific day so that the length of time is 1. My goal is to use 4D xarray with same attributess but include a month data so that the time length will be 30.
I try to google it but cannot find useful information. I appreciate it if anyone can provide some insights.

If you have multiple points in a time series, you can use xr.DataArray.resample to change the frequency of a datetime dimension. Once you have resampled, you'll get a DataArrayResample object, to which you can apply any of the methods listed in the DataArrayResample API docs.
If you only have a single point in time, you can't resample to a higher frequency. Your best bet is probably to simply select and drop the time dim altogether, then use expand_dims to expand the dimensions again to include the full time dim you want. Just be careful because this overwrites the time dimension's values with whatever you want, regardless of what was in there before:
target_dates = pd.date_range('2018-08-01', '2018-08-30', freq='D')
daily = (
da
.isel(time=0, drop=True)
.expand_dims(time=target_dates)
)

Related

how to extract values based upon month in xarray

I have an array of dimensions (9131,101,191). The first dimension is the days from 1/1/2075 till 31/12/2099. I want to extract all the days which are in the month of July. How can I do this in xarray? I have tried using loops and numpy but not getting the desired result. Ultimately, I want to extract all the arrays which are falling in July and find the mean.
Here is the array, its name is initialize_c3 and its shape is (9131,101,191).
import xarray as xr
arr_c3 = xr.DataArray(initialize_c3,
dims=("time", "lat", "lon"),
coords={"time": pd.date_range("2075-01-01", periods=9131, freq="D"),"lat": list(range(1, 102)),"lon": list(range(1, 192)),
},)
I have tried to groupby according to months.
try = arr_c3.groupby(arr_c3.time.dt.month)
After this the shape of try is (755,1,1) but want the dimensions of try to be (755,101,191). What I am doing wrong?
You can use groupby() to calculate the monthly climatology. Then use sel() to select the monthly mean for July:
ds.groupby('time.month').mean().sel(month=7)
Another way that avoids calculating the means for the other months is to first filter all days in July:
ds.sel(time=(ds.time.dt.month == 7)).mean('time')

Pandas overlapped time intervals to time series

I have a pandas dataframe that includes time intervals that overlapping at some points (figure 1). I need a data frame that has a time series that starts beginning from the first start_time to the end of the last end_time (figure 2).
I have to sum up VIS values at overlapped time intervals.
I couldn't figure it out. How can I do it?
This problem is easily solved with the python package staircase, which is built on pandas and numpy for the purposes of working with (mathematical) step functions.
Assume your original dataframe is called df and the times you want in your resulting dataframe are an array (or datetime index, or series etc) called times.
import staircase as sc
stepfunction = sc.Stairs(df, start="start_time", end="end_time", value="VIS")
result = stepfunction(times, include_index=True)
That's it, result is a pandas Series indexed by times, and has the values you want. You can convert it to a dataframe in the format you want using reset_index method on the Series.
You can generate your times data like this
import pandas as pd
times = pd.date_range(df["start_time"].min(), df["end_time"].max(), freq="30min")
Why it works
Each row in your dataframe can be thought of a step function. For example the first row corresponds to a step function which starts with a value of zero, then at 2002-02-03 04:15:00 increases to a value of 10, then at 2002-02-04 04:45:00 returns to zero. When you sum all the step functions up for each row you have one step function whose value is the sum of all VIS values at any point. This is what has been assigned to the stepfunction variable above. The stepfunction variable is callable, and returns values of the step function at the points specified. This is what is happening in the last line of the example where the result variable is being assigned.
note:
I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.
If you paste your data instead of the images, I'd be able to test this. But this is how you may want to think about it. Assume your dataframe is called df.
df['start_time'] = pd.to_datetime(df['start_time']) # in case it's not datetime already
df.set_index('start_time', inplace=True)
new_dates = pd.date_range(start=min(df.index), end=max(df.end_time), freq='15Min')
new_df = df.reindex(new_dates, fill_value=np.nan)
As long as there are no duplicates in start_time, this should work. If there is, that'd need to be handled in some other way.
Resample is another possibility, but without data, it's tough to say what would work.

how to vectorise Pandas calculation that is based on last x rows of data

I have a fairly sophisticate prediction code with over 20 columns and millions of data per column using wls. Now i use iterrow to loop through dates, then based on those dates and values in those dates, extract different sizes of data for calculation. it takes hours to run in my production, I simplify the code into the following:
import pandas as pd
import numpy as np
from datetime import timedelta
df=pd.DataFrame(np.random.randn(1000,2), columns=list('AB'))
df['dte'] = pd.date_range('9/1/2014', periods=1000, freq='D')
def calculateC(A, dte):
if A>0: #based on values has different cutoff length for trend prediction
depth=10
else:
depth=20
lastyear=(dte-timedelta(days=365))
df2=df[df.dte<lastyear].head(depth) #use last year same date data for basis of prediction
return df2.B.mean() #uses WLS in my model but for simplification replace with mean
for index, row in df.iterrows():
if index>365:
df.loc[index,'C']=calculateC(row.A, row.dte)
I read that iterrow is the main cause because it is not an effective way to use Pandas, and I should use vector methods. However, I can't seem to be able to find a way to vector based on conditions (dates, different length, and range of values). Is there a way?
I have good news and bad news. The good news is I have something vectorized that is about 300x faster but the bad news is that I can't quite replicate your results. But I think that you ought to be able to use the principles here to greatly speed up your code, even if this code does not actually replicate your results at the moment.
df['result'] = np.where( df['A'] > 0,
df.shift(365).rolling(10).B.mean(),
df.shift(365).rolling(20).B.mean() )
The tough (slow) part of your code is this:
df2=df[df.dte<lastyear].head(depth)
However, as long as your dates are all 365 days apart, you can use code like this, which is vectorized and much faster:
df.shift(365).rolling(10).B.mean()
shift(365) replaces df.dte < lastyear and the rolling().mean() replaces head().mean(). It will be much faster and use less memory.
And actually, even if your dates aren't completely regular, you can probably resample and get this way to work. Or, somewhat equivalently, if you make the date your index, the shift can be made to work based on a frequency rather than rows (e.g. shift 365 days, even if that is not 365 rows). It would probably be a good idea to make 'dte' your index here regardless.
I would try pandas.DataFrame.apply(func, axis=1)
def calculateC2(row):
if row.name >365: # row.name is the index of the row
if row.A >0: #based on values has different cutoff length for trend prediction
depth=10
else:
depth=20
lastyear=(row.dte-timedelta(days=365))
df2=df[df.dte<lastyear].B.head(depth) #use last year same date data for basis of prediction
print row.name,np.mean(df2) #uses WLS in my model but for simplification replace with mean
df.apply(calculateC2,axis=1)

Python xarray: Extract first and last time value within each month of a timeseries

EDIT 2016-01-24: This behavior was from a bug in xarray (at the time known as 'xray'). See answer by skc below.
I have an xarray.DataArray comprising daily data spanning multiple years. I want to compute the time tendency of that data for each month in the timeseries. I can get the numerator, i.e. the change in the quantity over each month, using resample. Supposing arr is my xarray.DataArray object, with the time coordinate named 'time':
data_first = arr.resample('1M', 'time' how='first')
data_last = arr.resample('1M', 'time' how='last')
Then data_last - data_first gives me the change in that variable over that month.
However, this doesn't work on the time=arr.time object itself: both 'first' and 'last' kwarg values yield the same value, which is the last day of that month. Also, I can't use the groupby methods, because doing so with time.month groups all the Januaries together, all the Februaries together, etc., when I want the first and last time value within each individual month in the timeseries.
Is there a simple way to do this in xarray? I suspect yes, but I'm new to the package and am failing miserably.
Since 'time' is a coordinate in the DataArray you provided, for the moment it is not possible1 preform resample directly upon it. A possible workaround is to create a new DataArray with the time coordinate values as a variable (still linked with the same coordinate 'time')
If arr is the DataArray you are starting from I would suggest something like this:
time = xray.DataArray(arr.time.values, coords=[arr.time.values], dims=['time'])
time_first = time.resample('1M', 'time', how='first')
time_last = time.resample('1M', 'time', how='last')
time_diff = time_last - time_first
1This is not the intended behavior -- see Stephan's comment above.
Update: Pull request 648 has fixed this issue, so there should no longer be a need to use a workaround.

Remove Holidays and Weekends in a very long time-serie, how to model time-series in Python?

Is there some function in Python to handle this. GoogleDocs has a Weekday -operation so perhaps there is something like that in Python. I am pretty sure someone must have solved this, similar problems occur in sparse data such as in finance and research. I am basically just trying to organize a huge amount of different sized vectors indexed by days, time-series, I am not sure how I should hadle the days -- mark the first day with 1 and the last day with N or with unix -time or how should that be done? I am not sure whether the time-series should be saved into matrix so I could model them more easily to calculate correlation matrices and such things, any ready thing to do such things?
Let's try to solve this problem without the "practical" extra clutter:
import itertools
seq = range(100000)
criteria = cycle([True]*10 + [False]*801)
list(compress(seq, criteria))
now have to change them into days and then change the $\mathbb R$ into $( \mathbb R, \mathbb R)$, tuple. So $V : \mathbb R \mapsto \mathbb R^{2}$ missing, investigating.
[Update]
Let's play! Below code solves the subproblem -- creates some test data to test things -- now we need to create arbitrary days and valuations there to try to test it on arbitrary timeseries. If we can create some function $V$, we are very close to solve this problem...it must consider though the holidays and weekends so maybe not easy (not sure).
import itertools as i
import time
import math
import numpy
def createRandomData():
samples=[]
for x in range(5):
seq = range(5)
criteria = i.cycle([True]*x+ [False]*3)
samples += [list(i.compress( seq, criteria ))]
return samples
def createNNtriangularMatrix(data):
N = len(data)
return [aa+[0]*(N-len(aa)) for aa in data]
A= createNNtriangularMatrix(createRandomData())
print numpy.array(A)
print numpy.corrcoef(A)
I think you should figure out someway the days you want to INCLUDE, and create a (probably looping) subroutine use slicing operations on your big list.
For discontinuous slices, you can take a look at this question:
Discontinuous slice in python list
Or perhaps you could make the days you do not want receive a null value (zero or None).
Try using pandas. You can create a DateOffset for business days and include your data in a DataFrame (see: http://pandas.pydata.org/pandas-docs/stable/timeseries.html) to analyze it.
I think it depends on the scope of your problem, for a personal calendar, 'day' is good enough for indexing.
One's life is as long as 200 years, about 73000 days, simply calculate and record them all, maybe use a dict, e.g.
day = {}
# day[0] = [event_a, event_b, ...]
# or you may want to rewrite the __getitem__ method like this: day['09-05-2012']
Why would you want to remove the holidays and weekends? Is it because they are outliers or zeroes? If they are zeroes they will be handled by the model. You would want to leave the data in the time series and use dummy variables to model the seasonal effects (ie monthly dummies), day of the week dummies and holiday dummies. Clearly, I am dummfounded. I have season people who are unable to deal with time series analysis even break the weekdays into one time series and the weekends into another which completely ignores the lead and lag impacts around holidays.
If it is trading days you want then you can use the pandas datareader package to download the s&p 500 historical prices for U.S. and use the index of dates as a mask to your data.
Answered on mobile, I'll add links and code later.

Categories

Resources