I read a netCDF file using xarray. The dataset contains lat, lon information of the population for the years: 1975, 1990, 2000 and 2015.
The dataset looks like the following and I made it also available here:
import xarray as xr
ds = xr.open_dataset('borneo_pop_t.nc')
ds
For each pixel I would like to have the information of each year between 1975 and 2000 given the trend of the data points I have. In particular I would like to generate more information and different layers of the population for the missing years.
How can I do that?
You can use xarray's interpolation function.
Using your variable names,
import pandas as pd
# Create time series of 1x a year data
# (you can use any date range; you can generate pretty much
# any sequence you need with pd.date_range())
dates = pd.date_range('1975-01-01','2015-01-01',freq='1Y')
# Linear interpolation
ds_interp = ds.interp(time=dates)
Note a few things:
This code just generates a simple linear interpolation, though ds.interp() supports everything that scipy.interpolate.interp1d() does - e.g., cubic, polynomial, etc. interpolation. Check the docks linked above for examples.
This of course doesn't create new information; you still only "know" the population at the original points. Be wary what consequences interpolating the data will have on your understanding of what the population actually was in a given year.
Related
I have a dataset of weather data that I want to use to make a prediction.
The data set consists of data from several different locations. The features in the data set are as follows:
datetime
location
rain
snow
temp_min
temp_max
clouds
pressure
humidity
wind_speed
wind_deg
weather_description
The measurements have been made at the same time in all locations, which make it possible to distinguish between the individual measurements.
I want to use data from all locations as input with getting a prediction for a location.
Is it possible to use several lines as input or can input data only consist of one line?
The DecisionTreeRegressor from scikit-learn expects a dataframe where each output is generated based on a single row. You can nevertheless move all your measurements in into one row (during training and testing) as below:
rain_stn1, rain_stn2, rain_stn3, ..., snow_stn1, snow_stn2, snow_stn3, ...
rain_value#stn1, rain_value#stn2, rain_value#stn3, ...
Of course this means that there needs to be some logical relationship between the stations such as distance. You could also create aggregate values such as rain_nearby (average of stations at <5 km distance), rain_far (average of stations at >5 km distance) which is probably more helpful in your case.
To give more specific answers you need to provide more details on use case, what you are trying to achieve, and how the dataset looks like.
I'm attempting to do a regression to fit a function to some data points I have, these are simply put (x,y) where x = date and y = a point of data. Seems simple enough.
I'm following along on a how-to and it comes to the part where you split your data into training/testing, that much I understand, but the input for model.fit is a 2D array and then labels.
I think I'm being incredibly dense, but this is what I have for that:
model.fit(input, date_time_training)
My input is an array like so [[5, 3], [7,5], etc] my "labels" are dates because that's how I'd want to label my data but that's not right, they need to be numbers. There are two things it could be, though, my data points which are y on my graph and my x-axis which are dates. I converted my dates into numbers (0,1,2,3,etc) corresponding to each date.
Is that also what my labels would be?
Also my input is just [[date_converted_to_int, score], etc] which when looking at the documentation, seemingly that should be [[points, features], etc]. I'm pretty confused, obviously not super experienced with regression either (otherwise I'm guessing this would be clearer).
You are trying to predict {actual term is forecast in this case} your y over time.
So, It is more suitable to use a time series model in this case. Because by definition this is a time series use case.
[time series: you try to understand the evolution of values of an attribute over time]
Try some models like:
AR
ARIMA
and
statsmodel would be a nice place to visit by for documentation
As raw data we have measurements m_{i,j}, measured every 30 seconds (i=0, 30, 60, 90,...720,..) for every subject j in the dataset.
I wish use TSFRESH (package) to extract time-series features, such that for a point of interest at time i, features are calculated based on symmetric rolling window.
We wish to calculate the feature vector of time point i,j based on measurements of 3 hours of context before i and 3 hours after i.
Thus, the 721-dim feature vector represents a point of interest surrounded by 6 hours “context”, i.e. 360 measurements before and 360 measurements after the point of interest.
For every point of interest, features should be extracted based on 721 measurements of m_{i,j}.
I've tried using rolling_direction param in roll_time_series(), but the only options are either roll backwards or forwards in “time” - I'm looking for a way to include both "past" and "future" data in features calculation.
If I understand your idea correctly, it is even possible to do this with only one-sided rolling. Let's try with one example:
You want to predict for the time 8:00 - and you need for this the data from 5:00 until 11:00.
If you roll through the data with a size of 6h and positive rolling direction, you will end up with a dataset, which also includes exactly this part of the data (5:00 to 11:00). Normally, it would be used to train for the value at 11:00 (or 12:00) - but nothing prevents you to use it for predicting the value at 8:00.
Basically, it is just a matter of re-indexing.
(Same is actually true for negative rolling direction)
A "workaround" solution:
Use the "roll_time_series" function twice; one for "backward" rolling (setting rolling_direction=1) and the second for "forward" (rolling_direction=-1), and then combine them into one.
This will provide, for each time point in the original dataset m_{i,j}$, a time series rolling object with 360 values "from the past" and 360 values "from the future" (i.e., the time point is at the center of the window and max_timeshift=360)
Note to the use of pandas functions below: concat(), sort_values(), drop_duplicates() - which are mandatory for this solution to work.
import numpy as np
import pandas as pd
from tsfresh.utilities.dataframe_functions import roll_time_series
from tsfresh.feature_extraction import EfficientFCParameters, MinimalFCParameters
rolled_backward = roll_time_series(activity_data,
column_id=id_column,
column_sort=sort_column,
column_kind=None,
rolling_direction=1,
max_timeshift=360)
rolled_farward = roll_time_series(activity_data,
column_id=id_column,
column_sort=sort_column,
column_kind=None,
rolling_direction=-1,
max_timeshift=360)
# merge into one dataframe, with rolled_farward and rolled_backward window for every time point (sample)
df = pd.concat([rolled_backward, rolled_farward])
# important! - sort and drop duplicates
df.sort_values(by=[id_column, sort_column], inplace=True)
df.drop_duplicates(subset=[id_column, sort_column, 'activity'], inplace=True, keep='first')
What are the good algorithms to automatically detect trend or draw trend line (up trend, down trend, no trend) for time series data? Appreciate if you can point me to any good research paper or good library in python, R or Matlab.
Ideally, the output from this algorithm will have 4 columns:
from_time
to_time
trend (up/down/no trend/unknown)
probability_of_trend or degree_of_trend
Thank you so much for your time.
I had a similar problem - wanted to do segmentation of the time series on segments with a similar trends. For that task, you can use trend-classifier Python library. It is pip installable (pip3 install trend-classifier).
Here is an example that gets the time series data from YahooFinance and performs analysis.
import yfinance as yf
from trend_classifier import Segmenter
# download data from yahoo finance
df = yf.download("AAPL", start="2018-09-15", end="2022-09-05", interval="1d", progress=False)
x_in = list(range(0, len(df.index.tolist()), 1))
y_in = df["Adj Close"].tolist()
seg = Segmenter(x_in, y_in, n=20)
seg.calculate_segments()
Now, you can plot the time series with trend lines and segment boundaries with:
seg.plot_segments()
You can inspect details about each segment (e.g. positive value for slope indicates up-trend and a negative down-trend). To see info about the segment with index 3:
from devtools import debug
debug(seg.segments[3])
You can have information about all segments in tabular form using Segmenter.segments.to_dataframe() method which produces Pandas DataFrame.
seg.segments.to_dataframe()
There is a parameter that controls the "generalization" factor, i.e. you can try to fit a trend line to a smaller range of time series - you will end up with a large number of segments, or you can go for the segments spanning a bigger part of the time series (more general trend line) and end up with a time series divided into fewer segments. To control that behavior, when initializing Segmenter() (e.g. Segmenter(x_in, y_in, n=20) use various values for n parameter. The larger n the generalization is stronger (fewer segments).
Disclaimer: I'm the author of the trend-classifier package.
I have an array of some arbitrary data x and associated timestamps t that correspond to the data in x (they are the same length N).
I want to downsample my data x to a smaller length M < N, such that the new data is roughly equally spaced in time (by using the timestamp information). This would be instead of simply decimating the data by taking every nth datapoint. Using the closest time-neighbor is fine.
scipy has some resampling code, but it actually tries to interpolate between data points, which I cannot do for my data. Does numpy or scipy have code that does this?
For example, suppose I want to downsample the letters of the alphabet according to some logarithmic time:
import string
import numpy as np
x = string.lowercase[::]
t = np.logspace(1, 10, num=26)
y = downsample(x, t, 8)
I'd suggest using pandas, specifically the resample function:
Convenience method for frequency conversion and resampling of regular time-series data.
Note the how parameter in particular.
You can convert your numpy array to a DataFrame:
import pandas as pd
YourPandasDF = pd.DataFrame(YourNumpyArray)