Feature engineering for timeseries dataset in Python [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I have a dataset with n observations, where all the observations have m timesteps. I also have a n*m array which contains the label for each timestep on each given observation.
I am doing feature engineering on this dataset to find meaningful features in the data, according to the labels I have. Is there any Python package out there to facilitate this process?
I came across tsfresh (https://github.com/blue-yonder/tsfresh), although it seems like it's only intended to be used when we have a single label to classify each observation, and not a label to classify each timestep, as is my case.

You could try Featuretools. It is an open-source automated feature engineering library that explicitly deals with time to make sure you don't introduce label leakage.
For your data, you could create two entities: "observations" and "timesteps", and then apply featuretools.dfs (Deep Feature Synthesis) to generate features for each timestep. You can think of an entity as being the same as a table in a relational database.
Of particular usefulness for your problem would be the Cumulative primitives in Featuretools, which are operations that use many instances ordered by time to compute a single value. In your case, if there was a observation with multiple timesteps each with a certain value, you could calculate the mean value of the previous timesteps using the CumMean primitive.
Here is an example:
from featuretools.primitives import Day, Weekend, Percentile, CumMean, CumSum
import featuretools as ft
import pandas as pd
import numpy as np
timesteps = pd.DataFrame({'ts_id': range(12),
'timestamp': pd.DatetimeIndex(start='1/1/2018', freq='1d', periods=12),
'attr1': np.random.random(12),
'obs_id': [1, 2, 3] * 4})
print(timesteps)
attr1 obs_id timestamp ts_id
0 0.663216 1 2018-01-01 0
1 0.455353 2 2018-01-02 1
2 0.800848 3 2018-01-03 2
3 0.938645 1 2018-01-04 3
4 0.442037 2 2018-01-05 4
5 0.724044 3 2018-01-06 5
6 0.304241 1 2018-01-07 6
7 0.134359 2 2018-01-08 7
8 0.275078 3 2018-01-09 8
9 0.499343 1 2018-01-10 9
10 0.608565 2 2018-01-11 10
11 0.340991 3 2018-01-12 11
entityset = ft.EntitySet("timeseries")
entityset.entity_from_dataframe("timesteps",
timesteps,
index='ts_id',
time_index='timestamp')
entityset.normalize_entity(base_entity_id='timesteps',
new_entity_id='observations',
index='obs_id',
make_time_index=True)
# per timestep
cutoffs = timesteps[['ts_id', 'timestamp']]
feature_matrix, feature_list = ft.dfs(entityset=entityset,
target_entity='timesteps',
cutoff_time=cutoffs,
trans_primitives=[Day, Weekend, Percentile, CumMean, CumSum],
agg_primitives=[])
print(feature_matrix.iloc[:, -6:])
CUMMEAN(attr1 by obs_id) CUMSUM(attr1 by obs_id) CUMMEAN(PERCENTILE(attr1) by obs_id) CUMSUM(CUMMEAN(attr1 by obs_id) by obs_id) CUMSUM(PERCENTILE(attr1) by obs_id) observations.DAY(first_timesteps_time)
ts_id
0 0.100711 0.100711 1.000000 0.100711 1.000000 1
1 0.811898 0.811898 1.000000 0.811898 1.000000 2
2 0.989166 0.989166 1.000000 0.989166 1.000000 3
3 0.442035 0.442035 0.500000 0.442035 0.500000 1
4 0.910106 0.910106 0.800000 0.910106 0.800000 2
5 0.427610 0.427610 0.333333 0.427610 0.333333 3
6 0.832516 0.832516 0.714286 0.832516 0.714286 1
7 0.035121 0.035121 0.125000 0.035121 0.125000 2
8 0.178202 0.178202 0.333333 0.178202 0.333333 3
9 0.085608 0.085608 0.200000 0.085608 0.200000 1
10 0.891033 0.891033 0.818182 0.891033 0.818182 2
11 0.044010 0.044010 0.166667 0.044010 0.166667 3
This example also used “cutoff times” to tell the feature computation engine to only use data prior to the specified time for each “ts_id” or “obs_id”. You can read more about cutoff times on this page in the documentation.
Another cool thing that Featuretools lets you do is construct features per observation in the “observations” table, instead of per timestep. To do this, change the “target_entity” parameter. In the following example, we use take the last timestamp per observation to use as the cutoff time, which will make sure no data is used from after that time (e.g. data from obs_id = 2 at 2018-01-11 will not be included in the Percentile() computation for obs_id = 1, with cutoff time 2018-01-10).
# per observation
ocutoffs = timesteps[['obs_id', 'timestamp']].drop_duplicates(['obs_id'], keep='last')
ofeature_matrix, ofeature_list = ft.dfs(entityset=entityset,
target_entity='observations',
cutoff_time=ocutoffs,
trans_primitives=[Day, Weekend, Percentile, CumMean, CumSum])
print(ofeature_matrix.iloc[:, -6:])
PERCENTILE(STD(timesteps.attr1)) PERCENTILE(MAX(timesteps.attr1)) PERCENTILE(SKEW(timesteps.attr1)) PERCENTILE(MIN(timesteps.attr1)) PERCENTILE(MEAN(timesteps.attr1)) PERCENTILE(COUNT(timesteps))
obs_id
1 0.666667 1.000000 0.666667 0.666667 0.666667 1.000000
2 0.333333 0.666667 0.666667 0.666667 0.333333 0.833333
3 1.000000 1.000000 0.333333 0.333333 1.000000 0.666667
Lastly, it is actually possible to use tsfresh in conjunction with Featuretools as a "custom primitive". This is an advanced feature, but I could explain more if you are interested.

I could answer more in detail if you give me more details about the question. However, based on my understanding, you want to predict something with the time-series data that you have.
There is a package called keras in Python for machine learning. What you can do is, you can use LSTMs for training your model. The support for LSTMs is very good in keras.

Related

How can I plot ROC curve and calculate AUC, determine cutoff value?

To evaluate diagnostic performance, I want to plot ROC curve, calculate AUC, and determine cutoff value
I have concentration of some protein, and actual disease diagnosis result (true of false)
I found some references, but I think they are optimized for machine learning.
And I’m not python expert. I can't figure out how to replace the test data with my own.
Here is some references and my sample data.
Could you please help me?
Sample Value Real
1 74.9 T
2 64.22 T
3 45.11 T
4 12.01 F
5 61.43 T
6 96 T
7 74.22 T
8 79.9 T
9 5.18 T
10 60.11 T
11 14.96 F
12 26.01 F
13 26.3 F

How to scale values if the feature values are already between -1 and 1 in Pandas Dataframe

I have a Pandas DataFrame as follows:
Month Sin Month Cos Hour Sin Hour Cos close
0 0.5 0.866025 0.258819 0.965926 430.78
1 0.5 0.866025 0.500000 0.866025 430.62
2 0.5 0.866025 0.707107 0.707107 432.84
3 0.5 0.866025 0.866025 0.500000 436.12
4 0.5 0.866025 0.965926 0.258819 435.99
I want to use the first 4 columns ['Month Sin', 'Month Cos', 'Hour Sin', 'Hour Cos'] as my feature values to predict close value using any Machine Learning algorithm (a regression problem basically).
My features which are the first 4 columns are already between -1 and 1. So, my question is that is it necessary to scale the values using MinMaxScaler or StandardScaler if the feature values are already between -1 and 1?
And do I need to scale my target variable close or not? Thanks.
you can use SVC algorithm for predict close

How to normalize the data in a dataframe in the range [0,1]?

I'm trying to implement a paper where PIMA Indians Diabetes dataset is used. This is the dataset after imputing missing values:
Preg Glucose BP SkinThickness Insulin BMI Pedigree Age Outcome
0 1 148.0 72.000000 35.00000 155.548223 33.600000 0.627 50 1
1 1 85.0 66.000000 29.00000 155.548223 26.600000 0.351 31 0
2 1 183.0 64.000000 29.15342 155.548223 23.300000 0.672 32 1
3 1 89.0 66.000000 23.00000 94.000000 28.100000 0.167 21 0
4 0 137.0 40.000000 35.00000 168.000000 43.100000 2.288 33 1
5 1 116.0 74.000000 29.15342 155.548223 25.600000 0.201 30 0
The description:
df.describe()
Preg Glucose BP SkinThickness Insulin BMI Pedigree Age
count768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean0.855469 121.686763 72.405184 29.153420 155.548223 32.457464 0.471876 33.240885
std 0.351857 30.435949 12.096346 8.790942 85.021108 6.875151 0.331329 11.760232
min 0.000000 44.000000 24.000000 7.000000 14.000000 18.200000 0.078000 21.000000
25% 1.000000 99.750000 64.000000 25.000000 121.500000 27.500000 0.243750 24.000000
50% 1.000000 117.000000 72.202592 29.153420 155.548223 32.400000 0.372500 29.000000
75% 1.000000 140.250000 80.000000 32.000000 155.548223 36.600000 0.626250 41.000000
max 1.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000
The description of normalization from the paper is as follows:
As part of our data preprocessing, the original data values are scaled so as to fall within a small specified range of [0,1] values by performing normalization of the dataset. This will improve speed and reduce runtime complexity. Using the Z-Score we normalize our value set V to obtain a new set of normalized values V’ with the equation below:
V'=V-Y/Z
where V’= New normalized value, V=previous value, Y=mean and Z=standard deviation
z=scipy.stats.zscore(df)
But when I try to run the code above, I'm getting negative values and values greater than one i.e., not in the range [0,1].
There are several points to note here.
Firstly, z-score normalisation will not result in features in the range [0, 1] unless the input data has very specific characteristics.
Secondly, as others have noted, two of the most common ways of normalising data are standardisation and min-max scaling.
Set up data
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv')
# For the purposes of this exercise, we'll just use the alphabet as column names
df.columns = list(string.ascii_lowercase)[:len(df.columns)]
$ print(df.head())
a b c d e f g h i
0 1 85 66 29 0 26.6 0.351 31 0
1 8 183 64 0 0 23.3 0.672 32 1
2 1 89 66 23 94 28.1 0.167 21 0
3 0 137 40 35 168 43.1 2.288 33 1
4 5 116 74 0 0 25.6 0.201 30 0
Standardisation
# print the minimum and maximum values in the entire dataset with a little formatting
$ print(f"Min: {standardised.min().min():4.3f} Max: {standardised.max().max():4.3f}")
Min: -4.055 Max: 845.307
As you can see, the values are far from being in [0, 1]. Note the range of the resulting data from z-score normalisation will vary depending on the distribution of the input data.
Min-max scaling
min_max = (df - df.values.min()) / (df.values.max() - df.values.min())
# print the minimum and maximum values in the entire dataset with a little formatting
$ print(f"Min: {min_max.min().min():4.3f} Max: {min_max.max().max():4.3f}")
Min: 0.000 Max: 1.000
Here we do indeed get values in [0, 1].
Discussion
These and a number of other scalers exist in the sklearn preprocessing module. I recommend reading the sklearn documentation and using these instead of doing it manually, for various reasons:
There are fewer chances of making a mistake as you have to do less typing.
sklearn will be at least as computationally efficient and often more so.
You should use the same scaling parameters from training on the test data to avoid leakage of test data information. (In most real world uses, this is unlikely to be significant but it is good practice.) By using sklearn you don't need to store the min/max/mean/SD etc. from scaling training data to reuse subsequently on test data. Instead, you can just use scaler.fit_transform(X_train) and scaler.transform(X_test).
If you want to reverse the scaling later on, you can use scaler.inverse_transform(data).
I'm sure there are other reasons, but these are the main ones that come to mind.
Your standardization formula hasn't the aim of putting values in the range [0, 1].
If you want to normalize data to make it in such a range, you can use the following formula :
z = (actual_value - min_value_in_database)/(max_value_in_database - min_value_in_database)
And sir, you're not oblige to do it manually, just use sklearn library, you'll find different standardization and normalization methods in the preprocessing section.
Assuming your original dataframe is df and it has no invalid float values, this should work
df2 = (df - df.values.min()) / (df.values.max()-df.values.min())

How can I build a faster decaying average? comparing a data frame's rows date field to other rows dates

I am clumsy but adequate with python. I have referenced stack often, but this is my first question. I have built a decaying average function to act on a pandas data frame with about 10000 rows, but it takes 40 minutes to run. I would appreciate any thoughts on how to speed it up. Here is a sample of actual data, simplified a bit.
sub = pd.DataFrame({
'user_id':[101,101,101,101,101,102,101],
'class_section':['Modern Biology - B','Spanish Novice 1 - D', 'Modern Biology - B','Spanish Novice 1 - D','Spanish Novice 1 - D','Modern Biology - B','Spanish Novice 1 - D'],
'sub_skill':['A','A','B','B','B','B','B'],
'rating' :[2.0,3.0,3.0,2.0,3.0,2.0,2.0],
'date' :['2019-10-16','2019-09-04','2019-09-04', '2019-09-04','2019-09-13','2019-10-16','2019-09-05']})
For this data frame:
sub
Out[716]:
user_id class_section sub_skill rating date
0 101 Modern Biology - B A 2.0 2019-10-16
1 101 Spanish Novice 1 - D A 3.0 2019-09-04
2 101 Modern Biology - B B 3.0 2019-09-04
3 101 Spanish Novice 1 - D B 2.0 2019-09-04
4 101 Spanish Novice 1 - D B 3.0 2019-09-13
5 102 Modern Biology - B B 2.0 2019-10-16
6 101 Spanish Novice 1 - D B 2.0 2019-09-05
A decaying average weights the most recent event that meets conditions at full weight and weights each previous event with a multiplier less than one. In this case, the multiplier is 0.667. previously weighted events are weighted again.
So the decaying average for user 101's rating in Spanish sub_skill B is:
(2.0*0.667^2 + 2.0*0.667^1 + 3.0*0.667^0)/((0.667^2 + 0.667^1 + 0.667^0) = 2.4735
Here is what I tried, after reading a helpful post on weighted averages
sub['date'] = pd.to_datetime(sub.date_due)
def func(date, user_id, class_section, sub_skill):
return sub.apply(lambda row: row['date'] > date
and row['user_id']==user_id
and row['class_section']== class_section
and row['sub_skill']==sub_skill,axis=1).sum()
# for some reason this next line of code took about 40 minutes to run on 9000 rows:
sub['decay_count']=sub.apply(lambda row: func(row['date'],row['user_id'], row['class_section'], row['sub_skill']), axis=1)
# calculate decay factor:
sub['decay_weight']=sub.apply(lambda row: 0.667**row['decay_count'], axis=1)
# calcuate decay average contributors (still needs to be summed):
g = sub.groupby(['user_id','class_section','sub_skill'])
sub['decay_avg'] = sub.decay_weight / g.decay_weight.transform("sum") * sub.rating
# new dataframe with indicator/course summaries as decaying average (note the sum):
indicator_summary = g.decay_avg.sum().to_frame(name = 'DAvg').reset_index()
I frequently work in pandas and I am used to iterating through large datasets. I would have expected this to take rows-squared time, but it is taking much longer. A more elegant solution or some advice to speed it up would be really appreciated!
Some background on this project: I am trying to automate the conversion from proficiency-based grading into a classic course grade for my school. I have the process of data extraction from our Learning Management System into a spreadsheet that does the decaying average and then posts the information to teachers, but I would like to automate the whole process and extract myself from it. The LMS is slow to implement a proficiency-based system and is reluctant to provide a conversion - for good reason. However, we have to communicate both student proficiencies and our conversion to a traditional grade to parents and colleges since that is a language they speak.
Why not use groupby? The idea here is that you rank the dates within the group in descending order and subtract 1 (because rank starts with 1). That seems to mirror your logic in func above, without having to try to call apply with a nested apply.
sub['decay_count'] = sub.groupby(['user_id', 'class_section', 'sub_skill'])['date'].rank(method='first', ascending=False) - 1
sub['decay_weight'] = sub['decay_count'].apply(lambda x: 0.667 ** x)
Output:
sub.sort_values(['user_id', 'class_section', 'sub_skill', 'decay_count'])
user_id class_section sub_skill rating date decay_count decay_weight
0 101 Modern Biology - B A 2.0 2019-10-16 0.0 1.000000
2 101 Modern Biology - B B 3.0 2019-09-04 0.0 1.000000
1 101 Spanish Novice 1 - D A 3.0 2019-09-04 0.0 1.000000
3 101 Spanish Novice 1 - D B 2.0 2019-09-04 0.0 1.000000
6 101 Spanish Novice 1 - D B 2.0 2019-09-05 1.0 0.667000
4 101 Spanish Novice 1 - D B 3.0 2019-09-13 2.0 0.444889
5 102 Modern Biology - B B 2.0 2019-10-16 0.0 1.000000

Table of differences between observed and expected counts

I have data where I'm modeling a binary dependent variable. There are 5 other categorical predictor variables and I have the chi-square test for independence for each of them, vs. the dependent variable. All came up with very low p-values.
Now, I'd like to create a chart that displays all of the differences between the observed and expected counts. It seems like this should be part of the scipy chi2_contingency function but I can't figure it out.
The only thing I can think of is that the chi2_contingency function will output an array of expected counts, so I guess I need to figure out how to convert my cross tab table of observed counts into an array and then subtract the two.
## Gender & Income: cross-tabulation table and chi-square
ct_sex_income=pd.crosstab(adult_df.sex, adult_df.income, margins=True)
ct_sex_income
## Run Chi-Square test
scipy.stats.chi2_contingency(ct_sex_income)
## try to subtract them
ct_sex_income.observed - chi2_contingency(ct_sex_income)[4]
Error I get is "AttributeError: 'DataFrame' object has no attribute 'observed'"
I'd like just an array that shows the differences.
TIA for any help
I don't know your data and have no clue about how your observed function is defined. I couldn't understand much of your intention, probably something about predicting people's income based on their marital status.
I am posting here one possible solution for your problem.
import pandas as pd
import numpy as np
import scipy.stats as stats
from scipy.stats import chi2_contingency
# some bogus data
data = [['single','30k-35k'],['divorced','40k-45k'],['married','25k-30k'],
['single','25k-30k'],['married','40k-45k'],['divorced','40k-35k'],
['single','30k-35k'],['married','30k-35k'],['divorced','30k-35k'],
['single','30k-35k'],['married','40k-45k'],['divorced','25k-30k'],
['single','40k-45k'],['married','30k-35k'],['divorced','30k-35k'],
]
adult_df = pd.DataFrame(data,columns=['marital','income'])
X = adult_df['marital'] #variable
Y = adult_df['income'] #prediction
dfObserved = pd.crosstab(Y,X)
results = []
#Chi-Statistic, P-Value, Degrees of Freedom and the expected frequencies
results = stats.chi2_contingency(dfObserved.values)
chi2 = results[0]
pv = results[1]
free = results[2]
efreq = results[3]
dfExpected = pd.DataFrame(efreq, columns=dfObserved.columns, index = dfObserved.index)
print(dfExpected)
"""
marital divorced married single
income
25k-30k 1.000000 1.000000 1.000000
30k-35k 2.333333 2.333333 2.333333
40k-35k 0.333333 0.333333 0.333333
40k-45k 1.333333 1.333333 1.333333
"""
print(dfObserved)
"""
marital divorced married single
income
25k-30k 1 1 1
30k-35k 2 2 3
40k-35k 1 0 0
40k-45k 1 2 1
"""
difference = dfObserved - dfExpected
print(difference)
""""
marital divorced married single
income
25k-30k 0.000000 0.000000 0.000000
30k-35k -0.333333 -0.333333 0.666667
40k-35k 0.666667 -0.333333 -0.333333
40k-45k -0.333333 0.666667 -0.333333
"""
I hope it helps

Categories

Resources