Sequential predictions on multiple sequences - python

I need to predict a sequence of responses based on a sequence of predictors using something like an LSTM. There are multiple sequences, and they are structured such that they cannot be stacked and still make sense.
For example, we might have one sequence with the sequential values
Observation
Location 1x
Location 1y
Location 2 (response)
1
3.8
2.5
9.4
2
3.9
2.7
9.7
and another with the values
Observation
Location 1x
Location 1y
Location 2 (response)
1
9.4
4.6
16.8
2
9.2
4.1
16.2
Observation 2 from the first table and observation 1 from the second table do not follow each other. I then need to predict on an unseen sequence like
Location 1x
Location 1y
5.6
8.4
5.6
8.1
which is also not correlated to the first two, except that the first two should give a guideline on how to predict the sequence.
I've looked into multiple sequence prediction, and haven't had much luck. Can anyone give a guideline on what sort of stretegies I might use for this problem? Thanks.

Related

how i use markov chain for classification numerical data?

my data is X(x1,x2,x3) and Y(y1). Y contain label classes.for example:
datase=[1.2 4.5 10.32 1; 1.7 5.7 10.12 1; 0.9 6.1 9.99 0;...;1.9 7.8 6.67 0]
I want to classify my data via chain markov,But all Python code is available for textual data and other applications. Does anyone know a code sample for classifying numerical data?

Pandas Way of Weighted Average in a Large DataFrame

I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to find a way to compute weighted average on this dataframe which in turn creates another data frame.
Here is how my dataset looks like (very simplified version of it):
prec temp
location_id hours
135 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
136 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
I have a multi-index on [location_id, hours]. I have around 60k locations and 140 hours for each location (making up the 8 million rows).
The rest of the data is numeric (float) or categorical. I have only included 2 columns here, normally there are around 20 columns.
What I am willing to do is to create a new data frame that is basically a weighted average of this data frame. The requirements indicate that 12 of these location_ids should be averaged out by a specified weight to form the combined_location_id values.
For example, location_ids 1,3,5,7,9,11,13,15,17,19,21,23 with their appropriate weights (separate data coming in from another data frame) should be weighted averaged to from the combined_location_id CL_1's data.
That is a lot of data to handle and I wasn't able to find a completely Pandas way of solving it. Therefore, I went with a for loop approach. It is extremely slow and I am sure this is not the right way to do it:
def __weighted(self, ds, weights):
return np.average(ds, weights=weights)
f = {'hours': 'first', 'location_id': 'first',
'temp': lambda x: self.__weighted(x, weights), 'prec': lambda x: self.__weighted(x, weights)}
data_frames = []
for combined_location in all_combined_locations:
mapped_location_ids = combined_location.location_ids
weights = combined_location.weights_of_location_ids
data_for_this_combined_location = pd.concat(df_data.loc[df_data.index.get_level_values(0) == location_id] for location_id in mapped_location_ids)
data_grouped_by_distance = data_for_this_combined_location.groupby("hours", as_index=False)
data_grouped_by_distance = data_grouped_by_distance.agg(f)
data_frames.append(data_grouped_by_distance)
df_combined_location_data = pd.concat(data_frames)
df_combined_location_data.set_index(['location_id', 'hours'], inplace=True)
This works well functionally, however the performance and the memory consumption is horrible. It is taking over 2 hours on my dataset and that is currently not acceptable. The existence of the for loop is an indicator that this could be handled better.
Is there a better/faster way to implement this?
From what I saw you can reduce one for loop with mapped_location_ids
data_for_this_combined_location = df_data.loc[df_data.index.get_level_values(0).isin(mapped_location_ids)]

Split a list up to a maximum number of elements

I was wondering if someone could help me with the following problem: I have a text file that I split into rows and columns. The text file contains a variable amount of columns, however I would like to split each row into seven columns, no more, no less. To do that, I want to through everything after the sixth column into a single column.
Example code:
import numpy as np
rot = ['6697 1100.0 90.0 0.0 0.0 6609 !',
'701 0.0 0.0 83.9 1.5 000 !AFR-AHS IndHS-AFR']
for i in range(len(rot)):
rot[i]=rot[i].split()
Here, the array 'rot' contains 7 entries in the first row (the ! counts as a separate entry) and 8 in the second row. In both cases, everything after and including the ! should be grouped in the same column.
Many thanks!
You are almost there. split takes (as its second argument) the maximum number of splits to do.
https://docs.python.org/3.8/library/stdtypes.html#str.split
rot = ['6697 1100.0 90.0 0.0 0.0 6609 !',
'701 0.0 0.0 83.9 1.5 000 !AFR-AHS IndHS-AFR']
for i in range(len(rot)):
rot[i]=rot[i].split(maxsplit=6)
Note: You want six splits, which results in seven columns. You'll need to do some extra processing if the text can have fewer than seven columns though.

When using k nearest neighbors, is there a way to retrieve the "neighbors" that are used?

I'd like to find a way to determine which neighbors are actually used in my knn algorithm, so I can dive deeper into the rows of data that are similar to my features.
Here is an example of a dataset which I split into a training set and a test set for the prediction model:
Player PER VORP WS
Fabricio Oberto 11.9 1.0 4.1
Eddie Johnson 16.5 1.7 4.8
Tim Legler 15.9 2.0 6.8
Ersan Ilyasova 14.3 0.7 3.8
Kevin Love 25.4 3.5 10.0
Tim Hardaway 20.6 5.1 11.7
Frank Brickowsk 8.6 -0.2 1.6
etc....
And here is an example of my knn algorithm code:
features = ['PER','VORP']
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')
knn.fit(train[features], train['WS'])
predictions = knn.predict(test[features])
Now, I'm aware that the algorithm will iterate over each row and make each target prediction based on the 5 closest neighbors that come from the target features I've specified.
I'd like to find out WHICH 5 n_neighbors were actually used in determining my target feature? In this case - which players were actually used in determining the target?
Is there a way to get a list of the 5 neighbors (aka players) which were used in the analysis for each row?
knn.kneighbors will return you an array of the corresponding nearest neighbours.

Regex/split strings in list for particular element

I have a list that of items in a list that looks like this:
[u'1111 aaaa 20 0 250m 149m 113m S 0.0 2.2 532:09.83 bbbb', u' 5555 cccc 20 0 218m 121m 91m S 0.0 3.3 288:50.20 dddd']
The only thing from each item in the list I am concerned about is 2.2 and 3.3, but everything in each item is a variable and changes every time the process is run. The format will always be the same however.
Is there a way to regex each item in the list and check this value in each list?
If you want to just get the 2.2 and 3.3 values, you can go without regexps:
data = [u'1111 aaaa 20 0 250m 149m 113m S 0.0 2.2 532:09.83 bbbb', u' 5555 cccc 20 0 218m 121m 91m S 0.0 3.3 288:50.20 dddd']
print([item.split()[9] for item in data]) # yields [u'2.2', u'3.3']
By default split splits by whitespace. And your 2.2 and 3.3 numbers happen to be 10th in each of the blobs. Python uses 0-indexing of lists, so 10th in human terms becomes 9.

Categories

Resources