I need to predict a sequence of responses based on a sequence of predictors using something like an LSTM. There are multiple sequences, and they are structured such that they cannot be stacked and still make sense.
For example, we might have one sequence with the sequential values
Observation
Location 1x
Location 1y
Location 2 (response)
1
3.8
2.5
9.4
2
3.9
2.7
9.7
and another with the values
Observation
Location 1x
Location 1y
Location 2 (response)
1
9.4
4.6
16.8
2
9.2
4.1
16.2
Observation 2 from the first table and observation 1 from the second table do not follow each other. I then need to predict on an unseen sequence like
Location 1x
Location 1y
5.6
8.4
5.6
8.1
which is also not correlated to the first two, except that the first two should give a guideline on how to predict the sequence.
I've looked into multiple sequence prediction, and haven't had much luck. Can anyone give a guideline on what sort of stretegies I might use for this problem? Thanks.
I have a dataset where each sample/row is a unique protein and that protein is quantified across 7 features/columns. This dataset includes thousands of proteins and will be classified by machine learning (Support Vector Machine). To give an example of the data:
Protein
Feature 1
Feature 2
Feature 3
Feature 4
Feature 5
Feature 6
Feature 7
Protein 1
10.0
8.7
5.4
28.0
7.9
11.3
5.3
Protein 2
6.5
9.3
4.8
2.7
12.3
14.2
0.7
...
...
...
...
...
...
...
...
Protein N
8.0
6.8
4.9
6.2
10.0
19.3
4.8
In addition to this dataset, I also have 2 more replicates that are structured the exact same and have the same proteins for a total of 3 replicates. Normally if I wanted to visualize one of these datasets, I could transform my 7 features using PCA and plot the first two principal components with each point/protein colored by its classification. However, is there a way that I can take my 3 replicates and get some sort of "consensus" PCA plot for them?
I've seen two possible solutions for handling this:
Average each feature for each protein to get a single dataset with N rows and 7 columns, then PCA transform and plot
Concatenate the 3 replicates into a single dataset such that each row now has 7x3 columns, then PCA transform and plot
To clarify what's being said in solution 2, let's call Feature 1 from replicate 1 Feature 1.1, Feature 1 from replicate 2 Feature 1.2, etc.:
Protein
Feature 1.1
...
Feature 7.1
Feature 1.2
...
Feature 7.2
Feature 1.3
...
Feature 7.3
Protein 1
10.0
...
5.3
8.4
...
5.9
9.7
...
5.2
Protein 2
6.5
...
0.7
6.8
...
0.8
6.3
...
0.7
...
...
...
...
...
...
...
...
...
...
Protein N
8.0
...
4.8
7.9
...
4.9
8.1
...
4.7
What I'm looking for is if there's an accepted solution for such a problem or if there's a solution that's more statistically sound. Thanks in advance!
I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to find a way to compute weighted average on this dataframe which in turn creates another data frame.
Here is how my dataset looks like (very simplified version of it):
prec temp
location_id hours
135 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
136 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
I have a multi-index on [location_id, hours]. I have around 60k locations and 140 hours for each location (making up the 8 million rows).
The rest of the data is numeric (float) or categorical. I have only included 2 columns here, normally there are around 20 columns.
What I am willing to do is to create a new data frame that is basically a weighted average of this data frame. The requirements indicate that 12 of these location_ids should be averaged out by a specified weight to form the combined_location_id values.
For example, location_ids 1,3,5,7,9,11,13,15,17,19,21,23 with their appropriate weights (separate data coming in from another data frame) should be weighted averaged to from the combined_location_id CL_1's data.
That is a lot of data to handle and I wasn't able to find a completely Pandas way of solving it. Therefore, I went with a for loop approach. It is extremely slow and I am sure this is not the right way to do it:
def __weighted(self, ds, weights):
return np.average(ds, weights=weights)
f = {'hours': 'first', 'location_id': 'first',
'temp': lambda x: self.__weighted(x, weights), 'prec': lambda x: self.__weighted(x, weights)}
data_frames = []
for combined_location in all_combined_locations:
mapped_location_ids = combined_location.location_ids
weights = combined_location.weights_of_location_ids
data_for_this_combined_location = pd.concat(df_data.loc[df_data.index.get_level_values(0) == location_id] for location_id in mapped_location_ids)
data_grouped_by_distance = data_for_this_combined_location.groupby("hours", as_index=False)
data_grouped_by_distance = data_grouped_by_distance.agg(f)
data_frames.append(data_grouped_by_distance)
df_combined_location_data = pd.concat(data_frames)
df_combined_location_data.set_index(['location_id', 'hours'], inplace=True)
This works well functionally, however the performance and the memory consumption is horrible. It is taking over 2 hours on my dataset and that is currently not acceptable. The existence of the for loop is an indicator that this could be handled better.
Is there a better/faster way to implement this?
From what I saw you can reduce one for loop with mapped_location_ids
data_for_this_combined_location = df_data.loc[df_data.index.get_level_values(0).isin(mapped_location_ids)]
I'd like to find a way to determine which neighbors are actually used in my knn algorithm, so I can dive deeper into the rows of data that are similar to my features.
Here is an example of a dataset which I split into a training set and a test set for the prediction model:
Player PER VORP WS
Fabricio Oberto 11.9 1.0 4.1
Eddie Johnson 16.5 1.7 4.8
Tim Legler 15.9 2.0 6.8
Ersan Ilyasova 14.3 0.7 3.8
Kevin Love 25.4 3.5 10.0
Tim Hardaway 20.6 5.1 11.7
Frank Brickowsk 8.6 -0.2 1.6
etc....
And here is an example of my knn algorithm code:
features = ['PER','VORP']
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')
knn.fit(train[features], train['WS'])
predictions = knn.predict(test[features])
Now, I'm aware that the algorithm will iterate over each row and make each target prediction based on the 5 closest neighbors that come from the target features I've specified.
I'd like to find out WHICH 5 n_neighbors were actually used in determining my target feature? In this case - which players were actually used in determining the target?
Is there a way to get a list of the 5 neighbors (aka players) which were used in the analysis for each row?
knn.kneighbors will return you an array of the corresponding nearest neighbours.
I have been comparing the relative efficiency of numpy versus Python list comprehensions in multiplying together arrays of random numbers. (Python 3.4/Spyder, Windows and Ubuntu).
As one would expect, for all but the smallest arrays, numpy rapidly outperforms an list comprehension, and for increasing array length you get the expected sigmoid curve for performance. But the sigmoid is far from smooth, which I am puzzling to understand.
Obviously there is a certain amount of quantization noise for shorter array lengths, but I am getting unexpectedly noisy results, particularly under Windows. The figures are the mean of 100 runs of the various array lengths, so should have any transient effects smoothed out (so I would have thought).
Numpy and Python list performance comparison
The figures below show the ratio of multiplying arrays of differing lengths using numpy against list comprehension.
Array Length Windows Ubuntu
1 0.2 0.4
2 2.0 0.6
5 1.0 0.5
10 3.0 1.0
20 0.3 0.8
50 3.5 1.9
100 3.5 1.9
200 10.0 3.0
500 4.6 6.0
1,000 13.6 6.9
2,000 9.2 8.2
5,000 14.6 10.4
10,000 12.1 11.1
20,000 12.9 11.6
50,000 13.4 11.4
100,000 13.4 12.0
200,000 12.8 12.4
500,000 13.0 12.3
1,000,000 13.3 12.4
2,000,000 13.6 12.0
5,000,000 13.6 11.9
So I guess my question is can anyone explain why the results, particularly under Windows are so noisy. I have run the tests multiple times but the results always seem to be exactly the same.
UPDATE. At Reblochon Masque's suggestion I have disabled grabage collection. Which smooths the Windows performance out somewhat, but the curves are still lumpy.
Numpy and Python list performance comparison
(Updated to remove garbage collection)
Array Length Windows Ubuntu
1 0.1 0.3
2 0.6 0.4
5 0.3 0.4
10 0.5 0.5
20 0.6 0.5
50 0.8 0.7
100 1.6 1.1
200 1.3 1.7
500 3.7 3.2
1,000 3.9 4.8
2,000 6.5 6.6
5,000 11.5 9.2
10,000 10.8 10.7
20,000 12.1 11.4
50,000 13.3 12.4
100,000 13.5 12.6
200,000 12.8 12.6
500,000 12.9 12.3
1,000,000 13.3 12.3
2,000,000 13.6 12.0
5,000,000 13.6 11.8
UPDATE
At #Sid's suggestion, I've restricted it to running on a single core on each machine. The curves are slightly smoother (particularly the Linux one), but still with the inflexions and some noise, particularly under Windows.
(It was actually the inflexions that I was originally going to post about, as they appear consistently in the same places.)
Numpy and Python list performance comparison
(Garbage collection disabled and running on 1 CPU)
Array Length Windows Ubuntu
1 0.3 0.3
2 0.0 0.4
5 0.5 0.4
10 0.6 0.5
20 0.3 0.5
50 0.9 0.7
100 1.0 1.1
200 2.8 1.7
500 3.7 3.3
1,000 3.3 4.7
2,000 6.5 6.7
5,000 11.0 9.6
10,000 11.0 11.1
20,000 12.7 11.8
50,000 12.9 12.8
100,000 14.3 13.0
200,000 12.6 13.1
500,000 12.6 12.6
1,000,000 13.0 12.6
2,000,000 13.4 12.4
5,000,000 13.6 12.2
The garbage collector explains the bulk of it. The rest could be fluctuation based on other programs running on your machine.
How about turning most things off and running the bare minimum and testing it. Since you are using datetime (which is the actual time passed) it must be taking into account any processor context switches as well.
You could also try running this while having it affixed to a processor using a unix call, that might help further smoothen it out. On Ubuntu it can be done hence: https://askubuntu.com/a/483827
For windows processor affinity can be set thus: http://www.addictivetips.com/windows-tips/how-to-set-processor-affinity-to-an-application-in-windows/
From my comments:
Usually Garbage collection explains noise in bnchmark performance tests; it is possible to disable it to run the tests, and under some conditions, will smoothen the results.
Here is a link to how and why disable the GC: Why disable the garbage collector?
Outside of the GC, it is always tricky to run benchmarks as other processes running on your system may affect performance (network connections, system backups, etc... that may be automated and silently running in the background); maybe you could re-try with a fresh system boot, and as little other processes as possible to see how that goes?