Why does the efficiency of numpy not scale - python

I have been comparing the relative efficiency of numpy versus Python list comprehensions in multiplying together arrays of random numbers. (Python 3.4/Spyder, Windows and Ubuntu).
As one would expect, for all but the smallest arrays, numpy rapidly outperforms an list comprehension, and for increasing array length you get the expected sigmoid curve for performance. But the sigmoid is far from smooth, which I am puzzling to understand.
Obviously there is a certain amount of quantization noise for shorter array lengths, but I am getting unexpectedly noisy results, particularly under Windows. The figures are the mean of 100 runs of the various array lengths, so should have any transient effects smoothed out (so I would have thought).
Numpy and Python list performance comparison
The figures below show the ratio of multiplying arrays of differing lengths using numpy against list comprehension.
Array Length Windows Ubuntu
1 0.2 0.4
2 2.0 0.6
5 1.0 0.5
10 3.0 1.0
20 0.3 0.8
50 3.5 1.9
100 3.5 1.9
200 10.0 3.0
500 4.6 6.0
1,000 13.6 6.9
2,000 9.2 8.2
5,000 14.6 10.4
10,000 12.1 11.1
20,000 12.9 11.6
50,000 13.4 11.4
100,000 13.4 12.0
200,000 12.8 12.4
500,000 13.0 12.3
1,000,000 13.3 12.4
2,000,000 13.6 12.0
5,000,000 13.6 11.9
So I guess my question is can anyone explain why the results, particularly under Windows are so noisy. I have run the tests multiple times but the results always seem to be exactly the same.
UPDATE. At Reblochon Masque's suggestion I have disabled grabage collection. Which smooths the Windows performance out somewhat, but the curves are still lumpy.
Numpy and Python list performance comparison
(Updated to remove garbage collection)
Array Length Windows Ubuntu
1 0.1 0.3
2 0.6 0.4
5 0.3 0.4
10 0.5 0.5
20 0.6 0.5
50 0.8 0.7
100 1.6 1.1
200 1.3 1.7
500 3.7 3.2
1,000 3.9 4.8
2,000 6.5 6.6
5,000 11.5 9.2
10,000 10.8 10.7
20,000 12.1 11.4
50,000 13.3 12.4
100,000 13.5 12.6
200,000 12.8 12.6
500,000 12.9 12.3
1,000,000 13.3 12.3
2,000,000 13.6 12.0
5,000,000 13.6 11.8
UPDATE
At #Sid's suggestion, I've restricted it to running on a single core on each machine. The curves are slightly smoother (particularly the Linux one), but still with the inflexions and some noise, particularly under Windows.
(It was actually the inflexions that I was originally going to post about, as they appear consistently in the same places.)
Numpy and Python list performance comparison
(Garbage collection disabled and running on 1 CPU)
Array Length Windows Ubuntu
1 0.3 0.3
2 0.0 0.4
5 0.5 0.4
10 0.6 0.5
20 0.3 0.5
50 0.9 0.7
100 1.0 1.1
200 2.8 1.7
500 3.7 3.3
1,000 3.3 4.7
2,000 6.5 6.7
5,000 11.0 9.6
10,000 11.0 11.1
20,000 12.7 11.8
50,000 12.9 12.8
100,000 14.3 13.0
200,000 12.6 13.1
500,000 12.6 12.6
1,000,000 13.0 12.6
2,000,000 13.4 12.4
5,000,000 13.6 12.2

The garbage collector explains the bulk of it. The rest could be fluctuation based on other programs running on your machine.
How about turning most things off and running the bare minimum and testing it. Since you are using datetime (which is the actual time passed) it must be taking into account any processor context switches as well.
You could also try running this while having it affixed to a processor using a unix call, that might help further smoothen it out. On Ubuntu it can be done hence: https://askubuntu.com/a/483827
For windows processor affinity can be set thus: http://www.addictivetips.com/windows-tips/how-to-set-processor-affinity-to-an-application-in-windows/

From my comments:
Usually Garbage collection explains noise in bnchmark performance tests; it is possible to disable it to run the tests, and under some conditions, will smoothen the results.
Here is a link to how and why disable the GC: Why disable the garbage collector?
Outside of the GC, it is always tricky to run benchmarks as other processes running on your system may affect performance (network connections, system backups, etc... that may be automated and silently running in the background); maybe you could re-try with a fresh system boot, and as little other processes as possible to see how that goes?

Related

how i use markov chain for classification numerical data?

my data is X(x1,x2,x3) and Y(y1). Y contain label classes.for example:
datase=[1.2 4.5 10.32 1; 1.7 5.7 10.12 1; 0.9 6.1 9.99 0;...;1.9 7.8 6.67 0]
I want to classify my data via chain markov,But all Python code is available for textual data and other applications. Does anyone know a code sample for classifying numerical data?

Pandas: df["<tab> find completion, df.loc["<tab> WON'T find any completion. Is it normal?

I read somewhere that the preferred way for accessing dataframes columns is through the method .loc, but I found some drawbacks and I am wondering if it is normal.
Say that I do the following:
import pandas as pd
df = read_csv("MyFile.csv")
and assume that the dataframe df that looks like this:
ColA ColB ColC
Time
0.0 9.2 -3.5 2.0
0.1 10.2 -0.9 1.1
0.2 4.3 2.1 4.2
If I type df[" and then hit TAB the autocompletion kicks in and I can choose the column name from a pop up list whereas if I type df.loc[" and then hit TAB, then nothing happens and I am wondering if it is a normal behavior.
Also, it seems that if the column name are tuples, e.g.
('ColA','X') ('ColB','Y') ('ColC','Z')
Time
0.0 9.2 -3.5 2.0
0.1 10.2 -0.9 1.1
0.2 4.3 2.1 4.2
then I can access them with e.g. df[('ColA','X')] but I cannot with df.loc[('ColA','X')].
I am running iPython 7.2.2 (console) on a Windows 10 machine, if it may help.

How to combine data replicates for PCA visualization

I have a dataset where each sample/row is a unique protein and that protein is quantified across 7 features/columns. This dataset includes thousands of proteins and will be classified by machine learning (Support Vector Machine). To give an example of the data:
Protein
Feature 1
Feature 2
Feature 3
Feature 4
Feature 5
Feature 6
Feature 7
Protein 1
10.0
8.7
5.4
28.0
7.9
11.3
5.3
Protein 2
6.5
9.3
4.8
2.7
12.3
14.2
0.7
...
...
...
...
...
...
...
...
Protein N
8.0
6.8
4.9
6.2
10.0
19.3
4.8
In addition to this dataset, I also have 2 more replicates that are structured the exact same and have the same proteins for a total of 3 replicates. Normally if I wanted to visualize one of these datasets, I could transform my 7 features using PCA and plot the first two principal components with each point/protein colored by its classification. However, is there a way that I can take my 3 replicates and get some sort of "consensus" PCA plot for them?
I've seen two possible solutions for handling this:
Average each feature for each protein to get a single dataset with N rows and 7 columns, then PCA transform and plot
Concatenate the 3 replicates into a single dataset such that each row now has 7x3 columns, then PCA transform and plot
To clarify what's being said in solution 2, let's call Feature 1 from replicate 1 Feature 1.1, Feature 1 from replicate 2 Feature 1.2, etc.:
Protein
Feature 1.1
...
Feature 7.1
Feature 1.2
...
Feature 7.2
Feature 1.3
...
Feature 7.3
Protein 1
10.0
...
5.3
8.4
...
5.9
9.7
...
5.2
Protein 2
6.5
...
0.7
6.8
...
0.8
6.3
...
0.7
...
...
...
...
...
...
...
...
...
...
Protein N
8.0
...
4.8
7.9
...
4.9
8.1
...
4.7
What I'm looking for is if there's an accepted solution for such a problem or if there's a solution that's more statistically sound. Thanks in advance!

When using k nearest neighbors, is there a way to retrieve the "neighbors" that are used?

I'd like to find a way to determine which neighbors are actually used in my knn algorithm, so I can dive deeper into the rows of data that are similar to my features.
Here is an example of a dataset which I split into a training set and a test set for the prediction model:
Player PER VORP WS
Fabricio Oberto 11.9 1.0 4.1
Eddie Johnson 16.5 1.7 4.8
Tim Legler 15.9 2.0 6.8
Ersan Ilyasova 14.3 0.7 3.8
Kevin Love 25.4 3.5 10.0
Tim Hardaway 20.6 5.1 11.7
Frank Brickowsk 8.6 -0.2 1.6
etc....
And here is an example of my knn algorithm code:
features = ['PER','VORP']
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')
knn.fit(train[features], train['WS'])
predictions = knn.predict(test[features])
Now, I'm aware that the algorithm will iterate over each row and make each target prediction based on the 5 closest neighbors that come from the target features I've specified.
I'd like to find out WHICH 5 n_neighbors were actually used in determining my target feature? In this case - which players were actually used in determining the target?
Is there a way to get a list of the 5 neighbors (aka players) which were used in the analysis for each row?
knn.kneighbors will return you an array of the corresponding nearest neighbours.

Find the average for user-defined window in pandas

I have a pandas dataframe that has raw heart rate data with the index of time (in seconds).
I am trying to bin the data so that I can have the average of a user define window (e.g. 10s) - not a rolling average, just an average of 10s, then the 10s following, etc.
import pandas as pd
hr_raw = pd.read_csv('hr_data.csv', index_col='time')
print(hr_raw)
heart_rate
time
0.6 164.0
1.0 182.0
1.3 164.0
1.6 150.0
2.0 152.0
2.4 141.0
2.9 163.0
3.2 141.0
3.7 124.0
4.2 116.0
4.7 126.0
5.1 116.0
5.7 107.0
Using the example data above, I would like to be able to set a user defined window size (let's use 2 seconds) and produce a new dataframe that has index of 2sec increments and averages the 'heart_rate' values if the time falls into that window (and should continue to the end of the dataframe).
For example:
heart_rate
time
2.0 162.40
4.0 142.25
6.0 116.25
I can only seem to find methods to bin the data based on a predetermined number of bins (e.g. making a histogram) and this only returns the count/frequency.
thanks.
A groupby should do it.
df.groupby((df.index // 2 + 1) * 2).mean()
heart_rate
time
2.0 165.00
4.0 144.20
6.0 116.25
Note that the reason for the slight difference between our answers is that the upper bound is excluded. That means, a reading taken at 2.0s will be considered for the 4.0s time interval. This is how it is usually done, a similar solution with the TimeGrouper will yield the same result.
Like coldspeed pointed out, 2s will be considered in 4s, however, if you need it in 2x bucket, you can
In [1038]: df.groupby(np.ceil(df.index/2)*2).mean()
Out[1038]:
heart_rate
time
2.0 162.40
4.0 142.25
6.0 116.25

Categories

Resources