I have a csv file having 140K rows. Working with pandas library.
Now the problem is I have to compare each rows with every other rows.
Now the problem is it's taking too much time.
At the same time, I am creating another column where I am appending many data for each row based on the comparison. Here I am getting memory error.
What is the optimal solution for atleast Memory error?
I am working on 12GB RAM, Google Colaboratory.
Dataframe sample:
ID x_coordinate y_coordinate
1 2 3
2 3 4
............
X 1 5
Now, I need to find distance each row with other rows and if the distance in certain threshold, I am assigning a new id for that two row which are in certain distance. So, if in my case ID 1 and ID 2 is in a certain distance I assigned a for both. And ID 2 and ID X is in certain distance I am assigning b as new matched id like below
ID x_coordinate y_coordinate Matched ID
1 2 3 [a]
2 3 4 [a, b]
............
X 1 5 [b]
For distance I am using √{(xi − xj)2 + (yi − yj)2}
Threshold can be anything. Say m unit.
This reads like you attempt to hold the complete square distance matrix in memory, which obviously doesn't scale very well, as you have noticed.
I'd suggest you to read up on how DBSCAN clustering approaches the problem, compared to e.g., hierarchical clustering:
https://en.wikipedia.org/wiki/DBSCAN#Complexity
Instead of computing all the pairwise distances at once, they seem to
put the data into a spatial database (for efficient neighborhood queries with a threshold) and then
iterate the points to identify the neighbors and the relevant distances on the fly.
Unfortunately I can't point you to readily available code or pandas functionality to support this though.
I have many n by 2 matrixes, like so :
Distance
Value
Distance.1
Value.1
...
-15
1
-15
2
...
-14.9
3
-14.995
4
...
-14.8
4
-14.992
2
...
...
...
...
...
...
15
3
8.959
2
...
...
...
...
15.048
3
...
The Distance columns all start at -15, and all finish around +15+-0.05.
My goal is to compute the mean value (of the value columns) and then plot those values as a function of the distance. What I am stuck on is how to 'align' all of the distances, like so :
Distance
Value
Distance.1
Value.1
Mean of Values
-15
1
-15
2
mean value at this distance
NaN
NaN
-14.995
4
mean value at this distance
NaN
NaN
...
...
mean value at this distance
-14.8
4
-14.8
2
mean value at this distance
...
...
...
...
mean value at this distance
NaN
NaN
14.995
2
mean value at this distance
15
2
15
3
mean value at this distance
At the moment I am just plotting all the values against the 1st Distance column, which means that all the data points that are in columns longer than the first matrix are lost.
Here is some code. Now I know that this could hopefully be solved in an other language than python, and even then without pandas, but I am using them in my project to clean and filter the data before hand.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df0 = pd.DataFrame({'Distance':np.linspace(-15, 14.969, 100),
'Value':np.random.rand(100)})
df1 = pd.DataFrame({'Distance.1':np.linspace(-15, 15.034, 500),
'Value.1':np.random.rand(500)})
df = pd.concat([df0,df1], ignore_index=False, axis=1)
df['mean'] = df.filter(regex='Value').mean(axis=1)
df.plot(x='Distance', y='mean')
plt.show()
So with the code above, the distances are not aligned. I see two ways of solving the problem :
Somehow add NaNs to fill in the missing values in the shorter columns, while aligning the distances which appear in all the Distance columns.
Somehow average out the values in the longer columns. So say for every distance in the the short column, there are 3 distances in the long column, take the average of those 3 extra values and replace the 3 entries with that average.
I think, taking into account statistics, the second method is better. Because with the first method the values that only appear in the longer columns will be overrepresented in the mean compared to the values at distances that are in all the columns.
I'm guessing that there is a very clever way to use a Lambda expression in combination with pandas' DataFrame.apply. But I'm really not sure on how to do it. (How to get the lambda expression to look at all the columns except the longest or vice versa ? How to add values to certain columns and not others ?)
I've come up with this so far :
for each row, compare the value in the distance column to the distance value of the longest distance column. If its the same, go to next column, else somehow insert NaN values and shift all of the values in the rows below down, except those in the longest matrix (longest two columns).
Any help is appreciated. I'm sure that I am not the first person to encounter this problem, but I am really not sure how to phrase it well enough to be able to google it.
The pandas corr function is really fast. Even for a table with a 1000 columns x 200 rows it doesn't take more than a minute to calculate a cross correlation matrix.
I was trying to use a for loop to compute the slope between each pair of columns and it was taking much longer. I went back and tried to just use correlation in a loop and that was also taking a long time.
So I guess I have two part question. How is the pandas corr function working so fast. Is there a similar method to to compute the pairwise slope between each column?
Here is an example:
here is table
a b
0 0
1 1
In that case slope(a,b)=1
a b
0 0
1 2
In that case slope(a,b)=2
INTRO
I have a Pandas DataFrame that represents a segmented time series of different users (i.e., user1 & user2). I want to train a scikit-learn classifier with the mentioned DataFrames, but I can't understand the shape of the scikit-learn dataset that I must create.
Since my series are segmented, my DataFrame has a 'segID' column that contains IDs of a specific segment. I'll skip the description of the segmentation since it is provided by an algorithm.
Let's take an example where both user1 and user2 has 2 segments: print df
username voltage segID
0 user1 -0.154732 0
1 user1 -0.063169 0
2 user1 0.554732 1
3 user1 -0.641311 1
4 user1 -0.653732 1
5 user2 0.446469 0
6 user2 -0.655732 0
7 user2 0.646769 0
8 user2 -0.646369 1
9 user2 0.257732 1
10 user2 -0.346369 1
QUESTIONS:
scikit-learn dataset API says to create a dict containing data and target, but how can I shape my data since they are segments and not just a list?
I can't figure out my segments fitting into the n_samples * n_features structure.
I have two ideas:
1) every data sample is a list representing a segment, on the other hand, target is different for each data entry since they're grouped. What about target_names? Could this work?
{
'data': array([
[[-0.154732, -0.063169]],
[[ 0.554732, -0.641311, -0.653732],
[[ 0.446469, -0.655732, 0.646769]],
[[-0.646369, 0.257732, -0.346369]]
]),
'target':
array([0, 1, 2, 3]),
'target_names': array(['user1seg1', 'user1seg2', 'user2seg1', 'user2seg2'], dtype='|S10')
}
2) data is (simply) the nparray returned by df.values. target contains segments' IDs different for each user.... does it make sense?
{
'data': array([
[-0.154732],
[-0.063169],
[ 0.554732],
[-0.641311],
[-0.653732],
[ 0.446469],
[-0.655732],
[ 0.646769],
[-0.646369],
[ 0.257732],
[-0.346369]
]),
'target':
array([0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]),
'target_names': array(['user1seg1', 'user1seg1', 'user1seg2', 'user1seg2', .....], dtype='|S10')
}
I think the main problem is that I can't figure out what to use as labels...
EDIT:
ok it's clear... labels are given by my ground truth, they are just the user's names.
elyase's answer is exactly what i was looking for.
In order to better state the problem, I'm going to explain here the segID meaning.
In time series pattern recognition, segmenting could be useful in order to isolate meaningful segments.
At testing time I want to recognize segments and not the entire series, because series is rather long and segments are supposed to be meaningful in my context.
Have a look at the following example from this implementation based on "An Online Algorithm for Segmenting Time Series".
My segID is just a column representing the id of a chunk.
This is not trivial and there might be several way of formulating the problem for consumption by a ML algorithm. You should try them all and find how you get the best results.
As you already found you need two things, a matrix X of shape n_samples * n_features and a column vector y of length 'n_samples'. Lets start with the target y.
Target:
As you want to predict a user from a discrete pool of usernames, you have a classification problem an your target will be a vector with np.unique(y) == ['user1', 'user2', ...]
Features
Your features are the information that you provide the ML algorithm for each label/user/target. Unfortunately most algorithms require this information to have a fixed length, but variable length time series don't fit well into this description. So if you want to stick to classic algorithms, you need some way to condense the time series information for a user into a fixed length vector. Some possibilities are the mean, min, max, sum, first, last values, histogram, spectral power, etc. You will need to come up with the ones that make sense for your given problem.
So if you ignore the SegID information your X matrix will look like this:
y/features
min max ... sum
user1 0.1 1.2 ... 1.1 # <-first time series for user 1
user1 0.0 1.3 ... 1.1 # <-second time series for user 1
user2 0.3 0.4 ... 13.0 # <-first time series for user 2
As SegID is itself a time series you also need to encode it as fixed length information, for example a histogram/counts of all possible values, most frequent value, etc
In this case you will have:
y/features
min max ... sum segID_most_freq segID_min
user1 0.1 1.2 ... 1.1 1 1
user1 0.3 0.4 ... 13 2 1
user2 0.3 0.4 ... 13 5 3
The algorithm will look at this data and will "think": so for user1 the minimum segID is always 1 so if I see a user a prediction time, whose time series has a minimum ID of 1 then it should be user1. If it is around 3 it is probably user2, and so on.
Keep in mind that this is only a possible approach. Sometimes it is useful to ask, what info will I have at prediction time that will allow me to find which user is the one I am seeing and why will this info lead to the given user?
Basically, how would I create a pivot table that consolidates data, where one of the columns of data it represents is calculated, say, by likelihood percentage (0.0 - 1.0) by taking the mean, and another is calculated by number ordered which sums all of them?
Right now I can specify values=... to indicate what should make up one of the two, but then when I specify the aggfunc=... I don't know how the two interoperate.
In my head I'd specify two values for values=... (likelihood percentage and number ordered) and two values for aggfunc=..., but this does not seem to be working.
You could supply to aggfunc a dictionary with column:funtion (key:value) pairs:
df = pd.DataFrame({'a':['a','a','a'],'m':[1,2,3],'s':[1,2,3]})
print df
a m s
0 a 1 1
1 a 2 2
2 a 3 3
df.pivot_table(index='a', values=['m','s'], aggfunc={'m':pd.Series.mean,'s':sum})
m s
a
a 2 6