I have a csv file having 140K rows. Working with pandas library.
Now the problem is I have to compare each rows with every other rows.
Now the problem is it's taking too much time.
At the same time, I am creating another column where I am appending many data for each row based on the comparison. Here I am getting memory error.
What is the optimal solution for atleast Memory error?
I am working on 12GB RAM, Google Colaboratory.
Dataframe sample:
ID x_coordinate y_coordinate
1 2 3
2 3 4
............
X 1 5
Now, I need to find distance each row with other rows and if the distance in certain threshold, I am assigning a new id for that two row which are in certain distance. So, if in my case ID 1 and ID 2 is in a certain distance I assigned a for both. And ID 2 and ID X is in certain distance I am assigning b as new matched id like below
ID x_coordinate y_coordinate Matched ID
1 2 3 [a]
2 3 4 [a, b]
............
X 1 5 [b]
For distance I am using √{(xi − xj)2 + (yi − yj)2}
Threshold can be anything. Say m unit.
This reads like you attempt to hold the complete square distance matrix in memory, which obviously doesn't scale very well, as you have noticed.
I'd suggest you to read up on how DBSCAN clustering approaches the problem, compared to e.g., hierarchical clustering:
https://en.wikipedia.org/wiki/DBSCAN#Complexity
Instead of computing all the pairwise distances at once, they seem to
put the data into a spatial database (for efficient neighborhood queries with a threshold) and then
iterate the points to identify the neighbors and the relevant distances on the fly.
Unfortunately I can't point you to readily available code or pandas functionality to support this though.
Related
Suppose I have a dataset that records camera sightings of some object over time, and I groupby date so that each group represents sightings within the same day. I'd then like to break one group into 'subgroups' based on the time between sightings -- if the gap is too large, then I want them to be in different groups.
Consider the following as one group.
(Camera). (Time)
A 6
B 12
C 17
D 21
E 47
F 50
Suppose I had a cutoff matrix that told me how close the next sighting had to be for two adjacent cameras to be in the same group. For example, we might have cutoff_mat[d, e] = 10 which means that since cameras D and E are more than 10 units apart in time, I should break the group into two after D and before E. I would like to do so in a way that allows for efficient iteration over each of the resulting groups since my real goal is to compute some other matrix using values within each sub-group, and need to potentially break one group into many and not just two. How do I do this? The dataset is large (>100M points) so something fast would be appreciated.
I am thinking I could do this by creating another column in the original dataset that represents time between consecutive sightings on the same day, and somehow groupby both date AND this new column, but I'm not quite sure how that'd work. I also don't think pd.df.cut() works here since I don't have pre-determined bins.
The pandas corr function is really fast. Even for a table with a 1000 columns x 200 rows it doesn't take more than a minute to calculate a cross correlation matrix.
I was trying to use a for loop to compute the slope between each pair of columns and it was taking much longer. I went back and tried to just use correlation in a loop and that was also taking a long time.
So I guess I have two part question. How is the pandas corr function working so fast. Is there a similar method to to compute the pairwise slope between each column?
Here is an example:
here is table
a b
0 0
1 1
In that case slope(a,b)=1
a b
0 0
1 2
In that case slope(a,b)=2
I have 140 csv files. Each file has 3 variables and is about 750 GB. Number of observation varies from 60 to 90 million.
I also have another small file, treatment_data - with 138000 row (for each unique ID) and 21 column (01 column for ID and 20 columns of 1s and 0s indicating whether the ID was given a particular treatment or not.
The variables are,
ID_FROM: A Numeric ID
ID_TO: A Numeric ID
DISTANCE: A numeric variable of physical distance between ID_FROM and ID_TO
(So in total, I have 138000*138000 (= 19+ Billion)rows - for every possible bi-lateral combination all ID, divided across these 140 files.
Research Question: Given a distance, how many neighbors (of each treatment type) an ID has.
So I need help with a system (preferably in Pandas) where
the researcher will input a distance
the program will look over all the files and filter out the the
rows wither DISTANCE between ID_FROM and ID_TO is less than
the given distance
output a single dataframe. (DISTANCE can be dropped at this
point)
merge the dataframe with the treatment_data by matching ID_TO
with ID. (ID_TO can be dropped at this point)
collapse the data by ID_FROM (group_by and sum the 1s, across
20 treatment variable.
(In the Final output dataset, I will have 138000 row and 21 column. 01 column for ID. 20 column for each different treatment type. So, for example, I will be able to answer the question, "Within '2000' meter, How many neighbors of '500' (ID) is in 'treatment_media' category?"
IMPORTANT SIDE NOTE:
The DISTANCE variable range between 0 to roughly the radius of an
average sized US State (in meter). Researcher is mostly interested to
see what happens with in 5000 meter. Which usually drops 98% of
observations. But sometimes, he/she will check for longer distance
measure too. So I have to keep all the observations available.
Otherwise, I could have simply filtered out the DISTANCE more than
5000 from the raw input files and made my life easier. The reason I
think this is important is that, the data are sorted based in
ID_FROM across 140 files. If I could somehow rearrange these 19+
billion rows based on DISTANCE and associate them have some kind of
dictionary system, then the program does not need to go over all the
140 files. Most of the time, the researcher will be looking into only
2 percentile of DISTANCE range. It seems like a colossal waste of
time that I have to loop over 140 files. But this is a secondary
thought. Please do provide answer even if you can't use this
additional side-note.
I tried looping over 140 files for a particular distance in Stata, It
takes 11+ hour to complete the task. Which is not acceptable as the
researcher will want to vary the distance with in 0 to 5000 range.
But, most of the computation time is wasted on reading each dataset
on memory (that is how Stata do it). That is why I am seeking help in
Python.
Is there a particular reason that you need to do the whole thing in Python?
This seems like something that a SQL database would be very good at. I think a basic outline like the following could work:
TABLE Distances {
Integer PrimaryKey,
String IdFrom,
String IdTo,
Integer Distance
}
INDEX ON Distances(IdFrom, Distance);
TABLE TreatmentData {
Integer PrimaryKey,
String Id,
String TreatmentType
}
INDEX ON TreatmentData(Id, TreatmentType);
-- How many neighbors of ID 500 are within 2000 meters and have gotten
-- the "treatment_media" treatment?
SELECT
d.IdFrom AS Id,
td.Treatment,
COUNT(*) AS Total
FROM Distances d
JOIN TreatmentData td ON d.IdTo = td.Id
WHERE d.IdFrom = "500"
AND d.Distance <= 2000
AND td.TreatmentType = "treatment_media"
GROUP BY 1, 2;
There's probably some other combination of indexes that would give better performance, but this seems like it would at least answer your example question.
I have a time series (array of values) and I would like to find the starting points where a long drop in values begins (at least X consecutive values going down). For example:
Having a list of values
[1,2,3,4,3,4,5,4,3,4,5,4,3,2,1,2,3,2,3,4,3,4,5,6,7,8]
I would like to find a drop of at least 5 consecutive values. So in this case I would find the segment 5,4,3,2,1.
However, in a real scenario, there is noise in the data, so the actual drop includes a lot of little ups and downs.
I could write an algorithm for this. But I was wondering whether there is an existing library or standard signal processing method for this type of analysis.
You can do this pretty easily with pandas (which I know you have). Convert your list to a series, and then perform a groupby + count to find consecutively declining values:
v = pd.Series([...])
v[v.groupby(v.diff().gt(0).cumsum()).transform('size').ge(5)]
10 5
11 4
12 3
13 2
14 1
dtype: int64
Presented as an example.
Two data sets. One collected over a 1 hour period. One collected over a 20 min period within that hour.
Each data set contains instances of events that can transformed into single columns of true (-) or false (_), representing if the event is occurring or not.
DS1.event:
_-__-_--___----_-__--_-__---__
DS2.event:
__--_-__--
I'm looking for a way to automate the correlation (correct me if the terminology is incorrect) of the two data sets and find the offset(s) into DS1 at which DS2 is most (top x many) likely to have occurred. This will probably end up with some matching percentage that I can then threshold to determine the validity of the match.
Such that
_-__-_--___----_-__--_-__---__
__--_-__--
DS1.start + 34min ~= DS2.start
Additional information:
DS1 was recorded at roughly 1 Hz. DS2 at roughly 30 Hz. This makes it less likely that there will be a 100% clean match.
Alternate methods (to pandas) will be appreciated, but python/pandas are what I have at my disposal.
Sounds like you just want something like a cross correlation?
I would first convert the string to a numeric representation, so replace your - and _ with 1 and 0
You can do that using a strings replace method (e.g. signal.replace("-", "1"))
Convert them to a list or a numpy array:
event1 = [int(x) for x in signal1]
event2 = [int(x) for x in signal2]
Then calculate the cross correlation between them:
xcor = np.correlate(event1, event2, "full")
That will give you the cross correlation value at each time lag. You just want to find the largest value, and the time lag at which it happens:
nR = max(xcor)
maxLag = np.argmax(xcor) # I imported numpy as np here
Giving you something like:
Cross correlation value: 5
Lag: 20
It sounds like you're more interested in the lag value here. What the lag tells you is essentially how many time/positional shifts are required to get the maximum cross correlation value (degree of match) between your 2 signals
You might want to take a look at the docs for np.correlate and np.convolve to determine the method (full, same, or valid) you want to use as thats determined by the length of your data and what you want to happen if your signals are different lengths