Splitting pandas group into subgroups based on relation between consecutive data points

Splitting pandas group into subgroups based on relation between consecutive data points - python

Suppose I have a dataset that records camera sightings of some object over time, and I groupby date so that each group represents sightings within the same day. I'd then like to break one group into 'subgroups' based on the time between sightings -- if the gap is too large, then I want them to be in different groups.
Consider the following as one group.
(Camera). (Time)
A 6
B 12
C 17
D 21
E 47
F 50
Suppose I had a cutoff matrix that told me how close the next sighting had to be for two adjacent cameras to be in the same group. For example, we might have cutoff_mat[d, e] = 10 which means that since cameras D and E are more than 10 units apart in time, I should break the group into two after D and before E. I would like to do so in a way that allows for efficient iteration over each of the resulting groups since my real goal is to compute some other matrix using values within each sub-group, and need to potentially break one group into many and not just two. How do I do this? The dataset is large (>100M points) so something fast would be appreciated.
I am thinking I could do this by creating another column in the original dataset that represents time between consecutive sightings on the same day, and somehow groupby both date AND this new column, but I'm not quite sure how that'd work. I also don't think pd.df.cut() works here since I don't have pre-determined bins.

Related

randomisation in python within a group for A/B test

I have a python list of size 67 with three unique values with following distribution
A - 20 occurrences randomly spread in the list
B - 36 occurrences randomly spread in the list
C - 11 occurrences randomly spread in the list
I want to perform random selection at 10% within each group to perform a special treatment on the values selected from randomisation.
Based on the occurrences in the list shown above, 2 treatments for group A, 3 treatments for B and 1 treatment for C should have been done.
Selection for treatment need not be done exactly on the 10th occurrence of a value but the ratio of treatment to values should be maintained at approximately 10%.
Right now, I have this code
import random
if random.random() <= 0.1
do something
Using this code, doesn't get me above requirement treatments at a group level. Instead it randomly picks treatments across all groups. I want to trim the selection of random samples at a group level. How do I do that?
Also, if this list were dynamic and keeps getting bigger and bigger and populated with more values of A,B,C at run time albeit with different distributions, how can I still maintain the randomisation at a group (unique value in the list) level.

How to use Pandas to split timestamped CSV data into multiple CSVs based on values and continuous time periods

I am trying to analyse a ships AIS data. I have a CSV with ~20,000 rows, with columns for lat / long / speed / time stamp.
I have loaded the data in a pandas data frame, in a Jupyter notebook.
What I want to do is split the CSV into smaller CSVs based on the time stamp and the speed, so I want an individual CSV for each period of time the vessel speed was less than say 2 knots, eg if the vessel transited at 10 knots for 6hrs, then slowed down to 1 knot for a period of 3 hrs, sped back up 10 knots, then slowed down again to 1 knot for a period of 4 hrs, I would want to the output to be two CSVs, one for the 3hr period and one for the 4hr period. This is so I can review these periods individually in my mapping software.
I can filter the data easily to show all the periods where it is <1 knot but I can't break it down to output the continuous periods as separate CSVs / data frames.
EDIT
Here is an example of the data
I've tried to show more clearly what I want to achieve here

Here is something to maybe get you started.
First filter out all values that meets the criteria (for example below 2):
df = pd.DataFrame({'speed':[2,1,4,5,4,1,1,1,3,4,5,6], 'time':[4,5,6,7,8,9,10,11,12,13,14,15]})
df_below2 = df[df['speed']<=2].reset_index(drop=True)
Now we need to split the frame if there is too long gap btw values in time. For example:
threshold = 2
df_below2['not_continuous'] = df_below2['time'].diff() > threshold
Distinguish between the groups using cums:
df_below2['group_id'] = df_below2['not_continuous'].cumsum()
From here it should be easy to split the frame based on the group id.

Merging large data in Python in local machine

I have 140 csv files. Each file has 3 variables and is about 750 GB. Number of observation varies from 60 to 90 million.
I also have another small file, treatment_data - with 138000 row (for each unique ID) and 21 column (01 column for ID and 20 columns of 1s and 0s indicating whether the ID was given a particular treatment or not.
The variables are,
ID_FROM: A Numeric ID
ID_TO: A Numeric ID
DISTANCE: A numeric variable of physical distance between ID_FROM and ID_TO
(So in total, I have 138000*138000 (= 19+ Billion)rows - for every possible bi-lateral combination all ID, divided across these 140 files.
Research Question: Given a distance, how many neighbors (of each treatment type) an ID has.
So I need help with a system (preferably in Pandas) where
the researcher will input a distance
the program will look over all the files and filter out the the
rows wither DISTANCE between ID_FROM and ID_TO is less than
the given distance
output a single dataframe. (DISTANCE can be dropped at this
point)
merge the dataframe with the treatment_data by matching ID_TO
with ID. (ID_TO can be dropped at this point)
collapse the data by ID_FROM (group_by and sum the 1s, across
20 treatment variable.
(In the Final output dataset, I will have 138000 row and 21 column. 01 column for ID. 20 column for each different treatment type. So, for example, I will be able to answer the question, "Within '2000' meter, How many neighbors of '500' (ID) is in 'treatment_media' category?"
IMPORTANT SIDE NOTE:
The DISTANCE variable range between 0 to roughly the radius of an
average sized US State (in meter). Researcher is mostly interested to
see what happens with in 5000 meter. Which usually drops 98% of
observations. But sometimes, he/she will check for longer distance
measure too. So I have to keep all the observations available.
Otherwise, I could have simply filtered out the DISTANCE more than
5000 from the raw input files and made my life easier. The reason I
think this is important is that, the data are sorted based in
ID_FROM across 140 files. If I could somehow rearrange these 19+
billion rows based on DISTANCE and associate them have some kind of
dictionary system, then the program does not need to go over all the
140 files. Most of the time, the researcher will be looking into only
2 percentile of DISTANCE range. It seems like a colossal waste of
time that I have to loop over 140 files. But this is a secondary
thought. Please do provide answer even if you can't use this
additional side-note.
I tried looping over 140 files for a particular distance in Stata, It
takes 11+ hour to complete the task. Which is not acceptable as the
researcher will want to vary the distance with in 0 to 5000 range.
But, most of the computation time is wasted on reading each dataset
on memory (that is how Stata do it). That is why I am seeking help in
Python.

Is there a particular reason that you need to do the whole thing in Python?
This seems like something that a SQL database would be very good at. I think a basic outline like the following could work:
TABLE Distances {
Integer PrimaryKey,
String IdFrom,
String IdTo,
Integer Distance
}
INDEX ON Distances(IdFrom, Distance);
TABLE TreatmentData {
Integer PrimaryKey,
String Id,
String TreatmentType
}
INDEX ON TreatmentData(Id, TreatmentType);
-- How many neighbors of ID 500 are within 2000 meters and have gotten
-- the "treatment_media" treatment?
SELECT
d.IdFrom AS Id,
td.Treatment,
COUNT(*) AS Total
FROM Distances d
JOIN TreatmentData td ON d.IdTo = td.Id
WHERE d.IdFrom = "500"
AND d.Distance <= 2000
AND td.TreatmentType = "treatment_media"
GROUP BY 1, 2;
There's probably some other combination of indexes that would give better performance, but this seems like it would at least answer your example question.

Any existing methods to find a drop in a noisy time series?

I have a time series (array of values) and I would like to find the starting points where a long drop in values begins (at least X consecutive values going down). For example:
Having a list of values
[1,2,3,4,3,4,5,4,3,4,5,4,3,2,1,2,3,2,3,4,3,4,5,6,7,8]
I would like to find a drop of at least 5 consecutive values. So in this case I would find the segment 5,4,3,2,1.
However, in a real scenario, there is noise in the data, so the actual drop includes a lot of little ups and downs.
I could write an algorithm for this. But I was wondering whether there is an existing library or standard signal processing method for this type of analysis.

You can do this pretty easily with pandas (which I know you have). Convert your list to a series, and then perform a groupby + count to find consecutively declining values:
v = pd.Series([...])
v[v.groupby(v.diff().gt(0).cumsum()).transform('size').ge(5)]
10 5
11 4
12 3
13 2
14 1
dtype: int64

How to find the alignment of two data sets in pandas

Presented as an example.
Two data sets. One collected over a 1 hour period. One collected over a 20 min period within that hour.
Each data set contains instances of events that can transformed into single columns of true (-) or false (_), representing if the event is occurring or not.
DS1.event:
_-__-_--___----_-__--_-__---__
DS2.event:
__--_-__--
I'm looking for a way to automate the correlation (correct me if the terminology is incorrect) of the two data sets and find the offset(s) into DS1 at which DS2 is most (top x many) likely to have occurred. This will probably end up with some matching percentage that I can then threshold to determine the validity of the match.
Such that
_-__-_--___----_-__--_-__---__
__--_-__--
DS1.start + 34min ~= DS2.start
Additional information:
DS1 was recorded at roughly 1 Hz. DS2 at roughly 30 Hz. This makes it less likely that there will be a 100% clean match.
Alternate methods (to pandas) will be appreciated, but python/pandas are what I have at my disposal.

Sounds like you just want something like a cross correlation?
I would first convert the string to a numeric representation, so replace your - and _ with 1 and 0
You can do that using a strings replace method (e.g. signal.replace("-", "1"))
Convert them to a list or a numpy array:
event1 = [int(x) for x in signal1]
event2 = [int(x) for x in signal2]
Then calculate the cross correlation between them:
xcor = np.correlate(event1, event2, "full")
That will give you the cross correlation value at each time lag. You just want to find the largest value, and the time lag at which it happens:
nR = max(xcor)
maxLag = np.argmax(xcor) # I imported numpy as np here
Giving you something like:
Cross correlation value: 5
Lag: 20
It sounds like you're more interested in the lag value here. What the lag tells you is essentially how many time/positional shifts are required to get the maximum cross correlation value (degree of match) between your 2 signals
You might want to take a look at the docs for np.correlate and np.convolve to determine the method (full, same, or valid) you want to use as thats determined by the length of your data and what you want to happen if your signals are different lengths

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.