I've got a dataset (pandas dataframe) of multiple people that have a gps device and have tracked their location over time. This dataset looks something like this:
person_id | timestamp | latitude | longitude
1 | 2019-05-15 10:01:53.231 | 10.00110 | 5.64321
1 | 2019-05-15 10:02:54.131 | 10.00310 | 5.64322
1 | 2019-05-15 10:03:55.331 | 10.00210 | 5.64325
1 | 2019-05-15 10:05:00.731 | 10.00410 | 5.64421
1 | 2019-05-15 10:06:48.434 | 10.00510 | 5.64121
1 | 2019-05-15 10:07:24.189 | 10.01110 | 5.63321
1 | 2019-05-15 10:08:53.231 | 10.02110 | 5.62821
2 | 2019-05-15 10:02:41.111 | 10.01131 | 5.64320
2 | 2019-05-15 10:03:47.221 | 10.01132 | 5.64322
2 | 2019-05-15 10:05:53.121 | 10.01130 | 5.64321
2 | 2019-05-15 10:07:24.564 | 10.01401 | 5.64331
etc.
So the GPS devices measure their location frequently. Sometimes we miss a few points, but in general the dataset is quite good. The GPS coordinates however jump around a bit even if you are not moving due to the accuracy of the device/GPS.
What I want to do is to add a column that indicates whether a person is moving or not. To do so I thought of having a rolling window, calculate the mean position in that window and then calculate the distance (geopy.distance.distance()) to that position and if the distance of any of the points in the window is larger than a given threshold (say 15m) than those points are considered to be "moving".
I've looked around on the internet but can't really find out how to do this (without using inefficient for loops). I would be looking into something like such:
df['moving'] = df.groupby(['mmsi']).rolling(
window=10).apply(
... some function here, like:
np.any([distance(
lat_mean,
lon_mean,
row_lat,
row_lon
) for row in window] > threshold))
Ideally we would want to have the window based on time and a minimum of datapoints, but this might make it more difficult...
Any suggestions/ideas?
Related
I have a numerous dataframes that looks something like this:
| first_column | second_column | third_column |
| ------------ | ------------- | ------------ |
| 25 | 20 | 18 |
| 20 | 21 | 18 |
| 17 | 18 | 16 |
I want to see how well they converge to a corner (i.e., how well the values are increasing as you go to a corner of the dataframe). I've tried a couple of different methods, such as using df.diff() < 0 to see if the values are are decreasing as you go down the column and then converting the True or False statements to integers of 1 or 0 and summing those up, then doing the same thing but with the dataframe transposed and adding both sums together, but that doesn't seem to give me the correct best dataframes that converge to a corner.
I know this is oddly specific, but is there anyway to best see if the values of a dataframe increase as you go to a corner of the dataframe?
I have a dataset where each row represents the number of occurrences of certain behaviour. The columns represent a window of a set amount of time. It looks like this:
+----------+----------+----------+----------+-----------+------+
| Episode1 | Episode2 | Episode3 | Episode4 | Episode5 | ... |
+----------+----------+----------+----------+-----------+------+
| 2 | 0 | 1 | 3 | | |
| 1 | 2 | 4 | 2 | 3 | |
| 0 | | | | | |
+----------+----------+----------+----------+-----------+------+
There are over 150 episodes. I want to find a way to represent each row as a trend, whether the occurrences are exhibiting more/less.
I have tried to first calculate the average/median/sum of every 3/5/10 cells of each row (because each row has different length and many 0 values), and use these to correlate with a horizontal line (which represent the time), the coefficients of these correlations should tell the trend (<0 means downward, >0 upward). The trends will be used in further analysis.
I'm wondering if there is a better way to do this. Thanks.
If you expect the trend to be linear, you could fit a linear regression to each row separately, using time to predict number of occurences of a behavior. Then store the slopes.
This slope represents the effect of increasing time by 1 episode on the behavior. It also naturally accounts for the difference in length of the time series.
I have a dataframe full of stock transactions. I want to calculate how I can tell if a certain account is ordering less than they usually do.
NB: some cusotmers might place a big order (double the size of normal orders) and then take twice as long to order again. Or half the size and order double the frequency as normal.
Data layout:
SA_DACCNT = Account number
SA_TRREF = Invoice No
+-----------+------------+----------+------------+------------+
| SA_DACCNT | SA_TRDATE | SA_TRREF | SA_TRVALUE | SA_PRODUCT |
+-----------+------------+----------+------------+------------+
| GO63 | 01/05/2019 | 11587 | 0.98 | R613S/6 |
| GO63 | 01/05/2019 | 11587 | 0.98 | R614S/6 |
| AAA1 | 01/05/2019 | 11587 | 1.96 | R613S/18 |
| GO63 | 01/05/2019 | 11587 | 2.5 | POST3 |
| DOD2 | 01/07/2019 | 11588 | 7.84 | R613S/18 |
+-----------+------------+----------+------------+------------+
So Far:
I've tried to group by customer and resample the columns into quarters and analyse the z score of the last quarter for each customer, but it doesn't always work out correctly as not everyone orders quarterly. So there are gaps which skew the figures.
I've also tired using fitting a linear model to the figures for each customer and using the coefficient as a metric to determine if spend is down. But it still doesn't look right.
I've been looking at is_monotonic_increasing and other such functions but still unable to find what I'm looking for, i'm sure there is a statistical technique for it but it aludes me.
I'm pretty new to all this and am left scratching my head on hows best to tackle it.
My goal is to determine WHO has spent less (over their last few orders) than they usually do.
Bascially who do we need to call / chase / send offers etc.
Any ideas on the correct way to analyse this would be appreaciated. Not looking to copy someones code, I'm sure i can work that bit out myself.
EDIT:
I've also now tried using diff to calcualte the average distace between orders and resampling to this value, then calculating the Z score. But still its not what I expect.
I have a dataframe which contains a whole set of data and relevant id information:
| sample_id | run_id | x | y | z |
| 0 | 1 | 1 | 2 | 3 |
| 0 | 2 | 4 | 5 | 6 |
| 1 | 1 | 1 | 2 | 3 |
| 1 | 2 | 7 | 8 | 9 |
I wish to create a dataframe based on results from this. So a simple example would be my new dataframe should contain a row with the average information from a sample run:
| sample_id | avg_x | avg_y | avg_z |
| 0 | 2.5 | 3.5 | 4.5 |
| 1 | 4 | 5 | 6 |
At the moment I do this with a loop:
pivots = []
for i in samples:
df_sample = df_samples[df_samples['sample_id'] == i]
pivot = df_sample.pivot_table(index=index, columns='run_id', values=[x, y, z], aggfunc='mean')
# Add some other features. Involves creating more columns than existed in the initial df_samples dataframe
pivots.append(pivot)
# create new dataframe
pd.concat(pivots)
So my first question is, if I wanted to create a new dataframe which consists of repeated pivots of another dataframe. Is there a way to do that all at once with one pivot command instead of having to call it iteratively? If there is, is it more performant?
My second question involves the more complicated case. If it is possible to perform multiple pivots at once to build up the new dataframe when the new dataframe also will increase its dimensions i.e. it might look like
| s_id | avg_x | avg_y | avg_z | new_feature_1 |new_feature_2 |
| 0 | 2.5 | 3.5 | 4.5 | f(x11, x12, y11, y12, z11, z12) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
| 1 | 4 | 5 | 6 | f(x21, x22, y21, y22, z21, z22) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
The functions essentially perform individual operations on the data per sample_id to create new features.
Aside: I am looking for a good resource on working with large pandas dataframes and performantley constructing new ones or performing queries. I am almost always able to get the result I want using pandas. My implementations are often not efficient and akin to how it might be done in a lower level language like c++. I would like to improve my working knowledge and maybe this involves some theory I do not know on dataframes and tables etc. A recommendation for a resource would be good. Note that that is just additional helpful information and a recommendation alone does not answer the question and any answer that answers my two use cases above will be accepted with or without a recommendation for a resource.
I'm sure what I'm trying to do is fairly simple for those with better knowledge of PD, but I'm simply stuck at transforming:
+---------+------------+-------+
| Trigger | Date | Value |
+---------+------------+-------+
| 1 | 01/01/2016 | a |
+---------+------------+-------+
| 2 | 01/01/2016 | b |
+---------+------------+-------+
| 3 | 01/01/2016 | c |
+---------+------------+-------+
...etc, into:
+------------+---------------------+---------+---------+---------+
| Date | #of triggers | count a | count b | count c |
+------------+---------------------+---------+---------+---------+
| 01/01/2016 | 3 | 1 | 1 | 1 |
+------------+---------------------+---------+---------+---------+
| 02/01/2016 | 5 | 2 | 1 | 2 |
+------------+---------------------+---------+---------+---------+
... and so on
The issue is, I've got no bloody idea of how to achieve this..
I've scoured SO, but I can't seem to find anything that applies to my specific case.
I presume I'd have to group it all by date, but then once that is done, what do I need to do to get the remaining columns?
The initial DF is loaded from an SQL Alchemy query object, and then I want to manipulate it to get the result I described above. How would one do this?
Thanks
df.groupby(['Date','Value']).count().unstack(level=-1)
You can use GroupBy.size with unstack, also parameter sort=False is helpful:
df1 = df.groupby(['Date','Value'])['Value'].size().unstack(fill_value=0)
df1['Total'] = df1.sum(axis=1)
cols = df1.columns[-1:].union(df1.columns[:-1])
df1 = df1[cols]
print (df1)
Value Total a b c
Date
01/01/2016 3 1 1 1
The difference between size and count is:
size counts NaN values, count does not.