I have a dataset where each row represents the number of occurrences of certain behaviour. The columns represent a window of a set amount of time. It looks like this:
+----------+----------+----------+----------+-----------+------+
| Episode1 | Episode2 | Episode3 | Episode4 | Episode5 | ... |
+----------+----------+----------+----------+-----------+------+
| 2 | 0 | 1 | 3 | | |
| 1 | 2 | 4 | 2 | 3 | |
| 0 | | | | | |
+----------+----------+----------+----------+-----------+------+
There are over 150 episodes. I want to find a way to represent each row as a trend, whether the occurrences are exhibiting more/less.
I have tried to first calculate the average/median/sum of every 3/5/10 cells of each row (because each row has different length and many 0 values), and use these to correlate with a horizontal line (which represent the time), the coefficients of these correlations should tell the trend (<0 means downward, >0 upward). The trends will be used in further analysis.
I'm wondering if there is a better way to do this. Thanks.
If you expect the trend to be linear, you could fit a linear regression to each row separately, using time to predict number of occurences of a behavior. Then store the slopes.
This slope represents the effect of increasing time by 1 episode on the behavior. It also naturally accounts for the difference in length of the time series.
Related
I have a numerous dataframes that looks something like this:
| first_column | second_column | third_column |
| ------------ | ------------- | ------------ |
| 25 | 20 | 18 |
| 20 | 21 | 18 |
| 17 | 18 | 16 |
I want to see how well they converge to a corner (i.e., how well the values are increasing as you go to a corner of the dataframe). I've tried a couple of different methods, such as using df.diff() < 0 to see if the values are are decreasing as you go down the column and then converting the True or False statements to integers of 1 or 0 and summing those up, then doing the same thing but with the dataframe transposed and adding both sums together, but that doesn't seem to give me the correct best dataframes that converge to a corner.
I know this is oddly specific, but is there anyway to best see if the values of a dataframe increase as you go to a corner of the dataframe?
I have this kind of data set (originally comes from Excel so I past it like this in here). This is a ice hocket statistics
game_id | home_team_shots | home_team_saves | home_team_goals | away_team_shots | away_team_saves | away_team_goals
123 | 20 | 13 | 3 | 15 | 17 | 2
124 | 23 | 28 | 5 | 30 | 17 | 5
-----
There are about 15-20 rows like this depending on the exact dataset, I have about 50 of these excel.
Now I would need to find simple mathematical formulas from this data. Basically if in some column all values are less than X, equal to X or more than X. All the columns won't contain any such formula which is ok.
For example for the data above should be identified that away_team_saves == 17. And home_team_shots < 24. Of course all columns are not important so I should be able to define which columns to use.
I have mainly looking for solution in Python. I don't need an exact solution, just a push to the right direction. I am quite the beginner in Python and don't know where I should start looking.
All the help is very appreciated.
I have a dataframe full of stock transactions. I want to calculate how I can tell if a certain account is ordering less than they usually do.
NB: some cusotmers might place a big order (double the size of normal orders) and then take twice as long to order again. Or half the size and order double the frequency as normal.
Data layout:
SA_DACCNT = Account number
SA_TRREF = Invoice No
+-----------+------------+----------+------------+------------+
| SA_DACCNT | SA_TRDATE | SA_TRREF | SA_TRVALUE | SA_PRODUCT |
+-----------+------------+----------+------------+------------+
| GO63 | 01/05/2019 | 11587 | 0.98 | R613S/6 |
| GO63 | 01/05/2019 | 11587 | 0.98 | R614S/6 |
| AAA1 | 01/05/2019 | 11587 | 1.96 | R613S/18 |
| GO63 | 01/05/2019 | 11587 | 2.5 | POST3 |
| DOD2 | 01/07/2019 | 11588 | 7.84 | R613S/18 |
+-----------+------------+----------+------------+------------+
So Far:
I've tried to group by customer and resample the columns into quarters and analyse the z score of the last quarter for each customer, but it doesn't always work out correctly as not everyone orders quarterly. So there are gaps which skew the figures.
I've also tired using fitting a linear model to the figures for each customer and using the coefficient as a metric to determine if spend is down. But it still doesn't look right.
I've been looking at is_monotonic_increasing and other such functions but still unable to find what I'm looking for, i'm sure there is a statistical technique for it but it aludes me.
I'm pretty new to all this and am left scratching my head on hows best to tackle it.
My goal is to determine WHO has spent less (over their last few orders) than they usually do.
Bascially who do we need to call / chase / send offers etc.
Any ideas on the correct way to analyse this would be appreaciated. Not looking to copy someones code, I'm sure i can work that bit out myself.
EDIT:
I've also now tried using diff to calcualte the average distace between orders and resampling to this value, then calculating the Z score. But still its not what I expect.
I've got a dataset (pandas dataframe) of multiple people that have a gps device and have tracked their location over time. This dataset looks something like this:
person_id | timestamp | latitude | longitude
1 | 2019-05-15 10:01:53.231 | 10.00110 | 5.64321
1 | 2019-05-15 10:02:54.131 | 10.00310 | 5.64322
1 | 2019-05-15 10:03:55.331 | 10.00210 | 5.64325
1 | 2019-05-15 10:05:00.731 | 10.00410 | 5.64421
1 | 2019-05-15 10:06:48.434 | 10.00510 | 5.64121
1 | 2019-05-15 10:07:24.189 | 10.01110 | 5.63321
1 | 2019-05-15 10:08:53.231 | 10.02110 | 5.62821
2 | 2019-05-15 10:02:41.111 | 10.01131 | 5.64320
2 | 2019-05-15 10:03:47.221 | 10.01132 | 5.64322
2 | 2019-05-15 10:05:53.121 | 10.01130 | 5.64321
2 | 2019-05-15 10:07:24.564 | 10.01401 | 5.64331
etc.
So the GPS devices measure their location frequently. Sometimes we miss a few points, but in general the dataset is quite good. The GPS coordinates however jump around a bit even if you are not moving due to the accuracy of the device/GPS.
What I want to do is to add a column that indicates whether a person is moving or not. To do so I thought of having a rolling window, calculate the mean position in that window and then calculate the distance (geopy.distance.distance()) to that position and if the distance of any of the points in the window is larger than a given threshold (say 15m) than those points are considered to be "moving".
I've looked around on the internet but can't really find out how to do this (without using inefficient for loops). I would be looking into something like such:
df['moving'] = df.groupby(['mmsi']).rolling(
window=10).apply(
... some function here, like:
np.any([distance(
lat_mean,
lon_mean,
row_lat,
row_lon
) for row in window] > threshold))
Ideally we would want to have the window based on time and a minimum of datapoints, but this might make it more difficult...
Any suggestions/ideas?
I have a dataframe which contains a whole set of data and relevant id information:
| sample_id | run_id | x | y | z |
| 0 | 1 | 1 | 2 | 3 |
| 0 | 2 | 4 | 5 | 6 |
| 1 | 1 | 1 | 2 | 3 |
| 1 | 2 | 7 | 8 | 9 |
I wish to create a dataframe based on results from this. So a simple example would be my new dataframe should contain a row with the average information from a sample run:
| sample_id | avg_x | avg_y | avg_z |
| 0 | 2.5 | 3.5 | 4.5 |
| 1 | 4 | 5 | 6 |
At the moment I do this with a loop:
pivots = []
for i in samples:
df_sample = df_samples[df_samples['sample_id'] == i]
pivot = df_sample.pivot_table(index=index, columns='run_id', values=[x, y, z], aggfunc='mean')
# Add some other features. Involves creating more columns than existed in the initial df_samples dataframe
pivots.append(pivot)
# create new dataframe
pd.concat(pivots)
So my first question is, if I wanted to create a new dataframe which consists of repeated pivots of another dataframe. Is there a way to do that all at once with one pivot command instead of having to call it iteratively? If there is, is it more performant?
My second question involves the more complicated case. If it is possible to perform multiple pivots at once to build up the new dataframe when the new dataframe also will increase its dimensions i.e. it might look like
| s_id | avg_x | avg_y | avg_z | new_feature_1 |new_feature_2 |
| 0 | 2.5 | 3.5 | 4.5 | f(x11, x12, y11, y12, z11, z12) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
| 1 | 4 | 5 | 6 | f(x21, x22, y21, y22, z21, z22) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
The functions essentially perform individual operations on the data per sample_id to create new features.
Aside: I am looking for a good resource on working with large pandas dataframes and performantley constructing new ones or performing queries. I am almost always able to get the result I want using pandas. My implementations are often not efficient and akin to how it might be done in a lower level language like c++. I would like to improve my working knowledge and maybe this involves some theory I do not know on dataframes and tables etc. A recommendation for a resource would be good. Note that that is just additional helpful information and a recommendation alone does not answer the question and any answer that answers my two use cases above will be accepted with or without a recommendation for a resource.