This question already has an answer here:
Pandas : zscore among the groups
(1 answer)
Closed 5 months ago.
below is example of df I use, sales data. df is big, having several Gb of data, few thousands brands, data for past 12 months, hundred of territories.
index date brand territory value
0 2019-01-01 A 1 63
1 2019-02-01 A 1 91
2 2019-03-01 A 1 139
3 2019-04-01 A 1 80
4 2019-05-01 A 1 149
I want to find outliers for each individual brand across all territories for all dates
To find outliers within whole df I can use use
outliers = df[(np.abs(stats.zscore(df['value'])) > 3)]
or stats.zscore(df['value'] just to calculate z-score
I would like to add column df[z-score]
so I though about something like this but apparently it doesn't work
df['z-score'] = df.groupby('brand', as_index=False)['value'].stats.zscore(df['value'])
Use transform
df['z-score'] = df.groupby('brand')['value'].transform(stats.zscore)
I have this MultiIndex pandas dataframe:
chamber_temp
month day
1 1 0.000000
2 0.005977
3 0.001439
4 -0.000119
5 0.000514
...
12 27 0.001799
28 0.002346
29 -0.001815
30 0.001102
31 -0.004189
What I want to get is which month has the highest cumsum().
What I am trying to do is for each month there should 1 value which will give me the cumulative sum of all the values for day in that month, that is the problem which I am trying to get help on.
You can leverage on level parameter in Series.sum when there's MultiIndex to avoid groupby in such cases.
df['champer_temp'].sum(level=0).idxmax()
Please Try
df.groupby('month')['chamber_temp'].sum().idxmax()
i have a dataframe like below.
Trying to sum Week 7 and Week 8
SalesQuantity values for all the regarding productCodes[1-317], and update their week 7 rows Sales Quantity as a new value. And deleting their Week 8 rows from Dataframe.
Week column range is [7-26] and all of the weeks include [1-317] product code
cause of the original data is before group by [Week,ProductCode]
Week ProductCode SalesQuantity
7 1 49.285714
7 2 36.714286
7 3 33.285714
7 4 36.857143
7 5 42.714286
... ... ...
8 3 61.000000
26 314 4.285714
26 315 3.571429
26 316 6.142857
26 317 3.285714
Example Result : From the above table, adding week 7+8 SalesQuantities for product code 3: 61.000+33.285714= 94.285.714 new SalesQuantity updated value for week 7 is founded for ProductCode 3.
After that, need delete Week 8 row for ProductCode 3.
How to automate it for all of the ProductCode[1-317]?
Thanks
Use the `groupby()' method:
sumSales = data[['productCode', 'SalesQuality']].groupby('ProductCode').sum()
This creates a new DataFrame, with the sum of SalesQuality, indexed with the product code. The data[['productCode', 'SalesQuality']] part creates a sub-selection of the original data frame, otherwise the weeks also get summed.
I need help with some big pandas issue.
As a lot of people asked to have the real input and real desired output in order to answer the question, there it goes:
So I have the following dataframe
Date user cumulative_num_exercises total_exercises %_exercises
2017-01-01 1 2 7 28,57
2017-01-01 2 1 7 14.28
2017-01-01 4 3 7 42,85
2017-01-01 10 1 7 14,28
2017-02-02 1 2 14 14,28
2017-02-02 2 3 14 21,42
2017-02-02 4 4 14 28,57
2017-02-02 10 5 14 35,71
2017-03-03 1 3 17 17,64
2017-03-03 2 3 17 17,64
2017-03-03 4 5 17 29,41
2017-03-03 10 6 17 35,29
%_exercises_accum
28,57
42,85
85,7
100
14,28
35,7
64,27
100
17,64
35,28
64,69
100
-The column %_exercises is the value of the column (cumulative_num_exercises/total_exercises)*100
-The column %_exercises_accum is the value of the sum of the %_exercises for each month. (Note that at the end of each month, it reaches the value 100).
-I need to calculate, whith this data, the % of users that contributed to do a 50%, 80% and 90% of the total exercises, during each month.
-In order to do so, I have thought to create a new column, called category, which will later be used to count how many users contributed to each of the 3 percentages (50%, 80% and 90%). The category column takes the following values:
0 if the user did a %_exercises_accum = 0.
1 if the user did a %_exercises_accum < 50 and > 0.
50 if the user did a %_exercises_accum = 50.
80 if the user did a %_exercises_accum = 80.
90 if the user did a %_exercises_accum = 90.
And so on, because there are many cases in order to determine who contributes to which percentage of the total number of exercises on each month.
I have already determined all the cases and all the values that must be taken.
Basically, I traverse the dataframe using a for loop, and with two main ifs:
if (df.iloc[i][date] == df.iloc[i][date].shift()):
calculations to determine the percentage or percentages to which the user from the second to the last row of the same month group contributes
(because the same user can contribute to all the percentages, or to more than one)
else:
calculations to determine to which percentage of exercises the first
member of each
month group contributes.
The calculations involve:
Looking at the value of the category column in the previous row using shift().
Doing while loops inside the for, because when a user suddenly reaches a big percentage, we need to go back for the users in the same month, and change their category_column value to 50, as they have contributed to the 50%, but didn't reach it. for instance, in this situation:
Date %_exercises_accum
2017-01-01 1,24
2017-01-01 3,53
2017-01-01 20,25
2017-01-01 55,5
The desired output for the given dataframe at the beginning of the question would include the same columns as before (date, user, cumulative_num_exercises, total_exercises, %_exercises and %_exercises_accum) plus the category column, which is the following:
category
50
50
508090
90
50
50
5080
8090
50
50
5080
8090
Note that the rows with the values: 508090, or 8090, mean that that user is contributing to create:
508090: both 50%, 80% and 90% of total exercises in a month.
8090: both 80% and 90% of exercises in a month.
Does anyone know how can I simplify this for loop by traversing the groups of a group by object?
Thank you very much!
Given no sense of what calculations you wish to accomplish, this is my best guess at what you're looking for. However, I'd re-iterate Datanovice's point that the best way to get answers is to provide a sample output.
You can slice to each unique date using the following code:
dates = ['2017-01-01', '2017-01-01','2017-01-01','2017-01-01','2017-02-02','2017-02-02','2017-02-02','2017-02-02','2017-03-03','2017-03-03','2017-03-03','2017-03-03']
df = pd.DataFrame(
{'date':pd.to_datetime(dates),
'user': [1,2,4,10,1,2,4,10,1,2,4,10],
'cumulative_num_exercises':[2,1,3,1,2,3,4,5,3,3,5,6],
'total_exercises':[7,7,7,7,14,14,14,14,17,17,17,17]}
)
df = df.set_index('date')
for idx in df.index.unique():
hold = df.loc[idx]
### YOUR CODE GOES HERE ###
I have been spinning my wheels with this problem and was wondering if anyone has any insight on how best to approach it. I have a pandas DataFrame with a number of columns, including one datetime64[ns]. I would like to find some way to 'group' records together which have datetimes which are very close to one another. For example, I might be interested in grouping the following transactions together if they occur within two seconds of each other by assigning a common ID called Grouped ID:
Transaction ID Time Grouped ID
1 08:10:02 1
2 08:10:03 1
3 08:10:50
4 08:10:55
5 08:11:00 2
6 08:11:01 2
7 08:11:02 2
8 08:11:03 3
9 08:11:04 3
10 08:15:00
Note that I am not looking to have the time window expand ad infinitum if transactions continue to occur at quick intervals - once a full 2 second window has passed, a new window would begin with the next transaction (as shown in transactions 5 - 9). Additionally, I will ultimately be performing this analysis at the millisecond level (i.e. combine transactions within 50 ms) but stuck with seconds for ease of presentation above.
Thanks very much for any insight you can offer!
The solution i suggest requires you to reindex your data with your Time data.
You can use a list of datetimes with the desired frequency, use searchsorted to find the nearest datetimes in your index, and then use it for slicing (as suggested in question python pandas dataframe slicing by date conditions and Python pandas, how to truncate DatetimeIndex and fill missing data only in certain interval).
I'm using pandas 0.14.1 and the DataOffset object (http://pandas.pydata.org/pandas-docs/dev/timeseries.html?highlight=dateoffset). I didn't check with datetime64, but i guess you might adapt the code. DataOffset goes down to the microsecond level.
Using the following code,
import pandas as pd
import pandas.tseries.offsets as pto
import numpy as np
# Create some ome test data
d_size = 15
df = pd.DataFrame({"value": np.arange(d_size)}, index=pd.date_range("2014/11/03", periods=d_size, freq=pto.Milli()))
# Define periods to define groups (ticks)
ticks = pd.date_range("2014/11/03", periods=d_size/3, freq=5*pto.Milli())
# find nearest indexes matching the ticks
index_ticks = np.unique(df.index.searchsorted(ticks))
# make a dataframe with the group ids
dgroups = pa.DataFrame(index=df.index, columns=['Group id',])
# sets the group ids
for i, (mini, maxi) in enumerate(zip(index_ticks[:-1], index_ticks[1:])):
dgroups.loc[mini:maxi] = i
# update original dataframe
df['Group id'] = dgroups['Group id']
I was able to obtain this kind of dataframe:
value Group id
2014-11-03 00:00:00 0 0
2014-11-03 00:00:00.001000 1 0
2014-11-03 00:00:00.002000 2 0
2014-11-03 00:00:00.003000 3 0
2014-11-03 00:00:00.004000 4 0
2014-11-03 00:00:00.005000 5 1
2014-11-03 00:00:00.006000 6 1
2014-11-03 00:00:00.007000 7 1
2014-11-03 00:00:00.008000 8 1
2014-11-03 00:00:00.009000 9 1
2014-11-03 00:00:00.010000 10 2
2014-11-03 00:00:00.011000 11 2
2014-11-03 00:00:00.012000 12 2
2014-11-03 00:00:00.013000 13 2
2014-11-03 00:00:00.014000 14 2