finding outliers in subset of df [duplicate]

finding outliers in subset of df [duplicate] - python

This question already has an answer here:
Pandas : zscore among the groups
(1 answer)
Closed 5 months ago.
below is example of df I use, sales data. df is big, having several Gb of data, few thousands brands, data for past 12 months, hundred of territories.
index date brand territory value
0 2019-01-01 A 1 63
1 2019-02-01 A 1 91
2 2019-03-01 A 1 139
3 2019-04-01 A 1 80
4 2019-05-01 A 1 149
I want to find outliers for each individual brand across all territories for all dates
To find outliers within whole df I can use use
outliers = df[(np.abs(stats.zscore(df['value'])) > 3)]
or stats.zscore(df['value'] just to calculate z-score
I would like to add column df[z-score]
so I though about something like this but apparently it doesn't work
df['z-score'] = df.groupby('brand', as_index=False)['value'].stats.zscore(df['value'])

Use transform
df['z-score'] = df.groupby('brand')['value'].transform(stats.zscore)

Related

create new column in dataframe which returns the number of records recorded each day [duplicate]

This question already has answers here:
Make a new column that's the frequency of a row in a pandas DataFrame
(1 answer)
counting unique values using .groupby in pandas dataframe
(3 answers)
Closed 5 months ago.
I've got a dataframe with date which resembles the following:
df =
ID date
124 2022-03-14
34 2022-03-14
66 2022-03-14
2 2022-03-15
91 2022-03-16
20 2022-03-16
I'm trying to add a new column to the df which would state the number of records we have for each day. So the output should look like this:
ID date daily_records
124 2022-03-14 3
34 2022-03-14 3
66 2022-03-14 3
2 2022-03-15 1
91 2022-03-16 2
20 2022-03-16 2
I've tried using:
df['daily_records] = df.groupby('date').count()
But it just returns NaN values. How could I get my desired outcome

Drawing a boxplot of a Panda dataframe with time intervals

I have a Panda Dataframe with the following data:
df1[['interval','answer']]
interval answer
0 0 days 06:19:17.767000 no
1 0 days 00:26:35.867000 no
2 0 days 00:29:12.562000 no
3 0 days 01:04:36.362000 no
4 0 days 00:04:28.746000 yes
5 0 days 02:56:56.644000 yes
6 0 days 00:20:13.600000 no
7 0 days 02:31:17.836000 no
8 0 days 02:33:44.575000 no
9 0 days 00:08:08.785000 no
10 0 days 03:48:48.183000 no
11 0 days 00:22:19.327000 no
12 0 days 00:05:05.253000 question
13 0 days 01:08:01.338000 unsubscribe
14 0 days 15:10:30.503000 no
15 0 days 11:09:05.824000 no
16 1 days 12:56:07.526000 no
17 0 days 18:10:13.593000 no
18 0 days 02:25:56.299000 no
19 2 days 03:54:57.715000 no
20 0 days 10:11:28.478000 no
21 0 days 01:04:55.025000 yes
22 0 days 13:59:40.622000 yes
The format of the df is:
id object
datum datetime64[ns]
datum2 datetime64[ns]
answer object
interval timedelta64[ns]
dtype: object
As a result the boxplot looks like:
enter image description here
Any idea?
Any help is appreciated...
Robert

Seaborn may help you achieve what you want.
First of all, one needs to make sure the columns are of the type one wants.
In order to recreate your problem, created the same dataframe (and gave it the same name df1). Here one can see the data types of the columns
[In]: df1.dtypes
[Out]:
interval object
answer object
dtype: object
For the column "answers", one can use pandas.factorize as follows
df1['NewAnswer'] = pd.factorize(df1['answer'])[0] + 1
That will create a new column and assign the values 1 to No, 2 to Yes, 3 to Question, 4 to Unscribe.
With this, one can, already, create a box plot using sns.boxplot as
ax = sns.boxplot(x="interval", y="NewAnswer", hue="answer", data=df1)
Which results in the following
The amount of combinations one can do are various, so I will leave only these as OP didn't specify its requirements nor gave an example of the expected output.
Notes:
Make sure you have the required libraries installed.
There may be other visualizations that would work better with these dataframe, here one can see a gallery with examples.

Multiple math operations on timeseries data using groupby

I have a dataframe/series containing hourly sampled data over a couple of years. I'd like to sum the values for each month, then calculate the mean of those monthly totals over all the years.
I can get a multi-index dataframe/series of the totals using:
df.groupby([df.index.year, df.index.month]).sum()
Date & Time Date & Time
2016 3 220.246292
4 736.204574
5 683.240291
6 566.693919
7 948.116766
8 761.214823
9 735.168033
10 771.210572
11 542.314915
12 434.467037
2017 1 728.983901
2 639.787918
3 709.944521
4 704.610437
5 685.729297
6 760.175060
7 856.928659
But I don't know how to then combine the data to get the means.
I might be totally off on the wrong track too. Also not sure I've labelled the question very well.

I think you need mean per years - so per first level:
df.groupby([df.index.year, df.index.month]).sum().mean(level=0)

You can use groupby twice, once to get the monthly sum, once to get the mean of monthly sum:
(df.groupby(pd.Grouper(freq='M')).sum()
.groupby(pd.Grouper(freq='Y')).mean()
)

how to speed up dataframe analysis

I'm looping through a DataFrame of 200k rows. It's doing what I want but it takes hours. I'm not very sophisticated when it comes to all the ways you can join and manipulate DataFrames so I wonder if I'm doing this in a very inefficient way. It's quite simple, here's the code:
three_yr_gaps = []
for index, row in df.iterrows():
three_yr_gaps.append(df[(df['GROUP_ID'] == row['GROUP_ID']) &
(df['BEG_DATE'] >= row['THREE_YEAR_AGO']) &
(df['END_DATE'] <= row['BEG_DATE'])]['GAP'].sum() + row['GAP'])
df['GAP_THREE'] = three_yr_gaps
The DF has a column called GAP that holds an integer value. the logic I'm employing to sum this number up is:
for each row get these columns from the dataframe:
those that match on the group id, and...
those that have a beginning date within the last 3 years of this rows start date, and...
those that have an ending date before this row's beginning date.
sum up those rows GAP number and add this row's GAP number then append those to a list of indexes.
So is there a faster way to introduce this logic into some kind of automatic merge or join that could speed up this process?
PS.
I was asked for some clarification on input and output, so here's a constructed dataset to play with:
from dateutil import parser
df = pd.DataFrame( columns = ['ID_NBR','GROUP_ID','BEG_DATE','END_DATE','THREE_YEAR_AGO','GAP'],
data = [['09','185',parser.parse('2008-08-13'),parser.parse('2009-07-01'),parser.parse('2005-08-13'),44],
['10','185',parser.parse('2009-08-04'),parser.parse('2010-01-18'),parser.parse('2006-08-04'),35],
['11','185',parser.parse('2010-01-18'),parser.parse('2011-01-18'),parser.parse('2007-01-18'),0],
['12','185',parser.parse('2014-09-04'),parser.parse('2015-09-04'),parser.parse('2011-09-04'),0]])
and here's what I wrote at the top of the script, may help:
The purpose of this script is to extract gaps counts over the
last 3 year period. It uses gaps.sql as its source extract. this query
returns a DataFrame that looks like this:
ID_NBR GROUP_ID BEG_DATE END_DATE THREE_YEAR_AGO GAP
09 185 2008-08-13 2009-07-01 2005-08-13 44
10 185 2009-08-04 2010-01-18 2006-08-04 35
11 185 2010-01-18 2011-01-18 2007-01-18 0
12 185 2014-09-04 2015-09-04 2011-09-04 0
The python code then looks back at the previous 3 years (those
previous rows that have the same GROUP_ID but whose effective dates
come after their own THIRD_YEAR_AGO and whose end date come before
their own beginning date). Those rows are added up and a new column is
made called GAP_THREE. What remains is this:
ID_NBR GROUP_ID BEG_DATE END_DATE THREE_YEAR_AGO GAP GAP_THREE
09 185 2008-08-13 2009-07-01 2005-08-13 44 44
10 185 2009-08-04 2010-01-18 2006-08-04 35 79
11 185 2010-01-18 2011-01-18 2007-01-18 0 79
12 185 2014-09-04 2015-09-04 2011-09-04 0 0
you'll notice that row id_nbr 11 has a 79 value in the last 3 years but id_nbr 12 has 0 because the last gap was 35 in 2009 which is more than 3 years before 12's beginning date of 2014

Grouping records with close DateTimes in Python pandas DataFrame

I have been spinning my wheels with this problem and was wondering if anyone has any insight on how best to approach it. I have a pandas DataFrame with a number of columns, including one datetime64[ns]. I would like to find some way to 'group' records together which have datetimes which are very close to one another. For example, I might be interested in grouping the following transactions together if they occur within two seconds of each other by assigning a common ID called Grouped ID:
Transaction ID Time Grouped ID
1 08:10:02 1
2 08:10:03 1
3 08:10:50
4 08:10:55
5 08:11:00 2
6 08:11:01 2
7 08:11:02 2
8 08:11:03 3
9 08:11:04 3
10 08:15:00
Note that I am not looking to have the time window expand ad infinitum if transactions continue to occur at quick intervals - once a full 2 second window has passed, a new window would begin with the next transaction (as shown in transactions 5 - 9). Additionally, I will ultimately be performing this analysis at the millisecond level (i.e. combine transactions within 50 ms) but stuck with seconds for ease of presentation above.
Thanks very much for any insight you can offer!

The solution i suggest requires you to reindex your data with your Time data.
You can use a list of datetimes with the desired frequency, use searchsorted to find the nearest datetimes in your index, and then use it for slicing (as suggested in question python pandas dataframe slicing by date conditions and Python pandas, how to truncate DatetimeIndex and fill missing data only in certain interval).
I'm using pandas 0.14.1 and the DataOffset object (http://pandas.pydata.org/pandas-docs/dev/timeseries.html?highlight=dateoffset). I didn't check with datetime64, but i guess you might adapt the code. DataOffset goes down to the microsecond level.
Using the following code,
import pandas as pd
import pandas.tseries.offsets as pto
import numpy as np
# Create some ome test data
d_size = 15
df = pd.DataFrame({"value": np.arange(d_size)}, index=pd.date_range("2014/11/03", periods=d_size, freq=pto.Milli()))
# Define periods to define groups (ticks)
ticks = pd.date_range("2014/11/03", periods=d_size/3, freq=5*pto.Milli())
# find nearest indexes matching the ticks
index_ticks = np.unique(df.index.searchsorted(ticks))
# make a dataframe with the group ids
dgroups = pa.DataFrame(index=df.index, columns=['Group id',])
# sets the group ids
for i, (mini, maxi) in enumerate(zip(index_ticks[:-1], index_ticks[1:])):
dgroups.loc[mini:maxi] = i
# update original dataframe
df['Group id'] = dgroups['Group id']
I was able to obtain this kind of dataframe:
value Group id
2014-11-03 00:00:00 0 0
2014-11-03 00:00:00.001000 1 0
2014-11-03 00:00:00.002000 2 0
2014-11-03 00:00:00.003000 3 0
2014-11-03 00:00:00.004000 4 0
2014-11-03 00:00:00.005000 5 1
2014-11-03 00:00:00.006000 6 1
2014-11-03 00:00:00.007000 7 1
2014-11-03 00:00:00.008000 8 1
2014-11-03 00:00:00.009000 9 1
2014-11-03 00:00:00.010000 10 2
2014-11-03 00:00:00.011000 11 2
2014-11-03 00:00:00.012000 12 2
2014-11-03 00:00:00.013000 13 2
2014-11-03 00:00:00.014000 14 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

finding outliers in subset of df [duplicate] - python

Use transform df['z-score'] = df.groupby('brand')['value'].transform(stats.zscore)

Related

create new column in dataframe which returns the number of records recorded each day [duplicate]

Drawing a boxplot of a Panda dataframe with time intervals

Multiple math operations on timeseries data using groupby

how to speed up dataframe analysis

Grouping records with close DateTimes in Python pandas DataFrame

Categories

Resources