Group By Having Count in Pandas - python

Here is my data:
{'SystemID': {0: '95EE8B57',
1: '5F891F03',
2: '5F891F03',
3: '5F891F03'},
'Day': {0: '06/08/2018', 1: '05/08/2018', 2: '04/08/2018', 3: '05/08/2018'},
'AlarmClass-S': {0: 4, 1: 2, 2: 4, 3: 0},
'AlarmClass-ELM': {0: 0, 1: 0, 2: 0, 3: 2}}
I would like to perform an aggregation and filtering which in SQL would be formulated as
SELECT SystemID, COUNT(*) as count FROM table GROUP BY SystemID HAVING COUNT(*) > 2
Thus the result shall be
{'SystemID': {0: '5F891F03'},
'count': {0: '3'}}
How to do this in pandas?

You can use groupby and count, then filter at the end.
(df.groupby('SystemID', as_index=False)['SystemID']
.agg({'count': 'count'})
.query('count > 2'))
SystemID count
0 5F891F03 3
(df.groupby('SystemID', as_index=False)['SystemID']
.agg({'count': 'count'})
.query('count > 2')
.to_dict())
# {'SystemID': {0: '5F891F03'}, 'count': {0: 3}}

Related

Fixing column names and renaming them after grouping the dataframe by two columns

I have a dataframe:
{'ARTICLE_ID': {0: 111, 1: 111, 2: 222, 3: 222, 4: 222}, 'CITEDIN_ARTICLE_ID': {0: 11, 1: 11, 2: 11, 3: 22, 4: 22}, 'enrollment': {0: 10, 1: 10, 2: 10, 3: 10, 4: 10}, 'Trial_year': {0: 2017, 1: 2017, 2: 2017, 3: 2017, 4: 2017}, 'AUTHOR_ID': {0: 'aaa', 1: 'aaa', 2: 'aaa', 3: 'aaa', 4: 'aaa'}, 'AUTHOR_RANK': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}}
I am grouping it by two columns
df_grouped = df.groupby(['AUTHOR_ID', 'Trial_year']).agg({'ARTICLE_ID': "count",
'enrollment': ["count", 'sum']}).reset_index()
As a result, I receive this dataframe, where column names have two levels
{('AUTHOR_ID', ''): {0: 'aaa'}, ('Trial_year', ''): {0: 2017}, ('ARTICLE_ID', 'count'): {0: 5}, ('enrollment', 'count'): {0: 5}, ('enrollment', 'sum'): {0: 50}}
My ideal output - the dataframe with one level of column names and renamed column names
`AUTHOR_ID`, `Trial_year`, `ARTICLE_ID_count`, `enrollment_count`, `enrollment_sum`
You can modify the columns:
df_grouped.columns = [f"{i}_{j}" if j!='' else i for i,j in df_grouped.columns]
or use NamedAgg from the beginning:
df_grouped = (df.groupby(['AUTHOR_ID', 'Trial_year'])
.agg(ARTICLE_ID_count=('ARTICLE_ID', "count"),
enrollment_count=('enrollment','count'),
enrollment_sum=('enrollment','sum')).reset_index())
You can also pass a dictionary to groupby.agg for a little concise code:
df_grouped = (df.groupby(['AUTHOR_ID', 'Trial_year'], as_index=False)
.agg(**{'_'.join(pair): pair for pair in [('ARTICLE_ID', 'count'),
('enrollment','count'),
('enrollment','sum')]}))
Output:
AUTHOR_ID Trial_year ARTICLE_ID_count enrollment_count enrollment_sum
0 aaa 2017 5 5 50

In python pandas, count the integers in a particular column and also count all the elements in particular column

There is a huge df with multiple columns but want to read only specific column that is interested to me:
in the below data, I would like to read only the column 'Type 1'
import numpy as np
import pandas as pd
data = {'Type 1': {0: 1, 1: 3, 2: 5, 3: 'HH', 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: 'np.NaN',
3: '55',
4: '3.14',
5: '-96',
6: 'String',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}
df = pd.DataFrame(data)
df
int_count = df['Type 1'].count(0,numeric_only = True) # should count only cells that contain integers and return 8
total_count = df['Type 1'].count(0,numeric_only = False) # should count all the cells and return 9
I want something like count only the numeric values in particular column
eg: df['Type 1'].count(0,numeric_only = True) should return 8 (exclude counting the string 'HH' in Type 1 column)
df['Type 1'].count(0,numeric_only = False) should return 9 (total number of cells in the particular column)
but "df['Type 1'].count(0,numeric_only = True/False)" this is not working as I expect...
I would suggest the below:
int_count = len(df.loc[df['Type 1'].astype(str).str.isnumeric()])
total_count = len(df)

Replace value with different value for multi level column in pandas

df = pd.DataFrame({('Quarter', 'Range'): {0: 'A', 1: 'B'}, ('Q1(0.25)', 'Low'): {0: 0, 1: 0}, ('Q1(0.25)', 'High'): {0: 10, 1: 630}, ('Q2(0.5)', 'Low'): {0: 10, 1: 630}, ('Q2(0.5)', 'High'): {0: 50, 1: 3000}, ('Q3(0.75)', 'Low'): {0: 50, 1: 3000}, ('Q3(0.75)', 'High'): {0: 100, 1: 8500}, ('Q4(1.0)', 'Low'): {0: 100, 1: 8500}, ('Q4(1.0)', 'High'): {0: 'np.inf', 1: 'np.inf'}})
As the above dataframe if the value of column A is between 0-10 replace the value with 0.25, if the value is between 10-50 replace the value with 0.5. Similarly, we have to repeat for all values and columns.
Expected output:
If the value is 12 for A and 3210 for column B.
df2 = pd.DataFrame({'Column': {0: 'A', 1: 'B'}, 'Prob': {0: 0.5, 1: 0.75}})
How to do it?

How do I extract these SQL queries from these pandas dataframes?

SQL QUERIES
customers = pd.DataFrame({'customer_id': {0: 5386596, 1: 32676876}, 'created_at': {0: Timestamp('2017-01-27 00:00:00'), 1: Timestamp('2018-06-07 00:00:00')}, 'venture_code': {0: 'MY', 1: 'ID'}})
visits = Pd.DataFrame({'customer_id': {0: 3434886, 1: 10053}, 'date': {0: Timestamp('2016-10-02 00:00:00'), 1: Timestamp('2017-12-14 00:00:00')}})
orders = Pd.DataFrame({'order_id': {0: 112525, 1: 112525}, 'date': {0: Timestamp('2019-02-01 00:00:00'), 1: Timestamp('2019-02-01 00:00:00')}, 'sku': {0: 'SA108SH89OLAHK', 1: 'RO151AA60REHHK'}, 'customer_id': {0: 46160566, 1: 46160566}})
products = Pd.DataFrame({'sku': {0: 'SA108SH89OLAHK', 1: 'RO151AA60REHHK'}, 'brand': {0: 1, 1: 1}, 'supplier': {0: 'A', 1: 'B'}, 'category': {0: 'Mapp', 1: 'Macc'}, 'price': {0: 15, 1: 45}})
segment = Pd.DataFrame({'Age Range': {0: '<20', 1: '<20'},
'Gender': {0: 'female', 1: 'female'},
'Category': {0: 'Wsho', 1: 'Wapp'},
'Discount %': {0: 0.246607432, 1: 0.174166503},
'NMV': {0: 2509.580375, 1: 8910.447587},
'# Items': {0: 169, 1: 778},
'# Orders': {0: 15, 1: 135}})
buying = Pd.DataFrame({'Supplier Name': {0: 'A', 1: 'A'},
'Brand Name': {0: 1, 1: 2},
'# SKU': {0: 506, 1: 267},
'# Item Before Return': {0: 5663, 1: 3256},
'# Item Returned': {0: 2776, 1: 1395},
'Margin %': {0: 0.266922793, 1: 0.282847894},
'GMV': {0: 191686.749171408, 1: 115560.037075292}})
Using SQL or Pandas, please tell me how to
1. Compare the monthly sales (GMV) trend in Q4 2019, across all countries (venture_code)
2. Show the top 10 brands for each product category, based on total sales (GMV)
I wrote but got the query wrong!
SELECT category, SUM(GMV) as Total_Sales FROM products INNER JOIN buying ON products.brand = buying.[Brand Name]
Concerning the error, you have a space in the column name.
In SQL, if the column has a space, use brackets to wrap the column name:
MyTable.[My Column]
In your code, use this SQL:
SELECT category, SUM(GMV) as Total_Sales FROM products INNER JOIN buying ON products.brand = buying.[Brand Name]
I don't have access to your data, so I can't test, but I think these queries are correct. You may need to tweak them some.
Part 1:
select c.venture_code, sum(b.GMV) GMVSum from customers c join orders o on c.customer_id = o.customer_id
join products p on o.skuv=p.sku
join buying b on p.brand = b.[Brand Name] and p.supplier = b.[Supplier Name]
where o.date >= '2019-10-01' and o.date <= '2019-12-31' -- 2019 4th qtr
group by c.venture_code
Part 2:
select * from
(select *, RANK() over (PARTITION BY category,brand order by GMV) rk from
(select p.brand, p.category, b.GMV from products p join buying b on p.brand = b.[Brand Name] and p.supplier = b.[Supplier Name]) x) xx
where rk <= 10

Pandas: population new columns from other column's values

I have a pandas.dataframe of SEC reports for multiple tickers & periods.
Reproducible dict for DF:
{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'field': {0: 'taxonomyid',
1: 'cik',
2: 'companyname',
3: 'entityid',
4: 'primaryexchange'},
'value': {0: '50',
1: '0000023217',
2: 'CONAGRA BRANDS INC.',
3: '6976',
4: 'NYSE'},
'ticker': {0: 'CAG', 1: 'CAG', 2: 'CAG', 3: 'CAG', 4: 'CAG'},
'cik': {0: 23217, 1: 23217, 2: 23217, 3: 23217, 4: 23217},
'dcn': {0: '0000023217-18-000009',
1: '0000023217-18-000009',
2: '0000023217-18-000009',
3: '0000023217-18-000009',
4: '0000023217-18-000009'},
'fiscalyear': {0: 2019, 1: 2019, 2: 2019, 3: 2019, 4: 2019},
'fiscalquarter': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'receiveddate': {0: '10/2/2018',
1: '10/2/2018',
2: '10/2/2018',
3: '10/2/2018',
4: '10/2/2018'},
'periodenddate': {0: '8/26/2018',
1: '8/26/2018',
2: '8/26/2018',
3: '8/26/2018',
4: '8/26/2018'}}
The column 'field' contains the name of the reporting field (e.g. Indicator), column 'value' contains value for that indicator. Other columns are description for the SEC filing (ticker+date+fiscal_periods = unique set of features to describe certain filing). There are about 60-70 indicators per filing (number varies).
With the code below I've managed to create a pivot dataframe with columns = features (let say total number of N for 1 submission). But the length of this dataframe also equals the number of indicators = N, with NaN in non-diagonal places.
# Adf - Initial dataframe
c = Adf.pivot(columns='field', values='value')
d = Adf[['ticker','cik','fiscalyear','fiscalquarter','dcn','receiveddate','periodenddate']]
e = pd.concat([d, c], sort=False, axis=1)
I want to use an Indicator names from the 'field' as new columns (going from narrow to wide format). At the end I want to have a dataframe with 1 row for each of SEC reports.
So the expected output for provided example is a 1-row dataframe with N new columns, where N = number of unique indicators from the 'field' column of initial dataframe:
{'ticker': {0: 'CAG'},
'cik': {0: 23217},
'dcn': {0: '0000023217-18-000009'},
'fiscalyear': {0: 2019},
'fiscalquarter': {0: 1},
'receiveddate': {0: '10/2/2018'},
'periodenddate': {0: '8/26/2018'},
'taxonomyid':{0:'50'},
'cik': {0: '0000023217}',
'companyname':{0: 'CONAGRA BRANDS INC.'},
'entityid':{0:'6976'},
'primaryexchange': {0:'NYSE'},
}
What is the proper way to create such columns from or what is the proper way to clean-up resulting dataframe from multiple NaN?
What worked for me is setting new index to DF and unstacking 'field' and 'value' columns
aa = Adf.set_index(['ticker','cik', 'fiscalyear','fiscalquarter', 'dcn','receiveddate', 'periodenddate', 'field']).unstack()
aa = aa.reset_index()

Categories

Resources