SQL QUERIES
customers = pd.DataFrame({'customer_id': {0: 5386596, 1: 32676876}, 'created_at': {0: Timestamp('2017-01-27 00:00:00'), 1: Timestamp('2018-06-07 00:00:00')}, 'venture_code': {0: 'MY', 1: 'ID'}})
visits = Pd.DataFrame({'customer_id': {0: 3434886, 1: 10053}, 'date': {0: Timestamp('2016-10-02 00:00:00'), 1: Timestamp('2017-12-14 00:00:00')}})
orders = Pd.DataFrame({'order_id': {0: 112525, 1: 112525}, 'date': {0: Timestamp('2019-02-01 00:00:00'), 1: Timestamp('2019-02-01 00:00:00')}, 'sku': {0: 'SA108SH89OLAHK', 1: 'RO151AA60REHHK'}, 'customer_id': {0: 46160566, 1: 46160566}})
products = Pd.DataFrame({'sku': {0: 'SA108SH89OLAHK', 1: 'RO151AA60REHHK'}, 'brand': {0: 1, 1: 1}, 'supplier': {0: 'A', 1: 'B'}, 'category': {0: 'Mapp', 1: 'Macc'}, 'price': {0: 15, 1: 45}})
segment = Pd.DataFrame({'Age Range': {0: '<20', 1: '<20'},
'Gender': {0: 'female', 1: 'female'},
'Category': {0: 'Wsho', 1: 'Wapp'},
'Discount %': {0: 0.246607432, 1: 0.174166503},
'NMV': {0: 2509.580375, 1: 8910.447587},
'# Items': {0: 169, 1: 778},
'# Orders': {0: 15, 1: 135}})
buying = Pd.DataFrame({'Supplier Name': {0: 'A', 1: 'A'},
'Brand Name': {0: 1, 1: 2},
'# SKU': {0: 506, 1: 267},
'# Item Before Return': {0: 5663, 1: 3256},
'# Item Returned': {0: 2776, 1: 1395},
'Margin %': {0: 0.266922793, 1: 0.282847894},
'GMV': {0: 191686.749171408, 1: 115560.037075292}})
Using SQL or Pandas, please tell me how to
1. Compare the monthly sales (GMV) trend in Q4 2019, across all countries (venture_code)
2. Show the top 10 brands for each product category, based on total sales (GMV)
I wrote but got the query wrong!
SELECT category, SUM(GMV) as Total_Sales FROM products INNER JOIN buying ON products.brand = buying.[Brand Name]
Concerning the error, you have a space in the column name.
In SQL, if the column has a space, use brackets to wrap the column name:
MyTable.[My Column]
In your code, use this SQL:
SELECT category, SUM(GMV) as Total_Sales FROM products INNER JOIN buying ON products.brand = buying.[Brand Name]
I don't have access to your data, so I can't test, but I think these queries are correct. You may need to tweak them some.
Part 1:
select c.venture_code, sum(b.GMV) GMVSum from customers c join orders o on c.customer_id = o.customer_id
join products p on o.skuv=p.sku
join buying b on p.brand = b.[Brand Name] and p.supplier = b.[Supplier Name]
where o.date >= '2019-10-01' and o.date <= '2019-12-31' -- 2019 4th qtr
group by c.venture_code
Part 2:
select * from
(select *, RANK() over (PARTITION BY category,brand order by GMV) rk from
(select p.brand, p.category, b.GMV from products p join buying b on p.brand = b.[Brand Name] and p.supplier = b.[Supplier Name]) x) xx
where rk <= 10
Related
I am building a program to track my employees. I have a CSV where I keep track of information. I'm trying to loop through and print out the rows of employees that dont have an end_date - i.e. those that are still working.
I have been able to get the correct rows to print, but the formatting is not how I'm hoping, which is in rows.
Here is an example of my csv:
csv = [employee_id,name,address,Phone,date_of_birth,job_title,start_date,end_date
1,Arya,New York,1234567890,1/1/1970,lecturer,1/1/2021,10/20/2022
2,Terri,New York,25151521,010109,Nurse,10/10/2022,
42,Bill,New York,2314,09/10/1994,Teacher,10/14/2022,
48,Steve,New York,454554,08/10/1994,Teacher,02/25/2022,
9,Stephen,New York,526415252,10/08/1994,Teacher,10/15/2022,N/A]
here is the program that im running:
df2 = pd.read_csv('employees.csv')
print()
for index, row in df2.iterrows():
if ((len(str(row['end_date'])) <= 3)):
print(df2.loc[index])
else:
continue
print()
This print out looks like this for each line (multiples of the below):
employee_id 8
name Bill
address New York
phone 25235
date_of_birth 081019
job_title Engineer
start_date 081019
end_date NaN
Name: 2, dtype: object
however, i want the print out to look like the beginning csv, but only showing the rows for people without values in the 'end_date' column like this:
[employee_id,name,address,Phone,date_of_birth,job_title,start_date,end_date
2,Terri,New York,25151521,010109,Nurse,10/10/2022,
42,Bill,New York,2314,09/10/1994,Teacher,10/14/2022,
48,Steve,New York,454554,08/10/1994,Teacher,02/25/2022,
I dont want use df.drop becaues I want to keep a record of everyone.
This should work.
import pandas as pd
import numpy as np
df = {'employee_id': {0: 1, 1: 2, 2: 42, 3: 48, 4: 9}, 'name': {0: 'Arya', 1: 'Terri', 2: 'Bill', 3: 'Steve', 4: 'Stephen'}, 'address': {0: 'New York', 1: 'New York', 2: 'New York', 3: 'New York', 4: 'New York'}, 'Phone': {0: '1234567890', 1: '25151521', 2: '2314', 3: '454554', 4: '526415252'}, 'date_of_birth': {0: '1/1/1970', 1: '010109', 2: '09/10/1994', 3: '08/10/1994', 4: '10/08/1994'}, 'job_title': {0: 'lecturer', 1: 'Nurse', 2: 'Teacher', 3: 'Teacher', 4: 'Teacher'}, 'start_date': {0: '1/1/2021', 1: '10/10/2022', 2: '10/14/2022', 3: '02/25/2022', 4: '10/15/2022'}, 'end_date': {0: '10/20/2022', 1: '', 2: '', 3: '', 4: 'N/A'}}
df['end_date'] = df['end_date'].replace(['N/A',''], np.nan)
#prints only rows with null values in end_date
df[df['end_date'].isna()]
Also, a simpler way to get your vertical printout for each employee:
for e in df['employee_id']:
df[df['employee_id']==e].transpose()
#output
employee_id 1
name Arya
address New York
Phone 1234567890
date_of_birth 1/1/1970
job_title lecturer
start_date 1/1/2021
end_date 10/20/2022
I'm trying to clean up a dataframe by merging the columns on a multi-index so all values in columns that belong to the same first-level index appear in one column.
From This:
To This:
I was doing it manually by defining each column and joining them like this:
df['Subjects'] = df['Which of the following subjects are you taking this semester?'].apply(lambda x: '|'.join(x.dropna()), axis = 1)
df.drop('Which of the following subjects are you taking this semester?', axis = 1, level = 0, inplace = True)
The problem is I have a large dataframe with many more columns then this, so I was wondering if there is a way to do this dynamically for all columns instead of copying this code and defining each column individually?
data = {('Name', ''): {0: 'Jane',
1: 'John',
2: 'Lisa',
3: 'Michael'},
('Location', ''): {0: 'Houston', 1: 'LA', 2: 'LA', 3:
'Dallas'},
('Which of the following subjects are you taking this
semester?', 'Math'): {0: 'Math',
1: 'Math',
2: np.nan,
3: 'Math'},
('Which of the following subjects are you taking this
semester?', 'Science'): {0: 'Science',
1: np.nan,
2: np.nan,
3: 'Science'},
('Which of the following subjects are you taking this
semester?', 'Art'): {0: np.nan,
1: 'Art',
2: 'Art',
3: np.nan},
('Which of the following electronic devices do you own?',
'Laptop'): {0: 'Laptop',
1: 'Laptop',
2: 'Laptop',
3: 'Laptop'},
('Which of the following electronic devices do you own?',
'Phone'): {0: 'Phone',
1: 'Phone',
2: 'Phone',
3: 'Phone'},
('Which of the following electronic devices do you own?',
'TV'): {0: np.nan,
1: 'TV',
2: np.nan,
3: np.nan},
('Which of the following electronic devices do you own?',
'Tablet'): {0: 'Tablet',
1: np.nan,
2: 'Tablet',
3: np.nan},
('Age', ''): {0: 24, 1: 20, 2: 19, 3: 29},
('Which Social Media Platforms Do You Use?', 'Instagram'):
{0: np.nan,
1: 'Instagram',
2: 'Instagram',
3: 'Instagram'},
('Which Social Media Platforms Do You Use?', 'Facebook'):
{0: 'Facebook',
1: 'Facebook',
2: np.nan,
3: np.nan},
('Which Social Media Platforms Do You Use?', 'Tik Tok'):
{0: np.nan,
1: 'Tik Tok',
2: 'Tik Tok',
3: np.nan},
('Which Social Media Platforms Do You Use?', 'LinkedIn'):
{0: 'LinkedIn',
1: 'LinkedIn',
2: np.nan,
3: np.nan}
}
You can try this:
df.T.groupby(level=0).agg(list).T
You can use melt as starting point to flatten your dataframe, filter out nan values then pivot_table to reshape your dataframe:
pat = r'(subjects|electronic devices|Social Media Platforms)'
cols = ['Name', 'Location', 'Age']
out = df.droplevel(1, axis=1).melt(cols, ignore_index=False).query('value.notna()')
out['variable'] = out['variable'].str.extract(pat, expand=False).str.title()
out = out.reset_index().pivot_table('value', ['index'] + cols, 'variable', aggfunc='|'.join) \
.reset_index(cols).rename_axis(index=None, columns=None)
Output:
>>> out
Name Location Age Electronic Devices Social Media Platforms Subjects
0 Jane Houston 24 Laptop|Phone|Tablet Facebook|LinkedIn Math|Science
1 John LA 20 Laptop|Phone|TV Instagram|Facebook|Tik Tok|LinkedIn Math|Art
2 Lisa LA 19 Laptop|Phone|Tablet Instagram|Tik Tok Art
3 Michael Dallas 29 Laptop|Phone Instagram Math|Science
df = pd.DataFrame({('Quarter', 'Range'): {0: 'A', 1: 'B'}, ('Q1(0.25)', 'Low'): {0: 0, 1: 0}, ('Q1(0.25)', 'High'): {0: 10, 1: 630}, ('Q2(0.5)', 'Low'): {0: 10, 1: 630}, ('Q2(0.5)', 'High'): {0: 50, 1: 3000}, ('Q3(0.75)', 'Low'): {0: 50, 1: 3000}, ('Q3(0.75)', 'High'): {0: 100, 1: 8500}, ('Q4(1.0)', 'Low'): {0: 100, 1: 8500}, ('Q4(1.0)', 'High'): {0: 'np.inf', 1: 'np.inf'}})
As the above dataframe if the value of column A is between 0-10 replace the value with 0.25, if the value is between 10-50 replace the value with 0.5. Similarly, we have to repeat for all values and columns.
Expected output:
If the value is 12 for A and 3210 for column B.
df2 = pd.DataFrame({'Column': {0: 'A', 1: 'B'}, 'Prob': {0: 0.5, 1: 0.75}})
How to do it?
Here is my data:
{'SystemID': {0: '95EE8B57',
1: '5F891F03',
2: '5F891F03',
3: '5F891F03'},
'Day': {0: '06/08/2018', 1: '05/08/2018', 2: '04/08/2018', 3: '05/08/2018'},
'AlarmClass-S': {0: 4, 1: 2, 2: 4, 3: 0},
'AlarmClass-ELM': {0: 0, 1: 0, 2: 0, 3: 2}}
I would like to perform an aggregation and filtering which in SQL would be formulated as
SELECT SystemID, COUNT(*) as count FROM table GROUP BY SystemID HAVING COUNT(*) > 2
Thus the result shall be
{'SystemID': {0: '5F891F03'},
'count': {0: '3'}}
How to do this in pandas?
You can use groupby and count, then filter at the end.
(df.groupby('SystemID', as_index=False)['SystemID']
.agg({'count': 'count'})
.query('count > 2'))
SystemID count
0 5F891F03 3
(df.groupby('SystemID', as_index=False)['SystemID']
.agg({'count': 'count'})
.query('count > 2')
.to_dict())
# {'SystemID': {0: '5F891F03'}, 'count': {0: 3}}
I have a pandas.dataframe of SEC reports for multiple tickers & periods.
Reproducible dict for DF:
{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'field': {0: 'taxonomyid',
1: 'cik',
2: 'companyname',
3: 'entityid',
4: 'primaryexchange'},
'value': {0: '50',
1: '0000023217',
2: 'CONAGRA BRANDS INC.',
3: '6976',
4: 'NYSE'},
'ticker': {0: 'CAG', 1: 'CAG', 2: 'CAG', 3: 'CAG', 4: 'CAG'},
'cik': {0: 23217, 1: 23217, 2: 23217, 3: 23217, 4: 23217},
'dcn': {0: '0000023217-18-000009',
1: '0000023217-18-000009',
2: '0000023217-18-000009',
3: '0000023217-18-000009',
4: '0000023217-18-000009'},
'fiscalyear': {0: 2019, 1: 2019, 2: 2019, 3: 2019, 4: 2019},
'fiscalquarter': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'receiveddate': {0: '10/2/2018',
1: '10/2/2018',
2: '10/2/2018',
3: '10/2/2018',
4: '10/2/2018'},
'periodenddate': {0: '8/26/2018',
1: '8/26/2018',
2: '8/26/2018',
3: '8/26/2018',
4: '8/26/2018'}}
The column 'field' contains the name of the reporting field (e.g. Indicator), column 'value' contains value for that indicator. Other columns are description for the SEC filing (ticker+date+fiscal_periods = unique set of features to describe certain filing). There are about 60-70 indicators per filing (number varies).
With the code below I've managed to create a pivot dataframe with columns = features (let say total number of N for 1 submission). But the length of this dataframe also equals the number of indicators = N, with NaN in non-diagonal places.
# Adf - Initial dataframe
c = Adf.pivot(columns='field', values='value')
d = Adf[['ticker','cik','fiscalyear','fiscalquarter','dcn','receiveddate','periodenddate']]
e = pd.concat([d, c], sort=False, axis=1)
I want to use an Indicator names from the 'field' as new columns (going from narrow to wide format). At the end I want to have a dataframe with 1 row for each of SEC reports.
So the expected output for provided example is a 1-row dataframe with N new columns, where N = number of unique indicators from the 'field' column of initial dataframe:
{'ticker': {0: 'CAG'},
'cik': {0: 23217},
'dcn': {0: '0000023217-18-000009'},
'fiscalyear': {0: 2019},
'fiscalquarter': {0: 1},
'receiveddate': {0: '10/2/2018'},
'periodenddate': {0: '8/26/2018'},
'taxonomyid':{0:'50'},
'cik': {0: '0000023217}',
'companyname':{0: 'CONAGRA BRANDS INC.'},
'entityid':{0:'6976'},
'primaryexchange': {0:'NYSE'},
}
What is the proper way to create such columns from or what is the proper way to clean-up resulting dataframe from multiple NaN?
What worked for me is setting new index to DF and unstacking 'field' and 'value' columns
aa = Adf.set_index(['ticker','cik', 'fiscalyear','fiscalquarter', 'dcn','receiveddate', 'periodenddate', 'field']).unstack()
aa = aa.reset_index()