Pandas: population new columns from other column's values - python

I have a pandas.dataframe of SEC reports for multiple tickers & periods.
Reproducible dict for DF:
{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'field': {0: 'taxonomyid',
1: 'cik',
2: 'companyname',
3: 'entityid',
4: 'primaryexchange'},
'value': {0: '50',
1: '0000023217',
2: 'CONAGRA BRANDS INC.',
3: '6976',
4: 'NYSE'},
'ticker': {0: 'CAG', 1: 'CAG', 2: 'CAG', 3: 'CAG', 4: 'CAG'},
'cik': {0: 23217, 1: 23217, 2: 23217, 3: 23217, 4: 23217},
'dcn': {0: '0000023217-18-000009',
1: '0000023217-18-000009',
2: '0000023217-18-000009',
3: '0000023217-18-000009',
4: '0000023217-18-000009'},
'fiscalyear': {0: 2019, 1: 2019, 2: 2019, 3: 2019, 4: 2019},
'fiscalquarter': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'receiveddate': {0: '10/2/2018',
1: '10/2/2018',
2: '10/2/2018',
3: '10/2/2018',
4: '10/2/2018'},
'periodenddate': {0: '8/26/2018',
1: '8/26/2018',
2: '8/26/2018',
3: '8/26/2018',
4: '8/26/2018'}}
The column 'field' contains the name of the reporting field (e.g. Indicator), column 'value' contains value for that indicator. Other columns are description for the SEC filing (ticker+date+fiscal_periods = unique set of features to describe certain filing). There are about 60-70 indicators per filing (number varies).
With the code below I've managed to create a pivot dataframe with columns = features (let say total number of N for 1 submission). But the length of this dataframe also equals the number of indicators = N, with NaN in non-diagonal places.
# Adf - Initial dataframe
c = Adf.pivot(columns='field', values='value')
d = Adf[['ticker','cik','fiscalyear','fiscalquarter','dcn','receiveddate','periodenddate']]
e = pd.concat([d, c], sort=False, axis=1)
I want to use an Indicator names from the 'field' as new columns (going from narrow to wide format). At the end I want to have a dataframe with 1 row for each of SEC reports.
So the expected output for provided example is a 1-row dataframe with N new columns, where N = number of unique indicators from the 'field' column of initial dataframe:
{'ticker': {0: 'CAG'},
'cik': {0: 23217},
'dcn': {0: '0000023217-18-000009'},
'fiscalyear': {0: 2019},
'fiscalquarter': {0: 1},
'receiveddate': {0: '10/2/2018'},
'periodenddate': {0: '8/26/2018'},
'taxonomyid':{0:'50'},
'cik': {0: '0000023217}',
'companyname':{0: 'CONAGRA BRANDS INC.'},
'entityid':{0:'6976'},
'primaryexchange': {0:'NYSE'},
}
What is the proper way to create such columns from or what is the proper way to clean-up resulting dataframe from multiple NaN?

What worked for me is setting new index to DF and unstacking 'field' and 'value' columns
aa = Adf.set_index(['ticker','cik', 'fiscalyear','fiscalquarter', 'dcn','receiveddate', 'periodenddate', 'field']).unstack()
aa = aa.reset_index()

Related

Write a function to perform calculations on multiple columns in a Pandas dataframe

I have the following dataframe (the real one has a lot more columns and rows, so just using this as an example):
{'sample': {0: 'orange', 1: 'orange', 2: 'banana', 3: 'banana'},
'sample id': {0: 1, 1: 1, 2: 5, 3: 5},
'replicate': {0: 1, 1: 2, 2: 1, 3: 2},
'taste': {0: 1.2, 1: 4.6, 2: 35.4, 3: 0.005},
'smell': {0: 20.0, 1: 23.0, 2: 2.1, 3: 5.3},
'shape': {0: 0.004, 1: 0.2, 2: 0.12, 3: 11.0},
'volume': {0: 23, 1: 23, 2: 23, 3: 23},
'weight': {0: 12.0, 1: 1.3, 2: 2.4, 3: 3.2}}
I'd like to write a function to perform calculations on the dataframe, for specific columns. The calculation is in the code below.
As I'd only want to apply the code to specific columns, I've set up a list of columns, and as there is a pre-defined 'factor' we need to take into account in the calculation, I set this up too:
cols = ['taste', 'smell', 'shape']
factor = 72
def multiply_columns(row):
return ((row[cols] / row['volume']) * (factor * row['volume'] / row['weight']) / 1000)
Then, I apply the function to the dataframe, and I want to overwrite the original column values with the new ones, so I do this:
for cols in df.columns:
df[cols] = df[cols].apply(multiply_columns)
But I get the following error:
~\AppData\Local\Temp/ipykernel_8544/3939806184.py in multiply_columns(row)
3
4 def multiply_columns(row):
----> 5 return ((row[cols] / row['volume']) * (factor * row['volume'] / row['weight']) / 1000)
6
7
TypeError: string indices must be integers
But the values I'm using in the calculation aren't strings:
sample object
sample id int64
replicate int64
taste float64
smell float64
shape float64
volume int64
weight float64
dtype: object
The desired output would be:
{'sample': {0: 'orange', 1: 'orange', 2: 'banana', 3: 'banana'},
'sample id': {0: 1, 1: 1, 2: 5, 3: 5},
'replicate': {0: 1, 1: 2, 2: 1, 3: 2},
'taste': {0: 0.0074, 1: 0.028366667, 2: 0.2183, 3: 3.08333e-05},
'smell': {0: 0.123333333, 1: 0.141833333, 2: 0.01295, 3: 0.032683333},
'shape': {0: 2.46667e-05, 1: 0.001233333, 2: 0.00074, 3: 0.067833333},
'volume': {0: 23, 1: 23, 2: 23, 3: 23},
'weight': {0: 12.0, 1: 1.3, 2: 2.4, 3: 3.2}}
Can anyone kindly show me the errors of my ways
This has a few issues.
If you wanted to index elements in row, the index you're using is a string (the column name) rather than an integer (like an index). To get an index for the column names you're interested in, you could use this:
cols = ['taste', 'smell', 'shape']
cols_idx = [df.columns.get_loc(col) for col in cols]
However, if I understand your question, you could perform this operation on columns directly with the understanding that the operation will be performed on each row. See a test case that worked for me:
import pandas as pd
df = pd.DataFrame({'sample': {0: 'orange', 1: 'orange', 2: 'banana', 3: 'banana'},
'sample id': {0: 1, 1: 1, 2: 5, 3: 5},
'replicate': {0: 1, 1: 2, 2: 1, 3: 2},
'taste': {0: 1.2, 1: 4.6, 2: 35.4, 3: 0.005},
'smell': {0: 20.0, 1: 23.0, 2: 2.1, 3: 5.3},
'shape': {0: 0.004, 1: 0.2, 2: 0.12, 3: 11.0},
'volume': {0: 23, 1: 23, 2: 23, 3: 23},
'weight': {0: 12.0, 1: 1.3, 2: 2.4, 3: 3.2}})
cols = ['taste', 'smell', 'shape']
factor = 72
for col in cols:
df[col] = ((df[col] / df['volume']) * (factor * df['volume'] / df['weight']) / 1000)
Note that your line
for cols in df.columns:
indicated you should run this operation on every column (cols became the index and was no longer your list).
You have to pass the column as well to the function.
cols = ['taste', 'smell', 'shape']
factor = 72
def multiply_columns(row,col):
return ((row[col]/ row['volume']) * (factor * row['volume'] / row['weight']) / 1000)
for col in cols:
df[col] = df.apply(lambda x:multiply_columns(x,col),axis=1)
Also the output I'm getting is bit different from your desired output even though I used the same formula.
sample sample id replicate taste smell shape volume weight
0 orange 1 1 0.00720000000 0.12000000000 0.00002400000 23 12.00000000000
1 orange 1 2 0.25476923077 1.27384615385 0.01107692308 23 1.30000000000
2 banana 5 1 1.06200000000 0.06300000000 0.00360000000 23 2.40000000000
3 banana 5 2 0.00011250000 0.11925000000 0.24750000000 23 3.20000000000

Fixing column names and renaming them after grouping the dataframe by two columns

I have a dataframe:
{'ARTICLE_ID': {0: 111, 1: 111, 2: 222, 3: 222, 4: 222}, 'CITEDIN_ARTICLE_ID': {0: 11, 1: 11, 2: 11, 3: 22, 4: 22}, 'enrollment': {0: 10, 1: 10, 2: 10, 3: 10, 4: 10}, 'Trial_year': {0: 2017, 1: 2017, 2: 2017, 3: 2017, 4: 2017}, 'AUTHOR_ID': {0: 'aaa', 1: 'aaa', 2: 'aaa', 3: 'aaa', 4: 'aaa'}, 'AUTHOR_RANK': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}}
I am grouping it by two columns
df_grouped = df.groupby(['AUTHOR_ID', 'Trial_year']).agg({'ARTICLE_ID': "count",
'enrollment': ["count", 'sum']}).reset_index()
As a result, I receive this dataframe, where column names have two levels
{('AUTHOR_ID', ''): {0: 'aaa'}, ('Trial_year', ''): {0: 2017}, ('ARTICLE_ID', 'count'): {0: 5}, ('enrollment', 'count'): {0: 5}, ('enrollment', 'sum'): {0: 50}}
My ideal output - the dataframe with one level of column names and renamed column names
`AUTHOR_ID`, `Trial_year`, `ARTICLE_ID_count`, `enrollment_count`, `enrollment_sum`
You can modify the columns:
df_grouped.columns = [f"{i}_{j}" if j!='' else i for i,j in df_grouped.columns]
or use NamedAgg from the beginning:
df_grouped = (df.groupby(['AUTHOR_ID', 'Trial_year'])
.agg(ARTICLE_ID_count=('ARTICLE_ID', "count"),
enrollment_count=('enrollment','count'),
enrollment_sum=('enrollment','sum')).reset_index())
You can also pass a dictionary to groupby.agg for a little concise code:
df_grouped = (df.groupby(['AUTHOR_ID', 'Trial_year'], as_index=False)
.agg(**{'_'.join(pair): pair for pair in [('ARTICLE_ID', 'count'),
('enrollment','count'),
('enrollment','sum')]}))
Output:
AUTHOR_ID Trial_year ARTICLE_ID_count enrollment_count enrollment_sum
0 aaa 2017 5 5 50

How do I extract these SQL queries from these pandas dataframes?

SQL QUERIES
customers = pd.DataFrame({'customer_id': {0: 5386596, 1: 32676876}, 'created_at': {0: Timestamp('2017-01-27 00:00:00'), 1: Timestamp('2018-06-07 00:00:00')}, 'venture_code': {0: 'MY', 1: 'ID'}})
visits = Pd.DataFrame({'customer_id': {0: 3434886, 1: 10053}, 'date': {0: Timestamp('2016-10-02 00:00:00'), 1: Timestamp('2017-12-14 00:00:00')}})
orders = Pd.DataFrame({'order_id': {0: 112525, 1: 112525}, 'date': {0: Timestamp('2019-02-01 00:00:00'), 1: Timestamp('2019-02-01 00:00:00')}, 'sku': {0: 'SA108SH89OLAHK', 1: 'RO151AA60REHHK'}, 'customer_id': {0: 46160566, 1: 46160566}})
products = Pd.DataFrame({'sku': {0: 'SA108SH89OLAHK', 1: 'RO151AA60REHHK'}, 'brand': {0: 1, 1: 1}, 'supplier': {0: 'A', 1: 'B'}, 'category': {0: 'Mapp', 1: 'Macc'}, 'price': {0: 15, 1: 45}})
segment = Pd.DataFrame({'Age Range': {0: '<20', 1: '<20'},
'Gender': {0: 'female', 1: 'female'},
'Category': {0: 'Wsho', 1: 'Wapp'},
'Discount %': {0: 0.246607432, 1: 0.174166503},
'NMV': {0: 2509.580375, 1: 8910.447587},
'# Items': {0: 169, 1: 778},
'# Orders': {0: 15, 1: 135}})
buying = Pd.DataFrame({'Supplier Name': {0: 'A', 1: 'A'},
'Brand Name': {0: 1, 1: 2},
'# SKU': {0: 506, 1: 267},
'# Item Before Return': {0: 5663, 1: 3256},
'# Item Returned': {0: 2776, 1: 1395},
'Margin %': {0: 0.266922793, 1: 0.282847894},
'GMV': {0: 191686.749171408, 1: 115560.037075292}})
Using SQL or Pandas, please tell me how to
1. Compare the monthly sales (GMV) trend in Q4 2019, across all countries (venture_code)
2. Show the top 10 brands for each product category, based on total sales (GMV)
I wrote but got the query wrong!
SELECT category, SUM(GMV) as Total_Sales FROM products INNER JOIN buying ON products.brand = buying.[Brand Name]
Concerning the error, you have a space in the column name.
In SQL, if the column has a space, use brackets to wrap the column name:
MyTable.[My Column]
In your code, use this SQL:
SELECT category, SUM(GMV) as Total_Sales FROM products INNER JOIN buying ON products.brand = buying.[Brand Name]
I don't have access to your data, so I can't test, but I think these queries are correct. You may need to tweak them some.
Part 1:
select c.venture_code, sum(b.GMV) GMVSum from customers c join orders o on c.customer_id = o.customer_id
join products p on o.skuv=p.sku
join buying b on p.brand = b.[Brand Name] and p.supplier = b.[Supplier Name]
where o.date >= '2019-10-01' and o.date <= '2019-12-31' -- 2019 4th qtr
group by c.venture_code
Part 2:
select * from
(select *, RANK() over (PARTITION BY category,brand order by GMV) rk from
(select p.brand, p.category, b.GMV from products p join buying b on p.brand = b.[Brand Name] and p.supplier = b.[Supplier Name]) x) xx
where rk <= 10

Merge DataFrame with many-to-many

I have 2 DataFrames containing examples, I would like to see if a example of DataFrame 1 is present in DataFrame 2.
Normally I would aggregate the rows per example and simply merge the DataFrames. Unfortunately the merging has to be done with a "matching table" which has a many-to-many relationship between the keys (id_low vs. id_high).
Simplified example
Matching Table:
Input DataFrames
They are therefore matchable like this:
Expected Output:
Simplified example (for Python)
import pandas as pd
# Dataframe 1 - containing 1 Example
d1 = pd.DataFrame.from_dict({'Example': {0: 'Example 1', 1: 'Example 1', 2: 'Example 1'},
'id_low': {0: 1, 1: 2, 2: 3}})
# DataFrame 2 - containing 1 Example
d2 = pd.DataFrame.from_dict({'Example': {0: 'Example 2', 1: 'Example 2', 2: 'Example 2'},
'id_low': {0: 1, 1: 4, 2: 6}})
# DataFrame 3 - matching table
dm = pd.DataFrame.from_dict({'id_low': {0: 1, 1: 2, 2: 2, 3: 3, 4: 3, 5: 4, 6: 5, 7: 6, 8: 6},
'id_high': {0: 'A',
1: 'B',
2: 'C',
3: 'D',
4: 'E',
5: 'B',
6: 'B',
7: 'E',
8: 'F'}})
d1 and d2 are matchable as you can see above.
Expected Output (or similar):
df_output = pd.DataFrame.from_dict({'Example': {0: 'Example 1'}, 'Example_2': {0: 'Example 2'}})
Failed attemps
Aggregation of with matching table translated values then merging. Considerer using Regex with the OR-Operator.
IIUC:
d2.merge(dm)
.merge(d1.merge(dm), on='id_high')\
.groupby(['Example_x','Example_y'])['id_high'].agg(list)\
.reset_index()
Output:
Example_x Example_y id_high
0 Example 2 Example 1 [A, B, E]

Group By Having Count in Pandas

Here is my data:
{'SystemID': {0: '95EE8B57',
1: '5F891F03',
2: '5F891F03',
3: '5F891F03'},
'Day': {0: '06/08/2018', 1: '05/08/2018', 2: '04/08/2018', 3: '05/08/2018'},
'AlarmClass-S': {0: 4, 1: 2, 2: 4, 3: 0},
'AlarmClass-ELM': {0: 0, 1: 0, 2: 0, 3: 2}}
I would like to perform an aggregation and filtering which in SQL would be formulated as
SELECT SystemID, COUNT(*) as count FROM table GROUP BY SystemID HAVING COUNT(*) > 2
Thus the result shall be
{'SystemID': {0: '5F891F03'},
'count': {0: '3'}}
How to do this in pandas?
You can use groupby and count, then filter at the end.
(df.groupby('SystemID', as_index=False)['SystemID']
.agg({'count': 'count'})
.query('count > 2'))
SystemID count
0 5F891F03 3
(df.groupby('SystemID', as_index=False)['SystemID']
.agg({'count': 'count'})
.query('count > 2')
.to_dict())
# {'SystemID': {0: '5F891F03'}, 'count': {0: 3}}

Categories

Resources