Flatten multi-index columns into one pandas - python

I'm trying to clean up a dataframe by merging the columns on a multi-index so all values in columns that belong to the same first-level index appear in one column.
From This:
To This:
I was doing it manually by defining each column and joining them like this:
df['Subjects'] = df['Which of the following subjects are you taking this semester?'].apply(lambda x: '|'.join(x.dropna()), axis = 1)
df.drop('Which of the following subjects are you taking this semester?', axis = 1, level = 0, inplace = True)
The problem is I have a large dataframe with many more columns then this, so I was wondering if there is a way to do this dynamically for all columns instead of copying this code and defining each column individually?
data = {('Name', ''): {0: 'Jane',
1: 'John',
2: 'Lisa',
3: 'Michael'},
('Location', ''): {0: 'Houston', 1: 'LA', 2: 'LA', 3:
'Dallas'},
('Which of the following subjects are you taking this
semester?', 'Math'): {0: 'Math',
1: 'Math',
2: np.nan,
3: 'Math'},
('Which of the following subjects are you taking this
semester?', 'Science'): {0: 'Science',
1: np.nan,
2: np.nan,
3: 'Science'},
('Which of the following subjects are you taking this
semester?', 'Art'): {0: np.nan,
1: 'Art',
2: 'Art',
3: np.nan},
('Which of the following electronic devices do you own?',
'Laptop'): {0: 'Laptop',
1: 'Laptop',
2: 'Laptop',
3: 'Laptop'},
('Which of the following electronic devices do you own?',
'Phone'): {0: 'Phone',
1: 'Phone',
2: 'Phone',
3: 'Phone'},
('Which of the following electronic devices do you own?',
'TV'): {0: np.nan,
1: 'TV',
2: np.nan,
3: np.nan},
('Which of the following electronic devices do you own?',
'Tablet'): {0: 'Tablet',
1: np.nan,
2: 'Tablet',
3: np.nan},
('Age', ''): {0: 24, 1: 20, 2: 19, 3: 29},
('Which Social Media Platforms Do You Use?', 'Instagram'):
{0: np.nan,
1: 'Instagram',
2: 'Instagram',
3: 'Instagram'},
('Which Social Media Platforms Do You Use?', 'Facebook'):
{0: 'Facebook',
1: 'Facebook',
2: np.nan,
3: np.nan},
('Which Social Media Platforms Do You Use?', 'Tik Tok'):
{0: np.nan,
1: 'Tik Tok',
2: 'Tik Tok',
3: np.nan},
('Which Social Media Platforms Do You Use?', 'LinkedIn'):
{0: 'LinkedIn',
1: 'LinkedIn',
2: np.nan,
3: np.nan}
}

You can try this:
df.T.groupby(level=0).agg(list).T

You can use melt as starting point to flatten your dataframe, filter out nan values then pivot_table to reshape your dataframe:
pat = r'(subjects|electronic devices|Social Media Platforms)'
cols = ['Name', 'Location', 'Age']
out = df.droplevel(1, axis=1).melt(cols, ignore_index=False).query('value.notna()')
out['variable'] = out['variable'].str.extract(pat, expand=False).str.title()
out = out.reset_index().pivot_table('value', ['index'] + cols, 'variable', aggfunc='|'.join) \
.reset_index(cols).rename_axis(index=None, columns=None)
Output:
>>> out
Name Location Age Electronic Devices Social Media Platforms Subjects
0 Jane Houston 24 Laptop|Phone|Tablet Facebook|LinkedIn Math|Science
1 John LA 20 Laptop|Phone|TV Instagram|Facebook|Tik Tok|LinkedIn Math|Art
2 Lisa LA 19 Laptop|Phone|Tablet Instagram|Tik Tok Art
3 Michael Dallas 29 Laptop|Phone Instagram Math|Science

Related

Fixing column names and renaming them after grouping the dataframe by two columns

I have a dataframe:
{'ARTICLE_ID': {0: 111, 1: 111, 2: 222, 3: 222, 4: 222}, 'CITEDIN_ARTICLE_ID': {0: 11, 1: 11, 2: 11, 3: 22, 4: 22}, 'enrollment': {0: 10, 1: 10, 2: 10, 3: 10, 4: 10}, 'Trial_year': {0: 2017, 1: 2017, 2: 2017, 3: 2017, 4: 2017}, 'AUTHOR_ID': {0: 'aaa', 1: 'aaa', 2: 'aaa', 3: 'aaa', 4: 'aaa'}, 'AUTHOR_RANK': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}}
I am grouping it by two columns
df_grouped = df.groupby(['AUTHOR_ID', 'Trial_year']).agg({'ARTICLE_ID': "count",
'enrollment': ["count", 'sum']}).reset_index()
As a result, I receive this dataframe, where column names have two levels
{('AUTHOR_ID', ''): {0: 'aaa'}, ('Trial_year', ''): {0: 2017}, ('ARTICLE_ID', 'count'): {0: 5}, ('enrollment', 'count'): {0: 5}, ('enrollment', 'sum'): {0: 50}}
My ideal output - the dataframe with one level of column names and renamed column names
`AUTHOR_ID`, `Trial_year`, `ARTICLE_ID_count`, `enrollment_count`, `enrollment_sum`
You can modify the columns:
df_grouped.columns = [f"{i}_{j}" if j!='' else i for i,j in df_grouped.columns]
or use NamedAgg from the beginning:
df_grouped = (df.groupby(['AUTHOR_ID', 'Trial_year'])
.agg(ARTICLE_ID_count=('ARTICLE_ID', "count"),
enrollment_count=('enrollment','count'),
enrollment_sum=('enrollment','sum')).reset_index())
You can also pass a dictionary to groupby.agg for a little concise code:
df_grouped = (df.groupby(['AUTHOR_ID', 'Trial_year'], as_index=False)
.agg(**{'_'.join(pair): pair for pair in [('ARTICLE_ID', 'count'),
('enrollment','count'),
('enrollment','sum')]}))
Output:
AUTHOR_ID Trial_year ARTICLE_ID_count enrollment_count enrollment_sum
0 aaa 2017 5 5 50

How do I extract these SQL queries from these pandas dataframes?

SQL QUERIES
customers = pd.DataFrame({'customer_id': {0: 5386596, 1: 32676876}, 'created_at': {0: Timestamp('2017-01-27 00:00:00'), 1: Timestamp('2018-06-07 00:00:00')}, 'venture_code': {0: 'MY', 1: 'ID'}})
visits = Pd.DataFrame({'customer_id': {0: 3434886, 1: 10053}, 'date': {0: Timestamp('2016-10-02 00:00:00'), 1: Timestamp('2017-12-14 00:00:00')}})
orders = Pd.DataFrame({'order_id': {0: 112525, 1: 112525}, 'date': {0: Timestamp('2019-02-01 00:00:00'), 1: Timestamp('2019-02-01 00:00:00')}, 'sku': {0: 'SA108SH89OLAHK', 1: 'RO151AA60REHHK'}, 'customer_id': {0: 46160566, 1: 46160566}})
products = Pd.DataFrame({'sku': {0: 'SA108SH89OLAHK', 1: 'RO151AA60REHHK'}, 'brand': {0: 1, 1: 1}, 'supplier': {0: 'A', 1: 'B'}, 'category': {0: 'Mapp', 1: 'Macc'}, 'price': {0: 15, 1: 45}})
segment = Pd.DataFrame({'Age Range': {0: '<20', 1: '<20'},
'Gender': {0: 'female', 1: 'female'},
'Category': {0: 'Wsho', 1: 'Wapp'},
'Discount %': {0: 0.246607432, 1: 0.174166503},
'NMV': {0: 2509.580375, 1: 8910.447587},
'# Items': {0: 169, 1: 778},
'# Orders': {0: 15, 1: 135}})
buying = Pd.DataFrame({'Supplier Name': {0: 'A', 1: 'A'},
'Brand Name': {0: 1, 1: 2},
'# SKU': {0: 506, 1: 267},
'# Item Before Return': {0: 5663, 1: 3256},
'# Item Returned': {0: 2776, 1: 1395},
'Margin %': {0: 0.266922793, 1: 0.282847894},
'GMV': {0: 191686.749171408, 1: 115560.037075292}})
Using SQL or Pandas, please tell me how to
1. Compare the monthly sales (GMV) trend in Q4 2019, across all countries (venture_code)
2. Show the top 10 brands for each product category, based on total sales (GMV)
I wrote but got the query wrong!
SELECT category, SUM(GMV) as Total_Sales FROM products INNER JOIN buying ON products.brand = buying.[Brand Name]
Concerning the error, you have a space in the column name.
In SQL, if the column has a space, use brackets to wrap the column name:
MyTable.[My Column]
In your code, use this SQL:
SELECT category, SUM(GMV) as Total_Sales FROM products INNER JOIN buying ON products.brand = buying.[Brand Name]
I don't have access to your data, so I can't test, but I think these queries are correct. You may need to tweak them some.
Part 1:
select c.venture_code, sum(b.GMV) GMVSum from customers c join orders o on c.customer_id = o.customer_id
join products p on o.skuv=p.sku
join buying b on p.brand = b.[Brand Name] and p.supplier = b.[Supplier Name]
where o.date >= '2019-10-01' and o.date <= '2019-12-31' -- 2019 4th qtr
group by c.venture_code
Part 2:
select * from
(select *, RANK() over (PARTITION BY category,brand order by GMV) rk from
(select p.brand, p.category, b.GMV from products p join buying b on p.brand = b.[Brand Name] and p.supplier = b.[Supplier Name]) x) xx
where rk <= 10

Joining two columns in the same data frame

I am trying to add one column at the end of another column. I have included a picture that kind of demonstrates what I want to achieve. How can this be done?
For example, in this case I added the age column under the name column
Dummy data:
{'Unnamed: 0': {0: nan, 1: nan, 2: nan, 3: nan},
'age ': {0: 35, 1: 56, 2: 22, 3: 16},
'name': {0: 'andrea', 1: 'juan', 2: 'jose ', 3: 'manuel'},
'sex': {0: 'female', 1: 'male ', 2: 'male ', 3: 'male '}}
One way is to use .append. If your data is in the DataFrame df:
# Split out the relevant parts of your DataFrame
top_df = df[['name','sex']]
bottom_df = df[['age','sex']]
# Make the column names match
bottom_df.columns = ['name','sex']
# Append the two together
full_df = top_df.append(bottom_df)
You might have to decide on what kind of indexing you want. This method above will have non-unique indexing in full_df, which could be fixed by running the following line:
full_df.reset_index(drop=True, inplace=True)
You can use pd.melt and drop variable column using df.drop here.
df = pd.DataFrame({'Unnamed: 0': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan},
'age ': {0: 35, 1: 56, 2: 22, 3: 16},
'name': {0: 'andrea', 1: 'juan', 2: 'jose ', 3: 'manuel'},
'sex': {0: 'female', 1: 'male ', 2: 'male ', 3: 'male '}})
df.melt(id_vars=['sex'], value_vars=['name', 'age']).drop(columns='variable')
sex value
0 female andrea
1 male juan
2 male jose
3 male manuel
4 female 35
5 male 56
6 male 22
7 male 16

Calculating total unique values per column

I am trying to use the below data to get the 'Total Facebook likes' for each unique actor. The output should be in two columns, column 1
containing the unique actor names from all the actor_name columns and
column 2 should have the total likes from all three
actor_facebook_likes columns. Any idea on how this can done, will be
appreciated.
{'actor_1_name': {0: 'Ryan Gosling',
1: 'Ginnifer Goodwin',
2: 'Dev Patel',
3: 'Amy Adams',
4: 'Casey Affleck'},
'actor_2_name': {0: 'Emma Stone',
1: 'Jason Bateman',
2: 'Nicole Kidman',
3: 'Jeremy Renner',
4: 'Michelle Williams '},
'actor_3_name': {0: 'Amiée Conn',
1: 'Idris Elba',
2: 'Rooney Mara',
3: 'Forest Whitaker',
4: 'Kyle Chandler'},
'actor_1_facebook_likes': {0: 14000, 1: 2800, 2: 33000, 3: 35000, 4: 518},
'actor_2_facebook_likes': {0: 19000.0,
1: 28000.0,
2: 96000.0,
3: 5300.0,
4: 71000.0},
'actor_3_facebook_likes': {0: nan, 1: 27000.0, 2: 9800.0, 3: nan, 4: 3300.0}}
Use pivot to get sum of likes for each actor in each facebook like category
df3=pd.pivot_table(df,columns=['actor_1_name', 'actor_2_name', 'actor_3_name'],values=['actor_1_facebook_likes', 'actor_2_facebook_likes',
'actor_3_facebook_likes'],aggfunc=[np.sum]).reset_index()
Melt the Actors, groupby and sum all categories
res=pd.melt(df3,id_vars=['sum'], value_vars=['actor_1_name', 'actor_2_name', 'actor_3_name']).groupby('value').agg(Totallikes =('sum', 'sum')).reset_index()
Rename the columns
res.columns=['Actor','Totallikes']
print(res)
Actor Totallikes
0 Amiée Conn 33000.0
1 Amy Adams 40300.0
2 Casey Affleck 74818.0
3 Dev Patel 138800.0
4 Emma Stone 33000.0
5 Forest Whitaker 40300.0
6 Ginnifer Goodwin 57800.0
7 Idris Elba 57800.0
8 Jason Bateman 57800.0
9 Jeremy Renner 40300.0
10 Kyle Chandler 74818.0
11 Michelle Williams 74818.0
12 Nicole Kidman 138800.0
13 Rooney Mara 138800.0
14 Ryan Gosling 33000.0
This makes the job :
df0 = pd.DataFrame({'actor_1_name': {0: 'Ryan Gosling',
1: 'Ginnifer Goodwin',
2: 'Dev Patel',
3: 'Amy Adams',
4: 'Casey Affleck'},
'actor_2_name': {0: 'Emma Stone',
1: 'Jason Bateman',
2: 'Nicole Kidman',
3: 'Jeremy Renner',
4: 'Michelle Williams '},
'actor_3_name': {0: 'Amiée Conn',
1: 'Idris Elba',
2: 'Rooney Mara',
3: 'Forest Whitaker',
4: 'Kyle Chandler'},
'actor_1_facebook_likes': {0: 14000, 1: 2800, 2: 33000, 3: 35000, 4: 518},
'actor_2_facebook_likes': {0: 19000.0,
1: 28000.0,
2: 96000.0,
3: 5300.0,
4: 71000.0},
'actor_3_facebook_likes': {0: 0, 1: 27000.0, 2: 9800.0, 3: 0, 4: 3300.0}})
df1 = pd.concat([df0, df0, df0])
dfa = pd.DataFrame()
for i in range(0, 3):
names = list(df1.iloc[3*i:4+3*i, i])
val = df1.iloc[3*i:4+3*i, 3+i]
df = pd.DataFrame(names)
df['value'] = val
dfa = pd.concat([dfa, df], axis = 0)

Pandas: population new columns from other column's values

I have a pandas.dataframe of SEC reports for multiple tickers & periods.
Reproducible dict for DF:
{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'field': {0: 'taxonomyid',
1: 'cik',
2: 'companyname',
3: 'entityid',
4: 'primaryexchange'},
'value': {0: '50',
1: '0000023217',
2: 'CONAGRA BRANDS INC.',
3: '6976',
4: 'NYSE'},
'ticker': {0: 'CAG', 1: 'CAG', 2: 'CAG', 3: 'CAG', 4: 'CAG'},
'cik': {0: 23217, 1: 23217, 2: 23217, 3: 23217, 4: 23217},
'dcn': {0: '0000023217-18-000009',
1: '0000023217-18-000009',
2: '0000023217-18-000009',
3: '0000023217-18-000009',
4: '0000023217-18-000009'},
'fiscalyear': {0: 2019, 1: 2019, 2: 2019, 3: 2019, 4: 2019},
'fiscalquarter': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'receiveddate': {0: '10/2/2018',
1: '10/2/2018',
2: '10/2/2018',
3: '10/2/2018',
4: '10/2/2018'},
'periodenddate': {0: '8/26/2018',
1: '8/26/2018',
2: '8/26/2018',
3: '8/26/2018',
4: '8/26/2018'}}
The column 'field' contains the name of the reporting field (e.g. Indicator), column 'value' contains value for that indicator. Other columns are description for the SEC filing (ticker+date+fiscal_periods = unique set of features to describe certain filing). There are about 60-70 indicators per filing (number varies).
With the code below I've managed to create a pivot dataframe with columns = features (let say total number of N for 1 submission). But the length of this dataframe also equals the number of indicators = N, with NaN in non-diagonal places.
# Adf - Initial dataframe
c = Adf.pivot(columns='field', values='value')
d = Adf[['ticker','cik','fiscalyear','fiscalquarter','dcn','receiveddate','periodenddate']]
e = pd.concat([d, c], sort=False, axis=1)
I want to use an Indicator names from the 'field' as new columns (going from narrow to wide format). At the end I want to have a dataframe with 1 row for each of SEC reports.
So the expected output for provided example is a 1-row dataframe with N new columns, where N = number of unique indicators from the 'field' column of initial dataframe:
{'ticker': {0: 'CAG'},
'cik': {0: 23217},
'dcn': {0: '0000023217-18-000009'},
'fiscalyear': {0: 2019},
'fiscalquarter': {0: 1},
'receiveddate': {0: '10/2/2018'},
'periodenddate': {0: '8/26/2018'},
'taxonomyid':{0:'50'},
'cik': {0: '0000023217}',
'companyname':{0: 'CONAGRA BRANDS INC.'},
'entityid':{0:'6976'},
'primaryexchange': {0:'NYSE'},
}
What is the proper way to create such columns from or what is the proper way to clean-up resulting dataframe from multiple NaN?
What worked for me is setting new index to DF and unstacking 'field' and 'value' columns
aa = Adf.set_index(['ticker','cik', 'fiscalyear','fiscalquarter', 'dcn','receiveddate', 'periodenddate', 'field']).unstack()
aa = aa.reset_index()

Categories

Resources