Formatting the printout of Pandas Dataframe - python

I am building a program to track my employees. I have a CSV where I keep track of information. I'm trying to loop through and print out the rows of employees that dont have an end_date - i.e. those that are still working.
I have been able to get the correct rows to print, but the formatting is not how I'm hoping, which is in rows.
Here is an example of my csv:
csv = [employee_id,name,address,Phone,date_of_birth,job_title,start_date,end_date
1,Arya,New York,1234567890,1/1/1970,lecturer,1/1/2021,10/20/2022
2,Terri,New York,25151521,010109,Nurse,10/10/2022,
42,Bill,New York,2314,09/10/1994,Teacher,10/14/2022,
48,Steve,New York,454554,08/10/1994,Teacher,02/25/2022,
9,Stephen,New York,526415252,10/08/1994,Teacher,10/15/2022,N/A]
here is the program that im running:
df2 = pd.read_csv('employees.csv')
print()
for index, row in df2.iterrows():
if ((len(str(row['end_date'])) <= 3)):
print(df2.loc[index])
else:
continue
print()
This print out looks like this for each line (multiples of the below):
employee_id 8
name Bill
address New York
phone 25235
date_of_birth 081019
job_title Engineer
start_date 081019
end_date NaN
Name: 2, dtype: object
however, i want the print out to look like the beginning csv, but only showing the rows for people without values in the 'end_date' column like this:
[employee_id,name,address,Phone,date_of_birth,job_title,start_date,end_date
2,Terri,New York,25151521,010109,Nurse,10/10/2022,
42,Bill,New York,2314,09/10/1994,Teacher,10/14/2022,
48,Steve,New York,454554,08/10/1994,Teacher,02/25/2022,
I dont want use df.drop becaues I want to keep a record of everyone.

This should work.
import pandas as pd
import numpy as np
df = {'employee_id': {0: 1, 1: 2, 2: 42, 3: 48, 4: 9}, 'name': {0: 'Arya', 1: 'Terri', 2: 'Bill', 3: 'Steve', 4: 'Stephen'}, 'address': {0: 'New York', 1: 'New York', 2: 'New York', 3: 'New York', 4: 'New York'}, 'Phone': {0: '1234567890', 1: '25151521', 2: '2314', 3: '454554', 4: '526415252'}, 'date_of_birth': {0: '1/1/1970', 1: '010109', 2: '09/10/1994', 3: '08/10/1994', 4: '10/08/1994'}, 'job_title': {0: 'lecturer', 1: 'Nurse', 2: 'Teacher', 3: 'Teacher', 4: 'Teacher'}, 'start_date': {0: '1/1/2021', 1: '10/10/2022', 2: '10/14/2022', 3: '02/25/2022', 4: '10/15/2022'}, 'end_date': {0: '10/20/2022', 1: '', 2: '', 3: '', 4: 'N/A'}}
df['end_date'] = df['end_date'].replace(['N/A',''], np.nan)
#prints only rows with null values in end_date
df[df['end_date'].isna()]
Also, a simpler way to get your vertical printout for each employee:
for e in df['employee_id']:
df[df['employee_id']==e].transpose()
#output
employee_id 1
name Arya
address New York
Phone 1234567890
date_of_birth 1/1/1970
job_title lecturer
start_date 1/1/2021
end_date 10/20/2022

Related

Flatten multi-index columns into one pandas

I'm trying to clean up a dataframe by merging the columns on a multi-index so all values in columns that belong to the same first-level index appear in one column.
From This:
To This:
I was doing it manually by defining each column and joining them like this:
df['Subjects'] = df['Which of the following subjects are you taking this semester?'].apply(lambda x: '|'.join(x.dropna()), axis = 1)
df.drop('Which of the following subjects are you taking this semester?', axis = 1, level = 0, inplace = True)
The problem is I have a large dataframe with many more columns then this, so I was wondering if there is a way to do this dynamically for all columns instead of copying this code and defining each column individually?
data = {('Name', ''): {0: 'Jane',
1: 'John',
2: 'Lisa',
3: 'Michael'},
('Location', ''): {0: 'Houston', 1: 'LA', 2: 'LA', 3:
'Dallas'},
('Which of the following subjects are you taking this
semester?', 'Math'): {0: 'Math',
1: 'Math',
2: np.nan,
3: 'Math'},
('Which of the following subjects are you taking this
semester?', 'Science'): {0: 'Science',
1: np.nan,
2: np.nan,
3: 'Science'},
('Which of the following subjects are you taking this
semester?', 'Art'): {0: np.nan,
1: 'Art',
2: 'Art',
3: np.nan},
('Which of the following electronic devices do you own?',
'Laptop'): {0: 'Laptop',
1: 'Laptop',
2: 'Laptop',
3: 'Laptop'},
('Which of the following electronic devices do you own?',
'Phone'): {0: 'Phone',
1: 'Phone',
2: 'Phone',
3: 'Phone'},
('Which of the following electronic devices do you own?',
'TV'): {0: np.nan,
1: 'TV',
2: np.nan,
3: np.nan},
('Which of the following electronic devices do you own?',
'Tablet'): {0: 'Tablet',
1: np.nan,
2: 'Tablet',
3: np.nan},
('Age', ''): {0: 24, 1: 20, 2: 19, 3: 29},
('Which Social Media Platforms Do You Use?', 'Instagram'):
{0: np.nan,
1: 'Instagram',
2: 'Instagram',
3: 'Instagram'},
('Which Social Media Platforms Do You Use?', 'Facebook'):
{0: 'Facebook',
1: 'Facebook',
2: np.nan,
3: np.nan},
('Which Social Media Platforms Do You Use?', 'Tik Tok'):
{0: np.nan,
1: 'Tik Tok',
2: 'Tik Tok',
3: np.nan},
('Which Social Media Platforms Do You Use?', 'LinkedIn'):
{0: 'LinkedIn',
1: 'LinkedIn',
2: np.nan,
3: np.nan}
}
You can try this:
df.T.groupby(level=0).agg(list).T
You can use melt as starting point to flatten your dataframe, filter out nan values then pivot_table to reshape your dataframe:
pat = r'(subjects|electronic devices|Social Media Platforms)'
cols = ['Name', 'Location', 'Age']
out = df.droplevel(1, axis=1).melt(cols, ignore_index=False).query('value.notna()')
out['variable'] = out['variable'].str.extract(pat, expand=False).str.title()
out = out.reset_index().pivot_table('value', ['index'] + cols, 'variable', aggfunc='|'.join) \
.reset_index(cols).rename_axis(index=None, columns=None)
Output:
>>> out
Name Location Age Electronic Devices Social Media Platforms Subjects
0 Jane Houston 24 Laptop|Phone|Tablet Facebook|LinkedIn Math|Science
1 John LA 20 Laptop|Phone|TV Instagram|Facebook|Tik Tok|LinkedIn Math|Art
2 Lisa LA 19 Laptop|Phone|Tablet Instagram|Tik Tok Art
3 Michael Dallas 29 Laptop|Phone Instagram Math|Science

How do I extract these SQL queries from these pandas dataframes?

SQL QUERIES
customers = pd.DataFrame({'customer_id': {0: 5386596, 1: 32676876}, 'created_at': {0: Timestamp('2017-01-27 00:00:00'), 1: Timestamp('2018-06-07 00:00:00')}, 'venture_code': {0: 'MY', 1: 'ID'}})
visits = Pd.DataFrame({'customer_id': {0: 3434886, 1: 10053}, 'date': {0: Timestamp('2016-10-02 00:00:00'), 1: Timestamp('2017-12-14 00:00:00')}})
orders = Pd.DataFrame({'order_id': {0: 112525, 1: 112525}, 'date': {0: Timestamp('2019-02-01 00:00:00'), 1: Timestamp('2019-02-01 00:00:00')}, 'sku': {0: 'SA108SH89OLAHK', 1: 'RO151AA60REHHK'}, 'customer_id': {0: 46160566, 1: 46160566}})
products = Pd.DataFrame({'sku': {0: 'SA108SH89OLAHK', 1: 'RO151AA60REHHK'}, 'brand': {0: 1, 1: 1}, 'supplier': {0: 'A', 1: 'B'}, 'category': {0: 'Mapp', 1: 'Macc'}, 'price': {0: 15, 1: 45}})
segment = Pd.DataFrame({'Age Range': {0: '<20', 1: '<20'},
'Gender': {0: 'female', 1: 'female'},
'Category': {0: 'Wsho', 1: 'Wapp'},
'Discount %': {0: 0.246607432, 1: 0.174166503},
'NMV': {0: 2509.580375, 1: 8910.447587},
'# Items': {0: 169, 1: 778},
'# Orders': {0: 15, 1: 135}})
buying = Pd.DataFrame({'Supplier Name': {0: 'A', 1: 'A'},
'Brand Name': {0: 1, 1: 2},
'# SKU': {0: 506, 1: 267},
'# Item Before Return': {0: 5663, 1: 3256},
'# Item Returned': {0: 2776, 1: 1395},
'Margin %': {0: 0.266922793, 1: 0.282847894},
'GMV': {0: 191686.749171408, 1: 115560.037075292}})
Using SQL or Pandas, please tell me how to
1. Compare the monthly sales (GMV) trend in Q4 2019, across all countries (venture_code)
2. Show the top 10 brands for each product category, based on total sales (GMV)
I wrote but got the query wrong!
SELECT category, SUM(GMV) as Total_Sales FROM products INNER JOIN buying ON products.brand = buying.[Brand Name]
Concerning the error, you have a space in the column name.
In SQL, if the column has a space, use brackets to wrap the column name:
MyTable.[My Column]
In your code, use this SQL:
SELECT category, SUM(GMV) as Total_Sales FROM products INNER JOIN buying ON products.brand = buying.[Brand Name]
I don't have access to your data, so I can't test, but I think these queries are correct. You may need to tweak them some.
Part 1:
select c.venture_code, sum(b.GMV) GMVSum from customers c join orders o on c.customer_id = o.customer_id
join products p on o.skuv=p.sku
join buying b on p.brand = b.[Brand Name] and p.supplier = b.[Supplier Name]
where o.date >= '2019-10-01' and o.date <= '2019-12-31' -- 2019 4th qtr
group by c.venture_code
Part 2:
select * from
(select *, RANK() over (PARTITION BY category,brand order by GMV) rk from
(select p.brand, p.category, b.GMV from products p join buying b on p.brand = b.[Brand Name] and p.supplier = b.[Supplier Name]) x) xx
where rk <= 10

Joining two columns in the same data frame

I am trying to add one column at the end of another column. I have included a picture that kind of demonstrates what I want to achieve. How can this be done?
For example, in this case I added the age column under the name column
Dummy data:
{'Unnamed: 0': {0: nan, 1: nan, 2: nan, 3: nan},
'age ': {0: 35, 1: 56, 2: 22, 3: 16},
'name': {0: 'andrea', 1: 'juan', 2: 'jose ', 3: 'manuel'},
'sex': {0: 'female', 1: 'male ', 2: 'male ', 3: 'male '}}
One way is to use .append. If your data is in the DataFrame df:
# Split out the relevant parts of your DataFrame
top_df = df[['name','sex']]
bottom_df = df[['age','sex']]
# Make the column names match
bottom_df.columns = ['name','sex']
# Append the two together
full_df = top_df.append(bottom_df)
You might have to decide on what kind of indexing you want. This method above will have non-unique indexing in full_df, which could be fixed by running the following line:
full_df.reset_index(drop=True, inplace=True)
You can use pd.melt and drop variable column using df.drop here.
df = pd.DataFrame({'Unnamed: 0': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan},
'age ': {0: 35, 1: 56, 2: 22, 3: 16},
'name': {0: 'andrea', 1: 'juan', 2: 'jose ', 3: 'manuel'},
'sex': {0: 'female', 1: 'male ', 2: 'male ', 3: 'male '}})
df.melt(id_vars=['sex'], value_vars=['name', 'age']).drop(columns='variable')
sex value
0 female andrea
1 male juan
2 male jose
3 male manuel
4 female 35
5 male 56
6 male 22
7 male 16

Merge DataFrame with many-to-many

I have 2 DataFrames containing examples, I would like to see if a example of DataFrame 1 is present in DataFrame 2.
Normally I would aggregate the rows per example and simply merge the DataFrames. Unfortunately the merging has to be done with a "matching table" which has a many-to-many relationship between the keys (id_low vs. id_high).
Simplified example
Matching Table:
Input DataFrames
They are therefore matchable like this:
Expected Output:
Simplified example (for Python)
import pandas as pd
# Dataframe 1 - containing 1 Example
d1 = pd.DataFrame.from_dict({'Example': {0: 'Example 1', 1: 'Example 1', 2: 'Example 1'},
'id_low': {0: 1, 1: 2, 2: 3}})
# DataFrame 2 - containing 1 Example
d2 = pd.DataFrame.from_dict({'Example': {0: 'Example 2', 1: 'Example 2', 2: 'Example 2'},
'id_low': {0: 1, 1: 4, 2: 6}})
# DataFrame 3 - matching table
dm = pd.DataFrame.from_dict({'id_low': {0: 1, 1: 2, 2: 2, 3: 3, 4: 3, 5: 4, 6: 5, 7: 6, 8: 6},
'id_high': {0: 'A',
1: 'B',
2: 'C',
3: 'D',
4: 'E',
5: 'B',
6: 'B',
7: 'E',
8: 'F'}})
d1 and d2 are matchable as you can see above.
Expected Output (or similar):
df_output = pd.DataFrame.from_dict({'Example': {0: 'Example 1'}, 'Example_2': {0: 'Example 2'}})
Failed attemps
Aggregation of with matching table translated values then merging. Considerer using Regex with the OR-Operator.
IIUC:
d2.merge(dm)
.merge(d1.merge(dm), on='id_high')\
.groupby(['Example_x','Example_y'])['id_high'].agg(list)\
.reset_index()
Output:
Example_x Example_y id_high
0 Example 2 Example 1 [A, B, E]

Pandas: population new columns from other column's values

I have a pandas.dataframe of SEC reports for multiple tickers & periods.
Reproducible dict for DF:
{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'field': {0: 'taxonomyid',
1: 'cik',
2: 'companyname',
3: 'entityid',
4: 'primaryexchange'},
'value': {0: '50',
1: '0000023217',
2: 'CONAGRA BRANDS INC.',
3: '6976',
4: 'NYSE'},
'ticker': {0: 'CAG', 1: 'CAG', 2: 'CAG', 3: 'CAG', 4: 'CAG'},
'cik': {0: 23217, 1: 23217, 2: 23217, 3: 23217, 4: 23217},
'dcn': {0: '0000023217-18-000009',
1: '0000023217-18-000009',
2: '0000023217-18-000009',
3: '0000023217-18-000009',
4: '0000023217-18-000009'},
'fiscalyear': {0: 2019, 1: 2019, 2: 2019, 3: 2019, 4: 2019},
'fiscalquarter': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'receiveddate': {0: '10/2/2018',
1: '10/2/2018',
2: '10/2/2018',
3: '10/2/2018',
4: '10/2/2018'},
'periodenddate': {0: '8/26/2018',
1: '8/26/2018',
2: '8/26/2018',
3: '8/26/2018',
4: '8/26/2018'}}
The column 'field' contains the name of the reporting field (e.g. Indicator), column 'value' contains value for that indicator. Other columns are description for the SEC filing (ticker+date+fiscal_periods = unique set of features to describe certain filing). There are about 60-70 indicators per filing (number varies).
With the code below I've managed to create a pivot dataframe with columns = features (let say total number of N for 1 submission). But the length of this dataframe also equals the number of indicators = N, with NaN in non-diagonal places.
# Adf - Initial dataframe
c = Adf.pivot(columns='field', values='value')
d = Adf[['ticker','cik','fiscalyear','fiscalquarter','dcn','receiveddate','periodenddate']]
e = pd.concat([d, c], sort=False, axis=1)
I want to use an Indicator names from the 'field' as new columns (going from narrow to wide format). At the end I want to have a dataframe with 1 row for each of SEC reports.
So the expected output for provided example is a 1-row dataframe with N new columns, where N = number of unique indicators from the 'field' column of initial dataframe:
{'ticker': {0: 'CAG'},
'cik': {0: 23217},
'dcn': {0: '0000023217-18-000009'},
'fiscalyear': {0: 2019},
'fiscalquarter': {0: 1},
'receiveddate': {0: '10/2/2018'},
'periodenddate': {0: '8/26/2018'},
'taxonomyid':{0:'50'},
'cik': {0: '0000023217}',
'companyname':{0: 'CONAGRA BRANDS INC.'},
'entityid':{0:'6976'},
'primaryexchange': {0:'NYSE'},
}
What is the proper way to create such columns from or what is the proper way to clean-up resulting dataframe from multiple NaN?
What worked for me is setting new index to DF and unstacking 'field' and 'value' columns
aa = Adf.set_index(['ticker','cik', 'fiscalyear','fiscalquarter', 'dcn','receiveddate', 'periodenddate', 'field']).unstack()
aa = aa.reset_index()

Categories

Resources