Adding sub headers to pandas Dataframe - python

I have a large dataframe with lots of columns, each with their own header. I'd like to group them in sub headers so readability is more clear. For example, here are my column headers:
df1 = df1[['Date', 'Time', 'USA', 'Canada', 'SD', 'HD']]
I'd like have headers above certain columns to output like this:
When Country Channel
Date Time USA Canada SD HD
However, I'm not sure how to go about this. Any assistance or direction is much appreciated.
Thanks!

You can use:
df.columns = pd.MultiIndex.from_tuples([('When', 'Date'), ('When', 'Time'),
('Country', 'USA'), ('Country', 'Canada'),
('Channel', 'SD'), ('Channel', 'HD')])

Related

Why is there no duplicates in pandas dataframe.index?

I just wrote a program for college using pandas to structure some unstructured data. I definitely made it harder than it should be, but I ended up finding something interesting.
here is the data I parsed
Center/Daycare
825 23rd Street South
Arlington, VA 22202
703-979-BABY (2229)
22.
Maria Teresa Desaba, Owner/Director; Tony Saba, Org. >Director.
Website: www.mariateresasbabies.com
Serving children 6 wks to 5yrs full-time.
National Science Foundation Child Development Center
23.
4201 Wilson Blvd., Suite 180 22203
703-292-4794
Website: www.brighthorizons.com 112 children, ages 6 wks - 5 yrs.
7:00 a.m. – 6:00 p.m. Summer Camp for children 5 - 9 years.
here is the (aggressively commented for school)code that is mostly irrelevant but here for completeness sake
import csv
import pandas as pd
lines = []
"""opening the raw data from a text file"""
with open('raw_data.txt') as f:
lines = f.readlines()
f.close()
"""removing new line characters"""
for i in range(len(lines)):
lines[i] = lines[i].rstrip('\n')
df = pd.DataFrame(lines, columns=['info'], index=['business type', 'address', 'location',
'phone number', 'unknown', 'owner', 'website', 'description',
'null', 'business type', 'unknown', 'address', 'phone number',
'website', 'description'])
"""creating more columns with the value at each index. This doesn't contain any duplicates"""
for i in df.index:
df[i] = ''
"""here I am taking every column and adding corresponding values from the original dataframe
extra data frames chould be garbage collected but this serves for demonstration"""
df.index = df.index.astype('str')
df1 = df[df.index.str.contains('bus')]
df2 = df[df.index.str.contains('address')]
df3 = df[df.index.str.contains('location')]
df4 = df[df.index.str.contains('number')]
df5 = df[df.index.str.contains('know')]
df6 = df[df.index.str.contains('owner')]
df7 = df[df.index.str.contains('site')]
df8 = df[df.index.str.contains('descript')]
df9 = df[df.index.str.contains('null')]
for i in range(len(df1)):
df['business type'][i] = df1['info'][i]
for i in range(len(df2)):
df['address'][i] = df2['info'][i]
for i in range(len(df3)):
df['location'][i] = df3['info'][i]
for i in range(len(df4)):
df['phone number'][i] = df4['info'][i]
for i in range(len(df5)):
df['unknown'][i] = df5['info'][i]
for i in range(len(df6)):
df['owner'][i] = df6['info'][i]
for i in range(len(df7)):
df['website'][i] = df7['info'][i]
for i in range(len(df8)):
df['description'][i] = df8['info'][i]
for i in range(len(df9)):
df['null'][i] = df9['info'][i]
"""dropping unnecessary columns"""
df.drop(columns='info', inplace=True)
df.drop(columns='null', inplace=True)
df.drop(columns='unknown', inplace=True)
"""changing the index values to int to make easier to drop unused rows"""
idx = []
for i in range(0, len(df)):
idx.append(i)
df.index = idx
"""dropping unused rows"""
for i in range(2, 15):
df.drop([i], inplace=True)
"""writing to csv and printing to console"""
df.to_csv("new.csv", index=False)
print(df.to_string())
I'm just curious why when I create more columns by using the name of the index[i] item here
df = pd.DataFrame(lines, columns=['info'], index=['business type', 'address', 'location',
'phone number', 'unknown', 'owner', 'website', 'description',
'null', 'business type', 'unknown', 'address', 'phone number',
'website', 'description'])
"""creating more columns with the value at each index. This doesn't contain any duplicates"""
for i in df.index:
df[i] = ''
doesn't contain any duplicates.
when I add
print(df.columns)
I get the output
Index(['info', 'business type', 'address', 'location', 'phone number',
'unknown', 'owner', 'website', 'description', 'null'],
dtype='object')
I'm just generally curious why there are no duplicates as I'm sure that could be problematic in certain situations and also pandas is interesting and I hardly understand it and would like to know more. Also, if you feel extra enthusiastic any info on a more efficient way to do this would be greatly appreciated, but if not no worries, I'll eventually read the docs.
The pandas DataFrame is designed for tabular data in which all the entries in any one column have the same type (e.g. integer or string). One row usually represents one instance, sample, or individual. So the natural way to parse your data into a DataFrame is to have two rows, one for each institution, and define the columns as what you have called index (perhaps with the address split into several columns), e.g. business type, street, city, state, post code, phone number, etc.
So there would be one row per institution, and the index would be used to assign a unique identifier to each of them. That's why it's desirable for the index to contain no duplicates.

pandas using apply method and sending column names?

I have a pandas dataframe that I want to convert each row to a json message to upload. I figured this would be a great usecase for the apply method but I'm having a slight problem, it's not sending the column names.
Here's my data:
Industry CountryName periodDate predicted
0 Advertising Agencies USA 1995.0 144565.060000
1 Advertising Agencies USA 1996.0 165903.120000
2 Advertising Agencies USA 2001.0 326320.740300
When I use apply I lose the column names(industry, countryName, periodDate, etc)
def sendAggData(row):
uploadDataJson = row.to_json(orient='records')
print(json.loads(uploadDataJson))
aggValue.apply(sendAggData, axis=1)
I get this result:
['Advertising Agencies', 'USA', 1995.0, 144565.06]
['Advertising Agencies', 'USA', 1996.0, 165903.12]
['Advertising Agencies', 'USA', 2001.0, 326320.7403]
I want this as a json message, so I'd like the column name on it. so something like {'Industry': 'Advertising Agencies', 'CountryName':'USA'....} - previously I got this to work using a for loop for each row but was told that apply is the more pandas way :-) Any suggestion of what I can do to use apply correctly?
You can just do:
df.apply(lambda x: x.to_json(), axis=1)
Which gives you:
0 {"Industry":0,"CountryName":"Advertising Agenc...
1 {"Industry":1,"CountryName":"Advertising Agenc...
2 {"Industry":2,"CountryName":"Advertising Agenc...
dtype: object
However, what's about df.to_dict('records') which gives:
[{'Industry': 0, 'CountryName': 'Advertising Agencies', 'periodDate': 'USA 1995.0', 'predicted': 144565.06},
{'Industry': 1, 'CountryName': 'Advertising Agencies', 'periodDate': 'USA 1996.0', 'predicted': 165903.12},
{'Industry': 2, 'CountryName': 'Advertising Agencies', 'periodDate': 'USA 2001.0', 'predicted': 326320.7403}]

Splitting Columns that contains delimiters using Python

i am an incoming file with 100+ columns where in some columns we have comma separated values.
i have to convert those delimited columns in to multiple columns with same column header and its sequence.
for ex..if my input is below..
name,age,interests,sports,gender,year
aaa,44,"movies,poker","tennis,baseball",M,2000
bbb,23,"movies","hockey,baseball",F,2018
output should be..we should not hardcode the column names..which ever column has , it should be split.
name,age,interests_1,interest_2,sports_1,sports_2,gender,year
aaa, 44,movies, poker, tennis, baseball,M, 2000
bbb, 23,movies, hockey, baseball,F, 2018
Use these columns as columns of the file you are going to create:-
st = '''name,age,interests,sports,gender,year aaa,44,"movies,poker","tennis,baseball",M,2000 bbb,23,"movies","hockey,baseball",F,2018'''
columns = st.split(',')
>>columns
['name',
'age',
'interests',
'sports',
'gender',
'year aaa',
'44',
'"movies',
'poker"',
'"tennis',
'baseball"',
'M',
'2000 bbb',
'23',
'"movies"',
'"hockey',
'baseball"',
'F',
'2018']

Adding a column to a panda dataframe with values from columns from another dataframe, depending on a key from a dictionary

I have the following two dataframes:
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon'], ["blala Amazon", '04/28/2017', 'Amazon'], ['blabla Netflix', '06/28/2017', 'Netflix']], columns=['text', 'date', 'keyword'])
df2 = pd.DataFrame([['01/28/2017', '3.4', '10.2'], ['02/30/2017', '3.7', '10.5'], ['03/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAssets.1', 'ReturnOnAssets.2'])
(perhaps it's clearer in the screenshots here: https://imgur.com/a/YNrWpR2)
The df2 is much larger than shown here - it contains columns for 100 companies. So for example, for the 10th company, the column names are: ReturnOnAssets.10, etc.
I have created a dictionary which maps the company names to the column names:
stocks = {'Microsoft':'','Apple' :'1', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Johnson & Johnson':'5',
'JPMorgan' :'6', 'Alphabet': '7'}
and so on.
Now, what I am trying to achieve is adding a column "ReturnOnAssets" from d2 to d1, but for a specific company and for a specific date. So looking at df1, the first tweet (i.e. "text") contains a keyword "Amazon" and it was posted on 04/28/2017. I now need to go to df2 to the relevant column name for Amazon (i.e. "ReturnOnAssets.2") and fetch the value for the specified date.
So what I expect looks like this:
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon', **'10.5'**], ["blala Amazon", '04/28/2017', 'Amazon', 'x'], ["blabla Netflix', '06/28/2017', 'Netflix', 'x']], columns=['text', 'date', 'keyword', 'ReturnOnAssets'])
By x I mean values which where not included in the example df1 and df2.
I am fairly new to pandas and I can't wrap my head around it. I tried:
keyword = df1['keyword']
txt = 'ReturnOnAssets.'+ stocks[keyword]
df1['ReturnOnAssets'] = df2[txt]
But I don't know how to fetch the relevant date, and also this gives me an error: "Series' objects are mutable, thus they cannot be hashed", which probably comes from the fact that I cannot just add a whole column of keywords to the text string.
I don't know how to achieve the operation I need to do, so I would appreciate help.
It can probably be shortened and you can add if statements to deal with when there are missing values.
import pandas as pd
import numpy as np
df1 = pd.DataFrame([["blala Amazon", '05/28/2017', 'Amazon'], ["blala Facebook", '04/28/2017', 'Facebook'], ['blabla Netflix', '06/28/2017', 'Netflix']], columns=['text', 'dates', 'keyword'])
df1
df2 = pd.DataFrame([['06/28/2017', '3.4', '10.2'], ['05/28/2017', '3.7', '10.5'], ['04/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAsset.1', 'ReturnOnAsset.2'])
#creating myself a bigger df2 to cover all the way to netflix
for i in range (9):
df2[('ReturnOnAsset.' + str(i))]=np.random.randint(1, 1000, df1.shape[0])
stocks = {'Microsoft':'0','Apple' :'1', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Johnson & Johnson':'5',
'JPMorgan' :'6', 'Alphabet': '7', 'Netflix': '8'}
#new col where to store values
df1['ReturnOnAsset']=np.nan
for index, row in df1.iterrows():
colname=('ReturnOnAsset.' + stocks[row['keyword']] )
df1['ReturnOnAsset'][index]=df2.loc[df2['dates'] ==row['dates'] , colname]
Next time please give us a correct test data, I modified your dates and dictionary for match the first and second column (netflix and amazon values).
This code will work if and only if all dates from df1 are in df2 (Note that in df1 the column name is date and in df2 the column name is dates)
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon'], ["blala Amazon", '04/28/2017', 'Amazon'], ['blabla Netflix', '02/30/2017', 'Netflix']], columns=['text', 'date', 'keyword'])
df2 = pd.DataFrame([['04/28/2017', '3.4', '10.2'], ['02/30/2017', '3.7', '10.5'], ['03/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAssets.1', 'ReturnOnAssets.2'])
stocks = {'Microsoft':'','Apple' :'5', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Netflix':'1',
'JPMorgan' :'6', 'Alphabet': '7'}
df1["ReturnOnAssets"]= [ df2["ReturnOnAssets." + stocks[ df1[ "keyword" ][ index ] ] ][ df2.index[ df2["dates"] == df1["date"][index] ][0] ] for index in range(len(df1)) ]
df1

Trim each column values at pandas

I am working on .xls files after import data to a data frame with pandas, need to trim them. I have a lot of columns. Each data starting xxx: or yyy: and in a column
for example:
xxx:abc yyy:def \n
xxx:def yyy:ghi \n
xxx:ghi yyy:jkl \n
...
I need to trim that xxx: and yyy: for each column. Researched and tried some issue solves but they doesn't worked. How can I trim that, I need an effective code. Already thanks.
(Unnecessary chars don't have static length I just know what are them look like stop words. For example:
['Comp:Apple', 'Product:iPhone', 'Year:2018', '128GB', ...]
['Comp:Samsung', 'Product:Note', 'Year:2017', '64GB', ...]
i want to new dataset look like:
['Apple', 'iPhone', '2018', '128GB', ...]
['Samsung', 'Note', '2017', '64GB', ...]
So I want to trim ('Comp:', 'Product:', 'Year:', ...) stop words for each column.
You can use pd.Series.str.split for this:
import pandas as pd
df = pd.DataFrame([['Comp:Apple', 'Product:iPhone', 'Year:2018', '128GB'],
['Comp:Samsung', 'Product:Note', 'Year:2017', '64GB']],
columns=['Comp', 'Product', 'Year', 'Memory'])
for col in ['Comp', 'Product', 'Year']:
df[col] = df[col].str.split(':').str.get(1)
# Comp Product Year Memory
# 0 Apple iPhone 2018 128GB
# 1 Samsung Note 2017 64GB

Categories

Resources