Extract series objects from Pandas DataFrame - python

I have a dataframe with the columns
['CPL4', 'Part Number', 'Calendar Year/Month', 'Sales', 'Inventory']
For each 'Part Number', 'Calendar Year/Month' will be unique on each Part Number.
I want to convert each part number to a univariate Series with 'Calendar Year/Month' as the index and either 'Sales' or 'Inventory' as the value.
How can I accomplish this using pandas built-in functions and not iterating through the dataframe manually?

In pandas this is called a MultiIndex. Try:
import pandas as pd
df = pd.DataFrame(file,
index=['Part Number', 'Calendar Year/Month'],
columns = ['Sales', 'Inventory'])

you can use the groupby method such has:
grouped_df = df.groupby('Part Number')
and then you can access the df of a certain part number and set the index easily such has:
new_df = grouped_df.get_group('THEPARTNUMBERYOUWANT').set_index('Calendar Year/Month')
if you only want the 2 columns you can do:
print new_df[['Sales', 'Inventory']]]

From the answers and comments here, along with a little more research, I ended with the following solution.
temp_series = df[df[ "Part Number" == sku ] ].pivot(columns = ["Calendar Year/Month"], values = "Sales").iloc[0]
Where sku is a specific part number from df["Part Number"].unique()
This will give you a univariate time series(temp_series) indexed by "Calendar Year/Month" with values of "Sales" EG:
1.2015 NaN
1.2016 NaN
2.2015 NaN
2.2016 NaN
3.2015 NaN
3.2016 NaN
4.2015 NaN
4.2016 NaN
5.2015 NaN
5.2016 NaN
6.2015 NaN
6.2016 NaN
7.2015 NaN
7.2016 NaN
8.2015 NaN
8.2016 NaN
9.2015 NaN
10.2015 NaN
11.2015 NaN
12.2015 NaN
Name: 161, dtype: float64
<class 'pandas.core.series.Series'>])
from the columns
['CPL4', 'Part Number', 'Calendar Year/Month', 'Sales', 'Inventory']

Related

How to add a new row and new column to a multiindex Pandas dataframe?

I try to use .loc to create a new row and a new column to a multiindex Pandas dataframe, by specifying all the axis. The problem is that it creates the new index without the new column, and at the same time throws an obscur KeyError: 6.
How could I do that ? A one line solution whould be much appreciated.
> df
side total value
city code type
NaN NTE urban ouest 0.01949 391.501656
> df.loc[(np.nan, 'NTE', 'rural'), 'population'] = 1000
KeyError: 6
> df
side total value
city code type
NaN NTE urban ouest 0.01949 391.501656
NaN NTE rural NaN NaN NaN
Now, when I try the same command again it complains the index doesn't exist.
> df.loc[(np.nan, 'NTE', 'rural'), 'population'] = 1000
KeyError: (nan, 'NTE', 'rural')
The desired output would be this dataframe:
side total value population
city code type
NaN NTE urban ouest 0.01949 391.501656 NaN
NaN NTE rural NaN NaN NaN 1000
Here is problem with missing values, possible hack solution with assign empty string and rename:
df.loc[('', 'NTE', 'rural'), 'population'] = 1000
print (df.index)
MultiIndex([(nan, 'NTE', 'urban'),
( '', 'NTE', 'rural')],
names=['city', 'code', 'type'])
df = df.rename({'':np.nan}, level=0)
print (df.index)
MultiIndex([(nan, 'NTE', 'urban'),
(nan, 'NTE', 'rural')],
names=['city', 'code', 'type'])
print (df)
side total value population
city code type
NaN NTE urban ouest 0.01949 391.501656 NaN
rural NaN NaN NaN 1000.0

Python Pandas dataframe - add a new column based on index value

I have a Python pandas dataframe that looks like this:
year_2000 year_1999 year_1998 year_1997 year_1996 year_1995 (MANH, stock name)
MANH 454.47 -71.90 nan nan nan nan TEST
LH 385.52 180.95 -24.14 -41.67 -68.92 -26.47 TEST
DGX 373.33 68.04 4.01 nan nan nan TEST
SKX 306.56 nan nan nan nan nan TEST
where the stock tickers are the index. I want to add the name of each stock as a new column
I tried adding the stock name column via yearly_best['MANH','stock name']='TEST' but it adds the same name in all rows.
I have a dictionary called ticker_name which contains the tickers and the names
Out[65]:
{'TWOU': '2U',
'MMM': '3M',
'ABT': 'Abbott Laboratories',
'ABBV': 'AbbVie Inc.',
'ABMD': 'Abiomed',
'ACHC': 'Acadia Healthcare',
'ACN': 'Accenture',
'ATVI': 'Activision Blizzard',
'AYI': 'Acuity Brands',
'ADNT': 'Adient',
thus I would like to get the names from the dict and put then in a column in the dataframe. How can I do that?
As the key of your dictionnary are index of your dataFrame you can try:
d = {'TWOU': '2U',
'MMM': '3M',
'ABT': 'Abbott Laboratories',
'ABBV': 'AbbVie Inc.',
'ABMD': 'Abiomed',
'ACHC': 'Acadia Healthcare',
'ACN': 'Accenture',
'ATVI': 'Activision Blizzard',
'AYI': 'Acuity Brands',
'ADNT': 'Adient',}
df['stock name'] = pd.Series(d)
You can try:
# Create a new column "stock_name" with the index values
yearly_best['stock_name'] = yearly_best.index
# Replace the "stock_name" values based on the dictionary
yearly_best['stock_name'].map(ticker_name, inplace=True)
Note that in this case, the dataframe's indices will remain as they were (stock tickers). If you would like to replace the indices with row numbers, consider using reset_index

PANDAS dataframe concat and pivot data

I'm leaning python pandas and playing with some example data. I have a CSV file of a dataset with net worth by percentile of US population by quarter of year.
I've successfully subseted the data by percentile to create three scatter plots of net worth by year, one plot for each of three population sections. However, I'm trying to combine those three plots to one data frame so I can combine the lines on a single plot figure.
Data here:
https://www.federalreserve.gov/releases/z1/dataviz/download/dfa-income-levels.csv
Code thus far:
import pandas as pd
import matplotlib.pyplot as plt
# importing numpy as np
import numpy as np
df = pd.read_csv("dfa-income-levels.csv")
df99th = df.loc[df['Category']=="pct99to100"]
df99th.plot(x='Date',y='Net worth', title='Net worth by percentile')
dfmid = df.loc[df['Category']=="pct40to60"]
dfmid.plot(x='Date',y='Net worth')
dflow = df.loc[df['Category']=="pct00to20"]
dflow.plot(x='Date',y='Net worth')
data = dflow['Net worth'], dfmid['Net worth'], df99th['Net worth']
headers = ['low', 'mid', '99th']
newdf = pd.concat(data, axis=1, keys=headers)
And that yields a dataframe shown below, which is not what I want for plotting the data.
low mid 99th
0 NaN NaN 3514469.0
3 NaN 2503918.0 NaN
5 585550.0 NaN NaN
6 NaN NaN 3602196.0
9 NaN 2518238.0 NaN
... ... ... ...
747 NaN 8610343.0 NaN
749 3486198.0 NaN NaN
750 NaN NaN 32011671.0
753 NaN 8952933.0 NaN
755 3540306.0 NaN NaN
Any recommendations for other ways to approach this?
#filter you dataframe to only the categories you're interested in
filtered_df = df[df['Category'].isin(['pct99to100', 'pct00to20', 'pct40to60'])]
filtered_df = filtered_df[['Date', 'Category', 'Net worth']]
fig, ax = plt.subplots() #ax is an axis object allowing multiple plots per axis
filtered_df.groupby('Category').plot(ax=ax)
I don't see the categories mentioned in your code in the csv file you shared. In order to concat dataframes along columns, you could use pd.concat along axis=1. It concats the columns of same index number. So first set the Date column as index and then concat them, and then again bring back Date as a dataframe column.
To set Date column as index of dataframe, df1 = df1.set_index('Date') and df2 = df2.set_index('Date')
Concat the dataframes df1 and df2 using df_merge = pd.concat([df1,df2],axis=1) or df_merge = pd.merge(df1,df2,on='Date')
bringing back Date into column by df_merge = df_merge.reset_index()

NaN columns pandas turning into empty row when using pivot_table

I use read_csv to fill a pandas. In this pandas I have a full NaN empty columns and this turns into a problem when I use pivot_table.
Here my situation:
d= {'dates': ['01/01/20','01/02/20','01/03/20'], 'country':['Fra','Fra','Fra'], 'val': [np.nan,np.nan,np.nan]}
df = pd.DataFrame(data=d)
piv=df.pivot_table(index='country',values='val',columns='dates')
print(piv)
Empty DataFrame
Columns: []
Index: []
I would like to have this :
dates 01/01/20 01/02/20 01/03/20
country
Fra NaN NaN NaN
from docs, set dropna = False DataFrame.pivot_table
piv = df.pivot_table(index='country',values='val',columns='dates', dropna=False)
dates 01/01/20 01/02/20 01/03/20
country
Fra NaN NaN NaN
Just use the dropna argument of pivot:
df.pivot_table(index='country',columns='dates', values='val', dropna = False)
The output is:
dates 01/01/20 01/02/20 01/03/20
country
Fra NaN NaN NaN

Pandas aligning multiple dataframes with TimeStamp index

This has been the bane of my life for the past couple of days. I have numerous Pandas Dataframes that contain time series data with irregular frequencies. I try to align these into a single dataframe.
Below is some code, with representative dataframes, df1, df2, and df3 ( I actually have n=5, and would appreciate a solution that would work for all n>2):
# df1, df2, df3 are given at the bottom
import pandas as pd
import datetime
# I can align df1 to df2 easily
df1aligned, df2aligned = df1.align(df2)
# And then concatenate into a single dataframe
combined_1_n_2 = pd.concat([df1aligned, df2aligned], axis =1 )
# Since I don't know any better, I then try to align df3 to combined_1_n_2 manually:
combined_1_n_2.align(df3)
error: Reindexing only valid with uniquely valued Index objects
I have an idea why I get this error, so I get rid of the duplicate indices in combined_1_n_2 and try again:
combined_1_n_2 = combined_1_n_2.groupby(combined_1_n_2.index).first()
combined_1_n_2.align(df3) # But stll get the same error
error: Reindexing only valid with uniquely valued Index objects
Why am I getting this error? Even if this worked, it is completely manual and ugly. How can I align >2 time series and combine them in a single dataframe?
Data:
df1 = pd.DataFrame( {'price' : [62.1250,62.2500,62.2375,61.9250,61.9125 ]},
index = [pd.DatetimeIndex([datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')])[0]
for s in ['2008-06-01 06:03:59.614000', '2008-06-01 06:03:59.692000',
'2008-06-01 06:15:42.004000', '2008-06-01 06:15:42.083000','2008-06-01 06:17:01.654000' ] ])
df2 = pd.DataFrame({'price': [241.0625, 241.5000, 241.3750, 241.2500, 241.3750 ]},
index = [pd.DatetimeIndex([datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')])[0]
for s in ['2008-06-01 06:13:34.524000', '2008-06-01 06:13:34.602000',
'2008-06-01 06:15:05.399000', '2008-06-01 06:15:05.399000','2008-06-01 06:15:42.082000' ] ])
df3 = pd.DataFrame({'price': [67.656, 67.875, 67.8125, 67.75, 67.6875 ]},
index = [pd.DatetimeIndex([datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')])[0]
for s in ['2008-06-01 06:03:52.281000', '2008-06-01 06:03:52.359000',
'2008-06-01 06:13:34.848000', '2008-06-01 06:13:34.926000','2008-06-01 06:15:05.321000' ] ])
Your specific error is due the column names of combined_1_n_2 having duplicates (both columns will be named 'price'). You could rename the columns and the second align would work.
One alternative way would be to chain the join operator, which merges frames on the index, as below.
In [23]: df1.join(df2, how='outer', rsuffix='_1').join(df3, how='outer', rsuffix='_2')
Out[23]:
price price_1 price_2
2008-06-01 06:03:52.281000 NaN NaN 67.6560
2008-06-01 06:03:52.359000 NaN NaN 67.8750
2008-06-01 06:03:59.614000 62.1250 NaN NaN
2008-06-01 06:03:59.692000 62.2500 NaN NaN
2008-06-01 06:13:34.524000 NaN 241.0625 NaN
2008-06-01 06:13:34.602000 NaN 241.5000 NaN
2008-06-01 06:13:34.848000 NaN NaN 67.8125
2008-06-01 06:13:34.926000 NaN NaN 67.7500
2008-06-01 06:15:05.321000 NaN NaN 67.6875
2008-06-01 06:15:05.399000 NaN 241.3750 NaN
2008-06-01 06:15:05.399000 NaN 241.2500 NaN
2008-06-01 06:15:42.004000 62.2375 NaN NaN
2008-06-01 06:15:42.082000 NaN 241.3750 NaN
2008-06-01 06:15:42.083000 61.9250 NaN NaN
2008-06-01 06:17:01.654000 61.9125 NaN NaN

Categories

Resources