PANDAS dataframe concat and pivot data - python

I'm leaning python pandas and playing with some example data. I have a CSV file of a dataset with net worth by percentile of US population by quarter of year.
I've successfully subseted the data by percentile to create three scatter plots of net worth by year, one plot for each of three population sections. However, I'm trying to combine those three plots to one data frame so I can combine the lines on a single plot figure.
Data here:
https://www.federalreserve.gov/releases/z1/dataviz/download/dfa-income-levels.csv
Code thus far:
import pandas as pd
import matplotlib.pyplot as plt
# importing numpy as np
import numpy as np
df = pd.read_csv("dfa-income-levels.csv")
df99th = df.loc[df['Category']=="pct99to100"]
df99th.plot(x='Date',y='Net worth', title='Net worth by percentile')
dfmid = df.loc[df['Category']=="pct40to60"]
dfmid.plot(x='Date',y='Net worth')
dflow = df.loc[df['Category']=="pct00to20"]
dflow.plot(x='Date',y='Net worth')
data = dflow['Net worth'], dfmid['Net worth'], df99th['Net worth']
headers = ['low', 'mid', '99th']
newdf = pd.concat(data, axis=1, keys=headers)
And that yields a dataframe shown below, which is not what I want for plotting the data.
low mid 99th
0 NaN NaN 3514469.0
3 NaN 2503918.0 NaN
5 585550.0 NaN NaN
6 NaN NaN 3602196.0
9 NaN 2518238.0 NaN
... ... ... ...
747 NaN 8610343.0 NaN
749 3486198.0 NaN NaN
750 NaN NaN 32011671.0
753 NaN 8952933.0 NaN
755 3540306.0 NaN NaN
Any recommendations for other ways to approach this?

#filter you dataframe to only the categories you're interested in
filtered_df = df[df['Category'].isin(['pct99to100', 'pct00to20', 'pct40to60'])]
filtered_df = filtered_df[['Date', 'Category', 'Net worth']]
fig, ax = plt.subplots() #ax is an axis object allowing multiple plots per axis
filtered_df.groupby('Category').plot(ax=ax)

I don't see the categories mentioned in your code in the csv file you shared. In order to concat dataframes along columns, you could use pd.concat along axis=1. It concats the columns of same index number. So first set the Date column as index and then concat them, and then again bring back Date as a dataframe column.
To set Date column as index of dataframe, df1 = df1.set_index('Date') and df2 = df2.set_index('Date')
Concat the dataframes df1 and df2 using df_merge = pd.concat([df1,df2],axis=1) or df_merge = pd.merge(df1,df2,on='Date')
bringing back Date into column by df_merge = df_merge.reset_index()

Related

Python turning a list of dataframes into one dataframe

Using a bloomberg api to extract the data.... the output was a list of dataframes. Trying to merge them together. Here is what a snippet of the list looks like:
In[60]: df
Out[60]:
[ FDIDFDMO INDEX
BN_SURVEY_AVERAGE 0.9
ECO_RELEASE_DT 2022-03-15
ECO_RELEASE_TIME 08:30:00
NAME US PPI Final Demand MoM SA,
INJCJC INDEX
BN_SURVEY_AVERAGE 215.3
ECO_RELEASE_DT 2022-03-10
ECO_RELEASE_TIME 08:30:00
NAME US Initial Jobless Claims SA]
Where "FDIDFDMO INDEX, INJCJC INDEX,... etc" are the column names. each dataframe is a dim of 4x1.
As #voidpointercast stated, pd.concat was correct.
import pandas as pd
df = pd.concat(df, axis = 1)

Boxplot of Multiindex df

I want to do 2 things:
I want to create one boxplot per date/day with all the values for MeanTravelTimeSeconds in that date. The number of MeanTravelTimeSeconds elements varies from date to date (e.g. one day might have a count of 300 values while another, 400).
Also, I want to transform the rows in my multiindex series into columns because I don't want the rows to repeat every time. If it stays like this I'd have tens of millions of unnecessary rows.
Here is the resulting series after using df.stack() on a df indexed by date (date is a datetime object index):
Date
2016-01-02 NumericIndex 1611664
OriginMovementID 4744
DestinationMovementID 5084
MeanTravelTimeSeconds 1233
RangeLowerBoundTravelTimeSeconds 756
...
2020-03-31 DestinationMovementID 3594
MeanTravelTimeSeconds 1778
RangeLowerBoundTravelTimeSeconds 1601
RangeUpperBoundTravelTimeSeconds 1973
DayOfWeek Tuesday
Length: 11281655, dtype: object
When I use seaborn to plot the boxplot I guet a bucnh of errors after playing with different selections.
If I try to do df.stack().unstack() or df.stack().T I get then following error:
Index contains duplicate entries, cannot reshape
How do I plot the boxplot and how do I turn the rows into columns?
You really do need to make your index unique to make the functions you want to work. I suggest a sequential number that resets at every change in the other two key columns.
import datetime as dt
import random
import numpy as np
cat = ["NumericIndex","OriginMovementID","DestinationMovementID","MeanTravelTimeSeconds",
"RangeLowerBoundTravelTimeSeconds"]
df = pd.DataFrame(
[{"Date":d, "Observation":cat[random.randint(0,len(cat)-1)],
"Value":random.randint(1000,10000)}
for i in range(random.randint(5,20))
for d in pd.date_range(dt.datetime(2016,1,2), dt.datetime(2016,3,31), freq="14D")])
# starting point....
df = df.sort_values(["Date","Observation"]).set_index(["Date","Observation"])
# generate an array that is sequential within change of key
seq = np.full(df.index.shape, 0)
s=0
p=""
for i, v in enumerate(df.index):
if i==0 or p!=v: s=0
else: s+=1
seq[i] = s
p=v
df["SeqNo"] = seq
# add to index - now unstack works as required
dfdd = df.set_index(["SeqNo"], append=True)
dfdd.unstack(0).loc["MeanTravelTimeSeconds"].boxplot()
print(dfdd.unstack(1).head().to_string())
output
Value
Observation DestinationMovementID MeanTravelTimeSeconds NumericIndex OriginMovementID RangeLowerBoundTravelTimeSeconds
Date SeqNo
2016-01-02 0 NaN NaN 2560.0 5324.0 5085.0
1 NaN NaN 1066.0 7372.0 NaN
2016-01-16 0 NaN 6226.0 NaN 7832.0 NaN
1 NaN 1384.0 NaN 8839.0 NaN
2 NaN 7892.0 NaN NaN NaN

Mapping a new column to a DataFrame by rows from another DataFrame

I have a Pandas DataFrame stations with index as id:
id station lat lng
1 Boston 45.343 -45.333
2 New York 56.444 -35.690
I have another DataFrame df1 that has the following:
duration date station gender
NaN 20181118 NaN M
9 20181009 2.0 F
8 20170605 1.0 F
I want to add to df1 so that it looks like the following DataFrame:
duration date station gender lat lng
NaN 20181118 NaN M nan nan
9 20181009 New York F 56.444 -35.690
8 20170605 Boston F 45.343 -45.333
I tried doing this iteratively by referring to the station.iloc[] as shown in the following example but I have about 2 mil rows and it ended up taking a lot of time.
stat_list = []
lng_list []
lat_list = []
for stat in df1:
if not np.isnan(stat):
ref = station.iloc[stat]
stat_list.append(ref.station)
lng_list.append(ref.lng)
lat_list.append(ref.lat)
else:
stat_list.append(np.nan)
lng_list.append(np.nan)
lat_list.append(np.nan)
Is there a faster way to do this?
Looks like this would be best solved with a merge which should significantly boost performance:
df1.merge(stations, left_on="station", right_index=True, how="left")
This will leave you with two columns station_x and station_y if you only want the station column with the string names in you can do:
df_merged = df1.merge(stations, left_on="station", right_index=True, how="left", suffixes=("_x", ""))
df_final = df_merged[df_merged.columns.difference(["station_x"])]
(or just rename one of them before you merge)

Pandas reindex and interpolate time series efficiently (reindex drops data)

Suppose I wish to re-index, with linear interpolation, a time series to a pre-defined index, where none of the index values are shared between old and new index. For example
# index is all precise timestamps e.g. 2018-10-08 05:23:07
series = pandas.Series(data,index)
# I want rounded date-times
desired_index = pandas.date_range("2010-10-08",periods=10,freq="30min")
Tutorials/API suggest the way to do this is to reindex then fill NaN values using interpolate. But, as there is no overlap of datetimes between the old and new index, reindex outputs all NaN:
# The following outputs all NaN as no date times match old to new index
series.reindex(desired_index)
I do not want to fill nearest values during reindex as that will lose precision, so I came up with the following; concatenate the reindexed series with the original before interpolating:
pandas.concat([series,series.reindex(desired_index)]).sort_index().interpolate(method="linear")
This seems very inefficient, concatenating and then sorting the two series. Is there a better way?
The only (simple) way I can see of doing this is to use resample to upsample to your time resolution (say 1 second), then reindex.
Get an example DataFrame:
import numpy as np
import pandas as pd
np.random.seed(2)
df = (pd.DataFrame()
.assign(SampleTime=pd.date_range(start='2018-10-01', end='2018-10-08', freq='30T')
+ pd.to_timedelta(np.random.randint(-5, 5, size=337), unit='s'),
Value=np.random.randn(337)
)
.set_index(['SampleTime'])
)
Let's see what the data looks like:
df.head()
Value
SampleTime
2018-10-01 00:00:03 0.033171
2018-10-01 00:30:03 0.481966
2018-10-01 01:00:01 -0.495496
Get the desired index:
desired_index = pd.date_range('2018-10-01', periods=10, freq='30T')
Now, reindex the data with the union of the desired and existing indices, interpolate based on the time, and reindex again using only the desired index:
(df
.reindex(df.index.union(desired_index))
.interpolate(method='time')
.reindex(desired_index)
)
Value
2018-10-01 00:00:00 NaN
2018-10-01 00:30:00 0.481218
2018-10-01 01:00:00 -0.494952
2018-10-01 01:30:00 -0.103270
As you can see, you still have an issue with the first timestamp because it's outside the range of the original index; there are number of ways to deal with this (pad, for example).
my methods
frequency = nyse_trading_dates.rename_axis([None]).index
df = prices.rename_axis([None]).reindex(frequency)
for d in prices.rename_axis([None]).index:
df.loc[d] = prices.loc[d]
df.interpolate(method='linear')
method 2
prices = data.loc[~data.index.duplicated(keep='last')]
#prices = data.reset_index()
idx1 = prices.index
idx1 = pd.to_datetime(idx1, errors='coerce')
merged = idx1.union(idx2)
s = prices.reindex(merged)
df = s.interpolate(method='linear').dropna(axis=0, how='any')
data=df

Simple way to subset multiple data frames with pandas groupby or other function?

I have a DataFrame with two columns that I got with this command result = pd.concat([Value, Date], axis=1)
import pandas as pd
>>> result
Value Date
189 9.0 11/14/15
191 10.0 11/14/15
192 1.0 11/14/15
193 4.0 11/14/15
... ... ...
2920 6.0 2/20/16
2921 8.0 2/20/16
2923 10.0 2/20/16
2925 2.0 2/20/16
But what I need is multiple dataframes of all the Value data for each Date. I know that I can execute something like x = result.groupby('Date').mean() which gives me the mean Value for each Date, but I want the actual data in its own dataframe that was used to produce the mean.
Is there another argument or function to simply get this data frame?
From your comments you can use seaborn directly to plot a distplot of all dates without any grouping or looping with FacetGrid. Here is some fake data for 12 days and then the plot.
Create fake data and then plot
date = pd.date_range('1-1-2016', '1-13-2016', freq='h', closed='left').date
df = pd.DataFrame({'num' : np.random.rand(len(date)), 'date':date})
g = sns.FacetGrid(df, col='date', col_wrap=4)
g.map(sns.distplot, "num", hist=False, rug=True)
your specific data
g = sns.FacetGrid(result, col='Date', col_wrap=4)
g.map(sns.distplot, 'Value', hist=False, rug=True)
you need a place to put each DataFrame. Let's say you put it in a dictionary d
d = {day: group for day, group in result.groupby('Date')}

Categories

Resources