Using a bloomberg api to extract the data.... the output was a list of dataframes. Trying to merge them together. Here is what a snippet of the list looks like:
In[60]: df
Out[60]:
[ FDIDFDMO INDEX
BN_SURVEY_AVERAGE 0.9
ECO_RELEASE_DT 2022-03-15
ECO_RELEASE_TIME 08:30:00
NAME US PPI Final Demand MoM SA,
INJCJC INDEX
BN_SURVEY_AVERAGE 215.3
ECO_RELEASE_DT 2022-03-10
ECO_RELEASE_TIME 08:30:00
NAME US Initial Jobless Claims SA]
Where "FDIDFDMO INDEX, INJCJC INDEX,... etc" are the column names. each dataframe is a dim of 4x1.
As #voidpointercast stated, pd.concat was correct.
import pandas as pd
df = pd.concat(df, axis = 1)
Related
I'm leaning python pandas and playing with some example data. I have a CSV file of a dataset with net worth by percentile of US population by quarter of year.
I've successfully subseted the data by percentile to create three scatter plots of net worth by year, one plot for each of three population sections. However, I'm trying to combine those three plots to one data frame so I can combine the lines on a single plot figure.
Data here:
https://www.federalreserve.gov/releases/z1/dataviz/download/dfa-income-levels.csv
Code thus far:
import pandas as pd
import matplotlib.pyplot as plt
# importing numpy as np
import numpy as np
df = pd.read_csv("dfa-income-levels.csv")
df99th = df.loc[df['Category']=="pct99to100"]
df99th.plot(x='Date',y='Net worth', title='Net worth by percentile')
dfmid = df.loc[df['Category']=="pct40to60"]
dfmid.plot(x='Date',y='Net worth')
dflow = df.loc[df['Category']=="pct00to20"]
dflow.plot(x='Date',y='Net worth')
data = dflow['Net worth'], dfmid['Net worth'], df99th['Net worth']
headers = ['low', 'mid', '99th']
newdf = pd.concat(data, axis=1, keys=headers)
And that yields a dataframe shown below, which is not what I want for plotting the data.
low mid 99th
0 NaN NaN 3514469.0
3 NaN 2503918.0 NaN
5 585550.0 NaN NaN
6 NaN NaN 3602196.0
9 NaN 2518238.0 NaN
... ... ... ...
747 NaN 8610343.0 NaN
749 3486198.0 NaN NaN
750 NaN NaN 32011671.0
753 NaN 8952933.0 NaN
755 3540306.0 NaN NaN
Any recommendations for other ways to approach this?
#filter you dataframe to only the categories you're interested in
filtered_df = df[df['Category'].isin(['pct99to100', 'pct00to20', 'pct40to60'])]
filtered_df = filtered_df[['Date', 'Category', 'Net worth']]
fig, ax = plt.subplots() #ax is an axis object allowing multiple plots per axis
filtered_df.groupby('Category').plot(ax=ax)
I don't see the categories mentioned in your code in the csv file you shared. In order to concat dataframes along columns, you could use pd.concat along axis=1. It concats the columns of same index number. So first set the Date column as index and then concat them, and then again bring back Date as a dataframe column.
To set Date column as index of dataframe, df1 = df1.set_index('Date') and df2 = df2.set_index('Date')
Concat the dataframes df1 and df2 using df_merge = pd.concat([df1,df2],axis=1) or df_merge = pd.merge(df1,df2,on='Date')
bringing back Date into column by df_merge = df_merge.reset_index()
Sample date:
Dataframe 1
cusip_id
trd_exctn_dt
time_to_maturity
00077AA2
2015-05-09
1.20 years
00077TBO
2015-05-06
3.08 years
Dataframe 2:
Index
SVENY01
SVENY02
SVENY03
SVENY04
2015-05-09
1.35467
1.23367
1.52467
1.89467
2015-05-08
1.65467
1.87967
1.43251
1.98765
2015-05-07
1.35467
1.76567
1.90271
1.43521
2015-05-06
1.34467
1.35417
1.67737
1.11167
Desired output:
I am wanting to exactly match the 'trd_exctn_dt' in df1 with the date in the index of df2, whilst at the same time matching the 'time_to_maturity' in df1 with the nearest SVENYXX in df2 (rounded up e.g. 1.20 years would be equivalent to SVENY02). For example, for cusip_id (00077AA2), the trd_exctn_dt is 2015-05-09 and the time_to_maturity is 1.20 years. As this is the case I want to obtain the corresponding value in df2 with the date of 2015-05-09 in the column SVENY02.
I want to repeat this for several cusip_ids, how would I achieve this?
Any help would be appreciated!
Here is my solution code:
import pandas as pd
SVENYXX = []
for i in range(df1['cusip_id'].shape[0]):
cusip_id = df1['cusip_id'][i]
trd_exctn_date = df1['trd_exctn_dt'][i]
maturity_time = df1['time_to_maturity'][i]
svenyVals = df2.loc[trd_exctn_date]
closestSvenyVal = svenyVals.iloc[(svenyVals-maturity_time).abs().argsort()[0]]
SVENYXX.append(closestSvenyVal)
where df1 is Dataframe 1, df2 is Dataframe 2, and SVENYXX is the list with all the closest SVENYXX values to the given cusip_id.
I loop through all the cusip_id's and obtain the correspond trd_exctn_dt and time_to_maturity values. Then with the extracted data, I find the corresponding row in DataFrame 2, and then by finding the lowest difference in svenyVals compared to time_to_maturity, I append that value to the SVENYXX list.
my name is Nick and I am new to coding. I recently completed Codeacademy's Analyze Financial Data with Python course. I've started working on some projects of my own, and I've run into a road block.
I'm importing stock index daily closing price data from the Federal Reserve API (FRED) using pandas-datareader:
import numpy as np
import pandas as pd
import pandas_datareader.data as web
import matplotlib.pyplot as plt
from datetime import datetime
start = datetime(2020, 1, 1)
sp_data = web.DataReader('SP500', 'fred', start)
The dataframe sp_data is formatted like so:
SP500
DATE
2020-01-01 NaN
2020-01-02 3257.85
2020-01-03 3234.85
2020-01-06 3246.28
2020-01-07 3237.18
The problem with this dataframe is that on days when the markets are closed (weekends, holidays), those dates are completely omitted. You can see above that 2020-01-04 and 2020-01-05 are missing because they are weekends. I would like my dataframe to have all dates, even when the market is closed. On dates when the markets are closed, I would like the SP500 column to just have the most recent closing price. So on 2020-01-04 and 2020-01-05, the SP500 column would have 3234.85.
I've tried to create a new dataframe with every date I need:
date_list = pd.date_range(start, np.datetime64('today'))
df = pd.DataFrame(date_list)
df.columns =['date']
This creates:
date
0 2020-01-01
1 2020-01-02
2 2020-01-03
3 2020-01-04
4 2020-01-05
I'm now trying to create an 'SP500' column in df by iterating through each row in sp_data and if the dates match, it assigns that value to that date in df. I will then use pd.DataFrame.ffill to fill the missing values. The lambda function I am using to create the new column is:
df['SP500'] = sp_data.apply(lambda row: row['SP500'] if row.index == df.date else 0, axis=1)
This returns:
ValueError: Lengths must match to compare
I know that the dataframes need to be the same length to use a lambda function on. I guess my question is, what is the best way to iterate over each row in a Pandas dataframe to assign the proper values to the correct dates in the new dataframe? Are there any ways to accomplish the same end goal that are much easier than the way I am trying to tackle it?
Any and all suggestions are welcome!
This what indexes are use for, if there is a match between the index int new empty dataframe (df), and the dataframe with the data (sp_data), then the value will be added to the new dataframe, else it will asign NaN values. Your df should be an empty dataframe with index date_list and after that, just assign the new column:
date_list = pd.date_range(start, np.datetime64('today'))
df = pd.DataFrame(index=date_list)
df['SP500'] = sp_data['SP500']
I have a pandas dataframe which contains time series data, so the index of the dataframe is of type datetime64 at weekly intervals, each date occurs on the Monday of each calendar week.
There are only entries in the dataframe when an order was recorded, so if there was no order placed, there isn't a corresponding record in the dataframe. I would like to "pad" this dataframe so that any weeks in a given date range are included in the dataframe and a corresponding zero quantity is entered.
I have managed to get this working by creating a dummy dataframe, which includes an entry for each week that I want with a zero quantity and then merging these two dataframes and dropping the dummy dataframe column. This results in a 3rd padded dataframe.
I don't feel this is a great solution to the problem and being new to pandas wanted to know if there is a more specific and or pythonic way to achieve this, probably without having to create a dummy dataframe and then merge.
The code I used is below to get my current solution:
# Create the dummy product
# Week hold the week date of the order, want to set this as index later
group_by_product_name = df_all_products.groupby(['Week', 'Product Name'])['Qty'].sum()
first_date = group_by_product_name.head(1) # First date in entire dataset
last_date = group_by_product_name.tail().index[-1] # last date in the data set
bdates = pd.bdate_range(start=first_date, end=last_date, freq='W-MON')
qty = np.zeros(bdates.shape)
dummy_product = {'Week':bdates, 'DummyQty':qty}
df_dummy_product = pd.DataFrame(dummy_product)
df_dummy_product.set_index('Week', inplace=True)
group_by_product_name = df_all_products.groupby('Week')['Qty'].sum()
df_temp = pd.concat([df_dummy_product, group_by_product_name], axis=1, join='outer')
df_temp.fillna(0, inplace=True)
df_temp.drop(columns=['DummyQty'], axis=1, inplace=True)
The problem with this approach is sometimes (I don't know why) the indexes don't match correctly, I think somehow the dtype of the index on one of the dataframes loses its type and goes to object instead of staying with dtype datetime64. So I am sure there is a better way to solve this problem than my current solution.
EDIT
Here is a sample dataframe with "missing entries"
df1 = pd.DataFrame({'Week':['2018-05-28', '2018-06-04',
'2018-06-11', '2018-06-25'], 'Qty':[100, 200, 300, 500]})
df1.set_index('Week', inplace=True)
df1.head()
Here is an example of the padded dataframe that contains the additional missing dates between the date range
df_zero = pd.DataFrame({'Week':['2018-05-21', '2018-05-28', '2018-06-04',
'2018-06-11', '2018-06-18', '2018-06-25', '2018-07-02'], 'Dummy Qty':[0, 0, 0, 0, 0, 0, 0]})
df_zero.set_index('Week', inplace=True)
df_zero.head()
And this is the intended outcome after concatenating the two dataframes
df_padded = pd.concat([df_zero, df1], axis=1, join='outer')
df_padded.fillna(0, inplace=True)
df_padded.drop(columns=['Dummy Qty'], inplace=True)
df_padded.head(6)
Note that the missing entries are added before and between other entries where necessary in the final dataframe.
Edit 2:
As requested here is an example of what the initial product dataframe would look like:
df_all_products = pd.DataFrame({'Week':['2018-05-21', '2018-05-28', '2018-05-21', '2018-06-11', '2018-06-18',
'2018-06-25', '2018-07-02'],
'Product Name':['A', 'A', 'B', 'A', 'B', 'A', 'A'],
'Qty':[100, 200, 300, 400, 500, 600, 700]})
Ok given your original data you can achieve the expected results by using pivot and resample for any missing weeks, like the following:
results = df_all_products.groupby(
['Week','Product Name']
)['Qty'].sum().reset_index().pivot(
index='Week',columns='Product Name', values='Qty'
).resample('W-MON').asfreq().fillna(0)
Output results:
Product Name A B
Week
2018-05-21 100.0 300.0
2018-05-28 200.0 0.0
2018-06-04 0.0 0.0
2018-06-11 400.0 0.0
2018-06-18 0.0 500.0
2018-06-25 600.0 0.0
2018-07-02 700.0 0.0
So if you want to get the df for Product Name A, you can do results['A'].
Suppose I wish to re-index, with linear interpolation, a time series to a pre-defined index, where none of the index values are shared between old and new index. For example
# index is all precise timestamps e.g. 2018-10-08 05:23:07
series = pandas.Series(data,index)
# I want rounded date-times
desired_index = pandas.date_range("2010-10-08",periods=10,freq="30min")
Tutorials/API suggest the way to do this is to reindex then fill NaN values using interpolate. But, as there is no overlap of datetimes between the old and new index, reindex outputs all NaN:
# The following outputs all NaN as no date times match old to new index
series.reindex(desired_index)
I do not want to fill nearest values during reindex as that will lose precision, so I came up with the following; concatenate the reindexed series with the original before interpolating:
pandas.concat([series,series.reindex(desired_index)]).sort_index().interpolate(method="linear")
This seems very inefficient, concatenating and then sorting the two series. Is there a better way?
The only (simple) way I can see of doing this is to use resample to upsample to your time resolution (say 1 second), then reindex.
Get an example DataFrame:
import numpy as np
import pandas as pd
np.random.seed(2)
df = (pd.DataFrame()
.assign(SampleTime=pd.date_range(start='2018-10-01', end='2018-10-08', freq='30T')
+ pd.to_timedelta(np.random.randint(-5, 5, size=337), unit='s'),
Value=np.random.randn(337)
)
.set_index(['SampleTime'])
)
Let's see what the data looks like:
df.head()
Value
SampleTime
2018-10-01 00:00:03 0.033171
2018-10-01 00:30:03 0.481966
2018-10-01 01:00:01 -0.495496
Get the desired index:
desired_index = pd.date_range('2018-10-01', periods=10, freq='30T')
Now, reindex the data with the union of the desired and existing indices, interpolate based on the time, and reindex again using only the desired index:
(df
.reindex(df.index.union(desired_index))
.interpolate(method='time')
.reindex(desired_index)
)
Value
2018-10-01 00:00:00 NaN
2018-10-01 00:30:00 0.481218
2018-10-01 01:00:00 -0.494952
2018-10-01 01:30:00 -0.103270
As you can see, you still have an issue with the first timestamp because it's outside the range of the original index; there are number of ways to deal with this (pad, for example).
my methods
frequency = nyse_trading_dates.rename_axis([None]).index
df = prices.rename_axis([None]).reindex(frequency)
for d in prices.rename_axis([None]).index:
df.loc[d] = prices.loc[d]
df.interpolate(method='linear')
method 2
prices = data.loc[~data.index.duplicated(keep='last')]
#prices = data.reset_index()
idx1 = prices.index
idx1 = pd.to_datetime(idx1, errors='coerce')
merged = idx1.union(idx2)
s = prices.reindex(merged)
df = s.interpolate(method='linear').dropna(axis=0, how='any')
data=df