Combining a list of dataframes into a new dataframe - python

I want to create a dataframe that combines historical price data for various ETFs (XLF, XLP, XLU, etc) downloaded from yahoo finance. I tried to create a dataframe for each ETF, with the date and the adjusted close. Now I want to combine them all into one dataframe. Any idea why this doesn't work?
I've tried saving each csv as a dataframe, which only has date and adjusted close. Then I want to combine, using Date as the index.
# Here's what I've tried.
import os
import pandas as pd
from functools import reduce
XLF_df = pd.read_csv("datasets/XLF.csv").set_index("Date")["Adj Close"]
XRT_df = pd.read_csv("datasets/XRT.csv").set_index("Date")["Adj Close"]
XLP_df = pd.read_csv("datasets/XLP.csv").set_index("Date")["Adj Close"]
XLY_df = pd.read_csv("datasets/XLY.csv").set_index("Date")["Adj Close"]
XLV_df = pd.read_csv("datasets/XLV.csv").set_index("Date")["Adj Close"]
dfList = [XLF_df, XRT_df, XLP_df, XLY_df, XLV_df]
df = reduce(lambda df1,df2: pd.merge(df1,df2,on="Date"), dfList)
df.head()
I get an empty dataframe! And if I say on="id" then I get a key error.

Related

How do I save each iteration of a for loop in one big DataFrame - Python

I want to gather all the historical prices of each stock in the S&P500 in Python. I'm using a package from IEX Cloud which gives me the historical prices of an individual stock. I want a for loop to run through a list of the tickers/symbols from the stock index so that I get all the data in a single DataFrame.
This is the code that produces a DataFrame - in this example I've chosen AAPL for a two year period:
import pyEX as p
sym = 'AAPL'
stock_list = stocks['Ticker'].tolist()
c = p.Client(api_token='TOKEN', version='stable')
timeframe = '2y'
df = c.chartDF(symbol=sym, timeframe=timeframe)[['close']]
df
This DataFrame contains the date and the daily closing price. Now do any of you have any ideas how to loop through my list of tickers, so that I get a comprehensive DataFrame of all the historical prices?
Thank you.
Create an empty list to append to and concat everything together after you iterate over all the tickers
import pyEX as p
import pandas as pd
stock_list = stocks['Ticker'].tolist()
c = p.Client(api_token='TOKEN', version='stable')
timeframe = '2y'
dfs = [] # create an empty list
for sym in stock_list: # iterate over your ticker list
df = c.chartDF(symbol=sym, timeframe=timeframe)[['close']] # create your frame
dfs.append(df) # append frame to list
final_df = pd.concat(dfs) # concat all your frames together into one
Update with Try-Except
import pyEX as p
import pandas as pd
stock_list = stocks['Ticker'].tolist()
c = p.Client(api_token='TOKEN', version='stable')
timeframe = '2y'
dfs = [] # create an empty list
for sym in stock_list: # iterate over your ticker list
try:
df = c.chartDF(symbol=sym, timeframe=timeframe)[['close']] # create your frame
dfs.append(df) # append frame to list
except KeyError:
print(f'KeyError for {sym}')
final_df = pd.concat(dfs) # concat all your frames together into one

how to use pandas and python and ta-lib to build dataframe from many csv's in order calculate technical indicators

I can use the code below to build a frame from a single file (which has ticker, date, OHLC and volume) and to then use TA-lib to build the technical indicators. Works fine. I can also use "glob" to combine thousands of csv's into a blob and import into SQL and then run python script against sql (with other plugins) to parse the values and build technical indicator values, however, what is happening is that the moving averages are not being calculated for each symbol individually, but instead, just calculated across all of the symbols in the SQL table (or csv) , thereby messing everything up. In other words, on say row 11, the ticker has changed from A to B and the first 10 rows were all ticker A, TA-lib is just using the data on row 11 as if its part of the data for ticker A but now, it's ticker B. It should start over with each unique ticker. If I can find a way to build individual dataframes for each csv file, run the calculations, and then output into thousands of newly created csv files (one for each unique ticker) that will solve the problem. I can also avoid SQL all together. Thanks in advance.
import pandas as pd
import talib
csv_file = "C:\\Users\\Bob\\IBM.csv"
df = pd.read_csv(csv_file)
Symbol = df['Symbol']
Date = df['Date']
Open = df['Open']
High = df['High']
Low = df['Low']
Close = df['Close']
Volume = df['Volume']
from talib import SMA,T3
SMA = SMA(Close, timeperiod=5)
print(SMA)
T3 = T3(Close, timeperiod=5, vfactor=0)
print(T3)
total_df = pd.concat([Symbol, Date, Open, High, Low, Close, Volume, SMA, T3])
print(total_df)
total_df.to_csv("test.csv")
************** here is my latest code below*************
import pandas as pd
import talib
import glob, os
from talib import SMA, T3
import os
csv_file_list = glob.glob(r"H:\EOD_DATA_RECENT\TEST\\*.csv")
print(csv_file_list)
for csv_file in csv_file_list:
df = pd.read_csv(csv_file)
print(df)
df['SMA'] = SMA(df['AdjustedClose'], timeperiod=5)
# print(df['SMA'])
df['T3'] = T3(df['AdjustedClose'], timeperiod=5, vfactor=0)
# print(df['T3'])
print(df)
df.to_csv("test.csv")
There are two ways, I believe, you can do this. If you want separate files you just read in your csv files in a loop, perform the action and write file to disk. Also, I'm making some assumptions here...
from talib import SMA,T3 # move this up to the top with other modules
csv_file_list = [however you get list of files]
for csv_file in csv_file_list:
df = pd.read_csv(csv_file)
#I'm not sure why are reading these into series, I think you can call them directly
#Symbol = df['Symbol']
#Date = df['Date']
#Open = df['Open']
#High = df['High']
#Low = df['Low']
#Close = df['Close']
#Volume = df['Volume']
df['SMA'] = SMA(df['Close'], timeperiod=5) # create column in df automatically
print(df['SMA'])
df['T3'] = T3(df['Close'], timeperiod=5, vfactor=0) # create column in df automatically
print(df['T3'])
# df is already built from above, so don't need next line
#total_df = pd.concat([Symbol, Date, Open, High, Low, Close, Volume, SMA, T3])
print(df)
Symbol = df.Symbol[0]
fn = Symbol + '_indicators.csv
df.to_csv(fn)
The second way would be to read all csv files into dfs and concat. You can save this one df to a csv 'master' if you will, then use groupby to get the SMA and T3 by ticker. If you have thousands of tickers, this might be too cumbersome but does alleviate having to read thousands of files. I do both methods depending on what type of analysis I'm running. A df of 500 tickers is manageable from a compute time perspective as long as what you are doing is coded correctly. Otherwise, I look a one ticker at a time, then go to the larger df.
Try the first reworked suggested code and see what you come up with.

Filter data from a created list

I am working on my Covid data set from github and I would like to filter my data set with the countries that appear in the this EU_member list in csv format.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv')
df = df[df.continent == 'Europe']
# From here I want to just pick those countries that appear in the following list:
EU_members= ['Austria','Italy','Belgium''Latvia','Bulgaria','Lithuania','Croatia','Luxembourg','Cyprus','Malta','Czechia','Netherlands','Denmark','Poland','Estonia',
'Portugal','Finland','Romania','France','Slovakia','Germany','Slovenia','Greece','Spain','Hungary','Sweden','Ireland']
# I have tried something like this but it is not what I expected:
df.location.str.find('EU_members')
You can use .isin():
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv')
EU_members= ['Austria','Italy','Belgium''Latvia','Bulgaria','Lithuania','Croatia','Luxembourg','Cyprus','Malta','Czechia','Netherlands','Denmark','Poland','Estonia',
'Portugal','Finland','Romania','France','Slovakia','Germany','Slovenia','Greece','Spain','Hungary','Sweden','Ireland']
df_out = df[df['location'].isin(EU_members)]
df_out.to_csv('data.csv')
Creates data.csv:

excluding count from dataFrame if value is 0

I have the code below:
import pandas as pd
import numpy as np
# Read data from file 'filename.csv'
# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later)
df = pd.read_csv("../example.csv", parse_dates=['DateTime'])
# Preview the first 5 lines of the loaded data
df = df.assign(Burned = df['Quantity'])
df.loc[df['To'] != '0x0000000000000000000000000000000000000000', 'Burned'] = 0.0
# OR:
df['cum_sum'] = df['Burned'].cumsum()
df['percent_burned'] = df['cum_sum']/df['Quantity'].max()*100.0
per_day = df.groupby(df['DateTime'].dt.date)['Burned'].count().reset_index(name='Trx')
The last line create a dataFrame called per_day which sorts all the transactions by day from the df dataFrame. Many of the transactions from df have 'Burned' = 0. I want to count the total number of transactions which is what my current code does. But I want to exclude transactions in which 'Burned'=0.
Also, how can I consoladate these 3 lines of code? I don't even want to create per_day_burned. How do I just make this its own column in per_day without doing it the way I did it?
per_day = dg.groupby(dg['DateTime'].dt.date)['Burned'].count().reset_index(name='Trx')
per_day_burned = dg.groupby(dg['DateTime'].dt.date)['Burned'].sum().reset_index(name='Burned')
per_day['Burned'] = per_day_burned['Burned']

Saving to_csv read just the columns, not the rows

I'm stuck with reading all the rows of a csv file and save into a csv files (I'm using pandas 0.17.1).
I've a list of tickers inserted into a csv file: they are inserted into each column, like this:
Column A: AAPL / Column B:TSLA / Column C: EXPD... and so on.
Now, I've to add 3000 new tickers to this list, and so I change the orientation of the csv, bringing every ticker into each row of the first column, like this:
Column A
AAPL
TSLA
EXPD
...and so on.
The issue is: when I save the document into a csv file, it read only the first row, and nothing else.
In my example, if i have on the first row "AAPL", I will obtain a csv file that has only the data from AAPL.
This is my code:
symbols_list = pd.read_csv('/home/andrea/htrade/python/titoli_rows.csv')
symbols = []
for ticker in symbols_list:
r = DataReader(ticker, "yahoo",
start=datetime.datetime.now() - BDay(20),
end=datetime.datetime.now())
# add a symbol column
r['Symbol'] = ticker
symbols.append(r)
# concatenate all the dfs
df = pd.concat(symbols)
#define cell with the columns that i need
cell = df[['Symbol', 'Open', 'High', 'Low', 'Adj Close', 'Volume']]
cell.reset_index().sort_values(['Symbol', 'Date'], ascending=[1, 0]).set_index('Symbol').to_csv('/home/andrea/Dropbox/HT/stock20.csv', date_format='%d/%m/%Y')
Why if I paste a ticker in each column the csv contain all the data of every ticker, but if I paste a ticker in each row, it will read just the first row?
I already tried to see if the "read_csv" function was reading correctly the csv, and he is, so I don't understand why he's not elaborating them all.
I just ran the below and with a short list of symbols imported via read_csv it seemed to work fine:
from datetime import datetime
import pandas.io.data as web
from pandas.tseries.offsets import BDay
df = pd.read_csv(path_to_file).loc[:, ['symbols']].dropna().squeeze()
symbols = []
for ticker in df.tolist():
r = web.DataReader(ticker, "yahoo",
start= datetime.now() - BDay(20),
end= datetime.now())
r['Symbol'] = ticker
symbols.append(r)
df = pd.concat(symbols).drop('Close', axis=1)
cell= df[['Symbol','Open','High','Low','Adj Close','Volume']]
cell.reset_index().sort_values(['Symbol', 'Date'], ascending=[1,0]).set_index('Symbol').to_csv(path_to_file, date_format='%d/%m/%Y')

Categories

Resources