Problems reading and concatenating CSV files into a single dataframe

Problems reading and concatenating CSV files into a single dataframe - python

I am going to Yahoo Finance and pulling data for German stocks. Then writing them to individual CSV files.
I then want to read them back in to a single dataframe.
#Code to get stocks
tickers = ["MUV2.DE","DTE.DE", "VNA.DE", "ALV.DE", "BAYN.DE", "EOAN.DE", "RWE.DE", "CON.DE", "HEN3.DE", "BAS.DE", "FME.DE", "WDI.DE", "IFX.DE", "SAP.DE", "BMW.DE", "DPW.DE", "DB1.DE", "DAI.DE", "BEI.DE", "SIE.DE", "ADS.DE", "DBK.DE", "FRE.DE", "HEI.DE", "MRK.DE", "LHA.DE", "VOW3.DE", "1COV.DE", "LIN.DE", "TKA.DE"]
start = datetime.datetime(2012,5,31)
end = datetime.datetime(2020,3,1)
# Go to yahoo and pull data for the following tickers and then write them to CSV
for ticker in tickers:
df = pdr.get_data_yahoo(ticker, start=start, end=end)
df.to_csv(f"{ticker}.csv")
Once the above has been done, I'm reading in a CSV of all the ticker names and then concatenating them with the individual CSV file names. Well that's what I want to do at least.
import pandas as pd
tickers = pd.read_csv('C:/Users/XXX/Desktop/Momentum/tickers.csv', header=None)[1].tolist()
stocks = (
(pd.concat(
[pd.read_csv(f"C:/Users/XXX/Desktop/Momentum/{ticker}.csv", index_col='Date', parse_dates=True)['Adj Close'].rename(.replace(".DE", "")ticker)
for ticker in tickers],
axis=1,
sort=True)
)
)
stocks = stocks.loc[:,~stocks.columns.duplicated()]
Now I've got this code to work before but when importing other stock tickers. All my jupyter notebook does is spin out.
I was wondering what the issue was here and if it was because the CSV file name would be something like ADS.DE.csv and if the first . is causing issues.

I solved this problem with the following code.
import os
for filename in os.listdir('dirname'):
os.rename(filename, filename.replace('_intsect_d', ''))

Related

How to preserve complicated excel header formats when manipulating data using Pandas Python?

I am parsing a large excel data file to another one, however the headers are very abnormal. I tried to use "read_excel skiprows" and that did not work. I also tried to include the header in
df = pd.read_excel(user_input, header= [1:3], sheet_name = 'PN Projection'), but then I get this error "ValueError: cannot join with no overlapping index names." To get around this I tried to name the columns by location and that did not work either.
When I run the code as shows below everything works fine, but past cell "U" I get the header titles to be "unnamed1, 2, ..." I understand this is because pandas is considering the first row to be the header(which are empty), but how do I fix this? Is there a way to preserve the headers without manually typing in the format for each cell? Any and all help is appreciated, thank you!
small section of the excel file header
the code I am trying to run
#!/usr/bin/env python
import sys
import os
import pandas as pd
#load source excel file
user_input = input("Enter the path of your source excel file (omit 'C:'): ")
#reads the source excel file
df = pd.read_excel(user_input, sheet_name = 'PN Projection')
#Filtering dataframe
#Filters out rows with 'EOL' in column 'item status' and 'xcvr' in 'description'
df = df[~(df['Item Status'] == 'EOL')]
df = df[~(df['Description'].str.contains("XCVR", na=False))]
#Filters in rows with "XC" or "spartan" in 'description' column
df = df[(df['Description'].str.contains("XC", na=False) | df['Description'].str.contains("Spartan", na=False))]
print(df)
#Saving to a new spreadsheet called Filtered Data
df.to_excel('filtered_data.xlsx', sheet_name='filtered_data')

If you do not need the top 2 rows, then:
df = pd.read_excel(user_input, sheet_name = 'PN Projection',error_bad_lines=False, skiprows=range(0,2)
This has worked for me when handling several strangely formatted files. Let me know if this isn't what your looking for, or if their are additional issues.

how to use pandas and python and ta-lib to build dataframe from many csv's in order calculate technical indicators

I can use the code below to build a frame from a single file (which has ticker, date, OHLC and volume) and to then use TA-lib to build the technical indicators. Works fine. I can also use "glob" to combine thousands of csv's into a blob and import into SQL and then run python script against sql (with other plugins) to parse the values and build technical indicator values, however, what is happening is that the moving averages are not being calculated for each symbol individually, but instead, just calculated across all of the symbols in the SQL table (or csv) , thereby messing everything up. In other words, on say row 11, the ticker has changed from A to B and the first 10 rows were all ticker A, TA-lib is just using the data on row 11 as if its part of the data for ticker A but now, it's ticker B. It should start over with each unique ticker. If I can find a way to build individual dataframes for each csv file, run the calculations, and then output into thousands of newly created csv files (one for each unique ticker) that will solve the problem. I can also avoid SQL all together. Thanks in advance.
import pandas as pd
import talib
csv_file = "C:\\Users\\Bob\\IBM.csv"
df = pd.read_csv(csv_file)
Symbol = df['Symbol']
Date = df['Date']
Open = df['Open']
High = df['High']
Low = df['Low']
Close = df['Close']
Volume = df['Volume']
from talib import SMA,T3
SMA = SMA(Close, timeperiod=5)
print(SMA)
T3 = T3(Close, timeperiod=5, vfactor=0)
print(T3)
total_df = pd.concat([Symbol, Date, Open, High, Low, Close, Volume, SMA, T3])
print(total_df)
total_df.to_csv("test.csv")
************** here is my latest code below*************
import pandas as pd
import talib
import glob, os
from talib import SMA, T3
import os
csv_file_list = glob.glob(r"H:\EOD_DATA_RECENT\TEST\\*.csv")
print(csv_file_list)
for csv_file in csv_file_list:
df = pd.read_csv(csv_file)
print(df)
df['SMA'] = SMA(df['AdjustedClose'], timeperiod=5)
# print(df['SMA'])
df['T3'] = T3(df['AdjustedClose'], timeperiod=5, vfactor=0)
# print(df['T3'])
print(df)
df.to_csv("test.csv")

There are two ways, I believe, you can do this. If you want separate files you just read in your csv files in a loop, perform the action and write file to disk. Also, I'm making some assumptions here...
from talib import SMA,T3 # move this up to the top with other modules
csv_file_list = [however you get list of files]
for csv_file in csv_file_list:
df = pd.read_csv(csv_file)
#I'm not sure why are reading these into series, I think you can call them directly
#Symbol = df['Symbol']
#Date = df['Date']
#Open = df['Open']
#High = df['High']
#Low = df['Low']
#Close = df['Close']
#Volume = df['Volume']
df['SMA'] = SMA(df['Close'], timeperiod=5) # create column in df automatically
print(df['SMA'])
df['T3'] = T3(df['Close'], timeperiod=5, vfactor=0) # create column in df automatically
print(df['T3'])
# df is already built from above, so don't need next line
#total_df = pd.concat([Symbol, Date, Open, High, Low, Close, Volume, SMA, T3])
print(df)
Symbol = df.Symbol[0]
fn = Symbol + '_indicators.csv
df.to_csv(fn)
The second way would be to read all csv files into dfs and concat. You can save this one df to a csv 'master' if you will, then use groupby to get the SMA and T3 by ticker. If you have thousands of tickers, this might be too cumbersome but does alleviate having to read thousands of files. I do both methods depending on what type of analysis I'm running. A df of 500 tickers is manageable from a compute time perspective as long as what you are doing is coded correctly. Otherwise, I look a one ticker at a time, then go to the larger df.
Try the first reworked suggested code and see what you come up with.

Python and Pandas Creating Multiple Dynamic Excel Sheets with Dataframes

Thanks in advance! I have been struggling for a few days so that means it is time for me to ask a question. I have a program that is pulling information for three stocks using the module "yfinance" It uses a ticker list in a txt file. I can get the intended information into a data frame for each ticker in the list using a for loop. I then want to save information for each separate ticker on its own sheet in an Excel book with the sheet name being the ticker. As of now I end up creating three distinct data frames but the Excel output only has one tab with the last requested ticker information (MSFT). I think I may need to use an append process to create a new tab with each data frame information, thanks for any suggestions.
Code
import platform
import yfinance as yf
import pandas as pd
import csv
# check versions
print('Python Version: ' + platform.python_version())
print('YFinance Version: ' + yf.__version__)
# load txt of tickers to list, contains three tickers
tickerlist = []
with open('tickers.txt') as inputfile:
for row in csv.reader(inputfile):
tickerlist.append(row)
# iterate through ticker txt file
for i in range(len(tickerlist)):
tickersymbol = tickerlist[i]
stringticker = str(tickersymbol)
stringticker = stringticker.replace("[", "")
stringticker = stringticker.replace("]", "")
stringticker = stringticker.replace("'", "")
# set data to retrievable variable
tickerdata = yf.Ticker(stringticker)
tickerinfo = tickerdata.info
# data items requested
investment = tickerinfo['shortName']
country = tickerinfo['country']
# create dataframes from lists
dfoverview = pd.DataFrame({'Label': ['Company', 'Country'],
'Value': [investment, country]
})
print(dfoverview)
print('-----------------------------------------------------------------')
#export data to each tab (PROBLEM AREA)
dfoverview.to_excel('output.xlsx',
sheet_name=stringticker)
Output
Python Version: 3.7.7
YFinance Version: 0.1.54
Company Walmart Inc.
Country United States
Company Tesla, Inc.
Country United States
Company Microsoft Corporation
Country United States
Process finished with exit code 0
EDITS: Deleted original to try and post to correct forum/location

If all of your ticker information is in a single data frame, Pandas groupby() method works well for you here (if I'm understanding your problem correctly). This is pseudo, but try something like this instead:
import pandas as pd
# df here represents your single data frame with all your ticker info
# column_value is the column you choose to group by
# this column_value will also be used to dynamically create your sheet names
ticker_group = df.groupby(['column_value'])
# create the writer obj
with pd.ExcelWriter('output.xlsx') as writer:
# key=str obj of column_value, data=dataframe obj of data pertaining to key
for key, data in ticker_group:
ticker_group.get_group(key).to_excel(writer, sheet_name=key, index=False)

Saving to_csv read just the columns, not the rows

I'm stuck with reading all the rows of a csv file and save into a csv files (I'm using pandas 0.17.1).
I've a list of tickers inserted into a csv file: they are inserted into each column, like this:
Column A: AAPL / Column B:TSLA / Column C: EXPD... and so on.
Now, I've to add 3000 new tickers to this list, and so I change the orientation of the csv, bringing every ticker into each row of the first column, like this:
Column A
AAPL
TSLA
EXPD
...and so on.
The issue is: when I save the document into a csv file, it read only the first row, and nothing else.
In my example, if i have on the first row "AAPL", I will obtain a csv file that has only the data from AAPL.
This is my code:
symbols_list = pd.read_csv('/home/andrea/htrade/python/titoli_rows.csv')
symbols = []
for ticker in symbols_list:
r = DataReader(ticker, "yahoo",
start=datetime.datetime.now() - BDay(20),
end=datetime.datetime.now())
# add a symbol column
r['Symbol'] = ticker
symbols.append(r)
# concatenate all the dfs
df = pd.concat(symbols)
#define cell with the columns that i need
cell = df[['Symbol', 'Open', 'High', 'Low', 'Adj Close', 'Volume']]
cell.reset_index().sort_values(['Symbol', 'Date'], ascending=[1, 0]).set_index('Symbol').to_csv('/home/andrea/Dropbox/HT/stock20.csv', date_format='%d/%m/%Y')
Why if I paste a ticker in each column the csv contain all the data of every ticker, but if I paste a ticker in each row, it will read just the first row?
I already tried to see if the "read_csv" function was reading correctly the csv, and he is, so I don't understand why he's not elaborating them all.

I just ran the below and with a short list of symbols imported via read_csv it seemed to work fine:
from datetime import datetime
import pandas.io.data as web
from pandas.tseries.offsets import BDay
df = pd.read_csv(path_to_file).loc[:, ['symbols']].dropna().squeeze()
symbols = []
for ticker in df.tolist():
r = web.DataReader(ticker, "yahoo",
start= datetime.now() - BDay(20),
end= datetime.now())
r['Symbol'] = ticker
symbols.append(r)
df = pd.concat(symbols).drop('Close', axis=1)
cell= df[['Symbol','Open','High','Low','Adj Close','Volume']]
cell.reset_index().sort_values(['Symbol', 'Date'], ascending=[1,0]).set_index('Symbol').to_csv(path_to_file, date_format='%d/%m/%Y')

How to write data to excel using python for stock data being pulled from yahoo

I have the following code which helps me to pull daily data for a number of stocks I have stored in a worksheet. What I was hoping to accomplish was to have the daily data returned and stored in another worksheet.
I am struggling to write a code which accomplishes this task. Currently I am able to pull the data for each of the individual stocks, though I have no way of storing this information. Any help will be appreciated. For the sake of testing I only tried to store Open and Close, ideally I would like all the parameters from yahoo finance to be stored.
import numpy as np
import pandas as pd
import xlsxwriter
df=pd.read_csv('Stock Companies Modified.csv', sep=',',header=True)
df.columns = ['StockSymbol', 'CompanyName', 'ClosingPrice', 'MarketCap', 'IPOYear', 'Sector', 'Industry']
workbook = xlsxwriter.Workbook('New Workbook.xlsx')
worksheet = workbook.add_worksheet()
df = df.convert_objects(convert_numeric=True)
df.dtypes
from pandas.io.data import DataReader
from datetime import datetime
for x in df.StockSymbol:
if len(x)<=4:
ClosingPrice = DataReader(x, 'yahoo', datetime(2015,1,1), datetime(2015,7,1))
row = 0
col = 0
#This is the area where I am getting an error, and to be honest I dont know how to do it correctly
for Open, Close in (ClosingPrice):
worksheet.write(row, col, (ClosingPrice['Open']))
worksheet.write(row,col+1,(ClosingPrice['Close']))
row+=1
workbook.close()
print x
else:
print("This is not working")

I've yet to find a clean way to append data to sheets with xlsxwriter, so typically I create a temporary dataframe with all of the values, as well as current sheet if existing - then overwrite. I would definitely prefer if we could append to sheets as you attempted but it doesn't seem possible.
import pandas as pd
from pandas.io.data import DataReader
from datetime import datetime
symbols = ['GOOG','AAPL']
try:
df = pd.read_excel('NewFile.xlsx')
except:
df = pd.DataFrame()
for symbol in symbols:
ClosingPrice = DataReader(symbol, 'yahoo', datetime(2015,1,1), datetime(2015,9,1))
ClosingPrice = ClosingPrice.reset_index()
ClosingPrice['Symbol'] = symbol
df = df.append(ClosingPrice)
writer = pd.ExcelWriter('NewFile.xlsx', engine='xlsxwriter')
df.to_excel(writer,sheet_name='Sheet1',index=False)
writer.save()
If you were later appending to this same file, it would be ok:
df = pd.read_excel('NewFile.xlsx')
symbols = ['G']
for symbol in symbols:
ClosingPrice = DataReader(symbol, 'yahoo', datetime(2015,1,1), datetime(2015,9,1))
ClosingPrice = ClosingPrice.reset_index()
ClosingPrice['Symbol'] = symbol
df = df.append(ClosingPrice)
writer = pd.ExcelWriter('NewFile.xlsx', engine='xlsxwriter')
df.to_excel(writer,sheet_name='Sheet1',index=False)
writer.save()

What is the error you are getting? Have you tried Pandas dataframe.to_excel?
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_excel.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problems reading and concatenating CSV files into a single dataframe - python

I solved this problem with the following code. import os for filename in os.listdir('dirname'): os.rename(filename, filename.replace('_intsect_d', ''))

Related

How to preserve complicated excel header formats when manipulating data using Pandas Python?

how to use pandas and python and ta-lib to build dataframe from many csv's in order calculate technical indicators

Python and Pandas Creating Multiple Dynamic Excel Sheets with Dataframes

Saving to_csv read just the columns, not the rows

How to write data to excel using python for stock data being pulled from yahoo

Categories

Resources