I've recently jumped from VBA to python trying to improve the reporting on my company so I am a bit new in python.
What my program tries to do is after consolidating various CSV files in one dataframe, applying datavalidation in form of a dropdown list with xlsxwriter.
I was getting a corrupted file everytime after the process was done and so, the data validation was gone.
I've been playing with the limits to find out it can only apply data validation for 71215 rows of data, after that the file becomes corrupt.
Because this only happens with large data, I can't provide an example of dataframe but it would be like:
df = pd.read_csv(fichero, sep=";", dtype={
"Dirección": "string",
"Territorio": "string",
"DNI": "string",
"Agente": "string",
"CentroACC": "string",
"ModoACC": "string",
"Fecha": "string",
"Presentes": "string",
"Contratados": "string",
"Modo Planificación": "string",
}
)
Here is the code which is implementing the data validation. Works fine, won't raise any error, but the result is a corrupted xlsx file.
import pandas as pd
import xlsxwriter as xl
import numpy as np
import time
from diccionarios import diccionarioHabilidadesAgente as dh
def cargaListaHabilidades(df, Fecha):
df = df.copy();
df['Habilidades Vigentes'] = np.where((df['INICIO'] <= Fecha) & (Fecha <= df['FIN']), 1, 0)
df = df.loc[df['Habilidades Vigentes']==1]
return df['Campaña'].tolist()
def main(df, rutaficheros):
inicio = time.perf_counter()
dfHabilidades = pd.read_csv(rutaficheros + 'Habilidades.csv', sep=";")
diccionarioHabilidades = dh(dfHabilidades)
fin1 = time.perf_counter()
print('diccionario habilidades generado en: ', fin1 - inicio, ' segundos.')
del(dfHabilidades)
fichero = rutaficheros + 'Consolidado_ACC.xlsx'
writer = pd.ExcelWriter(fichero, engine='xlsxwriter')
df.to_excel(writer, sheet_name='Descarga ACC')
workbook = writer.book
worksheet = writer.sheets['Descarga ACC']
fin2 = time.perf_counter()
print('Excel generado en: ', fin2 - fin1, ' segundos.')
for fila in range(0, 71214):#len(df)):
Agente = df.iloc[fila]['DNI']
Fecha = df.iloc[fila]['Fecha']
if Agente in diccionarioHabilidades:
if Fecha in diccionarioHabilidades[Agente]:
lista = diccionarioHabilidades[Agente][Fecha]
worksheet.data_validation('K' + str(fila + 2), {'validate': 'list',
'source': lista})
fin3 = time.perf_counter()
print('validación aplicada en: ', fin3 - fin2)
workbook.close()
print('guardado el fichero en: ', time.perf_counter() - fin3, ' segundos.')
return
Like this the code works because I manually input 71214 on the loop, but the total number of rows is like 100k.
I've wanted to ask if someone would know why is the reason of this before doing the less pretty process of spliting the df in 2 and generate 2 separated files.
Edit:
Excel can handle way more data validations (it was done by VBA).
Each data validation needs to be applied on it's own because people have different set of skills for each day of the month.
I have a program that is taking over 2 minutes to run and I'm not sure what I could do to reduce the runtime. It's most definitely the for loop shown in the code.
"CPU Start Address in decimal"
CPU_START_ADDR = 1644167168 #0x62000000
"Import necessary libraries"
import time
import pandas as pd
import numpy as np
"Creates variable for time to display runtime"
start_time = time.time()
"Create a blank dataframe with necessary column headers"
columnNames = ["Settings 1 Values", "Settings 2 Values", "CPU Address", "FPGA Address", "Delta (Decimal)", "Delta (Pos. Difference)", "Register Name", "R/W"]
output = pd.DataFrame(columns = columnNames)
"Fill values from settings files into output dataframe"
df1 = pd.read_csv("50MHzWholeFPGA.csv")
df2 = pd.read_csv("75MHzWholeFPGA.csv")
spec = pd.read_excel("Mozart FPGA Register Specification.xlsx", skiprows=[0,1,2,3])
output.loc[:, "Settings 1 Values"] = df1.iloc[:, 0]
output.loc[:, "Settings 2 Values"] = df2.iloc[:, 0]
output['Delta (Decimal)'] = output['Settings 2 Values'] - output['Settings 1 Values']
"For loop generates CPU Addresses for all values"
for index, row in output.iterrows():
output.loc[index, 'CPU Address'] = hex(CPU_START_ADDR + (2 * index))[2:]
settingXor = bin(int(output.loc[index, "Settings 1 Values"]) ^ int(output.loc[index, "Settings 2 Values"]))
output.loc[index, "Delta (Pos. Difference)"] = settingXor
I'm having problems properly adding JSON to a Pandas dataframe I'm receiving from a websocket stream. In my code I've tried a few different ways to append the data to dataframe but it ends up all messed up.
Looking at the data I see 321 before each of the lines that I want the data from. I don't know how to access that data: I thought something like mv = check['321'] would access it but it did not. The result variable is what the stream is assigned to so I'm just trying to figure out how to get that in the dataframe.
Code:
import json, time
from websocket import create_connection
import pandas as pd
# start with empty dataframe
df = pd.DataFrame()
for i in range(3):
try:
ws = create_connection("wss://ws.kraken.com/")
except Exception as error:
print('Caught this error: ' + repr(error))
time.sleep(3)
else:
break
ws.send(json.dumps({
"event": "subscribe",
#"event": "ping",
"pair": ["BTC/USD"],
#"subscription": {"name": "ticker"}
#"subscription": {"name": "spread"}
"subscription": {"name": "trade"}
#"subscription": {"name": "book", "depth": 10}
#"subscription": {"name": "ohlc", "interval": 5}
}))
csv_file = "kraken-test.csv"
timeout = time.time() + 60*1
# start with empty dataframe
data = []
#while True:
while time.time() < timeout:
try:
result = ws.recv()
converted = json.loads(result)
check = json.dumps(result)
#mv = converted['321']
#data.append(pd.DataFrame.from_dict(pd.json_normalize(check)))
#data.append(pd.DataFrame.from_dict(converted, orient='columns'))
#data.append(pd.json_normalize(converted), orient='columns')
data.append(check)
print(check)
#print ("Received '%s'" % converted, time.time())
#print(df)
except Exception as error:
print('Caught this error: ' + repr(error))
time.sleep(3)
ws.close()
df = pd.DataFrame(data)
df.to_csv(csv_file, index=False, encoding='utf-8')
Output from print(check):
"[321,[[\"37491.40000\",\"0.00420457\",\"1612471467.490327\",\"b\",\"l\",\"\"]],\"trade\",\"XBT/USD\"]"
"{\"event\":\"heartbeat\"}"
"[321,[[\"37491.40000\",\"0.00154223\",\"1612471468.547627\",\"b\",\"l\",\"\"]],\"trade\",\"XBT/USD\"]"
"{\"event\":\"heartbeat\"}"
"{\"event\":\"heartbeat\"}"
"[321,[[\"37491.40000\",\"0.00743339\",\"1612471470.533849\",\"b\",\"m\",\"\"],[\"37491.40000\",\"0.00001187\",\"1612471470.537466\",\"b\",\"m\",\"\"],[\"37491.40000\",\"0.00000002\",\"1612471470.539063\",\"b\",\"m\",\"\"]],\"trade\",\"XBT/USD\"]"
"{\"event\":\"heartbeat\"}"
"{\"event\":\"heartbeat\"}"
"{\"event\":\"heartbeat\"}"
"{\"event\":\"heartbeat\"}"
"{\"event\":\"heartbeat\"}"
csv output:
0
"""{\""connectionID\"":18300780323084664829,\""event\"":\""systemStatus\"",\""status\"":\""online\"",\""version\"":\""1.7.0\""}"""
"""{\""channelID\"":321,\""channelName\"":\""trade\"",\""event\"":\""subscriptionStatus\"",\""pair\"":\""XBT/USD\"",\""status\"":\""subscribed\"",\""subscription\"":{\""name\"":\""trade\""}}"""
"""{\""event\"":\""heartbeat\""}"""
"""{\""event\"":\""heartbeat\""}"""
"""{\""event\"":\""heartbeat\""}"""
"""{\""event\"":\""heartbeat\""}"""
"""{\""event\"":\""heartbeat\""}"""
"""{\""event\"":\""heartbeat\""}"""
"""{\""event\"":\""heartbeat\""}"""
"""{\""event\"":\""heartbeat\""}"""
"""{\""event\"":\""heartbeat\""}"""
"""{\""event\"":\""heartbeat\""}"""
"""{\""event\"":\""heartbeat\""}"""
"""{\""event\"":\""heartbeat\""}"""
"""{\""event\"":\""heartbeat\""}"""
"""{\""event\"":\""heartbeat\""}"""
"""[321,[[\""37500.20000\"",\""0.07021874\"",\""1612471427.916155\"",\""b\"",\""l\"",\""\""],[\""37500.20000\"",\""0.30978126\"",\""1612471427.918316\"",\""b\"",\""l\"",\""\""]],\""trade\"",\""XBT/USD\""]"""
"""[321,[[\""37500.10000\"",\""0.01275000\"",\""1612471428.366246\"",\""s\"",\""l\"",\""\""]],\""trade\"",\""XBT/USD\""]"""
print output of result variable:
{"connectionID":13755154340899011582,"event":"systemStatus","status":"online","version":"1.7.0"}
{"channelID":321,"channelName":"trade","event":"subscriptionStatus","pair":"XBT/USD","status":"subscribed","subscription":{"name":"trade"}}
{"event":"heartbeat"}
[321,[["37679.30000","0.00462919","1612473049.044471","s","l",""]],"trade","XBT/USD"]
{"event":"heartbeat"}
{"event":"heartbeat"}
{"event":"heartbeat"}
[321,[["37684.00000","0.00300000","1612473051.657296","s","m",""]],"trade","XBT/USD"]
Cleaning your code up
remove exception handling that masks what is going on
it then became clear that ws.recv() sometimes returns a dict and sometimes a list
contract a dict from the list
not sure what is contained in 2D list in position 1, so called it measure
pd.concat() is used to build up a dataframe
import json, time
from websocket import create_connection
import pandas as pd
# start with empty dataframe
df = pd.DataFrame()
ws = create_connection("wss://ws.kraken.com/")
ws.send(json.dumps({
"event": "subscribe",
"pair": ["BTC/USD"],
"subscription": {"name": "trade"}
}))
timeout = time.time() + 60*1
while time.time() < timeout:
js = json.loads(ws.recv())
if isinstance(js, dict):
df = pd.concat([df, pd.json_normalize(js)])
elif isinstance(js, list):
df = pd.concat([df, pd.json_normalize({"event":"data",
"data":{
"channelID":js[0],
"measure":js[1],
"channelName":js[2],
"pair":js[3]}
})
])
else:
assert f"unknown socket data {js}"
time.sleep(1)
pick out from "measure"
Does not consider lengths of either dimension. What's being thrown away?
df = pd.concat([df, pd.json_normalize({"event":"data",
"data":{
"channelID":js[0],
"measure":js[1],
"m0":js[1][0][0],
"m1":js[1][0][1],
"m2":js[1][0][2],
"m3":js[1][0][3],
"m4":js[1][0][4],
"channelName":js[2],
"pair":js[3]}
})
])
save the json data to a file ws.txt.
import json, time
from websocket import create_connection
import pandas as pd
ws = create_connection("wss://ws.kraken.com/")
ws.send(json.dumps({
"event": "subscribe",
"pair": ["BTC/USD"],
"subscription": {"name": "trade"}
}))
timeout = time.time() + 5
with open('ws.txt', 'a') as fw:
while time.time() < timeout:
data = ws.recv()+ '\n'
fw.write(data)
print(data, end='')
parse the ws.txt:
df_ws = pd.read_csv('ws.txt', header=None, sep='\n')
obj = df_ws[0].map(json.loads)
df = pd.DataFrame(obj[obj.map(lambda x: isinstance(x, list))].tolist(),
columns=['channelID', 'trade', 'event', 'pair']).explode('trade')
df[['price', 'volume', 'time', 'side', 'orderType', 'misc']] = pd.DataFrame(df['trade'].tolist()).values
cols = ['event', 'price', 'volume', 'time', 'side', 'orderType', 'misc', 'pair']
dfn = df[cols].copy()
print(dfn.head())
# event price volume time side orderType misc pair
# 0 trade 46743.2 0.00667696 1612850630.079810 s m XBT/USD
# 1 trade 46761.1 0.00320743 1612850633.402091 b l XBT/USD
# 2 trade 46766.3 0.04576695 1612850634.419905 s m XBT/USD
# 3 trade 46794.8 0.12000000 1612850637.033033 s l XBT/USD
# 3 trade 46787.2 0.08639234 1612850637.036229 s l XBT/USD
A dataframe includes many items, including close, high and low items in a size of (50k+, 20)
In order to compute the true range and then the ATR, I have to compute the max between tr1, tr2 and tr3 as described below : tr1 = (dataTrad['high'] - dataTrad['low']) tr2 = abs(dataTrad['high'] - dataTrad['close'].shift(1)) tr3 = abs(dataTrad['low'] - dataTrad['close'].shift(1))
I tried the max() function but get an error message
I also tried the numpy maximum function, that causes a crash of the spyder kernel...
I am a bit lost on what route is available for computing this maximum without a loop.
Any idea ?
Thanks a lot in advance.
dataTrad is a DataFrame (60946, 13)here is a bit of the code :
connect = sq.connect('test.db')
cursor = connect.cursor()
trlist = pd.read_sql_query("SELECT * FROM TRdataindex", connect)
dataTrad = trlist.rename(columns={"Price Open": "open", "Price Close": "close", "Price High": "high", "Price Low": "low", "Volume": "volume", "index": "reindex" })
dataTrad.drop(dataTrad[pd.isnull(dataTrad["close"])].index, inplace=True)
dataTrad.drop(dataTrad[pd.isnull(dataTrad["open"])].index, inplace=True)
dataTrad.drop(dataTrad[pd.isnull(dataTrad["high"])].index, inplace=True)
dataTrad.drop(dataTrad[pd.isnull(dataTrad["low"])].index, inplace=True)
# ATR - Average True Range
tr1 = (dataTrad['high'] - dataTrad['low'])
tr2 = abs(dataTrad['high'] - dataTrad['close'].shift(1))
tr3 = abs(dataTrad['low'] - dataTrad['close'].shift(1))
**trRange = [tr1, tr2, tr3].max(axis = 1)**
atr10 = ((atr10.shift(-1) * (10 - 1) + trRange)/10
Im attempting to create a table as follows, where equities in a list get appended as columns to the dataframe:
Fundamentals CTRP EBAY ...... MPNGF
price
dividend
five_year_dividend
pe_ratio
pegRatio
priceToBook
price_to_sales
book_value
ebit
net_income
EPS
DebtEquity
threeYearAverageReturn
At the moment, based on the code below, only the last equity in the list is showing up:
Fundamentals MPNGF
price
dividend
five_year_dividend
pe_ratio
pegRatio
priceToBook
price_to_sales
book_value
ebit
net_income
EPS
DebtEquity
threeYearAverageReturn
from yahoofinancials import YahooFinancials
import pandas as pd
import lxml
from lxml import html
import requests
import numpy as np
from datetime import datetime
def scrape_table(url):
page = requests.get(url)
tree = html.fromstring(page.content)
table = tree.xpath('//table')
assert len(table) == 1
df = pd.read_html(lxml.etree.tostring(table[0], method='html'))[0]
df = df.set_index(0)
df = df.dropna()
df = df.transpose()
df = df.replace('-', '0')
df[df.columns[0]] = pd.to_datetime(df[df.columns[0]])
cols = list(df.columns)
cols[0] = 'Date'
df = df.set_axis(cols, axis='columns', inplace=False)
numeric_columns = list(df.columns)[1::]
df[numeric_columns] = df[numeric_columns].astype(np.float64)
return df
ecommerce = ['CTRP', 'EBAY', 'GRUB', 'BABA', 'JD', 'EXPE', 'AMZN', 'BKNG', 'MPNGF']
price=[]
dividend=[]
five_year_dividend=[]
pe_ratio=[]
pegRatio=[]
priceToBook=[]
price_to_sales=[]
book_value=[]
ebit=[]
net_income=[]
EPS=[]
DebtEquity=[]
threeYearAverageReturn=[]
for i, symbol in enumerate(ecommerce):
yahoo_financials = YahooFinancials(symbol)
balance_sheet_url = 'https://finance.yahoo.com/quote/' + symbol + '/balance-sheet?p=' + symbol
df_balance_sheet = scrape_table(balance_sheet_url)
df_balance_sheet_de = pd.DataFrame(df_balance_sheet, columns = ["Total Liabilities", "Total stockholders' equity"])
j= df_balance_sheet_de.loc[[1]]
j['DebtEquity'] = j["Total Liabilities"]/j["Total stockholders' equity"]
k= j.iloc[0]['DebtEquity']
X = yahoo_financials.get_key_statistics_data()
for d in X.values():
PEG = d['pegRatio']
PB = d['priceToBook']
three_year_ave_return = d['threeYearAverageReturn']
data = [['price', yahoo_financials.get_current_price()], ['dividend', yahoo_financials.get_dividend_yield()], ['five_year_dividend', yahoo_financials.get_five_yr_avg_div_yield()], ['pe_ratio', yahoo_financials.get_pe_ratio()], ['pegRatio', PEG], ['priceToBook', PB], ['price_to_sales', yahoo_financials.get_price_to_sales()], ['book_value', yahoo_financials.get_book_value()], ['ebit', yahoo_financials.get_ebit()], ['net_income', yahoo_financials.get_net_income()], ['EPS', yahoo_financials.get_earnings_per_share()], ['DebtEquity', mee], ['threeYearAverageReturn', three_year_ave_return]]
data.append(symbol.text)
df = pd.DataFrame(data, columns = ['Fundamentals', symbol])
df
Seeking your kind advice please as to where may i have gone wrong in the above table? Thank you so very much!
You need to call your df outside of your for loop. Your code as currently written will recreate a new df for every loop.