using pandas HTML to read and print data from a website

using pandas HTML to read and print data from a website - python

Problem:
I need to make a code that lists off data from a online web data set with pandas pd.read_HTML then call out a temperature based on that list and have it display that row of data with a few parameters.
The trouble is the final part I need to make it to where it loops for when the user input is out of range or not == to one of the documented temperatures it returns a retry command a message saying something like invalid input
What I have tried:
I tried running it through a while loop, try except commands and if and elif but I'm sure I did it wrong because it almost all the time breaks my spyder program so I have to close it and try again.
Any recommendation or solutions would be super helpful cause I'm past the point of vague hints that supposed to lead me to an answer but leave me more confused.
My code:
def get_t_data(t):
t_table = pd.read_html('https://thermo.pressbooks.com/chapter/saturation-properties-temperature-table/', header=0)
t_df = t_table[0]
data_df =t_df.loc[t_df['Temp'] == t]
df_result = data_df[['Pressure', 'Volume ()', 'Energy (kJ/kg)', 'Enthalpy (kJ/kg)', 'Entropy (kJ/kg.K)']]
df_final = df_result.to_string(index=False)
return df_final
user_t = input('Please enter the temp you will like to research: ')
print('\n')
data = get_t_data(user_t)
print('For temperature {}°C your outputs are \n'.format(user_t))
print(data)```

[upd]
something like this:
import pandas as pd
def get_t_data(t):
t_table = pd.read_html('https://thermo.pressbooks.com/chapter/saturation-properties-temperature-table/', header=0)
t_df = t_table[0]
t_df = t_df.iloc[1:,:] # to skip additional line of header
ind = list(t_df['Temp'].astype(float)) # get all indexes as float as you have not only integer (0.01 and 373.95)
if float(t) not in ind: # check if the 't' in index
return {'exist': False, 'result':'no such temp'}
data_df =t_df.loc[t_df['Temp'] == t]
df_result = data_df[['Pressure', 'Volume ()', 'Energy (kJ/kg)', 'Enthalpy (kJ/kg)', 'Entropy (kJ/kg.K)']]
df_final = df_result.to_string(index=False)
return {'exist': True, 'result': df_final}
# data format for get_t_data response
data = {'exist': False, 'result':''}
while data['exist'] == False:
user_t = input('Please enter the temp you will like to research: ')
print('\n')
data = get_t_data(user_t)
print('For temperature {}°C your outputs are \n'.format(user_t))
print(data['result'])

Related

Python: IF statement on the index of a dataframe?

Working through my first project in pandas and trying to build a simple program that retrieves the ID's and information from a large CSV.
I'm able to get it working when I print it. However when I try to make it a conditional statement it won't work. Is this because of the way I set the index?
The intent is to input an ID and in return list out all the rows with this ID in the CSV.
import numpy as np
import pandas as pd
file = PartIdsWithVehicleData.csv
excel = pd.read_csv(file, index_col = "ID", dtype={'ID': "str"})
UserInput = input("Enter a part number: ")
result = (excel.loc[UserInput])
#If I print this.... it will work
print(result)
#however when I try to make it a conditional statement it runs my else clause.
if UserInput in excel:
print(result)
else:
print("Part number is not in file.")
#one thing I did notice is that if I find the boolean value of the part number (index) it says false..
print(UserInput in excel)

Is this what you are trying to accomplish? I had to build my own table to help visualize the process, but you should be able to accomplish what you asked with a little tweeking to your own data
df = pd.DataFrame({
'Part_Number' : [1, 2, 3, 4, 5],
'Part_Name' : ['Name', 'Another Name', 'Third Name', 'Almost last name', 'last name']
})
user_input = int(input('Please enter part number'))
if user_input in df['Part_Number'].tolist():
print(df.loc[df['Part_Number'] == user_input])
else:
print('part number does not exist')

KeyError on the 2nd loop on pandas df

so on the first run in the loop everything works fine but on the second loop, it causes a KeyError on the column values on my df. I don't understand why this is happening since in every loop I'm triggering a set of functions.
Part of the code that creates the error
def market_data (crypto, ts_float):
#request to kraken for pricing data
r = requests.get('https://futures.kraken.com/api/charts/v1/trade/' + crypto + '/15m?from=' + ts_float)
#set JSON response to data
data = r.json()
#normalize data into dataframe
df = pd.json_normalize(data, record_path=['candles'])
#convert unix time back into readable time
df['time'] = pd.to_datetime(df['time'],unit='ms')
#set time as index
df = df.set_index('time')
#convert into integer for calculations
df['open'] = df['open'].astype(float).astype(int)
df['high'] = df['high'].astype(float).astype(int)
df['low'] = df['low'].astype(float).astype(int)
df['close'] = df['close'].astype(float).astype(int)
df['volume'] = df['volume'].astype(float).astype(int)
return df
crypto_pairs = [
{"crypto": "pf_ethusd", "size": 0.05},
{"crypto": "pf_btcusd", "size": 0.0003},
{"crypto": "pf_avaxusd", "size":3},
{"crypto": "pf_dotusd", "size":10},
{"crypto": "pf_ltcusd", "size":1.5}
]
# getting the timstamp to get the data from
ts = (datetime.now() - timedelta(hours = 48)).timestamp()
ts_float = str(int(ts))
for cryptos in enumerate(crypto_pairs):
data = market_data(cryptos[1]['crypto'], ts_float)
KeyError: time
I have a set of functions in my enumerate loop and the market_data which is the first one generates the mentioned error on the 2nd loop. The errors are always happening when changing the column names such as "time" and "open".

I don't have skills in 'request', but this worked for me. Try the following. In the 'deep market_data' function, after receiving the dataframe, set a check, if len(df)<=0, then exit.
Where the dataframe turns out to be empty, the request returns 200, that is, everything is fine. Printed out 'crypto'. An empty dataframe is obtained on 'pf_btcusd'. I tried to swap it and again an empty dataframe turns out to be 'pf_btcusd'. Something is wrong with this symbol.
def market_data (crypto, ts_float):
#request to kraken for pricing data
r = requests.get('https://futures.kraken.com/api/charts/v1/trade/' + crypto + '/15m?from=' + ts_float)
#print(r.status_code)
#set JSON response to data
data = r.json()
#normalize data into dataframe
df = pd.json_normalize(data, record_path=['candles'])
if len(df) <=0:
print(r.status_code)
print(crypto)
return

How do I loop column names in a pandas dataframe?

I am new to Python and have never really used Pandas, so forgive me if this doesn't make sense. I am trying to create a df based on frontend data I am sending to a flask route. The data is looped through and appended for each row. My only problem is that I don't know how to get the df columns to reflect that. Here is my code to build the rows and the current output:
claims = csv_data["claims"]
setups = csv_data["setups"]
for setup in setups:
setup = setups[0]
offerings = setup["currentOfferings"]
considered = setup["considerationSet"]
reach_dict = setup["reach"]
favorite_dict = setup["favorite"]
summary_dict = setup["summaryMetrics"]
rows = []
for i, claim in enumerate(claims):
row = []
row.append(i + 1)
row.append(claim)
for setup in setups:
setup = setups[0]
row.append("X") if claim in setup["currentOfferings"] else row.append(float('nan'))
row.append("X") if claim in setup["considerationSet"] else row.append(float('nan'))
if claim in setup["currentOfferings"]:
reach_score = reach_dict[claim]
reach_percentage = "{:.0%}".format(reach_score)
row.append(reach_percentage)
else:
row.append(float('nan'))
if claim in setup["currentOfferings"]:
favorite_score = favorite_dict[claim]
fav_percentage = "{:.0%}".format(favorite_score)
row.append(fav_percentage)
else:
row.append(float('nan'))
rows.append(row)
I know that I can put columns = ["#", "Claims", "Setups", etc...] in the df, but that doesn't work because the rows are looping through multiple setups, and the number of setups can change. If I don't specify the column names (how it is in the image), then I just have numbers as columns names. Ideally it should loop through the data it receives in the route, and would start with "#" "Claims" as columns, and then for each setup "Setup 1", "Consideration Set 1", "Reach", "Favorite", "Setup 2", "Consideration Set 2", and so on... etc.
I tried to create a similar type of loop for the columns:
my_columns = []
for i, row in enumerate(rows):
col = []
if row[0] != None:
col.append("#")
else:
pass
if row[1] != None:
col.append("Claims")
else:
pass
if row[2] != None:
col.append("Setup")
else:
pass
if row[3] != None:
col.append("Consideration Set")
else:
pass
if row[4] != None:
col.append("Reach")
else:
pass
if row[5] != None:
col.append("Favorite")
else:
pass
my_columns.append(col)
df = pd.DataFrame(
rows,
columns = my_columns
)
But this didn't work because I have the same issue of no loop, I have 6 columns passed and 10 data columns passed. I'm not sure if I am just not doing the loop of the columns properly, or if I am making everything more complicated than it needs to be.
This is what I am trying to accomplish without having to explicitly name the columns because this is just sample data. There could end up being 3, 4, however many setups in the actual app.
what I would like the ouput to look like

I don't know if this is the most efficient way of doing something like this but I think this is what you want to achieve.
def create_columns(df):
new_cols=[]
for i in range(len(df.columns)):
repeated_cols = 6 #here is the number of columns you need to repeat for every setup
idx = 1 + i // repeated_cols
basic = ['#', 'Claims', f'Setup_{idx}', f'Consideration_Set_{idx}', 'Reach', 'Favorite']
new_cols.append(basic[i % len(basic)])
return new_cols
df.columns = create_columns(df)

If your data comes as csv then try pd.read_csv() to create dataframe.

How do you make dataframe match datetime dates?

So I am writing a code for a Tkinter GUI, and in it, the code pulls data from FRED and uses it to present graphs. There is an option at the start to save the pulled data in a CSV file so you can run it without the internet. But when the code runs to use the CSV, something happens with the scale and it gives me a graph like this. I think it has something to do with the datetime data not being remembered. Current code situation follows:
Imports: from tkinter import *, from tkinter import ttk, pandas_datareader as pdr, pandas as pd, from datetime import datetime
Example of how data is called:
def getBudgetData():
'''
PURPOSE: Get the government budget balance data
INPUTS: None
OUTPUTS: The dataframe of the selected country
'''
global namedCountry
# Reads what country is in the combobox when selected, then gives the index value so the correct
# code is used in the graph
namedCountry = countryCombo.get()
selectedCountry = countryOptions.index(namedCountry)
df = dfBudget[dfBudget.columns[selectedCountry]]
return df
Code for getting/reading the dataframes
def readDataframeCSV():
global dfCPIQuarterly, dfCPIMonthly, dfGDP, dfUnemployment, dfCashRate, dfBudget
dfCPIQuarterly = pd.read_csv('dataframes\dfCPIQuarterly.csv', infer_datetime_format = True)
dfCPIMonthly = pd.read_csv('dataframes\dfCPIMonthly.csv')
dfGDP = pd.read_csv('dataframes\dfGDP.csv')
dfUnemployment = pd.read_csv('dataframes\dfUnemployment.csv')
dfCashRate = pd.read_csv('dataframes\dfCashRate.csv')
dfBudget = pd.read_csv('dataframes\dfBudget.csv')
def LogDiff(x, frequency):
'''
PURPOSE: Transform level data into growth
INPUTS: x (time series), frequency (frequency of time series)
OUTPUTS: x_diff (growth rate of time series)
REFERENCE: Tau, Ran, & Chris Brookes. (2019). Python Guide to accompany
introductary econometrics for finance (4th Edition).
Cambridge University Press.
'''
x_diff = 100*log(x/x.shift(frequency))
x_diff = x_diff.dropna()
return x_diff
def getAllFredData():
'''
PURPOSE: Extract all required data from FRED
INPUTS: None
OUTPUTS: Dataframes of all time series
REFERENCE: https://fred.stlouisfed.org/
'''
global dfCPIQuarterly, dfCPIMonthly, dfGDP, dfUnemployment, dfCashRate, dfBudget
# Country codes
countryCPIQuarterlyCodes = ['AUSCPIALLQINMEI', 'NZLCPIALLQINMEI']
countryCPIMonthlyCodes = ['CPALCY01CAM661N', 'JPNCPIALLMINMEI', 'GBRCPIALLMINMEI', 'CPIAUCSL']
countryGDPCodes = ['AUSGDPRQDSMEI', 'NAEXKP01CAQ189S', 'JPNRGDPEXP',
'NAEXKP01NZQ189S', 'CLVMNACSCAB1GQUK', 'GDPC1']
countryUnemploymentCodes = ['LRUNTTTTAUQ156S', 'LRUNTTTTCAQ156S', 'LRUN64TTJPQ156S',
'LRUNTTTTNZQ156S', 'LRUNTTTTGBQ156S', 'LRUN64TTUSQ156S']
countryCashRateCodes = ['IR3TBB01AUM156N', 'IR3TIB01CAM156N', 'INTDSRJPM193N',
'IR3TBB01NZM156N', 'IR3TIB01GBM156N', 'FEDFUNDS']
countryBudgetCodes = ['GGNLBAAUA188N', 'GGNLBACAA188N', 'GGNLBAJPA188N',
'NZLGGXCNLG01GDPPT', 'GGNLBAGBA188N', 'FYFSGDA188S']
# Inflation
dfCPIQuarterly = pdr.DataReader(countryCPIQuarterlyCodes,
'fred', start, end)
for country in countryCPIQuarterlyCodes:
dfCPIQuarterly[country] = pd.DataFrame({"Inflation rate":LogDiff(dfCPIQuarterly[country], 4)})
dfCPIMonthly = pdr.DataReader(countryCPIMonthlyCodes,
'fred', start, end)
for country in countryCPIMonthlyCodes:
dfCPIMonthly[country] = pd.DataFrame({"Inflation rate":LogDiff(dfCPIMonthly[country], 12)})
# GDP
dfGDP = pdr.DataReader(countryGDPCodes,
'fred', start, end)
for country in countryGDPCodes:
dfGDP[country] = pd.DataFrame({"Economic Growth":LogDiff(dfGDP[country], 4)})
# Unemployment
dfUnemployment = pdr.DataReader(countryUnemploymentCodes,
'fred', start, end)
# Cash Rate
dfCashRate = pdr.DataReader(countryCashRateCodes,
'fred', start, end)
# Budget
dfBudget = pdr.DataReader(countryBudgetCodes,
'fred', start, end)
print('')
saveToCSVLoop = True
while saveToCSVLoop == True:
saveToCSV = input('Would you like to save the dataframes to a CSV file so start-up will be quicker next (y or n): ')
if saveToCSV == 'y':
dfCPIQuarterly.to_csv('dataframes\dfCPIQuarterly.csv', index = True)
dfCPIMonthly.to_csv('dataframes\dfCPIMonthly.csv', index = False)
dfGDP.to_csv('dataframes\dfGDP.csv', index = False)
dfUnemployment.to_csv('dataframes\dfUnemployment.csv', index = False)
dfCashRate.to_csv('dataframes\dfCashRate.csv', index = False)
dfBudget.to_csv('dataframes\dfBudget.csv', index = False)
saveToCSVLoop = False
elif saveToCSV == 'n':
saveToCSVLoop = False
else:
print('\nNot a valid option')
sleep(1)

It's hard to help you without the csv data. It could be that the dates aren't saved properly, or aren't interpreted properly. Maybe you could try parsing the datetime first. It kind of looks like there are no years, or something that is expected to be the year is actually a month?
Since it starts at 1970, I have a feeling that it's interpreting your time as unix epoch, not normal yyyymmdd type dates. Try printing dfCPIQuarterly and see if it looks like a date. Maybe you shouldn't use infer_datetime_format = True when reading it from the csv, but it's hard to tell without more details.

Script keep showing "SettingCopyWithWarning'

Hello my problem is that my script keep showing below message
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
downcast=downcast
I Searched the google for a while regarding this, and it seems like my code is somehow
assigning sliced dataframe to new variable, which is problematic.
The problem is ** I can't find where my code get problematic **
I tried copy function, or seperated the nested functions, but it is not working
I attached my code below.
def case_sorting(file_get, col_get, methods_get, operator_get, value_get):
ops = {">": gt, "<": lt}
col_get = str(col_get)
value_get = int(value_get)
if methods_get is "|x|":
new_file = file_get[ops[operator_get](file_get[col_get], value_get)]
else:
new_file = file_get[ops[operator_get](file_get[col_get], np.percentile(file_get[col_get], value_get))]
return new_file
Basically what i was about to do was to make flask api that gets excel file as an input, and returns the csv file with some filtering. So I defined some functions first.
def get_brandlist(df_input, brand_input):
if brand_input == "default":
final_list = (pd.unique(df_input["브랜드"])).tolist()
else:
final_list = brand_input.split("/")
if '브랜드' in final_list:
final_list.remove('브랜드')
final_list = [x for x in final_list if str(x) != 'nan']
return final_list
Then I defined the main function
def select_bestitem(df_data, brand_name, col_name, methods, operator, value):
# // 2-1 // to remove unnecessary rows and columns with na values
df_data = df_data.dropna(axis=0 & 1, how='all')
df_data.fillna(method='pad', inplace=True)
# // 2-2 // iterate over all rows to find which row contains brand value
default_number = 0
for row in df_data.itertuples():
if '브랜드' in row:
df_data.columns = df_data.iloc[default_number, :]
break
else:
default_number = default_number + 1
# // 2-3 // create the list contains all the target brand names
brand_list = get_brandlist(df_input=df_data, brand_input=brand_name)
# // 2-4 // subset the target brand into another dataframe
df_data_refined = df_data[df_data.iloc[:, 1].isin(brand_list)]
# // 2-5 // split the dataframe based on the "brand name", and apply the input condition
df_per_brand = {}
df_per_brand_modified = {}
for brand_each in brand_list:
df_per_brand[brand_each] = df_data_refined[df_data_refined['브랜드'] == brand_each]
file = df_per_brand[brand_each].copy()
df_per_brand_modified[brand_each] = case_sorting(file_get=file, col_get=col_name, methods_get=methods,
operator_get=operator, value_get=value)
# // 2-6 // merge all the remaining dataframe
df_merged = pd.DataFrame()
for brand_each in brand_list:
df_merged = df_merged.append(df_per_brand_modified[brand_each], ignore_index=True)
final_df = df_merged.to_csv(index=False, sep=',', encoding='utf-8')
return final_df
And I am gonna import this function in my app.py later
I am quite new to all the coding, therefore really really sorry if my code is quite hard to understand, but I just really wanted to get rid of this annoying warning message. Thanks for help in advance :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

using pandas HTML to read and print data from a website - python

Related

Python: IF statement on the index of a dataframe?

KeyError on the 2nd loop on pandas df

How do I loop column names in a pandas dataframe?

How do you make dataframe match datetime dates?

Script keep showing "SettingCopyWithWarning'

Categories

Resources