Failing while passing dataframe column seperated by comma into a API - python

I have to pass locations to API to retrieve values.
Working Code
dfs = []
locations = ['ZRH','SIN']
for loc in locations:
response = requests.get(f'https://risk.dev.tyche.eu-central-1.aws.int.kn/il/risk/location/{loc}', headers=headers, verify=False)
if 'items' in data:
df = pd.json_normalize(data, 'items', 'totalItems')
df1 = pd.concat([pd.DataFrame(x) for x in df.pop('relatedEntities')], keys=df.index).add_prefix('relatedEntities.')
df3 = df.join((df1).reset_index(level=1, drop=True))
dfs.append(df3)
df = pd.concat(dfs, ignore_index=True)
Failing Code ( while passing as parameter)
When I try to pass location as parameter which is created another dataframe column it fails.
Unique_Location = data['LOCATION'].unique()
Unique_Location = pd.DataFrame( list(zip(Unique_Location)), columns =['Unique_Location'])
t= ','.join(map(repr,Unique_Location['Unique_Location'] ))
locations = [t]
for loc in locations:
response = requests.get(f'https://risk.dev.logindex.com/il/risk/location/{loc}', headers=headers)
data = json.loads(response.text)
df = pd.json_normalize(data, 'items', 'totalItems')
What is wrong in my code?
Error
`c:\users\ashok.eapen\pycharmprojects\rs-components\venv\lib\site-packages\pandas\io\json\_normalize.py in _pull_records(js, spec)
246 if has non iterable value.
247 """
--> 248 result = _pull_field(js, spec)
249
250 # GH 31507 GH 30145, GH 26284 if result is not list, raise TypeError if not
c:\users\ashok.eapen\pycharmprojects\rs-components\venv\lib\site-packages\pandas\io\json\_normalize.py in _pull_field(js, spec)
237 result = result[field]
238 else:
--> 239 result = result[spec]
240 return result
241
KeyError: 'items'
`

You can test if items exist in json like:
dfs = []
locations = ['NZAKL', 'NZ23-USBCH', 'DEBAD', 'ARBUE', 'AR02_GSTI', 'AEJEA', 'UYMVD', 'UY03', 'AE01_GSTI', 'TH02_GSTI', 'JO01_GSTI', 'ITSIM', 'GB75_GSTI', 'DEAMA', 'DE273_GSTI', 'ITPRO', 'AT07_GSTI', 'FR05', 'FRHAU', 'FR01_GSTI', 'FRHER', 'ES70X-FRLBM', 'THNEO']
for loc in locations:
response = requests.get(f'https://risk.dev.logindex.com/il/risk/location/{loc}', headers=headers)
data = json.loads(response.text)
if 'items' in data:
if len(data['items']) > 0:
df = pd.json_normalize(data, 'items', 'totalItems')
#NaN in column, so failed - replace NaN to empty list
f = lambda x: x if isinstance(x, list) else []
df['raw.identifiers'] = df['raw.identifiers'].apply(f)
df['raw.relationships'] = df['raw.relationships'].apply(f)
df1 = pd.concat([pd.DataFrame(x) for x in df.pop('raw.identifiers')], keys=df.index).add_prefix('raw.identifiers.')
df2 = pd.concat([pd.DataFrame(x) for x in df.pop('raw.relationships')], keys=df.index).add_prefix('raw.relationships.')
df3 = df.join(df1.join(df2).reset_index(level=1, drop=True))
dfs.append(df3)
df = pd.concat(dfs, ignore_index=True)

Related

TypeError for >= between float and string values

I have been using the below script to calculate the storage capacity on one of our server environments. It reads the values from a report I get every two weeks and then creates a file I can import into PowerBI to create graphs. It ran without an issue 2 weeks ago but today when I tried to run it I get a TypeError. I assume it is "if float(df['Capacity(TB)']) >= 0.01: " causing the issue as per the error message.
The data I am importing is a xls sheet with a header name and values underneath it. I had a look to see if there are any blank fields but could not find any. Any help/suggestions would be greatly appreciated.
import pandas as pd
import numpy as np
from datetime import datetime
import os
from os import listdir
from os.path import isfile, join
#SCCM resource import as 'df'
pathres = r'C:\Capacity Reports\SCOM Reports'
onlyfiles = [f for f in listdir(pathres) if isfile(join(pathres, f))]
df = pd.DataFrame()
for i in onlyfiles:
print(i)
dfresimp = pd.read_excel(pathres+'\\'+i)
df = pd.concat([df, dfresimp])
#CMDB import as 'df2'
df2 = pd.read_excel('C:\\Capacity Reports\\CMDB_Export.xlsx')
#Windows Lifecycle import as 'df3'
df3 = pd.read_excel('C:\\Capacity Reports\\Windows Server Lifecycle.xlsx')
#SCVMM clusters import as 'df4'
df4 = pd.read_excel('C:\\Capacity Reports\\HyperV Overview.xlsx')
#SCVMM Storage reports import as 'df5'
pathstor = r'C:\Capacity Reports\Hyper-V Storage'
Storfiles = [f for f in listdir(pathstor) if isfile(join(pathstor, f))]
df5 = pd.DataFrame()
for i in Storfiles:
print(i)
dfstorimp = pd.read_excel(pathstor+'\\'+i)
df5 = pd.concat([df5, dfstorimp])
#CREATE MAIN TABLE
df['NAME'] = df['Computer Name'].str.upper()
df11 = pd.DataFrame()
df11['NAME'] = df2['NAME'].str.upper()
df11['Application Owner'] = df2['Application Owner'].str.title()
df11['HW EOSL'] = df2['HW EOSL'].str.title()
#print(df11['HW EOSL'])
Main_Table = df.merge(df11, on='NAME', how='left')
Main_Table = Main_Table.merge(df3, on='Operating System Edition', how='left')
df13 = pd.DataFrame()
df13['Hyper V Cluster name'] = df4['Hyper V Cluster name']
df13['Computer Name'] = df4['Server Name'].str.upper()
Main_Table = Main_Table.merge(df13, on='Computer Name', how='left')
Main_Table['OS_Support'] = pd.to_datetime(Main_Table['Extended_Support_End_Date'], format='"%Y-%m-%d %H:%S:%f')
Main_Table['OS_Support'] = Main_Table['OS_Support'].dt.strftime("%Y-%m-%d")
#print(Main_Table['OS_Support'])
def f(df):
if df['Host/GuestVM'] == 'GuestVM':
result = (df['Total Physical Memory GB']-(df['Total Physical Memory GB']*(df['Memory % Used Max Value']/100)))/2
return result
else:
np.nan
Main_Table['Reclaimable Memory Calculated'] = Main_Table.apply(f, axis=1)
def f(df):
if df['Host/GuestVM'] == 'GuestVM':
result = (df['Total Logical Processors']-(df['Total Logical Processors']*(df['CPU % Used Max Value']/100)))/2
return result
else:
np.nan
Main_Table['Reclaimable CPU Calculated'] = Main_Table.apply(f, axis=1)
Main_Table['Reclaimable Memory Calculated'] = round(Main_Table['Reclaimable Memory Calculated'])
Main_Table['Reclaimable CPU Calculated'] = round(Main_Table['Reclaimable CPU Calculated'])
Main_Table['Report Timestamp'] = Main_Table['Report Timestamp'].dt.strftime("%Y%m%d")
Main_Table = Main_Table.drop_duplicates()
Main_Table['Report Timestamp Number'] = Main_Table['Report Timestamp']
column = Main_Table["Report Timestamp Number"]
max_value = column.max()
Total_Memory_Latest = 0
def f(df):
global Total_Memory_Latest
if df['Report Timestamp Number'] == max_value and df['Host/GuestVM'] == 'Host':
Total_Memory_Latest += df['Total Physical Memory GB']
return 0
else:
np.nan
Main_Table['DummyField'] = Main_Table.apply(f, axis=1)
Main_Table.to_excel(r'C:\Users\storm_he\OneDrive - MTN Group\Documents\Testing\Main_Table.xlsx')
#CREATE STORAGE TABLE AND EXPORT
def f(df):
#if df['Host/GuestVM'] == 'Host':
#try:
if float(df['Capacity(TB)']) >= 0.01:
result = (df['Available(TB)']/df['Capacity(TB)'])*100
return round(result)
else:
return ''
#except:
#return np.nan
df5['% Storage free'] = df5.apply(f, axis=1)
pattern = '|'.join(['.mtn.co.za', '.mtn.com'])
df5['VMHost'] = df5['VMHost'].str.replace(pattern,'')
df5['VMHost'] = df5['VMHost'].str.upper()
df5['Report Timestamp'] = df5['Report Timestamp'].dt.strftime("%Y%m%d")
#print(df5['Report Timestamp'])
df5.to_excel(r'C:\Users\storm_he\OneDrive - MTN Group\Documents\Testing\Main_Storage_table.xlsx')
print('Run Finished')
StackTrace
TypeError Traceback (most recent call last)
<ipython-input-1-3c53bb32e311> in <module>
108 column = Main_Table["Report Timestamp Number"]
109
--> 110 max_value = column.max()
111 Total_Memory_Latest = 0
112
~\Anaconda3\lib\site-packages\pandas\core\generic.py in stat_func(self, axis, skipna, level, numeric_only, **kwargs)
11212 if level is not None:
11213 return self._agg_by_level(name, axis=axis, level=level, skipna=skipna)
> 11214 return self._reduce(
11215 f, name, axis=axis, skipna=skipna, numeric_only=numeric_only
11216 )
~\Anaconda3\lib\site-packages\pandas\core\series.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
3889 )
3890 with np.errstate(all="ignore"):
-> 3891 return op(delegate, skipna=skipna, **kwds)
3892
3893 # TODO(EA) dispatch to Index
~\Anaconda3\lib\site-packages\pandas\core\nanops.py in f(values, axis, skipna, **kwds)
123 result = alt(values, axis=axis, skipna=skipna, **kwds)
124 else:
--> 125 result = alt(values, axis=axis, skipna=skipna, **kwds)
126
127 return result
~\Anaconda3\lib\site-packages\pandas\core\nanops.py in reduction(values, axis, skipna, mask)
835 result = np.nan
836 else:
--> 837 result = getattr(values, meth)(axis)
838
839 result = _wrap_results(result, dtype, fill_value)
~\Anaconda3\lib\site-packages\numpy\core\_methods.py in _amax(a, axis, out, keepdims, initial, where)
28 def _amax(a, axis=None, out=None, keepdims=False,
29 initial=_NoValue, where=True):
---> 30 return umr_maximum(a, axis, None, out, keepdims, initial, where)
31
32 def _amin(a, axis=None, out=None, keepdims=False,
TypeError: '>=' not supported between instances of 'float' and 'str'

Looping through a list of names of dataframes

I have a list of dataframes, each created from a unique web query;
bngimp = parse_forecast_data(get_json('419524'), None)
belimp = parse_forecast_data(get_json('419525'), None)
braimp = parse_forecast_data(get_json('419635'), None)
chilimp = parse_forecast_data(get_json('419526'), None)
chinimp = parse_forecast_data(get_json('419527'), None)
domimp = parse_forecast_data(get_json('419633'), None)
fraimp = parse_forecast_data(get_json('419636'), None)
greimp = parse_forecast_data(get_json('419528'), None)
ghaimp = parse_forecast_data(get_json('419638'), None)
indimp = parse_forecast_data(get_json('419530'), None)
indoimp = parse_forecast_data(get_json('419639'), None)
itaimp = parse_forecast_data(get_json('419533'), None)
japimp = parse_forecast_data(get_json('419534'), None)
kuwimp = parse_forecast_data(get_json('419640'), None)
litimp = parse_forecast_data(get_json('419641'), None)
meximp = parse_forecast_data(get_json('419537'), None)
I need to format each dataframe in the same way as follows;
bngimp = bngimp[['From Date','Sales Volume']]
bngimp = bngimp.set_index('From Date')
bngimp.index = pd.to_datetime(bngimp.index)
bngimp = bngimp.groupby(by=[bngimp.index.year, bngimp.index.month]).sum()
bngimp.columns = ['bngimp']
Is there any way I could loop through the name of dataframes without having to copy and paste each dataframe name into the above code?
There will be multiple more dataframes so the copying and pasting quite time consuming!
Any help is much appreciated;
I suggest create dictionary for map numbers by DataFrame names and create dictionary of DataFrame called out:
d = {'419524': 'bngimp', '419525': 'belimp', ...}
out = {}
for k, v in d.items():
df = parse_forecast_data(get_json(k), None)
df = df[['From Date','Sales Volume']]
df = df.set_index('From Date')
df.index = pd.to_datetime(df.index)
df = df.groupby(by=[df.index.year, df.index.month]).sum()
df.columns = [v]
out[v] = df
then for get DataFrame select by key:
print (out['bngimp'])
Also if want create one big DataFrame is possible use:
df = pd.concat(out, axis=1)

Getting Error When Trying to Import Data From Website Using Loop

I'm trying to import data from multiple web pages into a data table using Python.
Basically, I'm trying to download attendance data for certain teams since 2000.
Here is what I have so far:
import requests
import pandas as pd
import numpy as np
#What is the effect of a rival team's performance on a team's attendance
Teams = ['LAA', 'LAD', 'NYY', 'NYM', 'CHC', 'CHW', 'OAK', 'SFG']
Years = []
for year in range(2000,2020):
Years.append(str(year))
bbattend = pd.DataFrame(columns=['GM_Num','Date','Team','Home','Opp','W/L','R','RA','Inn','W-L','Rank','GB','Time','D/N','Attendance','Streak','Game_Win','Wins','Losses','Net_Wins'])
for team in Teams:
for year in Years:
url = 'https://www.baseball-reference.com/teams/' + team + '/' + year +'-schedule-scores.shtml'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
#Formatting data table
df.rename(columns={"Gm#": "GM_Num", "Unnamed: 4": "Home", "Tm": "Team", "D/N": "Night"}, inplace = True)
df['Home'] = df['Home'].apply(lambda x: 0 if x == '#' else 1)
df['Game_Win'] = df['W/L'].astype(str).str[0]
df['Game_Win'] = df['Game_Win'].apply(lambda x: 0 if x == 'L' else 1)
df['Night'] = df['Night'].apply(lambda x: 1 if x == 'N' else 0)
df['Streak'] = df['Streak'].apply(lambda x: -1*len(x) if '-' in x else len(x))
df.drop('Unnamed: 2', axis=1, inplace = True)
df.drop('Orig. Scheduled', axis=1, inplace = True)
df.drop('Win', axis=1, inplace = True)
df.drop('Loss', axis=1, inplace = True)
df.drop('Save', axis=1, inplace = True)
#Drop rows that do not have data
df = df[df['GM_Num'].str.isdigit()]
WL = df["W-L"].str.split("-", n = 1, expand = True)
df["Wins"] = WL[0].astype(dtype=np.int64)
df["Losses"] = WL[1].astype(dtype=np.int64)
df['Net_Wins'] = df['Wins'] - df['Losses']
bbattend.append(df)
bbattend
When I do the thing in the loop separately by using a specific link instead of trying to use concatenation to make the url, it seems to work.
However, using this code, I am getting the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-77-997e6aeea77e> in <module>
16 url = 'https://www.baseball-reference.com/teams/' + team + '/' + year +'-schedule-scores.shtml'
17 html = requests.get(url).content
---> 18 df_list = pd.read_html(html)
19 df = df_list[-1]
20 #Formatting data table
~/anaconda3/lib/python3.7/site-packages/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, tupleize_cols, thousands, encoding, decimal, converters, na_values, keep_default_na, displayed_only)
1092 decimal=decimal, converters=converters, na_values=na_values,
1093 keep_default_na=keep_default_na,
-> 1094 displayed_only=displayed_only)
~/anaconda3/lib/python3.7/site-packages/pandas/io/html.py in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs)
914 break
915 else:
--> 916 raise_with_traceback(retained)
917
918 ret = []
~/anaconda3/lib/python3.7/site-packages/pandas/compat/__init__.py in raise_with_traceback(exc, traceback)
418 if traceback == Ellipsis:
419 _, _, traceback = sys.exc_info()
--> 420 raise exc.with_traceback(traceback)
421 else:
422 # this version of raise is a syntax error in Python 3
ValueError: No tables found
I don't really understand what the error message is saying.
I'd appreciate any help!
Because do not have any table in some page, e.g., this page and this page
So, df_list = pd.read_html(html) will raise ValueError: No tables found.
You should need use try-except in here.

Function output into loop

I am trying to replicate the following code which work smoothly and add a parameter for date to the function and run the function with different date in a loop:
FUNCTION V1:
def getOHLCV(currencies):
c_price = []
data = {}
try:
url = 'https://pro-api.coinmarketcap.com/v1/cryptocurrency/ohlcv/historical'
parameters = {
'symbol': ",".join(currencies),
#'time_start': ",".join(start_dates),
'count':'91',
'interval':'daily',
'convert':'JPY',
}
headers = {
'Accepts': 'application/json',
'X-CMC_PRO_API_KEY': 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
}
session = Session()
session.headers.update(headers)
response = session.get(url, params=parameters)
data = json.loads(response.text)
for currency in data['data']:
used_list = [
item['quote']['JPY']
for item in data['data'][currency]['quotes']
]
price = pd.DataFrame.from_records(used_list)
price['timestamp'] = pd.to_datetime(price['timestamp'])
price['timestamp'] = price['timestamp'].astype(str).str[:-15]
price_c = price.set_index('timestamp').close
c_price.append(price_c.rename(currency))
except Exception as e:
print (data)
return c_price
c_price = []
c_price.extend(getOHLCV(available[:61]))
c_price.extend(getOHLCV(available[61:]))
c_price = pd.concat(c_price, axis=1, sort=True)
pd.set_option('display.max_columns', 200)
c_price = c_price.transpose()
c_price.index.name = 'currency'
c_price.sort_index(axis=0, ascending=True, inplace=True)
OUTPUT:
2019-07-25 2019-07-26 2019-07-27 2019-07-28 2019-07-29 \
currency
1WO 2.604104 2.502526 2.392313 2.418967 2.517868
ABX 1.015568 0.957774 0.913224 0.922612 1.037273
ADH 0.244782 0.282976 0.309931 0.287933 0.309613
... ... ... ... ... ...
XTX 0.156103 0.156009 0.156009 0.165103 0.156498
ZCO 0.685255 0.661324 0.703521 0.654763 0.616204
ZPR 0.214395 0.204968 0.181529 0.178460 0.177596
FUNCTION V2:
The V2 function add a parameter start_dates and loop the function with this new parameter. The issue is I got an empty dataframe from it. I assume that there is an issue with the date but I don't know where. Any help is appreciated.
def getOHLCV(currencies, start_dates):
...
'symbol': ",".join(currencies),
'time_start': ",".join(start_dates),
...
date_list = [(date.today() - timedelta(days= x * 91)) for x in range(3)][1:]
one = []
for i in date_list:
c_price = []
c_price.extend(getOHLCV(available[:61], i))
c_price.extend(getOHLCV(available[61:], i))
c_price = pd.concat(c_price, axis=1, sort=True)
one = pd.concat(c_price, axis=1, sort=True)
pd.set_option('display.max_columns', 200)
The array you are extending you are clearing at each iteration of the foor loop, it can be fixed like so
date_list = [(date.today() - timedelta(days= x * 91)) for x in range(3)][1:]
one = []
c_price = []
for i in date_list:
c_price.extend(getOHLCV(available[:61], i))
c_price.extend(getOHLCV(available[61:], i))
c_price = pd.concat(c_price, axis=1, sort=True)
one = pd.concat(c_price, axis=1, sort=True)
pd.set_option('display.max_columns', 200)
Hope that works for you
EDIT 1
So we need to fix the error : "time_start" must be a valid ISO 8601 timestamp or unix time value'
This is because the return from this
date_list = [(date.today() - timedelta(days= x * 91)) for x in range(3)][1:]
Is this
[datetime.date(2019, 7, 24), datetime.date(2019, 4, 24)]
So we need to convert the list from datetime objects to something that the API will understand, we can do it the following way
date_list = list(map(date.isoformat, date_list))
And we get the following output
['2019-07-24', '2019-04-24']
Edit 2
The error happens when we try to call join on something that isnt a list, so we can fix it by doing
'time_start': start_dates
Instead of doing
'time_start': ",".join(start_dates),

How to reshape data in Python?

I have a data set as given below-
Timestamp = 22-05-2019 08:40 :Light = 64.00 :Temp_Soil = 20.5625 :Temp_Air = 23.1875 :Soil_Moisture_1 = 756 :Soil_Moisture_2 = 780 :Soil_Moisture_3 = 1002
Timestamp = 22-05-2019 08:42 :Light = 64.00 :Temp_Soil = 20.5625 :Temp_Air = 23.125 :Soil_Moisture_1 = 755 :Soil_Moisture_2 = 782 :Soil_Moisture_3 = 1002
And I want to Reshape(rearrange) the dataset to orient header columns like [Timestamp, Light, Temp_Soil, Temp_Air, Soil_Moisture_1, Soil_Moisture_2, Soil_Moisture_3] and their values as the row entry in Python.
One of possible solutions:
Instead of a "true" input file, I used a string:
inp="""Timestamp = 22-05-2019 08:40 :Light = 64.00 :TempSoil = 20.5625 :TempAir = 23.1875 :SoilMoist1 = 756 :SoilMoist2 = 780 :SoilMoist3 = 1002
Timestamp = 22-05-2019 08:42 :Light = 64.00 :TempSoil = 20.5625 :TempAir = 23.125 :SoilMoist1 = 755 :SoilMoist2 = 782 :SoilMoist3 = 1002"""
buf = pd.compat.StringIO(inp)
To avoid "folding" of output lines, I shortened field names.
Then let's create the result DataFrame and a list of "rows" to append to it.
For now - both of them are empty.
df = pd.DataFrame(columns=['Timestamp', 'Light', 'TempSoil', 'TempAir',
'SoilMoist1', 'SoilMoist2', 'SoilMoist3'])
src = []
Below is a loop processing input rows:
while True:
line = buf.readline()
if not(line): # EOF
break
lst = re.split(r' :', line.rstrip()) # Field list
if len(lst) < 2: # Skip empty source lines
continue
dct = {} # Source "row" (dictionary)
for elem in lst: # Process fields
k, v = re.split(r' = ', elem)
dct[k] = v # Add field : value to "row"
src.append(dct)
And the last step is to append rows from src to df :
df = df.append(src, ignore_index =True, sort=False)
When you print(df), for my test data, you will get:
Timestamp Light TempSoil TempAir SoilMoist1 SoilMoist2 SoilMoist3
0 22-05-2019 08:40 64.00 20.5625 23.1875 756 780 1002
1 22-05-2019 08:42 64.00 20.5625 23.125 755 782 1002
For now all columns are of string type, so you can change the required
columns to either float or int:
df.Light = pd.to_numeric(df.Light)
df.TempSoil = pd.to_numeric(df.TempSoil)
df.TempAir = pd.to_numeric(df.TempAir)
df.SoilMoist1 = pd.to_numeric(df.SoilMoist1)
df.SoilMoist2 = pd.to_numeric(df.SoilMoist2)
df.SoilMoist3 = pd.to_numeric(df.SoilMoist3)
Note that to_numeric() function is clever enough to recognize the possible
type to convert to, so first 3 columns changed their type to float64
and the next 3 to int64.
You can check it executing df.info().
One more possible conversion is to change Timestamp column
to DateTime type:
df.Timestamp = pd.to_datetime(df.Timestamp)

Categories

Resources