variable structure in json data source - python

thanks for your time.
I have a dataframe in pyspark in Databricks that reads json. The data from the source does not always have the same structure, sometimes the 'emailAddress' field does not appear, causing me the error "org.apache.spark.sql.AnalysisException: cannot resolve ...".
I have tried to solve by applying a Try-Except function in this way:
try:
df_json = df_json.select("responseID", "surveyID", "surveyName","timestamp", "customVariables.Id_Cliente", "timestamp", "responseSet", "emailAddress")
except ValueError:
None
But it does not work for me, it returns the same error that I mentioned.
I am even trying to take another alternative but without results:
if 'Id_Cliente' in s_fields:
try:
df_json = df_json.select("responseID", "surveyID", "surveyName","timestamp", "customVariables.Id_Cliente", "timestamp", "responseSet", "emailAddress")
except ValueError:
df_json = df_json.select("responseID", "surveyID", "surveyName","timestamp", "customVariables.Id_Cliente", "timestamp", "responseSet")
Please help me with some idea to control this situation? I need to stop the execution of my notebook when it does not find the field in the structure, otherwise (it finds the emailAddress variable) to continue processing.
From already thank you very much.
Greetings.

You're catching ValueError while the exception is AnalysisException, that's why it doesn't work.
from pyspark.sql.utils import AnalysisException
try:
df.select('xyz')
except AnalysisException:
print(123)

Related

Web scraping table with missing attributes via Python Selenium and Pandas

Scraping a table from a website. But encountering empty cells during the process. Below try-except block is screwing up the data at the end. Also dont want to exclude the complete row, as the information is still relevant even when the some attribute is missing.
try:
for i in range(10):
data = {'ID': IDs[i].get_attribute('textContent'),
'holder': holder[i].get_attribute('textContent'),
'view': view[i].get_attribute('textContent'),
'material': material[i].get_attribute('textContent'),
'Addons': addOns[i].get_attribute('textContent'),
'link': link[i].get_attribute('href')}
list.append(data)
except:
print('Error')
Any ideas?
What you can do is place all the objects to which you want to access the attributes to in a dictionary like this:
objects={"IDs":IDs,"holder":holder,"view":view,"material":material...]
Then you can iterate through this dictionary and if the specific attribute does not exist, simply append an empty string to the value corresponding to the dict key. Something like this:
the_keys=list(objects.keys())
for i in range(len(objects["IDs"])): #I assume the ID field will never be empty
#so making a for loop like this is better since you iterate only through
#existing objects
data={}
for j in range(len(objects)):
try:
data[the_keys[j]]=objects[the_keys[j]][i].get_attribute('textContent')
except Exception as e:
print("Exception: {}".format(e))
data[the_keys[j]]="" #this means we had an exception
#it is better to catch the specific exception that is thrown
#when the attribute of the element does not exist but I don't know what it is
list.append(data)
I don't know if this code works since I didn't try it but it should give you an overall idea on how to solve your problem.
If you have any questions, doubts, or concerns please ask away.
Edit: To get another object's attribute like the href you can simply include an if statement checking the value of the key. I also realized you can just loop through the objects dictionary getting the keys and values instead of accessing each key and value by an index. You could change the inner loop to be like this:
for key,value in objects.items():
try:
if key=="link":
data[key]=objects[key][i].get_attribute("href")
else:
data[key]=objects[key][i].get_attribute("textContent")
except Exception as e:
print("Error: ",e)
data[key]=""
Edit 2:
data={}
for i in list(objects.keys()):
data[i]=[]
for key,value in objects.items():
for i in range(len(objects["IDs"])):
try:
if key=="link":
data[key].append(objects[key][i].get_attribute("href"))
else:
data[key].append(objects[key][i].get_attribute("textContent"))
except Exception as e:
print("Error: ",e)
data[key].append("")
Try with this. You won't have to append the data dictionary to the list. Without the original data I won't be able to help much more. I believe this should work.

KeyError: "Key 'fields' not found. If specifying a record_path, all elements of data should have the path."

I am getting very nested json for different items through an API and am then trying to convert some of the received information into a dataframe.
I have worked with this line to get the dataframe I want:
df = pd.json_normalize(result, record_path=['fields'],errors='ignore')
This works sometimes, but other times I either get a KeyError for the record-path:
KeyError: "Key 'fields' not found. If specifying a record_path, all elements of data should have the path."
I assume that this is because the json I receive is not always exactly the same but can vary according to the type of item that information about is requested.
My question now is, if there is a way to skip data which doesn't have any of these keys? Or if there are other options to ignore the data that doesn't have those keys in it?
Thanks for the well written question. To do this, you want to learn about "Exception Handling".
Its worth learning a bit more about it, but here is the tl/dr:
try:
df = pd.json_normalize(result, record_path=['fields'],,errors='ignore')
except KeyError as e:
print(f"Unable to normalize json: {json.dumps(result, indent=4)}")

HTTPError when appending DataFrame

I am reading Python code from another programmer, particularly the following code block:
try:
df.append(df_extension)
except HTTPError as e:
if ("No data could be loaded!" in str(e)):
print("No data could be loaded. Error was caught.")
else:
raise
In this, df and df_extension are pandas.DataFrames.
I wonder how an HTTPError could occur with pandas.DataFrame.append. At least from the documentation I can not find out how append raises an HTTPError.
Any ideas will be welcome.
According to comments to the question by #JCaesar and #Neither, you don't have to worry about an HTTPError arising from the use of df.append. The try-except-block does not seem to have any justification. The one-liner
df.append(df_extension)
suffices.

Pandas/Python how to skip errors and goto the next line of code?

Please mind you, I'm new to Pandas/Python and I don't know what I'm doing.
I'm working with CSV files and I basically filter currencies.
Every other day, the exported CSV file may contain or not contain certain currencies.
I have several such cells of codes--
AUDdf = df.loc[df['Currency'] == 'AUD']
AUDtable = pd.pivot_table(AUDdf,index=["Username"],values=["Amount"],aggfunc=np.sum)
AUDtable.loc['AUD Amounts Rejected Grand Total'] = (AUDdf['Amount'].sum())
AUDdesc = AUDdf['Amount'].describe()
When the CSV doesn't contain AUD, I get ValueError: cannot set a frame with no defined columns.
What I'd like to produce is a function or an if statement or a loop that checks if the column contains AUD, and if it does, it runs the above code, and if it doesn't, it simply skips it and proceeds to the next line of code for the next currency.
Any idea how I can accomplish this?
Thanks in advance.
This can be done in 2 ways:
You can create a try and except statement, this will try and look for the given currency and if a ValueError occurs it will skip and move on:
try:
AUDdf = df.loc[df['Currency'] == 'AUD']
AUDtable = pd.pivot_table(AUDdf,index=["Username"],values["Amount"],aggfunc=np.sum)
AUDtable.loc['AUD Amounts Rejected Grand Total'] = (AUDdf['Amount'].sum())
AUDdesc = AUDdf['Amount'].describe()
except ValueError:
pass
You can create an if statement which looks for the currencies presence first:
currency_set = set(list(df['Currency'].values))
if 'AUD' in currency_set:
AUDdf = df.loc[df['Currency'] == 'AUD']
AUDtable = pd.pivot_table(AUDdf,index=["Username"],values=["Amount"],aggfunc=np.sum)
AUDtable.loc['AUD Amounts Rejected Grand Total'] = (AUDdf['Amount'].sum())
AUDdesc = AUDdf['Amount'].describe()
1.Worst way to skip over the error/exception:
try:
<Your Code>
except:
pass
The above is probably the worst way because you want to know when an exception occur. using generic Except statements is bad practice because you want to avoid "catch em all" code. You want to be catching exceptions that you know how to handle. You want to know what specific exception occurred and you need to handle them on an exception-by-exception basis. Writing Generic except statements leads to missed bugs and tends to mislead while running the code to test.
Slightly worse way to handle the exception:
try:
<Your Code>
except Exception as e:
<Some code to handle an exception>
Still not optimal as it is still generic handling
Average way to handle it for your case:
try:
<Your Code>
except ValueError:
<Some code to handle this exception>
Other suggestion - Much Better Ways to deal with this:
1.You can get a set of the available columns at run time and aggregate based on if 'AUD' is in the list.
2.Clean your data set
You can use try and except where
try:
#your code here
except:
#some print statement
pass

Try and except whilst trying to writerow in Python

I have the following code that is throwing up an out of range error on the barcode looping section of the below code.
for each in data['articles']:
f.writerow([each['local']['name'],
each['information'][0]['barcodes'][0]['barcode']])
I wrote a try and except to catch and handle when a barcode is not present within the json I am parsing and this worked perfectly during testing using the print function however I have been having some trouble getting the try and except to work whilst trying to writerow to a csv file.
Does anyone have any suggestions or another method I could try to get this to work.
My try and accept which worked when testing using print was as follows:
for each in data['articles']:
print(each['local']['name'])
try:
print(each['information'][0]['barcodes'][0]['barcode'])
except:
"none"
Any help is much appreciated!
As komatiraju032 points out, one way of doing this is via get(), although if there are different elements of the dictionary that might have empty/incorrect values, it might get unwieldy to provide a default for each one. To do this via a try/except you might do:
for each in data['articles']:
row = [each['local']['name']]
try:
row.append(each['information'][0]['barcodes'][0]['barcode'])
except (IndexError, KeyError):
row.append("none")
f.writerow(row)
This will give you that "none" replacement value regardless of which of those lists/dicts is missing the requested index/key, since any of those lookups might raise but they'll all end up at the same except.
Use dict.get() method. It will return None if key not exist
res = each['information'][0]['barcodes'][0].get('barcode')

Categories

Resources