Web scraping table with missing attributes via Python Selenium and Pandas

Web scraping table with missing attributes via Python Selenium and Pandas - python

Scraping a table from a website. But encountering empty cells during the process. Below try-except block is screwing up the data at the end. Also dont want to exclude the complete row, as the information is still relevant even when the some attribute is missing.
try:
for i in range(10):
data = {'ID': IDs[i].get_attribute('textContent'),
'holder': holder[i].get_attribute('textContent'),
'view': view[i].get_attribute('textContent'),
'material': material[i].get_attribute('textContent'),
'Addons': addOns[i].get_attribute('textContent'),
'link': link[i].get_attribute('href')}
list.append(data)
except:
print('Error')
Any ideas?

What you can do is place all the objects to which you want to access the attributes to in a dictionary like this:
objects={"IDs":IDs,"holder":holder,"view":view,"material":material...]
Then you can iterate through this dictionary and if the specific attribute does not exist, simply append an empty string to the value corresponding to the dict key. Something like this:
the_keys=list(objects.keys())
for i in range(len(objects["IDs"])): #I assume the ID field will never be empty
#so making a for loop like this is better since you iterate only through
#existing objects
data={}
for j in range(len(objects)):
try:
data[the_keys[j]]=objects[the_keys[j]][i].get_attribute('textContent')
except Exception as e:
print("Exception: {}".format(e))
data[the_keys[j]]="" #this means we had an exception
#it is better to catch the specific exception that is thrown
#when the attribute of the element does not exist but I don't know what it is
list.append(data)
I don't know if this code works since I didn't try it but it should give you an overall idea on how to solve your problem.
If you have any questions, doubts, or concerns please ask away.
Edit: To get another object's attribute like the href you can simply include an if statement checking the value of the key. I also realized you can just loop through the objects dictionary getting the keys and values instead of accessing each key and value by an index. You could change the inner loop to be like this:
for key,value in objects.items():
try:
if key=="link":
data[key]=objects[key][i].get_attribute("href")
else:
data[key]=objects[key][i].get_attribute("textContent")
except Exception as e:
print("Error: ",e)
data[key]=""
Edit 2:
data={}
for i in list(objects.keys()):
data[i]=[]
for key,value in objects.items():
for i in range(len(objects["IDs"])):
try:
if key=="link":
data[key].append(objects[key][i].get_attribute("href"))
else:
data[key].append(objects[key][i].get_attribute("textContent"))
except Exception as e:
print("Error: ",e)
data[key].append("")
Try with this. You won't have to append the data dictionary to the list. Without the original data I won't be able to help much more. I believe this should work.

Related

Python: Reciveing 'none' when trying to add the string of an exception to a dictionary

So I'm trying to get this working, where I remove the week's stats (weeklydict) from this second's stats (instantdict) so I have an accurate weekly progress for all keys of instantdict (keys being members). It works fine and dandy, but when a new member joins (adding to the keys in instantdict), shit hits the fan, so I use try/except, and attempt to add the missing member to weeklydict too, except when I do that using except keyerror as e and str(e), I'm given a 'none' value. Any idea on what to do?
Code:
for member, wins in instantDict.items():
try:
instantDict[member] = instantDict[member] - weeklyDict[member]
except KeyError as e:
weeklyDict[str(e)] = instantDict.get(str(e)) #error occurs here
instantDict[member] = instantDict[member] - weeklyDict[member] #thus fucking this up

Based on my testing, str(e) returns a string as such:
"'test'"
The value is a string displaying a string, so .get() is not finding the value. Try something like:
for member, wins in instantDict.items():
try:
instantDict[member] = instantDict[member] - weeklyDict[member]
except KeyError as e:
weeklyDict[str(e).strip("'")] = instantDict.get(str(e).strip("'"))
instantDict[member] = instantDict[member] - weeklyDict[member]
That should take the extra string characters off of the keyword, and allow .get() to actually find the value.
Alternatively, if you know that it errored because you know that member is not in the dictionary, why pull the exact same variable from the exception when you could just use member again?

Maybe it can't fetch the thing so try this:
weeklyDict[stre(e)] = instantDict.get(stre(e)]

how to jump a list member in python?

so I'm trying to send i message to a group of people . I want to know how can I jump to 'test2' (next list member ) if 'test1' (current list member ) got an error .
profile_id = ['test1','test2']
for ids in profile_id:
api.send_direct_message(ids,text)

for iterates trough iprofile_id. So, the variable ids will first be the element 'test 1', do whatever is inside the for loop (i.e., send message to the person that is called 'test 1'. Then ids becomes 'test2', and a message is sent to 'test2'. But you try sending a message to the list of people, not the person picked (ids). I assume that the function send_direct_message does not allow lists, so therefore you need to have your third line be api.send_direct_message(ids, text).

Why not use TRY and then EXCEPT for the relevant class?
You can then use PASS to do nothing and continue to the next.

as Menashe already said, you can use TRY, EXCEPT and PASS.
like this:
profile_id = ['test1','test2']
for ids in profile_id:
try:
api.send_direct_message(ids,text)
except:
pass

You can try to use try/except for that, but only in a case if you raise an exception in send_direct_message
profile_ids = ['test1','test2']
for index, profile_id in enumerate(profile_ids):
try:
api.send_direct_message(profile_id,text)
# assume that using just Exception is not a good tone, try to use specific exception you raise in the method here instead of just Exception
except Exception as exc:
if (index+1) >= len(profile_ids):
break
api.send_direct_message(profile_ids[index+1], text)
# but in such a case think of list index out of range error, that will be thrown when list will go to an end

Pandas/Python how to skip errors and goto the next line of code?

Please mind you, I'm new to Pandas/Python and I don't know what I'm doing.
I'm working with CSV files and I basically filter currencies.
Every other day, the exported CSV file may contain or not contain certain currencies.
I have several such cells of codes--
AUDdf = df.loc[df['Currency'] == 'AUD']
AUDtable = pd.pivot_table(AUDdf,index=["Username"],values=["Amount"],aggfunc=np.sum)
AUDtable.loc['AUD Amounts Rejected Grand Total'] = (AUDdf['Amount'].sum())
AUDdesc = AUDdf['Amount'].describe()
When the CSV doesn't contain AUD, I get ValueError: cannot set a frame with no defined columns.
What I'd like to produce is a function or an if statement or a loop that checks if the column contains AUD, and if it does, it runs the above code, and if it doesn't, it simply skips it and proceeds to the next line of code for the next currency.
Any idea how I can accomplish this?
Thanks in advance.

This can be done in 2 ways:
You can create a try and except statement, this will try and look for the given currency and if a ValueError occurs it will skip and move on:
try:
AUDdf = df.loc[df['Currency'] == 'AUD']
AUDtable = pd.pivot_table(AUDdf,index=["Username"],values["Amount"],aggfunc=np.sum)
AUDtable.loc['AUD Amounts Rejected Grand Total'] = (AUDdf['Amount'].sum())
AUDdesc = AUDdf['Amount'].describe()
except ValueError:
pass
You can create an if statement which looks for the currencies presence first:
currency_set = set(list(df['Currency'].values))
if 'AUD' in currency_set:
AUDdf = df.loc[df['Currency'] == 'AUD']
AUDtable = pd.pivot_table(AUDdf,index=["Username"],values=["Amount"],aggfunc=np.sum)
AUDtable.loc['AUD Amounts Rejected Grand Total'] = (AUDdf['Amount'].sum())
AUDdesc = AUDdf['Amount'].describe()

1.Worst way to skip over the error/exception:
try:
<Your Code>
except:
pass
The above is probably the worst way because you want to know when an exception occur. using generic Except statements is bad practice because you want to avoid "catch em all" code. You want to be catching exceptions that you know how to handle. You want to know what specific exception occurred and you need to handle them on an exception-by-exception basis. Writing Generic except statements leads to missed bugs and tends to mislead while running the code to test.
Slightly worse way to handle the exception:
try:
<Your Code>
except Exception as e:
<Some code to handle an exception>
Still not optimal as it is still generic handling
Average way to handle it for your case:
try:
<Your Code>
except ValueError:
<Some code to handle this exception>
Other suggestion - Much Better Ways to deal with this:
1.You can get a set of the available columns at run time and aggregate based on if 'AUD' is in the list.
2.Clean your data set

You can use try and except where
try:
#your code here
except:
#some print statement
pass

Try and except whilst trying to writerow in Python

I have the following code that is throwing up an out of range error on the barcode looping section of the below code.
for each in data['articles']:
f.writerow([each['local']['name'],
each['information'][0]['barcodes'][0]['barcode']])
I wrote a try and except to catch and handle when a barcode is not present within the json I am parsing and this worked perfectly during testing using the print function however I have been having some trouble getting the try and except to work whilst trying to writerow to a csv file.
Does anyone have any suggestions or another method I could try to get this to work.
My try and accept which worked when testing using print was as follows:
for each in data['articles']:
print(each['local']['name'])
try:
print(each['information'][0]['barcodes'][0]['barcode'])
except:
"none"
Any help is much appreciated!

As komatiraju032 points out, one way of doing this is via get(), although if there are different elements of the dictionary that might have empty/incorrect values, it might get unwieldy to provide a default for each one. To do this via a try/except you might do:
for each in data['articles']:
row = [each['local']['name']]
try:
row.append(each['information'][0]['barcodes'][0]['barcode'])
except (IndexError, KeyError):
row.append("none")
f.writerow(row)
This will give you that "none" replacement value regardless of which of those lists/dicts is missing the requested index/key, since any of those lookups might raise but they'll all end up at the same except.

Use dict.get() method. It will return None if key not exist
res = each['information'][0]['barcodes'][0].get('barcode')

Specific exception output

I am trying to find out which portion of my code contains a KeyError in my events list. Events is a list that contains JSON elements. I want to put timestamp, event_sequence_number, and device_id in their respective variables. However each JSON object is different and some do not contain the timestamp, event_sequence_number, or device_id keys. How can I change my bit of code so that I am able to output which specific key(s) is missing?
ex:
When timestamp is missing
"timestamp key is missing"
when timestamp and device_id is missing
"timestamp key is missing"
"device_id key is missing"
etc
Code:
for event in events:
try:
timestamp = event["event"]["timestamp"]
event_sequence_num = event["event"]["properties"]["event_sequence_number"]
device_id = event["application"]["mobile"]["device_id"]
event_identifier = str(device_id) + "_" + str(timestamp) + "_" + str(event_sequence_num)
event_dict[event_identifier] = 1
except KeyError:
print "JSON Key does not exist"

You can print the exception as that will include the key for which the KeyError was raised:
except KeyError as exc:
print "JSON Key does not exist: " + str(exc)
You can also access the key by looking at exc.args[0]:
except KeyError as exc:
print "JSON Key does not exist: " + str(exc.args[0])

Simeon Visser's answer is spot-on. Reporting the key causing the KeyError is probably the best that can be done in bare, straightforward Python. If you're only accessing the JSON structure once, that's the way to go.
I offer a longer alternative, however, for situations where you need to access the multi-level event data repeatedly. If you're accessing it often, your program can afford a few more lines of setup and infrastructure. Consider:
def getpath(obj, path, post=str):
"""
Use path as sequence of keys/indices into obj. Return the value
there, filtered through the post (postprocessing function).
If there is no such value, raise KeyError displaying the
partial path to the point where there is no index/key.
"""
c = obj
try:
for i, p in enumerate(path):
c = c[p]
return post(c) if post else c
except (KeyError, IndexError) as e:
msg = "JSON keys {0!r} don't exist".format(path[:i+1])
raise KeyError(msg)
# raise type(e)(msg) # Alternative if you want more exception variety
EID_COMPONENTS = [('application', 'mobile', 'device_id'),
('event', 'timestamp'),
('event', 'properties', 'event_sequence_number')]
for event in events:
event_identifier = '_'.join(getpath(event, p) for p in EID_COMPONENTS)
event_dict[event_identifier] = 1
There is more preparation here, with a separate getpath function and globally defined specification of what paths into the JSON data to get. On the plus side, the assembly of event_identifier is much shorter (if it were wrapped in a function, it'd be about 1/3 the size in either source lines or bytecodes).
If an attempted access fails, it returns a more complete error message, giving the path into the structure up to that point, not just the final key that was missing. In complex JSON with duplicated keys in different sub-structures (multiple timestamps, e.g.), knowing which attempted access failed can save you much debugging effort. You may also notice that the code is prepared to use integer indices and gracefully handle IndexError; in JSON, array values are common.
This is abstraction in action: More framework and more setup, but if you need to do a lot of deep structure accesses, the code size savings and better error reporting would advantage multiple parts of your program, making it potentially a good investment.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping table with missing attributes via Python Selenium and Pandas - python

Related

Python: Reciveing 'none' when trying to add the string of an exception to a dictionary

how to jump a list member in python?

Pandas/Python how to skip errors and goto the next line of code?

Try and except whilst trying to writerow in Python

Specific exception output

Categories

Resources