Handle Key-Error whilst scraping

Handle Key-Error whilst scraping - python

I am currently working on a script to scrape data from ClinicalTrials.gov. To do this I have written the following script:
def clinicalTrialsGov (id):
url = "https://clinicaltrials.gov/ct2/show/" + id + "?displayxml=true"
data = BeautifulSoup(requests.get(url).text, "lxml")
studyType = data.study_type.text
if studyType == 'Interventional':
allocation = data.allocation.text
interventionModel = data.intervention_model.text
primaryPurpose = data.primary_purpose.text
masking = data.masking.text
enrollment = data.enrollment.text
officialTitle = data.official_title.text
condition = data.condition.text
minAge = data.eligibility.minimum_age.text
maxAge = data.eligibility.maximum_age.text
gender = data.eligibility.gender.text
healthyVolunteers = data.eligibility.healthy_volunteers.text
armType = []
intType = []
for each in data.findAll('intervention'):
intType.append(each.intervention_type.text)
for each in data.findAll('arm_group'):
armType.append(each.arm_group_type.text)
citedPMID = tryExceptCT(data, '.results_reference.PMID')
citedPMID = data.results_reference.PMID
print(citedPMID)
return officialTitle, studyType, allocation, interventionModel, primaryPurpose, masking, enrollment, condition, minAge, maxAge, gender, healthyVolunteers, armType, intType
However, the following script won't always work as not all studies will have all items (ie. a KeyError will occur). To resolve this, I could simply wrap each statement in a try-except catch, like this:
try:
studyType = data.study_type.text
except:
studyType = ""
but it seems a bad way to implement this. What's a better/cleaner solution?

This is a good question. Before I address it, let me say that you should consider changing the second parameter to the BeautifulSoup (BS) constructor from lxml to xml. Otherwise, BS does not flag the parsed markup as XML (to verify this for yourself, access the is_xml attribute on the data variable in your code).
You can avoid generating an error when attempting to access a non-existent element by passing a list of desired element names to the find_all() method:
subset = ['results_reference','allocation','interventionModel','primaryPurpose','masking','enrollment','eligibility','official_title','arm_group','condition']
tag_matches = data.find_all(subset)
Then, if you want to get a specific element from the list of Tags without iterating through it, you can convert it to a dict using the Tag names as keys:
tag_dict = dict((tag_matches[i].name, tag_matches[i]) for i in range(0, len(tag_matches)))

Related

BeautifulSoup4: Extracting tables, now how do I exclude certain tags and bits of information I do not want

Trying to extract coin names, price, and market cap from coinmarketcap.com. I first tried using soup.find_all to search for certain tags with a specific class but it always picked up information I didnt need or want. So instead I used find_all to search for 'td' and then planned on using a for loop to look for specific class names and to append those to a new list and then print that list but it returns a data type for some reason.
coin_table = soup.find_all('td')
class_value = 'sc-1eb5slv-0 iJjGCS'
for i in coin_table:
if class_value in coin_table:
list.append(i)
print(list)
But this returns:
<class 'list'>
to the console even though im not asking to see the data type. Very new to beautifulsoup and coding in general so sorry if this is a very basic question. Still trying to get my head around all of this stuff.

As #RJAdriaansen mentioned, you don't need to scrape website when they provide api. Here is how you do it with requests library:
import requests
url = 'https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&limit=100&sortBy=market_cap&sortType=desc&convert=USD,BTC,ETH&cryptoType=all&tagType=all&audited=false&aux=ath,atl,high24h,low24h,num_market_pairs,cmc_rank,date_added,tags,platform,max_supply,circulating_supply,total_supply,volume_7d,volume_30d'
response = requests.get(url)
data = response.json()
This will give you json data. Now you can grab all you need by accessing correct keys:
final_list = []
temp = []
for each_crypto in data['data']['cryptoCurrencyList']:
temp.append(each_crypto['name'])
# each_crypto['quotes'] gives you list of price and market gap of each crypto
for quote in each_crypto['quotes']:
# assuming you want to get USD price of each crypto
if quote['name'] == "USD":
temp.append(quote['price'])
temp.append(quote['marketCap'])
final_list.append(temp)
temp = []
Final result would look like this:
[
['Bitcoin', 34497.01819639692, 646704595579.0485],
['Ethereum', 2195.11816422801, 255815488972.87268],
['Tether', 1.0003936138399, 62398426501.02027],
['Binance Coin', 294.2550537711805, 45148405357.003],
...
]

Can you iterate over only tags with the .children iterator from BeautifulSoup?

I am pulling down an xml file using BeautifulSoup with this code
dlink = r'https://www.sec.gov/Archives/edgar/data/1040188/000104018820000126/primary_doc.xml'
dreq = requests.get(dlink).content
dsoup = BeautifulSoup(dreq, 'lxml')
There is a level I'm trying to access and then place the elements into a dictionary. I've got it working with this code:
if dsoup.otherincludedmanagerscount.text != '0':
inclmgr = []
for i in dsoup.find_all('othermanagers2info'):
for m in i.find_all('othermanager2'):
for o in m.find_all('othermanager'):
imd={}
if o.cik:
imd['cik'] = o.cik.text
if o.form13ffilenumber:
imd['file_no'] = o.form13ffilenumber.text
imd['name'] = o.find('name').text
inclmgr.append(imd)
comp_dict['incl_mgr'] = inclmgr
I assume its easier to use the .children or .descendants generators, but every time I run it, I get an error. Is there a way to only iterate over tags using the BeautifulSoup generators?
Something like this?
for i in dsoup.othermanagers2info.children:
imd['cik'] = i.cik.text
AttributeError: 'NavigableString' object has no attribute 'cik'

Assuming othermanagers2info is a single item; you can create the same results using 1 for loop:
for i in dsoup.find('othermanagers2info').find_all('othermanager'):
imd={}
if i.cik:
imd['cik'] = i.cik.text
if i.form13ffilenumber:
imd['file_no'] = i.form13ffilenumber.text
imd['name'] = i.find('name').text
inclmgr.append(imd)
comp_dict['incl_mgr'] = inclmgr
You can also do for i in dsoup.find('othermanagers2info').findChildren():. However this will produce different results (unless you add additional code). It will flattened the list and include both parent & child items. You can also pass in a node name

Why can't I use a variable in a results.find_all?

I'm trying to do a pretty basic web scrape but I'd like to be able to use variables so I don't have to keep repeating code for multiple pages.
example line of code:
elems = results.find_all("span", class_="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P")
I would like it to read as:
elems = results.find_all(<variable>)
and then have my variable be:
'"span", class_="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"'
However, when I do this, I get no results. I've included the rest of the function below. Does anyone know why this will not work?
EDIT:
I've also tried splitting it up like below example but still get the same issue:
elems = results.find_all(variable1 , class_=variable2)
variable1 = '"span"'
variable2 = '"st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"'
code:
def get_name(sym,result,elem,name):
url = URL + sym
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id=result)
elems = results.find_all(elem)
for elem in elems:
name_elem = elem.find(name)
print(name_elem.text)
get_name('top',"app",'"span",class_="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"','"span","st_3lrv4Jo"')

The find_all method takes more then one parameter
you are just using a string in the first argument of the method which would struggle to find anything
you will need to split the variable into multiple so your variable '"span", class_="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"' will need to be split into to variables
elem = "span" and class="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"
and in your code it will look like
elems = results.find_all(elem, class)
Just a few more things:
according to the documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all and what i can find online the class parameter takes a Dict with a string array for multiple class values so your function will look more like
findAll(elem, {'class':['st_2XVIMK7', 'st_8u0ePN3', 'st_2oUi2Vb', 'st_3kXJm4P']})

Python for loop returns only first element

I'm trying to extract Pubmed IDs from Pubmed URLs, using https://pypi.org/project/pymed/.
Problem: Despite proper inputs all throughout my for loop, my "pubmed query" is apparently being performed only on the first iteration? Please help.
url_list = pd.Series of URLs, dtype='str'
ID_list = list(range(0, len(url_list)))
for i in url_list:
spider = Pubmed("MyTool", my_email)
url = url_list.iloc[i].to_string(index=False)
print(url) #CORRECTLY ITERATES
if re.search(pattern, url):
lookup = spider.query(url) #Unique <class itertools.chain object>
results = list(lookup) #Unique <'pymed.article.PubMedArticle' object>
ID = results.pop().pubmed_id
#OR
ID = results[0].pubmed_id
print(ID) #RETURNS ONLY THE FIRST ID
ID_list[i] = ID
else:
ID_list[i] = None
print("Extracted " + url + " with ID: " + ID)
I've tried settings the spider and lookup vars to None at the end of the whole loop, as well as using "del VAR" on both of them.
Nothing gives. For whatever reason, the spider.query() method is only pulling from the first url it is fed. Note that I also tried putting the Pubmed() spider outside the for loop, where it probably should go, but this was me trying to be thorough.
Thanks so much for the help, and I apologize for any issues reproducing this, or with anything stylistic that causes you pain and suffering.

Python assign text to new variable

In Python, I am using BeautifulSoup to parse text. I want to save a set of 'str' objects into a list. The following code won't run, but the idea should come across:
listings = soup.find_all('h6')
for i in listings:
projecturls[i] = i.find_all('a', href=True)[0]['href']
So I want to cycle through the elements 'listings' and extract a string. I then want to save this string into projecturls, which I want to be a list. But I get the following error:
NameError: name 'projecturls' is not defined
How do I define this? Or is there a better way to do what I want?
I suppose that dynamically defining N variables would also work, but it is not preferred.

Define projecturls as a list object, then use list.append method to add an item there:
listings = soup.find_all('h6')
projecturls = [] # <-------------
for i in listings:
url = i.find_all('a', href=True)[0]['href']
projecturls.append(url) # <------

You could also use list comprehension:
listings = soup.find_all('h6')
projecturls = [i.find_all('a', href=True)[0]['href'] for i in listings]
Or map function:
listings = soup.find_all('h6')
projecturls = list(map(lambda i: i.find_all('a', href=True)[0]['href'], listings))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Handle Key-Error whilst scraping - python

Related

BeautifulSoup4: Extracting tables, now how do I exclude certain tags and bits of information I do not want

Can you iterate over only tags with the .children iterator from BeautifulSoup?

Why can't I use a variable in a results.find_all?

Python for loop returns only first element

Python assign text to new variable

Categories

Resources