Why can't I use a variable in a results.find_all?

Why can't I use a variable in a results.find_all? - python

I'm trying to do a pretty basic web scrape but I'd like to be able to use variables so I don't have to keep repeating code for multiple pages.
example line of code:
elems = results.find_all("span", class_="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P")
I would like it to read as:
elems = results.find_all(<variable>)
and then have my variable be:
'"span", class_="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"'
However, when I do this, I get no results. I've included the rest of the function below. Does anyone know why this will not work?
EDIT:
I've also tried splitting it up like below example but still get the same issue:
elems = results.find_all(variable1 , class_=variable2)
variable1 = '"span"'
variable2 = '"st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"'
code:
def get_name(sym,result,elem,name):
url = URL + sym
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id=result)
elems = results.find_all(elem)
for elem in elems:
name_elem = elem.find(name)
print(name_elem.text)
get_name('top',"app",'"span",class_="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"','"span","st_3lrv4Jo"')

The find_all method takes more then one parameter
you are just using a string in the first argument of the method which would struggle to find anything
you will need to split the variable into multiple so your variable '"span", class_="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"' will need to be split into to variables
elem = "span" and class="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"
and in your code it will look like
elems = results.find_all(elem, class)
Just a few more things:
according to the documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all and what i can find online the class parameter takes a Dict with a string array for multiple class values so your function will look more like
findAll(elem, {'class':['st_2XVIMK7', 'st_8u0ePN3', 'st_2oUi2Vb', 'st_3kXJm4P']})

Related

How to get the positions of get_text in Beautiful Soup

I am trying to store the result of my get_text in variables.
I am filtering my html in order to find the information I need. If I want to extract for example a number of rooted can present several, this is how the information I get is displayed:
<span cetxt\"="" class='\"rSpnValor' vidc0='\"74922\"'>74922</span>
<span cetxt\"="" class='\"rSpnValor' vidc0='\"75005\"'>75005</span>
With get_text it would look like this:
74922
75005
I share a bit of my code:
def getValBySpanName(name):
dataArray = soup.find_all('div', {'class': '\\\"rDivDatosAseg'})
for data in dataArray:
data_container = data
spans_data = data_container.find_all("span")
info = []
if spans_data[0].get_text() == name:
container_values = spans_data[1].get_text()
return container_values
file_number= getValBySpanName('Número de radicado')
print(file_number)
The problem is that I get the first position "74922" as a result. I need to find a way to store each value in the variable (Then I will insert this data in sql) so I need to save it one by one
I tried to go through them with a for but it goes through the positions of the first value, something like '7,4,9,2,2'

If I understand you correctly, you are probably looking for something like this:
dataArray = soup.select('div.\\"rSpnValor span')
container_values = []
for data in dataArray:
container_values.append(data.text)
print(container_values)
output
['74922', '75005']

Can you iterate over only tags with the .children iterator from BeautifulSoup?

I am pulling down an xml file using BeautifulSoup with this code
dlink = r'https://www.sec.gov/Archives/edgar/data/1040188/000104018820000126/primary_doc.xml'
dreq = requests.get(dlink).content
dsoup = BeautifulSoup(dreq, 'lxml')
There is a level I'm trying to access and then place the elements into a dictionary. I've got it working with this code:
if dsoup.otherincludedmanagerscount.text != '0':
inclmgr = []
for i in dsoup.find_all('othermanagers2info'):
for m in i.find_all('othermanager2'):
for o in m.find_all('othermanager'):
imd={}
if o.cik:
imd['cik'] = o.cik.text
if o.form13ffilenumber:
imd['file_no'] = o.form13ffilenumber.text
imd['name'] = o.find('name').text
inclmgr.append(imd)
comp_dict['incl_mgr'] = inclmgr
I assume its easier to use the .children or .descendants generators, but every time I run it, I get an error. Is there a way to only iterate over tags using the BeautifulSoup generators?
Something like this?
for i in dsoup.othermanagers2info.children:
imd['cik'] = i.cik.text
AttributeError: 'NavigableString' object has no attribute 'cik'

Assuming othermanagers2info is a single item; you can create the same results using 1 for loop:
for i in dsoup.find('othermanagers2info').find_all('othermanager'):
imd={}
if i.cik:
imd['cik'] = i.cik.text
if i.form13ffilenumber:
imd['file_no'] = i.form13ffilenumber.text
imd['name'] = i.find('name').text
inclmgr.append(imd)
comp_dict['incl_mgr'] = inclmgr
You can also do for i in dsoup.find('othermanagers2info').findChildren():. However this will produce different results (unless you add additional code). It will flattened the list and include both parent & child items. You can also pass in a node name

Handle Key-Error whilst scraping

I am currently working on a script to scrape data from ClinicalTrials.gov. To do this I have written the following script:
def clinicalTrialsGov (id):
url = "https://clinicaltrials.gov/ct2/show/" + id + "?displayxml=true"
data = BeautifulSoup(requests.get(url).text, "lxml")
studyType = data.study_type.text
if studyType == 'Interventional':
allocation = data.allocation.text
interventionModel = data.intervention_model.text
primaryPurpose = data.primary_purpose.text
masking = data.masking.text
enrollment = data.enrollment.text
officialTitle = data.official_title.text
condition = data.condition.text
minAge = data.eligibility.minimum_age.text
maxAge = data.eligibility.maximum_age.text
gender = data.eligibility.gender.text
healthyVolunteers = data.eligibility.healthy_volunteers.text
armType = []
intType = []
for each in data.findAll('intervention'):
intType.append(each.intervention_type.text)
for each in data.findAll('arm_group'):
armType.append(each.arm_group_type.text)
citedPMID = tryExceptCT(data, '.results_reference.PMID')
citedPMID = data.results_reference.PMID
print(citedPMID)
return officialTitle, studyType, allocation, interventionModel, primaryPurpose, masking, enrollment, condition, minAge, maxAge, gender, healthyVolunteers, armType, intType
However, the following script won't always work as not all studies will have all items (ie. a KeyError will occur). To resolve this, I could simply wrap each statement in a try-except catch, like this:
try:
studyType = data.study_type.text
except:
studyType = ""
but it seems a bad way to implement this. What's a better/cleaner solution?

This is a good question. Before I address it, let me say that you should consider changing the second parameter to the BeautifulSoup (BS) constructor from lxml to xml. Otherwise, BS does not flag the parsed markup as XML (to verify this for yourself, access the is_xml attribute on the data variable in your code).
You can avoid generating an error when attempting to access a non-existent element by passing a list of desired element names to the find_all() method:
subset = ['results_reference','allocation','interventionModel','primaryPurpose','masking','enrollment','eligibility','official_title','arm_group','condition']
tag_matches = data.find_all(subset)
Then, if you want to get a specific element from the list of Tags without iterating through it, you can convert it to a dict using the Tag names as keys:
tag_dict = dict((tag_matches[i].name, tag_matches[i]) for i in range(0, len(tag_matches)))

Why do I need to specify the size of this list, else it gives list index out of range error

I am trying to parse a list of urls from a webpage. I did the following things:
Got a list of all "a" tags.
Used a for loop to get("href")
While looping, I kept assigning the get value to a new empty list called links
But I kept getting a index out of range error. I thought it might be because of the way I was incrementing the index of links, but I am sure that is not the case.
This is the error prone code:
import urllib
import bs4
url = "http://tellerprimer.ucdavis.edu/pdf/"
response = urllib.urlopen(url)
webpage = response.read()
soup = bs4.BeautifulSoup(webpage, 'html.parser')
i = 0
links = []
for tags in soup.find_all('a'):
links[i] = str(tags.get('href'))
i +=1
print i, links
I gave links a fixed length and it fixed it, like so:
links = [0]*89 #89 is the length of soup.find_all('a')
I want to know what was causing this problem.

You are attempting to assign something to a non-existent index. When you create links, you create it as an empty list.
Then you do links[i], but links is empty, so there is no ith index.
The proper way to do this is:
links.append(str(tags.get('href')))
This also means that you can eliminate your i variable. It's not needed.
for tags in soup.find_all('a'):
links.append(str(tags.get('href')))
print links
This will print all 89 links in your links list.

The list is initially empty, so you're trying to assign values to non-existing index locations in the list.
Use append() to add items to a list:
links = []
for tags in soup.find_all('a'):
links.append(str(tags.get('href')))
Or use map() instead:
links = map(lambda tags: str(tags.get('href')), soup.find_all('a'))
Or use a list comprehension:
links = [str(tags.get('href')) for tags in soup.find_all('a')]

Python assign text to new variable

In Python, I am using BeautifulSoup to parse text. I want to save a set of 'str' objects into a list. The following code won't run, but the idea should come across:
listings = soup.find_all('h6')
for i in listings:
projecturls[i] = i.find_all('a', href=True)[0]['href']
So I want to cycle through the elements 'listings' and extract a string. I then want to save this string into projecturls, which I want to be a list. But I get the following error:
NameError: name 'projecturls' is not defined
How do I define this? Or is there a better way to do what I want?
I suppose that dynamically defining N variables would also work, but it is not preferred.

Define projecturls as a list object, then use list.append method to add an item there:
listings = soup.find_all('h6')
projecturls = [] # <-------------
for i in listings:
url = i.find_all('a', href=True)[0]['href']
projecturls.append(url) # <------

You could also use list comprehension:
listings = soup.find_all('h6')
projecturls = [i.find_all('a', href=True)[0]['href'] for i in listings]
Or map function:
listings = soup.find_all('h6')
projecturls = list(map(lambda i: i.find_all('a', href=True)[0]['href'], listings))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why can't I use a variable in a results.find_all? - python

Related

How to get the positions of get_text in Beautiful Soup

Can you iterate over only tags with the .children iterator from BeautifulSoup?

Handle Key-Error whilst scraping

Why do I need to specify the size of this list, else it gives list index out of range error

Python assign text to new variable

Categories

Resources