How can I scrape this page? - python

I'm scraping a page, but I'm getting errors trying to scrape WANTED-DATA
<td class="class-1" data-reactid="41"><a class="class-2" data-reactid="42" data-symbol="MORE-DATA" href="/quote/HKxlkPH4-x" title="WANTED-DATA">text</a></td>
The closer thing I can extract is text by doing:
getText.find('a', attrs={'class':'class-2'}).text
# output: 'text'
How can I scrape 'WANTED-DATA'?

try this one:
links = soup.findAll('a', attrs={'class':'class-2'}).text
for link in links:
title = link.get('title')

from the docs. You can write tag[attr_name] to get single attribute and tag.attrs lo get a dictionary of all attributes with values.
soup.find('a', attrs={'class':'class-2'})['title']

You could do it also like this:
html = """<td class="class-1" data-reactid="41"><a class="class-2" data-reactid="42" data-symbol="MORE-DATA" href="/quote/HKxlkPH4-x" title="WANTED-DATA">text</a></td>"""
soup = BeautifulSoup(html)
## adding title=True below prevent any error in case you have links without the 'title attribute'
titles = [x.get('title') for x in soup.find_all('a',title=True)]
print(titles)
Output:
['WANTED-DATA']

Related

How to scrape links without href attribute?

I want to extract links but there is not any href attribute given. How do I scrape the links from the page?
from bs4 import BeautifulSoup
import requests
for count in range(1,421):
r = requests.get('http://iapsm.org/MemberPage/members.php?
page='+str(count)+'&Search=',headers= {'User-Agent':'Googleboat'})
soup = BeautifulSoup(r.text,'lxml')
links = soup.find_all('div',class_='Table')
for link in soup.find_all('tr'):
c = (link.get('a'))
print c
I'm not getting any output or getting any error
To scrape all the details, first search for all the div which have there class as modal-content
You can try my code below to get all the information of users.
modals = soup.find_all('div',{'class':'modal-content'})
user_data = []
for modal in modals:
uls = modal.find_all('ul',{'class':'Modal-List'})
info = {}
for ul in uls:
info[ul[0]] = ul[1]
user_data.append(info)
print(user_data)

Short & Easy - soup.find_all Not Returning Multiple Tag Elements

I need to scrape all 'a' tags with "result-title" class, and all 'span' tags with either class 'results-price' and 'results-hood'. Then, write the output to a .csv file across multiple columns. The current code does not print anything to the csv file. This may be bad syntax but I really can't see what I am missing. Thanks.
f = csv.writer(open(r"C:\Users\Sean\Desktop\Portfolio\Python - Web Scraper\RE Competitor Analysis.csv", "wb"))
def scrape_links(start_url):
for i in range(0, 2500, 120):
source = urllib.request.urlopen(start_url.format(i)).read()
soup = BeautifulSoup(source, 'lxml')
for a in soup.find_all("a", "span", {"class" : ["result-title hdrlnk", "result-price", "result-hood"]}):
f.writerow([a['href']], span['results-title hdrlnk'].getText(), span['results-price'].getText(), span['results-hood'].getText() )
if i < 2500:
sleep(randint(30,120))
print(i)
scrape_links('my_url')
If you want to find multiple tags with one call to find_all, you should pass them in a list. For example:
soup.find_all(["a", "span"])
Without access to the page you are scraping, it's too hard to give you a complete solution, but I recommend extracting one variable at a time and printing it to help you debug. For example:
a = soup.find('a', class_ = 'result-title')
a_link = a['href']
a_text = a.text
spans = soup.find_all('span', class_ = ['results-price', 'result-hood'])
row = [a_link, a_text] + [s.text for s in spans]
print(row) # verify we are getting the results we expect
f.writerow(row)

return value inside html tag with beautifulsoup

I'm trying to get the data from some social networks and put in the mongodb.
This is the information inside the html tag
<span class="ProfileNav-value" data-count="347235" data-is-compact="true">347K</span>
I was able to recover the 347K as follows
page = requests.get("https://twitter.com/cancaonova")
soup = BeautifulSoup(page.content, 'html.parser')
followers = soup.find_all(class_="ProfileNav-value")
seguidores = followers[2]
print seguidores.get_text()
However I wanted to get the data inside the data-cont tag I'm trying that way, but the result was: none
page = requests.get("https://twitter.com/cancaonova")
soup = BeautifulSoup(page.content, 'html.parser')
followers = soup.find('data-count')
print(followers)
Tks for you
Use 'element.attrs' to read attribute:
seguidores = followers[2]
datacount = seguidores.attrs['data-count']
rel_soup = BeautifulSoup('<span class="ProfileNav-value" data-count="347235" data-is-compact="true">347K</span>','html.parser')
rel_soup.span['data-count']

BeautifulSoup - Python - Find the key from HTML

I have been practicing with bs4 and Python and now I have been stucked.
My plan is to do a If - Else state where I wanted to do similar like
If(I find a value inside this html)
Do This method
Else:
Do something else
and I have scraped up a html I found randomly which looks like -
<div class="Talkinghand" data-backing="ShowingHide" data-key="123456" data-theme="$MemeTheme" style=""></div>
and what I have done so far is that:
s = requests.Session()
Url = 'www.myhtml.com' #Just took a random page which I don't feel to insert
r = s.get(Url)
soup = soup(r, "lxml")
findKey = soup.find(('div', {'class': 'Talkinghand'})['data-key'])
print(findKey)
but no luck. Gives me error and
TypeError: object of type 'Response' has no len()
Once I find or print out the key. I wanted to do a if else statement where it also says:
If(there is a value inside that data-key)
...
To display the data-key attribute from inside the <div> tag, you can do the following:
from bs4 import BeautifulSoup
html = '<div class="Talkinghand" data-backing="ShowingHide" data-key="123456" data-theme="$MemeTheme" style=""></div>'
soup = BeautifulSoup(html, "html.parser")
print soup.div['data-key']
This would print:
123456
You would need to pass r.content to your soup call.
Your script had an extra ( and ), so the following would also work:
findKey = soup.find('div', {'class': 'Talkinghand'})['data-key']
print findKey

Parsing a div with a "class" attribute

Using the BeautifulSoup module in Python, I'm trying to parse this webpage below.
<div class="span-body"><div class="timestamp updated" title="2016-05-08T1231Z">May 8, 12:31 PM EDT</div></div>
I'm trying to get the script below to return 2016-05-08T1231Z, which is found in the second div with the timestamp updated class.
with open("index.html", 'rb') as source_file:
soup = BeautifulSoup(source_file.read()) # Read the source file and get BeautifulSoup to work with it.
div_1 = soup.find("div", {"class": "span-body"}).contents[0] # Parse the first div.
div_2 = div_1("div", {"class": "timestamp updated"}) # Parse the second div.
print div_2
div_1 returns what I wanted to return (the second div), but div_2 isn't, instead it's only giving me an empty list in return.
How can I fix this problem?
A couple of options, all of which you should just drop contents[0]:
div_1 = soup.find("div", {"class": "span-body"}) # Parse the first div.
div_2 = div_1("div", {"class": "timestamp updated"})
This will return a list with one element in it:
[<div class="timestamp updated" title="2016-05-08T1231Z">May 8, 12:31 PM EDT</div>]
Just use find():
div_1 = soup.find("div", {"class": "span-body"})
div_2 = div_1.find("div", {'class': 'timestamp updated'})
print(div_2)
Result:
<div class="timestamp updated" title="2016-05-08T1231Z">May 8, 12:31 PM EDT</div>
If you don't need the intermediate div_1 why not just go straight to div_2?
div_2 = soup.find("div", {'class': 'timestamp updated'})
Edit from comment: To get the value of the title attribute you can index it like this:
div_2['title']
To find what you want from div_1 you need to use the find function again, also you can get rid of the contents[0] as find doesn't return a list.
soup = BeautifulSoup(source_file.read()) # Read the source file and get BeautifulSoup to work with it.
div_1 = soup.find("div", {"class": "span-body"}) # Parse the first div.
div_2 = div_1.find("div", {"class": "timestamp updated"}) # Parse the second div.
print div_2

Categories

Resources