I'm placing here HTML code :
<div class="rendering rendering_person rendering_short rendering_person_short">
<h3 class="title">
<a rel="Person" href="https://moh-it.pure.elsevier.com/en/persons/massimo-eraldo-abate" class="link person"><span>Massimo Eraldo Abate</span></a>
</h3>
<ul class="relations email">
<li class="email"><span>massimo.abate#ior.it</span></li>
</ul>
<p class="type"><span class="family">Person: </span>Academic</p>
</div>
From above code how to extract Massimo Eraldo Abate?
Please help me.
You can extract the name using
response.xpath('//h3[#class="title"]/a/span/text()').extract_first()
Also, look at this Scrapinghub's blogpost for introduction to XPath.
Please take a look at this page. there are lots of ways of extracting text
scrapy docs
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').extract()
>>> response = HtmlResponse(url='http://example.com', body=body)
>>> Selector(response=response).xpath('//span/text()').extract()
Related
I am trying to use selenium to loop through a list of properties on a web page and return the property address and auction time. I have the following python code so far and html for the web page below.
I'm able to return the links to every property in the list, but can't seen to return the values I need from the "H4" tags. I think I'm doing something wrong with getting the elements by Xpath but I can't seem to figure it out.
Any help would be greatly appreciated!
HTML:
<div data-elem-id="asset_list_content">
<a href="/details/123-memory-lane">
<div data-elm-id="asset_2352111_address" class="styles__address-container--2l39p styles__u-mr-1--3qZyj">
<h4 data-elm-id="asset_2352111_address_content_1" class="styles__asset-font-big--vQU7K">123 memory-lane</h4>
<label data-elm-id="asset_2352111_address_content_2" class="styles__asset-font-small--2JgrX">POWDER SPRINGS, GA 30127, Cobb County</label>
</div>
<div class="styles__auction-container--45DZU styles__u-ml-1--34mF_">
<h4 data-elm-id="asset_2352111_auction_date" class="styles__asset-font-big--vQU7K">Apr 04, 10:00am</h4>
</div>
</a>
<a href="/details/456-memory-lane">
<div data-elm-id="asset_8463157_address" class="styles__address-container--2l39p styles__u-mr-1--3qZyj">
<h4 data-elm-id="asset_8463157_address_content_1" class="styles__asset-font-big--vQU7K">456 memory-lane</h4>
<label data-elm-id="asset_8463157_address_content_2" class="styles__asset-font-small--2JgrX">POWDER SPRINGS, GA 30127, Cobb County</label>
</div>
<div class="styles__auction-container--45DZU styles__u-ml-1--34mF_">
<h4 data-elm-id="asset_8463157_auction_date" class="styles__asset-font-big--vQU7K">March 10, 10:00am</h4>
</div>
</a>
</div>
Python (Selenium):
propertyList = browser.find_elements_by_xpath('//div[#data-elm-id="asset_list_content"]')
for element in propertyList:
propertyLinks = element.find_elements_by_tag_name('a')
for propertyLink in propertyLinks:
propertyAddress = propertyLink.get_element_by_xpath('//h4[1]')
propertyAuctionTime = propertyLink.get_element_by_xpath('//h4[2]')
print(propertyAddress).text
print(propertyAuctionTime).text
Output:
propertyAddress = propertyLink.get_element_by_xpath('//h4[1]')
AttributeError: 'WebElement' object has no attribute 'get_element_by_xpath'
The error seems to be you are using get_element_by_xpath(), which isn't a valid method. You used find_elements_by_xpath() in your code before that moment, and to find the elements you are looking for you just need to use the method that only finds a single element: find_element_by_xpath().
I am trying to scrape urls from the html format website. I use beautiful soup. Here's a part of the html.
<li style="display: block;">
<article itemscope itemtype="http://schema.org/Article">
<div class="col-md-3 col-sm-3 col-xs-12" >
<a href="/stroke?p=3083" class="article-image">
<img itemprop="image" src="/FileUploads/Post/3083.jpg?w=300&h=160&mode=crop" alt="Banana" title="Good for health">
</a>
</div>
<div class="col-md-9 col-sm-9 col-xs-12">
<div class="article-content">
<a href="/stroke">
<img src="/assets/home/v2016/img/icon/stroke.png" style="float:left;margin-right:5px;width: 4%;">
</a>
<a href="/stroke?p=3083" class="article-title">
<div>
<h4 itemprop="name" id="playground">
Banana Good for health </h4>
</div>
</a>
<div>
<div class="clear"></div>
<span itemprop="dateCreated" style="font-size:10pt;color:#777;">
<i class="fa fa-clock-o" aria-hidden="true"></i>
09/10 </span>
</div>
<p itemprop="description" class="hidden-phone">
<a href="/stroke?p=3083">
I love Banana.
</a>
</p>
</div>
</div>
</article>
</li>
My code:
from bs4 import BeautifulSoup
re=requests.get('http://xxxxxx')
bs=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
for link in bs.find_all('a') :
if link.has_attr('href'):
print (link.attrs['href'])
The result will print out all the urls from this page, but this is not what I am looking for, I only want a particular one like "/stroke?p=3083" in this example how can I set the condition in python? (I know there are totally three "/stroke?p=3083" in this, but I just need one)
Another question. This url is not complete, I need to combine them with "http://www.abcde.com" so the result will be "http://www.abcde.com/stroke?p=3083". I know I can use paste in R, but how to do this in Python? Thanks in advance! :)
Just put there a link in the scraper replacing some_link and give it a go. I suppose you will have your desired link along with it's full form.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
res = requests.get(some_link).text
soup = BeautifulSoup(res,"lxml")
for item in soup.select(".article-image"):
print(urljoin(some_link,item['href']))
Another question. This url is not complete, I need to combine them
with "http://www.abcde.com" so the result will be
"http://www.abcde.com/stroke?p=3083". I know I can use paste in R, but
how to do this in Python? Thanks in advance! :)
link = 'http://abcde.com' + link
You are getting most of it right already. Collect the links as follows (just a list comprehension version of what you are doing already)
urls = [url for url in bs.findall('a') if url.has_attr('href')]
This will give you the urls. To get one of them, and append it to the abcde url you could simply do the following:
if urls:
new_url = 'http://www.abcde.com{}'.format(urls[0])
I have the following given html structure
<li class="g">
<div class="vsc">
<div class="alpha"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</div>
</li>
The above html structure keeps repeating, what can be the easiest way to parse all the links(stackoverflow.com) from the above html structure using BeautifulSoup and Python?
BeautifulSoup 4 offers a convenient way of accomplishing this, using CSS selectors:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print [a["href"] for a in soup.select('h3.r a')]
This also has the advantage of constraining the selection by context: it selects only those anchor nodes that are children of a h3 node with class r.
Omitting the constraint or choosing one most suitable for the need is easy by just tweaking the selector; see the CSS selector docs for that.
Using CSS selectors as proposed by Petri is probably the best way to do it using BS. However, i can't hold myself back to recommend using lxml.html and xpath, that are pretty much perfect for the job.
Test html:
html="""
<html>
<li class="g">
<div class="vsc"></div>
<div class="alpha"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</li>
<li class="g">
<div class="vsc"></div>
<div class="alpha"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</li>
<li class="g">
<div class="vsc"></div>
<div class="gamma"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</li>
</html>"""
and it's basically a oneliner:
import lxml.html as lh
doc=lh.fromstring(html)
doc.xpath('.//li[#class="g"][div/#class = "vsc"][div/#class = "alpha"][div/#class = "beta"][h3/#class = "r"]/h3/a/#href')
Out[264]:
['http://www.correct.com', 'http://www.correct.com']
all. I'm having trouble getting at links in nested HTML with Mechanize in Python. Here's my current code (I've tried everything; this is just the latest copy, which doesn't work correctly) (and pardon my variable names (thing, stuff)):
soup = BeautifulSoup(resultsPage)
if not soup.find(attrs={'class' : 'paging'}):
print "Only one producted listed!"
else:
stuff = soup.find('div', attrs={'class' : 'paging'}).ul.li
for thing in stuff:
print thing
Here's the HTML I'm looking at:
<div class="paging">
<ul>
<li><
</li>
<li class='on'>
1-10
</li>
<li class=''>
<a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl01_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=2">11-20</a>
</li>
<li class=''>
<a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl02_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=3">21-30</a>
</li>
<li class=''>
<a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl03_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=4">31-40</a>
</li>
<li class=''>
<a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl04_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=5">41-50</a>
</li>
<li class=''>
<a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl05_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=6">51-60</a>
</li>
<li>
<a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_lnkNext" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=7">>></a>
</li>
</ul>
I need to determine whether or not there are <li> tags with hyperlinks in them; if there are I need to store them for clicking on later. This is the page that the code came from, in case you're curious: http://www.kraftrecipes.com/Products/ProductInfoSearchResults.aspx?CatalogType=1&BrandId=22&SearchText=Jell-O&PageNo=1 I'm working on something to scrape food websites for product info and I need to be able to navigate around the search results.
I have another quick side question. Is it bad to chain together tags and searches like this?
ingredients = soup.find(attrs={'class' : "TitleAndDescription"}).div.find(text=re.compile("Ingredients")).next
I'm just learning Python but this seems kind of kludge-y and I'd like to know what you guys think. Here's a sample of the HTML I'm scraping:
<table>
<tr>
<td>
<div id="contHeader" class="TitleAndDescription">
<h1>JELL-O - GELATIN DESSERT - RASPBERRY</h1>
<div class="textArea">
<strong>Ingredients:</strong> SUGAR, GELATIN, ADIPIC ACID (FOR TARTNESS), CONTAINS LESS THAN 2% OF ARTIFICIAL FLAVOR, DISODIUM PHOSPHATE AND SODIUM CITRATE (CONTROL ACIDITY), FUMARIC ACID (FOR TARTNESS), RED 40.<br/>
<strong>Size:</strong> 6 OZ<br/><strong>Upc:</strong> 4300020052<br/>
<br/>
<!--<br/>-->
<br/>
</div>
</div>
...
</td>
...
</tr>
...
</table>
Sorry for the wall of text. Let me know if you need any more information.
Thanks.
"HTMLParser" module of python could be one of the solution to the problem. Find more details at http://docs.python.org/library/htmlparser.html
If I understood correctly, what you want to get is the list of all li tags that contain any a tag (no matter how deep in the DOM tree). If that's correct, then you can do something like this:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(resultsPage)
list_items = [list_item for list_item in soup.findAll('li')
if list_item.findAll('a')]
Consider the following:
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>
How would you go about taking out the sitemap line with regex in python?
Sitemap
The following can be used to pull out the anchor tags.
'/<a(.*?)a>/i'
However, there are multiple anchor tags. Also there are multiple hotlink(s) so we can't really use them either?
Don't use a regex. Use BeautfulSoup, an HTML parser.
from BeautifulSoup import BeautifulSoup
html = \
"""
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>"""
soup = BeautifulSoup(html)
soup.findAll("div",id="hotlink")[2].a
# Sitemap
Parsing HTML with regular expression is a bad idea!
Think about the following piece of html
<a></a > <!-- legal html, but won't pass your regex -->
Sitemap<!-- proof that a>b iff ab>1 -->
There are many more such examples. Regular expressions are good for many things, but not for parsing HTML.
You should consider using Beautiful Soup python HTML parser.
Anyhow, a ad-hoc solution using regex is
import re
data = """
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>
"""
e = re.compile('<a *[^>]*>.*</a *>')
print e.findall(data)
Output:
>>> e.findall(data)
['Foo1', 'Home', 'Extract', 'Sitemap']
In order to extract the contents of the tagline:
Sitemap
... I would use:
>>> import re
>>> s = '''
<div id=hotlinklist>
Foo1
<div id=hotlink>
Home
</div>
<div id=hotlink>
Extract
</div>
<div id=hotlink>
Sitemap
</div>
</div>'''
>>> m = re.compile(r'(.*?)').search(s)
>>> m.group(1)
'Sitemap'
Use BeautifulSoup or lxml if you need to parse HTML.
Also, what is it that you really need to do? Find the last link? Find the third link? Find the link that points to /sitemap? It's unclear from you question. What do you need to do with the data?
If you really have to use regular expressions, have a look at findall.