I am using this code to parse some context form a url:
response = requests.get(url)
cnbeta_article_content = BeautifulSoup(response.content, "html.parser").find("div", {"class": "cnbeta-article-body"})
return cnbeta_article_content.contents
But I need to get the cnbeta_article_content.contents was a result of list. How do you get the plain html from class cnbeta-article-body of the url? The cnbeta_article_content.text is not the original html.
Does cnbeta_article_content.prettify() render what you expect?
you are getting multiple results for the class so you will have to find out which one should you pick. If possible use a unique selector for the specific element or you can extract it from the current list (cnbeta_article_content.contents)
Go to the website and find out the elements serial number (I mean you are getting multiple elements with the same class so what is the position of your expected element) for the class you mentioned. You will get the text like this
cnbeta_article_content.contents[4].text
Here 4 is the 5th element (Zero indexing system)
Related
New to python and I have been using this piece of code in order to get the class name as a text for my csv but can't make it to only extract the first one. Do you have any idea how to ?
for x in book_url_soup.findAll('p', class_="star-rating"):
for k, v in x.attrs.items():
review = v[1]
reviews.append(review)
del reviews[1]
print(review)
the url is : http://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html
the output is:
Two
Two
One
One
Three
Five
Five
I only need the first output and don't know how to prevent the code from getting the "star ratings" from below the page that shares the same class name.
Instead of find_all() that will create a ResultSet you could use find() or select_one() to select only the first occurrence of your element and pick the last index from the list of class names:
soup.find('p', class_='star-rating').get('class')[-1]
or with css selector
soup.select_one('p.star-rating').get('class')[-1]
In newer code also avoid old syntax findAll() instead use find_all() - For more take a minute to check docs
Example
from bs4 import BeautifulSoup
import requests
url = 'http://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html'
page = requests.get(url).text
soup = BeautifulSoup(page)
soup.find('p', class_='star-rating').get('class')[-1]
Output
Two
Is there any way to pull links from href values and open them one by one, and perform an action (click) on a specific ID?
First, find element by ID
element = driver.find_element_by_id('YOUR_ID')
Second, get attribute from your element
link = element.get_attribute("href")
Third, use selenium, but I would suggest using something like requests, and perform GET request by this link:
import requests
response = requests.get(link)
assert response.status_code == 200
for example
I keep running into an issue when I scrape data with lxml by using the xpath. I want to scrape the dow price but when I print it out in python it says Element span at 0x448d6c0. I know that must be a block of memory but I just want the price. How can I print the price instead of the place in memory it is?
from lxml import html
import requests
page = requests.get('https://markets.businessinsider.com/index/realtime-
chart/dow_jones')
content = html.fromstring(page.content)
#This will create a list of prices:
prices = content.xpath('//*[#id="site"]/div/div[3]/div/div[3]/div[2]/div/table/tbody/tr[1]/th[1]/div/div/div/span')
#This will create a list of volume:
print (prices)
You're getting generators which as you said are just memory locations. To access them, you need to call a function on them, in this case, you want the text so .text
Additionally, I would highly recommend changing your XPath since it's a literal location and subject to change.
prices = content.xpath("//div[#id='site']//div[#class='price']//span[#class='push-data ']")
prices_holder = [i.text for i in prices]
prices_holder
['25,389.06',
'25,374.60',
'7,251.60',
'2,813.60',
'22,674.50',
'12,738.80',
'3,500.58',
'1.1669',
'111.7250',
'1.3119',
'1,219.58',
'15.43',
'6,162.55',
'67.55']
Also of note, you will only get the values at load. If you want the prices as they change, you'd likely need to use Selenium.
The variable prices is a list containing a web element. You need to call the text method to extract the value.
print(prices[0].text)
'25,396.03'
I am currently learning Python specialization on coursera. I have come across the issue of extracting a specific link from a webpage using BeautifulSoup. From this webpage (http://py4e-data.dr-chuck.net/known_by_Fikret.html), I am supposed to extract a URL from user input and open that subsequent links, all identified through the anchor tab and run some number of iterations.
While I able to program them using Lists, I am wondering if there is any simpler way of doing it without using Lists or Dictionary?
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
nameList=list()
loc=''
count=0
for tag in tags:
loc=tag.get('href',None)
nameList.append(loc)
url=nameList[pos-1]
In the above code, you would notice that after locating the links using 'a' tag and 'href', I cant help but has to create a list called nameList to locate the position of link. As this is inefficient, I would like to know if I could directly locate the URL without using the lists. Thanks in advance!
The easiest way is to get an element out of tags list and then extract href value:
tags = soup('a')
a = tags[pos-1]
loc = a.get('href', None)
You can also use soup.select_one() method to query :nth-of-type element:
soup.select('a:nth-of-type({})'.format(pos))
As :nth-of-type uses 1-based indexing, you don't need to subtract 1 from pos value if your users are expected to use 1-based indexing too.
Note that soup's :nth-of-type is not equivalent to CSS :nth-of-type pseudo-class, as it always selects only one element, while CSS selector may select many elements at once.
And if you're looking for "the most efficient way", then you need to look at lxml:
from lxml.html import fromstring
tree = fromstring(r.content)
url = tree.xpath('(//a)[{}]/#href'.format(pos))[0]
I'm trying to get python to extract text from one spot of a website. I've identified the HTML div:
<div class="number">76</div>
which is in:
...div/div[1]/div/div[2]
I'm trying to use lxml to extract the '76' from that, but can't get a return out of it other than:
[]
Here's my code:
from lxml import html
import requests
url = 'https://sleepiq.sleepnumber.com/#/##1'
values = {'username': 'my#gmail.com',
'password': 'mypassword'}
page = requests.get(url, data=values)
tree = html.fromstring(page.content)
hr = tree.xpath('//div[#class="number"]/text()')
print hr
Any suggestions? I feel this should be pretty easy, thanks in advance!
Update: the element I want is not contained in the page.content from requests.get
Updated Update: It looks like this is not logging me in to the page where the content I want is. It is only getting the login screen content.
Have you tried printing your page.content to make sure your requests.get is retrieving the content you want? That is often where things break. And your empty list returned off the xpath search indicates "not found."
Assuming that's okay, your parsing is close. I just tried the following, which is successful:
from lxml import html
tree = html.fromstring('<body><div class="number">76</div></body>')
number = tree.xpath('//div[#class="number"]/text()')[0]
number now equals '76'. Note the [0] indexing, because xpath always returns a list of what's found. You have to dereference to find the content.
A common gotcha here is that the XPath text() function isn't as inclusive or straightforward as it might seem. If there are any sub-elements to the div--e.g. if the text is really <div class="number"><strong>76</strong></div> then text() will return an empty list, because the text belongs to the strong not the div. In real-world HTML--especially HTML that's ever been cut-and-pasted from a word processor, or otherwise edited by humans--such extra elements are entirely common.
While it won't solve all known text management issues, one handy workaround is to use the // multi-level indirection instead of the / single-level indirection to text:
number = ''.join(tree.xpath('//div[#class="number"]//text()'))
Now, regardless of whether there are sub-elements or not, the total text will be concatenated and returned.
Update Ok, if your problem is logging in, you probably want to try a requests.post (rather than .get) at minimum. In simpler cases, just that change might work. In others, the login needs to be done to a separate page than the page you want to retrieve/scape. In that case, you probably want to use a session object:
with requests.Session() as session:
# First POST to the login page
landing_page = session.post(login_url, data=values)
# Now make authenticated request within the session
page = session.get(url)
# ...use page as above...
This is a bit more complex, but shows the logic for a separate login page. Many sites (e.g. WordPress sites) require this. Post-authentication, they often take you to pages (like the site home page) that isn't interesting content (though it can be scraped to identify whether the login was successful). This altered login workflow doesn't change any of the parsing techniques, which work as above.
Beautiful Soup(http://www.pythonforbeginners.com/beautifulsoup/web-scraping-with-beautifulsoup) will help u out.
another way
http://docs.python-guide.org/en/latest/scenarios/scrape/
I'd use plain regex over xml tools in this case. It's easier to handle.
import re
import requests
url = 'http://sleepiq.sleepnumber.com/#/user/-9223372029758346943##2'
values = {'email-email': 'my#gmail.com', 'password-clear': 'Combination',
'password-password': 'mypassword'}
page = requests.get(url, data=values, timeout=5)
m = re.search(r'(\w*)(<div class="number">)(.*)(<\/div>)', page.content)
# m = re.search(r'(\w*)(<title>)(.*)(<\/title>)', page.content)
if m:
print(m.group(3))
else:
print('Not found')