Python BeautifulSoup extracting text from result - python

I am trying to get the text from contents but when i try beautiful soup functions on the result variable it results in errors.
from bs4 import BeautifulSoup as bs
import requests
webpage = 'http://www.dictionary.com/browse/coypu'
r = requests.get(webpage)
page_text = r.text
soup = bs(page_text, 'html.parser')
result = soup.find_all('meta', attrs={'name':'description'})
print (result.get['contents'])
I am trying to get the result to read;
"Coypu definition, a large, South American, aquatic rodent, Myocastor (or Myopotamus) coypus, yielding the fur nutria. See more."

soup.find_all() returns a list. Since in your case, it returns only one element in the list, you can do:
>>> type(result)
<class 'bs4.element.ResultSet'>
>>> type(result[0])
<class 'bs4.element.ResultSet'>
>>> result[0].get('content')
Coypu definition, a large, South American, aquatic rodent, Myocastor (or Myopotamus) coypus, yielding the fur nutria. See more.

When you only want the first or a single tag use find, find_all returns a list/resultSet:
result = soup.find('meta', attrs={'name':'description'})["contents"]
You can also use a css selector with select_one:
result = soup.select_one('meta[name=description]')["contents"]

you need not to use findall as only by using find you can get desired output'
from bs4 import BeautifulSoup as bs
import requests
webpage = 'http://www.dictionary.com/browse/coypu'
r = requests.get(webpage)
page_text = r.text
soup = bs(page_text, 'html.parser')
result = soup.find('meta', {'name':'description'})
print result.get('content')
it will print:
Coypu definition, a large, South American, aquatic rodent, Myocastor (or Myopotamus) coypus, yielding the fur nutria. See more.

Related

Change scraped output

I have a loop putting URLs into my broswer and scraping its content, generating this output:
2PRACE,0.0014
Hispanic,0.1556
API,0.0688
Black,0.0510
AIAN,0.0031
White,0.7200
The code looks like this:
f1 = open('urlz.txt','r',encoding="utf8")
ethnicity_urls = f1.readlines()
f1.close()
from urllib import request
from bs4 import BeautifulSoup
import time
import openpyxl
import pprint
for each in ethnicity_urls:
time.sleep(1)
scraped = request.urlopen(each)
soup = BeautifulSoup(scraped)
soup1 = soup.select('p')
print(soup1)
resultFile = open('results.csv','a')
resultFile.write(pprint.pformat(soup1))
resultFile.close()
My problem is quite simple yet I do not find any tool that helps me achieve it. I would like to change the output from a list with "\n" in it to this:
2PRACE,0.0014 Hispanic,0.1556 API,0.0688 Black,0.0510 AIAN,0.0031 White,0.7200
I did not succeed by using replace as it told me I am treating a number of elements the same as a single element.
My approach here was:
for each in ethnicity_urls:
time.sleep(1)
scraped = request.urlopen(each)
soup = BeautifulSoup(scraped)
soup1 = soup.select('p')
soup2 = soup1.replace('\n',' ')
print(soup2)
resultFile = open('results.csv','a')
resultFile.write(pprint.pformat(soup2))
resultFile.close()
Can you help me find the correct approach to mutate the output before writing it to a csv?
The error message I get:
AttributeError: ResultSet object has no attribute 'replace'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
See the solution to the problem in my answer below. Thanks for all the responses!
soup1 seems to be an iterable, so you cannot just call replace on it.
Instead you could loop through all string items in soup1 and then call replace for every single one of them and then save the changes string to your soup2 variable. Something like this:
for e in soup1:
soup2.append(e.replace('\n',' '))
You need to iterate over the soup.
Soup is a list of elements:
The BS4 Documentation is excellent and has many many examples:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Use strip() to remove the \n
for x in soup1:
for r in x.children:
try:
print(r.strip())
except TypeError:
pass
Thank you both for the ideas and resources. I think I could implement what you suggested. The current build is
for each in ethnicity_urls:
time.sleep(1)
scraped = request.urlopen(each)
soup = BeautifulSoup(scraped)
soup1 = soup.select('p')
for e in soup1:
soup2 = str(soup1)
soup2 = soup2.replace('\n','')
print(soup2)
resultFile = open('results.csv','a')
resultFile.write(pprint.pformat(soup2))
resultFile.close()
And works just fine. I can do the final adjustments now in excel.

Get all elements that match a specific attribute value, but match any tag or attribute name with BeautifulSoup

Is it possible to get all elements that match a specific attribute value, but match any tag or attribute name with BeautifulSoup. If so does anyone know how to do it?
Here's an example of how I'm trying to do it
from bs4 import BeautifulSoup
import requests
text_to_match = 'https://s3-ap-southeast-2.amazonaws.com/bettss3/images/003obzt0t_w1200_h1200.jpg'
url = 'https://www.betts.com.au/item/37510-command.html?colour=chocolate'
r = requests.get(url)
bs = BeautifulSoup(r.text, features="html.parser")
possibles = bs.find_all(None, {None: text_to_match})
print(possibles)
This gives me an empty list [].
If I replace {None: text_to_match} with {'href': text_to_match} this example will give some results as expected. I'm trying to figure out how to do this without specifying the attribute's name, and only matching the value.
You can try to find_all with no limitation and filter those who doesn't correspond to your needs, as such
text_to_match = 'https://s3-ap-southeast-2.amazonaws.com/bettss3/images/003obzt0t_w1200_h1200.jpg'
url = 'https://www.betts.com.au/item/37510-command.html?colour=chocolate'
r = requests.get(url)
bs = BeautifulSoup(r.text, features="html.parser")
tags = [tag for tag in bs.find_all() if text_to_match in str(tag)]
print(tags)
this sort of solution is a bit clumsy as you might get some irrelevant tags, you make your text a bit more tag specific by:
text_to_match = r'="https://s3-ap-southeast-2.amazonaws.com/bettss3/images/003obzt0t_w1200_h1200.jpg"'
which is a bit closer to the str representation of a tag with attribute

Get element's text with CDATA

Say, I have an element:
>>> el = etree.XML('<tag><![CDATA[content]]></tag>')
>>> el.text
'content'
What I'd like to get is <![CDATA[content]]>. How can I go about it?
When you do el.text, that's always going to give you the plain text content.
To see the serialized element try tostring() instead:
el = etree.XML('<tag><![CDATA[content]]></tag>')
print(etree.tostring(el).decode())
this will print:
<tag>content</tag>
To preserve the CDATA, you need to use an XMLParser() with strip_cdata=False:
parser = etree.XMLParser(strip_cdata=False)
el = etree.XML('<tag><![CDATA[content]]></tag>', parser=parser)
print(etree.tostring(el).decode())
This will print:
<tag><![CDATA[content]]></tag>
This should be sufficient to fulfill your "I want to make sure in a test that content is wrapped in CDATA" requirement.
You might consider using BeautifulSoup and look for CDATA instances:
import bs4
from bs4 import BeautifulSoup
data='''<tag><![CDATA[content]]></tag>'''
soup = BeautifulSoup(data, 'html.parser')
"<![CDATA[{}]]>".format(soup.find(text=lambda x: isinstance(x, bs4.CData)))
Output
<![CDATA[content]]>

Python - Beautiful Soup - How to filter the extracted data for keywords?

I want to scrape the data of websitses using Beautiful Soup and requests, and I've come so far that I've got the data I want but now I want to filter it:
from bs4 import BeautifulSoup
import requests
url = "website.com"
keyword = "22222"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
for article in soup.find_all('a'):
for a in article:
if article.has_attr('data-variant-code'):
print(article.get("data-variant-code"))
Let's say this prints the following:
11111
22222
33333
How can I filter this so it only returns me the "22222"?
assuming that article.get("data-variant-code") prints 11111, 22222, 33333,
you can simply use an if statement:
for article in soup.find_all('a'):
for a in article:
if article.has_attr('data-variant-code'):
x = article.get("data-variant-code")
if x == '22222':
print(x)
if you want to print the 2nd group of chars in a string delimited by space, then you can split the string using space as delimiter. This will give you a list of strings then access the 2nd item of the list.
For example:
print(article.get("data-variant-code").split(" ")[1])
result: 22222

Find and extract curr_id number from Investing

I need to know the curr_id to submit using python to investing.com and extract historic data for a number of currencies/commodities. To do this I need the curr_id number. As in the example bellow. I'm able to extract all scripts. But then I cannot figure out how to find the correct script index that contains curr_id and extract the digits '2103'. Example: I need the code to find 2103.
import requests
from bs4 import BeautifulSoup
url = 'http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
#URL
url='http://www.investing.com/currencies/usd-brl-historical-data'
#OPEN URL
r = requests.get(url)
#DETERMINE FORMAT
soup=BeautifulSoup(r.content,'html.parser')
#FIND TABLE WITH VALUES IN soup
curr_data = soup.find_all('script', {'type':'text/javascript'})'
UPDATE
I did it like this:
g_data_string=str(g_data)
if 'curr_id' in g_data_string:
print('success')
start = g_data_string.find('curr_id') + 9
end = g_data_string.find('curr_id')+13
print(g_data_string[start:end])
But I`m sure there is a better way to do it.
You can use a regular expression pattern as a text argument to find a specific script element. Then, search inside the text of the script using the same regular expression:
import re
import requests
from bs4 import BeautifulSoup
url = 'http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
pattern = re.compile(r"curr_id: (\d+)")
script = soup.find('script', text=pattern)
match = pattern.search(script.text)
if match:
print(match.group(1))
Prints 2103.
Here (\d+) is a capturing group that would match one or more digits.
You don't actually need a regex, you can get the id from by extracting the value attribute from the input tag with the name=item_ID
In [6]: from bs4 import BeautifulSoup
In [7]: import requests
In [8]: r = requests.get("http://www.investing.com/currencies/usd-brl-historical-data").content
In [9]: soup = BeautifulSoup(r, "html.parser")
In [10]: soup.select_one("input[name=item_ID]")["value"]
Out[10]: u'2103'
You could also look for the id starting with item_id:
In [11]: soup.select_one("input[id^=item_id]")["value"]
Out[11]: u'2103'
Or look for the first div with the pair_id attribute:
In [12]: soup.select_one("div[pair_id]")["pair_id"]
Out[12]: u'2103'
There are actually numerous ways to get it.

Categories

Resources