How can i extract only text in scrapy selector in python - python

I have this code
site = hxs.select("//h1[#class='state']")
log.msg(str(site[0].extract()),level=log.ERROR)
The ouput is
[scrapy] ERROR: <h1 class="state"><strong>
1</strong>
<span> job containing <strong>php</strong> in <strong>region</strong> paying <strong>$30-40k per year</strong></span>
</h1>
Is it possible to only get the text without any html tags

//h1[#class='state']
in your above xpath you are selecting h1 tag that has class attribute state
so that's why it's selecting everything that comes in h1 element
if you just want to select text of h1 tag all you have to do is
//h1[#class='state']/text()
if you want to select text of h1 tag as well as its children tags, you have to use
//h1[#class='state']//text()
so the difference is /text() for specific tag text and //text() for text of specific tag as well as its children tags
below mentioned code works for you
site = ''.join(hxs.select("//h1[#class='state']/text()").extract()).strip()

You can use BeautifulSoup get_text() feature.
from bs4 import BeautifulSoup
text = '''
<td>Please can you strip me?
<br/>I am waiting....
</td>
'''
soup = BeautifulSoup(text)
print(soup.get_text())

I haven't got a scrapy instance running so I couldn't test this; but you could try to use text() within your search expression.
For example:
site = hxs.select("//h1[#class='state']/text()")
(got it from the tutorial)

You can use BeautifulSoup to strip html tags, here is an example:
from BeautifulSoup import BeautifulSoup
''.join(BeautifulSoup(str(site[0].extract())).findAll(text=True))
You can then strip all the additional whitespaces, new lines etc.
if you don't want to use additional modules, you can try simple regex:
# replace html tags with ' '
text = re.sub(r'<[^>]*?>', ' ', str(site[0].extract()))

You can use html2text
import html2text
converter = html2text.HTML2Text()
print converter.handle("<div>Please!!!<span>remove me</span></div>")

Related

How to remove a hyperlink tag under a header tag using beautifulSoup -

I am trying to web scrape a webpage. Here I want to extract only Freelancer from the header H3. but when I run the below code I get "More jobs" which is under 'a' tag . How to extract only Freelancer from below link?
https://www.timesjobs.com/candidate/job-search.html?searchType=personalizedSearch&from=submit&txtKeywords=work+from+home&txtLocation=
my code is:
company_name = job.find('h3', class_='joblist-comp-name').text
Result is: Freelancer (More Jobs)
Expected: Freelancer
You can simply split the string based on space and extract the first text
from bs4 import BeautifulSoup
html="""<h3 class="joblist-comp-name">Freelancer <a class="jobs-frm-comp" href="/candidate/companySearchResult.html?from=submit&encid=V1VUNYG9OfxywnPTmYOKIg==&searchType=byCompany&luceneResultSize=25">(More Jobs)</h3>
"""
soup=BeautifulSoup(html,"lxml")
soup.find("h3",class_="joblist-comp-name").text.split(" ")[0]
Output:
'Freelancer'
Update with URL given
import requests
from bs4 import BeautifulSoup
res=requests.get("https://www.timesjobs.com/candidate/job-search.html?searchType=personalizedSearch&from=submit&txtKeywords=work+from+home&txtLocation=")
soup=BeautifulSoup(res.text,"lxml")
Here it will find main ul tag and from it find all li tag so it will return as list from that we can go for first element and we can find the text associate to it!
all_li=soup.find("ul",class_="new-joblist").find_all("li")
all_li[0].find("h3",class_="joblist-comp-name").get_text(strip=True).split("(")[0]
Output:
'Freelancer'
Your html is not well formed, but if it's fixed like this:
<h3 class="joblist-comp-name"> Freelancer
<a class="jobs-frm-comp" href="/whatever"> More Jobs</a>
</h3>
something like the below should get you there - it uses the lxml library and xpath search to zero in on the target. Obviously, you'll have to modify it to fit your actual html:
import lxml.html as lh
company = """the modified html string above"""
job = lh.fromstring(company)
job.xpath('//h3[#class="joblist-comp-name"]/text()')[0].strip()
Output:
'Freelancer'

Can you write a css-selector in BeautifulSoup that uses either the class or style to identify the desired info in a div?

I am scraping a webpage using BeautifulSoup, and there is a piece of information I want that is contained in a <div> and sometimes only has a value for class and sometimes only has a value for style like below:
<div class="text-one">
Text I want
</div>
<div style="display-style">
Text I want
</div>
Using Selenium, I would grab be able to grab the text I want, regardless of how it's formatted on the page, by doing this:
driver.find_element_by_xpath(
".//div[contains(#class, 'text-one') or contains(#style, 'display-style')]"
).text
Right now I have a work around where I have an if statement to determine which selector to use to find the desired text (e.g. I do a string search of the raw HTML like:
if "<div style" in str(rawhtml):
want = soup.find("div", {"style": "display-style"}).text
else:
want = soup.find("div", {"class": "text-one"}).text
Is there an equivalent to the Selenium call I have above in BeautifulSoup? Or is determining the correct selector using an if-statement the only option?
You can use css OR syntax to specify to match either of those patterns.The "," is the OR operator. The [] indicates attribute selector and . class selector.
data = [i.text for i in soup.select("div.text-one, div[style='display-style']")]
I believe there's no support for xpath in beautifulsoup, only for css selectors. If you are heavily invested in xpaths, the similar library lxml could be used instead:
from lxml import html
dom = html.fromstring('<html><div class="text-one">test i want1</div><div style="display-style">text i want2</div></html>','html.parser')
selection = dom.xpath(".//div[contains(#class, 'text-one') or contains(#style, 'display-style')]")
[n.text for n in selection]
Response: ['test i want1', 'text i want2']

Get text and child from a span with BeautifulSoup

I am in a specific situation where I want to extract both the text and a child node from a span:
<span>condition:<b>good</b></span>
However, when I try to select the span with the text:
x = soup.find('span', text=re.compile(r'^condition:$'))
I get None back.
I've verified that that tag exists in the HTML document I am working with.
And I can't figure out how to get the internal tag either.
What am I doing wrong?
Following have the same issue BeautifulSoup - search by text inside a tag
You can separate the function to solve the problem, like this:
def find_all_with_regex(soup, target_tag, regex):
elements = soup.find_all(target_tag)
return list(filter(lambda tag_found: regex.match(tag_found.text), elements))
print(find_all_with_regex(soup, 'span', re.compile(r'^condition:.*')))
Try Following css selector.
print(soup.select_one('span:contains("condition:")').text)
Code:
from bs4 import BeautifulSoup
html='''<span>condition:<b>good</b></span>'''
soup=BeautifulSoup(html,"html.parser")
print(soup.select_one('span:contains("condition:")').text)

BeautifulSoup - how to extract text without opening tag and before <br> tag?

I'm new to python and beautifulsoup and spent quite a few hours trying to figure this one out.
I want to extract three particular text extracts within a <div> that has no class.
The first text extract I want is within an <a> tag which is within an <h4> tag. This I managed to extract it.
The second text extract immediately follows the closing h4 tag </h4> and is followed by a <br> tag.
The third text extract immediately follows the <br> tag after the second text extract and is also followed by a <br> tag.
Here the html extract I work with:
<div>
<h4 class="actorboxLink">
Decheterie de Bagnols
</h4>
Route des 4 Vents<br>
63810 Bagnols<br>
</div>
I want to extract:
Decheterie de Bagnols < That works
Route des 4 Vents < Doesn't work
63810 Bagnols < Doesn't work
Here is the code I have so far:
import urllib
from bs4 import BeautifulSoup
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
name = soup.findAll("h4", class_="actorboxLink")
for a_tag in name:
print a_tag.text.strip()
I need something like "soup.findAll(all text after </h4>)"
I played with using .next_sibling but I can't get it to work.
Any ideas? Thanks
UPDATE:
I tried this:
for a_tag in classActorboxLink:
print a_tag.find_all_next(string=True, limit=5)
which gives me:
[u'\n', u'\r\n\t\t\t\t\t\tDecheterie\xa0de\xa0Bagnols\t\t\t\t\t', u'\n', u'\r\n\t\t\t\tRoute\xa0des\xa04\xa0Vents', u'\r\n\t\t\t\t63810 Bagnols']
It's a start but I need to relove all the whitespaces and unecessary characters. I tried using .strip(),.strings and .stripped_strings but it doesn't work. Examples:
for a_tag in classActorboxLink.strings
for a_tag in classActorboxLink.stripped_strings
print a_tag.find_all_next(string=True, limit=5).strip()
For all three I get:
AttributeError: 'ResultSet' object has no attribute 'strings/stripped_strings/strip'
Locate the h4 element and use find_next_siblings():
h4s = soup.find_all("h4", class_="actorboxLink")
for h4 in h4s:
for text in h4.find_next_siblings(text=True):
print(text.strip())
If you don't need each of the 3 elements you are looking for in different variables you could just use the get_text() function on the <div> to get them all in one string. If there are other div tags but they all have classes you can find all the <div> with class=false. If you can't isolate the <div> that you are interested in then this solution won't work for you.
import urllib
from bs4 import BeautifulSoup
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
for name in soup.find_all("div", class=false)
print name.get_text().strip()
BTW this is python 3 & bs4

BeautifulSoup 4: Dealing with urls containing <br />

I'm dealing with html/xhtml links with beautifulsoup 4.3.2 and have faced some strangeness with br occuring in a elements.
from bs4 import BeautifulSoup
html = BeautifulSoup('<html><head></head><body>ABCD0000000<br /></body></html>')
html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))
Gives an empty list.
As I've found, it's caused by the br tag, appearing in the a tag.
Hmm. Well, lets replace it with a newline as someone advised here..
html.find('br').replaceWith('\n')
html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))
Again an empy list, damn.
Maybe,
html.find('br').replaceWith('')
html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))
The same result..
But
html = BeautifulSoup('<html><head></head><body>ABCD0000000</body></html>')
html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))
[ABCD0000000]
- Works fine.
So, as I see there is no way to bypass this except to clean or replace br's before feeding the data to bs4.
import re
re.sub(re.compile('<br\s*/>', re.IGNORECASE), '\n', '<html><head></head><body>ABCD0000000<br /></body></html>')
Or any?
Thanks for suggestions and complements.
Best regards,
~S.
One option would be to remove all br tags using extract() and then perform the search:
import re
from bs4 import BeautifulSoup
html = BeautifulSoup('<html><head></head><body>ABCD0000000<br /></body></html>')
for br in html('br'):
br.extract()
print html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))
Prints:
[ABCD0000000]
Another option would be to check that href attribute ends with ABCD0000000 using CSS Selector:
html.select('a[href$="ABCD0000000"]')
Another option would be to use a function and check that the link text starts with ABCD0000000:
html.find_all(lambda tag: tag.name == 'a' and tag.text.startswith('ABCD0000000'))

Categories

Resources