remove css style code when I get a webpage text - python

I would like to get full text of a webpage, unfortunately my scraper is also capturing css code, how i can complete the code below in order to remove also css style code :
page = " ".join(response.xpath('//body//descendant-or-self::*[not(self::script)]/text()').extract())

Try
//body//descendant-or-self::*[not(self::script or self::style)]
I tested and it works, it excludes STYLE and SCRIPT tags

Related

Scrapy css selector get text from occurence of first class

I'm trying to scrape text from class .s-recipe-header__info-item, but as you can see on the picture, there are three classes with the same name and I would like to extract only the first one to get text "Do hodiny" See the image of code here. So far I have tried this code:
recipe_item["preparation_time"] = response.css(".s-recipe-header__info > .s-recipe-header__info-items > .s-recipe-header__info-item::text").extract_first()
I have also tried to use .get() instead of .extract_first(), but both do not seem to work...
I am new to web scraping and I have only elemental HTML and CSS knowledge. Thank you in advance for your help.

How to add tags between tags in html using selenium web driver python

I want to insert a formatted text like bold in a text editor/text box on a website how would I do that?
I tried this :
browser.find_element_by_xpath("/html/body/div[4]/div/div/div[2]/div/div")
browser.execute_script("arguments[0].AdjacentHTML('beforebegin', arguments[1])", node, <'strong> bold </'strong>)
However, this seems to add adjacent to the HTML not inside the text box.
I would go with js method 'setAttribute':
browser.execute_script("arguments[0].setAttribute('style', arguments[1]);",node, "font-weight: bold");

I am confused why this XPath selector does not work

I am learning to use scrapy and playing with XPath selectors, and decided to practice by scraping job titles from craigslist.
Here is the html of a single job link from the craigslist page I am trying to scrape the job titles from:
Full Stack .NET C# Developer (Mid-Level, Senior) ***LOCAL ONLY***
What I wanted to do was retrieve all of the similar a tags with the class result-title, so I used the XPath selector:
titles = response.xpath('//a[#class="result-title"/text()]').getall()
but the output I receive is an empty list: []
I was able to copy the XPath directly from Chrome's inspector, which ended up working perfectly and gave me a full list of job title names. This selector was:
titles = response.xpath('*//div[#id="sortable-results"]/ul/li/p/a/text()').getall()
I can see why this second XPath selector works, but I don't understand why my first attempt did not work. Can someone explain to me why my first XPath selector failed? I have also provided a link to the full html for the craigslist page below if that is helpful/neccessary. I am new to scrapy and want to learn from my mistakes. Thank you!
view-source:https://orangecounty.craigslist.org/search/sof
Like this:
'//a[contains(#class,"result-title ")]/text()'
Or:
'//a[starts-with(#class,"result-title ")]/text()'
I use contains() or starts-with() because the class of the a node is
result-title hdrlnk
not just
result-title
In your XPath:
'//a[#class="result-title"/text()]'
even if the class was result-title, the syntax is wrong, you should use:
'//a[#class="result-title"]/text()'
Simply '//a[#class="result-title hdrlnk"]/text()'
Needed 2 fixes:
/text() outside of []
"result-title hdrlnk" not only "result-title" in attribute selection because XPath is XML parsing not CSS; so exact attribute content is needed to match.

Printing Text from 2nd Div in Class in Python + Selenium

newbie here trying to learn Selenium. I have the following HTML Code:
<div class="lower-text">
<div data-text="winScreen.yourCodeIs">Your Code Is:</div>
<div>OUTPUTCODE</div>
</div>
I am trying to only print the text OUTPUTCODE, however the following code only prints "Your Code Is:".
text = browser.find_elements_by_class_name('lower-text')
for test in text:
print(test.text)
Any help would be appreciated. Thank you.
Try the below xpath.
//div[#class='lower-text']/div[last()]
You code should be
print(driver.find_element_by_xpath("//div[#class='lower-text']/div[last()]").text)
Try below Solutions:
1. Xpath :
//div[#class='gs_copied']
2. CSS selector
.lower-text > div:nth-child(2)
Your site is unstable and not always generating coupon code.Currently I am getting below error(check screenshot). So wont able to identify elements which i have mentioned above.
You need to amend your logic based on functionality and if person is Unlucky for getting coupon code then you have to write script to handle other functionality based on your site, (e.g: Check out our Hot Deals Page)
Try the following approach:
text = driver.find_element_by_xpath("//div[text()='Your Code Is:']//following-sibling::div[text()]").get_attribute('innerHTML')
print(text)
I have copy pasted your html part in a new text file and tried the following xpath which work perfectly:
//div[#class='lower-text']/div[text()='Your Code Is:']/following-sibling::div
Attaching screenshot link also. Please have a look and hopefully it will solve your problem.
https://imgur.com/EujgZrI

White space and selectors

Try to use a selector on scrapy shell to extract information from a web page and didn't work proprely. I believe that it happened because exist white space into class name. Any idea what's going wrong?
I tried different syntaxes like:
response.xpath('//p[#class="text-nnowrap hidden-xs"]').getall()
response.xpath('//p[#class="text-nnowrap hidden-xs"]/text()').get()
# what I type into my scrapy shell
response.css('div.offer-item-details').xpath('//p[#class="text-nowrap hidden-xs"]/text()').get()
# html code that I need to extract:
<p class="text-nowrap hidden-xs">Apartamento para arrendar: Olivais, Lisboa</p>
expected result: Apartamento para arrendar: Olivais, Lisboa
actual result: []
The whitespace in the class section means that there are multiple classes, the "text-nnowrap" class and the "hidden-xs" class. In order to select by xpath for multiple classes, you can use the following format:
"//element[contains(#class, 'class1') and contains(#class, 'class2')]"
(grabbed this from How to get html elements with multiple css classes)
So in your example, I believe this would work.
response.xpath("//p[contains(#class, 'text-nnowrap') and contains(#class, 'hidden-xs')]").getall()
For this cases I prefer using css selectors because of its minimalistic syntax:
response.css("p.text-nowrap.hidden-xs::text")
Also google chrome developer tools displays css selectors when you observing html code This makes scraper development much easier

Categories

Resources