I am trying to scrapy Articles heading from
https://time.com/
I want to select only those articles Link which are under "The Brief" Heading
I have tried to select nested div using this code
for url in response.xpath('//div[#class="column text-align-left visible-desktop visible-mobile last-column"]/div[#class="column-tout"]/a/#href').extract():
but it did not work
Can someone please help to extract those specific articles
You can find this div by content and next get all following-sibling:
for url in response.xpath('//div[.="The Brief"]/following-sibling::div//a/#href').extract():
Related
I am learning to use scrapy and playing with XPath selectors, and decided to practice by scraping job titles from craigslist.
Here is the html of a single job link from the craigslist page I am trying to scrape the job titles from:
Full Stack .NET C# Developer (Mid-Level, Senior) ***LOCAL ONLY***
What I wanted to do was retrieve all of the similar a tags with the class result-title, so I used the XPath selector:
titles = response.xpath('//a[#class="result-title"/text()]').getall()
but the output I receive is an empty list: []
I was able to copy the XPath directly from Chrome's inspector, which ended up working perfectly and gave me a full list of job title names. This selector was:
titles = response.xpath('*//div[#id="sortable-results"]/ul/li/p/a/text()').getall()
I can see why this second XPath selector works, but I don't understand why my first attempt did not work. Can someone explain to me why my first XPath selector failed? I have also provided a link to the full html for the craigslist page below if that is helpful/neccessary. I am new to scrapy and want to learn from my mistakes. Thank you!
view-source:https://orangecounty.craigslist.org/search/sof
Like this:
'//a[contains(#class,"result-title ")]/text()'
Or:
'//a[starts-with(#class,"result-title ")]/text()'
I use contains() or starts-with() because the class of the a node is
result-title hdrlnk
not just
result-title
In your XPath:
'//a[#class="result-title"/text()]'
even if the class was result-title, the syntax is wrong, you should use:
'//a[#class="result-title"]/text()'
Simply '//a[#class="result-title hdrlnk"]/text()'
Needed 2 fixes:
/text() outside of []
"result-title hdrlnk" not only "result-title" in attribute selection because XPath is XML parsing not CSS; so exact attribute content is needed to match.
I'm trying to extract specific links on a page full of links. The links I need contain the word "apartment" in them.
But whatever I try, I get way more data extracted than only the links I need.
<a href="https://www.website.com/en/ad/apartment/abcd123" title target="IWEB_MAIN">
If anyone could help me out on this, it'd be much appreciated!
Also, if you have a good source that could inform me better about this, it would be double appreciated!
Yon can use regular expression re.
import re
soup=BeautifulSoup(Pagesource,'html.parser')
alltags=soup.find_all("a",attrs={"href" : re.compile("apartment")})
for item in alltags:
print(item['href']) #grab href value
Or You can use css selector
soup=BeautifulSoup(Pagesource,'html.parser')
alltags=soup.select("a[href*='apartment']")
for item in alltags:
print(item['href'])
You find the details in official documents Beautifulsoup
Edited:
You need to consider parent div first then find the anchor tag.
import requests
from bs4 import BeautifulSoup
res=requests.get("https://www.immoweb.be/en/search/apartment/for-sale/leuven/3000")
soup = BeautifulSoup(res.text, 'html.parser')
for item in soup.select("div[data-type='resultgallery-resultitem'] >a[href*='apartment']"):
print(item['href'])
I am crawling a website with scrapy.
i have come across a situation where i need to extract an attribute value of div tag. eg
i need to extract "lmnop" from the web page
I have tried few css selectors but they return an empty list.
for the above eg. i wrote a css selector as:
response.css('div.blahblah::attr(abc)').extract()
For this piece of HTML code the expected output is shown below
Code:<div class="searching-result" data-id="somehashvalue" abc="xyz"></div>
Expected output:
["somehashvalue"]
Below is a section of HTML code I am currently scraping.
<div class="RadAjaxPanel" id="LiveBoard1_LiveBoard1_litGamesPanel">
<a href="leaders.aspx?pos=all&stats=pit&lg=all&qual=0&type=8&season=2016&month=0&season1=2016&ind=0&team=0&rost=0&age=0&filter=&players=p2018-04-20">
Today's Probable Starters and Lineups Leaderboard
</a>
</div>
Throughout the code, I need to figure out a way to scrape all the links in this div class with the exception of the one posted above. Does anyone know how to decompose one specific link within a specific div class but still scrape the remaining links? With regards to this specific link, the beginning ("leaders.aspx") of the link is different than the links I am currently targeting. Below is a sample of my current code.
import requests
import csv
from bs4 import BeautifulSoup
page=requests.get('https://www.fangraphs.com/livescoreboard.aspx?date=2018-04-18')
soup=BeautifulSoup(page.text, 'html.parser')
#Remove Unwanted Links
[link.decompose() for link in soup.find_all(class_='lineup')]
[yesterday.decompose() for yesterday in soup.find_all('td',attrs=
{'colspan':'2'})]
team_name_list=soup.find(class_='RadAjaxPanel')
team_name_list_items=team_name_list.find_all('a')
for team_name in team_name_list_items:
teams=team_name.contents[0]
print(teams)
winprob_list=soup.find(class_='RadAjaxPanel')
winprob_list_items=winprob_list.find_all('td',attrs={'style':'border:1px
solid black;'})
for winprob in winprob_list_items:
winprobperc=winprob.contents[0]
print(winprobperc)
To summarize, I just need to remove the "Today's Probable Starters and Lineups Leaderboard" link that was posted in the first code block. Thanks in advance!
Just use CSS selectors with .select_one():
soup.select_one('.RadAjaxPanel > center > a').decompose()
So there were some attempts, but i cannot find a way to get the name and content of every
div class
div id
Im using lxml and beautysoup in my project, but i simply cant seem to find a way to find div's that are unknown to me.
Can someone show me a method or any tips how to do this?
Thanks in advance.
You can use the find_all method to find all tags of a certain type, then look at their attributes via ther attrs dict, e.g.:
soup = BeautifulSoup(html, 'lxml')
for div in soup.find_all('div'):
print(div.attrs)