I'm attempting to gather some data from a wikipedia page and I can't seem to narrow down my fetch to the ui and li items within the div. Here's what I have so far:
soup.findAll('div', attrs={'class': "mw-parser-output"})
I'm reading through the documentation and I cant seem to find where or how I can drill down to the ul or li within div class = mw-parser-output.
This is the first time I've used BeautifulSoup, so please excuse my ignorance.
Thanks in advance!
BeautifulSoup supports CSS selectors with the .select(selector) syntax. You could use something like soup.select('div.mw-parser-output ul li')
try this:
items=soup.find_all('div', attrs={'class': "mw-parser-output"})
for item in items:
print(item.find_all('li'))
^__^
I got what I was looking for by simply using items = soup.findAll('li'). If there is a better way, please let me know.
Related
I'm trying to extract specific links on a page full of links. The links I need contain the word "apartment" in them.
But whatever I try, I get way more data extracted than only the links I need.
<a href="https://www.website.com/en/ad/apartment/abcd123" title target="IWEB_MAIN">
If anyone could help me out on this, it'd be much appreciated!
Also, if you have a good source that could inform me better about this, it would be double appreciated!
Yon can use regular expression re.
import re
soup=BeautifulSoup(Pagesource,'html.parser')
alltags=soup.find_all("a",attrs={"href" : re.compile("apartment")})
for item in alltags:
print(item['href']) #grab href value
Or You can use css selector
soup=BeautifulSoup(Pagesource,'html.parser')
alltags=soup.select("a[href*='apartment']")
for item in alltags:
print(item['href'])
You find the details in official documents Beautifulsoup
Edited:
You need to consider parent div first then find the anchor tag.
import requests
from bs4 import BeautifulSoup
res=requests.get("https://www.immoweb.be/en/search/apartment/for-sale/leuven/3000")
soup = BeautifulSoup(res.text, 'html.parser')
for item in soup.select("div[data-type='resultgallery-resultitem'] >a[href*='apartment']"):
print(item['href'])
Is there any way to get a the link out of an a element that's nested within a div element when all I can match is the div element? Something similar to the get_text() method but for links.
Here is what i have matched:
<div class="headline">כל שעה שווה לכם קצת יותר
this is my bs4 call:
>>> headlines = davarsoup.select('.headline')
I figured it out! i thought i had tried this solution already, but apparently there was a typo. here is how i solved it:
>>> headlines = davarsoup.select('.headline > a')
sorry to bother anyone who might have bothered.
So there were some attempts, but i cannot find a way to get the name and content of every
div class
div id
Im using lxml and beautysoup in my project, but i simply cant seem to find a way to find div's that are unknown to me.
Can someone show me a method or any tips how to do this?
Thanks in advance.
You can use the find_all method to find all tags of a certain type, then look at their attributes via ther attrs dict, e.g.:
soup = BeautifulSoup(html, 'lxml')
for div in soup.find_all('div'):
print(div.attrs)
I have the following code with a purpose to parse specific information from each of multiple pages. The http of each of the multiple pages is structured and therefore I use this structure to collect all links at the same time for further parsing.
import urllib
import urlparse
import re
from bs4 import BeautifulSoup
Links = ["http://www.newyorksocialdiary.com/party-pictures?page=" + str(i) for i in range(2,27)]
This command gives me a list of http links. I go further to read in and make soups.
Rs = [urllib.urlopen(Link).read() for Link in Links]
soups = [BeautifulSoup(R) for R in Rs]
As these make the soups that I desire, I cannot achieve the final goal - parsing structure . For instance,
Something for Everyone
I am specifically interested in obtaining things like this: '/party-pictures/2007/something-for-everyone'. However, the code below cannot serve this purpose.
As = [soup.find_all('a', attr = {"href"}) for soup in soups]
Could someone tell me where went wrong? I highly appreciate your assistance. Thank you.
I am specifically interested in obtaining things like this: '/party-pictures/2007/something-for-everyone'.
The next would be going for regular expression!!
You don't necessarily need to use regular expressions, and, from what I understand, you can filter out the desired links with BeautifulSoup:
[[a["href"] for a in soup.select('a[href*=party-pictures]')]
for soup in soups]
This, for example, would give you the list of links having party-pictures inside the href. *= means "contains", select() is a CSS selector search.
You can also use find_all() and apply the regular expression filter, for example:
pattern = re.compile(r"/party-pictures/2007/")
[[a["href"] for a in soup.find_all('a', href=pattern)]
for soup in soups]
This should work :
As = [soup.find_all(href=True) for soup in soups]
This should give you all href tags
If you only want hrefs with name 'a', then the following would work :
As = [soup.find_all('a',href=True) for soup in soups]
Can someone direct me as how to pull the value of a tag using BeautifulSoup? I read the documentation but had a hard time navigating through it. For example, if I had:
<span title="Funstuff" class="thisClass">Fun Text</span>
How would I just pull "Funstuff" busing BeautifulSoup/Python?
Edit: I am using version 3.2.1
You need to have something to identify the element you're looking for, and it's hard to tell what it is in this question.
For example, both of these will print out 'Funstuff' in BeautifulSoup 3. One looks for a span element and gets the title, another looks for spans with the given class. Many other valid ways to get to this point are possible.
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup('<html><body><span title="Funstuff" class="thisClass">Fun Text</span></body></html>')
print soup.html.body.span['title']
print soup.find('span', {"class": "thisClass"})['title']
A tags children are available via .contents
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#contents-and-children
In your case you can find the tag be using its CSS class to extract the contents
from bs4 import BeautifulSoup
soup=BeautifulSoup('<span title="Funstuff" class="thisClass">Fun Text</span>')
soup.select('.thisClass')[0].contents[0]
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors has all the details nevessary