Trying to parse all linked text from wiki - python

I'm trying to parse a wiki page here, but I only want the certain parts. Those links in the main article, I'd like to parse them all. Is there an article or tutorial on how to do it? I'm assuming I'd be using BS4. Can anyone help?
Specifically speaking; the links that are under all the main headers in the page.

Well, it really depends on what you mean by "parse" but here is a full working example on how to extract all links from the main section with BeautfulSoup:
from bs4 import BeautifulSoup
import urllib.request
def main():
url = 'http://yugioh.wikia.com/wiki/Card_Tips%3aBlue-Eyes_White_Dragon'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read())
content = soup.find('div',id='mw-content-text')
links = content.findAll('a')
for link in links:
print(link.get_text())
if __name__ == "__main__":
main()
This code should be self explanatory, but just in case:
First we open the page with urllib.reauest.urlopen and pass its contents to BS
Then we extract the main content div by its id. (The id mw-content-text can be found in the page's source)
We proceed with extracting all the links inside the main content
In a for loop we print all the links.
Additional methods, you might need for parsing the links:
link.get('href') extracts the destination url
link.get('title') extracts the alternative title of the link
And since you asked for resources: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ is the first place you should start.

Related

How to get href from a class containing a specific text using CSS selector (Scrapy)

I am working with the following web site: https://inmuebles.mercadolibre.com.mx/venta/, and I am trying to get the link from "ver_todos" button from "Inmueble" section (in red). However, the "Tour virtual" and "Publicados hoy" sections (in blue) may or may not appear when visiting the site.
As shown in the image below, the classes ui-search-filter-dl contain the specific sections from the menu from above image; while ui-search-filter-container classes contain the sub-sections displayed by the site (e.g. Casas, Departamento & Terrenos for Inmueble). With the intention of obtaining the link from "ver todos" button from "Inmueble" section, I was using this line of code:
ver_todos = response.css('div.ui-search-filter-dl')[2].css('a.ui-search-modal__link').attrib['href']
But since "Tour virtual" and "Publicados hoy" are not always in the page, I cannot be sure that ui-search-filter-dl at index 2 is always the index corresponding to "ver todos" button.
I was trying to get the link from "ver todos" by using this line of code:
response.css(''':contains("Inmueble") ~ .ui-search-filter-dt-title
.ui-search-modal__link::attr(href)''').extract()
Basically, I was trying to get the href from a ui-search-filter-dt-title class that contains the title "Inmueble". Unfortunately, the output is an empty list. I would like to find the link from "ver todos" by using css and regex but I'm having trouble with it. How may I achieve that?
I think xpath is easier to select the target elements in most cases:
Code:
xpath = "//div[contains(text(), 'Inmueble')]/following-sibling::ul//a[contains(#class,'ui-search-modal__link')]/#href"
url = response.xpath(xpath).extract()[0]
Actually, I didn't create a scrapy project to check your code. Alternatively, I implemented the following code:
from lxml import html
import requests
res = requests.get( "https://inmuebles.mercadolibre.com.mx/venta/")
dom = html.fromstring(res.text)
xpath = "//div[contains(text(), 'Inmueble')]/following-sibling::ul//a[contains(#class,'ui-search-modal__link')]/#href"
url = dom.xpath(xpath)[0]
assert url == 'https://inmuebles.mercadolibre.com.mx/venta/_FiltersAvailableSidebar?filter=PROPERTY_TYPE'
Since the xpath should be the same among scrapy and lxml, of course, I hope the code shown in the beginning will also work fine in your scrapy project.
An easy way you could do it is by getting all the link <a> and then checking if any of their text matches ver todos.
import requests
from bs4 import BeautifulSoup
link = "https://inmuebles.mercadolibre.com.mx/venta/"
def main():
res = requests.get(link)
if res.status_code == 200:
soup = BeautifulSoup(res.text, "html.parser")
links = [a["href"] for a in soup.select("a") if a.text.strip().lower() == "ver todos"]
print(links)
if __name__ == "__main__":
main()

python crawling text from <em></em>

Hi, I want to get the text(number 18) from em tag as shown in the picture above.
When I ran my code, it did not work and gave me only empty list. Can anyone help me? Thank you~
here is my code.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://blog.naver.com/kwoohyun761/221945923725'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)
When you disable javascript you'll see that the like count is loaded dynamically, so you have to use a service that renders the website and then you can parse the content.
You can use an API: https://www.scraperapi.com/
Or run your own for example: https://github.com/scrapinghub/splash
EDIT:
First of all, I missed that you were using urlopen incorrectly the correct way is described here: https://docs.python.org/3/howto/urllib2.html . Assuming you are using python3, which seems to be the case judging by the print statement.
Furthermore: looking at the issue again it is a bit more complicated. When you look at the source code of the page it actually loads an iframe and in that iframe you have the actual content: Hit ctrl + u to see the source code of the original url, since the side seems to block the browser context menu.
So in order to achieve your crawling objective you have to first grab the initial page and then grab the page you are interested in:
from urllib.request import urlopen
from bs4 import BeautifulSoup
# original url
url = "https://blog.naver.com/kwoohyun761/221945923725"
with urlopen(url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
iframe = soup.find('iframe')
# iframe grabbed, construct real url
print(iframe['src'])
real_url = "https://blog.naver.com" + iframe['src']
# do your crawling
with urlopen(real_url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)
You might be able to avoid one round trip by analyzing the original url and the URL in the iframe. At first glance it looked like the iframe url can be constructed from the original url.
You'll still need a rendered version of the iframe url to grab your desired value.
I don't know what this site is about, but it seems they do not want to get crawled maybe you respect that.

How to crawl every page in a website in Python BeautifulSoup

Is there any way to crawl every page in a URL?
Such as https://gogo.mn/ to find every article page in the URL?
The following is what I have so far. The problem is that the news article patterns are weird for example https://gogo.mn/r/qqm4m
So the code like following will never find the articles.
base_url = 'https://gogo.mn/'
for i in range(number_pages):
url = base_url+str(i)
req = requests.get(url)
soup = BeautifulSoup(req.content)
How do I crawl such websites?
The easiest way is way is to first get the page from the website. This can be accomplished thusly:
url = 'https://gogo.mn/'
response = requests.get(url)
Then your page is contained in the response variable which you can examine by looking at response.text.
Now use BeautifulSoup to find all of the links that are contained on the page:
a_links = html.find_all('a')
This returns a bs4.element.ResultSet type that can be iterated through using a for loop. Looking at your particular site I found that they don't include the baseURL in many of their links so some normalization of the URLS has to be performed.
for link in a_links:
if ('https' in link['href']) or ('http' in link['href']):
print (link['href'])
else:
xLink = link['href'][1:]
print (f'{url}{xLink}')
Once you've done that you then have all of the links from a given page. You would then need to eliminate duplicates and for each page run through the links on the new pages. This would involve recursively stepping through all links you find.
Regards
I have not used Scrapy. But to get all the content using only request and BeautifulSoup, you need to find the index page (sometimes archives or search results) of the website, save the urls of all the pages, loop through the urls, and save the content of the pages.

Python: (Beautifulsoup) How to limit extracted text from a html news article to only the news article.

I wrote this test code which uses BeautifulSoup.
url = "http://www.dailymail.co.uk/news/article-3795511/Harry-Potter-sale-half-million-pound-house-Iconic-Privet-Drive-market-suburban-Berkshire-complete-cupboard-stairs-one-magical-boy.html"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,"lxml")
for n in soup.find_all('p'):
print(n.get_text())
It works fine but it also retrieves text that is not part of the news article, such as the time it was posted, number of comments, copyrights ect.
I would wish for it to only retrieve text from the news article itself, how would one go about this?
You might have much better luck with newspaper library which is focused on scraping articles.
If we talk about BeautifulSoup only, one option to get closer to the desired result and have more relevant paragraphs is to find them in the context of div element with itemprop="articleBody" attribute:
article_body = soup.find(itemprop="articleBody")
for p in article_body.find_all("p"):
print(p.get_text())
You'll need to target more specifically than just the p tag. Try looking for a div class="article" or something similar, then only grab paragraphs from there
Be more specific, you need to catch the div with class articleBody, so :
import urllib.request
from bs4 import BeautifulSoup
url = "http://www.dailymail.co.uk/news/article-3795511/Harry-Potter-sale-half-million-pound-house-Iconic-Privet-Drive-market-suburban-Berkshire-complete-cupboard-stairs-one-magical-boy.html"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,"lxml")
for n in soup.find_all('div', attrs={'itemprop':"articleBody"}):
print(n.get_text())
Responses on SO is not just for you, but also for people coming from google searches and such. So as you can see, attrs is a dict, it is then possible to pass more attributes/values if needed.

Checking webpage for results with python and beautifulsoup

I need to check a webpage search results and compare them to user input.
ui = raw_input() #for example "Niels Bohr"
link = "http://www.enciklopedija.hr/Trazi.aspx?t=profesor,%20gdje&s=90&k=10"
stranica=urllib.urlopen(link)
soup = BeautifulSoup(stranica, from_encoding="utf-8")
beauty = soup.prettify()
print beauty
since there is 1502 results, my idea was to change the k=10 to k=1502. Now I need some kind of function to check if search results contain my user input. I know that my names are the text after TEXT
so how to do it? maybe using regex?
the second part is if there are matching results to get the link of the results. Again, I know that link is inside that href="", but how to get it out and make it usable=
Finding if Niels Bohr is listed is as easy as using a large batch number and loading the resulting page:
import sys
import urllib2
from bs4 import BeautifulSoup
url = "http://www.enciklopedija.hr/Trazi.aspx?t=profesor,%20gdje&s=0&k={}".format(sys.maxint)
name = u'Bohr, Niels'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
for link in soup.find_all(class_='AllWordsTextHit', text=name):
print link
This produces any links that contain the text 'Bohr, Niels' as the link text. You can use a regular expression if you need a partial match.
The link object has a (relative) href attribute you can then use to load the next page:
professor_page = 'http://www.enciklopedija.hr/' + link['href']

Categories

Resources