How to take hyperlinks from a wikipedia page

How to take hyperlinks from a wikipedia page - python

My current project requires to obtain the summaries of some wikipedia pages. This is really easy to do, but I want to make a general script for that. More specifically I also want to obtain the summaries of hyperlinks. For example I want to get the summary of this page: https://en.wikipedia.org/wiki/Creative_industries (this is easy). Moreover, I would also like to get the summaries of the hyperlinks in Section: Definitions -> 'Advertising', 'Marketing', 'Architecture',...,'visual arts'. My problem is that some of these hyperlinks have different page names. For example, the previous mentioned page has the hyperlink 'Software' (number 6), but I want the summary of the page, which is 'Software Engineering'.
Can someone help me with that? I can find the summaries of the pages with the same hyperlink name, but that is not always the case. So basically I am looking for a way to use (page.links) to only one area of the page.
Thank you in advance

Try using Beautiful soup, this will print all the links with the given prefix
from bs4 import BeautifulSoup
import requests, re
''' Dont forget to install/setup package = 'lxml' '''
url = "your link"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data,'lxml')
tags = soup.find_all('a')
''' This will print every available link'''
for tag in tags:
print(tag.get('href'))
''' this will print links with only prefix as given'''
for link in soup.find_all('a',attrs={'href': re.compile("^{{you prefix here}}")}):
print(link.get('href')

Related

Unable to find tag when data scraping

I am new to Python and I've been working on a program that alerts you when a new item is uploaded to jp.mercari.com (a shopping site). I have the alert part of the program working, but it operates based on the number of items that come up on the search results. When I scrape the website I am unable to find what I am looking for despite being able to locate it when I inspect element on the page. The scraping program looks like this:
from bs4 import BeautifulSoup
import requests
url = "https://jp.mercari.com/search?keyword=pachinko"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
tag = doc.find_all("mer-text")
print(tag)
For more context, this is the website and some of the HTML. I've circled the parts I am trying to find in red:
Does anyone know why I am unable to find what I'm looking for?
Here is another example of the same problem but from a website that is in English:
import requests
url = "https://www.vinted.co.uk/vetements?search_text=pachinko"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
tag = doc.find_all("span")
print(tag)
Again, I can see the part of HTML I want to find when I inspect element but I can't find it when I scrape the website:

Here's what's happening with me: the element you seek (<mer-text>) is being found. However, the output is in Japanese, and Python doesn't know what to do with that. In my browser, it's being translated to English automatically by Google, so that's easier to deal with.

Extract tables in webpages from Python/R or other software

I would like to extract Name, Address of School, Tel, Fax ,Head of School from the website:
https://www.edb.gov.hk/en/student-parents/sch-info/sch-search/schlist-by-district/school-list-cw.html
Is it possible to do so?

yes it is possible and there are many tool that help you do that. If you do not want to use a programming language, you can use plenty of tools (but probably have to pay for them, here is an article that might be useful: https://popupsmart.com/blog/web-scraping-tools).
However, If you want to use python, what you should do is to load the page and then parse HTML. Then you should look you desirable element and fetch its data. This article explains the whole process with code: https://www.freecodecamp.org/news/web-scraping-python-tutorial-how-to-scrape-data-from-a-website/
Here is a simple code that shows the tables in the page that you posted, based on the code from above paper:
import requests
from bs4 import BeautifulSoup
# Make a request
page = requests.get(
"https://www.edb.gov.hk/en/student-parents/sch-info/sch-search/schlist-by-district/school-list-cw.html")
soup = BeautifulSoup(page.content, 'html.parser')
# Create top_items as empty list
top_items = []
# Extract and store in top_items according to instructions on the left
products = soup.select('table')
for elem in products:
print(elem)
You can try it out here:
https://colab.research.google.com/drive/13EzFWNBqpkGf4CZvGt5pYySCuW7Ij6I4?usp=sharing

Equivalent regular expression to extract link using Beautiful Soup

I am trying to randomly explore Webscraping through python.I have link of google search results page. I used url lib to extract all the links which are present in the GOOGLE SEARCH RESULT PAGE. From that parsed page of google I am extracting all possible anchor tags with the help of Beautiful Soup library. So now I have lots of links. Among those I want to pick selected links which matches my required pattern.
Example I want to pick all such lines:
This is one of the many links that got parsed. But I want to narrow down the result of the links which are like this
/url?q=http://avadl.uploadt.com/DL4/Film/&sa=U&ved=0ahUKEwiYwOKe1r7hAhWUf30KHcHUBkMQFggUMAA&usg=AOvVaw39cIJ0T8_CAQMY8EkSWZJl
And among such picks I need to extract only this part
http://avadl.uploadt.com/DL4/Film/
I tried this and this
possible_websites.append(re.findall('/url?q=(\S+)',links))
possible_websites.append(re.findall('/url?q=(\S+^&)',links))
Here's my code
soup = BeautifulSoup(webpage, 'html.parser')
tags = soup('a')
possible_websites=[]
for tag in tags:
links = tag.get('href', None)
possible_websites.append(re.findall('/url?q=(\S+)',links))
I want to use regular expression to extract the required text part. I am using Beautiful soup module to extract the HTML data. In short this is much of a reguar expression problem.

It’s not regex, but I’d use urllib:
from urllib.parse import parse_qs, urlparse
url = urlparse('/url?q=http://avadl.uploadt.com/DL4/Film/&sa=U&ved=0ahUKEwiYwOKe1r7hAhWUf30KHcHUBkMQFggUMAA&usg=AOvVaw39cIJ0T8_CAQMY8EkSWZJl')
qs = parse_qs(url.query)
print(qs['q'][0])

If you really need a regex, use q=(.*/)& otherwise go with Ry-'s answer, i.e.:
import re
u = "/url?q=http://avadl.uploadt.com/DL4/Film/&sa=U&ved=0ahUKEwiYwOKe1r7hAhWUf30KHcHUBkMQFggUMAA&usg=AOvVaw39cIJ0T8_CAQMY8EkSWZJl"
m = re.findall("q=(.*/)&", u)
if m:
print(m[0])
# http://avadl.uploadt.com/DL4/Film/
Demo

Python (Selenium/BeautifulSoup) Search Result Dynamic URL

Disclaimer: This is my first foray into web scraping
I have a list of URLs corresponding to search results, e.g.,
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
I'm trying to use Selenium to access the HTML of the result as follows:
for url in detail_urls:
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
However, when I comb through the resulting prettified soup, I notice that the components I need are missing. Upon looking back at the page loading process, I see that the URL redirects a few times as follows:
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
https://www.vinelink.com/#/searchResults/id/offender/34003/33/2662
https://www.vinelink.com/#/searchResults/1
Does anyone have a tip on how to access the final search results data?
Update: After further exploration this seems like it might have to do with the scripts being executed to retrieve the relevant data for display... there are many search results-related scripts referenced in the page_source; is there a way to determine which is relevant?
I am able to Inspect the information I need per this image:

Once you have your soup variable with the HTML follow the code below..
import json
data = soup.find('search-result')['data']
print(data)
Output:
Now treat each value like a dict.
{"offender_sid":154070373,"siteId":34003,"siteDesc":"NC_STATE","first_name":"WESLEY","last_name":"ADAMS","middle_initial":"CHURCHILL","alias_first_name":null,"alias_last_name":null,"alias_middle_initial":null,"oid":"2662","date_of_birth":"1965-11-21","agencyDesc":"Durham County Detention Center","age":53,"race":2,"raceDesc":"African American","gender":null,"genderDesc":null,"status_detail":"Durham County Detention Center","agency":33,"custody_status_cd":1,"custody_detail_cd":33,"custody_status_description":"In Custody","aliasFlag":false,"registerValid":true,"detailAgLink":false,"linkedCases":false,"registerMessage":"","juvenile_flg":0,"vineLinkInd":1,"vineLinkAgAccessCd":2,"links":[{"rel":"agency","href":"//www.vinelink.com/VineAppWebService/api/site/agency/34003/33"},{"rel":"self","href":"//www.vinelink.com/VineAppWebService/api/offender/?offSid=154070373&lang=en_US"}],"actions":[{"name":"register","template":"//www.vinelink.com/VineAppWebService/api/register/{json data}","method":"POST"}]}
Next:
info = json.loads(data)
print(info['first_name'], info['last_name'])
#This prints the first and last name but you can get others, just get the key like 'date_of_birth' or 'siteId'. You can also assign them to variables.

xpath how to format path

I would like to get #src value '/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg' from webpage
from lxml import html
import requests
URL = 'http://systemsklep.pl/pol_m_Kategorie_Deskorolka_Deski-281.html'
session = requests.session()
page = session.get(URL)
HTMLn = html.fromstring(page.content)
print HTMLn.xpath('//html/body/div[1]/div/div/div[3]/div[19]/div/a[2]/div/div/img/#src')[0]
but I can't. No matter how I format xpath, i tdooesnt work.

In the spirit of #pmuntima's answer, if you already know it's the 14th sourced image, but want to stay with lxml, then you can:
print HTMLn.xpath('//img/#data-src')[14]
To get that particular image. It similarly reports:
/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg
If you want to do your indexing in XPath (possibly more efficient in very large result sets), then:
print HTMLn.xpath('(//img/#data-src)[14]')[0]
It's a little bit uglier, given the need to parenthesize in the XPath, and then to index out the first element of the list that .xpath always returns.
Still, as discussed in the comments above, strictly numerical indexing is generally a fragile scraping pattern.
Update: So why is the XPath given by browser inspect tools not leading to the right element? Because the content seen by a browser, after a dynamic JavaScript-based update process, is different from the content seen by your request. Your request is not running JS, and is doing no such updates. Different content, different address needed--if the address is static and fragile, at any rate.
Part of the updates here seem to be taking src URIs, which initially point to an "I'm loading!" gif, and replacing them with the "real" src values, which are found in the data-src attribute to begin.
So you need two changes:
a stronger way to address the content you want (a way that doesn't break when you move from browser inspect to program fetch) and
to fetch the URIs you want from data-src not src, because in your program fetch, the JS has not done its load-and-switch trick the way it did in the browser.
If you know text associated with the target image, that can be the trick. E.g.:
search_phrase = 'DECK SANTA CRUZ STAR WARS EMPIRE STRIKES BACK POSTER'
path = '//img[contains(#alt, "{}")]/#data-src'.format(search_phrase)
print HTMLn.xpath(path)[0]
This works because the alt attribute contains the target text. You look for images that have the search phrase contained in their alt attributes, then fetch the corresponding data-src values.

I used a combination of requests and beautiful soup libraries. They both are wonderful and I would recommend them for scraping and parsing/extracting HTML. If you have a complex scraping job, scrapy is really good.
So for your specific example, I can do
from bs4 import BeautifulSoup
import requests
URL = 'http://systemsklep.pl/pol_m_Kategorie_Deskorolka_Deski-281.html'
r = requests.get(URL)
soup = BeautifulSoup(r.text, "html.parser")
specific_element = soup.find_all('a', class_="product-icon")[14]
res = specific_element.find('img')["data-src"]
print(res)
It will print out
/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.