Cleaning scraped url in python - python

I am writing a web scraper to scrape links from websites. It works fine but the output links are not clean. It outputs broken html links and also retrieves same html link. This is the code
links = re.findall('<a class=.*?href="?\'?([^"\'>]*)', sourceCode)
for link in links:
print link
And this is how output looks like
/preferences?hl=en&someting
/preferences?hl=en&someting
/history/something
/history/something
/support?pr=something
/support?pr=something
http://www.web1.com/parameters
http://www.web1.com/parameters
http://www.web2.com/parameters
http://www.web2.com/parameters
I tried cleaning links which are not html using this regex
link = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', link)
print link
It cleans the url but adds square bracket to it. How to clean this to get without square bracket? How should I prevent printing the same url twice or multiple times
/preferences?hl=en&someting -> []
http://www.web1.com/parameters -> [http://www.web1.com/parameters]

You are getting [] around matched items because re.findall returns list of items
link = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', link)
# pay attention on iteration over set(links) and not links
for link in set(links):
print link
Notice that I've added set creation into for loop to get only unique links in that way you would prevent printng same url.

Try using
links = re.findall('href="(http.*?)"', sourceCode)
links = sorted(set(links))
for link in links:
print(links)
This will get only links that begins with http in it and remove duplicates as well as sort them

Related

Traversal Parsing of Text from .HTML

I am trying to scrape text from webpages contained in tags of type titles, headings or paragraphs. When i try the below code I get mixed results depending on where the url is from. When i try some sources (e.g. Wikipedia or Reuters) the code works more or less fine and at least finds all the text. For other sources (e.g. Politico, The Economist) I start to miss a lot of the text contained in webpage.
I am using traversal algo to walk through the tree and check if the tag is 'of interest'. Maybe find_all(True, recursive=False) is for some reason missing children that subsequently contain the text I am looking for? I'm unsure how to investigate that. Or maybe some sites are blocking the scraping somehow? But then why can i scrape one paragraph from the economist?
Code below replicates issue for me - you should see the wikipedia page (urls[3]) print as desired, the politico (urls[0]) missing all text in the article and economist (urls[1]) missing all but one paragraph.
from bs4 import BeautifulSoup
import requests
urls = ["https://www.politico.com/news/2022/01/17/democrats-biden-clean-energy-527175",
"https://www.economist.com/finance-and-economics/the-race-to-power-the-defi-ecosystem-is-on/21807229",
"https://www.reuters.com/world/significant-damage-reported-tongas-main-island-after-volcanic-eruption-2022-01-17/",
"https://en.wikipedia.org/wiki/World_War_II"]
# get soup
url = urls[0] # first two urls don't work, last two do work
response = requests.get(url)
soup = BeautifulSoup(response.text, features="html.parser")
# tags with text that i want to print
tags_of_interest = ['p', 'title'] + ['h' + str(i) for i in range(1, 7)]
def read(soup):
for tag in soup.find_all(True, recursive=False):
if (tag.name in tags_of_interest):
print(tag.name + ": ", tag.text.strip())
for child in tag.find_all(True, recursive=False):
read(child)
# call the function
read(soup)
BeautifulSoup's find_all() will return a list of tags in the order of a DFT (depth first traversal) as per this answer here. This allows easy access to the desired elements.

How to open a list of links and scrape the text with Selenium

I am new to programming in Python and I want to write a Code to scrape text from articles on reuters using Selenium. I´m trying to open the article links and then get the full text from the article but it doesn´t work. I would be glad if somebody could help me.
article_links1 = []
for link in driver.find_elements_by_xpath("/html/body/div[4]/section[2]/div/div[1]/div[4]/div/div[3]/div[*]/div/h3/a"):
links = link.get_attribute("href")
article_links1.append(links)
article_links = article_links1[:5]
article_links
This is a shortened list of the articles, so it doesn´t take that long to scrape for testing. It contains 5 links, this is the output:
['https://www.reuters.com/article/idUSKCN2DM21B',
'https://www.reuters.com/article/idUSL2N2NS20U',
'https://www.reuters.com/article/idUSKCN2DM20N',
'https://www.reuters.com/article/idUSKCN2DM21W',
'https://www.reuters.com/article/idUSL3N2NS2F7']
Then I tried to iterate over the links and scrape the text out of the paragraphs but it doesn´t work.
for article in article_links:
driver.switch_to.window(driver.window_handles[-1])
driver.get(article)
time.sleep(5)
for article_text in driver.find_elements_by_xpath("/html/body/div[1]/div/div[4]/div[1]/article/div[1]/p[*]"):
full_text.append(article_text.text)
full_text
The output is only the empty list:
[]
There are a couple issues with your current code. The first one is an easy fix. You need to indent your second for loop, so that it's within the for loop that is iterating through each article. Otherwise, you won't be adding anything to the full_text list until it gets to the last article. It should look like this:
for article in article_links:
driver.switch_to.window(driver.window_handles[-1])
driver.get(article)
time.sleep(5)
for article_text in driver.find_elements_by_xpath("/html/body/div[1]/div/div[4]/div[1]/article/div[1]/p[*]"):
full_text.append(article_text.text)
The second problem lies within your xpath. Xpath can be very long when it's generated automatically by a browser. (I'd suggest learning CSS selectors, which are pretty concise. A good place to learn CSS selectors is called the CSS Diner.)
I've changed your find_elements_by_xpath() function to find_elements_by_css_selector(). You can see the example below.
for article_text in driver.find_elements_by_css_selector("article p"):
full_text.append(article_text.text)

Python (Selenium/BeautifulSoup) Search Result Dynamic URL

Disclaimer: This is my first foray into web scraping
I have a list of URLs corresponding to search results, e.g.,
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
I'm trying to use Selenium to access the HTML of the result as follows:
for url in detail_urls:
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
However, when I comb through the resulting prettified soup, I notice that the components I need are missing. Upon looking back at the page loading process, I see that the URL redirects a few times as follows:
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
https://www.vinelink.com/#/searchResults/id/offender/34003/33/2662
https://www.vinelink.com/#/searchResults/1
Does anyone have a tip on how to access the final search results data?
Update: After further exploration this seems like it might have to do with the scripts being executed to retrieve the relevant data for display... there are many search results-related scripts referenced in the page_source; is there a way to determine which is relevant?
I am able to Inspect the information I need per this image:
Once you have your soup variable with the HTML follow the code below..
import json
data = soup.find('search-result')['data']
print(data)
Output:
Now treat each value like a dict.
{"offender_sid":154070373,"siteId":34003,"siteDesc":"NC_STATE","first_name":"WESLEY","last_name":"ADAMS","middle_initial":"CHURCHILL","alias_first_name":null,"alias_last_name":null,"alias_middle_initial":null,"oid":"2662","date_of_birth":"1965-11-21","agencyDesc":"Durham County Detention Center","age":53,"race":2,"raceDesc":"African American","gender":null,"genderDesc":null,"status_detail":"Durham County Detention Center","agency":33,"custody_status_cd":1,"custody_detail_cd":33,"custody_status_description":"In Custody","aliasFlag":false,"registerValid":true,"detailAgLink":false,"linkedCases":false,"registerMessage":"","juvenile_flg":0,"vineLinkInd":1,"vineLinkAgAccessCd":2,"links":[{"rel":"agency","href":"//www.vinelink.com/VineAppWebService/api/site/agency/34003/33"},{"rel":"self","href":"//www.vinelink.com/VineAppWebService/api/offender/?offSid=154070373&lang=en_US"}],"actions":[{"name":"register","template":"//www.vinelink.com/VineAppWebService/api/register/{json data}","method":"POST"}]}
Next:
info = json.loads(data)
print(info['first_name'], info['last_name'])
#This prints the first and last name but you can get others, just get the key like 'date_of_birth' or 'siteId'. You can also assign them to variables.

Scraping pages with multiple parts with python

I want to scrape this site for a complete list of the teammates. I know how to do that with beautifoulsoup for the first page, but the results are broken in a lot of pages. Is there a way to scrape all of the parts?
Thank you!
https://www.transfermarkt.co.uk/yvon-mvogo/profil/spieler/147051
https://www.transfermarkt.co.uk/steve-von-bergen/profil/spieler/4793
https://www.transfermarkt.co.uk/scott-sutter/profil/spieler/34520
Above given are some links to the player profiles. You can open the page in BeautifulSoup and parse it to get all the links in it. Write a regular expression after to filter out only the links that satisfy the above pattern and write another function to extact information from profile pages
soup = BeautifulSoup(html_page,'html.parser')
for a in soup.find_all('a', href=True):
m = re.search('/[a-z\-]+/profil/spieler/[0-9]+', a['href'])
if m:
found = m.group(0)
print(found)
Output
/michael-frey/profil/spieler/147043
/yvon-mvogo/profil/spieler/147051
/scott-sutter/profil/spieler/34520
/leonardo-bertone/profil/spieler/194975
/steve-von-bergen/profil/spieler/4793
/alain-nef/profil/spieler/4945
/raphael-nuzzolo/profil/spieler/32574
/marco-wolfli/profil/spieler/4860
/moreno-costanzo/profil/spieler/41207
/jan-lecjaks/profil/spieler/62854
/alain-rochat/profil/spieler/4843
/christoph-spycher/profil/spieler/2871
/gonzalo-zarate/profil/spieler/52731
/christian-schneuwly/profil/spieler/52556
/yuya-kubo/profil/spieler/186260
/alexander-farnerud/profil/spieler/10255
/salim-khelifi/profil/spieler/147049
/alexander-gerndt/profil/spieler/45881
/adrian-winter/profil/spieler/59681
/victor-palsson/profil/spieler/97241
/milan-gajic/profil/spieler/46928
/dusan-veskovac/profil/spieler/28705
/marco-burki/profil/spieler/172192
/elsad-zverotic/profil/spieler/25542
/pa-modou/profil/spieler/66449
/yoric-ravet/profil/spieler/82461
You can loop through all the links and call a function that extracts the information that you require from the profile pages. Hope this helps
Use this link. I got it from inspecting the buttons
https://www.transfermarkt.co.uk/michael-frey/gemeinsameSpiele/spieler/147043/ajax/yw2/page/1
You can change the number at the end to get each page

How do I get a list of URLs using lxml's start-with() if they are relative links?

I'm looking at making a list of URLs that contains "page.php"? Do I parse all the links and then loop through them or is there a better way?
The URLs look like this:
<a href="../path/page.php?something=somewhere&yes=no">
And I tried this:
resumes = doc.xpath('//a[starts-with(#href, "../path/page.php"]/text()')
Is this correct or should I be using the absolute URL with starts-with()?
I'd do it this way, provided you want all links containing page.php
links = doc.findall('.//a') # Finds all links
resume = [res for res in links if 'page.php' in res.text_content()]
First, I get all the links on the page, then I make a list of all the links with page.php in them.
This is untested (I don't have all your code so I can't test it as quick as usual) but should still work.

Categories

Resources