Scraping pages with multiple parts with python - python

I want to scrape this site for a complete list of the teammates. I know how to do that with beautifoulsoup for the first page, but the results are broken in a lot of pages. Is there a way to scrape all of the parts?
Thank you!

https://www.transfermarkt.co.uk/yvon-mvogo/profil/spieler/147051
https://www.transfermarkt.co.uk/steve-von-bergen/profil/spieler/4793
https://www.transfermarkt.co.uk/scott-sutter/profil/spieler/34520
Above given are some links to the player profiles. You can open the page in BeautifulSoup and parse it to get all the links in it. Write a regular expression after to filter out only the links that satisfy the above pattern and write another function to extact information from profile pages
soup = BeautifulSoup(html_page,'html.parser')
for a in soup.find_all('a', href=True):
m = re.search('/[a-z\-]+/profil/spieler/[0-9]+', a['href'])
if m:
found = m.group(0)
print(found)
Output
/michael-frey/profil/spieler/147043
/yvon-mvogo/profil/spieler/147051
/scott-sutter/profil/spieler/34520
/leonardo-bertone/profil/spieler/194975
/steve-von-bergen/profil/spieler/4793
/alain-nef/profil/spieler/4945
/raphael-nuzzolo/profil/spieler/32574
/marco-wolfli/profil/spieler/4860
/moreno-costanzo/profil/spieler/41207
/jan-lecjaks/profil/spieler/62854
/alain-rochat/profil/spieler/4843
/christoph-spycher/profil/spieler/2871
/gonzalo-zarate/profil/spieler/52731
/christian-schneuwly/profil/spieler/52556
/yuya-kubo/profil/spieler/186260
/alexander-farnerud/profil/spieler/10255
/salim-khelifi/profil/spieler/147049
/alexander-gerndt/profil/spieler/45881
/adrian-winter/profil/spieler/59681
/victor-palsson/profil/spieler/97241
/milan-gajic/profil/spieler/46928
/dusan-veskovac/profil/spieler/28705
/marco-burki/profil/spieler/172192
/elsad-zverotic/profil/spieler/25542
/pa-modou/profil/spieler/66449
/yoric-ravet/profil/spieler/82461
You can loop through all the links and call a function that extracts the information that you require from the profile pages. Hope this helps
Use this link. I got it from inspecting the buttons
https://www.transfermarkt.co.uk/michael-frey/gemeinsameSpiele/spieler/147043/ajax/yw2/page/1
You can change the number at the end to get each page

Related

Python Beatiful Soup (Select only the one class with same name)

I am using Beautiful Soup to parse through elements of an email and I have successfully been able to extract the links from a button from the email. However, the class name on the button appears twice in the email HTML, therefore extracting/ printing two links. I only need one the first link or reference to the class first with the same name.
This is my code:
soup = BeautifulSoup(msg.html, 'html.parser')
for link in soup('a', class\_='mcnButton', href=True):
print(link\['href'\])
The 'mcnButton' is referencing two html buttons within the email containing two seperate links.I only need the first reference to the 'mcnButton' class and link containing.
The above codes prints out two links (again I only need the first).
https://leemarpet.us10.list-manage.com/track/XXXXXXX1
https://leemarpet.us10.list-manage.com/track/XXXXXXX2
I figured there should be a way to index and separately access the first reference to the class and link. Any assistance would be greatly appreciated, Thanks!
I tried the select_one, find, and attempts to index the class, unfortunately resulted in a syntax error.
To find only the first element matching your pattern use .find():
soup.find('a', class_='mcnButton', href=True)
and to get its href:
soup.find('a', class_='mcnButton', href=True).get('href')
For more information check the docs

How to open a list of links and scrape the text with Selenium

I am new to programming in Python and I want to write a Code to scrape text from articles on reuters using Selenium. I´m trying to open the article links and then get the full text from the article but it doesn´t work. I would be glad if somebody could help me.
article_links1 = []
for link in driver.find_elements_by_xpath("/html/body/div[4]/section[2]/div/div[1]/div[4]/div/div[3]/div[*]/div/h3/a"):
links = link.get_attribute("href")
article_links1.append(links)
article_links = article_links1[:5]
article_links
This is a shortened list of the articles, so it doesn´t take that long to scrape for testing. It contains 5 links, this is the output:
['https://www.reuters.com/article/idUSKCN2DM21B',
'https://www.reuters.com/article/idUSL2N2NS20U',
'https://www.reuters.com/article/idUSKCN2DM20N',
'https://www.reuters.com/article/idUSKCN2DM21W',
'https://www.reuters.com/article/idUSL3N2NS2F7']
Then I tried to iterate over the links and scrape the text out of the paragraphs but it doesn´t work.
for article in article_links:
driver.switch_to.window(driver.window_handles[-1])
driver.get(article)
time.sleep(5)
for article_text in driver.find_elements_by_xpath("/html/body/div[1]/div/div[4]/div[1]/article/div[1]/p[*]"):
full_text.append(article_text.text)
full_text
The output is only the empty list:
[]
There are a couple issues with your current code. The first one is an easy fix. You need to indent your second for loop, so that it's within the for loop that is iterating through each article. Otherwise, you won't be adding anything to the full_text list until it gets to the last article. It should look like this:
for article in article_links:
driver.switch_to.window(driver.window_handles[-1])
driver.get(article)
time.sleep(5)
for article_text in driver.find_elements_by_xpath("/html/body/div[1]/div/div[4]/div[1]/article/div[1]/p[*]"):
full_text.append(article_text.text)
The second problem lies within your xpath. Xpath can be very long when it's generated automatically by a browser. (I'd suggest learning CSS selectors, which are pretty concise. A good place to learn CSS selectors is called the CSS Diner.)
I've changed your find_elements_by_xpath() function to find_elements_by_css_selector(). You can see the example below.
for article_text in driver.find_elements_by_css_selector("article p"):
full_text.append(article_text.text)

Webscraping the links from a table

I want to web scrape the links and their respective texts from a table. I plan to use regex to accomplish this.
So let's say in this page I have multiple text_i tags. I want to get all the text_i's into a list and then get all the href's into a separate list.
I have:
web = requests.get(url)
web_text = web.text
texts = re.findall(r'<table .*><a .*>(.*)</a></table>, web_text)'
The regex expression finds all the anchor tags, of whatever class, inside a HTML table of whatever class and returns the texts, correct? This is taking an extraordinarily long time. Is this the correct way to do it?
Also, how do I go about getting the href url's now?
I suggest you use Beautiful Soup to parse the HTML text of the table.
Adapted from Beautiful Soup's documentation you could do for example:
from bs4 import BeautifulSoup
soup = BeautifulSoup(web_text, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))

What is the most efficient way to get a specific link using Beautiful Soup in Python 3.0?

I am currently learning Python specialization on coursera. I have come across the issue of extracting a specific link from a webpage using BeautifulSoup. From this webpage (http://py4e-data.dr-chuck.net/known_by_Fikret.html), I am supposed to extract a URL from user input and open that subsequent links, all identified through the anchor tab and run some number of iterations.
While I able to program them using Lists, I am wondering if there is any simpler way of doing it without using Lists or Dictionary?
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
nameList=list()
loc=''
count=0
for tag in tags:
loc=tag.get('href',None)
nameList.append(loc)
url=nameList[pos-1]
In the above code, you would notice that after locating the links using 'a' tag and 'href', I cant help but has to create a list called nameList to locate the position of link. As this is inefficient, I would like to know if I could directly locate the URL without using the lists. Thanks in advance!
The easiest way is to get an element out of tags list and then extract href value:
tags = soup('a')
a = tags[pos-1]
loc = a.get('href', None)
You can also use soup.select_one() method to query :nth-of-type element:
soup.select('a:nth-of-type({})'.format(pos))
As :nth-of-type uses 1-based indexing, you don't need to subtract 1 from pos value if your users are expected to use 1-based indexing too.
Note that soup's :nth-of-type is not equivalent to CSS :nth-of-type pseudo-class, as it always selects only one element, while CSS selector may select many elements at once.
And if you're looking for "the most efficient way", then you need to look at lxml:
from lxml.html import fromstring
tree = fromstring(r.content)
url = tree.xpath('(//a)[{}]/#href'.format(pos))[0]

Cleaning scraped url in python

I am writing a web scraper to scrape links from websites. It works fine but the output links are not clean. It outputs broken html links and also retrieves same html link. This is the code
links = re.findall('<a class=.*?href="?\'?([^"\'>]*)', sourceCode)
for link in links:
print link
And this is how output looks like
/preferences?hl=en&someting
/preferences?hl=en&someting
/history/something
/history/something
/support?pr=something
/support?pr=something
http://www.web1.com/parameters
http://www.web1.com/parameters
http://www.web2.com/parameters
http://www.web2.com/parameters
I tried cleaning links which are not html using this regex
link = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', link)
print link
It cleans the url but adds square bracket to it. How to clean this to get without square bracket? How should I prevent printing the same url twice or multiple times
/preferences?hl=en&someting -> []
http://www.web1.com/parameters -> [http://www.web1.com/parameters]
You are getting [] around matched items because re.findall returns list of items
link = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', link)
# pay attention on iteration over set(links) and not links
for link in set(links):
print link
Notice that I've added set creation into for loop to get only unique links in that way you would prevent printng same url.
Try using
links = re.findall('href="(http.*?)"', sourceCode)
links = sorted(set(links))
for link in links:
print(links)
This will get only links that begins with http in it and remove duplicates as well as sort them

Categories

Resources