Web scraping - .append add whitespaces and \n to list

Web scraping - .append add whitespaces and \n to list - python

I have written some code that help me scrape websites. It has worked well on some sites but I am currently running into an issue.
The collectData() function collects data from a site and appends it to 'dataList'. From this dataList I can create a csv file to export the data.
The issue I am having right now is that the function appends multiple whitespances and \n characters into my list. The output look like this: (the excessive whitespaces are not shown here)
dataList = ['\n 2.500.000 ']
Does anyone what what could cause this? As I mentioned, there are some websites where the function works fine.
Thank you!
def scrape():
dataList = []
pageNr = range(0, 1)
for page in pageNr:
pageUrl = ('https://www.example.com/site:{}'.format(page))
print(pageUrl)
def getUrl(pageUrl):
openUrl = urlopen(pageUrl)
soup = BeautifulSoup(openUrl, 'lxml')
links = soup.find_all('a', class_="ellipsis")
for link in links:
linkNew = link.get('href')
linkList.append(linkNew)
#print(linkList)
return linkList
anzList = getUrl(pageUrl)
lenght = len(anzList)
print(lenght)
anzLinks = []
for i in range(lenght):
anzeigenLinks.append('https://www.example.com/ + anzList[i]')
print(anzLinks)
def collectData():
for link in anzLinks:
openAnz = urlopen(link)
soup = BeautifulSoup(openAnz, 'lxml')
try:
kaufpreisSuche = soup.find('h2')
kaufpreis = kaufpreisSuche.text
dataListe.append(kaufpreis)
print(kaufpreis)
except:
kaufpreis = None
dataListe.append(kaufpreis)

Related

Python: Get element next to href

Python code:
url = 'https://www.basketball-reference.com/players/'
initial = list(string.ascii_lowercase)
initial_url = [url + i for i in initial]
html_initial = [urllib.request.urlopen(i).read() for i in initial_url]
soup_initial = [BeautifulSoup(i, 'html.parser') for i in html_initial]
tags_initial = [i('a') for i in soup_initial]
print(tags_initial[0][50])
Results example:
Shareef Abdur-Rahim
From the example above, I want to extract the name of the players which is 'Shareef Abdur-Rahim', but I want to do it for all the tags_initial lists,
Does anyone have an idea?

Could you modify your post by adding your code so that we can help you better?
Maybe that could help you :
name = soup.findAll(YOUR_SELECTOR)[0].string
UPDATE
import re
import string
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'https://www.basketball-reference.com/players/'
# Alphabet
initial = list(string.ascii_lowercase)
datas = []
# URLS
urls = [url + i for i in initial]
for url in urls:
# Soup Object
soup = BeautifulSoup(urlopen(url), 'html.parser')
# Players link
url_links = soup.findAll("a", href=re.compile("players"))
for link in url_links:
# Player name
datas.append(link.string)
print("datas : ", datas)
Then, "datas" contains all the names of the players, but I advise you to do a little processing afterwards to remove some erroneous information like "..." or perhaps duplicates

There are probably better ways but I'd do it like this:
html = "a href=\"/teams/LAL/2021.html\">Los Angeles Lakers</a"
index = html.find("a href")
index = html.find(">", index) + 1
index_end = html.find("<", index)
print(html[index:index_end])
If you're using a scraper library it probably has a similar function built-in.

How to get all emails from a page individually

I am trying to get all emails from a specific page and separate them into an individual variable or even better a dictionary. This is some code.
import requests
import re
import json
from bs4 import BeautifulSoup
page = "http://www.example.net"
info = requests.get(page)
if info.status_code == 200:
print("Page accessed")
else:
print("Error accessing page")
code = info.content
soup = BeautifulSoup(code, 'lxml')
allEmails = soup.find_all("a", href=re.compile(r"^mailto:"))
print(allEmails)
sep = ","
allEmailsStr = str(allEmails)
print(type(allEmails))
print(type(allEmailsStr))
j = allEmailsStr.split(sep, 1)[0]
print(j)
Excuse the poor variable names because I put this together so it would be fine by itself. The output from the example website would be for example something like
[k, kolyma, location, balkans]
So if I ran the problem it would return only
[k
But if I wanted it to return every email on there individually how would I do that?

To get just the email str you can try:
emails = []
for email_link in allEmails:
emails.append(email_link.get("href").replace('mailto:', ''))
print(emails)

Based on your expected output, you can use the unwrap function of BeautifulSoup
allEmails = soup.find_all("a", href=re.compile(r"^mailto:"))
for Email in allEmails:
print(Email.unwrap()) #This will print the whole element along with tag
# k

How can I get data from a website using BeautifulSoup and requests?

I am a beginner in web scraping, and I need help with this problem.
The website, allrecipes.com, is a website where you can find recipes based on a search, which in this case is 'pie':
link to the html file:
'view-source:https://www.allrecipes.com/search/results/?wt=pie&sort=re'
(right click-> view page source)
I want to create a program that takes a input, searches it up on allrecipes, and returns a list with tuples of the first five recipes with data such as the time that takes to make, serving yield, ingrediants, and more.
This is my program so far:
import requests
from bs4 import BeautifulSoup
def searchdata():
inp=input('what recipe would you like to search')
url ='http://www.allrecipes.com/search/results/?wt='+str(inp)+'&sort=re'
r=requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
links=[]
#fill in code for finding top 3 or five links
for i in range(3)
a = requests.get(links[i])
soupa = BeautifulSoup(a.text, 'html.parser')
#fill in code to find name, ingrediants, time, and serving size with data from soupa
names=[]
time=[]
servings=[]
ratings=[]
ingrediants=[]
searchdata()
Yes, i know, my code is very messy but What should I fill in in the two code fill-in areas?
Thanks

After searching for the recipe you have to get the links of each recipe and then request again for each of those links, because the information you're looking for is not available on the search page. That would not look clean without OOP so here's the class I wrote that does what you want.
import requests
from time import sleep
from bs4 import BeautifulSoup
class Scraper:
links = []
names = []
def get_url(self, url):
url = requests.get(url)
self.soup = BeautifulSoup(url.content, 'html.parser')
def print_info(self, name):
self.get_url(f'https://www.allrecipes.com/search/results/?wt={name}&sort=re')
if self.soup.find('span', class_='subtext').text.strip()[0] == '0':
print(f'No recipes found for {name}')
return
results = self.soup.find('section', id='fixedGridSection')
articles = results.find_all('article')
texts = []
for article in articles:
txt = article.find('h3', class_='fixed-recipe-card__h3')
if txt:
if len(texts) < 5:
texts.append(txt)
else:
break
self.links = [txt.a['href'] for txt in texts]
self.names = [txt.a.span.text for txt in texts]
self.get_data()
def get_data(self):
for i, link in enumerate(self.links):
self.get_url(link)
print('-' * 4 + self.names[i] + '-' * 4)
info_names = [div.text.strip() for div in self.soup.find_all(
'div', class_='recipe-meta-item-header')]
ingredient_spans = self.soup.find_all('span', class_='ingredients-item-name')
ingredients = [span.text.strip() for span in ingredient_spans]
for i, div in enumerate(self.soup.find_all('div', class_='recipe-meta-item-body')):
print(info_names[i].capitalize(), div.text.strip())
print()
print('Ingredients'.center(len(ingredients[0]), ' '))
print('\n'.join(ingredients))
print()
print('*' * 50, end='\n\n')
chrome = Scraper()
chrome.print_info(input('What recipe would you like to search: '))

returning a scraped variable from a function in python

I want to create a function that returns a varibable I can write to a csv.
If I write:
from makesoup import make_soup
def get_links(soupbowl):
linkname=""
for boot in soupbowl.findAll('tbody'):
for record in boot.findAll('tr', {"row0", "row1"}):
for link in record.find_all('a'):
if link.has_attr('href'):
linkname = linkname+"\n" + (link.attrs['href'])[1:]
print(linkname)
soup = make_soup("https://www.footballdb.com/teams/index.html")
pyt = get_links(soup)
print(pyt)
It prints what I want(all links on page) in the function and None with print(pyt)
Instead of print(linkname) in the function, i want to return(linkname).
But when I do I only print the first link on the page. Is there a way to pass all the links to variable pyt which is outside of the function?
Thank You in advance

Try the following, to get all the links in one go:
from makesoup import make_soup
def get_links(soupbowl):
links_found = []
linkname=""
for boot in soupbowl.findAll('tbody'):
for record in boot.findAll('tr', {"row0", "row1"}):
for link in record.find_all('a'):
if link.has_attr('href'):
linkname = linkname+"\n" + (link.attrs['href'])[1:]
links_found.append(linkname)
return links_found
soup = make_soup("https://www.footballdb.com/teams/index.html")
pyt = get_links(soup)
print(pyt)
Or use yield, to return them one by one - while you process the output for something else:
from makesoup import make_soup
def get_links(soupbowl):
linkname=""
for boot in soupbowl.findAll('tbody'):
for record in boot.findAll('tr', {"row0", "row1"}):
for link in record.find_all('a'):
if link.has_attr('href'):
linkname = linkname+"\n" + (link.attrs['href'])[1:]
yield linkname
soup = make_soup("https://www.footballdb.com/teams/index.html")
pyt = get_links(soup)
for link in pyt:
do_something()

from makesoup import make_soup
def get_links(soupbowl):
links = []
for boot in soupbowl.findAll('tbody'):
for record in boot.findAll('tr', {"row0", "row1"}):
for link in record.find_all('a'):
if link.has_attr('href'):
linkname = linkname+"\n" + (link.attrs['href'])[1:]
links.append(linkname)
return links
soup = make_soup("https://www.footballdb.com/teams/index.html")
pyt = get_links(soup)
print(pyt)

Python Link Scraper

focus_Search = raw_input("Focus Search ")
url = "https://www.google.com/search?q="
res = requests.get(url + focus_Search)
print("You Just Searched")
res_String = res.text
#Now I must get ALL the sections of code that start with "<a href" and end with "/a>"
Im trying to scrape all the links from a google search webpage. I could extract each link one at a time but I'm sure theres a better way to do it.

This creates a list of all links in the search page with some of your code, without getting into BeautifulSoup
import requests
import lxml.html
focus_Search = input("Focus Search ")
url = "https://www.google.com/search?q="
#focus_Search
res = requests.get(url + focus_Search).content
# res
dom = lxml.html.fromstring(res)
links = [x for x in dom.xpath('//a/#href')] # Borrows from cheekybastard in link below
# http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup
links

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping - .append add whitespaces and \n to list - python

Related

Python: Get element next to href

How to get all emails from a page individually

How can I get data from a website using BeautifulSoup and requests?

returning a scraped variable from a function in python

Python Link Scraper

Categories

Resources