I'm fairly new to web scraping in Python; and after reading most of the tutorials on the topic online I decided to give it a shot. I finally got one site working but the output is not formatted properly.
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time
page = requests.get("https://leeweebrothers.com/our-food/lunch-boxes/#")
soup = BeautifulSoup(page.text, "html.parser")
for div in soup.find_all('h2'): #prints the name of the food"
print(div.text)
for a in soup.find_all('span', {'class' : 'amount'}): #prints price of the food
print(a.text)
Output
I want both the name of the food to be printed side by side with the corresponding price of the food, concatenated by a "-" ... Would appreciate any help given, thanks!
Edit: After #Reblochon Masque comments below - I've run into another problem; As you can see there is a $0.00 which is a value from the inbuilt shopping cart on the website, how would i exclude this as an outlier and continue moving down the loop while ensuring that the other items in the price "move up" to correspond to the correct food?
Best practice is to use zip function in the for loop, but we can do that this way also. This is to just to show we can do by using indexing the two lists.
names = soup.find_all('h2')
rest = soup.find_all('span', {'class' : 'amount'})
for index in range(len(names)):
print('{} - {}'.format(names[index].text, rest[index].text))
You could maybe zip the two results:
names = soup.find_all('h2')
rest = soup.find_all('span', {'class' : 'amount'})
for div, a in zip(names, rest):
print('{} - {}'.format(div.text, a.text))
# print(f"{div.text} - {a.text}") # for python > 3.6
Related
I have the below code which follows random Wikipedia links and prints title of articles. I am trying to limit it to 10 results and not infinite results, but I am finding it difficult to do. Can anybody help please?
import requests
from bs4 import BeautifulSoup
import random
def scrape_wiki_article(article_url):
response = requests.get(url=article_url,)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find(id="firstHeading")
print(title.text)
#Get all the links
allLinks = soup.find(id="bodyContent").find_all("a")
random.shuffle(allLinks)
linkToScrape = 0
for link in allLinks:
#We are only interested in other wiki articles so look for /wiki/ prefix
if link['href'].find("/wiki/") == -1: # -1 is returned by .find if substring is not found
continue
#Use this link to scrape
linkToScrape = link
break
scrape_wiki_article("https://en.wikipedia.org" + linkToScrape['href'])
scrape_wiki_article("https://en.wikipedia.org/wiki/Web_scraping")
You can start by filtering the all links list to only include links with the /wiki prefix. Once you do that, you can truncate the list by doing something like
allLinks = allLinks[:10]
This way you would search up to 10 Wiki links.
Using beautifulsoup I'm able to scrape a web page with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://www.acbbroker.it/soci_dettaglio.php?r=3")
page
soup = BeautifulSoup(page.content, 'html.parser')
test = soup.find(id="paginainterna-content")
test_items = test.find_all(class_="entry-content")
tonight = test_items[0]
names = []
for x in tonight.find_all('a', itemprop="url"):
names.append(str(x))
print(names)
but I'm not able to clean the results and obtain only the content inside the < a > paragraph (removing also the href).
Here is a small snap of my result:
'A&B; Insurance e Reinsurance Brokers Srl', 'A.B.A. BROKERS SRL', 'ABC SRL BROKER E CONSULENTI DI ASSI.NE', 'AEGIS INTERMEDIA SAS',
What is the proper way to handle this kind of data and obtain a clean result?
Thank you
if you want only text from tag use get_text() method
for x in tonight.find_all('a', itemprop="url"):
names.append(x.get_text())
print(names)
better with list comprehension this is fastest
names = [x.get_text() for x in tonight.find_all('a', itemprop='url')]
I don't know what output you want but, the text you get it by changing this
names.append(str(x.get_text()))
I am trying to scrap the BBC football results website to get teams, shots, goals, cards and incidents.
I writing the script in Python and using the Beautiful soup package. The code provided only retrieves the first entry of the table in incidents. When the incidents table is printed to screen, the full table will all the data is there.
The table I am scraping from is stored in incidents:
from bs4 import BeautifulSoup
import urllib2
url = 'http://www.bbc.co.uk/sport/football/result/partial/EFBO815155?teamview=false'
inner_page = urllib2.urlopen(url).read()
soupb = BeautifulSoup(inner_page, 'lxml')
for incidents in soupb.find_all('table', class_="incidents-table"):
print incidents.prettify()
home_inc_tag = incidents.find('td', class_='incident-player-home')
home_inc = home_inc_tag and ''.join(home_inc_tag.stripped_strings)
type_inc_tag = incidents.find('td', 'span', class_='incident-type goal')
type_inc = type_inc_tag and ''.join(type_inc_tag.stripped_strings)
time_inc_tag = incidents.find('td', class_='incident-time')
time_inc = time_inc_tag and ''.join(time_inc_tag.stripped_strings)
away_inc_tag = incidents.find('td', class_='incident-player-away')
away_inc = away_inc_tag and ''.join(away_inc_tag.stripped_strings)
print home_inc, time_inc, type_inc, away_inc
I am just focusing one one match at the moment to get this correct (EFBO815155) before i add a regular expression into the URL to get all matches details.
So, the incidents for loop is not getting all the data, just the first entry in the table.
Thanks in advance, I am new to stack overflow, if anything is wrong with this post, formatting etc please let me know.
Thanks!
First, get the incidents table:
incidentsTable = soupb.find_all('table', class_='incidents-table')[0]
Then loop through all 'tr' tags within that table.
for incidents in incidentsTable.find_all('tr'):
# your code as it is
print incidents.prettify()
home_inc_tag = incidents.find('td', class_='incident-player-home')
home_inc = home_inc_tag and ''.join(home_inc_tag.stripped_strings)
.
.
.
Gives Output:
Bradford Park Avenue 1-2 Boston United
None None
2' Goal J.Rollins
36' None C.Piergianni
N.Turner 42' None
50' Goal D.Southwell
C.King 60' Goal
This is close to what you want. Hope this helps!
I'm trying to scrape two values from a webpage using BeautifulSoup. When printing only one value, the content looks good. However, when printing two values (to the same line), html-code is displayed around the one of the values..
Here is my code:
from bs4 import BeautifulSoup
import urllib.request as urllib2
list_open = open("source.txt")
read_list = list_open.read()
line_in_list = read_list.split("\n")
i = 0
for url in line_in_list:
soup = BeautifulSoup(urllib2.urlopen(url).read(), 'html.parser')
sku = soup.find_all(attrs={'class': "identifier"})
description = soup.find_all(attrs={'class': "description"})
for text in description:
print((sku), text.getText())
i += 1
And the output looks like this:
[<span class="identifier">112404</span>] A natural for...etc
[<span class="identifier">110027</span>] After what...etc
[<span class="identifier">03BA5730</span>] Argentina is know...etc
[<span class="identifier">090030</span>] To be carried...etc
The output should preferably be without the [<span class="identifier">-thing around the numbers...
I guess the problem is in the last for-loop, but I have no idea how to correct it. All help is appreciated. Thanks! -Espen
It looks like you need to zip() identifiers and descriptions and call getText() for every tag found in the loop:
identifiers = soup.find_all(attrs={'class': "identifier"})
descriptions = soup.find_all(attrs={'class': "description"})
for identifier, description in zip(identifiers, descriptions):
print(identifier.getText(), description.getText())
find_all() returns a ResultSet, which is more or less a fancy list. Printing a ResultSet will include the surrounding left and right square brackets that typically denote a list, and the items (tags) will be displayed within.
Your sample output suggests that the HTML for each URL contains one SKU and one description per URL. If that is correct then your code could just pick off the first item in each ResultSet like this:
sku = soup.find_all(attrs={'class': "identifier"})
description = soup.find_all(attrs={'class': "description"})
print(sku[0].get_text(), description[0].get_text())
Or, you could just find the first of each using find():
sku = soup.find(attrs={'class': "identifier"})
description = soup.find(attrs={'class': "description"})
print(sku.get_text(), description.get_text())
However, your code suggests that there can be multiple descriptions for each SKU because you are iterating over the description result set. Perhaps there can be multiple SKUs and descriptions per page (in which case see #alecxe's answer)? It's difficult to tell.
If you could update your question by adding live URLs or sample HTML we could offer better advice.
I've looked at the other beautifulsoup get same level type questions. Seems like my is slightly different.
Here is the website http://engine.data.cnzz.com/main.php?s=engine&uv=&st=2014-03-01&et=2014-03-31
I'm trying to get that table on the right. Notice how the first row of the table expands into a detailed break down of that data. I don't want that data. I only want the very top level data. You can also see that the other rows also can be expanded, but not in this case. So just looping and skipping tr[2] might not work. I've tried this:
r = requests.get(page)
r.encoding = 'gb2312'
soup = BeautifulSoup(r.text,'html.parser')
table=soup.find('div', class_='right1').findAll('tr', {"class" : re.compile('list.*')})
but there is still more nested list* at other levels. How to get only the first level?
Limit your search to direct children of the table element only by setting the recursive argument to False:
table = soup.find('div', class_='right1').table
rows = table.find_all('tr', {"class" : re.compile('list.*')}, recursive=False)
#MartijnPieters' solution is already perfect, but don't forget that BeautifulSoup allows you to use multiple attributes as well when locating elements. See the following code:
from bs4 import BeautifulSoup as bsoup
import requests as rq
import re
url = "http://engine.data.cnzz.com/main.php?s=engine&uv=&st=2014-03-01&et=2014-03-31"
r = rq.get(url)
r.encoding = "gb2312"
soup = bsoup(r.content, "html.parser")
div = soup.find("div", class_="right1")
rows = div.find_all("tr", {"class":re.compile(r"list\d+"), "style":"cursor:pointer;"})
for row in rows:
first_td = row.find_all("td")[0]
print first_td.get_text().encode("utf-8")
Notice how I also added "style":"cursor:pointer;". This is unique to the top-level rows and is not an attribute of the inner rows. This gives the same result as the accepted answer:
百度汇总
360搜索
新搜狗
谷歌
微软必应
雅虎
0
有道
其他
[Finished in 2.6s]
Hopefully this also helps.