Python - converting to list - python

import requests
from bs4 import BeautifulSoup
webpage = requests.get("http://www.nytimes.com/")
soup = BeautifulSoup(requests.get("http://www.nytimes.com/").text, "html.parser")
for story_heading in soup.find_all(class_="story-heading"):
articles = story_heading.text.replace('\n', '').replace(' ', '')
print (articles)
There is my code, it prints out a list of all the article titles on the website. I get strings:
Looking Back: 1980 | Funny, but Not Fit to Print
Brooklyn Studio With Room for Family and a Dog
Search for Homes for Sale or Rent
Sell Your Home
So, I want to convert this to a list = ['Search for Homes for Sale or Rent', 'Sell Your Home', ...], witch will allow me to make some other manipulations like random.choice etc.
I tried:
alist = articles.split("\n")
print (alist)
['Looking Back: 1980 | Funny, but Not Fit to Print']
['Brooklyn Studio With Room for Family and a Dog']
['Search for Homes for Sale or Rent']
['Sell Your Home']
It is not a list that I need. I'm stuck. Can you please help me with this part of code.

You are constantly overwriting articles with the next value in your list. What you want to do instead is make articles a list, and just append in each iteration:
import requests
from bs4 import BeautifulSoup
webpage = requests.get("http://www.nytimes.com/")
soup = BeautifulSoup(requests.get("http://www.nytimes.com/").text, "html.parser")
articles = []
for story_heading in soup.find_all(class_="story-heading"):
articles.append(story_heading.text.replace('\n', '').replace(' ', ''))
print (articles)
The output is huge, so this is a small sample of what it looks like:
['Global Deal Reached to Curb Chemical That Warms Planet', 'Accord Could Push A/C Out of Sweltering India’s Reach ',....]
Furthermore, you only need to strip spaces in each iteration. You don't need to do those replacements. So, you can do this with your story_heading.text instead:
articles.append(story_heading.text.strip())
Which, can now give you a final solution looking like this:
import requests
from bs4 import BeautifulSoup
webpage = requests.get("http://www.nytimes.com/")
soup = BeautifulSoup(requests.get("http://www.nytimes.com/").text, "html.parser")
articles = [story_heading.text.strip() for story_heading in soup.find_all(class_="story-heading")]
print (articles)

Related

BeautifulSoup deleting first half of HTML?

I'm practicing with BeautifulSoup and HTML requests in general for the first time. The goal of the programme is to load a webpage and it's HTML, then search through the webpage (in this case a recipe, to get a sub string of it's ingredients). I've managed to get it working with the following code:
url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"
result = requests.get(url)
myHTML = result.text
index1 = myHTML.find("recipeIngredient")
index2 = myHTML.find("recipeInstructions")
ingredients = myHTML[index1:index2]
But when I try and use BeautifulSoup here:
url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
ingredients = doc.find(text = "recipeIngredient")
print(ingredients)
I understand that the code above (even if I could get it working) would produce a different output of just ["recipeIngredient"] but that's all I'm focused on for now whilst I get to grips with BS. Instead the code above just outputs None. I printed "doc" to the terminal and it would only output what appears to be the second half of the HTML (or at least : not all of it). Whereas , the text file does contain all HTML, so I assume that's where the problem lies but i'm not sure how to fix it.
Thank you.
You need to use:
class_="recipe__ingredients"
For example:
import requests
from bs4 import BeautifulSoup
url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"
doc = (
BeautifulSoup(requests.get(url).text, "html.parser")
.find(class_="recipe__ingredients")
)
ingredients = "\n".join(
ingredient.getText() for ingredient in doc.find_all("li")
)
print(ingredients)
Output:
1 large onion , chopped
4 large garlic cloves
thumb-sized piece of ginger
2 tbsp rapeseed oil
4 small skinless chicken breasts, cut into chunks
2 tbsp tikka spice powder
1 tsp cayenne pepper
400g can chopped tomatoes
40g ground almonds
200g spinach
3 tbsp fat-free natural yogurt
½ small bunch of coriander , chopped
brown basmati rice , to serve
It outputs None because it's looking for where the content within html tags is 'recipeIngredient', whci does not exist (there is no text in the html content. That string is an attribute of an html tag).
What you are actually trying to get with bs4 is find specific tags and/or attributes of the data/content you want. For example, #baduker points out, the ingredients in the html are within the tag with a class attribute = "recipe__ingredients".
The string 'recipeIngredient', that you pull out in that first block of code, is actually from within the <script> tag in the html, that has the ingredients in json format.
from bs4 import BeautifulSoup
import requests
import json
url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
ingredients = doc.find('script', {'type':'application/ld+json'}).text
jsonData = json.loads(ingredients)
print(jsonData['recipeIngredient'])

printing result of 2 for loops in same line

I'm fairly new to web scraping in Python; and after reading most of the tutorials on the topic online I decided to give it a shot. I finally got one site working but the output is not formatted properly.
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time
page = requests.get("https://leeweebrothers.com/our-food/lunch-boxes/#")
soup = BeautifulSoup(page.text, "html.parser")
for div in soup.find_all('h2'): #prints the name of the food"
print(div.text)
for a in soup.find_all('span', {'class' : 'amount'}): #prints price of the food
print(a.text)
Output
I want both the name of the food to be printed side by side with the corresponding price of the food, concatenated by a "-" ... Would appreciate any help given, thanks!
Edit: After #Reblochon Masque comments below - I've run into another problem; As you can see there is a $0.00 which is a value from the inbuilt shopping cart on the website, how would i exclude this as an outlier and continue moving down the loop while ensuring that the other items in the price "move up" to correspond to the correct food?
Best practice is to use zip function in the for loop, but we can do that this way also. This is to just to show we can do by using indexing the two lists.
names = soup.find_all('h2')
rest = soup.find_all('span', {'class' : 'amount'})
for index in range(len(names)):
print('{} - {}'.format(names[index].text, rest[index].text))
You could maybe zip the two results:
names = soup.find_all('h2')
rest = soup.find_all('span', {'class' : 'amount'})
for div, a in zip(names, rest):
print('{} - {}'.format(div.text, a.text))
# print(f"{div.text} - {a.text}") # for python > 3.6

Remove <a> HTML tag from beautifulsoup results

Using beautifulsoup I'm able to scrape a web page with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://www.acbbroker.it/soci_dettaglio.php?r=3")
page
soup = BeautifulSoup(page.content, 'html.parser')
test = soup.find(id="paginainterna-content")
test_items = test.find_all(class_="entry-content")
tonight = test_items[0]
names = []
for x in tonight.find_all('a', itemprop="url"):
names.append(str(x))
print(names)
but I'm not able to clean the results and obtain only the content inside the < a > paragraph (removing also the href).
Here is a small snap of my result:
'A&B; Insurance e Reinsurance Brokers Srl', 'A.B.A. BROKERS SRL', 'ABC SRL BROKER E CONSULENTI DI ASSI.NE', 'AEGIS INTERMEDIA SAS',
What is the proper way to handle this kind of data and obtain a clean result?
Thank you
if you want only text from tag use get_text() method
for x in tonight.find_all('a', itemprop="url"):
names.append(x.get_text())
print(names)
better with list comprehension this is fastest
names = [x.get_text() for x in tonight.find_all('a', itemprop='url')]
I don't know what output you want but, the text you get it by changing this
names.append(str(x.get_text()))

BeautifulSoup: Scraping answers from form

I need to scrape the answers to the questions from the following link, including the check boxes.
Here's what I have so far:
from bs4 import BeautifulSoup
import selenium.webdriver as webdriver
url = 'https://www.adviserinfo.sec.gov/IAPD/content/viewform/adv/Sections/iapd_AdvPrivateFundReportingSection.aspx?ORG_PK=161227&FLNG_PK=05C43A1A0008018C026407B10062D49D056C8CC0'
driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source)
The following gives me all the written answers, if there are any:
soup.find_all('span', {'class':'PrintHistRed'})
and I think I can piece together all the checkbox answers from this:
soup.find_all('img')
but these aren't going to be ordered correctly, because this doesn't pick up the "No Information Filed" answers that aren't written in red.
I also feel like there's a much better way to be doing this. Ideally I want (for the first 6 questions) to return:
['APEX INVESTMENT FUND, V, L.P',
'805-2054766781',
'Delaware',
'United States',
'APEX MANAGEMENT V, LLC',
'X',
'O',
'No Information Filed',
'NO',
'NO']
EDIT
Martin's answer below seems to do the trick, however when I put it in a loop, the results begin to change after the 3rd iteration. Any ideas how to fix this?
from bs4 import BeautifulSoup
import requests
import re
for x in range(5):
url = 'https://www.adviserinfo.sec.gov/IAPD/content/viewform/adv/Sections/iapd_AdvPrivateFundReportingSection.aspx?ORG_PK=161227&FLNG_PK=05C43A1A0008018C026407B10062D49D056C8CC0'
html = requests.get(url)
soup = BeautifulSoup(html.text, "lxml")
tags = list(soup.find_all('span', {'class':'PrintHistRed'}))
tags.extend(list(soup.find_all('img', alt=re.compile('Radio|Checkbox')))[2:]) # 2: skip "are you an adviser" at the top
tags.extend([t.parent for t in soup.find_all(text="No Information Filed")])
output = []
for entry in sorted(tags):
if entry.name == 'img':
alt = entry['alt']
if 'Radio' in alt:
output.append('NO' if 'not selected' in alt else 'YES')
else:
output.append('O' if 'not checked' in alt else 'X')
else:
output.append(entry.text)
print output[:9]
The website does not generate any of the required HTML via Javascript, so I have chosen to use just requests to get the HTML (which should be faster).
One approach to solving your problem is to store all the tags for your three different types into a single array. If this is then sorted, it will result in the tags being in tree order.
The first search simply uses your PrintHistRed to get the matching span tags. Secondly it finds all img tags that have alt text containing either the word Radio or Checkbox. Lastly it searches for all locations where No Information Filed is found and returns the parent tag.
The tags can now be sorted and a suitable output array built containing the information in the required format:
from bs4 import BeautifulSoup
import requests
import re
url = 'https://www.adviserinfo.sec.gov/IAPD/content/viewform/adv/Sections/iapd_AdvPrivateFundReportingSection.aspx?ORG_PK=161227&FLNG_PK=05C43A1A0008018C026407B10062D49D056C8CC0'
html = requests.get(url)
soup = BeautifulSoup(html.text, "lxml")
tags = list(soup.find_all('span', {'class':'PrintHistRed'}))
tags.extend(list(soup.find_all('img', alt=re.compile('Radio|Checkbox')))[2:]) # 2: skip "are you an adviser" at the top
tags.extend([t.parent for t in soup.find_all(text="No Information Filed")])
output = []
for entry in sorted(tags):
if entry.name == 'img':
alt = entry['alt']
if 'Radio' in alt:
output.append('NO' if 'not selected' in alt else 'YES')
else:
output.append('O' if 'not checked' in alt else 'X')
else:
output.append(entry.text)
print output[:9] # Display the first 9 entries
Giving you:
[u'APEX INVESTMENT FUND V, L.P.', u'805-2054766781', u'Delaware', u'United States', 'X', 'O', u'No Information Filed', 'NO', 'YES']
I've looked fairly carefully at the HTML. I doubt there is an utterly simple way of scraping pages like this.
I would begin with an analysis, looking for similar questions. For instance, 11 through 16 inclusive can likely be handled in the same way. 19 and 21 appear to be similar. There may or may not be others.
I would work out how to handle each type of similar question as given by the rows containing them. For example, how would I handle 19 and 21? Then I would write code to identify the rows for the questions noting the question number for each. Finally I would use the appropriate code using the row number to winkle out information from it. In other words, when I encountered question 19 I'd use the code meant for either 19 or 21.

web scraping in python only retrieving one entry

I am trying to scrap the BBC football results website to get teams, shots, goals, cards and incidents.
I writing the script in Python and using the Beautiful soup package. The code provided only retrieves the first entry of the table in incidents. When the incidents table is printed to screen, the full table will all the data is there.
The table I am scraping from is stored in incidents:
from bs4 import BeautifulSoup
import urllib2
url = 'http://www.bbc.co.uk/sport/football/result/partial/EFBO815155?teamview=false'
inner_page = urllib2.urlopen(url).read()
soupb = BeautifulSoup(inner_page, 'lxml')
for incidents in soupb.find_all('table', class_="incidents-table"):
print incidents.prettify()
home_inc_tag = incidents.find('td', class_='incident-player-home')
home_inc = home_inc_tag and ''.join(home_inc_tag.stripped_strings)
type_inc_tag = incidents.find('td', 'span', class_='incident-type goal')
type_inc = type_inc_tag and ''.join(type_inc_tag.stripped_strings)
time_inc_tag = incidents.find('td', class_='incident-time')
time_inc = time_inc_tag and ''.join(time_inc_tag.stripped_strings)
away_inc_tag = incidents.find('td', class_='incident-player-away')
away_inc = away_inc_tag and ''.join(away_inc_tag.stripped_strings)
print home_inc, time_inc, type_inc, away_inc
I am just focusing one one match at the moment to get this correct (EFBO815155) before i add a regular expression into the URL to get all matches details.
So, the incidents for loop is not getting all the data, just the first entry in the table.
Thanks in advance, I am new to stack overflow, if anything is wrong with this post, formatting etc please let me know.
Thanks!
First, get the incidents table:
incidentsTable = soupb.find_all('table', class_='incidents-table')[0]
Then loop through all 'tr' tags within that table.
for incidents in incidentsTable.find_all('tr'):
# your code as it is
print incidents.prettify()
home_inc_tag = incidents.find('td', class_='incident-player-home')
home_inc = home_inc_tag and ''.join(home_inc_tag.stripped_strings)
.
.
.
Gives Output:
Bradford Park Avenue 1-2 Boston United
None None
2' Goal J.Rollins
36' None C.Piergianni
N.Turner 42' None
50' Goal D.Southwell
C.King 60' Goal
This is close to what you want. Hope this helps!

Categories

Resources