Range loop not working in webscrape - python

I have written a small web scraper in BS4.With the code I am able to scrape one page at a time,here is the relevant code.
import csv
from bs4 import BeautifulSoup
import requests
html = requests.get("http://www.gbgb.org.uk/resultsMeeting.aspx?id=129867").text
soup = BeautifulSoup(html,'lxml')
This code scrapes one page but I want to scrape more than one page at a time(a range) so I tried adding this for loop like this.
import csv
from bs4 import BeautifulSoup
import requests
for ace in range(129867, 129869):
html = requests.get("http://www.gbgb.org.uk/resultsMeeting.aspx?id= {ace}").text
soup = BeautifulSoup(html,'lxml')
Nothing happens when I run the code and I don't even get up any of the usual cryptic messages up hinting at what went wrong.Could it be syntax,or is it something else.Any help appreciated.

You should do everything inside the loop now. And, you are not inserting the ace value into the URL and there is an extra space after the id=. It might also be a good idea to establish a web-scraping session and use the params keyword of the get() method.
Fixed version:
import csv
from bs4 import BeautifulSoup
import requests
with requests.Session() as session:
for ace in range(129867, 129869):
url = "http://www.gbgb.org.uk/resultsMeeting.aspx"
html = session.get(url, params={'id': ace}).text
soup = BeautifulSoup(html, 'lxml')
Note this code is still of a blocking nature, it would process the pages one at a time. If you want to speed things up, look into Scrapy web-scraping framework.

Related

Python Beautifulsoup get tags under another tag (from selenium)

I am using Selenium to do web scraping and would like to instead use beautiful soup, but I am new to this library, I wanna get all company names and the time and jump to the next page.
Please find my codes using selenium first:
driver.get('http://www.csisc.cn/zbscbzw/isinbm/index_list_code.shtml')
while True:
links = [link.get_attribute('href') for link in driver.find_elements_by_xpath('//*[#class="sibian"]/tbody/tr/td/table[2]/tbody/tr/td[2]/a')]
for link in links:
driver.get(link)
driver.implicitly_wait(10)
windows = driver.window_handles
driver.switch_to.window(windows[-1])
time = driver.find_element_by_xpath('//*[#class="con_bj"]/table[3]/tbody/tr/td/publishtime').text
company = driver.find_element_by_xpath('//*[#class="title_A"]').text
driver.back()
if(len(links)< 20):
break
I tried doing the same with beautifulsoup as:
from bs4 import BeautifulSoup
import requests
html='http://www.csisc.cn/zbscbzw/isinbm/index_list_code.shtml'
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('td'):
num=link.find('a').get('href')
print(num)
But I get nothing and stuck with the first step.
Could you please help with that?
you are not making a request. You are thinking that BeautifulSoup is a HTTPRequest library, it is just a parser. Think of driver.get() as requests.get() (yes i know they are not the same, but it is for an easier understanding). You need to do something like this:
from bs4 import BeautifulSoup
import requests
html_link='http://www.csisc.cn/zbscbzw/isinbm/index_list_code.shtml'
html = requests.get(html_link).text
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('td'):
num=link.find('a').get('href')
print(num)
This will allow you to further debug your code. This MAY NOT work as some sites require specific headers or automatically reject your request, such as a user-agent header. Requests is a very easy (subjective of course) library to work with and has a lot of support on this site. To save some head-scratching I will go ahead and tell you that if the site requires javascript, Selenium or some variant is the best option.

Issues with requests and BeautifulSoup

I'm trying to read a news webpage to get the titles of their stories. I'm attempting to put them in a list, but I keep getting an empty list. Can someone please point in the right direction here? What am I missing? Please see code below. Thanks.
import requests
from bs4 import BeautifulSoup
url = 'https://nypost.com/'
ttl_lst = []
soup = BeautifulSoup(requests.get(url).text, "lxml")
title = soup.findAll('h2', {'class': 'story-heading'})
for row in title:
ttl_lst.append(row.text)
print (ttl_lst)
the requests module only returns the first html file sent to them. Sites like nypost use ajax requests to get their articles. You will have to use something like selenium for this, which allows for ajax requests after the page loads.

Web Scraping with Python, Beautiful Soup, and Selenium not working

I am doing a Python exercise, and it requires me to get the top news from the Google news website by web scraping and print to the console.
As I was doing it, I just used the Beautiful Soup library to retrieve the news. That was my code:
import bs4
from bs4 import BeautifulSoup
import urllib.request
news_url = "https://news.google.com/news/rss";
URLObject = urllib.request.urlopen(news_url);
xml_page = URLObject.read();
URLObject.close();
soup_page = BeautifulSoup(xml_page,"html.parser");
news_list = soup_page.findAll("item");
for news in news_list:
print(news.title.text);
print(news.link.text);
print(news.pubDate.text);
print("-"*60);
But it kept giving me errors by not printing the 'link' and 'pubDate'. After some research, I saw some answers here on Stack Overflow, and they said that, as the website uses Javascript, one should use the Selenium package in addition to Beautiful Soup.
Despite not understanding how Selenium really works, I updated the code as following:
from bs4 import BeautifulSoup
from selenium import webdriver
import urllib.request
driver = webdriver.Chrome("C:/Users/mauricio/Downloads/chromedriver");
driver.maximize_window();
driver.get("https://news.google.com/news/rss");
content = driver.page_source.encode("utf-8").strip();
soup = BeautifulSoup(content, "html.parser");
news_list = soup.findAll("item");
print(news_list);
for news in news_list:
print(news.title.text);
print(news.link.text);
print(news.pubDate.text);
print("-"*60);
However, when I run it, a blank browser page opens and this is printed to the console:
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: crashed
(Driver info: chromedriver=2.38.551601 (edb21f07fc70e9027c746edd3201443e011a61ed),platform=Windows NT 6.3.9600 x86_64)
I just tried and the following code is working for me. The items = line is horrible, apologies in advance. But for now it works...
EDIT
Just updated the snippet, you can use the ElementTree.iter('tag') to iterate over all the nodes with that tag:
import urllib.request
import xml.etree.ElementTree
news_url = "https://news.google.com/news/rss"
with urllib.request.urlopen(news_url) as page:
xml_page = page.read()
# Parse XML page
e = xml.etree.ElementTree.fromstring(xml_page)
# Get the item list
for it in e.iter('item'):
print(it.find('title').text)
print(it.find('link').text)
print(it.find('pubDate').text, '\n')
EDIT2: Discussion personal preferences of libraries for scraping
Personally, for interactive/dynamic pages in which I have to do stuff (click here, fill a form, obtain results, ...): I use selenium, and usually I don't have a need to use bs4, since you can use selenium directly to find and parse the specific nodes of the web you are looking for.
I use bs4 in conjunction with requests (instead of urllib.request) for to parse more static webpages in projects I don't want to have a whole webdriver installed.
There is nothing wrong with using urllib.request, but requests (see here for the docs) is one of the best python packages out there (in my opinion) and is a great example of how to create a simple yet powerful API.
Simply use BeautifulSoup with requests.
from bs4 import BeautifulSoup
import requests
r = requests.get('https://news.google.com/news/rss')
soup = BeautifulSoup(r.text, 'xml')
news_list = soup.find_all('item')
# do whatever you need with news_list

Live data HTML parsing with Python/BS

I have scoured these pages for days without success, so I am hoping this is not a duplicate. If so I apologize.
I have a device on a local network that provides a data read out in HTML that is updated live. So far my BeautifulSoup and URLLIB2 attempts at parsing this data has been unsuccessful.
Any help would be appreciated.
This is the source code, with the data of interested encircled:
This if the resultant output:
from bs4 import BeautifulSoup
import re
import urllib2
from urllib import urlopen
url = 'http://192.168.1.2/index.html#home-view'
#___________________________________________________________________
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
soup = BeautifulSoup(data, "html.parser")
result = soup.findAll('p', {'class':'gas-conc'})
print result
SOLVED!: Thank you for the assistance. With Selenium I was able to painfully scrape out this data. However I had to use the BS 'beautify' function on the source code and manually count out which characters to splice out.
I'm 90% sure that you wouldn't get these data unless you managed to render Javascript somehow.
Check out this post for more info in how to make that happens.
In a nutshell, you can use:
selenium
PyQt5
Dryscrape

how to scrape all the links of image of product present in flipkart

I am trying to scrape url of all the different images present in this link https://www.flipkart.com/samsung-galaxy-nxt-gold-32-gb/p/itmemzd4gepexjya?pid=MOBEMZD4KHRF5VZX. I am trying it with beautifulsoup module of python. but didn't succeed with this method. I am not able to understand the code structure of flipkart.com and why it is not returning the required data.
The code that I am trying is as follow
from bs4 import BeautifulSoup
import urllib
from pprintpp import pprint
import pandas as pd
import requests
from time import sleep
x=requests.get("https://www.flipkart.com/samsung-galaxy-nxt-gold-32-gb/p/itmemzd4gepexjya?pid=MOBEMZD4KHRF5VZX").content
#x= urllib._urlopener("https://www.flipkart.com/jbl-t250si-on-the-ear-headphone/p/itmefbgezsc72mgt?pid=ACCEFBGAK5ZDTBF7&")
soup2 = BeautifulSoup(x, 'html.parser')
data=[]
for j in soup2.find_all('img', attrs={'class':"sfescn"}):
data+=[j]
print data
Well I can clearly see that there are no links of mobile images in the page source code.
So I would recommend using tool Fiddler or your browser developer's console to track where the actual data is coming from, most probably it would be coming from a json response type request.
I am not familiar with beautifulsoup, i have been working with scrapy.

Categories

Resources