Live data HTML parsing with Python/BS - python

I have scoured these pages for days without success, so I am hoping this is not a duplicate. If so I apologize.
I have a device on a local network that provides a data read out in HTML that is updated live. So far my BeautifulSoup and URLLIB2 attempts at parsing this data has been unsuccessful.
Any help would be appreciated.
This is the source code, with the data of interested encircled:
This if the resultant output:
from bs4 import BeautifulSoup
import re
import urllib2
from urllib import urlopen
url = 'http://192.168.1.2/index.html#home-view'
#___________________________________________________________________
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
soup = BeautifulSoup(data, "html.parser")
result = soup.findAll('p', {'class':'gas-conc'})
print result
SOLVED!: Thank you for the assistance. With Selenium I was able to painfully scrape out this data. However I had to use the BS 'beautify' function on the source code and manually count out which characters to splice out.

I'm 90% sure that you wouldn't get these data unless you managed to render Javascript somehow.
Check out this post for more info in how to make that happens.
In a nutshell, you can use:
selenium
PyQt5
Dryscrape

Related

I try to parse internal network webpage using by beautifulsoup library but didn't same like html

I'd like to make an auto login program in internal network website.
So, I try to parse that site using requests and Beautifulsoup library.
It works...and I get some html alot shorter than that site's html.
what's the problem? maybe security issue?..
pleas help me.
import requests
from bs4 import BeautifulSoup as bs
page = requests.get("http://test.com")
soup = bs(page.text, "html.parse")
print(soup) # I get some html alot shorter than that site's html

Issues with requests and BeautifulSoup

I'm trying to read a news webpage to get the titles of their stories. I'm attempting to put them in a list, but I keep getting an empty list. Can someone please point in the right direction here? What am I missing? Please see code below. Thanks.
import requests
from bs4 import BeautifulSoup
url = 'https://nypost.com/'
ttl_lst = []
soup = BeautifulSoup(requests.get(url).text, "lxml")
title = soup.findAll('h2', {'class': 'story-heading'})
for row in title:
ttl_lst.append(row.text)
print (ttl_lst)
the requests module only returns the first html file sent to them. Sites like nypost use ajax requests to get their articles. You will have to use something like selenium for this, which allows for ajax requests after the page loads.

Read data from URL / XML with python

this is my first question.
Im trying to learn some python, so.. i have this problem
how i can get data from this url that shows info in XML:
import requests
from bs4 import BeautifulSoup
url = 'http://windte1910.acepta.com/v01/A23D046FC1854B18399D5383F36923E25774179C?k=5121f909fd63e674149c0e42a9847b49'
document = requests.get(url)
soup = BeautifulSoup(document.content, "lxml-xml")
print(soup)
output:
Output
but i wanna get access to this type of data, < RUTEmisor> data for example:
linkurl_invoice
hope guys you can try to advice me with the code and how to read xml docs.
By examining the URL you gave, it seems that the data is actually held a few links away at the following URL: http://windte1910.acepta.com/depot/A23D046FC1854B18399D5383F36923E25774179C?k=5121f909fd63e674149c0e42a9847b49
As such, you can access it directly as follows:
import requests
from bs4 import BeautifulSoup
url = 'http://windte1910.acepta.com/depot/A23D046FC1854B18399D5383F36923E25774179C?k=5121f909fd63e674149c0e42a9847b49'
document = requests.get(url)
soup = BeautifulSoup(document.content, "lxml-xml")
print(soup.find('RUTEmisor').text)

how to scrape all the links of image of product present in flipkart

I am trying to scrape url of all the different images present in this link https://www.flipkart.com/samsung-galaxy-nxt-gold-32-gb/p/itmemzd4gepexjya?pid=MOBEMZD4KHRF5VZX. I am trying it with beautifulsoup module of python. but didn't succeed with this method. I am not able to understand the code structure of flipkart.com and why it is not returning the required data.
The code that I am trying is as follow
from bs4 import BeautifulSoup
import urllib
from pprintpp import pprint
import pandas as pd
import requests
from time import sleep
x=requests.get("https://www.flipkart.com/samsung-galaxy-nxt-gold-32-gb/p/itmemzd4gepexjya?pid=MOBEMZD4KHRF5VZX").content
#x= urllib._urlopener("https://www.flipkart.com/jbl-t250si-on-the-ear-headphone/p/itmefbgezsc72mgt?pid=ACCEFBGAK5ZDTBF7&")
soup2 = BeautifulSoup(x, 'html.parser')
data=[]
for j in soup2.find_all('img', attrs={'class':"sfescn"}):
data+=[j]
print data
Well I can clearly see that there are no links of mobile images in the page source code.
So I would recommend using tool Fiddler or your browser developer's console to track where the actual data is coming from, most probably it would be coming from a json response type request.
I am not familiar with beautifulsoup, i have been working with scrapy.

Range loop not working in webscrape

I have written a small web scraper in BS4.With the code I am able to scrape one page at a time,here is the relevant code.
import csv
from bs4 import BeautifulSoup
import requests
html = requests.get("http://www.gbgb.org.uk/resultsMeeting.aspx?id=129867").text
soup = BeautifulSoup(html,'lxml')
This code scrapes one page but I want to scrape more than one page at a time(a range) so I tried adding this for loop like this.
import csv
from bs4 import BeautifulSoup
import requests
for ace in range(129867, 129869):
html = requests.get("http://www.gbgb.org.uk/resultsMeeting.aspx?id= {ace}").text
soup = BeautifulSoup(html,'lxml')
Nothing happens when I run the code and I don't even get up any of the usual cryptic messages up hinting at what went wrong.Could it be syntax,or is it something else.Any help appreciated.
You should do everything inside the loop now. And, you are not inserting the ace value into the URL and there is an extra space after the id=. It might also be a good idea to establish a web-scraping session and use the params keyword of the get() method.
Fixed version:
import csv
from bs4 import BeautifulSoup
import requests
with requests.Session() as session:
for ace in range(129867, 129869):
url = "http://www.gbgb.org.uk/resultsMeeting.aspx"
html = session.get(url, params={'id': ace}).text
soup = BeautifulSoup(html, 'lxml')
Note this code is still of a blocking nature, it would process the pages one at a time. If you want to speed things up, look into Scrapy web-scraping framework.

Categories

Resources