Python Couldn't parse HTML from URL - python

I have tried this below code
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://myip.ms/'
page = 1
req = requests.get(URL + str(page))
soup = bs (req.text, 'html.parser')
print (soup)
this code working for some websites but not working for most of websites like myip.ms

Works for me. But what essentially are you trying to achieve here? Your code appends "1" to the end of the URL and then visits it. If the page with those URL parameters doesn't exist on the server - it will give you errors. For this case: https://myip.ms/1 exists, but no surprise that any other page could give you errors

Related

How do I log data from a live website using beautiful soup

Hello I am trying to use beautiful soup and requests to log the data coming from an anemometer which updates live every second. The link to this website here:
http://88.97.23.70:81/
The piece of data I want to scrape is highlighted in purple in the image :
from inspection of the html in my browser.
I have written the code bellow in to try to print out the data however when I run the code it prints: None. I think this means that the soup object doesnt infact contain the whole html page? Upon printing soup.prettify() I cannot find the same id=js-2-text I find when inspecting the html in my browser. If anyone has any ideas why this might be or how to fix it I would be most grateful.
from bs4 import BeautifulSoup
import requests
wind_url='http://88.97.23.70:81/'
r = requests.get(wind_url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
print(soup.find(id='js-2-text'))
All the best,
Brendan
The data is loaded from external URL, so beautifulsoup doesn't need it. You can try to use API URL the page is connecting to:
import requests
from bs4 import BeautifulSoup
api_url = "http://88.97.23.70:81/cgi-bin/CGI_GetMeasurement.cgi"
data = {"input_id": "1"}
soup = BeautifulSoup(requests.post(api_url, data=data).content, "html.parser")
_, direction, metres_per_second, *_ = soup.csv.text.split(",")
knots = float(metres_per_second) * 1.9438445
print(direction, metres_per_second, knots)
Prints:
210 006.58 12.79049681

Read data from URL / XML with python

this is my first question.
Im trying to learn some python, so.. i have this problem
how i can get data from this url that shows info in XML:
import requests
from bs4 import BeautifulSoup
url = 'http://windte1910.acepta.com/v01/A23D046FC1854B18399D5383F36923E25774179C?k=5121f909fd63e674149c0e42a9847b49'
document = requests.get(url)
soup = BeautifulSoup(document.content, "lxml-xml")
print(soup)
output:
Output
but i wanna get access to this type of data, < RUTEmisor> data for example:
linkurl_invoice
hope guys you can try to advice me with the code and how to read xml docs.
By examining the URL you gave, it seems that the data is actually held a few links away at the following URL: http://windte1910.acepta.com/depot/A23D046FC1854B18399D5383F36923E25774179C?k=5121f909fd63e674149c0e42a9847b49
As such, you can access it directly as follows:
import requests
from bs4 import BeautifulSoup
url = 'http://windte1910.acepta.com/depot/A23D046FC1854B18399D5383F36923E25774179C?k=5121f909fd63e674149c0e42a9847b49'
document = requests.get(url)
soup = BeautifulSoup(document.content, "lxml-xml")
print(soup.find('RUTEmisor').text)

How to download tickers from webpage, beautifulsoup didnt get all content

I want to get the ticker values from this webpage https://www.oslobors.no/markedsaktivitet/#/list/shares/quotelist/ob/all/all/false
However when using Beautifulsoup I don't seem to get all the content, and I don't quite understand how to change my code in order to achieve my goal
import urllib3
from bs4 import BeautifulSoup
def oslobors():
http=urllib3.PoolManager()
url = 'https://www.oslobors.no/markedsaktivitet/#/list/shares/quotelist/ob/all/all/false'
response = http.request('GET', url)
soup=BeautifulSoup(response.data, "html.parser")
print(soup)
return
print(oslobors())
The content you wanna parse generates dynamically. You can either use any browser simulator like selenium or you can try the below url containing json response. The following is the easy way to go.
import requests
url = 'https://www.oslobors.no/ob/servlets/components?type=table&generators%5B0%5D%5Bsource%5D=feed.ob.quotes.EQUITIES%2BPCC&generators%5B1%5D%5Bsource%5D=feed.merk.quotes.EQUITIES%2BPCC&filter=&view=DELAYED&columns=PERIOD%2C+INSTRUMENT_TYPE%2C+TRADE_TIME%2C+ITEM_SECTOR%2C+ITEM%2C+LONG_NAME%2C+BID%2C+ASK%2C+LASTNZ_DIV%2C+CLOSE_LAST_TRADED%2C+CHANGE_PCT_SLACK%2C+TURNOVER_TOTAL%2C+TRADES_COUNT_TOTAL%2C+MARKET_CAP%2C+HAS_LIQUIDITY_PROVIDER%2C+PERIOD%2C+MIC%2C+GICS_CODE_LEVEL_1%2C+TIME%2C+VOLUME_TOTAL&channel=a66b1ba745886f611af56cec74115a51'
res = requests.get(url)
for ticker in res.json()['rows']:
ticker_name = ticker['values']['ITEM']
print(ticker_name)
Results you may get like (partial):
APP
HEX
APCL
ODFB
SAS NOK
WWI
ASC

How can I force Python to navigate to a web page and print all anchors (a-html)

I am trying to take some data off an intra-net site at work. I am testing the code below, which looks fine me, but it almost seems like it is going to the wrong URL. If I right-click on the page and click 'View Page Source', I can see a bunch of links (anchors) that I want to scrape from, but what Python is actually printing out is completely different from what I'm seeing in 'View Page Source'.
from bs4 import BeautifulSoup as bs
import requests
from lxml import html
import urllib.request
REQUEST_URL = 'https://corp-intranet-internal.com/admin/?page=0'
response = requests.get(REQUEST_URL, auth=('fname.lname#gmail.com', 'my_pass'))
xml_data = response.text.encode('utf-8', 'ignore')
html_page = urllib.request.urlopen(REQUEST_URL)
delay = 5 # seconds
soup = bs(html_page, "lxml")
for link in soup.findAll('a'):
print(link.get('href'))
I tested the same idea, using Selenium, and I'm getting results that don't match the 'View Page Source'. Any idea what could be wrong here? Thanks.

BeautifulSoup url scraping

Trying out BeautifulSoup for the first time.
I have this link http://www.mediafire.com/download/alv8dq6k35n4m2k/For+You.zip
I want to catch the direct download url from the download button which is
http://download2110.mediafire.com/niz8p9iu6r9g/alv8dq6k35n4m2k/For+You.zip
What I have tried so far.
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.findAll('a')
I think the last function findAll('a')would find all the links from that page, but I could not find the direct download url in my linkslist.
Am I doing something wrong here? If so, how can I grab that link with beautifulsoup. I inspect the element in Chrome Developer Console and I see that the link is there.
You can try this to extract the url from the javascript:
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.mediafire.com/download/alv8dq6k35n4m2k/For+You.zip")
soup = BeautifulSoup(r.content)
link = soup.find("div",{"class":"download_link"})
import re
url = re.findall("http.*.zip?",link.text)[0]

Categories

Resources