Parse data from webpage using python - python

Can anyone please help me parse particular data from a web page? Here is the content on the webpage.
{"sites":[{"id":"XX","name":"YY","url":"ZZ","username":"AA","password":"BB","siteId":"0"},{"id":"XX","name":"YY","url":"ZZ","username":"AA","password":"BB","siteId":"0"}]}
I need just the id from the entire content. Please note we have id two times here in the content of webpage, so I need all id from the webpage. Here is the code I have written to dump the web content, but unable to parse the data I need. Please help me.
def test(ip):
url = 'http://%s/' % ip
response = urllib.urlopen(url)
webContent = response.read()
print webContent

your content is a json document, you can parse it with the json library and use it as a python object:
import json
def test(ip):
url = 'http://%s/' % ip
response = urllib.urlopen(url)
webContent = response.read()
content = json.loads(webContent)
print([site['id'] for site in content['sites']])

Related

Python Web Scrapping Json.loads()

I'm trying to fetch only URL from report by response given in json format using python.
The responses are as below:
text = {'result':[{'URL':'/disclosure/listedinfo/announcement/c/new/2022-06-30/600781_20220630_1_Xe2cThkh.pdf<br>/disclosure/listedinfo/announcement/c/new/2022-06-30/600781_20220630_10_u0Egjf03.pdf<br>/disclosure/listedinfo/announcement/c/new/2022-06-30/600781_20220630_2_MnC1FzvY.pdf<br>/disclosure/listedinfo/announcement/c/new/2022-06-30/600781_20220630_3_8APKPJ6E.pdf'}]}
I would need to add this url text to fetched url: 'http://static.sse.com.cn', I coded a for loop:
data = json.loads(text)
for every_report in data['result']:
pdf_url = 'http://static.sse.com.cn' + every_report['URL']
print(pdf_url)
But this is the result I get, only able to fetch the first URL and add the url text I wanted.
http://static.sse.com.cn/disclosure/listedinfo/announcement/c/new/2022-06-30/600532_20220630_6_Y2pswtvy.pdf<br>/disclosure/listedinfo/announcement/c/new/2022-06-30/600532_20220630_10_GBwvYOfG.pdf<br>/disclosure/listedinfo/announcement/c/new/2022-06-30/600532_20220630_11_2LvtFNYz.pdf<br>
What should I do to get all the URL and add text I want, thank youu.
The reason is because the string value of URL key contains <br>. You have to remove it first before constructing the full URL.
for every_report in text['result']:
urls = every_report['URL'].split('<br>')
pdf_urls = ['http://static.sse.com.cn' + url for url in urls]
print(pdf_urls)

Using xpath to scrape local mhtml-File returns empty

I am trying to scrape information from a website using the xpath. It works just fine when I scrape it from the website itself, but as soon as I save the website as mhtml, it returns an empty list. I really need to use the .mhtml format, so it would be nice to be able to scrape from that.
This is my code, the first part works just fine:
import requests
from lxml import html
url = "https://en.wikipedia.org/wiki/Outline_of_the_Marvel_Cinematic_Universe"
resp = requests.get(url)
tree = html.fromstring(resp.content)
content = tree.xpath('//*[#id="mw-content-text"]/div[1]/table[2]/tbody/tr[*]/th/i/a/text()')
print("Titles: ", content)
with open(r"C:/Users/.../.../marvel.mhtml","r") as f:
resp = f.read()
tree = html.fromstring(resp)
content = tree.xpath('//*[#id="mw-content-text"]/div[1]/table[2]/tbody/tr[*]/th/i/a/text()')
print("Titles: ", content)

soup.find_all,returns empty even if parser works, url is correct

textToSearch = 'python tutorials'
query = urllib.parse.quote(textToSearch)
url = "https://www.youtube.com/results?search_query=" + query
response = urllib.request.urlopen(url)
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
for vid in soup.findAll(attrs={'class':'yt-uix-tile-link'}):
if not vid['href'].startswith("https://googleads.g.doubleclick.net/"):
print('https://www.youtube.com' + vid['href'])
i am trying to get the first vieo url of a query without using the youtube data v3 api due to its limit
returns empty even if parser works, url is correct
using python 3.9.0
the youtube data is now in a JSON object instead of being embedded into the HTML of the search page so beautiful soup can't access it
for youtube, to solve the issue i used a library called youtube-search
the method that i use is soup.find_all, not soup.findAll

how can i get whole web page include the fragment web

i've tried with with urllib and request library but the data in fragment was not written in .html file. help me please :(
Here with the request
url = 'https://xxxxxxxxxxx.co.jp/InService/delivery/#/V=2/partsList/Element.PartsList%3A%3AVj0xfnsicklkIjoiQzEtQlVMTERPWkVSLUxfSVNfQzNfLl9CVUxMRE9aRVItTF8uXzgwXy5fRDg1RVNTLTJfLl9LSSIsIm9wIjpbIkMxLUJVTExET1pFUi1MX0lTX0MzXy5fQlVMTERPWkVSLUxfLl84MF8uX0Q4NUVTUy0yXy5fS0kiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDMiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDNfLl9BMCIsIlBMX0MxLUJVTExET1pFUi1MX0FDXy5fRDg1RVNTLTJfLl9LSS0wMDAwM18uX0EwMDEwMDEwIl0sIm5uIjoyMTQsInRzIjoxNTc5ODM0OTIwMDE5fQ?filterId=Product%3A%3AVj0xfnsicklkIjoiUk9PVCBQUk9EVUNUIiwib3AiOlsiUk9PVCBQUk9EVUNUIiwiQzEtQlVMTERPWkVSLUwiLCJDMl8uX0JVTExET1pFUi1MXy5fODAiLCJDM18uX0JVTExET1pFUi1MXy5fODBfLl9EODVFU1MtMl8uX0tJIl0sIm5uIjo2OTcsInRzIjoxNTc2NTY0MjMwMDg1fQ&bomFilterState=false'
response = requests.get(url)
print(response)
here with the urllib
url = 'https://xxxxxxx.co.jp/InService/delivery/?view=print#/V=2/partsList/Element.PartsList::Vj0xfnsicklkIjoiQzEtQlVMTERPWkVSLUxfSVNfQzNfLl9CVUxMRE9aRVItTF8uXzgwXy5fRDg1RVNTLTJfLl9LSSIsIm9wIjpbIkMxLUJVTExET1pFUi1MX0lTX0MzXy5fQlVMTERPWkVSLUxfLl84MF8uX0Q4NUVTUy0yXy5fS0kiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDMiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDNfLl9BMCIsIlBMX0MxLUJVTExET1pFUi1MX0FDXy5fRDg1RVNTLTJfLl9LSS0wMDAwM18uX0EwMDEwMDIwIl0sIm5uIjoyMjUsInRzIjoxNTgwMDk1MDYzNjIyfQ?filterId=Product::Vj0xfnsicklkIjoiUk9PVCBQUk9EVUNUIiwib3AiOlsiUk9PVCBQUk9EVUNUIiwiQzEtQlVMTERPWkVSLUwiLCJDMl8uX0JVTExET1pFUi1MXy5fODAiLCJDM18uX0JVTExET1pFUi1MXy5fODBfLl9EODVFU1MtMl8uX0tJIl0sIm5uIjo2OTcsInRzIjoxNTc2NTY0MjMwMDg1fQ&bomFilterState=false'
request = urllib.request.Request(url)
string = '%s:%s' % ('xx','xx')
base64string = base64.standard_b64encode(string.encode('utf-8'))
request.add_header("Authorization", "Basic %s" % base64string.decode('utf-8'))
u = urllib.request.urlopen(request)
webContent = u.read()
here is home of the web page (url:https://xxxxxx.co.jp/InService/delivery/#/V=2/home)
and here is the page that i want to get the data (url: https://xxxxxxx.co.jp/InService/delivery/?view=print#/V=2/partsList/Element.PartsList::Vj0xfnsicklkIjoiQzE...)
so every i request the web page like in the 2 picture, the html content is must be the html in picture 1 because in picture 2 is the fragment
If all you would like is the html of the webpage, just use requests as you have in the first example, except instead of print(response) use print(response.content).
To save it into a file use:
import requests
url = 'https://xxxxxxx.co.jp/InService/delivery/?view=print#/V=2/partsList/Element.PartsList::Vj0xfnsicklkIjoiQzEtQlVMTERPWkVSLUxfSVNfQzNfLl9CVUxMRE9aRVItTF8uXzgwXy5fRDg1RVNTLTJfLl9LSSIsIm9wIjpbIkMxLUJVTExET1pFUi1MX0lTX0MzXy5fQlVMTERPWkVSLUxfLl84MF8uX0Q4NUVTUy0yXy5fS0kiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDMiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDNfLl9BMCIsIlBMX0MxLUJVTExET1pFUi1MX0FDXy5fRDg1RVNTLTJfLl9LSS0wMDAwM18uX0EwMDEwMDIwIl0sIm5uIjoyMjUsInRzIjoxNTgwMDk1MDYzNjIyfQ?filterId=Product::Vj0xfnsicklkIjoiUk9PVCBQUk9EVUNUIiwib3AiOlsiUk9PVCBQUk9EVUNUIiwiQzEtQlVMTERPWkVSLUwiLCJDMl8uX0JVTExET1pFUi1MXy5fODAiLCJDM18uX0JVTExET1pFUi1MXy5fODBfLl9EODVFU1MtMl8uX0tJIl0sIm5uIjo2OTcsInRzIjoxNTc2NTY0MjMwMDg1fQ&bomFilterState=false'
with open("output.html", 'w+') as f:
response = requests.get(url)
f.write(response.content)
If you need a certain part of the webpage, use BeautifulSoup.
import requests
from bs4 import BeautifulSoup
url = 'https://xxxxxxx.co.jp/InService/delivery/?view=print#/V=2/partsList/Element.PartsList::Vj0xfnsicklkIjoiQzEtQlVMTERPWkVSLUxfSVNfQzNfLl9CVUxMRE9aRVItTF8uXzgwXy5fRDg1RVNTLTJfLl9LSSIsIm9wIjpbIkMxLUJVTExET1pFUi1MX0lTX0MzXy5fQlVMTERPWkVSLUxfLl84MF8uX0Q4NUVTUy0yXy5fS0kiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDMiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDNfLl9BMCIsIlBMX0MxLUJVTExET1pFUi1MX0FDXy5fRDg1RVNTLTJfLl9LSS0wMDAwM18uX0EwMDEwMDIwIl0sIm5uIjoyMjUsInRzIjoxNTgwMDk1MDYzNjIyfQ?filterId=Product::Vj0xfnsicklkIjoiUk9PVCBQUk9EVUNUIiwib3AiOlsiUk9PVCBQUk9EVUNUIiwiQzEtQlVMTERPWkVSLUwiLCJDMl8uX0JVTExET1pFUi1MXy5fODAiLCJDM18uX0JVTExET1pFUi1MXy5fODBfLl9EODVFU1MtMl8uX0tJIl0sIm5uIjo2OTcsInRzIjoxNTc2NTY0MjMwMDg1fQ&bomFilterState=false'
response = BeautifulSoup(requests.get(url).content)
use inspect element and find the Tag of the table that you want in the second image, eg. https://imgur.com/a/pGbCCFy.
then use:
found = response.find('div', attrs={"class":"x-carousel__body no-scroll"}).find_all('ul')
For the ebay example I linked above.
This should return that table which you can then do whatever you like with.

check if the page is HTML page in python?

I am trying to write a code in python for Web crawler. I want to check if the page I am about to crawl is a HTML page and not page like .pdf/.doc/.docx etc.. I do not want to check it with extension .html as asp,aspx, or pages like http://bing.com/travel/ do not .html extensions explicitly but they are html pages. Is there any good way in python?
This gets the header only from the server:
import urllib2
url = 'http://www.kernel.org/pub/linux/kernel/v3.0/testing/linux-3.7-rc6.tar.bz2'
req = urllib2.Request(url)
req.get_method = lambda: 'HEAD'
response = urllib2.urlopen(req)
content_type = response.headers.getheader('Content-Type')
print(content_type)
prints
application/x-bzip2
From which you could conclude this is not HTML. You could use
'html' in content_type
to programmatically test if the content is HTML (or possibly XHTML).
If you wanted to be even more sure the content is HTML you could download the contents and try to parse it with an HTML parser like lxml or BeautifulSoup.
Beware of using requests.get like this:
import requests
r = requests.get(url)
print(r.headers['content-type'])
This takes a long time and my network monitor shows a sustained load leading me to believe this is downloading the entire file, not just the header.
On the other hand,
import requests
r = requests.head(url)
print(r.headers['content-type'])
gets the header only.
Don't bother with what the standard library throws at you but, rather try requests.
>>> import requests
>>> r = requests.get("http://www.google.com")
>>> r.headers['content-type']
'text/html; charset=ISO-8859-1'

Categories

Resources