how can i get whole web page include the fragment web - python

i've tried with with urllib and request library but the data in fragment was not written in .html file. help me please :(
Here with the request
url = 'https://xxxxxxxxxxx.co.jp/InService/delivery/#/V=2/partsList/Element.PartsList%3A%3AVj0xfnsicklkIjoiQzEtQlVMTERPWkVSLUxfSVNfQzNfLl9CVUxMRE9aRVItTF8uXzgwXy5fRDg1RVNTLTJfLl9LSSIsIm9wIjpbIkMxLUJVTExET1pFUi1MX0lTX0MzXy5fQlVMTERPWkVSLUxfLl84MF8uX0Q4NUVTUy0yXy5fS0kiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDMiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDNfLl9BMCIsIlBMX0MxLUJVTExET1pFUi1MX0FDXy5fRDg1RVNTLTJfLl9LSS0wMDAwM18uX0EwMDEwMDEwIl0sIm5uIjoyMTQsInRzIjoxNTc5ODM0OTIwMDE5fQ?filterId=Product%3A%3AVj0xfnsicklkIjoiUk9PVCBQUk9EVUNUIiwib3AiOlsiUk9PVCBQUk9EVUNUIiwiQzEtQlVMTERPWkVSLUwiLCJDMl8uX0JVTExET1pFUi1MXy5fODAiLCJDM18uX0JVTExET1pFUi1MXy5fODBfLl9EODVFU1MtMl8uX0tJIl0sIm5uIjo2OTcsInRzIjoxNTc2NTY0MjMwMDg1fQ&bomFilterState=false'
response = requests.get(url)
print(response)
here with the urllib
url = 'https://xxxxxxx.co.jp/InService/delivery/?view=print#/V=2/partsList/Element.PartsList::Vj0xfnsicklkIjoiQzEtQlVMTERPWkVSLUxfSVNfQzNfLl9CVUxMRE9aRVItTF8uXzgwXy5fRDg1RVNTLTJfLl9LSSIsIm9wIjpbIkMxLUJVTExET1pFUi1MX0lTX0MzXy5fQlVMTERPWkVSLUxfLl84MF8uX0Q4NUVTUy0yXy5fS0kiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDMiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDNfLl9BMCIsIlBMX0MxLUJVTExET1pFUi1MX0FDXy5fRDg1RVNTLTJfLl9LSS0wMDAwM18uX0EwMDEwMDIwIl0sIm5uIjoyMjUsInRzIjoxNTgwMDk1MDYzNjIyfQ?filterId=Product::Vj0xfnsicklkIjoiUk9PVCBQUk9EVUNUIiwib3AiOlsiUk9PVCBQUk9EVUNUIiwiQzEtQlVMTERPWkVSLUwiLCJDMl8uX0JVTExET1pFUi1MXy5fODAiLCJDM18uX0JVTExET1pFUi1MXy5fODBfLl9EODVFU1MtMl8uX0tJIl0sIm5uIjo2OTcsInRzIjoxNTc2NTY0MjMwMDg1fQ&bomFilterState=false'
request = urllib.request.Request(url)
string = '%s:%s' % ('xx','xx')
base64string = base64.standard_b64encode(string.encode('utf-8'))
request.add_header("Authorization", "Basic %s" % base64string.decode('utf-8'))
u = urllib.request.urlopen(request)
webContent = u.read()
here is home of the web page (url:https://xxxxxx.co.jp/InService/delivery/#/V=2/home)
and here is the page that i want to get the data (url: https://xxxxxxx.co.jp/InService/delivery/?view=print#/V=2/partsList/Element.PartsList::Vj0xfnsicklkIjoiQzE...)
so every i request the web page like in the 2 picture, the html content is must be the html in picture 1 because in picture 2 is the fragment

If all you would like is the html of the webpage, just use requests as you have in the first example, except instead of print(response) use print(response.content).
To save it into a file use:
import requests
url = 'https://xxxxxxx.co.jp/InService/delivery/?view=print#/V=2/partsList/Element.PartsList::Vj0xfnsicklkIjoiQzEtQlVMTERPWkVSLUxfSVNfQzNfLl9CVUxMRE9aRVItTF8uXzgwXy5fRDg1RVNTLTJfLl9LSSIsIm9wIjpbIkMxLUJVTExET1pFUi1MX0lTX0MzXy5fQlVMTERPWkVSLUxfLl84MF8uX0Q4NUVTUy0yXy5fS0kiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDMiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDNfLl9BMCIsIlBMX0MxLUJVTExET1pFUi1MX0FDXy5fRDg1RVNTLTJfLl9LSS0wMDAwM18uX0EwMDEwMDIwIl0sIm5uIjoyMjUsInRzIjoxNTgwMDk1MDYzNjIyfQ?filterId=Product::Vj0xfnsicklkIjoiUk9PVCBQUk9EVUNUIiwib3AiOlsiUk9PVCBQUk9EVUNUIiwiQzEtQlVMTERPWkVSLUwiLCJDMl8uX0JVTExET1pFUi1MXy5fODAiLCJDM18uX0JVTExET1pFUi1MXy5fODBfLl9EODVFU1MtMl8uX0tJIl0sIm5uIjo2OTcsInRzIjoxNTc2NTY0MjMwMDg1fQ&bomFilterState=false'
with open("output.html", 'w+') as f:
response = requests.get(url)
f.write(response.content)
If you need a certain part of the webpage, use BeautifulSoup.
import requests
from bs4 import BeautifulSoup
url = 'https://xxxxxxx.co.jp/InService/delivery/?view=print#/V=2/partsList/Element.PartsList::Vj0xfnsicklkIjoiQzEtQlVMTERPWkVSLUxfSVNfQzNfLl9CVUxMRE9aRVItTF8uXzgwXy5fRDg1RVNTLTJfLl9LSSIsIm9wIjpbIkMxLUJVTExET1pFUi1MX0lTX0MzXy5fQlVMTERPWkVSLUxfLl84MF8uX0Q4NUVTUy0yXy5fS0kiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDMiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDNfLl9BMCIsIlBMX0MxLUJVTExET1pFUi1MX0FDXy5fRDg1RVNTLTJfLl9LSS0wMDAwM18uX0EwMDEwMDIwIl0sIm5uIjoyMjUsInRzIjoxNTgwMDk1MDYzNjIyfQ?filterId=Product::Vj0xfnsicklkIjoiUk9PVCBQUk9EVUNUIiwib3AiOlsiUk9PVCBQUk9EVUNUIiwiQzEtQlVMTERPWkVSLUwiLCJDMl8uX0JVTExET1pFUi1MXy5fODAiLCJDM18uX0JVTExET1pFUi1MXy5fODBfLl9EODVFU1MtMl8uX0tJIl0sIm5uIjo2OTcsInRzIjoxNTc2NTY0MjMwMDg1fQ&bomFilterState=false'
response = BeautifulSoup(requests.get(url).content)
use inspect element and find the Tag of the table that you want in the second image, eg. https://imgur.com/a/pGbCCFy.
then use:
found = response.find('div', attrs={"class":"x-carousel__body no-scroll"}).find_all('ul')
For the ebay example I linked above.
This should return that table which you can then do whatever you like with.

Related

How to click the 'download as pdf' button on a website with python

Looking to click the download as pdf button on this site: https://www.goffs.com/sales-results/sales/december-nh-sale-2021/1
The reason I can't just scrape the download link or just manually download it is that there are multiple of these sites like:
https://www.goffs.com/sales-results/sales/december-nh-sale-2021/2
https://www.goffs.com/sales-results/sales/december-nh-sale-2021/3
And I want to loop through all of them and download each as a pdf.
Current code:
import urllib.request
from requests import get
from bs4 import BeautifulSoup
url = "https://www.goffs.com/sales-results/sales/december-nh-sale-2021/1"
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
This code should get the link to the pdf:
from urllib.request import *
url = "https://www.goffs.com/sales-results/sales/december-nh-sale-2021/{}".format("1")
request = Request(url)
response = urlopen(request)
content = response.read().decode().split('<a href="https://www.goffs.com/GoffsCMS/_Sales/')
content = content[1].split('"')
content = content[0]
output = 'https://www.goffs.com/GoffsCMS/_Sales/'+content
print(output)

Get XHR info from URL

I have this website https://www.futbin.com/22/player/7504 and I want to know if there is a way to get the XHR url for the information using python. For example for the URL above I know the XHR I want is https://www.futbin.com/22/playerPrices?player=231443 (got it from inspect element -> network).
My objective is to get the price value from https://www.futbin.com/22/player/1 to https://www.futbin.com/22/player/10000 at once without using inspect element one by one.
import requests
URL = 'https://www.futbin.com/22/playerPrices?player=231443'
page = requests.get(URL)
x = page.json()
data = x['231443']['prices']
print(data['pc']['LCPrice'])
print(data['ps']['LCPrice'])
print(data['xbox']['LCPrice'])
You can find the player-resource id and build the url yourself. I use beautifulsoup. It's made for parsing websites, but you can take the requests content and throw that into an html parser as well if you don't want to install beautifulsoup
With it, read the first url, get the id and use your code to pull the prices. To test, change the 10000 to 2 or 3 and you'll see it works.
import re, requests
from bs4 import BeautifulSoup
for i in range(1,10000):
url = 'https://www.futbin.com/22/player/{}'.format(str(i))
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
player_resource = soup.find(id=re.compile('page-info')).get('data-player-resource')
# print(player_resource)
URL = 'https://www.futbin.com/22/playerPrices?player={}'.format(player_resource)
page = requests.get(URL)
x = page.json()
# print(x)
data = x[player_resource]['prices']
print(data['pc']['LCPrice'])
print(data['ps']['LCPrice'])
print(data['xbox']['LCPrice'])

How to show full content of another website in Django?

I am trying to get the full content of another website, or modify the links that are clicked on when people use other websites on my site in django?
import requests
import urllib.request
def one(request, myurl='google.com'):
url = 'http://' + myurl
r = requests.get(url)
return HttpResponse(r)
The outcome of requests.get is a Response [requests-doc] object, not a string. You can obtain the content with content [requests-doc]. For example:
import requests
import urllib.request
def one(request, myurl='google.com'):
url = 'http://' + myurl
r = requests.get(url)
return HttpResponse(
content=r.content,
content_type=r.headers.get('Content-Type'),
status=r.status_code
)

Parse data from webpage using python

Can anyone please help me parse particular data from a web page? Here is the content on the webpage.
{"sites":[{"id":"XX","name":"YY","url":"ZZ","username":"AA","password":"BB","siteId":"0"},{"id":"XX","name":"YY","url":"ZZ","username":"AA","password":"BB","siteId":"0"}]}
I need just the id from the entire content. Please note we have id two times here in the content of webpage, so I need all id from the webpage. Here is the code I have written to dump the web content, but unable to parse the data I need. Please help me.
def test(ip):
url = 'http://%s/' % ip
response = urllib.urlopen(url)
webContent = response.read()
print webContent
your content is a json document, you can parse it with the json library and use it as a python object:
import json
def test(ip):
url = 'http://%s/' % ip
response = urllib.urlopen(url)
webContent = response.read()
content = json.loads(webContent)
print([site['id'] for site in content['sites']])

How to open with urllib, link parsed by BeautifulSoup?

I use python 3, Beautiful Soup 4 and urllib for parsing some html.
I need to parse some pages, get some links from this pages, and than parse pages from that links. I try to do it like that:
import urllib.request
import urllib
from bs4 import BeautifulSoup
with urllib.request.urlopen("https://example.com/mypage?myparam=%D0%BC%D0%B2") as response:
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
total = soup.find(attrs={"class":"item_total"})
link = u"https://example.com" + total.find('a').get('href')
with urllib.request.urlopen(link) as response:
exthtml = BeautifulSoup(html,response.read())
But urllib can't open second link, because it is not encoded, like fist link. It has different languages symbols, and white spaces.
I try to encode url, like:
link = urllib.parse.quote("https://example.com" + total.find('a').get('href'))
But it encode all symbols. How can I get properly url form bs, and get request?
UPD:
exapmle of second link, resulted by
link = u"https://example.com" + total.find('a').get('href')
is
"https://example.com/mypage?p1url=www.example.net%2Fthatpage%2F01234&text=абвгд еёжз ийклмно"
should just be urlencoding your link.
link = "https://example.com" + urllib.parse.quote(total.find('a').get('href'))

Categories

Resources