I'm trying to fetch only URL from report by response given in json format using python.
The responses are as below:
text = {'result':[{'URL':'/disclosure/listedinfo/announcement/c/new/2022-06-30/600781_20220630_1_Xe2cThkh.pdf<br>/disclosure/listedinfo/announcement/c/new/2022-06-30/600781_20220630_10_u0Egjf03.pdf<br>/disclosure/listedinfo/announcement/c/new/2022-06-30/600781_20220630_2_MnC1FzvY.pdf<br>/disclosure/listedinfo/announcement/c/new/2022-06-30/600781_20220630_3_8APKPJ6E.pdf'}]}
I would need to add this url text to fetched url: 'http://static.sse.com.cn', I coded a for loop:
data = json.loads(text)
for every_report in data['result']:
pdf_url = 'http://static.sse.com.cn' + every_report['URL']
print(pdf_url)
But this is the result I get, only able to fetch the first URL and add the url text I wanted.
http://static.sse.com.cn/disclosure/listedinfo/announcement/c/new/2022-06-30/600532_20220630_6_Y2pswtvy.pdf<br>/disclosure/listedinfo/announcement/c/new/2022-06-30/600532_20220630_10_GBwvYOfG.pdf<br>/disclosure/listedinfo/announcement/c/new/2022-06-30/600532_20220630_11_2LvtFNYz.pdf<br>
What should I do to get all the URL and add text I want, thank youu.
The reason is because the string value of URL key contains <br>. You have to remove it first before constructing the full URL.
for every_report in text['result']:
urls = every_report['URL'].split('<br>')
pdf_urls = ['http://static.sse.com.cn' + url for url in urls]
print(pdf_urls)
Related
I would like to get a table on ebi.ac.uk/interpro with the list of all the thousands of proteins names, accession number, species, and length for the entry I put on the website. I tried to write a script with python using requests, BeautifulSoup, and so on, but I am always getting the error
AttributeError: 'NoneType' object has no attribute 'find_all'.
The code
import requests
from bs4 import BeautifulSoup
# Set the URL of the website you want to scrape
url = xxxx
# Send a request to the website and get the response
response = requests.get(url)
# Parse the response using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
# Find the table on the page
table = soup.find("table", class_ = 'xxx')
# Extract the data from the table
# This will return a list of rows, where each row is a list of cells
table_data = []
for row in table.find_all('tr'):
cells = row.find_all("td")
row_data = []
# for cell in cells:
# row_data.append(cell.text)
# table_data.append(row_data)
# Print the extracted table data
#print(table_data)
for table = soup.find("table", class_ = 'xxx'), I fill in the class according to the name when I inspect the page.
Thank you.
I would like to get a table listing all the thousands of proteins that the website lists back from my request
sure it is take a look at this example:
import requests
url = "https://www.ebi.ac.uk/interpro/wwwapi/entry/hamap/"
querystring = {"search":"","page_size":"9999"}
payload = ""
response = requests.request("GET", url, data=payload, params=querystring)
print(response.text)
Please do not use selenium unless absolutely necessary. In the following example we request all the entries from /hamap/ I have no idea what this means but this is the API used to fetch the data. You can get the API for the dataset you want to scrape data from by doing the following:
open chrome dev tools -> network -> click Fetch/XAR -> click on the specific source you want -> wait until the page loads -> click the red icon for record -> look through the requests for the one that you want. It is important to not record requests after you retrieved the initial response. This website sends a tracking request every 1 second or so and it becomes cluttered really quick. Once you have the source that you want just loop over the array and get the fields that you want. I hope this answer was useful to you.
Hey I checked it out some more this site uses something similar to Elasticsearch's scroll here is a full implementation of what you are looking for:
import requests
import json
results_array = []
def main():
count = 0
starturl = "https://www.ebi.ac.uk/interpro/wwwapi//protein/UniProt/entry/InterPro/IPR002300/?page_size=100&has_model=true" ## This is the URL you want to scrape on page 0
startpage = requests.get(starturl) ## This is the page you want to scrape
count += int(startpage.json()['count']) ## This is the total number of indexes
next = startpage.json()['next'] ## This is the next page
for result in startpage.json()['results']:
results_array.append(result)
while count:
count -= 100
nextpage = requests.get(next) ## this is the next page
if nextpage.json()['next'] is None:
break
next = nextpage.json()['next']
for result in nextpage.json()['results']:
results_array.append(result)
print(json.dumps(nextpage.json()))
print(count)
if __name__ == '__main__':
main()
with open("output.json", "w") as f:
f.write(json.dumps(results_array))
To use this for any other type replace the startURL string with that one. make sure it is the url that controls pages. To get this click on the data you want then click on the next page use that url.
I hope this answer is what you were looking for.
I need to get the JSON containing the info from this URL hkex.com.hk, I can do so using firefox>developer tools>network and looking for the JSON I want, I need to do the same using python, so far I have this
url='https://www.hkex.com.hk/Market-Data/Securities-Prices/Equities?sc_lang=en'
r = requests.get(url)
print(r.text)
But I only receive an HTML so even after using .json() I get an error "Expecting value" because it is empty, how can I achieve this?
The response request is an html text, so you are not able to use json() method on the entire response.
There should be another way to "convert" html into json, but you will have to find the part of html you want to convert to.
The json is indeed hidden in the url you mention in one of your comments. You have to get the html, extract the json and load it:
url = 'https://www1.hkex.com.hk/hkexwidget/data/getequityfilter?lang=eng&token=evLtsLsBNAUVTPxtGqVeG8QpVRBPNt2I8CbDELLpyZv%2bff8QFzdfZ6w1Za4TWSJ6&sort=5&order=0&qid=1627367921383&callback=jQuery35106295196366220494_1627367912871&_=1627367912873'
req = requests.get(url)
#now for the extraction:
target = req.text.split('jQuery35106295196366220494_1627367912871(')[1].split(')')[0]
#EDIT
target = req.text.split('(')[1].split(')')[0]
data = json.loads(target)
data
The output should be your json.
i've tried with with urllib and request library but the data in fragment was not written in .html file. help me please :(
Here with the request
url = 'https://xxxxxxxxxxx.co.jp/InService/delivery/#/V=2/partsList/Element.PartsList%3A%3AVj0xfnsicklkIjoiQzEtQlVMTERPWkVSLUxfSVNfQzNfLl9CVUxMRE9aRVItTF8uXzgwXy5fRDg1RVNTLTJfLl9LSSIsIm9wIjpbIkMxLUJVTExET1pFUi1MX0lTX0MzXy5fQlVMTERPWkVSLUxfLl84MF8uX0Q4NUVTUy0yXy5fS0kiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDMiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDNfLl9BMCIsIlBMX0MxLUJVTExET1pFUi1MX0FDXy5fRDg1RVNTLTJfLl9LSS0wMDAwM18uX0EwMDEwMDEwIl0sIm5uIjoyMTQsInRzIjoxNTc5ODM0OTIwMDE5fQ?filterId=Product%3A%3AVj0xfnsicklkIjoiUk9PVCBQUk9EVUNUIiwib3AiOlsiUk9PVCBQUk9EVUNUIiwiQzEtQlVMTERPWkVSLUwiLCJDMl8uX0JVTExET1pFUi1MXy5fODAiLCJDM18uX0JVTExET1pFUi1MXy5fODBfLl9EODVFU1MtMl8uX0tJIl0sIm5uIjo2OTcsInRzIjoxNTc2NTY0MjMwMDg1fQ&bomFilterState=false'
response = requests.get(url)
print(response)
here with the urllib
url = 'https://xxxxxxx.co.jp/InService/delivery/?view=print#/V=2/partsList/Element.PartsList::Vj0xfnsicklkIjoiQzEtQlVMTERPWkVSLUxfSVNfQzNfLl9CVUxMRE9aRVItTF8uXzgwXy5fRDg1RVNTLTJfLl9LSSIsIm9wIjpbIkMxLUJVTExET1pFUi1MX0lTX0MzXy5fQlVMTERPWkVSLUxfLl84MF8uX0Q4NUVTUy0yXy5fS0kiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDMiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDNfLl9BMCIsIlBMX0MxLUJVTExET1pFUi1MX0FDXy5fRDg1RVNTLTJfLl9LSS0wMDAwM18uX0EwMDEwMDIwIl0sIm5uIjoyMjUsInRzIjoxNTgwMDk1MDYzNjIyfQ?filterId=Product::Vj0xfnsicklkIjoiUk9PVCBQUk9EVUNUIiwib3AiOlsiUk9PVCBQUk9EVUNUIiwiQzEtQlVMTERPWkVSLUwiLCJDMl8uX0JVTExET1pFUi1MXy5fODAiLCJDM18uX0JVTExET1pFUi1MXy5fODBfLl9EODVFU1MtMl8uX0tJIl0sIm5uIjo2OTcsInRzIjoxNTc2NTY0MjMwMDg1fQ&bomFilterState=false'
request = urllib.request.Request(url)
string = '%s:%s' % ('xx','xx')
base64string = base64.standard_b64encode(string.encode('utf-8'))
request.add_header("Authorization", "Basic %s" % base64string.decode('utf-8'))
u = urllib.request.urlopen(request)
webContent = u.read()
here is home of the web page (url:https://xxxxxx.co.jp/InService/delivery/#/V=2/home)
and here is the page that i want to get the data (url: https://xxxxxxx.co.jp/InService/delivery/?view=print#/V=2/partsList/Element.PartsList::Vj0xfnsicklkIjoiQzE...)
so every i request the web page like in the 2 picture, the html content is must be the html in picture 1 because in picture 2 is the fragment
If all you would like is the html of the webpage, just use requests as you have in the first example, except instead of print(response) use print(response.content).
To save it into a file use:
import requests
url = 'https://xxxxxxx.co.jp/InService/delivery/?view=print#/V=2/partsList/Element.PartsList::Vj0xfnsicklkIjoiQzEtQlVMTERPWkVSLUxfSVNfQzNfLl9CVUxMRE9aRVItTF8uXzgwXy5fRDg1RVNTLTJfLl9LSSIsIm9wIjpbIkMxLUJVTExET1pFUi1MX0lTX0MzXy5fQlVMTERPWkVSLUxfLl84MF8uX0Q4NUVTUy0yXy5fS0kiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDMiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDNfLl9BMCIsIlBMX0MxLUJVTExET1pFUi1MX0FDXy5fRDg1RVNTLTJfLl9LSS0wMDAwM18uX0EwMDEwMDIwIl0sIm5uIjoyMjUsInRzIjoxNTgwMDk1MDYzNjIyfQ?filterId=Product::Vj0xfnsicklkIjoiUk9PVCBQUk9EVUNUIiwib3AiOlsiUk9PVCBQUk9EVUNUIiwiQzEtQlVMTERPWkVSLUwiLCJDMl8uX0JVTExET1pFUi1MXy5fODAiLCJDM18uX0JVTExET1pFUi1MXy5fODBfLl9EODVFU1MtMl8uX0tJIl0sIm5uIjo2OTcsInRzIjoxNTc2NTY0MjMwMDg1fQ&bomFilterState=false'
with open("output.html", 'w+') as f:
response = requests.get(url)
f.write(response.content)
If you need a certain part of the webpage, use BeautifulSoup.
import requests
from bs4 import BeautifulSoup
url = 'https://xxxxxxx.co.jp/InService/delivery/?view=print#/V=2/partsList/Element.PartsList::Vj0xfnsicklkIjoiQzEtQlVMTERPWkVSLUxfSVNfQzNfLl9CVUxMRE9aRVItTF8uXzgwXy5fRDg1RVNTLTJfLl9LSSIsIm9wIjpbIkMxLUJVTExET1pFUi1MX0lTX0MzXy5fQlVMTERPWkVSLUxfLl84MF8uX0Q4NUVTUy0yXy5fS0kiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDMiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDNfLl9BMCIsIlBMX0MxLUJVTExET1pFUi1MX0FDXy5fRDg1RVNTLTJfLl9LSS0wMDAwM18uX0EwMDEwMDIwIl0sIm5uIjoyMjUsInRzIjoxNTgwMDk1MDYzNjIyfQ?filterId=Product::Vj0xfnsicklkIjoiUk9PVCBQUk9EVUNUIiwib3AiOlsiUk9PVCBQUk9EVUNUIiwiQzEtQlVMTERPWkVSLUwiLCJDMl8uX0JVTExET1pFUi1MXy5fODAiLCJDM18uX0JVTExET1pFUi1MXy5fODBfLl9EODVFU1MtMl8uX0tJIl0sIm5uIjo2OTcsInRzIjoxNTc2NTY0MjMwMDg1fQ&bomFilterState=false'
response = BeautifulSoup(requests.get(url).content)
use inspect element and find the Tag of the table that you want in the second image, eg. https://imgur.com/a/pGbCCFy.
then use:
found = response.find('div', attrs={"class":"x-carousel__body no-scroll"}).find_all('ul')
For the ebay example I linked above.
This should return that table which you can then do whatever you like with.
I'm trying to webscrape URLs from a website and send them to a .CSV file using a set so that duplicate URLs are removed. I understand what a set is and how to create a set, I just don't understand how to send webscraped data to a set. I'm assuming it's in the for loop but I'm newish to Python and am not quite sure. Here is the tail end of my code:
url_list=soup.find_all('a')
with open('HTMLList.csv','w',newline="") as f:
writer=csv.writer(f,delimiter=' ',lineterminator='\r')
for link in url_list:
url=str(link.get('href'))
if url:
if 'https://www.example.com' not in url:
url = 'https://www.example.com' + url
writer.writerow([url])
f.close()
I know that I need to create a set() and add the URLs to the set but am unsure how and I'm told that it will also get rid of any duplicates, which would be great. Any help would be much appreciated. Thanks!
You can create a set, add the URLs to the set, then write it to file
url_list=set()
for link in url_list:
url=str(link.get('href'))
if url:
if 'https://www.example.com' not in url:
url = 'https://www.example.com' + url
url_list.add(url)
with open('HTMLList.csv','w',newline="") as f:
writer=csv.writer(f,delimiter=' ',lineterminator='\r')
for i in url_list:
writer.writerow([i])
Can anyone please help me parse particular data from a web page? Here is the content on the webpage.
{"sites":[{"id":"XX","name":"YY","url":"ZZ","username":"AA","password":"BB","siteId":"0"},{"id":"XX","name":"YY","url":"ZZ","username":"AA","password":"BB","siteId":"0"}]}
I need just the id from the entire content. Please note we have id two times here in the content of webpage, so I need all id from the webpage. Here is the code I have written to dump the web content, but unable to parse the data I need. Please help me.
def test(ip):
url = 'http://%s/' % ip
response = urllib.urlopen(url)
webContent = response.read()
print webContent
your content is a json document, you can parse it with the json library and use it as a python object:
import json
def test(ip):
url = 'http://%s/' % ip
response = urllib.urlopen(url)
webContent = response.read()
content = json.loads(webContent)
print([site['id'] for site in content['sites']])