How to add webscraped data to a set in Python - python

I'm trying to webscrape URLs from a website and send them to a .CSV file using a set so that duplicate URLs are removed. I understand what a set is and how to create a set, I just don't understand how to send webscraped data to a set. I'm assuming it's in the for loop but I'm newish to Python and am not quite sure. Here is the tail end of my code:
url_list=soup.find_all('a')
with open('HTMLList.csv','w',newline="") as f:
writer=csv.writer(f,delimiter=' ',lineterminator='\r')
for link in url_list:
url=str(link.get('href'))
if url:
if 'https://www.example.com' not in url:
url = 'https://www.example.com' + url
writer.writerow([url])
f.close()
I know that I need to create a set() and add the URLs to the set but am unsure how and I'm told that it will also get rid of any duplicates, which would be great. Any help would be much appreciated. Thanks!

You can create a set, add the URLs to the set, then write it to file
url_list=set()
for link in url_list:
url=str(link.get('href'))
if url:
if 'https://www.example.com' not in url:
url = 'https://www.example.com' + url
url_list.add(url)
with open('HTMLList.csv','w',newline="") as f:
writer=csv.writer(f,delimiter=' ',lineterminator='\r')
for i in url_list:
writer.writerow([i])

Related

Python Web Scrapping Json.loads()

I'm trying to fetch only URL from report by response given in json format using python.
The responses are as below:
text = {'result':[{'URL':'/disclosure/listedinfo/announcement/c/new/2022-06-30/600781_20220630_1_Xe2cThkh.pdf<br>/disclosure/listedinfo/announcement/c/new/2022-06-30/600781_20220630_10_u0Egjf03.pdf<br>/disclosure/listedinfo/announcement/c/new/2022-06-30/600781_20220630_2_MnC1FzvY.pdf<br>/disclosure/listedinfo/announcement/c/new/2022-06-30/600781_20220630_3_8APKPJ6E.pdf'}]}
I would need to add this url text to fetched url: 'http://static.sse.com.cn', I coded a for loop:
data = json.loads(text)
for every_report in data['result']:
pdf_url = 'http://static.sse.com.cn' + every_report['URL']
print(pdf_url)
But this is the result I get, only able to fetch the first URL and add the url text I wanted.
http://static.sse.com.cn/disclosure/listedinfo/announcement/c/new/2022-06-30/600532_20220630_6_Y2pswtvy.pdf<br>/disclosure/listedinfo/announcement/c/new/2022-06-30/600532_20220630_10_GBwvYOfG.pdf<br>/disclosure/listedinfo/announcement/c/new/2022-06-30/600532_20220630_11_2LvtFNYz.pdf<br>
What should I do to get all the URL and add text I want, thank youu.
The reason is because the string value of URL key contains <br>. You have to remove it first before constructing the full URL.
for every_report in text['result']:
urls = every_report['URL'].split('<br>')
pdf_urls = ['http://static.sse.com.cn' + url for url in urls]
print(pdf_urls)

Python saves empty html file

import requests
url = 'https://hkpropel.humankinetics.com/mylibrary.htm'
r = requests.get(url)
with open('saving.html', 'wb+') as f:
f.write(r.content)
f.close()
I'm trying to save web in html file, it's work fine for others web , but for this one it's always save empty data
The page you're trying to copy likely requires you to be logged in. If your not it redirects you to the login page and you ended up getting a copy of that.
Also, I'm pretty new to Python, but I don't think you need to use f.close()

How to use request and read urls from a txt file

Im writing a piece of code where I have a list of urls in a file and Im using requests to go through the file and do a GET request and print the status code but from what i have written I am not getting any output
import requests
with open('demofile.txt','r') as http:
for req in http:
page=requests.get(req)
print page.status_code
I see two problems, one is that you forgot to indent the lines after the for loop and the second one is that you failed to remove the last new lines \n (supposing that urls are separated in different lines)
import requests
with open('deleteme','r') as urls:
for url in urls.readlines():
req = url.strip()
print(req)
page=requests.get(req)
print(page.status_code)

Scrape texts from multiple websites and save separately in text files

I am a beginner in python, have been using it for my master thesis to conduct textual analysis in gaming industry. I have been trying to scrape reviews from several gaming critic sites.
I used a list of URLs in the code to scrape the reviews and have been successful. Unfortunately, i could not write each reviews in a separate file. as i write the files, either i receive only the review from the last URL in the list to all the files, or all of the reviews in all of the files after changing the indent. following here is my code. could you kindly suggest what's wrong in here?
from bs4 import BeautifulSoup
import requests
urls= ['http://www.playstationlifestyle.net/2018/05/08/ao-international-tennis-review/#/slide/1',
'http://www.playstationlifestyle.net/2018/03/27/atelier-lydie-and-suelle-review/#/slide/1',
'http://www.playstationlifestyle.net/2018/03/15/attack-on-titan-2-review-from-a-different-perspective-ps4/#/slide/1']
for url in urls:
r=requests.get(url).text
soup= BeautifulSoup(r, 'lxml')
for i in range(len(urls)):
file=open('filename%i.txt' %i, 'w')
for article_body in soup.find_all('p'):
body=article_body.text
file.write(body)
file.close()
I think you only need one for loop. If I understand correctly, you only want to iterate through urls and store an individual file for each.
Therefore, I would suggest removing the second for statement. You do though then need to modify the for url in urls to get a unique index for the current url you can use for i and you can use enumerate for that.
Your single for statement would become:
for i, url in enumerate(urls):
I've not tested this myself but this is what I believe should resolve your issue.
I totally believe you are a beginner in python. I post the right one before explaining it.
for i,url in enumerate(urls):
r = requests.get(url).text
soup = BeautifulSoup(r, 'lxml')
file = open('filename{}.txt'.format(i), 'w')
for article_body in soup.find_all('p'):
body = article_body.text
file.write(body)
file.close()
The reason why i receive only the review from the last URL in the list to all the files
one variable for one value , so after for-loop finished you will get the last result(the third one). The result of first and second result will be override
for url in urls:
r = requests.get(url).text
soup = BeautifulSoup(r, 'lxml')

Scrape specific urls from a page and convert them to absolute urls

I need some help from you Pythonists: I'm scraping all urls starting with "details.php?" from this page and ignoring all other urls.
Then I need to convert every url I just scraped to an absolute url, so I can scrape them one by one. The absolute urls start with: http://evenementen.uitslagen.nl/2013/marathonrotterdam/details.php?...
I tried using re.findall like this:
html = scraperwiki.scrape(url)
if html is not None:
endofurl = re.findall("details.php?(.*?)>", html)
This gets me a list, but then I get stuck. Can anybody help me out?
You can use urlparse.urljoin() to create the full urls:
>>> import urlparse
>>> base_url = 'http://evenementen.uitslagen.nl/2013/marathonrotterdam/'
>>> urlparse.urljoin(base_url, 'details.php?whatever')
'http://evenementen.uitslagen.nl/2013/marathonrotterdam/details.php?whatever'
You can use a list comprehension to do this for all of your urls:
full_urls = [urlparse.urljoin(base_url, url) for url in endofurl]
If you need the final urls one by one and be done with them, you should use generator instead of the iterators.
abs_url = "url data"
urls = (abs_url+url for url in endofurl)
If you are worried about encoding the url you can use urllib.urlencode(url)
Ah! My favorite...list comprehensions!
base_url = 'http://evenementen.uitslagen.nl/2013/marathonrotterdam/{0}'
urls = [base.format(x) for x in list_of_things_you_scraped]
I'm not a regex genius, so you may need to fiddle with base_url until you get it exactly right.
If you'd like to use lxml.html to parse html; there is .make_links_absolute():
import lxml.html
html = lxml.html.make_links_absolute(html,
base_href="http://evenementen.uitslagen.nl/2013/marathonrotterdam/")

Categories

Resources