So I have a script that extracts all links from a web site, I thought that converting to a list would do the job of making sure I only returned unique links, but there are still dups in the output (ie 'www.commerce.gov/' and 'www.commerce.gov') the code is not picking up the trailing characters. Below is my code. Any help is appreciated. Thanks.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import csv
req = Request("https://www.census.gov/programs-surveys/popest.html")
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
prettyhtml = soup.prettify()
Html_file = open("U:\python_intro\popest_html.txt","w")
Html_file.write(prettyhtml)
Html_file.close()
links = []
for link in soup.findAll('a', attrs={'href': re.compile(r'^(?:http|ftp)s?://')}):
links.append(link.get('href'))
links = set(links)
myfile = "U:\python_stuff\links.csv"
with open(myfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
for a in links:
writer.writerow([a])
You mean "converting to a set" not a list.
You can remove any possible trailing '/':
links.append(link.get('href').rstrip('/'))
Or even better, build a set from the first place:
links = set()
for link in soup.findAll('a', attrs={'href': re.compile(r'^(?:http|ftp)s?://')}):
links.add(link.get('href').rstrip('/'))
Related
I'm trying to write a script that iterates through a list of web pages, extracts the links from each page and checks each link to see if the are in a given set of domains. I have the script set up to write two files - pages with links in the given domains are written to one file while the rest are written to the other. I'm essentially trying to sort the pages based on the links in the pages. Below is my script but it doesn't look right. I'd appreciate any pointers to achieve this (I'm new at this, can you tell)
import requests
from bs4 import BeautifulSoup
import re
urls = ['https://www.rose.com', 'https://www.pink.com']
for i in range(len(urls)):
grab = requests.get(urls[i])
soup = BeautifulSoup(grab.text, 'html.parser')
f = open('links_good.txt', 'w')
g = open('links_need_update.txt', 'w')
for link in soup.find_all('a'):
data = link.get('href')
check_url = re.compile(r'(www.x.com)+ | (www.y.com)')
invalid = check_url.search(data)
if invalid == None
g.write(urls[i])
g.write('\n')
else:
f.write(urls[i])
f.write('\n')
There are some very basic problems with your code:
if invalid == None is missing a : at the end, but should also be if invalid is None:
not all <a> elements will have an href, so you need to deal with those, or your script will fail.
the regex has some issues (you probably don't want to repeat that first URL and the parentheses are pointless)
you write the URL to the file every time you find a problem, but you only need to write it to the file if it has a problem at all; or perhaps you wanted a full lists of all the problematic links?
you rewrite the files on every iteration of your for loop, so you only get the final result
Fixing all that (and using a few arbitrary URLs that work):
import requests
from bs4 import BeautifulSoup
import re
urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
f = open('links_good.txt', 'w')
g = open('links_need_update.txt', 'w')
for i in range(len(urls)):
grab = requests.get(urls[i])
soup = BeautifulSoup(grab.text, 'html.parser')
for link in soup.find_all('a'):
data = link.get('href')
if data is not None:
check_url = re.compile('gamespot.com|pcgamer.com')
result = check_url.search(data)
if result is None:
# if there's no result, the link doesn't match what we need, so write it and stop searching
g.write(urls[i])
g.write('\n')
break
else:
f.write(urls[i])
f.write('\n')
However, there's still a lot of issues:
you open file handles, but never close them, use with instead
you loop over a list using an index, that's not needed, loop over urls directly
you compile a regex for efficieny, but do so on every iteration, countering the effect
The same code with those problems fixed:
import requests
from bs4 import BeautifulSoup
import re
urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
with open('links_good.txt', 'w') as f, open('links_need_update.txt', 'w') as g:
check_url = re.compile('gamespot.com|pcgamer.com')
for url in urls:
grab = requests.get(url)
soup = BeautifulSoup(grab.text, 'html.parser')
for link in soup.find_all('a'):
data = link.get('href')
if data is not None:
result = check_url.search(data)
if result is None:
# if there's no result, the link doesn't match what we need, so write it and stop searching
g.write(url)
g.write('\n')
break
else:
f.write(url)
f.write('\n')
Or, if you want to list all the problematic URLs on the sites:
import requests
from bs4 import BeautifulSoup
import re
urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
with open('links_good.txt', 'w') as f, open('links_need_update.txt', 'w') as g:
check_url = re.compile('gamespot.com|pcgamer.com')
for url in urls:
grab = requests.get(url)
soup = BeautifulSoup(grab.text, 'html.parser')
good = True
for link in soup.find_all('a'):
data = link.get('href')
if data is not None:
result = check_url.search(data)
if result is None:
g.write(f'{url},{data}\n')
good = False
if good:
f.write(url)
f.write('\n')
#!/usr/bin/python3
import requests
from bs4 import BeautifulSoup
import re
url = input("Please enter a URL to scrape: ")
r = requests.get(url)
html = r.text
print(html)
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a', attrs={'href': re.compile("^https://")}):
print(link.get('href'))
down at the bottom, where it prints the link... I know it'll go in there, but I can't think of a way to remove duplicate entries there. Can someone help me with that please?
Use a set to remove duplicates. You call add() to add an item and if the item is already present then it won't be added again.
Try this:
#!/usr/bin/python3
import requests
from bs4 import BeautifulSoup
import re
url = input("Please enter a URL to scrape: ")
r = requests.get(url)
html = r.text
print(html)
soup = BeautifulSoup(html, "html.parser")
urls = set()
for link in soup.find_all('a', attrs={'href': re.compile(r"^https://")}):
urls.add(link.get('href'))
print(urls) # urls contains unique set of URLs
Note some URLs might start with http:// so may want to use the regexp ^https?:// to catch both http and https URLs.
You can also use set comprehension syntax to rewrite the assignment and for statements like this.
urls = {
link.get("href")
for link in soup.find_all("a", attrs={"href": re.compile(r"^https://")})
}
instead of printing it you need to catch is somehow to compare.
Try this:
you get a list with all result by find_all and make it a set.
data = set(link.get('href') for link in soup.find_all('a', attrs={'href': re.compile("^https://")}))
for elem in data:
print(elem)
I want to open a txt file (which contains multiple links) and scrap title using beautifulsoup.
My txt file contains link like this:
https://www.lipsum.com/7845284869/
https://www.lipsum.com/56677788/
https://www.lipsum.com/01127111236/
My code:
import requests as rq
from bs4 import BeautifulSoup as bs
with open('output1.csv', 'w', newline='') as f:
url = open('urls.txt', 'r', encoding='utf8')
request = rq.get(str(url))
soup = bs(request.text, 'html.parser')
title = soup.findAll('title')
pdtitle = {}
for pdtitle in title:
pdtitle.append(pdtitle.text)
f.write(f'{pdtitle}')
I want to open all txt file links and scrap title from the links. The main problem is opening txt file in url variable is not working. How to open a file and save data to csv?
you code isn't working because inside URL is all the URL. you need to run one by one:
import requests as rq
from bs4 import BeautifulSoup as bs
with open(r'urls.txt', 'r') as f:
urls = f.readlines()
with open('output1.csv', 'w', newline='') as f:
for url in urls:
request = rq.get(str(url))
soup = bs(request.text, 'html.parser')
title = soup.findAll('title')
pdtitle = {}
for pdtitle in title:
pdtitle.append(pdtitle.text)
f.write(f'{pdtitle}')
Your urls may not be working because your urls are being read with a return line character: \n. You need to strip the text before putting them in a list.
Also, you are using .find_all('title'), and this will return a list, which is probably not what you are looking for. You probably just want the first title and that's it. In that case, .find('title') would be better. I have provided some possible corrections below.
from bs4 import BeautifulSoup
import requests
filepath = '...'
with open(filepath) as f:
urls = [i.strip() for i in f.readlines()]
titles = []
for url in urls:
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
title = soup.find('title') # Note: will find the FIRST title only
titles.append(title.text) # Grabs the TEXT of the title only, removes HTML
new_csv = open('urls.csv', 'w') # Make sure to prepend with desired location, e.g. 'C:/user/name/urls.csv'
for title in titles:
new_csv.write(title+'\n') # The '\n' ensures a new row is written
new_csv.close()
f.close()
I have a Python script that imports a list of url's from a CSV named list.csv, scrapes them and outputs any anchor text and href links found on each url from the csv:
(For reference the list of urls in the csv are all in column A)
from requests_html import HTMLSession
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import pandas
import csv
contents = []
with open('list.csv','r') as csvf: # Open file in read mode
urls = csv.reader(csvf)
for url in urls:
contents.append(url) # Add each url to list contents
for url in contents:
page = urlopen(url[0]).read()
soup = BeautifulSoup(page, "lxml")
for link in soup.find_all('a'):
if len(link.text)>0:
print(url, link.text, '-', link.get('href'))
The output results look something like this where https://www.example.com/csv-url-one/ and https://www.example.com/csv-url-two/ are the url's in column A in the csv:
['https://www.example.com/csv-url-one/'] Creative - https://www.example.com/creative/
['https://www.example.com/csv-url-one/'] Web Design - https://www.example.com/web-design/
['https://www.example.com/csv-url-two/'] PPC - https://www.example.com/ppc/
['https://www.example.com/csv-url-two/'] SEO - https://www.example.com/seo/
The issue is i want the output results to look more like this i.e not repeatedly print the url in the CSV before each result AND have a break after each line from the CSV:
['https://www.example.com/csv-url-one/']
Creative - https://www.example.com/creative/
Web Design - https://www.example.com/web-design/
['https://www.example.com/csv-url-two/']
PPC - https://www.example.com/ppc/
SEO - https://www.example.com/seo/
Is this possible?
Thanks
Does the following solve your problem?
for url in contents:
page = urlopen(url[0]).read()
soup = BeautifulSoup(page, "lxml")
print('\n','********',', '.join(url),'********','\n')
for link in soup.find_all('a'):
if len(link.text)>0:
print(link.text, '-', link.get('href'))
It is possible.
Simply add \n at the end of print.
\n is a break line special character.
for url in contents:
page = urlopen(url[0]).read()
soup = BeautifulSoup(page, "lxml")
for link in soup.find_all('a'):
if len(link.text)>0:
print(url, ('\n'), link.text, '-', link.get('href'), ('\n'),)
To add a separation between urls add a \n before printing each url.
If you want to print the urls only if it has valid links ieif len(link.text)>0:, use the for loop to save valid links to a list, and only print url and links if this list is not empty.
try this:
for url in contents:
page = urlopen(url[0]).read()
soup = BeautifulSoup(page, "lxml")
valid_links = []
for link in soup.find_all('a'):
if len(link.text)>0:
valid_links .append(link.text)
if len (valid_links ):
print('\n', url)
for item in valid_links :
print(item.text, '-', item.get('href')))
I first grab all article urls from an RSS feed and check for duplicates within that list. I then want to check those unique article urls against a csv file of old article urls to avoid duplicates with the csv list. I only want to print out the new urls that didn't match with the old urls in the csv.
I'm having trouble with the latter part, any help is appreciated!
import requests
from bs4 import BeautifulSoup
import csv
feed_urls = ["https://www.example.com/rss"]
with open("Old_Articles.csv", "r", encoding="utf-8") as r:
old_articles = csv.reader(r, delimiter=",")
for url in feed_urls:
response = requests.get(url)
html_source = response.text
soup = BeautifulSoup(html_source, "xml")
new_articles = set()
for link in soup.findAll("atom:link"):
new_articles.add(link.get("href"))
for link in new_articles:
if link not in old_articles:
print("Not Matched")
else:
print("Matched")