Extract URL from text file - Python - python

I am trying to extract a URL from a text file which contains a source code of a website. I want to get the website link inside href and I wrote some code I borrowed from stackoverflow but I can't get it to work.
with open(sourcecode.txt) as f:
urls = f.readlines()
urls = ([s.strip('\n') for s in urls ])
print(url)

Using a regexp, you can extract all urls from the text file, without the need to loop line by line:
import re
with open('/home/username/Downloads/Stack_Overflow.html') as f:
urls = f.read()
links = re.findall('"((http)s?://.*?)"', urls)
for url in links:
print(url[0])

You can use regular expressions for this.
import re
with open('sourcecode.txt') as f:
text = f.read()
href_regex = r'href=[\'"]?([^\'" >]+)'
urls = re.findall(href_regex, text)
print(urls)
You're probably getting an error like 'sourcecode' is not defined; this is because the parameter that you pass to open() needs to be a string (see above)

Related

Python link scraper regex works when only searching for 1 extension type, but fails when matching more than one extension type

This is the test link I am using for this project: https://www.dropbox.com/sh/4cgwf2b6gk4bex4/AADtM1GDYgPDdv8QP6JdSOkba?dl=0
Now, the code below works when matching for .mp3 only (line 8), and outputs the plain link to a text file as asked.
import re
import requests
url = input('Enter URL: ')
html_content = requests.get(url).text
# Define the regular expression pattern to match links
pattern = re.compile(r'http[s]?://[^\s]+\.mp3')
# Find all matches in the HTML content
links = re.findall(pattern, html_content)
# Remove duplicates
links = list(set(links))
# Write the extracted links to a text file
with open('links.txt', 'w') as file:
for link in links:
file.write(link + '\n')
The issue is, the test link above contains not only .mp3's but also .flac, and .wav.
When I change the code (line 8) to the following to scrape and return all links containing those extensions above (.mp4, .flac, .wav), it outputs a text file with "mp3", "flac" and "wav". No links.
import re
import requests
url = input('Enter URL: ')
html_content = requests.get(url).text
# Define the regular expression pattern to match links
pattern = re.compile(r'http[s]?://[^\s]+\.(mp3|flac|wav)')
# Find all matches in the HTML content
links = re.findall(pattern, html_content)
# Remove duplicates
links = list(set(links))
# Write the extracted links to a text file
with open('links.txt', 'w') as file:
for link in links:
file.write(link + '\n')
I've been trying to understand where the error is. Regex, or something else? I can't figure it out.
Thank you.
That's because in your second code, you capture (with parenthesis) only the extension of the url/file. So, one way to fix that, is to add another capturing group like below (read comments) :
import re
import requests
url = input('Enter URL: ')
html_content = requests.get(url).text
# Define the regular expression pattern to match links
pattern = re.compile(r'(http[s]?://[^\s]+\.(mp3|flac|wav))') # <- added outer parenthesis
# Find all matches in the HTML content
links = re.findall(pattern, html_content)
# Remove duplicates
links = list(set(links))
# Write the extracted links to a text file
with open('links.txt', 'w') as file:
for link in links:
file.write(link[0] + '\n') # <- link[0] instead of link[0]
Output :
Another way to solve this is to have non capturing groups with ?:
pattern = re.compile(r'http[s]?://[^\s]+\.(?:mp3|flac|wav)')
See here.

Script to extract links from page and check the domain

I'm trying to write a script that iterates through a list of web pages, extracts the links from each page and checks each link to see if the are in a given set of domains. I have the script set up to write two files - pages with links in the given domains are written to one file while the rest are written to the other. I'm essentially trying to sort the pages based on the links in the pages. Below is my script but it doesn't look right. I'd appreciate any pointers to achieve this (I'm new at this, can you tell)
import requests
from bs4 import BeautifulSoup
import re
urls = ['https://www.rose.com', 'https://www.pink.com']
for i in range(len(urls)):
grab = requests.get(urls[i])
soup = BeautifulSoup(grab.text, 'html.parser')
f = open('links_good.txt', 'w')
g = open('links_need_update.txt', 'w')
for link in soup.find_all('a'):
data = link.get('href')
check_url = re.compile(r'(www.x.com)+ | (www.y.com)')
invalid = check_url.search(data)
if invalid == None
g.write(urls[i])
g.write('\n')
else:
f.write(urls[i])
f.write('\n')
There are some very basic problems with your code:
if invalid == None is missing a : at the end, but should also be if invalid is None:
not all <a> elements will have an href, so you need to deal with those, or your script will fail.
the regex has some issues (you probably don't want to repeat that first URL and the parentheses are pointless)
you write the URL to the file every time you find a problem, but you only need to write it to the file if it has a problem at all; or perhaps you wanted a full lists of all the problematic links?
you rewrite the files on every iteration of your for loop, so you only get the final result
Fixing all that (and using a few arbitrary URLs that work):
import requests
from bs4 import BeautifulSoup
import re
urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
f = open('links_good.txt', 'w')
g = open('links_need_update.txt', 'w')
for i in range(len(urls)):
grab = requests.get(urls[i])
soup = BeautifulSoup(grab.text, 'html.parser')
for link in soup.find_all('a'):
data = link.get('href')
if data is not None:
check_url = re.compile('gamespot.com|pcgamer.com')
result = check_url.search(data)
if result is None:
# if there's no result, the link doesn't match what we need, so write it and stop searching
g.write(urls[i])
g.write('\n')
break
else:
f.write(urls[i])
f.write('\n')
However, there's still a lot of issues:
you open file handles, but never close them, use with instead
you loop over a list using an index, that's not needed, loop over urls directly
you compile a regex for efficieny, but do so on every iteration, countering the effect
The same code with those problems fixed:
import requests
from bs4 import BeautifulSoup
import re
urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
with open('links_good.txt', 'w') as f, open('links_need_update.txt', 'w') as g:
check_url = re.compile('gamespot.com|pcgamer.com')
for url in urls:
grab = requests.get(url)
soup = BeautifulSoup(grab.text, 'html.parser')
for link in soup.find_all('a'):
data = link.get('href')
if data is not None:
result = check_url.search(data)
if result is None:
# if there's no result, the link doesn't match what we need, so write it and stop searching
g.write(url)
g.write('\n')
break
else:
f.write(url)
f.write('\n')
Or, if you want to list all the problematic URLs on the sites:
import requests
from bs4 import BeautifulSoup
import re
urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
with open('links_good.txt', 'w') as f, open('links_need_update.txt', 'w') as g:
check_url = re.compile('gamespot.com|pcgamer.com')
for url in urls:
grab = requests.get(url)
soup = BeautifulSoup(grab.text, 'html.parser')
good = True
for link in soup.find_all('a'):
data = link.get('href')
if data is not None:
result = check_url.search(data)
if result is None:
g.write(f'{url},{data}\n')
good = False
if good:
f.write(url)
f.write('\n')

Python only writing first line

I have a text file which I read in and then I extract the data I require and try sending it to a different new text file, but only the first line gets into the new text file.
import requests
url_file = open('url-test.txt','r')
out_file = open('url.NDJSON','w')
for url in url_file.readlines():
html = requests.get(url).text
out_file.writelines(html)
out_file.close()
try:
for url in url_file.readlines():
html = requests.get(url).text
out_file.write(html)
or
lines = []
for url in url_file.readlines():
html = requests.get(url).text
# verify you are getting the expected data
print(111111, html)
lines.append(html)
out_file.writelines(lines)
either append the string in html or use the writelines statement in for loop

Delete a certain tag with a certain id content from an HTML using python BeautifulSoup

I got a suggestion to use BeautifulSoup to delete a tag with a certain id from an HTML. For example, deleting <div id=needDelete>...</div> Below is my code, but doesn't seem to be working correctly:
import os, re
from bs4 import BeautifulSoup
cwd = os.getcwd()
print ('Now you are at this directory: \n' + cwd)
# find files that have an extension with HTML
Files = os.listdir(cwd)
print Files
def func(file):
for file in os.listdir(cwd):
if file.endswith('.html'):
print ('HTML files are \n' + file)
f = open(file, "r+")
soup = BeautifulSoup(f, 'html.parser')
matches = str(soup.find_all("div", id="jp-post-flair"))
#The soup.find_all part should be correct as I tested it to
#print the matches and the result matches the texts I want to delete.
f.write(f.read().replace(matches,''))
#maybe the above line isn't correct
f.close()
func(file)
Would you help check which part has the wrong code and maybe how should I approach it?
Thank you very much!!
You can use the .decompose() method to remove the element/tag:
f = open(file, "r+")
soup = BeautifulSoup(f, 'html.parser')
elements = soup.find_all("div", id="jp-post-flair")
for element in elements:
element.decompose()
f.write(str(soup))
It's also worth mentioning that you can probably just use the .find() method because an id attribute should be unique within a document (which means that there will likely only be one element in most cases):
f = open(file, "r+")
soup = BeautifulSoup(html_doc, 'html.parser')
element = soup.find("div", id="jp-post-flair")
if element:
element.decompose()
f.write(str(soup))
As an alternative, based on the comments below:
If you only want to parse and modify part of the document, BeautifulSoup has a SoupStrainer class that allows you to selectively parse parts of the document.
You mentioned that the indentations and formatting in the HTML file was being changing. Rather than just converting the soup object directly into a string, you can check out the relevant output formatting section in the documentation.
Depending on the desired output, here are a few potential options:
soup.prettify(formatter="minimal")
soup.prettify(formatter="html")
soup.prettify(formatter=None)

Extract CSS media queries from websites with python 2.7

I'm trying to find a specific CSS media query (#media only screen) in CSS files of websites by using a crawler in python 2.7.
Right now I can crawl websites/URLs (from a CSV file) to find specific keywords in their HTML source code using the following code:
import urllib2
keyword = ['keyword to find']
with open('listofURLs.csv') as f:
for line in f:
strdomain = line.strip()
if strdomain:
req = urllib2.Request(strdomain.strip())
response = urllib2.urlopen(req)
html_content = response.read()
for searchstring in keyword:
if searchstring.lower() in str(html_content).lower():
print (strdomain, keyword, 'found')
f.close()
However, I now would like to crawl websites/ULRs (from the CSV file) to find the #media only screen query in their CSS files/source code. How should my code look like?
So, you have to:
1° read the csv file and put each url in a Python list;
2° loop this list, go to the pages and extract the list of css links. You need an HTML parser, for example BeautifulSoup;
3° browse the list of links and extract the item you need. There are CSS parsers like tinycss or cssutils, but I've never used them. A regex can maybe do the trick, even if this is probably not recommended.
4° write the results
Since you know how to read the csv (PS : no need to close the file with f.close() when you use the with open method), here is a minimal suggestion for operations 2 and 3. Feel free to adapt it to your needs and to improve it. I used Python 3, but I think it works on Python 2.7.
import re
import requests
from bs4 import BeautifulSoup
url_list = ["https://76crimes.com/2014/06/25/zambia-to-west-dont-watch-when-we-jail-lgbt-people/"]
for url in url_list:
try:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
css_links = [link["href"] for link in soup.findAll("link") if "stylesheet" in link.get("rel", [])]
print(css_links)
except Exception as e:
print(e, url)
pass
css_links = ["https://cdn.sstatic.net/Sites/stackoverflow/all.css?v=d9243128ba1c"]
#your regular expression
pattern = re.compile(r'#media only screen.+?\}')
for url in css_links:
try:
response = requests.get(url).text
media_only = pattern.findall(response)
print(media_only)
except Exception as e:
print(e, url)
pass

Categories

Resources