Parsing html for domain links - python

I have a script that parses an html page for all the links within it. I am getting all of them fine, but I have a list of domains I want to compare it against. So a sample list contains
list=['www.domain.com', 'sub.domain.com']
But I may have a list of links that look like
http://domain.com
http://sub.domain.com/some/other/page
I can strip off the http:// just fine, but in the two example links I just posted, they both should match. The first I would like to match against the www.domain.com, and the second, I would like to match against the subdomain in the list.
Right now I am using url2lib for parsing the html. What are my options in completely this task?

You might consider stripping 'www.' from the list and doing something as simple as:
url = 'domain.com/'
for domain in list:
if url.startswith(domain):
... do something ...
Or trying both wont hurt either I spose:
url = 'domain.com/'
for domain in list:
domain_minus_www = domain
if domain_minus_www.startswith('www.'):
domain_minus_www = domain_minus_www[4:]
if url.startswith(domain) or url.startswith(domain_minus_www):
... do something ...

Related

How to pass a ‘do not crawl’ list to scrapy that is not regex based?

Is there some way to pass scrapy a list of URLs to not crawl?
I found the LinkExtractor class, which has a deny parameter, but it’s regex based. I just want to pass it a list of URLs.
Background: I have a scrapy crawler based on the SitemapSpider class that can extract and index all the links from an XML sitemap no problem.
Since I am using it to index sites on a daily basis (it’s crawling job postings), I only want it to look at new pages. It saves server burden and index time for me to not look at previously indexed pages.
I’ve tried passing a list of links that I’ve previously indexed, but I get an error that the list of links are not regex objects. It could be that I just don’t know how to convert url strings to regex objects.
To convert a string to a regular expression that matches only that string, use re.escape to escape any regex metacharacters, then re.compile to compile the resulting string into a regex.
import re
url_list = ['http://example.com/', 'http://example.net/']
deny_list = [re.compile(re.escape(url)) for url in url_list]
As a side note, this may or may not help for your use case, but if you know the list of denied URLs in advance (i.e. when writing your program), then you can use an external tool to compile them directly to an efficient regex. I'm particularly fond of regexp-opt, which is built-in to Emacs.
For instance, if we know in advance that we want to block those two URLs above (example.com and example.net), then we can do this in Emacs
(regexp-opt '("http://example.com/" "http://example.net"))
which gives us this
\(?:http://example\.\(?:com/\|net\)\)
and that's a much more efficient regex than checking the two URLs separately. But this only works if you know the URLs in advance, as I don't offhand know of any Python tool to precompile match lists like this. (I'm sure one exists, I just don't know of it offhand)
Option #1
The link extractor also has a process_value parameter that can be set to a callable that can be used to filter certain urls by returning None if the extracted url is in your list of urls, otherwise it can simply return the url unchanged.
docs
Example:
previously_indexed = [....] # list of urls
def link_filter(value):
if value in previously_indexed:
return None
return value
linkextractor = LinkExtractor(..., process_value=link_filter)
Option #2
Since you are using the sitemap spider you can override the spiders sitemap_filter method to filter the urls that have been previously indexed.
docs
for example:
from scrapy.spiders import SitemapSpider
class FilteredSitemapSpider(SitemapSpider):
name = 'filtered_sitemap_spider'
allowed_domains = ['example.com']
sitemap_urls = ['http://example.com/sitemap.xml']
previously_indexed = [....]
def sitemap_filter(self, entries):
for entry in entries:
if entry['loc'] not in previously_indexed:
yield entry

Get substring from a string with specific pattern

I am working with twitter links and I have extracted the links from twitter's api.
https://twitter.com/shiromiru36/status/1302713597521403904/photo/1
The above is an example of links that I have.
I would like to get front part of the link only, I want to delete
/photo/1
So that I can get
https://twitter.com/shiromiru36/status/1302713597521403904
Currently I extract the link by counting the number of words that I don't want
url = https://twitter.com/shiromiru36/status/1302713597521403904/photo/1
url[:-8]
I would like to ask if there is any way to extract the link by finding its own pattern. As the unwanted part is after the 2nd last "/". I am thinking whether I can delete the unwanted part by finding the 2nd last "/" first, and then delete the words after it.
Thank you.
You could do something like this:
'/'.join(s.split('/')[:-2])
Try this
url = 'https://twitter.com/shiromiru36/status/1302713597521403904/photo/1'
url=(((url[::-1]).split('/',2))[-1])[::-1]
print(url)
https://twitter.com/shiromiru36/status/1302713597521403904

pdf-redactor syntax for URL replacement

Using https://github.com/JoshData/pdf-redactor
if I provide a PDF with multiple URL links and use the example code:
options.link_filters = [
lambda href, annotation : "https://www.google.com"
]
the effect is to change every single URL in the PDF to https://www.google.com
How can I get it to only replace, for example, https://www.example.com with https://www.google.com and leave the other URLs untouched?
Many thanks in advance.
Actually you can do a lot with that lambda in that lib. In this specific case you gave us, anything you insert in that function will have https://www.google.com as and output.
But if you want to make something different from that you can use either the URL (href) or the annotation (or both!) as parameters to change the URLs in the document. I will present to you a way you can change multiple URLs at once:
options.link_filters = [lambda href, annotation:
'www.google.com' if href == 'www.example.com' else
'www.anything.com' if href == 'www.whatever.com' else
'www.nevermind.com' if href == 'www.bye.com' else href]
Here, if you can replace all occurencies of www.example.com for www.google.com, www.whatever.com for www.anything.com, www.bye.com for www.nevermind.com and keep all the other URLs. You can even pass those URLs as variables if you ever need to make things a little bit more dynamic.
If you want to remove all the other URLs that aren't one of those three (example, whatever and bye.com), you can just replace href for None at the end of the code above.
Well, I think we both agree that the pdf_redactor guy should spend a little more time working on documentation. :)

How do I get a list of URLs using lxml's start-with() if they are relative links?

I'm looking at making a list of URLs that contains "page.php"? Do I parse all the links and then loop through them or is there a better way?
The URLs look like this:
<a href="../path/page.php?something=somewhere&yes=no">
And I tried this:
resumes = doc.xpath('//a[starts-with(#href, "../path/page.php"]/text()')
Is this correct or should I be using the absolute URL with starts-with()?
I'd do it this way, provided you want all links containing page.php
links = doc.findall('.//a') # Finds all links
resume = [res for res in links if 'page.php' in res.text_content()]
First, I get all the links on the page, then I make a list of all the links with page.php in them.
This is untested (I don't have all your code so I can't test it as quick as usual) but should still work.

Finding urls containing a specific string

I haven't used RegEx before, and everyone seems to agree that it's bad for webscraping and html in particular, but I'm not really sure how to solve my little challenge without.
I have a small Python scraper that opens 24 different webpages. In each webpage, there's links to other webpages. I want to make a simple solution that gets the links that I need and even though the webpages are somewhat similar, the links that I want are not.
The only common thing between the urls seems to be a specific string: 'uge' or 'Uge' (uge means week in Danish - and the week number changes every week, duh). It's not like the urls have a common ID or something like that I could use to target the correct ones each time.
I figure it would be possible using RegEx to go through the webpage and find all urls that has 'uge' or 'Uge' in them and then open them. But is there a way to do that using BS? And if I do it using RegEx, how would a possible solution look like?
For example, here are two of the urls I want to grab in different webpages:
http://www.domstol.dk/KobenhavnsByret/retslister/Pages/Uge45-Tvangsauktioner.aspx
http://www.domstol.dk/esbjerg/retslister/Pages/Straffesageruge32.aspx
This should work... The RegEx uge\d\d? tells it to find "uge" followed by a digit, and possibly another one.
import re
for item in listofurls:
l = re.findall("uge\d\d?", item, re.IGNORECASE):
if l:
print item #just do whatever you want to do when it finds it
Yes, you can do this with BeautifulSoup.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
# To find just 'Uge##' or 'uge##', as specified in the question:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("[Uu]ge\d+"))]
# To find without regard to case at all:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("(?i)uge\d+"))]
Or just use a simple for loop:
list_of_urls = ["""LIST GOES HERE"""]
for url in list_of_urls:
if 'uge' in url.lower():
# Code to execute
The regex expression would look something like: uge\d\d

Categories

Resources