Python and Beautiful soup || Regex with Varible before writting to file - python

I would love some assistance or help around an issue i'm currently having.
I'm working on a little python scanner as a project.
The libraries im current importing are:
requests
BeautifulSoup
re
tld
The exact issue is regarding 'scope' of the scanner.
I'd like to pass a URL to the code and have the scanner grab all the anchor tags from the page, but only the ones relevant to the base URL, ignoring out of scope links and also subdomains.
Here is my current code, i'm by no means a programmer, so please excuse sloppy inefficient code.
import requests
from bs4 import BeautifulSoup
import re
from tld import get_tld, get_fld
#This Grabs the URL
print("Please type in a URL:")
URL = input()
#This strips out everthing leaving only the TLD (Future scope function)
def strip_domain(URL):
global domain_name
domain_name = get_fld(URL)
strip_domain(URL)
#This makes the request, and cleans up the source code
def connection(URL):
r = requests.get(URL)
status = r.status_code
sourcecode = r.text
soup = BeautifulSoup(sourcecode,features="html.parser")
cleanupcode = soup.prettify()
#This Strips the Anchor tags and adds them to the links array
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))
#This writes our clean anchor tags to a file
with open('source.txt', 'w') as f:
for item in links:
f.write("%s\n" % item)
connection(URL)
The exact code issue is around the "for link in soup.find" section.
I have been trying to parse the array for anchor tags the only contain the base domain, which is the global var "domain_name" so that it only writes the relevant links to the source txt file.
google.com accepted
google.com/file accepted
maps.google.com not written
If someone could assist me or point me in the right direction i'd appreciate it.
I was also thinking it would be possible to write every link to the source.txt file and then alter it after removing the 'out of scope' links, but really thought it more beneficial to do it without having to create additional code.
Additionally, i'm not the strongest with regex, but here is someone that my help.
This is some regex code to catch all variations of http, www, https
(^http:\/\/+|www.|https:\/\/)
To this I was going to append
.*{}'.format(domain_name)

I provide two different situationes. Because i donot agree that href value is xxx.com. Actually you will gain three or four or more kinds of href value, such as /file, folder/file, etc. So you have to transform relative path to absolute path, otherwise, you can not gather all of urls.
Regex: (\/{2}([w]+.)?)([a-z.]+)(?=\/?)
(\/{2}([w]+.)?) Matching non-main parts start from //
([a-z.]+)(?=\/?) Match all specified character until we got /, we ought not to use .*(over-match)
My Code
import re
_input = "http://www.google.com/blabla"
all_part = re.findall(r"(\/{2}([w]+.)?)([a-z.]+)(?=\/?)",_input)[0]
_partA = all_part[2] # google.com
_partB = "".join(all_part[1:]) # www.google.com
print(_partA,_partB)
site = [
"google.com",
"google.com/file",
"maps.google.com"
]
href = [
"https://www.google.com",
"https://www.google.com/file",
"http://maps.google.com"
]
for ele in site:
if re.findall("^{}/?".format(_partA),ele):
print(ele)
for ele in href:
if re.findall("{}/?".format(_partB),ele):
print(ele)

Related

Beautiful Soup - Blank screen for a long time without any output

I am quite new to python and am working on a scraping based project- where I am supposed to extract all the contents from links containing a particular search term and place them in a csv file. As a first step, I wrote this code to extract all the links from a website based on a search term entered. I only get a blank screen as output and I am unable to find my mistake.
import urllib
import mechanize
from bs4 import BeautifulSoup
import datetime
def searchAP(searchterm):
newlinks = []
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders = [('User-agent', 'Firefox')]
text = ""
start = 0
while "There were no matches for your search" not in text:
url = "http://www.marketing-interactive.com/"+"?s="+searchterm
text = urllib.urlopen(url).read()
soup = BeautifulSoup(text, "lxml")
results = soup.findAll('a')
for r in results:
if "rel=bookmark" in r['href'] :
newlinks.append("http://www.marketing-interactive.com"+ str(r["href"]))
start +=10
return newlinks
print searchAP("digital marketing")
You made four mistakes:
You are defining start but you never use it. (Nor can you, as far as I can see on http://www.marketing-interactive.com/?s=something. There is no url based pagination.) So you endlessly looping over the first set of results.
"There were no matches for your search" is not the no-results string returned by that site. So it would go on forever anyway.
You are appending the link, including http://www.marketing-interactive.com to http://www.marketing-interactive.com. So you would end up with http://www.marketing-interactive.comhttp://www.marketing-interactive.com/astro-launches-digital-marketing-arm-blaze-digital/
Concerning rel=bookmark selection: arifs solution is the proper way to go. But if you really want to do it this way you'd need to something like this:
for r in results:
if r.attrs.get('rel') and r.attrs['rel'][0] == 'bookmark':
newlinks.append(r["href"])
This first checks if rel exists and then checks if its first child is "bookmark", as r['href'] simply does not contain the rel. That's not how BeautifulSoup structures things.
To scrape this specific site you can do two things:
You could do something with Selenium or something else that supports Javascript and press that "Load more" button. But this is quite a hassle.
You can use this loophole: http://www.marketing-interactive.com/wp-content/themes/MI/library/inc/loop_handler.php?pageNumber=1&postType=search&searchValue=digital+marketing
This is the url that feeds the list. It has pagination, so you can easily loop over all results.
The following script extracts all the links from the web page based on given search key. But it does not explore beyond the first page. Although the following code can easily be modified to get all results from multiple pages by manipulating page-number in the URL (as described by Rutger de Knijf in the other answer.).
from pprint import pprint
import requests
from BeautifulSoup import BeautifulSoup
def get_url_for_search_key(search_key):
base_url = 'http://www.marketing-interactive.com/'
response = requests.get(base_url + '?s=' + search_key)
soup = BeautifulSoup(response.content)
return [url['href'] for url in soup.findAll('a', {'rel': 'bookmark'})]
Usage:
pprint(get_url_for_search_key('digital marketing'))
Output:
[u'http://www.marketing-interactive.com/astro-launches-digital-marketing-arm-blaze-digital/',
u'http://www.marketing-interactive.com/singapore-polytechnic-on-the-hunt-for-digital-marketing-agency/',
u'http://www.marketing-interactive.com/how-to-get-your-bosses-on-board-your-digital-marketing-plan/',
u'http://www.marketing-interactive.com/digital-marketing-institute-launches-brand-refresh/',
u'http://www.marketing-interactive.com/entropia-highlights-the-7-original-sins-of-digital-marketing/',
u'http://www.marketing-interactive.com/features/futurist-right-mindset-digital-marketing/',
u'http://www.marketing-interactive.com/lenovo-brings-board-new-digital-marketing-head/',
u'http://www.marketing-interactive.com/video/discussing-digital-marketing-indonesia-video/',
u'http://www.marketing-interactive.com/ubs-melvin-kwek-joins-credit-suisse-as-apac-digital-marketing-lead/',
u'http://www.marketing-interactive.com/linkedins-top-10-digital-marketing-predictions-2017/']
Hope this is what you wanted as the first step for your project.

Retrieving a subset of href's from findall() in BeautifulSoup

My goal is to write a python script that takes an artist's name as a string input and then appends it to the base URL that goes to the genius search query.Then retrieves all the lyrics from the returned web page's links (Which is the required subset of this problem that will also contain specifically the artist name in every link in that subset.).I am in the initial phase right now and just have been able to retrieve all links from the web page including the ones that I don't want in my subset. I tried to find a simple solution but failed continuously.
import requests
# The Requests library.
from bs4 import BeautifulSoup
from lxml import html
user_input = input("Enter Artist Name = ").replace(" ","+")
base_url = "https://genius.com/search?q="+user_input
header = {'User-Agent':''}
response = requests.get(base_url, headers=header)
soup = BeautifulSoup(response.content, "lxml")
for link in soup.find_all('a',href=True):
print (link['href'])
This returns this complete list while I only need the ones that end with lyrics and the artist's name (here for instance Drake). These will the links from where I should be able to retrieve the lyrics.
https://genius.com/
/signup
/login
https://www.facebook.com/geniusdotcom/
https://twitter.com/Genius
https://www.instagram.com/genius/
https://www.youtube.com/user/RapGeniusVideo
https://genius.com/new
https://genius.com/Drake-hotline-bling-lyrics
https://genius.com/Drake-one-dance-lyrics
https://genius.com/Drake-hold-on-were-going-home-lyrics
https://genius.com/Drake-know-yourself-lyrics
https://genius.com/Drake-back-to-back-lyrics
https://genius.com/Drake-all-me-lyrics
https://genius.com/Drake-0-to-100-the-catch-up-lyrics
https://genius.com/Drake-started-from-the-bottom-lyrics
https://genius.com/Drake-from-time-lyrics
https://genius.com/Drake-the-motto-lyrics
/search?page=2&q=drake
/search?page=3&q=drake
/search?page=4&q=drake
/search?page=5&q=drake
/search?page=6&q=drake
/search?page=7&q=drake
/search?page=8&q=drake
/search?page=9&q=drake
/search?page=672&q=drake
/search?page=673&q=drake
/search?page=2&q=drake
/embed_guide
/verified-artists
/contributor_guidelines
/about
/static/press
mailto:brands#genius.com
https://eventspace.genius.com/
/static/privacy_policy
/jobs
/developers
/static/terms
/static/copyright
/feedback/new
https://genius.com/Genius-how-genius-works-annotated
https://genius.com/Genius-how-genius-works-annotated
My next step would be to use selenium to emulate scroll which in the case of genius.com gives the entire set of search results. Any suggestions or resources would be appreciated. I would also like a few comments about the way I wish to proceed with this solution. Can we make it more generic?
P.S. I may not have well lucidly explained my problem but I have tried my best. Also, any ambiguities are welcome too. I am new to scraping and python and programming as well in so, just wanted to make sure that I am following the right path.
Use the regex module to match only the links you want.
import requests
# The Requests library.
from bs4 import BeautifulSoup
from lxml import html
from re import compile
user_input = input("Enter Artist Name = ").replace(" ","+")
base_url = "https://genius.com/search?q="+user_input
header = {'User-Agent':''}
response = requests.get(base_url, headers=header)
soup = BeautifulSoup(response.content, "lxml")
pattern = re.compile("[\S]+-lyrics$")
for link in soup.find_all('a',href=True):
if pattern.match(link['href']):
print (link['href'])
Output:
https://genius.com/Drake-hotline-bling-lyrics
https://genius.com/Drake-one-dance-lyrics
https://genius.com/Drake-hold-on-were-going-home-lyrics
https://genius.com/Drake-know-yourself-lyrics
https://genius.com/Drake-back-to-back-lyrics
https://genius.com/Drake-all-me-lyrics
https://genius.com/Drake-0-to-100-the-catch-up-lyrics
https://genius.com/Drake-started-from-the-bottom-lyrics
https://genius.com/Drake-from-time-lyrics
https://genius.com/Drake-the-motto-lyrics
This just looks if your link matches the pattern ending in -lyrics. You may use similar logic to filter using user_input variable as well.
Hope this helps.

Python HTML parsing: getting site top level hosts

I have a program that takes in a site's source code/html and outputs the a href tags - it is extremely helpful and makes use of BeautifulSoup4.
I am wanting to have a variation of this code that only looks at < a href="..."> tags but returns just top directory host names from a site's source codes, for example
stackoverflow.com
google.com
etc. but NOT lower level ones like stackoverflow.com/questions/ etc. Right now it's outputting everything, including /, #t8 etc. and I need to filter them out.
Here is my current code I use to extract all a href tags.
url = sys.argv[1] #when program is invoked, takes it in like www.google.com etc.
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# get hosts
for a in soup.find_all('a', href=True):
print a['href']
Thank you!
It sounds like you're looking for the .netloc attribute of urlparse. It's part of the Python standard library: https://docs.python.org/2/library/urlparse.html
For example:
>>> from urlparse import urlparse
>>> url = "http://stackoverflow.com/questions/26351727/python-html-parsing-getting-site-top-level-hosts"
>>> urlparse(url).netloc
'stackoverflow.com'

Python Crawler is ignoring links on page

So I wrote a crawler for my friend that will go through a large list of web pages that are search results, pull all the links off the page, check if they're in the output file and add if they're not there. It took a lot of debugging but it works great! Unfortunately, the little bugger is really picky about which anchored tags it deems important enough to add.
Here's the code:
#!C:\Python27\Python.exe
from bs4 import BeautifulSoup
from urlparse import urljoin #urljoin is a class that's included in urlparse
import urllib2
import requests #not necessary but keeping here in case additions to code in future
urls_filename = "myurls.txt" #this is the input text file,list of urls or objects to scan
output_filename = "output.txt" #this is the output file that you will export to Excel
keyword = "skin" #optional keyword, not used for this script. Ignore
with open(urls_filename, "r") as f:
url_list = f.read() #This command opens the input text file and reads the information inside it
with open(output_filename, "w") as f:
for url in url_list.split("\n"): #This command splits the text file into separate lines so it's easier to scan
hdr = {'User-Agent': 'Mozilla/5.0'} #This (attempts) to tell the webpage that the program is a Firefox browser
try:
response = urllib2.urlopen(url) #tells program to open the url from the text file
except:
print "Could not access", url
continue
page = response.read() #this assigns a variable to the open page. like algebra, X=page opened
soup = BeautifulSoup(page) #we are feeding the variable to BeautifulSoup so it can analyze it
urls_all = soup('a') #beautiful soup is analyzing all the 'anchored' links in the page
for link in urls_all:
if('href' in dict(link.attrs)):
url = urljoin(url, link['href']) #this combines the relative link e.g. "/support/contactus.html" and adds to domain
if url.find("'")!=-1: continue #explicit statement that the value is not void. if it's NOT void, continue
url=url.split('#')[0]
if (url[0:4] == 'http' and url not in output_filename): #this checks if the item is a webpage and if it's already in the list
f.write(url + "\n") #if it's not in the list, it writes it to the output_filename
It works great except for the following link:
https://research.bidmc.harvard.edu/TVO/tvotech.asp
This link has a number of like "tvotech.asp?Submit=List&ID=796" and it's straight up ignoring them. The only anchor that goes into my output file is the main page itself. It's bizarre because looking at the source code, their anchors are pretty standard, like-
They have 'a' and 'href', I see no reason bs4 would just pass it and only include the main link. Please help. I've tried removing http from line 30 or changing it to https and that just removes all the results, not even the main page comes into the output.
that's cause one of the links there has a mailto in its href, it is then set to the url parameter and break the rest of the links as well cause the don't pass the url[0:4] == 'http' condition, it looks like this:
mailto:research#bidmc.harvard.edu?subject=Question about TVO Available Technology Abstracts
you should either filter it out or not use the same argument url in the loop, note the change to url1:
for link in urls_all:
if('href' in dict(link.attrs)):
url1 = urljoin(url, link['href']) #this combines the relative link e.g. "/support/contactus.html" and adds to domain
if url1.find("'")!=-1: continue #explicit statement that the value is not void. if it's NOT void, continue
url1=url1.split('#')[0]
if (url1[0:4] == 'http' and url1 not in output_filename): #this checks if the item is a webpage and if it's already in the list
f.write(url1 + "\n") #if it's not in the list, it writes it to the output_filename

Python href and save to .txt (no worries, not another regex question)

I currently am working on creating a python script that allows a user to input a torrent's hash (via terminal), and checks for more trackers via a website. However, I am at a loss and was hoping to receive some advice since I'm new to Python programming. I'm running into trouble since my result from html_page has another link to go to. So, my program assigns html_page "http://torrentz.eu/******* but, now I find myself trying to get it to follow another link on the page to arrive at http://torrentz.eu/announcelist_* ... that being said, I have found it can be retrieved (as it would appear from viewing the source)
µTorrent compatible list here
or possibly retrieved from here since values are same as they appear in /announcelist_**
<a name="post-comment"></a>
<input type="hidden" name="torrent" value="******" />
Since the /announcelist_** appears in text format I was also wondering how I might be able to save the resulting tracker list in a .txt file. That being said, this is my progress as of now on the Python scripting.
from BeautifulSoup import BeautifulSoup
import urllib2
import re
var = raw_input("Enter hash:")
html_page = urllib2.urlopen("http://torrentz.eu/" +var)
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
print link.get('href')
I'd also like to thank all of y'all in advance for your support, knowledge, advice, and skills.
Edit: I've altered the code to appear as follows:
from BeautifulSoup import BeautifulSoup
import urllib2
import re
hsh = raw_input("Enter Hash:")
html_data = urllib2.urlopen("http://torrentz.eu/" +hsh, 'r').read()
soup = BeautifulSoup(html_data)
announce = soup.find('a', attrs={'href': re.compile("^/announcelist")})
print announce
Which results in:
µTorrent compatible list here
So, now I'm just looking for a way to get the /announcelist_00000 portion of output only.
Once you have opened the url, you are able to find the href as you point out. Now, open that href using urlopen. When you encounter the file that you want to copy over, open it like so:
remote_file = open(filepath)
locale_file = open(path_to_local_file, 'w')
local_file.write(remote_file.read())
local_file.close()
remote_file.close()
Here's how you should probably go about doing this:
# insert code that you've already written
for link in soup.findAll('a'):
print link.get('href')
remote_file = open(link.get('href'))
local_file = open(path_too_local_file, 'w')
local_file.write(remote_file.read())
local_file.close()
remote_file.close()
I haven't tested this code, but I think it should work.
Hope this helps
If what you are looking for is the value of the href attribute, then
see what you get if you add the line:
print announce['href']

Categories

Resources