Printing the output from a loop under two different titles - python

I want to try format an output to be listed under a title.
I made a python (v3.6) script, which checks urls (contained in a textile) and output's whether it's safe or malicious.
Loop statement:
"""
This iterates through each url in the textfile and checks it against google's database.
"""
f = open("file.txt", "r")
for weburl in f:
sb = threat_matches_find(weburl) # Google API module
if url_checker == {}: # the '{}' represent = Safe URL
print ("Safe :", url)
else:
print ("Malicious :", url)
The result out put from this is:
>>>python url_checker.py
Safe : url1.com
Malicious : url2.com
Safe : url3.com
Malicious: url4.com
Safe : url5.com
The objective is to get the url to be listed/sorted under a Title (group), as follows:
If the url is safe, print the url under 'Safe URL', else 'Malicious'.
>>> python url_checker.py
Safe URLs:
url1.com
url3.com
url5.com
Malicious URLs:
url2.com
url4.com
I was unsuccessful in finding other post related to my problem. Any help would be much appreciated.

You could append to lists as you loop and then print when both lists are populated:
safe = []
malicious = []
for weburl in f:
sb = threat_matches_find(weburl) # Google API module
if url_checker == {}: # the '{}' represent = Safe URL
safe.append(url)
else:
malicious.append(url)
print('Safe URLs', *safe, '', sep='\n')
print('Malicious URLs', *malicious, '', sep='\n')
Sample Output:
safe = ['url1.com','url3.com','url5.com']
malicious = ['url2.com','url4.com']
Safe URLs
url1.com
url3.com
url5.com
Malicious URLs
url2.com
url4.com

Related

How to get all emails from a page individually

I am trying to get all emails from a specific page and separate them into an individual variable or even better a dictionary. This is some code.
import requests
import re
import json
from bs4 import BeautifulSoup
page = "http://www.example.net"
info = requests.get(page)
if info.status_code == 200:
print("Page accessed")
else:
print("Error accessing page")
code = info.content
soup = BeautifulSoup(code, 'lxml')
allEmails = soup.find_all("a", href=re.compile(r"^mailto:"))
print(allEmails)
sep = ","
allEmailsStr = str(allEmails)
print(type(allEmails))
print(type(allEmailsStr))
j = allEmailsStr.split(sep, 1)[0]
print(j)
Excuse the poor variable names because I put this together so it would be fine by itself. The output from the example website would be for example something like
[k, kolyma, location, balkans]
So if I ran the problem it would return only
[k
But if I wanted it to return every email on there individually how would I do that?
To get just the email str you can try:
emails = []
for email_link in allEmails:
emails.append(email_link.get("href").replace('mailto:', ''))
print(emails)
Based on your expected output, you can use the unwrap function of BeautifulSoup
allEmails = soup.find_all("a", href=re.compile(r"^mailto:"))
for Email in allEmails:
print(Email.unwrap()) #This will print the whole element along with tag
# k

Python - print 2nd argument

I am new to Python and I've written this test-code for practicing purposes, in order to find and print email addresses from various web pages:
def FindEmails(*urls):
for i in urls:
totalemails = []
req = urllib2.Request(i)
aResp = urllib2.urlopen(req)
webpage = aResp.read()
patt1 = '(\w+[-\w]\w+#\w+[.]\w+[.\w+]\w+)'
patt2 = '(\w+[\w]\w+#\w+[.]\w+)'
regexlist = [patt1,patt2]
for regex in regexlist:
match = re.search(regex,webpage)
if match:
totalemails.append(match.group())
break
#return totalemails
print "Mails from webpages are: %s " % totalemails
if __name__== "__main__":
FindEmails('https://www.urltest1.com', 'https://www.urltest2.com')
When I run it, it prints only one argument.
My goal is to print the emails acquired from webpages and store them in a list, separated by commas.
Thanks in advance.
The problem here is the line: totalemails = []. Here, you are re-instantiating the the variables totalemails to have zero entries. So, in each iteration, it only has one entry inside it. After the last iteration, you'll end up with just the last entry in the list. To get a list of all emails, you need to put the variable outside of the for loop.
Example:
def FindEmails(*urls):
totalemails = []
for i in urls:
req = urllib2.Request(i)
....

Issue with web crawler: IndexError: string index out of range

I am making a web crawler. I'm not using scrapy or anything, I'm trying to have my script do most things. I have tried doing a search for the issue however I can't seem to find anything that helps with the error. I've tried switching around some of the variable to try and narrow down the problem. I am getting an error on line 24 saying IndexError: string index out of range. The functions run on the first url, (the original url) then the second and fail on the third in the original array. I'm lost, any help would be appreciated greatly! Note, I'm only printing all of them for testing, I'll eventually have them printed to a text file.
import requests
from bs4 import BeautifulSoup
# creating requests from user input
url = raw_input("Please enter a domain to crawl, without the 'http://www' part : ")
def makeRequest(url):
r = requests.get('http://' + url)
# Adding in BS4 for finding a tags in HTML
soup = BeautifulSoup(r.content, 'html.parser')
# Writes a as the link found in the href
output = soup.find_all('a')
return output
def makeFilter(link):
# Creating array for our links
found_link = []
for a in link:
a = a.get('href')
a_string = str(a)
# if statement to filter our links
if a_string[0] == '/': # this is the line with the error
# Realtive Links
found_link.append(a_string)
if 'http://' + url in a_string:
# Links from the same site
found_link.append(a_string)
if 'https://' + url in a_string:
# Links from the same site with SSL
found_link.append(a_string)
if 'http://www.' + url in a_string:
# Links from the same site
found_link.append(a_string)
if 'https://www.' + url in a_string:
# Links from the same site with SSL
found_link.append(a_string)
#else:
# found_link.write(a_string + '\n') # testing only
output = found_link
return output
# Function for removing duplicates
def remove_duplicates(values):
output = []
seen = set()
for value in values:
if value not in seen:
output.append(value)
seen.add(value)
return output
# Run the function with our list in this order -> Makes the request -> Filters the links -> Removes duplicates
def createURLList(values):
requests = makeRequest(values)
new_list = makeFilter(requests)
filtered_list = remove_duplicates(new_list)
return filtered_list
result = createURLList(url)
# print result
# for verifying and crawling resulting pages
for b in result:
sub_directories = createURLList(url + b)
crawler = []
crawler.append(sub_directories)
print crawler
After a_string = str(a) try adding:
if not a_string:
continue

Scrapy, how to limit time per domain?

I have been searching for an answer and there is no answer on this forum although several questions have been asked. One answer is the it is possible to stop spider after certain time but that is not suitable for me because I usually launch 10 websites per spider. So my challenge is that I have spider for 10 websited and I would like to limit time to 20 seconds per domain in order to avoid getting stuck at some webshop. How to do it?
In general I can also tell you that I crawl 2000 company websites and in order to make it in one day I divide these websites into 200 groups of 10 websites and I launch 200 spiders in parallel. That may be amateur but that I the best that I know. The computer almost freezes because spiders consume entire CPU and memory, but next day I have the results. What I am looking for is employment webpages on companies' websites. Does anyone have any better idea how to crawl 2000 websites ? In case there is a webshop among websites the crawling could take days, so that is why I would like to limit the time per domain.
Thank you in advance.
Marko
My code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# encoding=UTF-8
import scrapy, urlparse, time, sys
from scrapy.http import Request
from scrapy.utils.response import get_base_url
from urlparse import urlparse, urljoin
from vacancies.items import JobItem
#We need that in order to force Slovenian pages instead of English pages. It happened at "http://www.g-gmi.si/gmiweb/" that only English pages were found and no Slovenian.
from scrapy.conf import settings
settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl',}
#Settings.set(name, value, priority='cmdline')
#settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl','en':q=0.8,}
#start_time = time.time()
# We run the programme in the command line with this command:
# scrapy crawl jobs -o urls.csv -t csv --logfile log.txt
# We get two output files
# 1) urls.csv
# 2) log.txt
# Url whitelist.
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/url_whitelist.txt", "r+") as kw:
url_whitelist = kw.read().replace('\n', '').split(",")
url_whitelist = map(str.strip, url_whitelist)
# Tab whitelist.
# We need to replace character the same way as in detector.
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/tab_whitelist.txt", "r+") as kw:
tab_whitelist = kw.read().decode(sys.stdin.encoding).encode('utf-8')
tab_whitelist = tab_whitelist.replace('Ŕ', 'č')
tab_whitelist = tab_whitelist.replace('L', 'č')
tab_whitelist = tab_whitelist.replace('Ő', 'š')
tab_whitelist = tab_whitelist.replace('Ü', 'š')
tab_whitelist = tab_whitelist.replace('Ä', 'ž')
tab_whitelist = tab_whitelist.replace('×', 'ž')
tab_whitelist = tab_whitelist.replace('\n', '').split(",")
tab_whitelist = map(str.strip, tab_whitelist)
# Look for occupations in url.
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/occupations_url.txt", "r+") as occ_url:
occupations_url = occ_url.read().replace('\n', '').split(",")
occupations_url = map(str.strip, occupations_url)
# Look for occupations in tab.
# We need to replace character the same way as in detector.
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/occupations_tab.txt", "r+") as occ_tab:
occupations_tab = occ_tab.read().decode(sys.stdin.encoding).encode('utf-8')
occupations_tab = occupations_tab.replace('Ŕ', 'č')
occupations_tab = occupations_tab.replace('L', 'č')
occupations_tab = occupations_tab.replace('Ő', 'š')
occupations_tab = occupations_tab.replace('Ü', 'š')
occupations_tab = occupations_tab.replace('Ä', 'ž')
occupations_tab = occupations_tab.replace('×', 'ž')
occupations_tab = occupations_tab.replace('\n', '').split(",")
occupations_tab = map(str.strip, occupations_tab)
#Join url whitelist and occupations.
url_whitelist_occupations = url_whitelist + occupations_url
#Join tab whitelist and occupations.
tab_whitelist_occupations = tab_whitelist + occupations_tab
#base = open("G:/myVE/vacancies/bazni.txt", "w")
#non_base = open("G:/myVE/vacancies/ne_bazni.txt", "w")
class JobSpider(scrapy.Spider):
#Name of spider
name = "jobs"
#start_urls = open("Q:\Big_Data\Utrip\spletne_strani.txt", "r+").readlines()[0]
#print urls
#start_urls = map(str.strip, urls)
#Start urls
start_urls = ["http://www.alius.si"]
print "\nSpletna stran ", start_urls, "\n"
#Result of the programme is this list of job vacancies webpages.
jobs_urls = []
def parse(self, response):
#Theoretically I could save the HTML of webpage to be able to check later and see how it looked like
# at the time of downloading. That is important for validation, because it is easier to look at nice HTML webpage instead of naked text.
# but I have to write a pipeline http://doc.scrapy.org/en/0.20/topics/item-pipeline.html
response.selector.remove_namespaces()
#print "response url" , str(response.url)
#Take url of response, because we would like to stay on the same domain.
parsed = urlparse(response.url)
#Base url.
#base_url = get_base_url(response).strip()
base_url = parsed.scheme+'://'+parsed.netloc
#print "base url" , str(base_url)
#If the urls grows from seeds, it's ok, otherwise not.
if base_url in self.start_urls:
#print "base url je v start"
#base.write(response.url+"\n")
#net1 = parsed.netloc
#Take all urls, they are marked by "href" or "data-link". These are either webpages on our website either new websites.
urls_href = response.xpath('//#href').extract()
urls_datalink = response.xpath('//#data-link').extract()
urls = urls_href + urls_datalink
#print "povezave na tej strani ", urls
#Loop through all urls on the webpage.
for url in urls:
#Test all new urls. NE DELA
#print "url ", str(url)
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
if not (url.startswith("http")):
#Povežem delni url z baznim url.
url = urljoin(base_url,url).strip()
#print "new url ", str(url)
new_parsed = urlparse(url)
new_base_url = new_parsed.scheme+'://'+new_parsed.netloc
#print "new base url ", str(new_base_url)
if new_base_url in self.start_urls:
#print "yes"
url = url.replace("\r", "")
url = url.replace("\n", "")
url = url.replace("\t", "")
url = url.strip()
#Remove anchors '#', that point to a section on the same webpage, because this is the same webpage.
#But we keep question marks '?', which mean, that different content is pulled from database.
if '#' in url:
index = url.find('#')
url = url[:index]
if url in self.jobs_urls:
continue
#Ignore ftp and sftp.
if url.startswith("ftp") or url.startswith("sftp"):
continue
#Compare each url on the webpage with original url, so that spider doesn't wander away on the net.
#net2 = urlparse(url).netloc
#test.write("lokacija novega url "+ str(net2)+"\n")
#if net2 != net1:
# continue
#test.write("ni ista lokacija, nadaljujemo\n")
#If the last character is slash /, I remove it to avoid duplicates.
if url[len(url)-1] == '/':
url = url[:(len(url)-1)]
#If url includes characters like %, ~ ... it is LIKELY NOT to be the one I are looking for and I ignore it.
#However in this case I exclude good urls like http://www.mdm.si/company#employment
if any(x in url for x in ['%', '~',
#slike
'.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', '.svg', '.tif', '.tiff',
'.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', '.SVG', '.TIF', '.TIFF',
#dokumenti
'.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', '.pd',
'.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', '.PD',
#glasba in video
'.mp3', '.mp4', '.mpg', '.ai', '.avi', '.swf',
'.MP3', '.MP4', '.MPG', '.AI', '.AVI', '.SWF',
#stiskanje in drugo
'.zip', '.rar', '.css', '.flv', '.xml'
'.ZIP', '.RAR', '.CSS', '.FLV', '.XML'
#Twitter, Facebook, Youtube
'://twitter.com', '://mobile.twitter.com', 'www.twitter.com',
'www.facebook.com', 'www.youtube.com'
#Feeds, RSS, arhiv
'/feed', '=feed', '&feed', 'rss.xml', 'arhiv'
]):
continue
#We need to save original url for xpath, in case we change it later (join it with base_url)
#url_xpath = url
#We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.
#if (urlparse(url).netloc == urlparse(base_url).netloc):
#The main part. We look for webpages, whose urls include one of the employment words as strings.
#We will check the tab of the url as well. This is additional filter, suggested by Dan Wu, to improve accuracy.
#tabs = response.xpath('//a[#href="%s"]/text()' % url_xpath).extract()
tabs = response.xpath('//a[#href="%s"]/text()' % url).extract()
# Sometimes tabs can be just empty spaces like '\t' and '\n' so in this case we replace it with [].
# That was the case when the spider didn't find this employment url: http://www.terme-krka.com/si/sl/o-termah-krka/o-podjetju-in-skupini-krka/zaposlitev/
tabs = [tab.encode('utf-8') for tab in tabs]
tabs = [tab.replace('\t', '') for tab in tabs]
tabs = [tab.replace('\n', '') for tab in tabs]
tab_empty = True
for tab in tabs:
if tab != '':
tab_empty = False
if tab_empty == True:
tabs = []
# -- Instruction.
# -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
# Starting keyword_url is zero, then we add keywords as we find them in url.
keyword_url = ''
#for keyword in url_whitelist:
for keyword in url_whitelist_occupations:
if keyword in url:
keyword_url = keyword_url + keyword + ' '
# a) If we find at least one keyword in url, we continue.
if keyword_url != '':
#1. Tabs are empty.
if tabs == []:
#We found url that includes one of the magic words and also the text includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls :
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = keyword_url
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = ' '
print "Zaposlitvena podstran ", url
#We return the item.
yield item
#2. There are texts in tabs, one or more.
else:
#For the same partial url several texts are possible.
for tab in tabs:
#We search for keywords in tabs.
keyword_url_tab = ''
#for key in tab_whitelist:
for key in tab_whitelist_occupations:
if key in tab:
keyword_url_tab = keyword_url_tab + key + ' '
# If we find some keywords in tabs, then we have found keywords in both url and tab and we can save the url.
if keyword_url_tab != '':
# keyword_url_tab starts with keyword_url from before, because we want to remember keywords from both url and tab. So we add initial keyword_url.
keyword_url_tab = 'URL ' + keyword_url + ' TAB ' + keyword_url_tab
#We found url that includes one of the magic words and also the tab includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = keyword_url_tab
#item["keyword_tab"] = ' '
print "Zaposlitvena podstran ", url
#We return the item.
yield item
#We haven't found any keywords in tabs, but url is still good, because it contains some keywords, so we save it.
else:
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = keyword_url
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = ' '
print "Zaposlitvena podstran ", url
#We return the item.
yield item
# b) If keyword_url = empty, there are no keywords in url, but perhaps there are keywords in tabs. So we check tabs.
else:
for tab in tabs:
keyword_tab = ''
#for key in tab_whitelist:
for key in tab_whitelist_occupations:
if key in tab:
keyword_tab = keyword_tab + key + ' '
if keyword_tab != '':
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = keyword_tab
print "Zaposlitvena podstran ", url
#We return the item.
yield item
#We don't put "else" sentence because we want to further explore the employment webpage to find possible new employment webpages.
#We keep looking for employment webpages, until we reach the DEPTH set in settings.py.
yield Request(url, callback = self.parse)
#else:
#non_base.write(response.url+"\n")
Just use scrapyd to schedule 2000 single web-site crawls. Set max_proc = 10 [1] to run 10 spiders in parallel. Set spider's CLOSESPIDER_TIMEOUT [2] to 20 run every spider for 20 seconds. Stop using Windows natively because it's a pain. I've observed Scrapy and scrapyd run faster inside in a VM rather than natively on Windows. I might be wrong - so try for yourself to cross-check but I have a strong feeling that if you use an Ubuntu 14.04 virtualbox image on Windows, it will be faster. Your crawl will take exactly 2000 * 20 / 10 = 17 minutes.

Very fast webpage scraping (Python)

So I'm trying to filter through a list of urls (potentially in the hundreds) and filter out every article who's body is less than X number of words (ARTICLE LENGTH). But when I run my application, it takes an unreasonable amount of time, so much so that my hosting service times out. I'm currently using Goose (https://github.com/grangier/python-goose) with the following filter function:
def is_news_and_data(url):
"""A function that returns a list of the form
[True, title, meta_description]
or
[False]
"""
result = []
if url == None:
return False
try:
article = g.extract(url=url)
if len(article.cleaned_text.split()) < ARTICLE_LENGTH:
result.append(False)
else:
title = article.title
meta_description = article.meta_description
result.extend([True, title, meta_description])
except:
result.append(False)
return result
In the context of the following. Dont mind the debug prints and messiness (tweepy is my twitter api wrapper):
def get_links(auth):
"""Returns a list of t.co links from a list of given tweets"""
api = tweepy.API(auth)
page_list = []
tweets_list = []
links_list = []
news_list = []
regex = re.compile('http://t.co/.[a-zA-Z0-9]*')
for page in tweepy.Cursor(api.home_timeline, count=20).pages(1):
page_list.append(page)
for page in page_list:
for status in page:
tweet = status.text.encode('utf-8','ignore')
tweets_list.append(tweet)
for tweet in tweets_list:
links = regex.findall(tweet)
links_list.extend(links)
#print 'The length of the links list is: ' + str(len(links_list))
for link in links_list:
news_and_data = is_news_and_data(link)
if True in news_and_data:
news_and_data.append(link)
#[True, title, meta_description, link]
news_list.append(news_and_data[1:])
print 'The length of the news list is: ' + str(len(news_list))
Can anyone recommend a perhaps faster method?
This code is probably causing your slow performance:
len(article.cleaned_text.split())
This is performing a lot of work, most of which is discarded. I would profile your code to see if this is the culprit, if so, replace it with something that just counts spaces, like so:
article.cleaned_text.count(' ')
That won't give you exactly the same result as your original code, but will be very close. To get closer you could use a regular expression to count words, but it won't be quite as fast.
I'm not saying this is the most absolute best you can do, but it will be faster. You'll have to redo some of your code to fit this new function.
It will at least give you less function calls.
You'll have to pass the whole url list.
def is_news_in_data(listings):
new_listings = {}
tmp_listing = ''
is_news = {}
for i in listings:
url = listings[i]
is_news[url] = 0
article = g.extract(url=url).cleaned_text
tmp_listing = '';
for s in article:
is_news[url] += 1
tmp_listing += s
if is_news[url] > ARTICLE_LENGTH:
new_listings[url] = tmp_listing
del is_news[url]
return new_listings

Categories

Resources