I am trying to get all emails from a specific page and separate them into an individual variable or even better a dictionary. This is some code.
import requests
import re
import json
from bs4 import BeautifulSoup
page = "http://www.example.net"
info = requests.get(page)
if info.status_code == 200:
print("Page accessed")
else:
print("Error accessing page")
code = info.content
soup = BeautifulSoup(code, 'lxml')
allEmails = soup.find_all("a", href=re.compile(r"^mailto:"))
print(allEmails)
sep = ","
allEmailsStr = str(allEmails)
print(type(allEmails))
print(type(allEmailsStr))
j = allEmailsStr.split(sep, 1)[0]
print(j)
Excuse the poor variable names because I put this together so it would be fine by itself. The output from the example website would be for example something like
[k, kolyma, location, balkans]
So if I ran the problem it would return only
[k
But if I wanted it to return every email on there individually how would I do that?
To get just the email str you can try:
emails = []
for email_link in allEmails:
emails.append(email_link.get("href").replace('mailto:', ''))
print(emails)
Based on your expected output, you can use the unwrap function of BeautifulSoup
allEmails = soup.find_all("a", href=re.compile(r"^mailto:"))
for Email in allEmails:
print(Email.unwrap()) #This will print the whole element along with tag
# k
I am new to Python and I've written this test-code for practicing purposes, in order to find and print email addresses from various web pages:
def FindEmails(*urls):
for i in urls:
totalemails = []
req = urllib2.Request(i)
aResp = urllib2.urlopen(req)
webpage = aResp.read()
patt1 = '(\w+[-\w]\w+#\w+[.]\w+[.\w+]\w+)'
patt2 = '(\w+[\w]\w+#\w+[.]\w+)'
regexlist = [patt1,patt2]
for regex in regexlist:
match = re.search(regex,webpage)
if match:
totalemails.append(match.group())
break
#return totalemails
print "Mails from webpages are: %s " % totalemails
if __name__== "__main__":
FindEmails('https://www.urltest1.com', 'https://www.urltest2.com')
When I run it, it prints only one argument.
My goal is to print the emails acquired from webpages and store them in a list, separated by commas.
Thanks in advance.
The problem here is the line: totalemails = []. Here, you are re-instantiating the the variables totalemails to have zero entries. So, in each iteration, it only has one entry inside it. After the last iteration, you'll end up with just the last entry in the list. To get a list of all emails, you need to put the variable outside of the for loop.
Example:
def FindEmails(*urls):
totalemails = []
for i in urls:
req = urllib2.Request(i)
....
I have been searching for an answer and there is no answer on this forum although several questions have been asked. One answer is the it is possible to stop spider after certain time but that is not suitable for me because I usually launch 10 websites per spider. So my challenge is that I have spider for 10 websited and I would like to limit time to 20 seconds per domain in order to avoid getting stuck at some webshop. How to do it?
In general I can also tell you that I crawl 2000 company websites and in order to make it in one day I divide these websites into 200 groups of 10 websites and I launch 200 spiders in parallel. That may be amateur but that I the best that I know. The computer almost freezes because spiders consume entire CPU and memory, but next day I have the results. What I am looking for is employment webpages on companies' websites. Does anyone have any better idea how to crawl 2000 websites ? In case there is a webshop among websites the crawling could take days, so that is why I would like to limit the time per domain.
Thank you in advance.
Marko
My code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# encoding=UTF-8
import scrapy, urlparse, time, sys
from scrapy.http import Request
from scrapy.utils.response import get_base_url
from urlparse import urlparse, urljoin
from vacancies.items import JobItem
#We need that in order to force Slovenian pages instead of English pages. It happened at "http://www.g-gmi.si/gmiweb/" that only English pages were found and no Slovenian.
from scrapy.conf import settings
settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl',}
#Settings.set(name, value, priority='cmdline')
#settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl','en':q=0.8,}
#start_time = time.time()
# We run the programme in the command line with this command:
# scrapy crawl jobs -o urls.csv -t csv --logfile log.txt
# We get two output files
# 1) urls.csv
# 2) log.txt
# Url whitelist.
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/url_whitelist.txt", "r+") as kw:
url_whitelist = kw.read().replace('\n', '').split(",")
url_whitelist = map(str.strip, url_whitelist)
# Tab whitelist.
# We need to replace character the same way as in detector.
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/tab_whitelist.txt", "r+") as kw:
tab_whitelist = kw.read().decode(sys.stdin.encoding).encode('utf-8')
tab_whitelist = tab_whitelist.replace('Ŕ', 'č')
tab_whitelist = tab_whitelist.replace('L', 'č')
tab_whitelist = tab_whitelist.replace('Ő', 'š')
tab_whitelist = tab_whitelist.replace('Ü', 'š')
tab_whitelist = tab_whitelist.replace('Ä', 'ž')
tab_whitelist = tab_whitelist.replace('×', 'ž')
tab_whitelist = tab_whitelist.replace('\n', '').split(",")
tab_whitelist = map(str.strip, tab_whitelist)
# Look for occupations in url.
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/occupations_url.txt", "r+") as occ_url:
occupations_url = occ_url.read().replace('\n', '').split(",")
occupations_url = map(str.strip, occupations_url)
# Look for occupations in tab.
# We need to replace character the same way as in detector.
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/occupations_tab.txt", "r+") as occ_tab:
occupations_tab = occ_tab.read().decode(sys.stdin.encoding).encode('utf-8')
occupations_tab = occupations_tab.replace('Ŕ', 'č')
occupations_tab = occupations_tab.replace('L', 'č')
occupations_tab = occupations_tab.replace('Ő', 'š')
occupations_tab = occupations_tab.replace('Ü', 'š')
occupations_tab = occupations_tab.replace('Ä', 'ž')
occupations_tab = occupations_tab.replace('×', 'ž')
occupations_tab = occupations_tab.replace('\n', '').split(",")
occupations_tab = map(str.strip, occupations_tab)
#Join url whitelist and occupations.
url_whitelist_occupations = url_whitelist + occupations_url
#Join tab whitelist and occupations.
tab_whitelist_occupations = tab_whitelist + occupations_tab
#base = open("G:/myVE/vacancies/bazni.txt", "w")
#non_base = open("G:/myVE/vacancies/ne_bazni.txt", "w")
class JobSpider(scrapy.Spider):
#Name of spider
name = "jobs"
#start_urls = open("Q:\Big_Data\Utrip\spletne_strani.txt", "r+").readlines()[0]
#print urls
#start_urls = map(str.strip, urls)
#Start urls
start_urls = ["http://www.alius.si"]
print "\nSpletna stran ", start_urls, "\n"
#Result of the programme is this list of job vacancies webpages.
jobs_urls = []
def parse(self, response):
#Theoretically I could save the HTML of webpage to be able to check later and see how it looked like
# at the time of downloading. That is important for validation, because it is easier to look at nice HTML webpage instead of naked text.
# but I have to write a pipeline http://doc.scrapy.org/en/0.20/topics/item-pipeline.html
response.selector.remove_namespaces()
#print "response url" , str(response.url)
#Take url of response, because we would like to stay on the same domain.
parsed = urlparse(response.url)
#Base url.
#base_url = get_base_url(response).strip()
base_url = parsed.scheme+'://'+parsed.netloc
#print "base url" , str(base_url)
#If the urls grows from seeds, it's ok, otherwise not.
if base_url in self.start_urls:
#print "base url je v start"
#base.write(response.url+"\n")
#net1 = parsed.netloc
#Take all urls, they are marked by "href" or "data-link". These are either webpages on our website either new websites.
urls_href = response.xpath('//#href').extract()
urls_datalink = response.xpath('//#data-link').extract()
urls = urls_href + urls_datalink
#print "povezave na tej strani ", urls
#Loop through all urls on the webpage.
for url in urls:
#Test all new urls. NE DELA
#print "url ", str(url)
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
if not (url.startswith("http")):
#Povežem delni url z baznim url.
url = urljoin(base_url,url).strip()
#print "new url ", str(url)
new_parsed = urlparse(url)
new_base_url = new_parsed.scheme+'://'+new_parsed.netloc
#print "new base url ", str(new_base_url)
if new_base_url in self.start_urls:
#print "yes"
url = url.replace("\r", "")
url = url.replace("\n", "")
url = url.replace("\t", "")
url = url.strip()
#Remove anchors '#', that point to a section on the same webpage, because this is the same webpage.
#But we keep question marks '?', which mean, that different content is pulled from database.
if '#' in url:
index = url.find('#')
url = url[:index]
if url in self.jobs_urls:
continue
#Ignore ftp and sftp.
if url.startswith("ftp") or url.startswith("sftp"):
continue
#Compare each url on the webpage with original url, so that spider doesn't wander away on the net.
#net2 = urlparse(url).netloc
#test.write("lokacija novega url "+ str(net2)+"\n")
#if net2 != net1:
# continue
#test.write("ni ista lokacija, nadaljujemo\n")
#If the last character is slash /, I remove it to avoid duplicates.
if url[len(url)-1] == '/':
url = url[:(len(url)-1)]
#If url includes characters like %, ~ ... it is LIKELY NOT to be the one I are looking for and I ignore it.
#However in this case I exclude good urls like http://www.mdm.si/company#employment
if any(x in url for x in ['%', '~',
#slike
'.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', '.svg', '.tif', '.tiff',
'.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', '.SVG', '.TIF', '.TIFF',
#dokumenti
'.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', '.pd',
'.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', '.PD',
#glasba in video
'.mp3', '.mp4', '.mpg', '.ai', '.avi', '.swf',
'.MP3', '.MP4', '.MPG', '.AI', '.AVI', '.SWF',
#stiskanje in drugo
'.zip', '.rar', '.css', '.flv', '.xml'
'.ZIP', '.RAR', '.CSS', '.FLV', '.XML'
#Twitter, Facebook, Youtube
'://twitter.com', '://mobile.twitter.com', 'www.twitter.com',
'www.facebook.com', 'www.youtube.com'
#Feeds, RSS, arhiv
'/feed', '=feed', '&feed', 'rss.xml', 'arhiv'
]):
continue
#We need to save original url for xpath, in case we change it later (join it with base_url)
#url_xpath = url
#We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.
#if (urlparse(url).netloc == urlparse(base_url).netloc):
#The main part. We look for webpages, whose urls include one of the employment words as strings.
#We will check the tab of the url as well. This is additional filter, suggested by Dan Wu, to improve accuracy.
#tabs = response.xpath('//a[#href="%s"]/text()' % url_xpath).extract()
tabs = response.xpath('//a[#href="%s"]/text()' % url).extract()
# Sometimes tabs can be just empty spaces like '\t' and '\n' so in this case we replace it with [].
# That was the case when the spider didn't find this employment url: http://www.terme-krka.com/si/sl/o-termah-krka/o-podjetju-in-skupini-krka/zaposlitev/
tabs = [tab.encode('utf-8') for tab in tabs]
tabs = [tab.replace('\t', '') for tab in tabs]
tabs = [tab.replace('\n', '') for tab in tabs]
tab_empty = True
for tab in tabs:
if tab != '':
tab_empty = False
if tab_empty == True:
tabs = []
# -- Instruction.
# -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
# Starting keyword_url is zero, then we add keywords as we find them in url.
keyword_url = ''
#for keyword in url_whitelist:
for keyword in url_whitelist_occupations:
if keyword in url:
keyword_url = keyword_url + keyword + ' '
# a) If we find at least one keyword in url, we continue.
if keyword_url != '':
#1. Tabs are empty.
if tabs == []:
#We found url that includes one of the magic words and also the text includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls :
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = keyword_url
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = ' '
print "Zaposlitvena podstran ", url
#We return the item.
yield item
#2. There are texts in tabs, one or more.
else:
#For the same partial url several texts are possible.
for tab in tabs:
#We search for keywords in tabs.
keyword_url_tab = ''
#for key in tab_whitelist:
for key in tab_whitelist_occupations:
if key in tab:
keyword_url_tab = keyword_url_tab + key + ' '
# If we find some keywords in tabs, then we have found keywords in both url and tab and we can save the url.
if keyword_url_tab != '':
# keyword_url_tab starts with keyword_url from before, because we want to remember keywords from both url and tab. So we add initial keyword_url.
keyword_url_tab = 'URL ' + keyword_url + ' TAB ' + keyword_url_tab
#We found url that includes one of the magic words and also the tab includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = keyword_url_tab
#item["keyword_tab"] = ' '
print "Zaposlitvena podstran ", url
#We return the item.
yield item
#We haven't found any keywords in tabs, but url is still good, because it contains some keywords, so we save it.
else:
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = keyword_url
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = ' '
print "Zaposlitvena podstran ", url
#We return the item.
yield item
# b) If keyword_url = empty, there are no keywords in url, but perhaps there are keywords in tabs. So we check tabs.
else:
for tab in tabs:
keyword_tab = ''
#for key in tab_whitelist:
for key in tab_whitelist_occupations:
if key in tab:
keyword_tab = keyword_tab + key + ' '
if keyword_tab != '':
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = keyword_tab
print "Zaposlitvena podstran ", url
#We return the item.
yield item
#We don't put "else" sentence because we want to further explore the employment webpage to find possible new employment webpages.
#We keep looking for employment webpages, until we reach the DEPTH set in settings.py.
yield Request(url, callback = self.parse)
#else:
#non_base.write(response.url+"\n")
Just use scrapyd to schedule 2000 single web-site crawls. Set max_proc = 10 [1] to run 10 spiders in parallel. Set spider's CLOSESPIDER_TIMEOUT [2] to 20 run every spider for 20 seconds. Stop using Windows natively because it's a pain. I've observed Scrapy and scrapyd run faster inside in a VM rather than natively on Windows. I might be wrong - so try for yourself to cross-check but I have a strong feeling that if you use an Ubuntu 14.04 virtualbox image on Windows, it will be faster. Your crawl will take exactly 2000 * 20 / 10 = 17 minutes.
So I'm trying to filter through a list of urls (potentially in the hundreds) and filter out every article who's body is less than X number of words (ARTICLE LENGTH). But when I run my application, it takes an unreasonable amount of time, so much so that my hosting service times out. I'm currently using Goose (https://github.com/grangier/python-goose) with the following filter function:
def is_news_and_data(url):
"""A function that returns a list of the form
[True, title, meta_description]
or
[False]
"""
result = []
if url == None:
return False
try:
article = g.extract(url=url)
if len(article.cleaned_text.split()) < ARTICLE_LENGTH:
result.append(False)
else:
title = article.title
meta_description = article.meta_description
result.extend([True, title, meta_description])
except:
result.append(False)
return result
In the context of the following. Dont mind the debug prints and messiness (tweepy is my twitter api wrapper):
def get_links(auth):
"""Returns a list of t.co links from a list of given tweets"""
api = tweepy.API(auth)
page_list = []
tweets_list = []
links_list = []
news_list = []
regex = re.compile('http://t.co/.[a-zA-Z0-9]*')
for page in tweepy.Cursor(api.home_timeline, count=20).pages(1):
page_list.append(page)
for page in page_list:
for status in page:
tweet = status.text.encode('utf-8','ignore')
tweets_list.append(tweet)
for tweet in tweets_list:
links = regex.findall(tweet)
links_list.extend(links)
#print 'The length of the links list is: ' + str(len(links_list))
for link in links_list:
news_and_data = is_news_and_data(link)
if True in news_and_data:
news_and_data.append(link)
#[True, title, meta_description, link]
news_list.append(news_and_data[1:])
print 'The length of the news list is: ' + str(len(news_list))
Can anyone recommend a perhaps faster method?
This code is probably causing your slow performance:
len(article.cleaned_text.split())
This is performing a lot of work, most of which is discarded. I would profile your code to see if this is the culprit, if so, replace it with something that just counts spaces, like so:
article.cleaned_text.count(' ')
That won't give you exactly the same result as your original code, but will be very close. To get closer you could use a regular expression to count words, but it won't be quite as fast.
I'm not saying this is the most absolute best you can do, but it will be faster. You'll have to redo some of your code to fit this new function.
It will at least give you less function calls.
You'll have to pass the whole url list.
def is_news_in_data(listings):
new_listings = {}
tmp_listing = ''
is_news = {}
for i in listings:
url = listings[i]
is_news[url] = 0
article = g.extract(url=url).cleaned_text
tmp_listing = '';
for s in article:
is_news[url] += 1
tmp_listing += s
if is_news[url] > ARTICLE_LENGTH:
new_listings[url] = tmp_listing
del is_news[url]
return new_listings