Python - print 2nd argument - python

I am new to Python and I've written this test-code for practicing purposes, in order to find and print email addresses from various web pages:
def FindEmails(*urls):
for i in urls:
totalemails = []
req = urllib2.Request(i)
aResp = urllib2.urlopen(req)
webpage = aResp.read()
patt1 = '(\w+[-\w]\w+#\w+[.]\w+[.\w+]\w+)'
patt2 = '(\w+[\w]\w+#\w+[.]\w+)'
regexlist = [patt1,patt2]
for regex in regexlist:
match = re.search(regex,webpage)
if match:
totalemails.append(match.group())
break
#return totalemails
print "Mails from webpages are: %s " % totalemails
if __name__== "__main__":
FindEmails('https://www.urltest1.com', 'https://www.urltest2.com')
When I run it, it prints only one argument.
My goal is to print the emails acquired from webpages and store them in a list, separated by commas.
Thanks in advance.

The problem here is the line: totalemails = []. Here, you are re-instantiating the the variables totalemails to have zero entries. So, in each iteration, it only has one entry inside it. After the last iteration, you'll end up with just the last entry in the list. To get a list of all emails, you need to put the variable outside of the for loop.
Example:
def FindEmails(*urls):
totalemails = []
for i in urls:
req = urllib2.Request(i)
....

Related

Unable to use different values in multiple lines while using yield

I'm trying to figure out how I can print multiple values using yield in different lines. To be clearer: not using multiple yield; rather, one yield having multiple lines. In case of return I can use something like this:
return (placeholder_one,placeholder_two,placeholder_three +
placeholder_four,placeholder_five,placeholder_six,title,link)
However, I get stuck when it comes to do the same using yield.
My goal is to write the values in a csv file. If I use return, I could write the same in the following manner:
placeholder_one,placeholder_two,placeholder_three,placeholder_four,placeholder_five,placeholder_six,title,link = fetch_items()
writer.writerow([placeholder_one,placeholder_two,placeholder_three,placeholder_four,placeholder_five,placeholder_six,title,link])
If I use yield, i can simply use this within name function (which would be the most ideal):
if __name__ == '__main__':
for item in fetch_items():
writer.writerow(item)
print(item)
I've used some placeholders to make the line bigger;
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
base = "https://stackoverflow.com"
url = "https://stackoverflow.com/questions/tagged/web-scraping"
def fetch_items():
res = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"html.parser")
placeholder_one = "Some name"
placeholder_two = "Some id"
placeholder_three = "Gender info"
placeholder_four = "Some phone"
placeholder_five = "Some email"
placeholder_six = "Some credit info"
for items in soup.select(".summary"):
title = items.select_one(".question-hyperlink").get_text(strip=True)
link = urljoin(base,items.select_one(".question-hyperlink").get("href"))
yield placeholder_one,placeholder_two,placeholder_three,placeholder_four,placeholder_five,placeholder_six,title,link
if __name__ == '__main__':
for item in fetch_items():
print(item)
How can I yield the values in two or three lines like the way I did with return?
Could you be just expecting something close to:
for items in soup.select(".summary"):
title = items.select_one(".question-hyperlink").get_text(strip=True)
link = urljoin(base,items.select_one(".question-hyperlink").get("href"))
yield (placeholder_one,placeholder_two,placeholder_three,
placeholder_four,placeholder_five,placeholder_six,
title,link)
Try creating a list or a tuple to yield:
yield [placeholder_one,placeholder_two,placeholder_three + placeholder_four,placeholder_five,placeholder_six,title,link]
As an example consider:
def num():
for i in range(10):
yield [2 + i, 6 + i, 7 + i]
for i in num():
print(i[0])
print(i[1])
print(i[2])
Also take a look at yield-multiple-values

Link Scraping Program Redundancy?

I am attempting to create a small script to simply take a given website along with a keyword, follow all the links a certain number of times(only links on website's domain), and finally search all the found links for the keyword and return any successful matches. Ultimately it's goal is if you remember a website where you saw something and know a good keyword that the page contained, this program might be able to help find the link to the lost page. Now my bug: upon looping through all these pages, extracting their URLs, and creating a list of them, it seems to somehow end up redundantly going over and removing the same links from the list. I did add a safeguard in place for this but it doesn't seem to be working as expected. I feel like some url(s) are mistakenly being duplicated into the list and end up being checked an infinite number of times.
Here's my full code(sorry about the length), problem area seems to be at the very end in the for loop:
import bs4, requests, sys
def getDomain(url):
if "www" in url:
domain = url[url.find('.')+1:url.rfind('.')]
elif "http" in url:
domain = url[url.find("//")+2:url.rfind('.')]
else:
domain = url[:url.rfind(".")]
return domain
def findHref(html):
'''Will find the link in a given BeautifulSoup match object.'''
link_start = html.find('href="')+6
link_end = html.find('"', link_start)
return html[link_start:link_end]
def pageExists(url):
'''Returns true if url returns a 200 response and doesn't redirect to a dns search.
url must be a requests.get() object.'''
response = requests.get(url)
try:
response.raise_for_status()
if response.text.find("dnsrsearch") >= 0:
print response.text.find("dnsrsearch")
print "Website does not exist"
return False
except Exception as e:
print "Bad response:",e
return False
return True
def extractURLs(url):
'''Returns list of urls in url that belong to same domain.'''
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text)
matches = soup.find_all('a')
urls = []
for index, link in enumerate(matches):
match_url = findHref(str(link).lower())
if "." in match_url:
if not domain in match_url:
print "Removing",match_url
else:
urls.append(match_url)
else:
urls.append(url + match_url)
return urls
def searchURL(url):
'''Search url for keyword.'''
pass
print "Enter homepage:(no http://)"
homepage = "http://" + raw_input("> ")
homepage_response = requests.get(homepage)
if not pageExists(homepage):
sys.exit()
domain = getDomain(homepage)
print "Enter keyword:"
#keyword = raw_input("> ")
print "Enter maximum branches:"
max_branches = int(raw_input("> "))
links = [homepage]
for n in range(max_branches):
for link in links:
results = extractURLs(link)
for result in results:
if result not in links:
links.append(result)
Partial output(about .000000000001%):
Removing /store/apps/details?id=com.handmark.sportcaster
Removing /store/apps/details?id=com.handmark.sportcaster
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.eweware.heard
Removing /store/apps/details?id=com.eweware.heard
Removing /store/apps/details?id=com.eweware.heard
You are repeatedly looping over the same link multiple times with your outer loop:
for n in range(max_branches):
for link in links:
results = extractURLs(link)
I would also be careful appending to a list you are iterating over or you could well end up with an infinite loop
Okay, I found a solution. All I did was change the links variable to a dictionary with the values 0 representing a not searched link and 1 representing a searched link. Then I iterated through a copy of the keys in order to preserve the branches and not let it wildly go follow every link that is added on in the loop. And finally if a link is found that is not already in links it is added and set to 0 to be searched.
links = {homepage: 0}
for n in range(max_branches):
for link in links.keys()[:]:
if not links[link]:
results = extractURLs(link)
for result in results:
if result not in links:
links[result] = 0

iterate the list in python

I have a loop inside loop i'm using try n catch once get error try n catch works fine but loop continues to next value. What I need is that where the loop breaks start from the same value don't continue to next so how i can do that with my code [like in other languages: in c++, it is i--]
for
r = urllib2.urlopen(url)
encoding = r.info().getparam('charset')
html = r.read()
c = td.find('a')['href']
urls = []
urls.append(c)
#collecting urls from first page then from those url collecting further info in below loop
for abc in urls:
try:
r = urllib2.urlopen(abc)
encoding = r.info().getparam('charset')
html = r.read()
except Exception as e:
last_error = e
time.sleep(retry_timeout) #here is the problem once get error then switch from next value
I need a more pythonic way to do this.
Waiting for a reply. Thank you.
Unfortunatly, there is no simple way to go back with iterator in Python :
http://docs.python.org/2/library/stdtypes.html
You should be interested in this stackoverflow's thread :
Making a python iterator go backwards?
For your particular case, i will use a simple while loop :
url = []
i = 0
while i < len(url): #url is list contain all urls which contain infinite as url updates every day
data = url[i]
try:
#getting data from there
i+=1
except:
#shows the error received and continue to next loop i need to make the loop start from same position
The problem with the way, you want to handle your problem is that you will risk to go on a infinite loop. For example if a link is broken r = urllib2.urlopen(abc) will always run an exception and you will always stay at the same position. You should consider doing something like that :
r = urllib2.urlopen(url)
encoding = r.info().getparam('charset')
html = r.read()
c = td.find('a')['href']
urls = []
urls.append(c)
#collecting urls from first page then from those url collecting further info in below loop
NUM_TRY = 3
for abc in urls:
for _ in range(NUM_TRY):
try:
r = urllib2.urlopen(abc)
encoding = r.info().getparam('charset')
html = r.read()
break #if we arrive to this line, it means no error occur so we don't need to retry again
#this is why we break the inner loop
except Exception as e:
last_error = e
time.sleep(retry_timeout) #here is the problem once get error then switch from next value

Very fast webpage scraping (Python)

So I'm trying to filter through a list of urls (potentially in the hundreds) and filter out every article who's body is less than X number of words (ARTICLE LENGTH). But when I run my application, it takes an unreasonable amount of time, so much so that my hosting service times out. I'm currently using Goose (https://github.com/grangier/python-goose) with the following filter function:
def is_news_and_data(url):
"""A function that returns a list of the form
[True, title, meta_description]
or
[False]
"""
result = []
if url == None:
return False
try:
article = g.extract(url=url)
if len(article.cleaned_text.split()) < ARTICLE_LENGTH:
result.append(False)
else:
title = article.title
meta_description = article.meta_description
result.extend([True, title, meta_description])
except:
result.append(False)
return result
In the context of the following. Dont mind the debug prints and messiness (tweepy is my twitter api wrapper):
def get_links(auth):
"""Returns a list of t.co links from a list of given tweets"""
api = tweepy.API(auth)
page_list = []
tweets_list = []
links_list = []
news_list = []
regex = re.compile('http://t.co/.[a-zA-Z0-9]*')
for page in tweepy.Cursor(api.home_timeline, count=20).pages(1):
page_list.append(page)
for page in page_list:
for status in page:
tweet = status.text.encode('utf-8','ignore')
tweets_list.append(tweet)
for tweet in tweets_list:
links = regex.findall(tweet)
links_list.extend(links)
#print 'The length of the links list is: ' + str(len(links_list))
for link in links_list:
news_and_data = is_news_and_data(link)
if True in news_and_data:
news_and_data.append(link)
#[True, title, meta_description, link]
news_list.append(news_and_data[1:])
print 'The length of the news list is: ' + str(len(news_list))
Can anyone recommend a perhaps faster method?
This code is probably causing your slow performance:
len(article.cleaned_text.split())
This is performing a lot of work, most of which is discarded. I would profile your code to see if this is the culprit, if so, replace it with something that just counts spaces, like so:
article.cleaned_text.count(' ')
That won't give you exactly the same result as your original code, but will be very close. To get closer you could use a regular expression to count words, but it won't be quite as fast.
I'm not saying this is the most absolute best you can do, but it will be faster. You'll have to redo some of your code to fit this new function.
It will at least give you less function calls.
You'll have to pass the whole url list.
def is_news_in_data(listings):
new_listings = {}
tmp_listing = ''
is_news = {}
for i in listings:
url = listings[i]
is_news[url] = 0
article = g.extract(url=url).cleaned_text
tmp_listing = '';
for s in article:
is_news[url] += 1
tmp_listing += s
if is_news[url] > ARTICLE_LENGTH:
new_listings[url] = tmp_listing
del is_news[url]
return new_listings

Recursive function gives no output

I'm scraping all the URL of my domain with recursive function.
But it outputs nothing, without any error.
#usr/bin/python
from bs4 import BeautifulSoup
import requests
import tldextract
def scrape(url):
for links in url:
main_domain = tldextract.extract(links)
r = requests.get(links)
data = r.text
soup = BeautifulSoup(data)
for href in soup.find_all('a'):
href = href.get('href')
if not href:
continue
link_domain = tldextract.extract(href)
if link_domain.domain == main_domain.domain :
problem.append(href)
elif not href == '#' and link_domain.tld == '':
new = 'http://www.'+ main_domain.domain + '.' + main_domain.tld + '/' + href
problem.append(new)
return len(problem)
return scrape(problem)
problem = ["http://xyzdomain.com"]
print(scrape(problem))
When I create a new list, it works, but I don't want to make a list every time for every loop.
You need to structure your code so that it meets the pattern for recursion as your current code doesn't - you also should not call variables the same name as libraries, e.g. href = href.get() because this will usually stop the library working as it becomes the variable, your code as it currently is will only ever return the len() as this return is unconditionally reached before: return scrap(problem).:
def Recursive(Factorable_problem)
if Factorable_problem is Simplest_Case:
return AnswerToSimplestCase
else:
return Rule_For_Generating_From_Simpler_Case(Recursive(Simpler_Case))
for example:
def Factorial(n):
""" Recursively Generate Factorials """
if n < 2:
return 1
else:
return n * Factorial(n-1)
Hello I've made a none recursive version of this that appears to get all the links on the same domain.
The code below I've tested using the problem included in the code. When I'd solved the problems with the recursive version the next problem was hitting the recursion depth limit so I rewrote it so it ran in an iterative fashion, the code and result below:
from bs4 import BeautifulSoup
import requests
import tldextract
def print_domain_info(d):
print "Main Domain:{0} \nSub Domain:{1} \nSuffix:{2}".format(d.domain,d.subdomain,d.suffix)
SEARCHED_URLS = []
problem = [ "http://Noelkd.neocities.org/", "http://youpi.neocities.org/"]
while problem:
# Get a link from the stack of links
link = problem.pop()
# Check we haven't been to this address before
if link in SEARCHED_URLS:
continue
# We don't want to come back here again after this point
SEARCHED_URLS.append(link)
# Try and get the website
try:
req = requests.get(link)
except:
# If its not working i don't care for it
print "borked website found: {0}".format(link)
continue
# Now we get to this point worth printing something
print "Trying to parse:{0}".format(link)
print "Status Code:{0} Thats: {1}".format(req.status_code, "A-OK" if req.status_code == 200 else "SOMTHINGS UP" )
# Get the domain info
dInfo = tldextract.extract(link)
print_domain_info(dInfo)
# I like utf-8
data = req.text.encode("utf-8")
print "Lenght Of Data Retrived:{0}".format(len(data)) # More info
soup = BeautifulSoup(data) # This was here before so i left it.
print "Found {0} link{1}".format(len(soup.find_all('a')),"s" if len(soup.find_all('a')) > 1 else "")
FOUND_THIS_ITERATION = [] # Getting the same links over and over was boring
found_links = [x for x in soup.find_all('a') if x.get('href') not in SEARCHED_URLS] # Find me all the links i don't got
for href in found_links:
href = href.get('href') # You wrote this seems to work well
if not href:
continue
link_domain = tldextract.extract(href)
if link_domain.domain == dInfo.domain: # JUST FINDING STUFF ON SAME DOMAIN RIGHT?!
if href not in FOUND_THIS_ITERATION: # I'ma check you out next time
print "Check out this link: {0}".format(href)
print_domain_info(link_domain)
FOUND_THIS_ITERATION.append(href)
problem.append(href)
else: # I got you already
print "DUPE LINK!"
else:
print "Not on same domain moving on"
# Count down
print "We have {0} more sites to search".format(len(problem))
if problem:
continue
else:
print "Its been fun"
print "Lets see the URLS we've visited:"
for url in SEARCHED_URLS:
print url
Which prints, after a lot of other logging loads of neocities websites!
What's happening is the script is popping a value of the list of websites yet to visit, it then gets all the links on the page which are on the same domain. If those links are to pages we haven't visited we add the link to the list of links to be visited. After we do that we pop the next page and do the same thing again until there are no pages left to visit.
Think this is what your looking for, get back to us in the comments if this doesn't work in the way that you want or if anyone can improve please leave a comment.

Categories

Resources