Issue in Lambda execution time - python

I am working on a project where I have to scrape maximum URLs (placed in an S3 bucket's file) in a limited time and store them in searchable database. Right now I am having an issue while scraping web pages inside aws lambda. I have a function for my task which when runs in a google Collab environment takes only 7-8 seconds to execute and produce the desired results. But the same function when deployed as lambda is taking almost 10X more time to execute. Here is my code:
import requests
import re
import validators
import boto3
from smart_open import open
import nltk
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
nltk.data.path.append("/tmp")
nltk.download("stopwords", download_dir = "/tmp")
def CrawlingLambda(event, context):
"""
This lambda crawls a list of webpages, reading URLS from S3 bucket and returns a dictionary
pairing each URL with its keywords.
Args:
http: A pckage inside PoolManager() able to send GET requests
web_url: url of the website whose availability is required
Returns:
bool: Depending upon the response of GET request, this function will return a bool indicating availability of web_url
"""
results = {}
client = boto3.client('s3')
for line in open('s3://urls-to-monitor/URLs1T.txt', transport_params={'client': client}):
if line[len(line)-1] != '/':
url = line[:len(line)-2]
else: url = line
if validation(url) == False:
continue
try:
web_content = scrape_web(url)
results[url] = web_content
except:
continue
return results
def validation(url):
"""
Validates the URL's string. This method use regular expressions for validation at backend.
Args:
url: URL to validate
Returns:
bool: True if the passes string is a valid URL and False otherwise.
"""
return validators.url(url)
def scrape_web(url):
"""
This function scrapes a given URL's web page for a specific set of keywords.
Args:
url: Page's URL to be scraped
Return:
filtered_words: A refined list of extracted words from the web page.
"""
try:
res = requests.get(url, timeout=2)
except:
raise ValueError
if res.status_code != 200:
raise ValueError
html_page = res.content
soup = remove_tags(html_page)
content = soup.get_text()
words = re.split(r"\s+|/", content.lower())
filtered_words = clean_wordlist(words)
return tuple(filtered_words)
def remove_tags(html):
"""
Remove the specified tags from HTML response recieved from request.get() method.
Args:
html: HTML response of the web page
Returns:
soup: Parsed response of HTML
"""
# parse html content
soup = BeautifulSoup(html, "html.parser")
for data in soup(['style', 'script', 'noscript']):
# Remove tags
data.decompose()
# return data by retrieving the tag content
return soup
def clean_wordlist(wordlist):
"""
This function removes any punctuation marks and stop words from our extracted wordlist.
Args:
wordlist: A list of raw words extracted from html response of web page.
Returns:
key_words: A filtered list of words containing only key words
"""
words_without_symbol = []
for word in wordlist:
#Symbols to ignore
symbols = "!##$%^&*()_-+={[}]|\;:\"<>?/., "
for i in range(len(symbols)):
word = word.replace(symbols[i], '')
if len(word) > 0:
words_without_symbol.append(word)
#ignoring the stopwords
key_words = [word for word in words_without_symbol if not word in stopwords.words()]
return key_words
Any directions, that why there is much time difference and how can I reduce it.

The only thing that you can configure to affect performance is memory allocation. Try increasing the memory allocated for your function, until you have at least the same performance as with Collab.
Billing shouldn't affected much, as it is calculated as the product of memory and execution time.

Related

List of all US ZIP Codes using uszipcode

I've been trying to fetch all US Zipcodes for a web scraping project for my company.
I'm trying to use uszipcode library for doing it automatically rather than manually from the website im intersted in but cant figure it out.
this is my manual attempt:
from bs4 import BeautifulSoup
import requests
url = 'https://www.unitedstateszipcodes.org'
headers = {'User-Agent': 'Chrome/50.0.2661.102'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
hrefs = []
all_zipcodes = []
# Extract all
for data in soup.find_all('div', class_='state-list'):
for a in data.find_all('a'):
if a is not None:
hrefs.append(a.get('href'))
hrefs.remove(None)
def get_zipcode_list():
"""
get_zipcode_list gets the GET response from the web archives server using CDX API
:return: CDX API output in json format.
"""
for state in hrefs:
state_url = url + state
state_page = requests.get(state_url, headers=headers)
states_soup = BeautifulSoup(state_page.text, 'html.parser')
div = states_soup.find(class_='list-group')
for a in div.findAll('a'):
if str(a.string).isdigit():
all_zipcodes.append(a.string)
return all_zipcodes
This takes alot of time and would like to know how to do the same in more efficient way using uszipcodes
You may try to search by pattern ''
s = SearchEngine()
l = s.by_pattern('', returns=1000000)
print(len(l))
More details in docs and in their basic tutorial
engine = SearchEngine()
allzips = {}
for i in range(100000): #Get zipcode info for every possible 5-digit combination
zipcode = str(i).zfill(5)
try: allzips[zipcode] = engine.by_zipcode(zipcode).to_dict()
except: pass
#Convert dictionary to DataFrame
allzips = pd.DataFrame(allzips).T.reset_index(drop = True)
Since zip codes are only 5-digits, you can iterate up to 100k and see which zip codes don't return an error. This solution gives you a DataFrame with all the stored information for each saved zip code
The regex that zip code in US have is [0-9]{5}(?:-[0-9]{4})?
you can simply check with re module
import re
regex = r"[0-9]{5}(?:-[0-9]{4})?"
if re.match(zipcode, regex):
print("match")
else:
print("not a match")
You can download the list of zip codes from the official source) and then parse it if it's for one-time use and you don't need any other metadata associated with each of the zip codes like the one which uszipcodes provides.
The uszipcodes also has another database which is quite big and should have all the data you need.
from uszipcode import SearchEngine
zipSearch = SearchEngine(simple_zipcode=False)
allZipCodes = zipSearch.by_pattern('', returns=200000)
print(len(allZipCodes)

Extracting follower count from Instagram

I am trying to pull the the number of followers from a list of Instagram accounts. I have tried using the "find" method within Requests, however, the string that I am looking for when I inspect the actual Instagram no longer appears when I print "r" from the code below.
Was able to get this code to run successfully find the past, however, will no longer run.
Webscraping Instagram follower count BeautifulSoup
import requests
user = "espn"
url = 'https://www.instagram.com/' + user
r = requests.get(url).text
start = '"edge_followed_by":{"count":'
end = '},"followed_by_viewer"'
print(r[r.find(start)+len(start):r.rfind(end)])
I receive a "-1" error, which means the substring from the find method was not found within the variable "r".
I think it's because of the last ' in start and first ' in end...this will work:
import requests
import re
user = "espn"
url = 'https://www.instagram.com/' + user
r = requests.get(url).text
followers = re.search('"edge_followed_by":{"count":([0-9]+)}',r).group(1)
print(followers)
'14061730'
I want to suggest an updated solution to this question, as the answer of Derek Eden above from 2019 does not work anymore, as stated in its comments.
The solution was to add the r' before the regular expression in the re.search like so:
follower_count = re.search(r'"edge_followed_by\\":{\\"count\\":([0-9]+)}', response).group(1)
This r'' is really important as without it, Python seems to treat the expression as regular string which leads to the query not giving any results.
Also the instagram page seems to have backslashes in the object we look for at least in my tests, so the code example i use is the following in Python 3.10 and working as of July 2022:
# get follower count of instagram profile
import os.path
import requests
import re
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# get instagram follower count
def get_instagram_follower_count(instagram_username):
url = "https://www.instagram.com/" + instagram_username
filename = "instagram.html"
try:
if not os.path.isfile(filename):
r = requests.get(url, verify=False)
print(r.status_code)
print(r.text)
response = r.text
if not r.status_code == 200:
raise Exception("Error: " + str(r.status_code))
with open(filename, "w") as f:
f.write(response)
else:
with open(filename, "r") as f:
response = f.read()
# print(response)
follower_count = re.search(r'"edge_followed_by\\":{\\"count\\":([0-9]+)}', response).group(1)
return follower_count
except Exception as e:
print(e)
return 0
print(get_instagram_follower_count('your.instagram.profile'))
The method returns the follower count as expected. Please note that i added a few lines to not hammer Instagrams webserver and get blocked while testing by just saving the response in a file.
This is a slice of the original html content that contains the part we are looking for:
... mRL&s=1\",\"edge_followed_by\":{\"count\":110070},\"fbid\":\"1784 ...
I debugged the regex in regexr, it seems to work just fine at this point in time.
There are many posts about the regex r prefix like this one
Also the documentation of the re package shows clearly that this is the issue with the code above.

Scraping json content from a site ordered in pages

I'm trying to scrape a site, when I run the following code without region_id=[any number from one to 32] I get a [500], but if I set region_id=1 I'll get only a first page by default (on the url it is pagina=&), pages are up to 500; is there a command or parameter for retrieving every page (every possible value of pagina=), avoiding for loops?
import requests
url = "http://www.enciclovida.mx/explora-por-region/especies-por-grupo?utf8=%E2%9C%93&grupo_id=Plantas&region_id=&parent_id=&pagina=&nombre="
resp = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})
data = resp.json()
Even without a for loop, you are still going to need iteration. You could do it with recursion or map as I've done below, but the iteration is still there. This solution has the advantage that everything is a generator, so only when you ask for a page's json from all_data will url be formatted, the request made, checked and converted to json. I added a filter to make sure you got a valid response before trying to get the json out. It still makes every request sequentially, but you could replace map with a parallel implementation quite easily.
import requests
from itertools import product, starmap
from functools import partial
def is_valid_resp(resp):
return resp.status_code == requests.codes.ok
def get_json(resp):
return resp.json()
# There's a .format hiding on the end of this really long url,
# with {} in appropriate places
url = "http://www.enciclovida.mx/explora-por-region/especies-por-grupo?utf8=%E2%9C%93&grupo_id=Plantas&region_id={}&parent_id=&pagina={}&nombre=".format
regions = range(1, 33)
pages = range(1, 501)
urls = starmap(url, product(regions, pages))
moz_get = partial(requests.get, headers={'User-Agent':'Mozilla/5.0'})
responses = map(moz_get, urls)
valid_responses = filter(is_valid_response, responses)
all_data = map(get_json, valid_responses)
# all_data is a generator that will give you each page's json.

blocking api redirect with requests in python 3.4

I am creating a python program that uses an online thesaurus and returns synonyms. Unfortunately sometimes it will take a word that is spelled wrong, and redirect to a page for a word that is close to it, which is sometimes problematic. How can I stop it from redirecting? I would appreciate any advice. This is the code that applies:
def get_synonym(the_word):
#return a dictionary of the thesaurus results of the word
theurl = (the api key for the thesaurus)
new_word = the_word + "/json"
theurl = theurl + new_word
r = requests.get(theurl)
thewords = r.text #all the text for the results
from json import loads
thewords = json.loads(thewords) #make a dictionary of terms
return thewords #return dictionary of synonyms for the_word
Use the allow_redirects=False keyword argument:
r = requests.get(url, allow_redirects=False)
By default, requests follows redirects on all methods except HEAD.

python fetch data from website sub pages

I am attempting to create a bot that fetches market links from steam but have run into a problem. I was able to return all the data from a single page, but when I attempt to get multiple pages it just gives me copies of the first page though I give it working links (eg: http://steamcommunity.com/market/search?q=appid%3A753#p1 and then http://steamcommunity.com/market/search?q=appid%3A753#p2). I have tested the links and they work in my browser. This is my code.
import urllib2
import random
import time
start_url = "http://steamcommunity.com/market/search?q=appid%3A753"
end_page = 3
urls = []
def get_raw(url):
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read()
def get_market_urls(html):
index = 0
while index != -1:
index = html.find("market_listing_row_link", index+25)
beg = html.find("http", index)
end = html.find('"',beg)
print html[beg:end]
urls.append(html[beg:end])
def go_to_page(page):
return start_url+"#p"+str(page)
def wait(min, max):
wait_t = random.randint(min,max)
time.sleep(wait_t)
for i in range(end_page):
url = go_to_page(i+1)
raw = get_raw(url)
get_market_urls(raw)
Your problem is that you've misunderstood what the URL says.
The number after the hashtag doesn't mean it's a different URL that can be fetched. This is called the query string. In that particular page the query string explains to the javascript which page to pull off AJAX. (Read about it Here and Here if you're interested..).
Anyway, you shoul look at the url: http://steamcommunity.com/market/search/render/?query=appid%3A753&start=00&count=10. You can play with the start=00&count=10 parameters to get the results you want.
Enjoy.

Categories

Resources