Using thread causes "python.exe has stopped working" - python

Recently I tried to add thread to my scraper so that it can have higher efficiency while scraping.
But somehow it randomly causes the python.exe to "has stopped working" with no further information given hence I have no idea how to debug it.
Here is some relevant code:
Where the threads are initiated:
def run(self):
"""
create the threads and run the scraper
:return:
"""
self.__load_resource()
self.__prepare_threads_args() # each thread is allocated a different set of links to scrape from, these should be no collision.
for item in self.threads_args:
try:
t = threading.Thread(target=self.urllib_method, args=(item,))
# use the following expression to use the selenium scraper
# t = threading.Thread(target=self.__scrape_site, args=(item,))
self.threads.append(t)
t.start()
except Exception as ex:
print ex
What the Scraper is like:
def urllib_method(self, thread_args):
"""
:param thread_args: arguments containing the files to scrape and the proxy to use
:return:
"""
site_scraper = SiteScraper()
for file in thread_args["files"]:
current_folder_path = self.__prepare_output_folder(file["name"])
articles_without_comments_file = os.path.join(current_folder_path, "articles_without_comments")
articles_without_comments_links = get_links_from_file(articles_without_comments_file) if isfile(articles_without_comments_file) else []
articles_scraped_file = os.path.join(current_folder_path, "articles_scraped")
articles_scraped_links = get_links_from_file(articles_without_comments_file) if isfile(articles_without_comments_file) else []
links = get_links_from_file(file["path"])
for link in links:
article_id = extract_article_id(link)
if isfile(join(current_folder_path, article_id)):
print "skip: ", link
if link not in articles_scraped_links:
append_text_to_file(articles_scraped_file, link)
continue
if link in articles_without_comments_links:
continue
comments = site_scraper.call_comments_endpoint(article_id, thread_args["proxy"])
if comments != "Pro article" and comments != "Crash" and comments != "No Comments" and comments is not None:
print article_id, comments[0:14]
write_text_to_file(os.path.join(current_folder_path, article_id), comments)
sleep(1)
append_text_to_file(articles_scraped_file, link)
elif comments == "No Comments":
print "article without comments: ", article_id
if link not in articles_without_comments_links:
append_text_to_file(articles_without_comments_file, link)
sleep(1)
I have tried to run the script on both Windows 10 and 8.1, the issue exists on both of them.
Also, the more data it scraped, the more frequent it happens. And the more threads used, the more frequent it happens.

Threads in Python pre 3.2 are very unsafe to use, due to the diabolical Global Interpreter Lock.
The preferred way to utilize multiple cores and processes in python is via the multiprocessing package.
https://docs.python.org/2/library/multiprocessing.html

Related

Python - Instaloader ProfileNotExistsException

I am new to Instaloader and I am running into a problem when trying to pull in Bio information. We have scraped Google for a list of Instagram handles for our accounts, unfortunately the data isn't perfect and some of the handles we have pulled in are no longer active(User has changed profile handle or deleted account). This causes the ProfileNotExistsException error to come up and stops pulling in the information for all subsequent accounts.
Is there a way to ignore this and continue pulling in the rest of the bios while just leaving this one blank?
Here is the code that is throwing me the error. handles is the list of handles we have.
bios = []
for element in handles:
if element == '': bios.append('NULL')
else:
bios.append(instaloader.Profile.from_username(L.context, element).biography)
I have tried using the workaround found in this forum(can't find the post) but it is not working for me. No errors, just not solving the issue. The code they suggested was:
def _obtain_metadata(self):
try:
if self._rhx_gis == None:
metadata = self._context.get_json('{}/'.format(self.username), params={})
self._node = metadata['entry_data']['ProfilePage'][0]['graphql']['user']
self._rhx_gis = metadata['rhx_gis']
metadata = self._context.get_json('{}/'.format(self.username), params={})
self._node = metadata['entry_data']['ProfilePage'][0]['graphql']['user']
except (QueryReturnedNotFoundException, KeyError) as err:
raise ProfileNotExistsException('Profile {} does not exist.'.format(self.username)) from err
Thanks in advance!

Downloading Multiple torrent files with Libtorrent in Python

I'm trying to write a torrent application that can take in a list of magnet links and then download them all together. I've been trying to read and understand the documentation at Libtorrent but I haven't been able to tell if what I try works or not. I've managed to be able to apply a SOCKS5 proxy to a Libtorrent session and download a single magnet link using this code:
import libtorrent as lt
import time
import os
ses = lt.session()
r = lt.proxy_settings()
r.hostname = "proxy_info"
r.username = "proxy_info"
r.password = "proxy_info"
r.port = 1080
r.type = lt.proxy_type_t.socks5_pw
ses.set_peer_proxy(r)
ses.set_web_seed_proxy(r)
ses.set_proxy(r)
t = ses.settings()
t.force_proxy = True
t.proxy_peer_connections = True
t.anonymous_mode = True
ses.set_settings(t)
print(ses.get_settings())
ses.peer_proxy()
ses.web_seed_proxy()
ses.set_settings(t)
magnet_link = "magnet"
params = {
"save_path": os.getcwd() + r"\torrents",
"storage_mode": lt.storage_mode_t.storage_mode_sparse,
"url": magnet_link
}
handle = lt.add_magnet_uri(ses, magnet_link, params)
ses.start_dht()
print('downloading metadata...')
while not handle.has_metadata():
time.sleep(1)
print('got metadata, starting torrent download...')
while handle.status().state != lt.torrent_status.seeding:
s = handle.status()
state_str = ['queued', 'checking', 'downloading metadata', 'downloading', 'finished', 'seeding', 'allocating']
print('%.2f%% complete (down: %.1f kb/s up: %.1f kB/s peers: %d) %s' % (s.progress * 100, s.download_rate / 1000, s.upload_rate / 1000, s.num_peers, state_str[s.state]))
time.sleep(5)
This is great and all for runing on its own with a single link. What I want to do is something like this:
def torrent_download(magnetic_link_list):
for mag in range(len(magnetic_link_list)):
handle = lt.add_magnet_uri(ses, magnetic_link_list[mag], params)
#Then download all the files
#Once all files complete, stop the torrents so they dont seed.
return torrent_name_list
I'm not sure if this is even on the right track or not, but some pointers would be helpful.
UPDATE: This is what I now have and it works fine in my case
def magnet2torrent(magnet_link):
global LIBTORRENT_SESSION, TORRENT_HANDLES
if LIBTORRENT_SESSION is None and TORRENT_HANDLES is None:
TORRENT_HANDLES = []
settings = lt.default_settings()
settings['proxy_hostname'] = CONFIG_DATA["PROXY"]["HOST"]
settings['proxy_username'] = CONFIG_DATA["PROXY"]["USERNAME"]
settings['proxy_password'] = CONFIG_DATA["PROXY"]["PASSWORD"]
settings['proxy_port'] = CONFIG_DATA["PROXY"]["PORT"]
settings['proxy_type'] = CONFIG_DATA["PROXY"]["TYPE"]
settings['force_proxy'] = True
settings['anonymous_mode'] = True
LIBTORRENT_SESSION = lt.session(settings)
params = {
"save_path": os.getcwd() + r"/torrents",
"storage_mode": lt.storage_mode_t.storage_mode_sparse,
"url": magnet_link
}
TORRENT_HANDLES.append(LIBTORRENT_SESSION.add_torrent(params))
def check_torrents():
global TORRENT_HANDLES
for torrent in range(len(TORRENT_HANDLES)):
print(TORRENT_HANDLES[torrent].status().is_seeding)
It's called "magnet links" (not magnetic).
In new versions of libtorrent, the way you add a magnet link is:
params = lt.parse_magnet_link(uri)
handle = ses.add_torrent(params)
That also gives you an opportunity to tweak the add_torrent_params object, to set the save directory for instance.
If you're adding a lot of magnet links (or regular torrent files for that matter) and want to do it quickly, a faster way is to use:
ses.add_torrent_async(params)
That function will return immediately and the torrent_handle object can be picked up later in the add_torrent_alert.
As for downloading multiple magnet links in parallel, your pseudo code for adding them is correct. You just want to make sure you either save off all the torrent_handle objects you get back or query all torrent handles once you're done adding them (using ses.get_torrents()). In your pseudo code you seem to overwrite the last torrent handle every time you add a new one.
The condition you expressed for exiting was that all torrents were complete. The simplest way of doing that is simply to poll them all with handle.status().is_seeding. i.e. loop over your list of torrent handles and ask that. Keep in mind that the call to status() requires a round-trip to the libtorrent network thread, which isn't super fast.
The faster way of doing this is to keep track of all torrents that aren't seeding yet, and "strike them off your list" as you get torrent_finished_alerts for torrents. (you get alerts by calling ses.pop_alerts()).
Another suggestion I would make is to set up your settings_pack object first, then create the session. It's more efficient and a bit cleaner. Especially with regards to opening listen sockets and then immediately closing and re-opening them when you change settings.
i.e.
p = lt.settings_pack()
p['proxy_hostname'] = '...'
p['proxy_username'] = '...'
p['proxy_password'] = '...'
p['proxy_port'] = 1080
p['proxy_type'] = lt.proxy_type_t.socks5_pw
p['proxy_peer_connections'] = True
ses = lt.session(p)

nfcpy: How to get on-release event correctly with NFCPY?

I try to listen to different RFID ID cards with a ACR122 reader and the nfcpy python library.
I would like to have the card's ID when the user connect it (without recognized it over and over) and get an event when user release it. Ideally in a loop, in order to listen to the next card when the user take his card away.
Below is my code, but the on-release event is fired even if the card is still on the reader. What is the correct way to
Get on-connect without recognizing over and over ?
Get on-release when user the card is away ?
import nfc
def on_startup(targets):
return targets
def on_connect(tag):
uid = str(tag.identifier).encode("hex").upper()
print(uid)
return True
def on_release(tag):
print('Released')
return tag
rdwr_options = {
'on-startup': on_startup,
'on-connect': on_connect,
'on-release': on_release,
'beep-on-connect': False,
}
with nfc.ContactlessFrontend('usb') as clf:
tag = clf.connect(rdwr=rdwr_options)
You might need to set an interval in your ContactlessFrontend config. Try this example:
import nfc
import ndef
tags = set()
rec = ndef.UriRecord("https://google.com")
def on_connect(tag):
if tag.identifier not in tags:
tags.add(tag.identifier)
fmt = tag.format()
if fmt is None:
print("Tag cannot be formatted (not supported).")
elif fmt is False:
print("Tag failed to be formatted (for some reason).")
else:
tag.ndef.records = [rec]
if __name__ == "__main__":
clf = nfc.ContactlessFrontend()
if not clf.open('usb'):
raise RuntimeError("Failed to open NFC device.")
while True:
config = {
'interval': 0.35,
'on-connect': on_connect
}
ret = clf.connect(rdwr=config)
if ret is None:
pass
elif not ret:
print ("NFC connection terminated due to an exception.")
break
else:
pass
clf.close()
https://gist.github.com/henrycjc/c1632b2d1f210ae0ff33d860c7c2eb8f
This discussion helped me figuring out how to solve this.
When reading the docs (‘on-release’ : function(tag)) very carefully – yes, it took me some loops – it becomes apparent that on-release is called as soon as on-connect returns True.
This function is called when the presence check was run (the ‘on-connect’ function returned a true value) and determined that communication with the tag has become impossible, or when the ‘terminate’ function returned a true value. The tag object may be used for cleanup actions but not for communication.
It seems that on-release must not be understood in a physical way, but rather in a communicative way (Released from communications, you may now remove the card).
To solve this issue, one needs to determine whether a card is present after it connected (or more precisely, after it was released – more about that later). The following code does the trick:
import nfc
import time
import logging
import inspect
logging.basicConfig(format="[%(name)s:%(levelname).4s] %(message)s")
logging.getLogger().setLevel(logging.DEBUG)
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
def on_startup(targets):
logger.debug(inspect.currentframe().f_code.co_name)
for target in targets:
target.sensf_req = bytearray.fromhex("0012FC0000")
return targets
def on_discover(target):
logger.debug(inspect.currentframe().f_code.co_name)
logger.info(target)
return True
def on_connect(tag):
logger.debug(inspect.currentframe().f_code.co_name)
logger.info(tag)
return True
def on_release(tag):
logger.debug(inspect.currentframe().f_code.co_name)
# Loop while card is present
while True:
time.sleep(1)
if not clf.sense(*[nfc.clf.RemoteTarget(target) for target in rdwr_options["targets"]]):
logger.info("Card removed")
break
return True
rdwr_options = {
"targets": ("106A", "106B", "212F"),
"on-startup": on_startup,
"on-discover": on_discover, # Here just for completeness :)
"on-connect": on_connect,
"on-release": on_release,
}
if __name__ == "__main__":
logger.debug(inspect.currentframe().f_code.co_name)
with nfc.ContactlessFrontend() as clf:
if not clf.open("usb"):
raise RuntimeError("Failed to open NFC device.")
while True:
ret = clf.connect(rdwr=rdwr_options)
if not ret:
break
Now is later: If we wait for removal of the card during the on-connect state, we run into trouble as on-release expects to retrieve information from the card (tag argument), which it cannot get anymore as communication is not possible with the card removed.
PS: The above mentioned discussion reads that the behavior of on-release depends on the type of card one is using.
So, I have a need to register when a card is present and when it leaves again. I have some Type2 tags for this.
Given this code:
def connected(tag):
print(tag)
return True def released(tag):
print("Bye")
tag = clf.connect(rdwr={'on-connect': connected, 'on-release':
released})
I would expect it to echo out the Tag ID when I present it to the
reader, and the echo "Bye" once I remove it. This works as expected on
a Type4 tag I have..
I'm afraid those cards are Mifare Classic 1K not supported by nfcpy. It is possible to read the UID but any other command requires to first authenticate and use the Mifare Crypto scheme. This should be doable with NXP reader ICs [...]. And NXP has a good selection of NFC Forum compatible Type 2 Tags that work just great.

Capturing click har before new page with Selenium 2 and Browsermob

I have this automation tool I've built with Selenium 2 and Browsermob proxy that works quite well for most of what I need. However, I've run into a snag on capturing network traffic.
I basically want to capture the har that a click provides before the page redirects. For example, I have an analytics call happening on the click that I want to capture, then another analytics call on the page load that I don't want to capture.
All of my attempts currently capture the har too late, so I see both the click analytics call and the page load one. Is there any way to get this working? I've included my current relevant code sections below
METHODS INSIDE HELPER CLASS
class _check_for_page_load(object):
def __init__(self, browser, parent):
self.browser = browser
self.maxWait = 5
self.parent = parent
def __enter__(self):
self.old_page = self.browser.find_element_by_tag_name('html')
def wait_for(self,condition_function):
start_time = time.time()
while time.time() < start_time + self.maxWait:
if condition_function():
return True
else:
time.sleep(0.01)
raise Exception(
'Timeout waiting for {}'.format(condition_function.__name__)
)
def page_has_loaded(self):
new_page = self.browser.find_element_by_tag_name('html')
###self.parent.log("testing ---- " + str(new_page.id) + " " + str(self.old_page.id))
return new_page.id != self.old_page.id
def __exit__(self, *_):
try:
self.wait_for(self.page_has_loaded)
except:
pass
def startNetworkCalls(self):
if self._p != None:
self._p.new_har("Step"+str(self._currStep))
def getNetworkCalls(self, waitForTrafficToStop = True):
if self._p != None:
if waitForTrafficToStop:
self._p.wait_for_traffic_to_stop(5000, 30*1000);
return self._p.har
else:
return "{}"
def click(self, selector):
''' clicks on an element '''
self.log("Clicking element '" + selector + "'")
el = self.findEl(selector)
traffic = ""
with self._check_for_page_load(self._d, self):
try:
self._curr_window = self._d.window_handles[0]
el.click()
except:
actions = ActionChains(self._d);
actions.move_to_element(el).click().perform()
traffic = self.getNetworkCalls(False)
try:
popup = self._d.switch_to.alert
if popup != None:
popup.dismiss()
except:
pass
try:
window_after = self._d.window_handles[1]
if window_after != self._curr_window:
self._d.close()
self._d.switch_to_window(self._curr_window)
except:
pass
return traffic
INSIDE FILE THAT RUNS MULTIPLE SELENIUM ACTIONS
##inside a for loop, we get an action that looks like "click('#selector')"
util.startNetworkCalls()
if action.startswith("click"):
temp_traffic = eval(action)
if temp_traffic == "":
temp_traffic = util.getNetworkCalls()
traffic = json.dumps(temp_traffic, sort_keys=True) ##gives json har info that is saved later
You can see from these couple snippets that I initiate the "click" function which returns network traffic. Inside the click function, you can see it references the class "_check_for_page_load". However, the first time it reaches this line:
###self.parent.log("testing ---- " + str(new_page.id) + " " + str(self.old_page.id))
The log (when enabled) shows that the element ids don't match on the first time it logs, indicating the page load has already started to happen. I'm pretty stuck right now as I've tried everything I can think of to try to accomplish this functionality.
I found a solution to my own question - though it isn't perfect. I told my network calls to capture headers:
def startNetworkCalls(self):
if self._p != None:
self._p.new_har("Step"+str(self._currStep),{"captureHeaders": "true"})
Then, when I retrieve the har data, I can look for the "Referer" header and compare that with the page that was initially loaded (before the redirect from the click). From there, I can split the har into two separate lists of network calls to further process later.
This works for my needs, but it isn't perfect. Some things, like image requests, sometimes get the same referrer that the previous page's url matched, so the splitting puts those into the first bucket rather than the appropriate second bucket. However, since I'm more interested in requests that aren't on the same domain, this isn't really an issue.

Threading memory profiling

So I hope this isn't a duplicate, however I either haven't been able to find the adequate solution or I just am not 100% on what I'm looking for. I've written a program to thread lots of requests. I create a thread to
Fetch responses from a number of api's such as this: share.yandex.ru/gpp.xml?url=MY_URL as well as scraping blogs
Parse the responses of all requests from the example above/ json/ using python-goose to extract articles
Return the parsed results back to the primary thread and insert into a database.
It's all been going well until it needs to pull back larger amounts of data which i haven't tested before. The primary reason for this is that it takes me over my shared memory limit on a shared Linux server (512mb) initiating a kill. This should be enough as it's only a few thousand requests, although i could be wrong. I'm clearing all large data variables/ objects within the main thread but that doesn't seem to help either.
I ran a memory_profile on the primary function which creates the threads with a thread class which looks like this:
class URLThread(Thread):
def __init__(self,request):
super(URLThread, self).__init__()
self.url = request['request']
self.post_id = request['post_id']
self.domain_id = request['domain_id']
self.post_data = request['post_params']
self.type = request['type']
self.code = ""
self.result = ""
self.final_results = ""
self.error = ""
self.encoding = ""
def run(self):
try:
self.request = get_page(self.url,self.type)
self.code = self.request['code']
self.result = self.request['result']
self.final_results = response_handler(dict(result=self.result,type=self.type,orig_url=self.url ))
self.encoding = chardet.detect(self.result)
self.error = self.request['error']
except Exception as e:
exc_type, exc_obj, exc_tb = sys.exc_info()
fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1]
errors.append((exc_type, fname, exc_tb.tb_lineno,e,'NOW()'))
pass
#profile
def multi_get(uris,timeout=2.0):
def alive_count(lst):
alive = map(lambda x : 1 if x.isAlive() else 0, lst)
return reduce(lambda a,b : a + b, alive)
threads = [ URLThread(uri) for uri in uris ]
for thread in threads:
thread.start()
while alive_count(threads) > 0 and timeout > 0.0:
timeout = timeout - UPDATE_INTERVAL
sleep(UPDATE_INTERVAL)
return [ {"request":x.url,
"code":str(x.code),
"result":x.result,
"post_id":str(x.post_id),
"domain_id":str(x.domain_id),
"final_results":x.final_results,
"error":str(x.error),
"encoding":str(x.encoding),
"type":x.type}
for x in threads ]
And the results look like this on the first batch of requests i pump through it (FYI it's a link as the output text isn't readable in here, i can't paste a html table or embed an image until i get 2 more points ):
http://tinypic.com/r/28c147d/8
And it doesn't seem to drop any of the memory in subsequent passes (I'm batching 100 requests/ threads through at 1 time). By this i mean once a batch of threads is complete they seem to stay in memory ad every time it runs another, memory is added as below:
http://tinypic.com/r/nzkeoz/8
Am I doing something really stupid here?
Python will generally free the memory taken up by an object when there are no references to that object left. Your multi_get function returns a list that contains references to every thread that you have created. So it's unlikely that Python would free that memory. But we would need to see what the code that is calling multi_get is doing in order to be sure.
To start freeing the memory you will need to stop returning references to the threads from this function. Or if you want to continue to do that, at least delete them somewhere del x.

Categories

Resources