I wrote a little tool, that gathers data from facebook, using api. Tool uses multiprocessing, queues and httplib modules. Here, is a part of code:
main process:
def extract_and_save(args):
put_queue = JoinableQueue()
get_queue = Queue()
for index in range(args.number_of_processes):
process_name = u"facebook_worker-%s" % index
grabber = FacebookGrabber(get_queue=put_queue, put_queue=get_queue, name=process_name)
friend_list = get_user_friends(args.default_user_id, ["id"])
for index, friend_id in enumerate(friend_list):
if not get_queue.empty():
... save to database ...
logger.info(u"There is no data to save")
worker process:
class FacebookGrabber(Process):
def __init__(self, *args, **kwargs):
self.connection = httplib.HTTPSConnection("graph.facebook.com", timeout=2)
self.get_queue = kwargs.pop("get_queue")
self.put_queue = kwargs.pop("put_queue")
super(FacebookGrabber, self).__init__(*args, **kwargs)
self.daemon = True
def run(self):
while True:
friend_id = self.get_queue.get(block=True)
friend_obj = self.get_friend_obj(friend_id)
except Exception, e:
logger.info(u"Friend id %s: facebook responded with an error (%s)", friend_id, e)
if friend_obj:
common code:
def get_json_from_facebook(connection, url, kwargs=None):
url_parts = list(urlparse.urlparse(url))
query = dict(urlparse.parse_qsl(url_parts[4]))
if kwargs:
url_parts[4] = urllib.urlencode(query)
url = urlparse.urlunparse(url_parts)
connection.request("GET", url)
except Exception, e:
print "<<<", e
response = connection.getresponse()
data = json.load(response)
return data
This code perfectly works on Ubuntu. But when I tried to run it on Windows 7 I got message "There is no data to save". The problem is here:
connection.request("GET", url)
except Exception, e:
print "<<<", e
I get next error: <<< a float is required
Do anybody know, how to fix this problem?
Python version: 2.7.5
One of the "gotcha's" that occasionally happens with socket timeout values is that most operating systems expect them as floats. I believe this has been accounted for with later versions of the linux kernel.
Try changing:
self.connection = httplib.HTTPSConnection("graph.facebook.com", timeout=2)
self.connection = httplib.HTTPSConnection("graph.facebook.com", timeout=2.0)
That's 2 seconds, by the way. Default is typically 5 seconds. Might be a little low.
There is a REST API with which I want to communicate via the request library.
First, I have to authenticate with a username and password (BasicAuth).
After a failed attempt I want to wait 3 seconds and repeat the whole process (maximum 3 times).
In case of success, I get a JSON string as a response.
You can see my current implementation below. The method get_order() is just an example. There are about 12 methods with a similar structure (they use partly POST-, partly GET-requests).
I think that my implementation is very complicated and I hope for suggestions and ideas how to optimize it.
def session_wrapper(self, func, *args, **kwargs) -> Union[bool, dict]:
has_authorization = False
attempts = 3
timeout_s = 3
for attempt in range(attempts):
resp = func(*args, **kwargs)
if resp.status_code == requests.status_codes.codes.UNAUTHORIZED:
self.session.auth = self.login_details
result = self.session.get(f"{self.base_uri}/auth/token", verify=False)
if not result.ok:
print("Cannot reach the authentication url.")
has_authorization = True
raise UnauthorizedException
return resp.json()
except json.JSONDecodeError as e:
print(f"Could not parse json: {e}")
return False
except requests.exceptions.RequestException as e:
print(f"RequestException: {e}")
except UnauthorizedException:
print("The session is expired and needs to be reestablished.")
except Exception as e:
print(f"Something unexpected went wrong: {e}")
if not has_authorization:
print("The session couldn't be reestablished.")
return False
def get_order(self, id: str) -> Optional[dict]:
payload = {'id': f"{id}"}
return self.session_wrapper(
def get_customor(self, id: str) -> Optional[dict]:
# ...
I had created simple code using multithreading in python with Queue. I have a main thread which keep on adding the Data in the Queue(Queue maxsize is 2000) and there will be 5 different threads who will take out from the Queue and publish into redis at some specific channel.
The code is working perfectly fine , but after 5 or 6 hours , the publish mechanism becomes slow.
As the threads which is used to remove the data from Queue becomes slow, and started throwing the buffer over flow error, as the Queue size reaches to maxsize. the speed of adding the data to queue is same which was in beginning.
The issue occurs differently on different configuration Linux system. How to identify what kind of error it is throwing? How to debug the problem.
As Stated - the code is very simple , where main thread is required to add the data in the Queue one by one and the 5 other threads can take the data out of the Queue one by one.
Sharing The Code
import redis
import logging
import sys
logging.basicConfig(level = logging.DEBUG)
redisPub = redis.StrictRedis(host='', port=6379)
def main():
recv_sock = socket.socket(socket.AF_INET, socket.SOCK_RAW, socket.IPPROTO_UDP)
recv_sock.bind(("", 8000))
except Exception,e:
exc_type, exc_obj, exc_tb = sys.exc_info()
logging.error("Socket Connection Unsuccessful")
print "Program halted."
sendThEvent = threading.Event()
sendPktQ = Queue.Queue(maxsize=2000)
for i in range(0,5):
thSend = threading.Thread(name = 'sendThread', target = sendThread , args = (sendPktQ,sendThEvent))
packetsCounting = 0
while True:
recvData, recvAddr = recv_sock.recvfrom(2048)
packetsCounting += 1
logging.error("There is some error")
except socket.timeout:
except Exception,e:
exc_type, exc_obj, exc_tb = sys.exc_info()
def sendThread(sendPktQ,listenEvent):
if sendPktQ == None:
while True:
instanceSentCnt += 1
if sendPktQ.qsize() < 1:
event_is_set = listenEvent.wait(0)
packetDict = sendPktQ.get()
data = redisPub.publish('chnlName', packetToSend)
logging.info("packet size reached to ============================================ %s ------------ %s"%(len(packetToSend),data))
if __name__ == "__main__":
Anyone Inputs will be highly appreciated.
I see a sendPktQ.get(), but no sendPktQ.task_done(), for example:
async def _dispatch_packet(self, destination=None) -> None:
"""Send a command unless in listen_only mode."""
if not self.command_queue.empty():
cmd = self.command_queue.get()
if not (destination is None or self.config.get("listen_only")):
await asyncio.sleep(0.05)
See queue library docs
I'm trying to get some data from a web page. To speed up this process (they allow me to make 1000 requests per minute), I use ThreadPool.
Since there is a huge amount of data, the process is quite vulnerable to connection fails etc. so I try to log everything I can to be able to detect each mistake I did in code.
The problem is that program sometimes just stops without any exception (it acts like it is running but with no effect - I use PyCharm). I log catched exceptions everywhere I can but I can't see any exception in any log.
I assume that if there were a timeout reached, the exception would be raised and logged.
I've found out where the problem could be. Here is the code:
As a pool, I use: from multiprocessing.pool import ThreadPool as Pool
And lock: from threading import Lock
The download_category function is being used in loop.
def download_category(url):
# some code
# ...
log('Create pool...')
_pool = Pool(_workers_number)
with open('database/temp_produkty.txt') as f:
log('Spracovavanie produktov... vytvaranie vlakien...') # I see this in log
for url_product in f:
x = _pool.apply_async(process_product, args=(url_product.strip('\n'), url))
log('Presuvanie produktov z temp export do export.csv...') # I can't see this in log
except Exception as e:
logging.exception('Got exception on download_one_category: {}'.format(url))
And process_product function:
def process_product(url, cat):
data = get_product_data(url)
log('{}: {} exception while getting product data... #') # I don't see this in log
print_to_temp_export(data, cat) # I don't see this in log
log('{}: {} exception while printing to csv... #') # I don't see this in log
LOG function:
def log(text):
now = datetime.now().strftime('%d.%m.%Y %H:%M:%S')
mLib.printToFile('logging/log.log', '{} -> {}'.format(now, text))
I use logging module too. In this log, I see that probably 8 (number of workers) times request was sent but no answer hasn't been recieved.
def get_product_data(url):
data = defaultdict(lambda: '-')
root = load_root(url)
nazov = root.xpath('//h1[#itemprop="name"]/text()')[0]
nazov = root.xpath('//h1/text()')[0]
under_block = root.xpath('//h2[#id="lowest-cost"]')
if len(under_block) < 1:
under_block = root.xpath('//h2[contains(text(),"Naj")]')
if len(under_block) < 1:
return False
data['nazov'] = nazov
data['url'] = url
blocks = under_block[0].xpath('./following-sibling::div[#class="shp"]/div[contains(#class,"shp")]')
i = 0
for block in blocks:
i += 1
data['dat{}_men'.format(i)] = eblock.xpath('.//a[#class="link"]/text()')[0]
del root
return data
class RedirectException(Exception):
def load_url(url):
r = requests.get(url, allow_redirects=False)
if r.status_code == 301:
raise RedirectException
if r.status_code == 404:
if '-q-' in url:
url = url.replace('-q-','-')
mLib.printToFileWOEncoding('logging/neexistujuce.txt','Skusanie {} kategorie...'.format(url))
return load_url(url) # THIS IS NOT LOOPING
html = r.text
return html
def load_root(url):
html = load_url(url)
except Exception as e:
return etree.fromstring(html, etree.HTMLParser())
I've read many answers, however I have not found a proper solution.
The problem, I'm reading mixed/replace HTTP streams that will not expire or end by default.
You can try it by yourself using curl:
curl http://agent.mtconnect.org/sample\?interval\=0
So, now I'm using Python threads and requests to read data from multiple streams.
import requests
import uuid
from threading import Thread
tasks = ['http://agent.mtconnect.org/sample?interval=5000',
thread_id = []
def http_handler(thread_id, url, flag):
print 'Starting task %s' % thread_id
requests_stream = requests.get(url, stream=True, timeout=2)
for line in requests_stream.iter_lines():
if line:
print line
if flag and line.endswith('</MTConnectStreams>'):
# Wait until XML message end is reached to receive the full message
except requests.exceptions.RequestException as e:
print('error: ', e)
except BaseException as e:
print e
if __name__ == '__main__':
for task in tasks:
uid = str(uuid.uuid4())
t = Thread(target=http_handler, args=(uid, task, False), name=uid)
print thread_id
# Wait Time X or until user is doing something
# Send flag = to desired thread to indicate the loop should stop after reaching the end.
Any suggestions? What is the best solution? I don't want to kill the thread because I would like to read the ending to have a full XML message.
I found a solution by using threading module and threading.events. Maybe not the best solution, but it works fine currently.
import logging
import threading
import time
import uuid
import requests
logging.basicConfig(level=logging.DEBUG, format='(%(threadName)-10s) %(message)s', )
tasks = ['http://agent.mtconnect.org/sample?interval=5000',
d = dict()
def http_handler(e, url):
logging.debug('wait_for_event starting')
message_buffer = []
filter_namespace = True
requests_stream = requests.get(url, stream=True, timeout=2)
for line in requests_stream.iter_lines():
if line:
if e.isSet() and line.endswith('</MTConnectStreams>'):
except requests.exceptions.RequestException as e:
print('error: ', e)
except BaseException as e:
print e
if __name__ == '__main__':
logging.debug('Waiting before calling Event.set()')
for task in tasks:
uid = str(uuid.uuid4())
e = threading.Event()
d[uid] = {"stop_event": e}
t = threading.Event(uid)
t = threading.Thread(name=uid,
args=(e, task))
logging.debug('Waiting 3 seconds before calling Event.set()')
for key in d:
I'm currently in the process of learning ssh via the brute-force/ just keep hacking until I understand it approach. After some trial and error I've been able to successfully send a "pty-req" followed by a "shell" request, I can get the login preamble, send commands and receive stdout but I'm not exactly sure how to tell the SSH service I want to recieve stderr and status messages. Reading through other SSH implementations ( paramiko, Net::SSH ) hasn't been much of a guide at the moment.
That said, looking at one of the RFC's for SSH, I believe that perhaps one of the listed requests might be what I am looking for: https://www.rfc-editor.org/rfc/rfc4250#section-4.9.3
#!/usr/bin/env python
from twisted.conch.ssh import transport
from twisted.conch.ssh import userauth
from twisted.conch.ssh import connection
from twisted.conch.ssh import common
from twisted.conch.ssh.common import NS
from twisted.conch.ssh import keys
from twisted.conch.ssh import channel
from twisted.conch.ssh import session
from twisted.internet import defer
from twisted.internet import defer, protocol, reactor
from twisted.python import log
import struct, sys, getpass, os
USER = 'dward'
HOST = '' # pristine.local
PASSWD = "password"
PRIVATE_KEY = "~/id_rsa"
class SimpleTransport(transport.SSHClientTransport):
def verifyHostKey(self, hostKey, fingerprint):
print 'host key fingerprint: %s' % fingerprint
return defer.succeed(1)
def connectionSecure(self):
class SimpleUserAuth(userauth.SSHUserAuthClient):
def getPassword(self):
return defer.succeed(PASSWD)
def getGenericAnswers(self, name, instruction, questions):
print name
print instruction
answers = []
for prompt, echo in questions:
if echo:
answer = raw_input(prompt)
answer = getpass.getpass(prompt)
return defer.succeed(answers)
def getPublicKey(self):
path = os.path.expanduser(PRIVATE_KEY)
# this works with rsa too
# just change the name here and in getPrivateKey
if not os.path.exists(path) or self.lastPublicKey:
# the file doesn't exist, or we've tried a public key
return keys.Key.fromFile(filename=path+'.pub').blob()
def getPrivateKey(self):
path = os.path.expanduser(PRIVATE_KEY)
return defer.succeed(keys.Key.fromFile(path).keyObject)
class SimpleConnection(connection.SSHConnection):
def serviceStarted(self):
self.openChannel(SmartChannel(2**16, 2**15, self))
class SmartChannel(channel.SSHChannel):
name = "session"
def getResponse(self, timeout = 10):
self.onData = defer.Deferred()
self.timeout = reactor.callLater( timeout, self.onData.errback, Exception("Timeout") )
return self.onData
def openFailed(self, reason):
print "Failed", reason
def channelOpen(self, ignoredData):
self.data = ''
self.oldData = ''
self.onData = None
self.timeout = None
term = os.environ.get('TERM', 'xterm')
#winsz = fcntl.ioctl(fd, tty.TIOCGWINSZ, '12345678')
winSize = (25,80,0,0) #struct.unpack('4H', winsz)
ptyReqData = session.packRequest_pty_req(term, winSize, '')
result = yield self.conn.sendRequest(self, 'pty-req', ptyReqData, wantReply = 1 )
except Exception as e:
print "Failed with ", e
result = yield self.conn.sendRequest(self, "shell", '', wantReply = 1)
except Exception as e:
print "Failed shell with ", e
#fetch preample
data = yield self.getResponse()
Welcome to Ubuntu 11.04 (GNU/Linux 2.6.38-8-server x86_64)
* Documentation: http://www.ubuntu.com/server/doc
System information as of Sat Oct 29 13:09:50 MDT 2011
System load: 0.0 Processes: 111
Usage of /: 48.0% of 6.62GB Users logged in: 1
Memory usage: 39% IP address for eth1:
Swap usage: 3%
Graph this data and manage this system at https://landscape.canonical.com/
New release 'oneiric' available.
Run 'do-release-upgrade' to upgrade to it.
Last login: Sat Oct 29 01:23:16 2011 from
print data
while data != "" and data.strip().endswith("~$") == False:
data = yield self.getResponse()
print repr(data)
\x1B]0;dward#pristine: ~\x07dward#pristine:~$
except Exception as e:
print e
#fetch response
data = yield self.getResponse()
except Exception as e:
print "Failed to catch response?", e
print data
\x1B]0;dward#pristine: ~\x07dward#pristine:~$
#fetch response
data = yield self.getResponse()
except Exception as e:
print "Failed to catch response?", e
print data
\x1B]0;dward#pristine: ~\x07dward#pristine:~$
self.write("echo Hello World\n\x00")
data = yield self.getResponse()
except Exception as e:
print "Failed to catch response?", e
print data
echo Hello World
Hello World
\x1B]0;dward#pristine: ~\x07dward#pristine:~$
#Close up shop
dbgp = 1
def request_exit_status(self, data):
status = struct.unpack('>L', data)[0]
print 'status was: %s' % status
def dataReceived(self, data):
self.data += data
if self.onData is not None:
if self.timeout and self.timeout.active():
if self.onData.called == False:
def extReceived(self, dataType, data):
dbgp = 1
print "Extended Data recieved! dataType = %s , data = %s " % ( dataType, data, )
self.extendData = data
def closed(self):
print 'got data : %s' % self.data.replace("\\r\\n","\r\n")
protocol.ClientCreator(reactor, SimpleTransport).connectTCP(HOST, 22)
Additionally I tried adding in an explicit bad command to the remote shell:
self.write("ls -alF badPathHere\n\x00")
data = yield self.getResponse()
except Exception as e:
print "Failed to catch response?", e
print data
ls -alF badPathHere
ls: cannot access badPathHere: No such file or directory
\x1B]0;dward#pristine: ~\x07dward#pristine:~$
And it looks like stderr is being mixed into stderr
Digging through the source code for OpenSSH, channel session logic is handled in session.c at line
2227 function -> session_input_channel_req which if given a pty-req then a "shell" request leads to do_exec_pty which ultimately leads to the call to session_set_fds(s, ptyfd, fdout, -1, 1, 1). The forth argument would normally be a file descriptor responsible for handling stderr but since none is supplied then there won't be any extended data for stderr.
Ultimately, even if I modified openssh to provide a stderr FD, the problem resides with the shell. Complete guess work at this point but I believe that similar to logging into a ssh service via a terminal like xterm or putty, that stderr and stdout are sent together unless explicitly redirected via something like "2> someFile" which is beyond the scope of a SSH service provider.