I have a fairly large python package that interacts synchronously with a third party API server and carries out various operations with the server. Additionally, I am now also starting to collect some of the data for future analysis by pickling the JSON responses. After profiling several serialisation/database methods, using pickle was the fastest in my case. My basic pseudo-code is:
While True:
do_existing_api_stuff()...
# additional data pickling
data = {'info': []} # there are multiple keys in real version!
if pickle_file_exists:
data = unpickle_file()
data['info'].append(new_data)
pickle_data(data)
if len(data['info']) >= 100: # file size limited for read/write speed
create_new_pickle_file()
# intensive section...
# move files from "wip" (Work In Progress) dir to "complete"
if number_of_pickle_files >= 100:
compress_pickle_files() # with lzma
move_compressed_files_to_another_dir()
My main issue is that the compressing and moving of the files takes several seconds to complete and is therefore slowing my main loop. What is the easiest way to call these functions in a non-blocking way without any major modifications to my existing code? I do not need any return from the function, however it will raise an error if anything fails. Another "nice to have" would be for the pickle.dump() to also be non-blocking. Again, I am not interested in the return beyond "did it raise an error?". I am aware that unpickle/append/re-pickle every loop is not particularly efficient, however it does avoid data loss when the api drops out due to connection issues, server errors, etc.
I have zero knowledge on threading, multiprocessing, asyncio, etc and after much searching, I am currently more confused than I was 2 days ago!
FYI, all of the file related functions are in a separate module/class, so that could be made asynchronous if necessary.
EDIT:
There may be multiple calls to the above functions, so I guess some sort of queuing will be required?
Easiest solution is probably the threading standard library package. This will allow you to spawn a thread to do the compression while your main loop continues.
There is almost certainly quite a bit of 'dead time' in your existing loop waiting for the API to respond and conversely there is quite a bit of time spent doing the compression when you could be usefully making another API call. For this reason I'd suggest separating these two aspects. There are lots of good tutorials on threading so I'll just describe a pattern which you could aim for
Keep the API call and the pickling in the main loop but add a step which passes the file path to each pickle to a queue after it is written
Write a function which takes a the queue as its input and works through the filepaths performing the compression
Before starting the main loop, start a thread with the new function as its target
Related
I am planing to setup a small proxy service for a remote sensor, that only accepts one connection. I have a temporary solution and I am now designing a more robust version, and therefore dived deeper into the python multiprocessing module.
I have written a couple of systems in python using a main process, which spawns subprocesses using the multiprocessing module and used multiprocessing.Queue to communicate between them. This works quite well and some of theses programs/scripts are doing their job in a production environment.
The new case is slightly different since it uses 2+n processes:
One data-collector, that reads data from the sensor (at 100Hz) and every once in a while receives short ASCII strings as command
One main-server, that binds to a socket and listens, for new connections and spawns...
n child-servers, that handle clients who want to have the sensor data
while communication from the child servers to the data collector seems pretty straight forward using a multiprocessing.Queue which manages a n:1 connection well enough, I have problems with the other way. I can't use a queue for that as well, because all child-servers need to get all the data the sensor produces, while they are active. At least I haven't found a way to configure a Queue to mimic that behaviour, as get takes the top most out of the Queue by design.
I looked into shared memory already, which massively increases the management overhead, since as far as I understand it while using it, I would basically need to implement a streaming buffer myself.
The only safe way I see right now, is using a redis server and messages queues, but I am a bit hesitant, since that would need more infrastructure than I like.
Is there a pure python internal way?
maybe You can use MQTT for that ?
You did not clearly specify, but sounds like observer pattern -
or do You want the clients to poll each time they need data ?
It depends which delays / data rate / jitter etc. You can accept.
after You provided the information :
The whole setup runs on one machine in one process space. What I would like to have, is a way without going through a third party process
I would suggest to check for observer pattern.
More informations can be found for example:
https://www.youtube.com/watch?v=_BpmfnqjgzQ&t=1882s
and
https://refactoring.guru/design-patterns/observer/python/example
and
https://www.protechtraining.com/blog/post/tutorial-the-observer-pattern-in-python-879
and
https://python-3-patterns-idioms-test.readthedocs.io/en/latest/Observer.html
Your Server should fork for each new connection and register with the observer, and will be therefore informed about every change.
Newbie-ish python/pandas user here. I've been playing with using chunksize arg in read_fwf and iterating value_counts of variables. I wrote a function to pass args such as the fileiterator and variables to parse and count. I was hoping to parallelize this function and be able to read 2 files at the same time into the same function.
It does appear to work... However, I'm getting unexpected slow downs. The threads finish same time but one seems to be slowing the other down (IO bottleneck?). I'm getting faster times by running the functions sequentially rather than parallel (324 secs Vs 172 secs). Ideas? I'm I executing this wrong? I've tried multiprocess but startmap errors that I can't pickle the fileiterator (output of read_fwf).
testdf1=pd.read_fwf(filepath_or_buffer='200k.dat',header=None,colspecs=wlist,names=nlist,dtype=object,na_values=[''],chunksize=1000)
testdf2=pd.read_fwf(filepath_or_buffer='200k2.dat',header=None,colspecs=wlist,names=nlist,dtype=object,na_values=[''],chunksize=1000)
def tfuncth(df,varn,q,*args):
td={}
for key in varn.keys():
td[key]=pd.Series()
for rdf in df:
if args is not None:
for arg in args:
rdf=eval(f"rdf.query(\"{arg}\")")
for key in varn.keys():
ecode=f'rdf.{varn[key]}.value_counts()'
td[key]=pd.concat([td[key],eval(ecode)])
td[key]=td[key].groupby(td[key].index).sum()
for key in varn.keys():
td[key]=pd.DataFrame(td[key].reset_index()).rename(columns={'index':'Value',0:'Counts'}).assign(Var=key,PCT=lambda x:round(x.Counts/x.Counts.sum()*100,2))[['Var','Value','Counts','PCT']]
q.put(td)
bands={
'1':'A',
'2':'B',
'3':'C',
'4':'D',
'5':'E',
'6':'F',
'7':'G',
'8':'H',
'9':'I'
}
vdict={
'var1':'e1270.str.slice(0,2)',
'var2':'e1270.str.slice(2,3)',
'band':'e7641.str.slice(0,1).replace(bands)'
}
my_q1=queue.Queue()
my_q2=queue.Queue()
thread1=threading.Thread(target=tfuncth,args=(testdf1,vdict,my_q1,flter1))
thread2=threading.Thread(target=tfuncth,args=(testdf2,vdict,my_q2))
thread1.start()
thread2.start()
UPDATE:
After much reading This is the conclusion I've came too. This is extremely simplified conclusion I'm sure so if someone knows otherwise please inform me.
Pandas is not a fully multi-thread friendly package
Apparently there’s a package called ‘dask’ that is and it replicates a lot of pandas functions. So I’ll be looking into that.
Python is not truly a multi-threading compatible language in many
cases
Python is bound by its compiler. In pure python, its interpreted and bound by the GIL for only execution of one thread at a time
Multiple threads can be spun off but will only be able to parallel non-cpu bound functions.
My code is wrapped with IO and CPU. The simple IO is probably running parallel but getting held up waiting on the processor for execution.
I plan to test this out by writing IO only operations and attempting threading.
Python can be compiled with different compilers that don’t have a global interpreter lock (GIL) on threads.
Thus packages such as ‘dask’ can utilize multi-threading.
I did manage to get this to work and fix my problems by using the multiprocessing package. I ran into two issues.
1) multiprocessing package is not compatible with Juypter Notebook
and
2) you can't pickle a handle to a pandas reader (multiprocessing pickles objects passed to the processes).
I fixed 1 by coding outside the Notebook environment and I fixed 2 by passing in the arguments needed to open a chunking file to each process and had each process start their own chunk read.
After doing those two things I was able to get a 60% increase in speed over sequential runs.
I'm sorry if this question has in fact been asked before. I've searched around quite a bit and found pieces of information here and there but nothing that completely helps me.
I am building an app on Google App engine in python, that lets a user upload a file, which is then being processed by a piece of python code, and then resulting processed file gets sent back to the user in an email.
At first I used a deferred task for this, which worked great. Over time I've come to realize that since the processing can take more than then 10 mins I have before I hit the DeadlineExceededError, I need to be more clever.
I therefore started to look into task queues, wanting to make a queue that processes the file in chunks, and then piece everything together at the end.
My present code for making the single deferred task look like this:
_=deferred.defer(transform_function,filename,from,to,email)
so that the transform_function code gets the values of filename, from, to and email and sets off to do the processing.
Could someone please enlighten me as to how I turn this into a linear chain of tasks that get acted on one after the other? I have read all documentation on Google app engine that I can think about, but they are unfortunately not written in enough detail in terms of actual pieces of code.
I see references to things like:
taskqueue.add(url='/worker', params={'key': key})
but since I don't have a url for my task, but rather a transform_function() implemented elsewhere, I don't see how this applies to me…
Many thanks!
You can just keep calling deferred to run your task when you get to the end of each phase.
Other queues just allow you to control the scheduling and rate, but work the same.
I track the elapsed time in the task, and when I get near the end of the processing window the code stops what it is doing, and calls defer for the next task in the chain or continues where it left off, depending if its a discrete set up steps or a continues chunk of work. This was all written back when tasks could only run for 60 seconds.
However the problem you will face (it doesn't matter if it's a normal task queue or deferred) is that each stage could fail for some reason, and then be re-run so each phase must be idempotent.
For long running chained tasks, I construct an entity in the datastore that holds the description of the work to be done and tracks the processing state for the job and then you can just keep rerunning the same task until completion. On completion it marks the job as complete.
To avoid the 10 minutes timeout you can direct the request to a backend or a B type module
using the "_target" param.
BTW, any reason you need to process the chunks sequentially? If all you need is some notification upon completion of all chunks (so you can "piece everything together at the end")
you can implement it in various ways (e.g. each deferred task for a chunk can decrease a shared datastore counter [read state, decrease and update all in the same transaction] that was initialized with the number of chunks. If the datastore update was successful and counter has reached zero you can proceed with combining all the pieces together.) An alternative for using deferred that would simplify the suggested workflow can be pipelines (https://code.google.com/p/appengine-pipeline/wiki/GettingStarted).
I have a long-running twisted server.
In a large system test, at one particular point several minutes into the test, when some clients enter a particular state and a particular outside event happens, then this server takes several minutes of 100% CPU and does its work very slowly. I'd like to know what it is doing.
How do you get a profile for a particular span of time in a long-running server?
I could easily send the server start and stop messages via HTTP if there was a way to enable or inject the profiler at runtime?
Given the choice, I'd like stack-based/call-graph profiling but even leaf sampling might give insight.
yappi profiler can be started and stopped at runtime.
There are two interesting tools that came up that try to solve that specific problem, where you might not necessarily have instrumented profiling in your code in advance but want to profile production code in a pinch.
pyflame will attach to an existing process using the ptrace(2) syscall and create "flame graphs" of the process. It's written in Python.
py-spy works by reading the process memory instead and figuring out the Python call stack. It also provides a flame graph but also a "top-like" interface to show which function is taking the most time. It's written in Rust and Python.
Not a very Pythonic answer, but maybe straceing the process gives some insight (assuming you are on a Linux or similar).
Using strictly Python, for such things I'm using tracing all calls, storing their results in a ringbuffer and use a signal (maybe you could do that via your HTTP message) to dump that ringbuffer. Of course, tracing slows down everything, but in your scenario you could switch on the tracing by an HTTP message as well, so it will only be enabled when your trouble is active as well.
Pyliveupdate is a tool designed for the purpose: profiling long running programs without restarting them. It allows you to dynamically selecting specific functions to profiling or stop profiling without instrument your code ahead of time -- it dynamically instrument code to do profiling.
Pyliveupdate have three key features:
Profile specific Python functions' (by function names or module names) call time.
Add / remove profilings without restart programs.
Show profiling results with call summary and flamegraphs.
Check out a demo here: https://asciinema.org/a/304465.
With regard to the Python Twisted framework, can someone explain to me how to write asynchronously a very large data string to a consumer, say the protocol.transport object?
I think what I am missing is a write(data_chunk) function that returns a Deferred. This is what I would like to do:
data_block = get_lots_and_lots_data()
CHUNK_SIZE = 1024 # write 1-K at a time.
def write_chunk(data, i):
d = transport.deferredWrite(data[i:i+CHUNK_SIZE])
d.addCallback(write_chunk, data, i+1)
write_chunk(data, 0)
But, after a day of wandering around in the Twisted API/Documentation, I can't seem to locate anything like the deferredWrite equivalence. What am I missing?
As Jean-Paul says, you should use IProducer and IConsumer, but you should also note that the lack of deferredWrite is a somewhat intentional omission.
For one thing, creating a Deferred for potentially every byte of data that gets written is a performance problem: we tried it in the web2 project and found that it was the most significant performance issue with the whole system, and we are trying to avoid that mistake as we backport web2 code to twisted.web.
More importantly, however, having a Deferred which gets returned when the write "completes" would provide a misleading impression: that the other end of the wire has received the data that you've sent. There's no reasonable way to discern this. Proxies, smart routers, application bugs and all manner of network contrivances can conspire to fool you into thinking that your data has actually arrived on the other end of the connection, even if it never gets processed. If you need to know that the other end has processed your data, make sure that your application protocol has an acknowledgement message that is only transmitted after the data has been received and processed.
The main reason to use producers and consumers in this kind of code is to avoid allocating memory in the first place. If your code really does read all of the data that it's going to write to its peer into a giant string in memory first (data_block = get_lots_and_lots_data() pretty directly implies that) then you won't lose much by doing transport.write(data_block). The transport will wake up and send a chunk of data as often as it can. Plus, you can simply do transport.write(hugeString) and then transport.loseConnection(), and the transport won't actually disconnect until either all of the data has been sent or the connection is otherwise interrupted. (Again: if you don't wait for an acknowledgement, you won't know if the data got there. But if you just want to dump some bytes into the socket and forget about it, this works okay.)
If get_lots_and_lots_data() is actually reading a file, you can use the included FileSender class. If it's something which is sort of like a file but not exactly, the implementation of FileSender might be a useful example.
The way large amounts of data is generally handled in Twisted is using the Producer/Consumer APIs. This doesn't give you a write method that returns a Deferred, but it does give you notification about when it's time to write more data.