Detecting if an event happens multiple times in a short timespan - python

This might be a bit hard to explain, but I hope I can explain it in a sufficient and understandable way.
I want to create a system to detect if a large amount of users suddenly joins our server, but i'm not sure how this would be setup.
Should I store every single new user in Redis and a timestamp, and then with a background tasks every n seconds/minutes check if the amount of users in a timespan is larger than an estimate of new users that I know we get ?
So like
if new_user_count > 10:
sendwarning()
But how would I incorporate and check within a certain timespan? Maybe the amount of new users since last scan?
Is this a sufficient way or is there some other smart method that I don't know of?

when a user connects a server (depending upon what kind of server you have) say if you have a linux server every user connection to your server has a entry in (/var/log/<file.log>) location you can write a simple python program to monitor this file and put regex to count the user "successful login". You can also use opensource IDS(Intrusion Detection System) like Snort in your server to do the same for you.
https://www.snort.org/

Related

What is the best way to handle multiple user requests when lot of back end calculation is involved?

Hi I am quite new to web application development. I have been designing an application where a user uploads a file, some calculation is done and an output table will be shown. This process takes approximately 5-6 seconds.
I am saving my data in sessions like this:
request.session ['data']=resultDATA.
And loading the data whenever I need from sessions like this:
resultDATA = request.session['data']
I dont need DATA once the user is signed out. So is approach correct to save user data (not involving passwords)?
My biggest problem is if n number of users upload their files at exact moment do the last user have to wait for n*6 seconds for his calculation to complete? If yes is there any solution for this?
Right now I am using django built-in web server.
Do I have to use a different server to solve this problem?
There are quiet some questions in this question, however i think they are related enough and concise enough to deserve an answer:
So is approach correct to save user data (not involving passwords)?
I don't see any problem with this approach, since it's volatile data and it's not sensitive.
My biggest problem is if n number of users upload their files at exact moment do the last user have to wait for n*6 seconds for his calculation to complete?
This shouldn't be an issue as you put it. obviously if your server is handling huge ammounts of traffic it will slow down, and it will take a bit longer than your usual 5-6 seconds. However it won't be n*6, the server should be able to handle multiple requests at once.
Do I have to use a different server to solve this problem?
No, but kind of yes... what i mean is that in development the built-in server is great. It does everything you need it to do, however when you decide to push the app into production, you'll need a proper server for it.
As a side note, try to see if you can improve the data collection time, because right now everything is running on your own PC, which means it will probably be faster than when you push it to production. When you "upload" a file to localhost it takes a lot less time than when you upload it to an actual server over the internet, so that's a thing to keep in mind.

Uploading code to server and run automatically

I'm fairly competent with Python but I've never 'uploaded code' to a server before and have it run automatically.
I'm working on a project that would require some code to be running 24/7. At certain points of the day, if a criteria is met, a process is started. For example: a database may contain records of what time each user wants to receive a daily newsletter (for some subjective reason) - the code would at the right time of day send the newsletter to the correct person. But of course, all of this is running out on a Cloud server.
Any help would be appreciated - even correcting my entire formulation of the problem! If you know how to do this in any other language - please reply with your solutions!
Thanks!
Here are two approaches to this problem, both of which require shell access to the cloud server.
Write the program to handle the scheduling itself. For example, sleep and wake up every few miliseconds to perform the necessary checks. You would then transfer this file to the server using a tool like scp, login, and start it in the background using something like python myscript.py &.
Write the program to do a single run only, and use the scheduling tool cron to start it up every minute of the day.
Took a few days but I finally got a way to work this out. The most practical way to get this working is to use a VPS that runs the script. The confusing part of my code was that each user would activate the script at a different time for themselves. To do this, say at midnight, the VPS runs the python script (using scheduled tasking or something similar) and runs the script. the script would then pull times from a database and process the code at those times outlined.
Thanks for your time anyways!

Scanning MySQL table for updates Python

I am creating a GUI that is dependent on information from MySQL table, what i want to be able to do is to display a message every time the table is updated with new data. I am not sure how to do this or even if it is possible. I have codes that retrieve the newest MySQL update but I don't know how to have a message every time new data comes into a table. Thanks!
Quite simple and straightforward solution will be just to poll the latest autoincrement id from your table, and compare it with what you've seen at the previous poll. If it is greater -- you have new data. This is called 'active polling', it's simple to implement and will suffice if you do this not too often. So you have to store the last id value somewhere in your GUI. And note that this stored value will reset when you restart your GUI application -- be sure to think what to do at the start of the GUI. Probably you will need to track only insertions that occur while GUI is running -- then, at the GUI startup you need just to poll and store current id value, and then poll peroidically and react on its changes.
#spacediver gives some good advice about the active polling approach. I wanted to post some other options as well.
You could use some type of message passing to communcate notifications between clients. ZeroMQ, twisted, etc offer these features. One way to do it is have the updating client issue the message along with their successful database insert. Clients can all listen to a channel for notifications instead of always polling the db.
If you cant control adding an update message to the client doing the insertions, you could also look at this link for using a database trigger to call a script which would simply issue an update message to your messaging framework. It explains installing a UDF extension to allow you to run a sys_exec command in a trigger and call a simple script.
This way clients simply respond to a notification instead of all checking regularily.

Timed Quiz: How to consider internet interruptions?

I am preparing a Test or Quiz in Django. The quiz needs to be completed in certain time frame. Say 30 minutes for 40 questions.I can always initiate a clock at start of the test, and then calculate time by the time the Quiz is completed. However it's likely that during the attempt, there may be issues such as internet connection drops, or system crashes/power outages etc.
I need a strategy to figure out when such an accident happened, and stop the clock, then let the user take the test again from where it stopped, and start the clock again.
What is the right strategy? Any help including sample code/examples/ideas are most welcome
Your strategy should depend on importance of the test and ability to retake whole test.
Is test/quiz for fun or competence/knowledge checking?
Are you dealing with logged users?
Are tests generated randomly from large poll of available questions?
these are the questions you need to answer yourself first.
Remember that:
malicious user CAN simulate connection outage / power failure,
only clock you can trust is one on server side,
everything on browser side can be manipulated (think firebug/console js injection)
My approach would be:
Inform users that TIME is important factor and connection issues may not be taken into account when grade will be given...,
Serve only one question, wait for answer, serve another one,
Whole test time should be calculated as SUM of each answer time:
save each "question send" / "answer received" timestamps and calculate answer time from it,
time between questions wouldn't count,
you'd get extra scope on which questions was harder / took longer to answer.
Add some kind of heartbeat to your question page (like ajax request every X seconds), when heartbeat stops you can (depending on options you have):
invalidate question and notify user via dialog that he has connection issues and have to refresh to get new question instead if you have larger poll of questions to use,
pause time on server side (and for example dim question page so user cannot answer until his connection is restored) IMO only for games/fun quiz/tests
save information on server side on each interruption which would later ease decision to allow retake whole test e.g. he was fine until 20th question and then on 3-4 easy questions in a row he was dropping...
The simplest way would be to add a timestamp when the person starts the quiz and then compare that to when they submit. Of course, this doesn't take into account connection drops, crashes, etc... like you mentioned.
To account for these issues I'd probably use something like node.js. Each client has "check-in" when they connect to the quiz. Then at regular intervals (every 1s, 10s, 1m, etc...) the client checks in. If at these intervals the client doesn't check-in you can assume they've had the connection drop. You could keep track of when they connect again and start the timer from where they left off.
This is my initial thought on how to keep track of connection drops and crashes. The same could be done with a front-end ajax call to a Django view.
Either you do the clock on the client side, in which case they can always cheat somehow, or you do it on the server side, and then you aren't taking into account these interruptions.
To reduce cheating somewhat and still allow for interruptions, you could do a 'keep alive'.
Here the client side code announces to the server that it is still there every so often, say every 5 seconds. The server side notes when it stops getting these messages, and pauses/stops the clock. However it still has the start and end time, so you know how long it really took in wall time, and also how long it took while the client was supposedly there.
With these two pieces of information you could very easily track down odd behaviour and blacklist people. Blacklisted people might not be aware that they are blacklisted, but their quiz scores don't show up for other users of your quiz system.
The problem with pausing the clock when the connection to the user drops, is that the user could just disconnect their computer from the internet each time they received a new question, and then reconnect once they had worked out the answer.
One thing you could do, is give the user a certain amount of time for each question.
The clock is started when the user successfully receives the question to their browser, and if the user submits an answer before the time limit, it is accepted, otherwise it is void.
That would mean if a user lost connection it would only affect the question they are currently on. But it would also mean that the user would have no flexibility in how much time they want to allot to each question, you decide for them.
I was thinking you could do something like removing the question from the screen unless the connection to the server was still alive, but the user could always just screen-shot the question before disconnecting.

Should I learn/use MapReduce, or some other type of parallelization for this task?

After talking with a friend of mine from Google, I'd like to implement some kind of Job/Worker model for updating my dataset.
This dataset mirrors a 3rd party service's data, so, to do the update, I need to make several remote calls to their API. I think a lot of time will be spent waiting for responses from this 3rd party service. I'd like to speed things up, and make better use of my compute hours, by parallelizing these requests and keeping many of them open at once, as they wait for their individual responses.
Before I explain my specific dataset and get into the problem, I'd like to clarify what answers I'm looking for:
Is this a flow that would be well suited to parallelizing with MapReduce?
If yes, would this be cost effective to run on Amazon's mapreduce module, which bills by the hour, and rounds hour's up when the job is complete? (I'm not sure exactly what counts as a "Job", so I don't know exactly how I'll be billed)
If no, Is there another system/pattern I should use? and Is there a library that will help me do this in python (On AWS, usign EC2 + EBS)?
Are there any problems you see with the way I've designed this job flow?
Ok, now onto the details:
The dataset consists of users who have favorite items and who follow other users. The aim is to be able to update each user's queue -- the list of items the user will see when they load the page, based on the favorite items of the users she follows. But, before I can crunch the data and update a user's queue, I need to make sure I have the most up-to-date data, which is where the API calls come in.
There are two calls I can make:
Get Followed Users -- Which returns all the users being followed by the requested user, and
Get Favorite Items -- Which returns all the favorite items of the requested user.
After I call get followed users for the user being updated, I need to update the favorite items for each user being followed. Only when all of the favorites are returned for all the users being followed can I start processing the queue for that original user. This flow looks like:
Jobs in this flow include:
Start Updating Queue for user -- kicks off the process by fetching the users followed by the user being updated, storing them, and then creating Get Favorites jobs for each user.
Get Favorites for user -- Requests, and stores, a list of favorites for the specified user, from the 3rd party service.
Calculate New Queue for user -- Processes a new queue, now that all the data has been fetched, and then stores the results in a cache which is used by the application layer.
So, again, my questions are:
Is this a flow that would be well suited to parallelizing with MapReduce? I don't know if it would let me start the process for UserX, fetch all the related data, and come back to processing UserX's queue only after that's all done.
If yes, would this be cost effective to run on Amazon's mapreduce module, which bills by the hour, and rounds hour's up when the job is complete? Is there a limit on how many "threads" I can have waiting on open API requests if I use their module?
If no, Is there another system/pattern I should use? and Is there a library that will help me do this in python (On AWS, usign EC2 + EBS?)?
Are there any problems you see with the way I've designed this job flow?
Thanks for reading, I'm looking forward to some discussion with you all.
Edit, in response to JimR:
Thanks for a solid reply. In my reading since I wrote the original question, I've leaned away from using MapReduce. I haven't decided for sure yet how I want to build this, but I'm beginning to feel MapReduce is better for distributing / parallelizing computing load when I'm really just looking to parallelize HTTP requests.
What would have been my "reduce" task, the part that takes all the fetched data and crunches it into results, isn't that computationally intensive. I'm pretty sure it's going to wind up being one big SQL query that executes for a second or two per user.
So, what I'm leaning towards is:
A non-MapReduce Job/Worker model, written in Python. A google friend of mine turned me onto learning Python for this, since it's low overhead and scales well.
Using Amazon EC2 as a compute layer. I think this means I also need an EBS slice to store my database.
Possibly using Amazon's Simple Message queue thingy. It sounds like this 3rd amazon widget is designed to keep track of job queues, move results from one task into the inputs of another and gracefully handle failed tasks. It's very cheap. May be worth implementing instead of a custom job-queue system.
The work you describe is probably a good fit for either a queue, or a combination of a queue and job server. It certainly could work as a set of MapReduce steps as well.
For a job server, I recommend looking at Gearman. The documentation isn't awesome, but the presentations do a great job documenting it, and the Python module is fairly self-explanatory too.
Basically, you create functions in the job server, and these functions get called by clients via an API. The functions can be called either synchronously or asynchronously. In your example, you probably want to asynchronously add the "Start update" job. That will do whatever preparatory tasks, and then asynchronously call the "Get followed users" job. That job will fetch the users, and then call the "Update followed users" job. That will submit all the "Get Favourites for UserA" and friend jobs together in one go, and synchronously wait for the result of all of them. When they have all returned, it will call the "Calculate new queue" job.
This job-server-only approach will initially be a bit less robust, since ensuring that you handle errors and any down servers and persistence properly is going to be fun.
For a queue, SQS is an obvious choice. It is rock-solid, and very quick to access from EC2, and cheap. And way easier to set up and maintain than other queues when you're just getting started.
Basically, you will put a message onto the queue, much like you would submit a job to the job server above, except you probably won't do anything synchronously. Instead of making the "Get Favourites For UserA" and so forth calls synchronously, you will make them asynchronously, and then have a message that says to check whether all of them are finished. You'll need some sort of persistence (a SQL database you're familiar with, or Amazon's SimpleDB if you want to go fully AWS) to track whether the work is done - you can't check on the progress of a job in SQS (although you can in other queues). The message that checks whether they are all finished will do the check - if they're not all finished, don't do anything, and then the message will be retried in a few minutes (based on the visibility_timeout). Otherwise, you can put the next message on the queue.
This queue-only approach should be robust, assuming you don't consume queue messages by mistake without doing the work. Making a mistake like that is hard to do with SQS - you really have to try. Don't use auto-consuming queues or protocols - if you error out, you might not be able to ensure that you put a replacement message back on the queue.
A combination of queue and job server may be useful in this case. You can get away with not having a persistence store to check job progress - the job server will allow you to track job progress. Your "get favourites for users" message could place all the "get favourites for UserA/B/C" jobs into the job server. Then, put a "check all favourites fetching done" message on the queue with a list of tasks that need to be complete (and enough information to restart any jobs that mysteriously disappear).
For bonus points:
Doing this as a MapReduce should be fairly easy.
Your first job's input will be a list of all your users. The map will take each user, get the followed users, and output lines for each user and their followed user:
"UserX" "UserA"
"UserX" "UserB"
"UserX" "UserC"
An identity reduce step will leave this unchanged. This will form the second job's input. The map for the second job will get the favourites for each line (you may want to use memcached to prevent fetching favourites for UserX/UserA combo and UserY/UserA via the API), and output a line for each favourite:
"UserX" "UserA" "Favourite1"
"UserX" "UserA" "Favourite2"
"UserX" "UserA" "Favourite3"
"UserX" "UserB" "Favourite4"
The reduce step for this job will convert this to:
"UserX" [("UserA", "Favourite1"), ("UserA", "Favourite2"), ("UserA", "Favourite3"), ("UserB", "Favourite4")]
At this point, you might have another MapReduce job to update your database for each user with these values, or you might be able to use some of the Hadoop-related tools like Pig, Hive, and HBase to manage your database for you.
I'd recommend using Cloudera's Distribution for Hadoop's ec2 management commands to create and tear down your Hadoop cluster on EC2 (their AMIs have Python set up on them), and use something like Dumbo (on PyPI) to create your MapReduce jobs, since it allows you to test your MapReduce jobs on your local/dev machine without access to Hadoop.
Good luck!
Seems that we're going with Node.js and the Seq flow control library. It was very easy to move from my map/flowchart of the process to a stubb of the code, and now it's just a matter of filling out the code to hook into the right APIs.
Thanks for the answers, they were a lot of help finding the solution I was looking for.
I am working with a similar problem that i need to solve. I was also looking at MapReduce and using the Elastic MapReduce service from Amazon.
I'm pretty convinced MapReduce will work for this problem. The implementation is where I'm getting hung up, becauase I'm not sure my reducer even needs to do anything.
I'll answer your questions as I understand your (and my) problem, and hopefully it helps.
Yes I think it'll be suited well. You could look at leveraging the Elastic MapReduce service's multiple steps option. You could use 1 Step to fetch a the people a user is following, and another step to compile a list of tracks for each of those followers, and the reducer for that 2nd step would probably be the one to build the cache.
Depends on how big your data-set is and how often you'll be running it. It's hard to say without knowing how big the data-set is (or is going to get) if it'll be cost effective or not. Initially, it'll probably be quite cost-effective, as you won't have to manage your own hadoop cluster, nor have to pay for EC2 instances (assuming that's what you use) to be up all the time. Once you reach the point where you're actually crunching this data for a long period of time, it probably will make less and less sense to use Amazon's MapReduce service, because you'll constantly have nodes online all the time.
A job is basically your MapReduce task. It can consist of multiple steps (each MapReduce task is a step). Once your data has been processed and all steps have been completed, your Job is done. So you're effectively paying for CPU time for each node in the Hadoop cluster. so, T*n where T is the Time (in hours) it takes to process your data, and n is the number of nodes you tell Amazon to spin up.
I hope this helps, good luck. I'd like to hear how you end up implementing your Mappers and Reducers, as I'm solving a very similar problem and I'm not sure my approach is really the best.

Categories

Resources