Realtime data processing with Python - python

I am working on a project which is going to consume data from Twitter Stream API and count certain hashtags. But I have difficulties in understanding what kind architecture I need in my case. Should I use Tornado or is there more suitable frameworks for this?

It really depends on what you want to do with the Tweets. Simply reading a stream of Tweets has not been an issue that I've seen. In fact that can be done on an AWS Micro Instance. I even run more advanced regression algorithms on the real-time feed. The scalability problem arises if you try to process a set of historical Tweets. Since Tweets are produced so fast, processing historical Tweets can be very slow. That's when you should try to parallelize.

Related

Automation of performance monitoring of mulesoft application

I would like to automate this process of viewing logs in dashboard and typing the information (Total messages sent in a time period, total errors, CPU usage, memory usage), this task is very time consuming at the moment.
The info is gathered from mulesoft anypoint platform. I'm currently thinking of a way to extract all of the data using python webscraping but I don't know how to use it perfectly.
You'll find here a screenshot of the website i'm trying to get the data off of, you can choose to see the logs specific to a certain time and date. My question is, do I start learning python webscrapping or is there another way of doing things that I am just unaware of ?
Logs website example
It doesn't make any sense to use web scrapping. All services in Anypoint Platform have a REST API. Most of them are documented at https://anypoint.mulesoft.com/exchange/portals/anypoint-platform/. Scrapping may broke with any minor change to the UI. The REST API is stable.
The screenshot seems to be from Anypoint Monitoring. I see in the catalog Anypoint Monitoring Archive API. I'm not sure if the API for getting Monitoring Dashboards data is documented. You could alternatively use the older CloudHub Dashboards API. It is probably not exactly the same but it will approximate.

Multiple users' data with single Twitter API request

I am trying to build a script that will take a Twitter handle and calculate its engagement rate based on the last 10 tweets or so. If I understand Twitter's API correctly I would have to make a q request for each calculation. If I understand Twitter's pricing correctly, I would be paying between $0.75 and $1 per request depending on my package. That seems very expensive for me to build such a simple tool. Am I missing something, is there a cheaper way of doing it?

Load chart data into a webapp from google datastore

I've got a google app engine application that loads time series data real-time into a google datastore nosql style table. I was hoping to get some feedback around the right type of architecture to pull this data into a web application style chart (and ideally something I could also plug into a content management system like Word Press).
Most of my server-side code is python. What's a reasonable client-server setup to pull the data from the datastore database and display into my webpage? Ideally I'd have something that scales and doesn't cause an unnecessary number of reads on my database (potentially using google-app-engine's built in caching/etc).
I'm guessing this is a common use-case but I'd like to get an idea of what might be some best practices around this. I've seen some examples using client web side javascript/ajax with server side php to read the DB- is this really the best way?
Welcome to "it depends".
You have some choices. Imagine the classic four-quadrant chart. Along one axis is data size, along the other is staleness/freshness.
If your time-series data changes rapidly but is small enough to safely be retrieved within a request, you can query for it on demand, convert it to JSON, and squirt it to the browser to be rendered by the JavaScript charting package of your choice. If the data is large, your app will need to do some sort of server-side pre-processing so that when the data is needed, it can be retrieved in sufficiently fewer requests that that the request won't time out. This might involve something data dependent like pre-bucketing the time series.
If the data changes slowly, you have the option of generating your chart on the server side, perhaps using matplotlib. When new data is ingested, or perhaps at intervals, spawn off a task to generate and cache the chart (or JSON to hand to the front-end) as a blob in the datastore. If the data is sufficiently large that a task will timeout, you might need to use a backend process. If the data is sufficiently large and you don't pre-process, you're in the quadrant of unhappiness.
In my experience, GAE memcache is best for caching data between requests where the time between requests is very short. Don't rely on generating artifacts, stuff them in memcache and hoping that they'll be there a few minutes later. I've rarely seen that work.

requests library: how to speed it up?

I am trying to send multiple requests to different web pages. At the moment I am using the "requests" library in multithreading, because I have found it most performing than urllib2. Is it possible to load only a part of the webpage? Do you have any other idea to speed my requests than KeepAlive and multithreading?
Thanks.
As you clarified in a comment:
Hi, I'm trying to extract several stock quotes and financial ratios from the Italian Stock Exchange website. Every page that I load is related to a specific company.
This means there aren't very many easy optimisations left to make. If the web pages themselves are very large and the data you want is early-on in the page, you might be able to avoid downloading some of the data by streaming the download: that is, setting stream=True on the request and then using Response.iter_content() to read in chunks at a time.
If you're fortunate, you might be able to take advantage of caching to reduce response times or sizes. Try plugging something like CacheControl into your Session objects to see if this improves anything.
Otherwise, you're already getting almost as big an improvement as you can get in software alone. If the Italian Stock Exchange supports SPDY (they probably don't), using a SPDY library could improve things, but that rules out Requests (and possibly multithreading as well, for reasons that are totally tangential to this answer). Another outside-the-box option is to run on a machine closer to the web server providing the data.

Is there any better way to access twitter streaming api through python?

I need to fetch twitter historical data for a given set of keywords. Twitter Search API returns tweets that are not more than 9 days old, so that will not do. I'm currently using Tweepy Library (http://code.google.com/p/tweepy/) to call Streaming API and it is working fine except the fact that it is too slow. For example, when I run a search for "$GOOG" sometimes it takes more than an hour between two results. There are definitely tweets containing that keyword but it isn't returning result fast enough.
What can be the problem? Is Streaming API slow or there is some problem in my method of accessing it? Is there any better way to get that data free of cost?
How far back do you need? To fetch historical data, you might want to keep the stream on indefinitely (the stream API allows for this) and store the stream locally, then retrieve historical data from your db.
I also use Tweepy for live Stream/Filtering and it works well. The latency is typically < 1s and Tweepy is able to handle large volume streams.
streaming API too fast you get message as soon as you post it, we use twitter4j. But streamer streams only current messages, so if you not listening on streamer the moment you send tweet then message is lost.

Categories

Resources