Analysing twitter for research : moving from small data to big

Analysing twitter for research : moving from small data to big - python

We have a research work that we are doing as part of our college project in which we need to analyse twitter data.
We have already built the prototype for classification and analysis using pandas and nltk, reading the comments from a csv file and then processing it. The problem now is that we want to scale it so as to read and analyse some big comments file also. But the problem is that we dont have anybody who could guide us(majority of them being from biology background) with what technologies to use for this massive amount.
Our issues are :-
1.] How to store a massive comments file(5 gb, offline data). Till now we had only 5000-10000 line of comments which we processed using pandas. But how do we store and process such a huge file. Which database to use for it.
2.] Also since we plan to use nltk, machine learning on this data, what should be our approach on parallels of :: csv->pandas,nltk,machine learning->model->prediction. That is, where in this path we need changes and with what technologies should we replace them to handle the huge data.

Generally speaking, there's two types of scaling:
Scale up
Scale out
Scale up, most of the time, means taking what you already have, and run it on a bigger machine (more CPU, RAM, disk throughput).
Scale out generally means partitioning your problem, and handling parts on separate threads/processes/machines.
Scaling up is much easier: keep the code you already have and run it on a big machine (possibly on Amazon EC2 or Rackspace, if you don't have one available).
If scaling up is not enough, you will need to scale out. Start by identifying what parts of your problem can be partitioned. Since you're processing twitter comments, there's a good chance you can simply partition your file into multiple ones, and train N independent models.
Since you're just processing text data, there isn't a big advantage to using a database over plain text files (for storing the input data, at least). Simply split your file into multiple files and distribute each one to a different processing unit.
Depending on the specific machine learning techniques you're using, it may be easy to merge the independent models into a single one, but it will likely require expert knowledge.
If you're using K-nearest-neighbors, for example, it's trivial to join the independent models

Related

Are temporary files generated while working with pandas data frames

Until now, I have always used SAS to work with sensitive data. It would be nice to change to Python instead. However, I realize I do not understand how the data is handled during processing in pandas.
While running SAS, one knows exactly where all the temporary files are stored (hence it is easy to keep these in an encrypted container). But what happens when I use pandas data frames? I think I would not even notice, if the data left my computer during processing.
The size of the mere flat files, of which I typically have dozens to merge, are a couple of Gb. Hence I cannot simply rely on the hope, that everything will be kept in the RAM during processing - or can I? I am currently using a desktop with 64 Gb RAM.

If it's a matter of life and death, I would write the data merging function in C. This is the only way to be 100% sure of what happens with the data. The general philosophy of Python is to hide whatever happens "under the hood", this does not seem to fit your particular use case.

What is a sensible way to store matrices (which represent images) either in memory or on disk, to make them available to a GUI application?

I am looking for some high level advice about a project that I am attempting.
I want to write a PyQt application (following the model-view pattern) to read in images from a directory one by one and process them. Typically there will be a few thousand .png images (each around 1 megapixel, 16 bit grayscale) in the directory. After being read in, the application will then process the integer pixel values of each image in some way, and crucially the result will be a matrix of floats for each. Once processed, the user should be able be able to then go back and explore any of the matrices they choose (or multiple at once), and possibly apply further processing.
My question is regarding a sensible way to store the matrices in memory, and access them when needed. After reading in the raw .png files and obatining the corresponding matrix of floats, I can then see the following options for handling the result:
Simply store each matrix as a numpy array and have every one of them stored in a class attribute. That way they will all be easily accessible to the code when requested by the user, but will this be poor in terms of RAM required?
After processing each, write out the matrix to a text file, and read it back in from the text file when requested by the user.
I have seen examples (see here) of people using SQLite databases to store data for a GUI application (using MVC pattern), and then query the database when you need access to data. This seems like it would have the advantage that data is not stored in RAM by the "model" part of the application (like in option 1), and is possibly more storage-efficient than option 2, but is this suitable given that my data are matrices?
I have seen examples (see here) of people using something called HDF5 for storing application data, and that this might be similar to using a SQLite database? Again, suitable for matrices?
Finally, I see that PyQt has the classes QImage and QPixmap. Do these make sense for solving the problem I have described?
I am a little lost with all the options, and don't want to spend too much time investigating all of them in too much detail so would appreciate some general advice. If someone could offer comments on each of the options I have described (as well as letting me know if any can be ruled out in this situation) that would be great!
Thank you

in which case one would use uncompress codec for parquet files in spark

I'm new in Spark and trying to understand how different compression codecs work. I'm using Cloudera Quickstart VM 5.12x, Spark 1.6.0 and Python APIs.
If I compress and save as Parquet files using below logic:
sqlContext.setConf("spark.sql.parquet.compression.codec","snappy")
df.write.parquet("/user/cloudera/data/orders_parquet_snappy")
then I can read them as:
sqlContext.read.parquet("/user/cloudera/data/orders_parquet_snappy").show()
I believe above read doesn't need to uncompress and read. I wonder why and in which condition I will use uncompressed ?
sqlContext.setConf("spark.sql.parquet.compression.codec", "uncompressed")
Not sure if my understanding is correct.

Compression is good because it saves at-rest storage and transfer bandwidth (both in terms of local disk I/O and network), but it does so at the cost of computing power. These are the metrics that you want to keep in mind when selecting a compression algorithm for your data: depending on the kind of expectation that you have, you can decide to select the appropriate one.
But in general, Snappy has been specifically designed to be relatively easy on CPU while providing a fair storage/bandwidth saving, which makes it more than appropriate for many use cases (which is why it's the default).
The sensible suggestion is of course to measure and come to a decision based on your observations on your particular setup, but I believe it's fair to say you should not expect a massive relative improvement on resource usage (but probably enough to be of some economic relevance if you are working with a truly massive cluster).

Convert CSV table to Redis data structures

I am looking for a method/data structure to implement an evaluation system for a binary matcher for a verification.
This system will be distributed over several PCs.
Basic idea is described in many places over the internet, for example, in this document: https://precisebiometrics.com/wp-content/uploads/2014/11/White-Paper-Understanding-Biometric-Performance-Evaluation.pdf
This matcher, that I am testing, takes two data items as an input and calculates a matching score that reflects their similarity (then a threshold will be chosen, depending on false match/false non-match rate).
Currently I store matching scores along with labels in CSV file, like following:
label1, label2, genuine, 0.1
label1, label4, genuine, 0.2
...
label_2, label_n+1, impostor, 0.8
label_2, label_n+3, impostor, 0.9
...
label_m, label_m+k, genuine, 0.3
...
(I've got a labeled data base)
Then I run a python script, that loads this table into Pandas DataFrame and calculates FMR/FNMR curve, similar to the one, shown in figure 2 in the link above. The processing is rather simple, just sorting the dataframe, scanning rows from top to bottom and calculating amount of impostors/genuines on rows above and below each row.
The system should also support finding outliers in order to support matching algorithm improvement (labels of pairs of data items, produced abnormally large genuine scores or abnormally small impostor scores). This is also pretty easy with the DataFrames (just sort and take head rows).
Now I'm thinking about how to store the comparison data in RAM instead of CSV files on HDD.
I am considering Redis in this regard: amount of data is large, and several PCs are involved in computations, and Redis has a master-slave feature that allows it quickly sync data over the network, so that several PCs have exact clones of data.
It is also free.
However, Redis does not seem to me to suit very well for storing such tabular data.
Therefore, I need to change data structures and algorithms for their processing.
However, it is not obvious for me, how to translate this table into Redis data structures.
Another option would be using some other data storage system instead of Redis. However, I am unaware of such systems and will be grateful for suggestions.

You need to learn more about Redis to solve your challenges. I recommend you give https://try.redis.io a try and then think about your questions.
TL;DR - Redis isn't a "tabular data" store, it is a store for data structures. It is up to you to use the data structure(s) that serves your query(ies) in the most optimal way.
IMO what you want to do is actually keep the large data (how big is it anyway?) on slower storage and just store the model (FMR curve computations? Outliers?) in Redis. This can almost certainly be done with the existing core data structures (probably Hashes and Sorted Sets in this case), but perhaps even more optimally with the new Modules API. See the redis-ml module as an example of serving machine learning models off Redis (and perhaps your use case would be a nice addition to it ;))
Disclaimer: I work at Redis Labs, home of the open source Redis and provider of commercial solutions that leverage on it, including the above mentioned module (open source, AGPL licensed).

Efficiently Reading Large Files with ATpy and numpy?

I've looked all over for an answer to this one, but nothing really seems to fit the bill. I've got very large files that I'm trying to read with ATpy, and the data comes in the form of numpy arrays. For smaller files the following code has been sufficient:
sat = atpy.Table('satellite_data.tbl')
From there I build up a number of variables that I have to manipulate later for plotting purposes. It's lots of these kinds of operations:
w1 = np.array([sat['w1_column']])
w2 = np.array([sat['w2_column']])
w3 = np.array([sat['w3_column']])
colorw1w2 = w1 - w2 #just subtracting w2 values from w1 values for each element
colorw1w3 = w1 - w3
etc.
But for very large files the computer can't handle it. I think all the data is getting stored in memory before parsing begins, and that's not feasible for 2GB files. So, what can I use instead to handle these large files?
I've seen lots of posts where people are breaking up the data into chunks and using for loops to iterate over each line, but I don't think that's going to work for me here given the nature of these files, and the kinds of operations I need to do on these arrays. I can't just do a single operation on every line of the file, because each line contains a number of parameters that are assigned to columns, and in some cases I need to do multiple operations with figures from a single column.
Honestly I don't really understand everything going on behind the scenes with ATpy and numpy. I'm new to Python, so I appreciate answers that spell it out clearly (i.e. not relying on lots of implicit coding knowledge). There has to be a clean way of parsing this, but I'm not finding it. Thanks.

For very large arrays (larger than your memory capacity) you can use pytables which stores arrays on disk in some clever ways (using the HDF5 format) so that manipulations can be done on them without loading the entire array into memory at once. Then, you won't have to manually break up your datasets or manipulate them one line at a time.
I know nothing about ATpy so you might be better off asking on an ATpy mailing list or at least some astronomy python users mailing list, as it's possible that ATpy has another solution built in.
From the pyables website:
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.
PyTables is built on top of the HDF5 library, using the Python language and the NumPy package.
... fast, yet extremely easy to use tool for interactively browse, process and search very large amounts of data. One important feature of PyTables is that it optimizes memory and disk resources so that data takes much less space...

Look into using pandas. It's built for this kind of work. But the data files need to be stored in a well structured binary format like hdf5 to get good performance with any solution.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.