Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
A large data frame (a couple of million rows, a few thousand columns) is created Pandas in python. This data frame is to be passed to R using PyRserve. This has to be quick - few seconds at most.
There is a to_json function in pandas. Is to and from json conversation for such large objects the only way? is it OK for such large objects?
I can always write it to disk and read it (fast using fread, and that it what I have done), but what is the best way to do this?
Without having tried it out, to_json seems to be a very bad idea, getting worse with larger dataframes as this has a lot of overhead, both in writing and reading the data.
I'd recommend using rpy2 (which is supported directly by pandas) or, if you want to write something to disk (maybe because the dataframe is only generated once) you can use HDF5 (see this thread for more information on interfacing pandas and R using this format).
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
Despite following best practices for reducing Pandas Dataframe object memory usage, I still find that the memory usage is too high. I've tried chunking, converting dtype, reading less data...etc.
For example, even though the CSV file I'm reading in is 2.7 GB large, when I use pd.read_csv, task manager shows that 25 GB of RAM have been used. I've tried converting objects to category, but some columns are not suitable for the conversion so object data type is the only choice I have.
Anyone have advice for how to reduce the memory usage, or alternative python libraries to use for low memory consuming dataframe objects? I've tried PySpark but the lazy evaluation is killing me every time I want to run a simple show statement.
Why use Dask dataframe:
Common Uses and Anti-Uses
Best Practices
Dask DataFrame is used in situations where Pandas is commonly needed,
usually when Pandas fails due to data size or computation speed.
For data that fits into RAM, Pandas can often be faster and easier to
use than Dask DataFrame. While “Big Data” tools can be exciting, they
are almost always worse than normal data tools while those remain
appropriate.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm taking a MOOC on Tensorflow 2 and in the class the assignments insist that we need to use tf datasets; however, it seems like all the hoops you have to jump through to do anything with datasets means that everything is way more difficult than using a Pandas dataframe, or a NumPy array... So why use it?
The things you mention are generally meant for small data, when a dataset can all fit in RAM (which is usually a few GB).
In practice, datasets are often much bigger than that. One way of dealing with this is to store the dataset on a hard drive. There are other more complicated storage solutions. TF DataSets allow you to interface with these storage solutions easily. You create the DataSet object, which represents the storage, in your script, and then as far as you're concerned you can train a model on it as usual. But behind the scenes, TF is repeatedly reading the data into RAM, using it, and then discarding it.
TF Datasets provide many helpful methods for handling big data, such as prefetching (doing the storage reading and data preprocessing at the same time as other stuff), multithreading (doing calculations like preprocessing on several examples at the same time), shuffling (which is harder to do when you can't just reorder a dataset each time in RAM), and batching (preparing sets of several examples for feeding to a model as a batch). All of this stuff would be a pain to do yourself in an optimised way with Pandas or NumPy.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm implementing Kalman Filter on two types of measurements. I have GPS measurement every second (1Hz) and 100 measurment of accelration in one second (100Hz).
So basically I have two huge tables and they have to be fused at some point. My aim is: I really want to write readable and maintainable code.
My first approach was: there is a class for both of the datatables (so an object is a datatable), and I do bulk calculations in the class methods (so almost all of my methods include a for loop), until I get to the actual filter. I found this approach a bit too stiff. It works, but there is so much data-type transformation, and it is just not that convenient.
Now I want to change my code. If I would want to stick to OOP, my second try would be: every single measurment is an object of either the GPS_measurment or the acceleration_measurement. This approach seems better, but this way thousands of objects would have been created.
My third try would be a data-driven design, but I'm not really familiar with this approach.
Which paradigm should I use? Or perhaps it should be solved by some kind of mixture of the above paradigms? Or should I just use procedural programming with the use of pandas dataframes?
It sounds like you would want to use pandas. OOP is a concept btw, not something you explicitly code in inflexibly. Generally speaking, you only want to define your own classes if you plan on extending them or encapsulating certain features. pandas and numpy are 2 modules that already do almost everything you can ask for with regards to data and they are faster in execution.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I am opening one very large csv in chunks using pandas read_csv with a chunksize set because the csv is too large to fit into memory. I am performing transformations on each chunk. I then want to append the transformed df chunk to the end another existing (and very large) csv.
I have been running into out-of-memory errors though. Does pandas to_csv(mode='a', header=False) open the csv in order to append the new chunk? In other words, is the to_csv() causing my memory errors?
I had this same issue several times. What you might try is to export your data chunks in several csv (without headers) and then concatenate them with a non pandas function (e.g. Writing new lines on a text file read from your different csv)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have two CSV files, and I would like to validate(find the differences and similarities) the data between these two files.
I am retrieving this data from vertica and because the data is so large I would like to do the validation at CSV level.
csvdiff allows you to compare the semantic contents of two CSV files, ignoring things like row and column ordering in order to get to what’s actually changed. This is useful if you’re comparing the output of an automatic system from one day to the next, so that you can look at just what’s changed.
I don't think you can directly compare sheets using openpyxl without manually looping on each rows and using your own validation code.
That depend your aim at performance, if speed is not a requirement, then why not but that will require some additional work.
Instead I would use pandas dataframes for any CSV validation needs, if you can add this dependency it should become really easier to compare files while keeping it at a great performance.
Here is a link to complete example:
http://pbpython.com/excel-diff-pandas.html
However, use read_csv() instead of read_excel() to read data from your files.