Pandas Dataframe consumes too much memory. Any alternatives? [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
Despite following best practices for reducing Pandas Dataframe object memory usage, I still find that the memory usage is too high. I've tried chunking, converting dtype, reading less data...etc.
For example, even though the CSV file I'm reading in is 2.7 GB large, when I use pd.read_csv, task manager shows that 25 GB of RAM have been used. I've tried converting objects to category, but some columns are not suitable for the conversion so object data type is the only choice I have.
Anyone have advice for how to reduce the memory usage, or alternative python libraries to use for low memory consuming dataframe objects? I've tried PySpark but the lazy evaluation is killing me every time I want to run a simple show statement.

Why use Dask dataframe:
Common Uses and Anti-Uses
Best Practices
Dask DataFrame is used in situations where Pandas is commonly needed,
usually when Pandas fails due to data size or computation speed.
For data that fits into RAM, Pandas can often be faster and easier to
use than Dask DataFrame. While “Big Data” tools can be exciting, they
are almost always worse than normal data tools while those remain
appropriate.

Related

What is the use of numpy in python [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
I'm a novice in AI & ML, and I always see
"import numpy as np"
So, I really want to know when is the usage of numpy? and what is the usage of it?
Ideally, when do we have to include in our code and be prepared for this library
Numpy is a python library and it has a variety of use cases. It actually depends on the scenario you are working on.
As a novice ML learner myself, Numpy can prove to very useful when you work with ML algorithms, which typically involve working with arrays. Numpy tends to more versatile when compared to the traditional python List. Numpy offers the following features:
Smaller Memory Consumption than List
Implementation of Multi-Dimensional Arrays
NumPy arrays are faster than Python List
NumPy can be used to transform the Arrays Python does not have inbuilt support for Arrays Offers function like Reshape, Sort, Reverse, etc.
Usually, you have will have to deal with a lot to arrays of varying sizes and might have to sort or reshape. Numpy is very useful in that case

(Conceptual question) Tensorflow dataset... why use it? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm taking a MOOC on Tensorflow 2 and in the class the assignments insist that we need to use tf datasets; however, it seems like all the hoops you have to jump through to do anything with datasets means that everything is way more difficult than using a Pandas dataframe, or a NumPy array... So why use it?
The things you mention are generally meant for small data, when a dataset can all fit in RAM (which is usually a few GB).
In practice, datasets are often much bigger than that. One way of dealing with this is to store the dataset on a hard drive. There are other more complicated storage solutions. TF DataSets allow you to interface with these storage solutions easily. You create the DataSet object, which represents the storage, in your script, and then as far as you're concerned you can train a model on it as usual. But behind the scenes, TF is repeatedly reading the data into RAM, using it, and then discarding it.
TF Datasets provide many helpful methods for handling big data, such as prefetching (doing the storage reading and data preprocessing at the same time as other stuff), multithreading (doing calculations like preprocessing on several examples at the same time), shuffling (which is harder to do when you can't just reorder a dataset each time in RAM), and batching (preparing sets of several examples for feeding to a model as a batch). All of this stuff would be a pain to do yourself in an optimised way with Pandas or NumPy.

Data analysis in python [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I have to make two data analysis reports using descriptive statistics, making plenty of informative plots etc.
Problem is, I'm not sure what tools should I use? I started preparing one report in Jupyter Notebook using pandas, scipy.stats and matplotlib with intention to convert it somehow to pdf later on, so I can have report without code. After hour or two I realized it might not be the best idea. I even had problem with making good description of categorical data, since pandas describe() had limited functionality on this type of data.
Could you suggest me some tools that would be best in this case? I want to prepare aesthetic, informative report. It's my first time doing data analysis including preparing report.
Your report doesn't require code, as you said. So why not just type up your report on Word and include the relevant tables and plots? You can produce plots on python using matplotlib (seaborn for aesthetic plots). And as for the statistics, you do not only have to use what pandas offers. Depending on the kind of data, for example, you can use scipy and apply those functions on columns of your dataframe to generate insights.
Also check out this data analysis and visualization software called Tableau. You can quickly create some beautiful and insightful plots using this; however, there is a learning curve.

Python Pandas IDE that would "know" columns and types [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm doing some development in Python, mostly using a simple text editor (Sublime Text). I'm mostly dealing in databases that I fit in Pandas DataFrames. My issue is, I often lose track of the column names, and occasionally the column types as well. Is there some IDE / plug-in / debug tool that would allow me to look into each DataFrame and see how it's defined, a little bit like Eclipse can do for Java classes?
Thank you,
I don't believe that something like that exists, but you can always use df.info().
You're looking for Spyder, at least that's the one that I currently use. Not sure if other ones have the same capabilities.
It has matlab like features that allows you to view your variables and a great tool for numpy arrays and pandas dataframes.
Here's an example:
import numpy as np
import pandas as pd
foo = 7
arr = np.array([0,5,9])
df = pd.DataFrame(arr).T
df.columns = ['spam','eggs','bar']
In the Variable Explorer you'll see the following:
Here's a view of the dataframe (double click). Can also be done to arrays:
Note: For numpy arrays you can change from showing max/min to truncated values, to full array.

pandas dataframe to R using pyRserve [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
A large data frame (a couple of million rows, a few thousand columns) is created Pandas in python. This data frame is to be passed to R using PyRserve. This has to be quick - few seconds at most.
There is a to_json function in pandas. Is to and from json conversation for such large objects the only way? is it OK for such large objects?
I can always write it to disk and read it (fast using fread, and that it what I have done), but what is the best way to do this?
Without having tried it out, to_json seems to be a very bad idea, getting worse with larger dataframes as this has a lot of overhead, both in writing and reading the data.
I'd recommend using rpy2 (which is supported directly by pandas) or, if you want to write something to disk (maybe because the dataframe is only generated once) you can use HDF5 (see this thread for more information on interfacing pandas and R using this format).

Categories

Resources