How is it possible to optimize Pandas df to ubyte data type (0..255)? (by default is int64 for integer)
If I will convert data to Categorical type, will df use less memory?
Or the only way to optimize it - use NumPy instead of Pandas?
For unsigned integer data in range 0..255, you can reduce the memory storage from default int64 (8 bytes) to use uint8 (1 byte). You can refer to this article for an example where the memory usage is substantially reduced from 1.5MB to 332KB (around one fifth).
For Categorical type, as Pandas stores categorical columns as objects, this storage is not optimal. One of the reason is that it creates a list of pointers to the memory address of each value of your column. Refer to this article for more information.
To use uint8, either you can do it when you input your data, e.g. during pd.read_csv call, you specify the dtype of input columns with uint8 type. (See the first article for an example). If you already have your data loaded and you want to convert the dataframe columns to use uint8, you can use the Series.astype() or DataFrame.astype() function with syntax like .astype('uint8').
Related
What would cause pandas to set a column type to 'object' when the values I have checked are strings? I have explicitly set that column to "string" in the dtypes dictionary settings in the read_excel method call that loads in the data. I have checked for NaN or NULL etc, but haven't found any as I know that may cause an object type to be set. I recall reading string types need to set a max length but I was under the impression that pandas sets that to the max length of the column.
Edit 1:
this seems to only happen in fields holding email addresses. While I don't think this has an effect, would the # character be triggering this behavior?
The dtype object comes from NumPy, it describes the type of element in a ndarray. Every element in an ndarray must have the same size in bytes. For int64 and float64, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of saving the bytes of strings in the ndarray directly, Pandas uses an object ndarray, which saves pointers to objects; because of this the dtype of this kind ndarray is object.
I want to know the list of all possible data types returned by pandas.DataFrame.dtypes.
(A) As per https://pbpython.com/pandas_dtypes.html the following are all the possible data types in pandas:
object, int64, float64, bool, datetime64, timedelta, category
(B) This SO answer talks about pandas supporting many more kinds of data types including PeriodDType, CategoricalDtype, etc.
Is it correct to say that (A) represents all possible data types with 'object' doing the heavy lifting for all the additional datatypes not specified (i.e., including for those in (B))?
https://pandas.pydata.org/docs/user_guide/basics.html#dtypes
This should give you the info required. TL-DR; pandas generally supports these numpy dtypes - float, int, bool, timedelta64[ns], datetime64[ns] in addition to the generic object dtype which is a catchall.
However, pandas has been introducing extension dtypes for a while now.
Is it correct to say that (A) represents all possible data types with 'object' doing the heavy lifting for all the additional datatypes not specified (i.e., including for those in (B))?
No, object is primarily there for either string columns or columns with mixed types. The newer ExtensionDtypes seem to be similar to np.dtypes
A pandas.api.extensions.ExtensionDtype is similar to a numpy.dtype object. It describes the data type.
https://pandas.pydata.org/docs/development/extending.html#extending-extension-types
When loading the output of query into a DataFrame using pandas, the standard behavior was to convert integer fields containing NULLs to float so that NULLs would became NaN.
Starting with pandas 1.0.0, they included a new type called pandas.NA to deal with integer columns having NULLs. However, when using pandas.read_sql(), the integer columns are still being transformed in float instead of integer when NULLs are present. Added to that, the read_sql() method doesn't support the dtype parameter to coerce fields, like read_csv().
Is there a way to load integer columns from a query directly into a Int64 dtype instead of first coercing it first to float and then having to manually covert it to Int64?
Have you tried using
select isnull(col_name,0) from table_name. This converts all null values to 0.
Integers are automatically cast to float values just as boolean values are cast to objects when some values are n/a.
Seems like that, as of current version, there is no direct way to do that. There is no way to coerce a column to this dtype and pandas won't use the dtype for inference.
There's a similar problem discussed in this thread: Convert Pandas column containing NaNs to dtype `int`
As a followup to my question on mixed types in a column:
Can I think of a DataFrame as a list of columns or is it a list of rows?
In the former case, it means that (optimally) each column has to be homogeneous (type-wise) and different columns can be of different types. The latter case, suggests that each row is type-wise homogeneous.
For the documentation:
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.
This implies that a DataFrame is a list of columns.
Does it mean that appending a row to a DataFrame is more expensive than appending a column?
You are fully correct that a DataFrame can be seen as a list of columns, or even more a (ordered) dictionary of columns (see explanation here).
Indeed, each column has to be homogeneous of type, and different columns can be of different types. But by using the object dtype you can still hold different types of objects in one column (although not recommended apart for eg strings).
To illustrate, if you ask the data types of a DataFrame, you get the dtype for each column:
In [2]: df = pd.DataFrame({'int_col':[0,1,2], 'float_col':[0.0,1.1,2.5], 'bool_col':[True, False, True]})
In [3]: df.dtypes
Out[3]:
bool_col bool
float_col float64
int_col int64
dtype: object
Internally, the values are stored as blocks of the same type. Each column, or collection of columns of the same type is stored in a separate array.
And this indeed implies that appending a row is more expensive. In general, appending multiple single rows is not a good idea: better to eg preallocate an empty dataframe to fill, or put the new rows/columns in a list and concat them all at once.
See the note at the end of the concat/append docs (just before the first subsection "Set logic on the other axes").
To address the question: Is appending a row to a DataFrame is more expensive than appending a column?
We need to take into account various factors, but the most important one is the internal physical data layout of Pandas Dataframe.
The short and kind of naive answer:
If the table(aka DataFrame) is stored in a column-wise physical layout, then add or fetch a column is faster than with a row; if the table is stored in a row-wise physical layout, it's the other way. In general, the default Pandas DataFrame is stored column-wise(but NOT all the time). So in general, appending a row to a DataFrame is indeed more expensive than appending a column. And you could consider the nature of Pandas DataFrame to be a dict of columns.
A longer answer:
Pandas needs to choose a way to arrange the internal layout of a table in memory (such as a Dataframe of 10 rows and 2 columns). The most common two approaches are column-wise and row-wise.
Pandas is built on top of Numpy, and DataFrame and Seires are built on top of Numpy Array. But do notice though Numpy Array is internally stored row-wise in Memory, this is NOT the case for Pandas DataFrame. How DataFrame is stored depends on how it was initiated, cf this post:https://krbnite.github.io/Memory-Efficient-Windowing-of-Time-Series-Data-in-Python-2-NumPy-Arrays-vs-Pandas-DataFrames/
It's actually quite natural that Pandas adopt a column-wise layout most of the time because Pandas was designed to be a data analysis tool that relies more heavily on column-oriented operations than row-oriented operations. cf https://www.stitchdata.com/columnardatabase/
In the end, the answer to the question Is appending a row to a DataFrame is more expensive than appending a column? also depends on caching, prefetching etc. Thus it's a rather complicated question to answer and could depend on specific runtime conditions. But the most important factor is the data layout.
Answer from the authors of Pandas
The authors of Pandas actually mentioned this point in their design documentation. cf https://github.com/pydata/pandas-design/blob/master/source/internal-architecture.rst#what-is-blockmanager-and-why-does-it-exist
So, to do anything row oriented on an all-numeric DataFrame, pandas
would concatenate all of the columns together (using numpy.vstack or
numpy.hstack) then use array broadcasting or methods like ndarray.sum
(combined with np.isnan to mind missing data) to carry out certain
operations.
What exactly happens when Pandas issues this warning? Should I worry about it?
In [1]: read_csv(path_to_my_file)
/Users/josh/anaconda/envs/py3k/lib/python3.3/site-packages/pandas/io/parsers.py:1139:
DtypeWarning: Columns (4,13,29,51,56,57,58,63,87,96) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
I assume that this means that Pandas is unable to infer the type from values on those columns. But if that is the case, what type does Pandas end up using for those columns?
Also, can the type always be recovered after the fact? (after getting the warning), or are there cases where I may not be able to recover the original info correctly, and I should pre-specify the type?
Finally, how exactly does low_memory=False fix the problem?
Revisiting mbatchkarov's link, low_memory is not deprecated.
It is now documented:
low_memory : boolean, default True
Internally process the file in chunks, resulting in lower memory use while
parsing, but possibly mixed type inference. To ensure no
mixed types either set False, or specify the type with the dtype
parameter. Note that the entire file is read into a single DataFrame
regardless, use the chunksize or iterator parameter to return the data
in chunks. (Only valid with C parser)
I have asked what resulting in mixed type inference means, and chris-b1 answered:
It is deterministic - types are consistently inferred based on what's
in the data. That said, the internal chunksize is not a fixed number
of rows, but instead bytes, so whether you can a mixed dtype warning
or not can feel a bit random.
So, what type does Pandas end up using for those columns?
This is answered by the following self-contained example:
df=pd.read_csv(StringIO('\n'.join([str(x) for x in range(1000000)] + ['a string'])))
DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
type(df.loc[524287,'0'])
Out[50]: int
type(df.loc[524288,'0'])
Out[51]: str
The first part of the csv data was seen as only int, so converted to int,
the second part also had a string, so all entries were kept as string.
Can the type always be recovered after the fact? (after getting the warning)?
I guess re-exporting to csv and re-reading with low_memory=False should do the job.
How exactly does low_memory=False fix the problem?
It reads all of the file before deciding the type, therefore needing more memory.
low_memory is apparently kind of deprecated, so I wouldn't bother with it.
The warning means that some of the values in a column have one dtype (e.g. str), and some have a different dtype (e.g. float). I believe pandas uses the lowest common super type, which in the example I used would be object.
You should check your data, or post some of it here. In particular, look for missing values or inconsistently formatted int/float values. If you are certain your data is correct, then use the dtypes parameter to help pandas out.