I have a large DataFrame (circa 4e+07 rows).
When summing it, I get 2 significantly different results whether I do the sum before or after the column selection.
Also, the type changes from float32 to float64 even though totals are all below 2**31
df[[col1, col2, col3]].sum()
Out[1]:
col1 9.36e+07
col2 1.39e+09
col3 6.37e+08
dtype: float32
df.sum()[[col1, col2, col3]]
Out[2]:
col1 1.21e+08
col2 1.70e+09
col3 7.32e+08
dtype: float64
I am obviously missing something, has anybody had the same issue?
Thanks for your help.
To understand what's going on here, you need to understand what Pandas is doing under the hood. I'm going to simplify a bit, since there are lots of bells and whistles and special cases to consider, but roughly it looks like this:
Suppose you've got a Pandas DataFrame object df with various numeric columns (we'll ignore datetime columns, categorical columns, and the like). When you compute df.sum(), Pandas:
Extracts the values of the dataframe into a two-dimensional NumPy array.
Applies the NumPy sum function to that 2d array with axis=0 to compute the column sums.
It's the first step that's important here. The columns of a DataFrame might have different dtypes, but a 2d NumPy array can only have a single dtype. If df has a mixture of float32 and int32 columns (for example), Pandas has to choose a single dtype that's appropriate for both columns simultaneously, and in this case it chooses float64. So when the sum is computed, it's computed on double-precision values, using double-precision arithmetic. This is what's happening in your second example.
On the other hand, if you cut down to just the float32 columns in the first place, then Pandas can and will use the float32 dtype for the 2d NumPy array, and so the sum computation is performed in single precision. This is what's happening in your first example.
Here's a simple example showing this in action: we'll set up a DataFrame with 100 million rows and three columns, of dtypes float32, float32 and int32 respectively. All the values are ones:
>>> import numpy as np, pandas as pd
>>> s = np.ones(10**8, dtype=np.float32)
>>> t = np.ones(10**8, dtype=np.int32)
>>> df = pd.DataFrame(dict(A=s, B=s, C=t))
>>> df.head()
A B C
0 1.0 1.0 1
1 1.0 1.0 1
2 1.0 1.0 1
3 1.0 1.0 1
4 1.0 1.0 1
>>> df.dtypes
A float32
B float32
C int32
dtype: object
Now when we compute the sums directly, Pandas first turns everything into float64s. The computation is also done using the float64 type, for all three columns, and we get an accurate answer.
>>> df.sum()
A 100000000.0
B 100000000.0
C 100000000.0
dtype: float64
But if we first cut down our dataframe to just the float32 columns, then float32-arithmetic is used for the sum, and we get very poor answers.
>>> df[['A', 'B']].sum()
A 16777216.0
B 16777216.0
dtype: float32
The inaccuracy is of course due to using a dtype that doesn't have enough precision for the task in question: at some point in the summation, we end up repeatedly adding 1.0 to 16777216.0, and getting 16777216.0 back each time, thanks to the usual floating-point problems. The solution is to explicitly convert to float64 yourself before doing the computation.
However, this isn't quite the end of the surprises that Pandas has in store for us. With the same dataframe as above, let's try just computing the sum for column "A":
>>> df[['A']].sum()
A 100000000.0
dtype: float32
Suddenly we're getting full accuracy again! So what's going on? This has little to do with dtypes: we're still using float32 to do the summation. It's now the second step (the NumPy summation) that's responsible for the difference. What's happening is that NumPy can, and sometimes does, use a more accurate summation algorithm, called pairwise summation, and with float32 dtype and the size arrays that we're using, that accuracy can make a hugely significant difference to the final result. However, it only uses that algorithm when summing along the fastest-varying axis of an array; see this NumPy issue for related discussion. In the case where we compute the sum of both column "A" and column "B", we end up with a values array of shape (100000000, 2). The fastest-varying axis is axis 1, and we're computing the sum along axis 0, so the naive summation algorithm is used and we get poor results. But if we only ask for the sum of column "A", we get the accurate sum result, computed using pairwise summation.
In sum, when working with DataFrames of this size, you want to be careful to (a) work with double precision rather than single precision whenever possible, and (b) be prepared for differences in output results due to NumPy making different algorithm choices.
You can lose precision with np.float32 relative to np.float64
np.finfo(np.float32)
finfo(resolution=1e-06, min=-3.4028235e+38, max=3.4028235e+38, dtype=float32)
And
np.finfo(np.float64)
finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64)
A contrived example
df = pd.DataFrame(dict(
x=[-60499999.315, 60500002.685] * int(2e7),
y=[-60499999.315, 60500002.685] * int(2e7),
z=[-60499999.315, 60500002.685] * int(2e7),
)).astype(dict(x=np.float64, y=np.float32, z=np.float32))
print(df.sum()[['y', 'z']], df[['y', 'z']].sum(), sep='\n\n')
y 80000000.0
z 80000000.0
dtype: float64
y 67108864.0
z 67108864.0
dtype: float32
Related
I have a data frame as below. I want to make it a numpy array.
When i using df.values command it is making as numpy array but all the attributes are converted to float. I checked df.values documentation but was not helpful, can i assign same datatype of df to numpy?
Thanks in advance for your help
High Low ... Volume Adj Close
Date ...
2018-12-20 2509.629883 2441.179932 ... 5585780000 2467.419922
2018-12-21 2504.409912 2408.550049 ... 7609010000 2416.620117
2018-12-24 2410.340088 2351.100098 ... 2613930000 2351.100098
2018-12-26 2467.760010 2346.580078 ... 4233990000 2467.699951
2018-12-27 2489.100098 2397.939941 ... 4096610000 2488.830078
2018-12-28 2520.270020 2472.889893 ... 3702620000 2485.739990
2018-12-31 2509.239990 2482.820068 ... 3442870000 2506.850098
2019-01-02 2519.489990 2467.469971 ... 3733160000 2510.030029
Numpy arrays have a uniform data type, as you can see from the documentation:
numpy.ndarray class numpy.ndarray(shape, dtype=float, buffer=None,
offset=0, strides=None, order=None)[source] An array object represents
a multidimensional, homogeneous array of fixed-size items. An
associated data-type object describes the format of each element in
the array (its byte-order, how many bytes it occupies in memory,
whether it is an integer, a floating point number, or something else,
etc.)
When you use the df.values, it will cast all values to the most suitable datatype to keep the homogeneity.
The pandas.DataFrame.values also mentions that:
Notes
The dtype will be a lower-common-denominator dtype (implicit
upcasting); that is to say if the dtypes (even of numeric types) are
mixed, the one that accommodates all will be chosen. Use this with
care if you are not dealing with the blocks.
e.g. If the dtypes are float16 and float32, dtype will be upcast to
float32. If dtypes are int32 and uint8, dtype will be upcast to int32.
By numpy.find_common_type() convention, mixing int64 and uint64 will
result in a float64 dtype.
You can do it with NumPy structured arrays.
I will create a DataFrame with only 2 rows and 2 columns similar to yours to demonstrate how you can do it with any size of DataFrame.
import Pandas as pd
import Numpy as np
df = pd.DataFrame({'High': [2509.629883, 2504.409912],
'Volume': [5585780000, 7609010000]},
index=np.array(['2018-12-20', '2018-12-21'], dtype='datetime64'))
Then you create an empty NumPy array defining what datatype each column must have. In my example, I only have 2 rows so the array will only have 2 rows as following:
array = np.empty(2, dtype={'names':('col1', 'col2', 'col3'),
'formats':('datetime64[D]', 'f8', 'i8')})
array['col1'] = df.index
array['col2'] = df['High']
array['col3'] = df['Volume']
and, the array will look like:
array([('2018-12-20', 2509.629883, 5585780000),
('2018-12-21', 2504.409912, 7609010000)],
dtype=[('col1', '<M8[D]'), ('col2', '<f8'), ('col3', '<i8')])
You also can create a np.recarray class using command np.rec.array. This is almost identical with structured arrays with only one extra feature. You can access fields as attributes, i.e. array.col1 instead of array['col1']. However, numpy record arrays are apparently slower than structured arrays!
I am starting a large matrix which I convert to dataframe in pandas allowing pandas to infer the data type of the columns.
The columns are inferred as float64, but I am subsequently able to downcast these columns to float32 using the pandas to_numeric function without a loss of precision.
Why is pandas inefficiently inferring the columns as float64 if they are able to be downcast to float32 without a loss of precision?
a = np.matrix('0.1 0.2; 0.3 0.4')
a_df = pd.DataFrame(list(map(np.ravel, a)), dtype=None)
print(genotype_data_df.dtypes)
# the columns are float64
genotype_data_df = a_df.apply(pd.to_numeric, downcast='float')
# the columns are now float32
I am assuming that there is an underlying technical or practical reason why the library is implemented in this way? If so I am expecting an answer which would explain why this is the case.
Why is pandas inefficiently inferring the columns as int64
It's not clear to me that the cast to int64 is inefficient. This is simply the default dtype for numeric values which avoids redundancy in re-casting the column to a higher precision as would be required by examining every value in the column in order to assign the proper dtype.
Why did they implement it that way instead of say, as integer or float32? Because if any value in the column exceeds that default precision, then the entire column needs to be re-cast to a greater precision, and to do that would require examining every single value in the column. So it is less redundant/expensive to just assume the higher precision from the start, rather than examine every value and re-cast etc.
Of course this may not seem "optimal", but this is a tradeoff you have to make if you're not able to specify a dtype for the constructor.
they are able to be downcast to int32 without a loss of precision?
You're mistaken about this. There is apparently no loss of precision, but if you check your genotype_data_df.dtypes, you'll see that they haven't been cast to a lower precision (integer), in fact they remain as float64.
>> a = np.matrix('0.1 0.2; 0.3 0.4')
>> a_df = DF(list(map(np.ravel, a)), dtype=None)
>> genotype_data_df = a_df.apply(pd.to_numeric, downcast='integer')
>> genotype_data_df.dtypes
0 float64
1 float64
dtype: object
Is there a preferred way to keep the data type of a numpy array fixed as int (or int64 or whatever), while still having an element inside listed as numpy.NaN?
In particular, I am converting an in-house data structure to a Pandas DataFrame. In our structure, we have integer-type columns that still have NaN's (but the dtype of the column is int). It seems to recast everything as a float if we make this a DataFrame, but we'd really like to be int.
Thoughts?
Things tried:
I tried using the from_records() function under pandas.DataFrame, with coerce_float=False and this did not help. I also tried using NumPy masked arrays, with NaN fill_value, which also did not work. All of these caused the column data type to become a float.
NaN can't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na
(This feature has been added beginning with version 0.24 of pandas, but note it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lower case):
https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support
)
This capability has been added to pandas beginning with version 0.24.
At this point, it requires the use of extension dtype 'Int64' (capitalized), rather than the default dtype 'int64' (lowercase).
If performance is not the main issue, you can store strings instead.
df.col = df.col.dropna().apply(lambda x: str(int(x)) )
Then you can mix then with NaN as much as you want. If you really want to have integers, depending on your application, you can use -1, or 0, or 1234567890, or some other dedicated value to represent NaN.
You can also temporarily duplicate the columns: one as you have, with floats; the other one experimental, with ints or strings. Then inserts asserts in every reasonable place checking that the two are in sync. After enough testing you can let go of the floats.
In case you are trying to convert a float (1.143) vector to integer (1), and that vector has NAs, converting it to the new 'Int64' dtype will give you an error. In order to solve this you have to round the numbers and then do ".astype('Int64')"
s1 = pd.Series([1.434, 2.343, np.nan])
#without round() the next line returns an error
s1.astype('Int64')
#cannot safely cast non-equivalent float64 to int64
##with round() it works
s1.round().astype('Int64')
0 1
1 2
2 NaN
dtype: Int64
My use case is that I have a float series that I want to round to int, but when you do .round() still has decimals, you need to convert to int to remove decimals.
This is not a solution for all cases, but mine (genomic coordinates) I've resorted to using 0 as NaN
a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)
This at least allows for the proper 'native' column type to be used, operations like subtraction, comparison etc work as expected
Pandas v0.24+
Functionality to support NaN in integer series will be available in v0.24 upwards. There's information on this in the v0.24 "What's New" section, and more details under Nullable Integer Data Type.
Pandas v0.23 and earlier
In general, it's best to work with float series where possible, even when the series is upcast from int to float due to inclusion of NaN values. This enables vectorised NumPy-based calculations where, otherwise, Python-level loops would be processed.
The docs do suggest : "One possibility is to use dtype=object arrays instead." For example:
s = pd.Series([1, 2, 3, np.nan])
print(s.astype(object))
0 1
1 2
2 3
3 NaN
dtype: object
For cosmetic reasons, e.g. output to a file, this may be preferable.
Pandas v0.23 and earlier: background
NaN is considered a float. The docs currently (as of v0.23) specify the reason why integer series are upcasted to float:
In the absence of high performance NA support being built into NumPy
from the ground up, the primary casualty is the ability to represent
NAs in integer arrays.
This trade-off is made largely for memory and performance reasons, and
also so that the resulting Series continues to be “numeric”.
The docs also provide rules for upcasting due to NaN inclusion:
Typeclass Promotion dtype for storing NAs
floating no change
object no change
integer cast to float64
boolean cast to object
New for Pandas v1.00 +
You do not (and can not) use numpy.nan any more.
Now you have pandas.NA.
Please read: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
IntegerArray is currently experimental. Its API or implementation may
change without warning.
Changed in version 1.0.0: Now uses pandas.NA as the missing value
rather than numpy.nan.
In Working with missing data, we saw that pandas primarily uses NaN to
represent missing data. Because NaN is a float, this forces an array
of integers with any missing values to become floating point. In some
cases, this may not matter much. But if your integer column is, say,
an identifier, casting to float can be problematic. Some integers
cannot even be represented as floating point numbers.
If there are blanks in the text data, columns that would normally be integers will be cast to floats as float64 dtype because int64 dtype cannot handle nulls. This can cause inconsistent schema if you are loading multiple files some with blanks (which will end up as float64 and others without which will end up as int64
This code will attempt to convert any number type columns to Int64 (as opposed to int64) since Int64 can handle nulls
import pandas as pd
import numpy as np
#show datatypes before transformation
mydf.dtypes
for c in mydf.select_dtypes(np.number).columns:
try:
mydf[c] = mydf[c].astype('Int64')
print('casted {} as Int64'.format(c))
except:
print('could not cast {} to Int64'.format(c))
#show datatypes after transformation
mydf.dtypes
This is now possible, since pandas v 0.24.0
pandas 0.24.x release notes
Quote: "Pandas has gained the ability to hold integer dtypes with missing values.
I know that OP has asked for NumPy or Pandas only, but I think it is worth mentioning polars as an alternative that supports the requested feature.
In Polars any missing values in an integer column are simply null values and the column remains an integer column.
See Polars - User Guide > Coming from Pandas for more info.
I'm currently learning how to use Pandas, and I'm in a situation where I'm attempting to replace missing data (Horsepower feature) using a best-fit line generated from linear regression with the Displacement column. What I'm doing is iterating through only the parts of the dataframe that are marked as NaN in the Horsepower column and replacing the data by feeding in the value of Displacement in that same row into the best-fit algorithm. My code looks like this:
for row, value in auto_data.HORSEPOWER[pd.isnull(auto_data.HORSEPOWER)].iteritems():
auto_data.HORSEPOWER[row] = int(round(slope * auto_data.DISPLACEMENT[row] + intercept))
Now, the code works and the data is replaced as expected, but it generates the SettingWithCopyWarning when I run it. I understand why the warning is generated, and that in this case I'm fine, but if there is a better way to iterate through the subset, or a method that's just more elegant, I'd rather avoid chained indexing that could cause a real problem in the future.
I've looked through the docs, and through other answers on Stack Overflow. All solutions to this seem to use .loc, but I just can't seem to figure out the correct syntax to get the subset of NaN rows using .loc Any help is appreciated. If it helps, the dataframe looks like this:
auto_data.dtypes
Out[15]:
MPG float64
CYLINDERS int64
DISPLACEMENT float64
HORSEPOWER float64
WEIGHT int64
ACCELERATION float64
MODELYEAR int64
NAME object
dtype: object
IIUC you should be able to just do:
auto_data.loc[auto_data[HORSEPOWER].isnull(),'HORSEPOWER'] = np.round(slope * auto_data['DISPLACEMENT'] + intercept)
The above will be vectorised and avoid looping, the error you get is from doing this:
auto_data.HORSEPOWER[row]
I think if you did:
auto_data.loc[row,'HORSEPOWER']
then the warning should not be raised
Instead of looping through the DataFrame row-by-row, it would be more efficient to calculate the extrapolated values in a vectorized way for the entire column:
y = (slope * auto_data['DISPLACEMENT'] + intercept).round()
and then use update to replace the NaN values:
auto_data['HORSEPOWER'].update(y)
Using update works for the particular case of replacing NaN values.
Ed Chum's solution shows how to replace the value in arbitrary rows by using a boolean mask and auto_data.loc.
For example,
import numpy as np
import pandas as pd
auto_data = pd.DataFrame({
'HORSEPOWER':[1, np.nan, 2],
'DISPLACEMENT': [3, 4, 5]})
slope, intercept = 2, 0.5
y = (slope * auto_data['DISPLACEMENT'] + intercept).round()
auto_data['HORSEPOWER'].update(y)
print(auto_data)
yields
DISPLACEMENT HORSEPOWER
0 3 6
1 4 8
2 5 10
Is there a preferred way to keep the data type of a numpy array fixed as int (or int64 or whatever), while still having an element inside listed as numpy.NaN?
In particular, I am converting an in-house data structure to a Pandas DataFrame. In our structure, we have integer-type columns that still have NaN's (but the dtype of the column is int). It seems to recast everything as a float if we make this a DataFrame, but we'd really like to be int.
Thoughts?
Things tried:
I tried using the from_records() function under pandas.DataFrame, with coerce_float=False and this did not help. I also tried using NumPy masked arrays, with NaN fill_value, which also did not work. All of these caused the column data type to become a float.
NaN can't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na
(This feature has been added beginning with version 0.24 of pandas, but note it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lower case):
https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support
)
This capability has been added to pandas beginning with version 0.24.
At this point, it requires the use of extension dtype 'Int64' (capitalized), rather than the default dtype 'int64' (lowercase).
If performance is not the main issue, you can store strings instead.
df.col = df.col.dropna().apply(lambda x: str(int(x)) )
Then you can mix then with NaN as much as you want. If you really want to have integers, depending on your application, you can use -1, or 0, or 1234567890, or some other dedicated value to represent NaN.
You can also temporarily duplicate the columns: one as you have, with floats; the other one experimental, with ints or strings. Then inserts asserts in every reasonable place checking that the two are in sync. After enough testing you can let go of the floats.
In case you are trying to convert a float (1.143) vector to integer (1), and that vector has NAs, converting it to the new 'Int64' dtype will give you an error. In order to solve this you have to round the numbers and then do ".astype('Int64')"
s1 = pd.Series([1.434, 2.343, np.nan])
#without round() the next line returns an error
s1.astype('Int64')
#cannot safely cast non-equivalent float64 to int64
##with round() it works
s1.round().astype('Int64')
0 1
1 2
2 NaN
dtype: Int64
My use case is that I have a float series that I want to round to int, but when you do .round() still has decimals, you need to convert to int to remove decimals.
This is not a solution for all cases, but mine (genomic coordinates) I've resorted to using 0 as NaN
a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)
This at least allows for the proper 'native' column type to be used, operations like subtraction, comparison etc work as expected
Pandas v0.24+
Functionality to support NaN in integer series will be available in v0.24 upwards. There's information on this in the v0.24 "What's New" section, and more details under Nullable Integer Data Type.
Pandas v0.23 and earlier
In general, it's best to work with float series where possible, even when the series is upcast from int to float due to inclusion of NaN values. This enables vectorised NumPy-based calculations where, otherwise, Python-level loops would be processed.
The docs do suggest : "One possibility is to use dtype=object arrays instead." For example:
s = pd.Series([1, 2, 3, np.nan])
print(s.astype(object))
0 1
1 2
2 3
3 NaN
dtype: object
For cosmetic reasons, e.g. output to a file, this may be preferable.
Pandas v0.23 and earlier: background
NaN is considered a float. The docs currently (as of v0.23) specify the reason why integer series are upcasted to float:
In the absence of high performance NA support being built into NumPy
from the ground up, the primary casualty is the ability to represent
NAs in integer arrays.
This trade-off is made largely for memory and performance reasons, and
also so that the resulting Series continues to be “numeric”.
The docs also provide rules for upcasting due to NaN inclusion:
Typeclass Promotion dtype for storing NAs
floating no change
object no change
integer cast to float64
boolean cast to object
New for Pandas v1.00 +
You do not (and can not) use numpy.nan any more.
Now you have pandas.NA.
Please read: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
IntegerArray is currently experimental. Its API or implementation may
change without warning.
Changed in version 1.0.0: Now uses pandas.NA as the missing value
rather than numpy.nan.
In Working with missing data, we saw that pandas primarily uses NaN to
represent missing data. Because NaN is a float, this forces an array
of integers with any missing values to become floating point. In some
cases, this may not matter much. But if your integer column is, say,
an identifier, casting to float can be problematic. Some integers
cannot even be represented as floating point numbers.
If there are blanks in the text data, columns that would normally be integers will be cast to floats as float64 dtype because int64 dtype cannot handle nulls. This can cause inconsistent schema if you are loading multiple files some with blanks (which will end up as float64 and others without which will end up as int64
This code will attempt to convert any number type columns to Int64 (as opposed to int64) since Int64 can handle nulls
import pandas as pd
import numpy as np
#show datatypes before transformation
mydf.dtypes
for c in mydf.select_dtypes(np.number).columns:
try:
mydf[c] = mydf[c].astype('Int64')
print('casted {} as Int64'.format(c))
except:
print('could not cast {} to Int64'.format(c))
#show datatypes after transformation
mydf.dtypes
This is now possible, since pandas v 0.24.0
pandas 0.24.x release notes
Quote: "Pandas has gained the ability to hold integer dtypes with missing values.
I know that OP has asked for NumPy or Pandas only, but I think it is worth mentioning polars as an alternative that supports the requested feature.
In Polars any missing values in an integer column are simply null values and the column remains an integer column.
See Polars - User Guide > Coming from Pandas for more info.