I have a Pandas dataframe, with 4 rows, and one of the columns (named limit) contains floating point values, where any zeros must be replaced with 9999999999 (9.999999999 billion). The column is set to the float32 data type, and I use the pandas.DataFrame.where method to do the replacement. But it's not working as expected because Numpy is rounding up 9999999999 to 10000000000 (10 billion).
I've tried this in iPython 3 (Python 3.6.8), Pandas version 0.24.2, Numpy version 1.14.0.
This is the replacement statement
df['limit'] = df['limit'].where(df['limit'] != 0, 9999999999)
I'm seeing the following column values for limit:
0 1.000000e+10
1 1.000000e+10
2 1.000000e+10
3 1.000000e+10
but I'm expecting
0 9999999999.0
1 9999999999.0
2 9999999999.0
3 9999999999.0
Why does the rounding up happen? This doesn't happen with plain Python
In [1]: (9.999999999) * 10**9
Out[1]: 9999999999.0
This is simply because int32 is not capable of preserving that number. You can check this by calculating the number of bits needed for demonstrating that number:
In [24]: np.floor(np.log2(9999999999)) + 1
Out[24]: 34.0
As you can see you need at least 34 bits for demonstrating that number. Therefore you should use int64 as a larger data type for representing it.
Even if you test this by putting the number in a series with same data type you'll see the unexpected result (overflow) again:
In [25]: s = pd.Series([9999999999], dtype=pd.np.int32)
In [26]: s
Out[26]:
0 1410065407
dtype: int32
Related
My task is to read data from excel to dataframe. The data is a bit messy and to clean that up I've done:
df_1 = pd.read_excel(offers[0])
df_1 = df_1.rename(columns={'Наименование [Дата Файла: 29.05.2019 время: 10:29:42 ]':'good_name',
'Штрихкод':'barcode',
'Цена шт. руб.':'price',
'Остаток': 'balance'
})
df_1 = df_1[new_columns]
# I don't know why but without replacing NaN with another char code doesn't work
df_1.barcode = df_1.barcode.fillna('_')
# remove all non-numeric characters
df_1.barcode = df_1.barcode.apply(lambda row: re.sub('[^0-9]', '', row))
# convert str to numeric
df_1.barcode = pd.to_numeric(df_1.barcode, downcast='integer').fillna(0)
df_1.head()
It returns column barcode with type float64 (why so?)
0 0.000000e+00
1 7.613037e+12
2 7.613037e+12
3 7.613034e+12
4 7.613035e+12
Name: barcode, dtype: float64
Then I try to convert that column to integer.
df_1.barcode = df_1.barcode.astype(int)
But I keep getting silly negative numbers.
df_1.barcode[0:5]
0 0
1 -2147483648
2 -2147483648
3 -2147483648
4 -2147483648
Name: barcode, dtype: int32
Thanks to #Will and #micric eventually I've got a solution.
df_1 = pd.read_excel(offers[0])
df_1 = df_1[new_columns]
# replacing NaN with 0, it'll help to convert the column explicitly to dtype integer
df_1.barcode = df_1.barcode.fillna('0')
# remove all non-numeric characters
df_1.barcode = df_1.barcode.apply(lambda row: re.sub('[^0-9]', '', row))
# convert str to integer
df_1.barcode = pd.to_numeric(df_1.barcode, downcast='integer')
Resume:
pd.to_numeric converts NaN to float64. As a result from column with
both NaN and not-Nan values we should expect column dtype float64.
Check size of number you're dealing with. int32 has its limit, which
is 2**32 = 4294967296.
Thanks a lot for your help, guys!
That number is a 32 bit lower limit. Your number is out of the int32 range you are trying to use, so it returns you the limit (notice that 2**32 = 4294967296, divided by 2 2147483648 that is your number).
You should use astype(int64) instead.
I ran into the same problem as OP, using
astype(np.int64)
solved mine, see the link here.
I like this solution because it's consistent with my habit of changing the column type of pandas column, maybe someone could check the performance of these solutions.
Many questions in one.
So your expected dtype...
pd.to_numeric(df_1.barcode, downcast='integer').fillna(0)
pd.to_numeric downcast to integer would give you an integer, however, you have NaNs in your data and pandas needs to use a float64 type to represent NaNs
I am trying to subtract two columns in the dataframe but it is giving me same result for all the values?
Here is my data:
a b
0 0.35805 -0.01315
1 0.35809 -0.01311
2 0.35820 -0.01300
3 0.35852 -0.01268
I tried following approach suggested in here, but it is repeating same result for me in all the rows.
More like a precision issue , I always using decimal
from decimal import *
df.z.map(Decimal)-df.dist.map(Decimal)
Out[189]:
0 0.3711999999999999796246319406
1 0.3712000000000000195232718880
2 0.3712000000000000177885484121
3 0.3712000000000000056454840802
dtype: object
I think this will work fine
df['a-b'] = df['a']-df['b']
I have a large DataFrame (circa 4e+07 rows).
When summing it, I get 2 significantly different results whether I do the sum before or after the column selection.
Also, the type changes from float32 to float64 even though totals are all below 2**31
df[[col1, col2, col3]].sum()
Out[1]:
col1 9.36e+07
col2 1.39e+09
col3 6.37e+08
dtype: float32
df.sum()[[col1, col2, col3]]
Out[2]:
col1 1.21e+08
col2 1.70e+09
col3 7.32e+08
dtype: float64
I am obviously missing something, has anybody had the same issue?
Thanks for your help.
To understand what's going on here, you need to understand what Pandas is doing under the hood. I'm going to simplify a bit, since there are lots of bells and whistles and special cases to consider, but roughly it looks like this:
Suppose you've got a Pandas DataFrame object df with various numeric columns (we'll ignore datetime columns, categorical columns, and the like). When you compute df.sum(), Pandas:
Extracts the values of the dataframe into a two-dimensional NumPy array.
Applies the NumPy sum function to that 2d array with axis=0 to compute the column sums.
It's the first step that's important here. The columns of a DataFrame might have different dtypes, but a 2d NumPy array can only have a single dtype. If df has a mixture of float32 and int32 columns (for example), Pandas has to choose a single dtype that's appropriate for both columns simultaneously, and in this case it chooses float64. So when the sum is computed, it's computed on double-precision values, using double-precision arithmetic. This is what's happening in your second example.
On the other hand, if you cut down to just the float32 columns in the first place, then Pandas can and will use the float32 dtype for the 2d NumPy array, and so the sum computation is performed in single precision. This is what's happening in your first example.
Here's a simple example showing this in action: we'll set up a DataFrame with 100 million rows and three columns, of dtypes float32, float32 and int32 respectively. All the values are ones:
>>> import numpy as np, pandas as pd
>>> s = np.ones(10**8, dtype=np.float32)
>>> t = np.ones(10**8, dtype=np.int32)
>>> df = pd.DataFrame(dict(A=s, B=s, C=t))
>>> df.head()
A B C
0 1.0 1.0 1
1 1.0 1.0 1
2 1.0 1.0 1
3 1.0 1.0 1
4 1.0 1.0 1
>>> df.dtypes
A float32
B float32
C int32
dtype: object
Now when we compute the sums directly, Pandas first turns everything into float64s. The computation is also done using the float64 type, for all three columns, and we get an accurate answer.
>>> df.sum()
A 100000000.0
B 100000000.0
C 100000000.0
dtype: float64
But if we first cut down our dataframe to just the float32 columns, then float32-arithmetic is used for the sum, and we get very poor answers.
>>> df[['A', 'B']].sum()
A 16777216.0
B 16777216.0
dtype: float32
The inaccuracy is of course due to using a dtype that doesn't have enough precision for the task in question: at some point in the summation, we end up repeatedly adding 1.0 to 16777216.0, and getting 16777216.0 back each time, thanks to the usual floating-point problems. The solution is to explicitly convert to float64 yourself before doing the computation.
However, this isn't quite the end of the surprises that Pandas has in store for us. With the same dataframe as above, let's try just computing the sum for column "A":
>>> df[['A']].sum()
A 100000000.0
dtype: float32
Suddenly we're getting full accuracy again! So what's going on? This has little to do with dtypes: we're still using float32 to do the summation. It's now the second step (the NumPy summation) that's responsible for the difference. What's happening is that NumPy can, and sometimes does, use a more accurate summation algorithm, called pairwise summation, and with float32 dtype and the size arrays that we're using, that accuracy can make a hugely significant difference to the final result. However, it only uses that algorithm when summing along the fastest-varying axis of an array; see this NumPy issue for related discussion. In the case where we compute the sum of both column "A" and column "B", we end up with a values array of shape (100000000, 2). The fastest-varying axis is axis 1, and we're computing the sum along axis 0, so the naive summation algorithm is used and we get poor results. But if we only ask for the sum of column "A", we get the accurate sum result, computed using pairwise summation.
In sum, when working with DataFrames of this size, you want to be careful to (a) work with double precision rather than single precision whenever possible, and (b) be prepared for differences in output results due to NumPy making different algorithm choices.
You can lose precision with np.float32 relative to np.float64
np.finfo(np.float32)
finfo(resolution=1e-06, min=-3.4028235e+38, max=3.4028235e+38, dtype=float32)
And
np.finfo(np.float64)
finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64)
A contrived example
df = pd.DataFrame(dict(
x=[-60499999.315, 60500002.685] * int(2e7),
y=[-60499999.315, 60500002.685] * int(2e7),
z=[-60499999.315, 60500002.685] * int(2e7),
)).astype(dict(x=np.float64, y=np.float32, z=np.float32))
print(df.sum()[['y', 'z']], df[['y', 'z']].sum(), sep='\n\n')
y 80000000.0
z 80000000.0
dtype: float64
y 67108864.0
z 67108864.0
dtype: float32
When shifting column of integers, I know how to fix my column when Pandas automatically converts the integers to floats because of the presence of a NaN.
I basically use the method described here.
However, if the shift introduces a NaN thereby converting all integers to floats, there's some rounding that happens (e.g. on epoch timestamps) so even recasting it back to integer doesn't replicate what it was originally.
Any way to fix this?
Example Data:
pd.DataFrame({'epochee':[1495571400259317500,1495571400260585120,1495571400260757200, 1495571400260866800]})
Out[19]:
epoch
0 1495571790919317503
1 1495999999999999999
2 1495571400265555555
3 1495571400267777777
Example Code:
df['prior_epochee'] = df['epochee'].shift(1)
df.dropna(axis=0, how='any', inplace=True)
df['prior_epochee'] = df['prior_epochee'].astype(int)
Resulting output:
Out[22]:
epoch prior_epoch
1 1444444444444444444 1400000000000000000
2 1433333333333333333 1490000000000000000
3 1777777777777777777 1499999999999999948
Because you know what happens when int is casted as float due to np.nan and you know that you don't want the np.nan rows anyway, you can shift yourself with numpy
df[1:].assign(prior_epoch=df.epoch.values[:-1])
epoch prior_epoch
1 1495571400260585120 1495571400259317500
2 1495571400260757200 1495571400260585120
3 1495571400260866800 1495571400260757200
I have a large Pandas dataframe (a subclass of Numpy ndarray for most purposes) containing binary strings (0s and 1s). I need to find the positions of all the zeros in these strings and then label them. Also, I expect the positions of the zeros to be relatively sparse (~1% of all bit positions).
Basically, I want to run something like this:
import pandas as pd
x = pd.Series([ '11101110', '11111101' ], ) # start with strings
x = pd.Series([ 0b11101110, 0b11111101 ], ) # ... or integers of a known bit length
zero_positions = find_zero_positions( x )
Yielding zero_positions =...
value
row bit
0 4 0
0 0
1 1 0
I've tried a few different ways to do this, but haven't come up with anything better than looping through one row at a time. (EDIT: The actual strings I want to look at are much longer than the 8-bit examples here, so a lookup table won't work.)
I'm not sure whether it will be more efficient to approach this as a string problem (Pandas's Vectorized string methods don't offer a substring-position-finding method) or a numeric problem (using something like numpy.unpackbits, maybe?).
You could use numpy.unpackbits as follows, starting with an ndarray of this form:
In [1]: x = np.array([[0b11101110], [0b11111101]], dtype=np.uint8)
In [2]: x
Out[2]:
array([[238],
[253]], dtype=uint8)
In [3]: df = pd.DataFrame(np.unpackbits(x, axis=1))
In [4]: df.columns = df.columns[::-1]
In [5]: df
Out[5]:
7 6 5 4 3 2 1 0
0 1 1 1 0 1 1 1 0
1 1 1 1 1 1 1 0 1
Then from the DataFrame, just stack and find the zeros:
In [6]: s = df.stack()
In [7]: s.index.names = ['row', 'bit']
In [8]: s[s == 0]
Out[8]:
row bit
0 4 0
0 0
1 1 0
dtype: uint8
I think this would be a reasonably efficient method.
One good solution would be to split the input into smallish chunks and use that in a memoized lookup table (where you compute the first time through).
E.g., if each number/array is 128 bits; break it into eight 16-bits parts that are looked up in a table. At worst, the lookup table needs 216 ~ 65536 entries - but if zeros are very sparse (e.g., at most two zeros in any group of 8 bits only need about ~64). Depending on how sparse you can beef up the size of the chunk.
In the "yuck" department, I would like to enter the following contestant:
def numpyToBinString(numpyValue):
return "".join( [str((numpyValue[0] >> shiftLength) & 1 ) for shiftLength in range(numpyValue.dtype.itemsize * 8)] )
Works for shape (,) ndArrays, but could be extended with #vectorize decorator.
You can use a lookup table.
Create a table that has the 0 positions for each number from 0-255 and a function to access it, call it zeroBitPositions, this returns a list.
Then, assuming that you are storing your numbers as a python long type (which, I believe has unlimited precision). You can do the following:
allZeroPositions = []
shift = 0
while (num >> shift) > 0:
zeroPositions += [x + shift for x in zeroBitPositions ((num >> shift) & 0xFF)]
shift += 8
Hopefully this is a good start.