I've got a pandas DataFrame with a float (on decimal) index which I use to look up values (similar to a dictionary). As floats are not exactly the value they are supposed to be multiplied everything by 10 and converted it to integers .astype(int) before setting it as index. However this seems to do a floor instead of rounding. Thus 1.999999999999999992 is converted to 1 instead of 2. Rounding with the pandas.DataFrame.round() method before does not avoid this problem as the values are still stored as floats.
The original idea (which obviously rises a key error) was this:
idx = np.arange(1,3,0.001)
s = pd.Series(range(2000))
s.index=idx
print(s[2.022])
trying with converting to integers:
idx_int = idx*1000
idx_int = idx_int.astype(int)
s.index = idx_int
for i in range(1000,3000):
print(s[i])
the output is always a bit random as the 'real' value of an integer can be slightly above or below the wanted value. In this case the index contains two times the value 1000 and does not contain the value 2999.
You are right, astype(int) does a conversion toward zero:
‘integer’ or ‘signed’: smallest signed int dtype
from pandas.to_numeric documentation (which is linked from astype() for numeric conversions).
If you want to round, you need to do a float round, and then convert to int:
df.round(0).astype(int)
Use other rounding functions, according your needs.
the output is always a bit random as the 'real' value of an integer can be slightly above or below the wanted value
Floats are able to represent whole numbers, making a conversion after round(0) lossless and non-risky, check here for details.
If I understand right you could just perform the rounding operation followed by converting it to an integer?
s1 = pd.Series([1.2,2.9])
s1 = s1.round().astype(int)
Which gives the output:
0 1
1 3
dtype: int32
In case the data frame contains both, numeric and non-numeric values and you only want to touch numeric fields:
df = df.applymap(lambda x: int(round(x, 0)) if isinstance(x, (int, float)) else x)
There is a potential that NA as a float type exists in the dataframe. so an alternative solution is: df.fillna(0).astype('int')
Related
I have a dataframe looks like this.
df = pd.DataFrame([0.70,1.0,0.75,0.0,5.0], columns=['pitch'])
I want to convert it into
df = pd.DataFrame([0.7,1,0.75,0,5], columns=['pitch'])
If i convert the float to int, 0.7 will be 0 .
How to solve this problem, thanks!!
astype()
e round
can use the astype(int) function in conjunction with the round() function on the dataframe. This will round each value in the 'pitch' column to the desired number of decimal places and then convert those rounded values to integers.
df['pitch'] = df['pitch'].round(1).astype(int)
or using round(0) to remove decimal places completely
df['pitch'] = df['pitch'].round(0).astype(int)
Be careful using the astype(int) function to convert to int, as it can round values, not just remove decimal places.
You can use the round() function to round the decimal values to the desired number of decimal places before converting them to integers. For example, you can use round(df['pitch'], 1) to round the values in the 'pitch' column to one decimal place, then use astype(int) to convert them to integers.
df['pitch'] = round(df['pitch'], 1).astype(int)
How can I retrain only 2 decimals for each values in a Pandas series? (I'm working with latitudes and longitudes). dtype is float64.
series = [-74.002568, -74.003085, -74.003546]
I tried using the round function but as the name suggests, it rounds. I looked into trunc() but this can only remove all decimals. Then I figures why not try running a For loop. I tried the following:
for i in series:
i = "{0:.2f}".format(i)
I was able to run the code without any errors but it didn't modify the data in any way.
Expected output would be the following:
[-74.00, -74.00, -74.00]
Anyone knows how to achieve this? Thanks!
series = [-74.002568, -74.003085, -74.003546]
["%0.2f" % (x,) for x in series]
['-74.00', '-74.00', '-74.00']
It will convert your data to string/object data type. It is just for display purpose. If you want to use it for calculation purpose then you can cast it to float. Then only one digit decimal will be visible.
[float('{0:.2f}'.format(x)) for x in series]
[-74.0, -74.0, -74.0]
here is one way to do it
assuming you meant pandas.Series, and if its true then
# you indicated its a series but defined only a list
# assuming you meant pandas.Series, and if its true then
series = [-74.002568, -74.003085, -74.003546]
s=pd.Series(series)
# use regex extract to pick the number until first two decimal places
out=s.astype(str).str.extract(r"(.*\..{2})")[0]
out
0 -74.00
1 -74.00
2 -74.00
Name: 0, dtype: object
Change the display options. This shouldn't change your underlying data.
pd.options.display.float_format = "{:,.2f}".format
This is the column, , and when i try med_app['patient_id'].astype(int) it results in a negative output like so, .
I want the output in this format; 2.987250e+13 to 29872499824296
The max size int32 is 2**31 - 1 = 2147483647 ≈ 2.147e9. If you want ints larger than this, you should use int64, which has max size 2**63 - 1 = 9223372036854775807 ≈ 9.2233e+18. When I test this myself, it automatically chooses int64 when I perform .astype(int), but you can be explicit and do .astype(np.int64) (when you import numpy as np). If you need to go even larger, you can use uint64, which goes all the way up to 2 ** 64 ≈ 1.845e+19.
I think if you check the type of column without any type conversion the column would still be of type int, what you are seeing is just the pandas display format which by default uses scientific notation for extremely large/small numbers.
If you wish to visualise the numbers as stored then you just need to set appropriate pandas display option.
med_app[["patient_id"]].style.format("{:.0f}")
Is there a preferred way to keep the data type of a numpy array fixed as int (or int64 or whatever), while still having an element inside listed as numpy.NaN?
In particular, I am converting an in-house data structure to a Pandas DataFrame. In our structure, we have integer-type columns that still have NaN's (but the dtype of the column is int). It seems to recast everything as a float if we make this a DataFrame, but we'd really like to be int.
Thoughts?
Things tried:
I tried using the from_records() function under pandas.DataFrame, with coerce_float=False and this did not help. I also tried using NumPy masked arrays, with NaN fill_value, which also did not work. All of these caused the column data type to become a float.
NaN can't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na
(This feature has been added beginning with version 0.24 of pandas, but note it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lower case):
https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support
)
This capability has been added to pandas beginning with version 0.24.
At this point, it requires the use of extension dtype 'Int64' (capitalized), rather than the default dtype 'int64' (lowercase).
If performance is not the main issue, you can store strings instead.
df.col = df.col.dropna().apply(lambda x: str(int(x)) )
Then you can mix then with NaN as much as you want. If you really want to have integers, depending on your application, you can use -1, or 0, or 1234567890, or some other dedicated value to represent NaN.
You can also temporarily duplicate the columns: one as you have, with floats; the other one experimental, with ints or strings. Then inserts asserts in every reasonable place checking that the two are in sync. After enough testing you can let go of the floats.
In case you are trying to convert a float (1.143) vector to integer (1), and that vector has NAs, converting it to the new 'Int64' dtype will give you an error. In order to solve this you have to round the numbers and then do ".astype('Int64')"
s1 = pd.Series([1.434, 2.343, np.nan])
#without round() the next line returns an error
s1.astype('Int64')
#cannot safely cast non-equivalent float64 to int64
##with round() it works
s1.round().astype('Int64')
0 1
1 2
2 NaN
dtype: Int64
My use case is that I have a float series that I want to round to int, but when you do .round() still has decimals, you need to convert to int to remove decimals.
This is not a solution for all cases, but mine (genomic coordinates) I've resorted to using 0 as NaN
a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)
This at least allows for the proper 'native' column type to be used, operations like subtraction, comparison etc work as expected
Pandas v0.24+
Functionality to support NaN in integer series will be available in v0.24 upwards. There's information on this in the v0.24 "What's New" section, and more details under Nullable Integer Data Type.
Pandas v0.23 and earlier
In general, it's best to work with float series where possible, even when the series is upcast from int to float due to inclusion of NaN values. This enables vectorised NumPy-based calculations where, otherwise, Python-level loops would be processed.
The docs do suggest : "One possibility is to use dtype=object arrays instead." For example:
s = pd.Series([1, 2, 3, np.nan])
print(s.astype(object))
0 1
1 2
2 3
3 NaN
dtype: object
For cosmetic reasons, e.g. output to a file, this may be preferable.
Pandas v0.23 and earlier: background
NaN is considered a float. The docs currently (as of v0.23) specify the reason why integer series are upcasted to float:
In the absence of high performance NA support being built into NumPy
from the ground up, the primary casualty is the ability to represent
NAs in integer arrays.
This trade-off is made largely for memory and performance reasons, and
also so that the resulting Series continues to be “numeric”.
The docs also provide rules for upcasting due to NaN inclusion:
Typeclass Promotion dtype for storing NAs
floating no change
object no change
integer cast to float64
boolean cast to object
New for Pandas v1.00 +
You do not (and can not) use numpy.nan any more.
Now you have pandas.NA.
Please read: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
IntegerArray is currently experimental. Its API or implementation may
change without warning.
Changed in version 1.0.0: Now uses pandas.NA as the missing value
rather than numpy.nan.
In Working with missing data, we saw that pandas primarily uses NaN to
represent missing data. Because NaN is a float, this forces an array
of integers with any missing values to become floating point. In some
cases, this may not matter much. But if your integer column is, say,
an identifier, casting to float can be problematic. Some integers
cannot even be represented as floating point numbers.
If there are blanks in the text data, columns that would normally be integers will be cast to floats as float64 dtype because int64 dtype cannot handle nulls. This can cause inconsistent schema if you are loading multiple files some with blanks (which will end up as float64 and others without which will end up as int64
This code will attempt to convert any number type columns to Int64 (as opposed to int64) since Int64 can handle nulls
import pandas as pd
import numpy as np
#show datatypes before transformation
mydf.dtypes
for c in mydf.select_dtypes(np.number).columns:
try:
mydf[c] = mydf[c].astype('Int64')
print('casted {} as Int64'.format(c))
except:
print('could not cast {} to Int64'.format(c))
#show datatypes after transformation
mydf.dtypes
This is now possible, since pandas v 0.24.0
pandas 0.24.x release notes
Quote: "Pandas has gained the ability to hold integer dtypes with missing values.
I know that OP has asked for NumPy or Pandas only, but I think it is worth mentioning polars as an alternative that supports the requested feature.
In Polars any missing values in an integer column are simply null values and the column remains an integer column.
See Polars - User Guide > Coming from Pandas for more info.
Is there a preferred way to keep the data type of a numpy array fixed as int (or int64 or whatever), while still having an element inside listed as numpy.NaN?
In particular, I am converting an in-house data structure to a Pandas DataFrame. In our structure, we have integer-type columns that still have NaN's (but the dtype of the column is int). It seems to recast everything as a float if we make this a DataFrame, but we'd really like to be int.
Thoughts?
Things tried:
I tried using the from_records() function under pandas.DataFrame, with coerce_float=False and this did not help. I also tried using NumPy masked arrays, with NaN fill_value, which also did not work. All of these caused the column data type to become a float.
NaN can't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na
(This feature has been added beginning with version 0.24 of pandas, but note it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lower case):
https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support
)
This capability has been added to pandas beginning with version 0.24.
At this point, it requires the use of extension dtype 'Int64' (capitalized), rather than the default dtype 'int64' (lowercase).
If performance is not the main issue, you can store strings instead.
df.col = df.col.dropna().apply(lambda x: str(int(x)) )
Then you can mix then with NaN as much as you want. If you really want to have integers, depending on your application, you can use -1, or 0, or 1234567890, or some other dedicated value to represent NaN.
You can also temporarily duplicate the columns: one as you have, with floats; the other one experimental, with ints or strings. Then inserts asserts in every reasonable place checking that the two are in sync. After enough testing you can let go of the floats.
In case you are trying to convert a float (1.143) vector to integer (1), and that vector has NAs, converting it to the new 'Int64' dtype will give you an error. In order to solve this you have to round the numbers and then do ".astype('Int64')"
s1 = pd.Series([1.434, 2.343, np.nan])
#without round() the next line returns an error
s1.astype('Int64')
#cannot safely cast non-equivalent float64 to int64
##with round() it works
s1.round().astype('Int64')
0 1
1 2
2 NaN
dtype: Int64
My use case is that I have a float series that I want to round to int, but when you do .round() still has decimals, you need to convert to int to remove decimals.
This is not a solution for all cases, but mine (genomic coordinates) I've resorted to using 0 as NaN
a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)
This at least allows for the proper 'native' column type to be used, operations like subtraction, comparison etc work as expected
Pandas v0.24+
Functionality to support NaN in integer series will be available in v0.24 upwards. There's information on this in the v0.24 "What's New" section, and more details under Nullable Integer Data Type.
Pandas v0.23 and earlier
In general, it's best to work with float series where possible, even when the series is upcast from int to float due to inclusion of NaN values. This enables vectorised NumPy-based calculations where, otherwise, Python-level loops would be processed.
The docs do suggest : "One possibility is to use dtype=object arrays instead." For example:
s = pd.Series([1, 2, 3, np.nan])
print(s.astype(object))
0 1
1 2
2 3
3 NaN
dtype: object
For cosmetic reasons, e.g. output to a file, this may be preferable.
Pandas v0.23 and earlier: background
NaN is considered a float. The docs currently (as of v0.23) specify the reason why integer series are upcasted to float:
In the absence of high performance NA support being built into NumPy
from the ground up, the primary casualty is the ability to represent
NAs in integer arrays.
This trade-off is made largely for memory and performance reasons, and
also so that the resulting Series continues to be “numeric”.
The docs also provide rules for upcasting due to NaN inclusion:
Typeclass Promotion dtype for storing NAs
floating no change
object no change
integer cast to float64
boolean cast to object
New for Pandas v1.00 +
You do not (and can not) use numpy.nan any more.
Now you have pandas.NA.
Please read: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
IntegerArray is currently experimental. Its API or implementation may
change without warning.
Changed in version 1.0.0: Now uses pandas.NA as the missing value
rather than numpy.nan.
In Working with missing data, we saw that pandas primarily uses NaN to
represent missing data. Because NaN is a float, this forces an array
of integers with any missing values to become floating point. In some
cases, this may not matter much. But if your integer column is, say,
an identifier, casting to float can be problematic. Some integers
cannot even be represented as floating point numbers.
If there are blanks in the text data, columns that would normally be integers will be cast to floats as float64 dtype because int64 dtype cannot handle nulls. This can cause inconsistent schema if you are loading multiple files some with blanks (which will end up as float64 and others without which will end up as int64
This code will attempt to convert any number type columns to Int64 (as opposed to int64) since Int64 can handle nulls
import pandas as pd
import numpy as np
#show datatypes before transformation
mydf.dtypes
for c in mydf.select_dtypes(np.number).columns:
try:
mydf[c] = mydf[c].astype('Int64')
print('casted {} as Int64'.format(c))
except:
print('could not cast {} to Int64'.format(c))
#show datatypes after transformation
mydf.dtypes
This is now possible, since pandas v 0.24.0
pandas 0.24.x release notes
Quote: "Pandas has gained the ability to hold integer dtypes with missing values.
I know that OP has asked for NumPy or Pandas only, but I think it is worth mentioning polars as an alternative that supports the requested feature.
In Polars any missing values in an integer column are simply null values and the column remains an integer column.
See Polars - User Guide > Coming from Pandas for more info.