Pandas Datatype Conversion issue - python

I have a pandas series that looks like this: a bunch of unicode strings
>>> some_id
0 400742773466599424
1 400740479161352192
2 398829879107809281
3 398823962966097921
4 398799036070653952
Name: some_id, dtype: object
I can do the following but I lose the precision.
>>> some_id.convert_objects(convert_numeric=True)
0 4.007428e+17
1 4.007405e+17
2 3.988299e+17
3 3.988240e+17
4 3.987990e+17
Name: some_id, dtype: float64
But if I do some_id.astype(int), I get the following: ValueError: invalid literal for long() with base 10
How can I convert them to int or int64 type while preserving the precision ?
I am using Pandas 0.16.2
UPDATE: I found the bug. some_id.astype(int) or any other form of it should work. Somewhere along the thousands of rows I have, some_id has a string of text (not a stringed number), so it was stopping the int64 conversion.
Thanks

Dagrha is right, you should be able to use :
some_id.astype(np.int64)
the type will then be :
In[40]: some_id.dtypes
Out[41]:
some_id int64
dtype: object

Original series of numbers:
s = pd.Series([400742773466599424, 400740479161352192, 398829879107809281,
398823962966097921, 398799036070653952], dtype=object)
>>> s
0 400742773466599424
1 400740479161352192
2 398829879107809281
3 398823962966097921
4 398799036070653952
dtype: object
Simply converting using .astype(int) should be sufficient.
>>> s.astype(int)
0 400742773466599424
1 400740479161352192
2 398829879107809281
3 398823962966097921
4 398799036070653952
dtype: int64
As an interesting side note (as pointed out by #Warren Weckesser and #DSM), you can lose precision due to floating point representation. For example, int(1e23) gets represented as 99999999999999991611392L. I'm not sure if this was the precision to which you referred, or if you were merely talking about the displayed precision.
With your sample data above, two numbers would be off by one:
>>> s.astype(np.int64) - s.astype(float).astype(np.int64)
0 0
1 0
2 1
3 1
4 0
dtype: int64

Related

Pandas Series - force dtype in Series constructor

I have this very simple series.
pd.Series(np.random.randn(10), dtype=np.int32)
I want to force a dtype, but pandas will overrule my initial setup:
Out[6]:
0 0.764638
1 -1.451616
2 -0.318875
3 -1.882215
4 1.995595
5 -0.497508
6 -1.004066
7 -1.641371
8 -1.271198
9 0.907795
dtype: float64
I know I could do this:
pd.Series(np.random.randn(10), dtype=np.int32).astype("int32")
But my question is: Why does pandas not handle the data how I want it in the Series constructor? There is no force parameter or something like that.
Can somebody explain me what happens there and how I can force the dtype in the series constructor or at least get a warning if the output differs from what I wanted initially?
You can use this:
>>> pd.Series(np.random.randn(10).astype(np.int32))
0 0
1 1
2 1
3 1
4 0
5 0
6 -1
7 0
8 0
9 0
dtype: int32
Pandas infers data type correctly. You can force your datatype with one exception. If your data is float and you want to force dtype to intX, this will not work because pandas does not take the responsibility to loose information and truncate the result.
That is why you have this behaviour.
>>> np.random.randn(10).dtype
dtype('float64')
>>> pd.Series(np.random.randn(10)).dtype
dtype('float64') # OK
>>> pd.Series(np.random.randn(10), dtype=np.int32).dtype
dtype('float64') # KO -> Pandas does not truncate the data
>>> np.random.randint(1, 10, 10).dtype
dtype('int64')
>>> pd.Series(np.random.randint(1, 10, 10)).dtype
dtype('int64') # OK
>>> pd.Series(np.random.randint(1, 10, 10), dtype=np.float64).dtype
dtype('float64') # OK -> float64 is a super set of int64

understand behavior of sqrt - giving different results when written different

I have pandas series that has the following numbers:
0 -1.309176
1 -1.226239
2 -1.339079
3 -1.298509
...
I'm trying to calculate the square root of each number in the series.
when I tried the whole series:
s**0.5
>>>
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
10778 NaN
but if I take the numbers it works:
-1.309176**0.5
I also tried to slice the numbers from the series:
b1[0]**0.5
>>>
nan
So i'm trying to understand why it works when I write number but doesn't work when I use the series
*the values are float type :
s.dtype
>>>dtype('float64')
s.to_frame().info()
>>>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10783 entries, 0 to 10782
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 B1 10783 non-null float64
dtypes: float64(1)
memory usage: 84.4 KB
You can't take a square root of a negative number (without venturing to complex numbers).
>>> np.sqrt(-1.30)
<stdin>:1: RuntimeWarning: invalid value encountered in sqrt
nan
When you do -1.309176**0.5, you're actually doing -(1.309176 ** 0.5), which is valid.
This has to do with operator precedence in python. Precedence of ** > unary operator -.
The square root of a negative number should be complex number. but when you compute -1.309176**0.5, it first computes the 1.309176**0.5 and then takes minus of it because the precedence of ** is > -.
>>>1.309176**0.5
-1.144192291531454
>>> (-1.309176)**0.5
(7.006157137165352e-17+1.144192291531454j)
Now numbers in your series are already negative, it's not like you are doing the unary operation - on them hence the square root of theses numbers should be complex number which Series shows as nan because the dtype is float.
>>> s = pd.Series([-1.30, -1.22])
>>> s
0 -1.30
1 -1.22
dtype: float64
Square root of this series gives nan.
>>> s**0.5
0 NaN
1 NaN
dtype: float64
Change dtype to np.complex
>>> s = s.astype(np.complex)
>>> s
0 -1.300000+0.000000j
1 -1.220000+0.000000j
dtype: complex128
Now you get the square root of s.
>>> s**0.05
0 1.000730+0.158500j
1 0.997557+0.157998j
dtype: complex128

How to correctly identify float values [0, 1] containing a dot, in DataFrame object dtype?

I have a dataframe like so, where my values are object dtype:
df = pd.DataFrame(data=['A', '290', '0.1744175757', '1', '1.0000000000'], columns=['Value'])
df
Out[65]:
Value
0 A
1 290
2 0.1744175757
3 1
4 1.0000000000
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 1 columns):
Value 5 non-null object
dtypes: object(1)
memory usage: 120.0+ bytes
What I want to do is select only percentages, in this case values of 0.1744175757 and 1.0000000000, which just so happen in my data will all have a period/dot in them. This is a key point - I need to be able to differentiate between a 1 integer value, and a 1.0000000000 percentage, as well as a 0 and 0.0000000000.
I've tried to look for the presence of the dot character, but this doesn't work, it returns true for every value, and I'm unclear why.
df[df['Value'].str.contains('.')]
Out[67]:
Value
0 A
1 290
2 0.1744175757
3 1
4 1.0000000000
I've also tried isdecimal(), but this isn't quite what I want:
df[df['Value'].str.isdecimal()]
Out[68]:
Value
1 290
3 1
The closest I've come up with a function:
def isPercent(x):
if pd.isnull(x):
return False
try:
x = float(x)
return x % 1 != 0
except:
return False
df[df['Value'].apply(isPercent)]
Out[74]:
Value
2 0.1744175757
but this fails to correctly identify scenarios of 1.0000000000 (and 0.0000000000).
I have two questions:
Why doesn't str.contains('.') work in this context? This seems like it's the easiest way since it will 100% of the time get me what I need in my data, but it returns True even if no '.' character is clearly in the value.
How might I correctly identify all values [0, 1] that have a dot character in the value?
str.contains performs a regex based search by default, and '.' will match any character by the regex engine. To disable it, use regex=False:
df[df['Value'].str.contains('.', regex=False)]
Value
2 0.1744175757
4 1.0000000000
You can also escape it to treat it literally:
df[df['Value'].str.contains(r'\.')]
Value
2 0.1744175757
4 1.0000000000
If you really want to pick up just float numbers, try using a regex that is a little more robust.
df[df['Value'].str.contains(r'\d+\.\d+')].astype(float)
Value
2 0.174418
4 1.000000

Casting pandas float64 to string with specific format

I have very small numbers in one pandas column. For example:
0 6.560000e+02
1 6.730000e+02
2 6.240000e+02
3 1.325000e+03
4 1.984500e+07
Unfortunately, when I cast it to a string, it gives me unusable values, such as:
df.astype('str').tolist()
['8.494e-07', ]
Is there a way to return the actual value of an item when casting to a string, such as:
'0.0000008494'
Given
# s = df[c]
s
0 6.560000e+02
1 6.730000e+02
2 6.240000e+02
3 1.325000e+03
4 8.494000e-07
Name: 1, dtype: float64
You can either call str.format through apply,
s.apply('{:.10f}'.format)
0 656.0000000000
1 673.0000000000
2 624.0000000000
3 1325.0000000000
4 0.0000008494
Name: 1, dtype: object
s.apply('{:.10f}'.format).tolist()
# ['656.0000000000', '673.0000000000', '624.0000000000',
# '1325.0000000000', '0.0000008494']
Or perhaps, through a list comprehension.
['{:f}'.format(x) for x in s]
# ['656.000000', '673.000000', '624.000000', '1325.000000', '0.000001']
Notice that if you do not specify decimal precision, the last value is rounded up.

change string object to number in dataframe

i have a 880184*1 dataframe, the only column is either integer object or string object. I want to change all string object to number 0. It looks like below:
index column
..... ......
23155 WILLS ST / MIDDLE POINT RD
23156 20323
23157 400 Block of BELLA VISTA WY
23158 19090
23159 100 Block of SAN BENITO WY
23160 20474
Now the problem is both number and string are 'object' type, I don't know how to change the string like object to 0 like below:
index column
..... ......
23155 0
23156 20323
23157 0
23158 19090
23159 0
23160 20474
Another problem is that the sample size is too large, making it too long to use for loops to fix row by row. I want to use something like:
df.loc[df.column == ...] = 0
You can convert the type to numeric with pd.to_numeric and pass errors='coerce' so that you would get NaN for the ones cannot be converted to numbers. In the end, you can replace the NaNs with zero:
df["column"] = pd.to_numeric(df["column"], errors="coerce").fillna(0)
Out[15]:
0 0.0
1 20323.0
2 0.0
3 19090.0
4 0.0
5 20474.0
Name: column, dtype: float64
If you want the integer values, add astype('int64') to the end:
df["column"] = pd.to_numeric(df["column"], errors="coerce").fillna(0).astype("int64")
Out[16]:
0 0
1 20323
2 0
3 19090
4 0
5 20474
Name: column, dtype: int64
try converting everything to integers using the int() function.
The strings cannot be converted so an error is raised. Pack this in a "try" loop and you are set.
Like this:
def converter(currentRowObj):
try:
obj = int(currentRowObj)
except:
obj = 0
return obj

Categories

Resources