understand behavior of sqrt - giving different results when written different - python

I have pandas series that has the following numbers:
0 -1.309176
1 -1.226239
2 -1.339079
3 -1.298509
...
I'm trying to calculate the square root of each number in the series.
when I tried the whole series:
s**0.5
>>>
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
10778 NaN
but if I take the numbers it works:
-1.309176**0.5
I also tried to slice the numbers from the series:
b1[0]**0.5
>>>
nan
So i'm trying to understand why it works when I write number but doesn't work when I use the series
*the values are float type :
s.dtype
>>>dtype('float64')
s.to_frame().info()
>>>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10783 entries, 0 to 10782
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 B1 10783 non-null float64
dtypes: float64(1)
memory usage: 84.4 KB

You can't take a square root of a negative number (without venturing to complex numbers).
>>> np.sqrt(-1.30)
<stdin>:1: RuntimeWarning: invalid value encountered in sqrt
nan
When you do -1.309176**0.5, you're actually doing -(1.309176 ** 0.5), which is valid.

This has to do with operator precedence in python. Precedence of ** > unary operator -.
The square root of a negative number should be complex number. but when you compute -1.309176**0.5, it first computes the 1.309176**0.5 and then takes minus of it because the precedence of ** is > -.
>>>1.309176**0.5
-1.144192291531454
>>> (-1.309176)**0.5
(7.006157137165352e-17+1.144192291531454j)
Now numbers in your series are already negative, it's not like you are doing the unary operation - on them hence the square root of theses numbers should be complex number which Series shows as nan because the dtype is float.
>>> s = pd.Series([-1.30, -1.22])
>>> s
0 -1.30
1 -1.22
dtype: float64
Square root of this series gives nan.
>>> s**0.5
0 NaN
1 NaN
dtype: float64
Change dtype to np.complex
>>> s = s.astype(np.complex)
>>> s
0 -1.300000+0.000000j
1 -1.220000+0.000000j
dtype: complex128
Now you get the square root of s.
>>> s**0.05
0 1.000730+0.158500j
1 0.997557+0.157998j
dtype: complex128

Related

change a column values with calculations

here is my dataframe : dataFrame
i just want to multiply all values in "sous_nutrition" by 10^6
When i do this code proportion_sous_nutrition_2017['sous_nutrition'] = proportion_sous_nutrition_2017.sous_nutrition * 1000000
It gave me this ... newDataFrame
I want to multiply by 1 million because the value is precised 'in million' and it will make easier to calculate other things after...
Any help would be greatly apreciated.
You can use pd.to_numeric(..., errors='coerce') to force to NaN values that cannot be converted into numeric.
Try:
proportion_sous_nutrition_2017['sous_nutrition'] = 1e6 * pd.to_numeric(proportion_sous_nutrition_2017['sous_nutrition'], errors='coerce')
Try:
# create a new column called 'sous_nutrition_float' that only has 1.1 or 0.3 etc. and removes the > or < etc.
proportion_sous_nutrition_2017['sous_nutrition_float'] = proportion_sous_nutrition_2017['sous_nutrition'].str.extract(r'([0-9.]+)').astype(float)
proportion_sous_nutrition_2017['sous_nutrition'] = proportion_sous_nutrition_2017.sous_nutrition_float * 1000000
To find the dtypes run:
print(proportion_sous_nutrition_2017.info())
The types should be float or int before multiplying etc.
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sous_nutrition 3 non-null float64
1 sous_nutrition_float 3 non-null float64
......
....
..
A solution can be convert all < xxx to xxx
>>> df['sous_nutrition']
0 1.2
1 NaN
2 < 0.1
Name: sous_nutrition, dtype: object
>>> df['sous_nutrition'].str.replace('<', '').astype(float)
0 1.2
1 NaN
2 0.1
Name: sous_nutrition, dtype: float64
So, this should work:
proportion_sous_nutrition_2017['sous_nutrition_float'] = proportion_sous_nutrition_2017['sous_nutrition'].str.replace('<', '').astype(float) * 1000000
The error that you have is due to the fact that the format of your column 'sous_nutrition' is not float as you expect, but string (or object). For the solution, you need to change the format as indicated
Hamza usman ghani
If there are errors when changing the type, try this code:
df['sous_nutrition'] = pd.to_numeric(df['sous_nutrition'], downcast='float', errors='coerce')
and then you do this corectly:
df['sous_nutrition'] = df['sous_nutrition']*1000000

How to correctly identify float values [0, 1] containing a dot, in DataFrame object dtype?

I have a dataframe like so, where my values are object dtype:
df = pd.DataFrame(data=['A', '290', '0.1744175757', '1', '1.0000000000'], columns=['Value'])
df
Out[65]:
Value
0 A
1 290
2 0.1744175757
3 1
4 1.0000000000
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 1 columns):
Value 5 non-null object
dtypes: object(1)
memory usage: 120.0+ bytes
What I want to do is select only percentages, in this case values of 0.1744175757 and 1.0000000000, which just so happen in my data will all have a period/dot in them. This is a key point - I need to be able to differentiate between a 1 integer value, and a 1.0000000000 percentage, as well as a 0 and 0.0000000000.
I've tried to look for the presence of the dot character, but this doesn't work, it returns true for every value, and I'm unclear why.
df[df['Value'].str.contains('.')]
Out[67]:
Value
0 A
1 290
2 0.1744175757
3 1
4 1.0000000000
I've also tried isdecimal(), but this isn't quite what I want:
df[df['Value'].str.isdecimal()]
Out[68]:
Value
1 290
3 1
The closest I've come up with a function:
def isPercent(x):
if pd.isnull(x):
return False
try:
x = float(x)
return x % 1 != 0
except:
return False
df[df['Value'].apply(isPercent)]
Out[74]:
Value
2 0.1744175757
but this fails to correctly identify scenarios of 1.0000000000 (and 0.0000000000).
I have two questions:
Why doesn't str.contains('.') work in this context? This seems like it's the easiest way since it will 100% of the time get me what I need in my data, but it returns True even if no '.' character is clearly in the value.
How might I correctly identify all values [0, 1] that have a dot character in the value?
str.contains performs a regex based search by default, and '.' will match any character by the regex engine. To disable it, use regex=False:
df[df['Value'].str.contains('.', regex=False)]
Value
2 0.1744175757
4 1.0000000000
You can also escape it to treat it literally:
df[df['Value'].str.contains(r'\.')]
Value
2 0.1744175757
4 1.0000000000
If you really want to pick up just float numbers, try using a regex that is a little more robust.
df[df['Value'].str.contains(r'\d+\.\d+')].astype(float)
Value
2 0.174418
4 1.000000

Casting pandas float64 to string with specific format

I have very small numbers in one pandas column. For example:
0 6.560000e+02
1 6.730000e+02
2 6.240000e+02
3 1.325000e+03
4 1.984500e+07
Unfortunately, when I cast it to a string, it gives me unusable values, such as:
df.astype('str').tolist()
['8.494e-07', ]
Is there a way to return the actual value of an item when casting to a string, such as:
'0.0000008494'
Given
# s = df[c]
s
0 6.560000e+02
1 6.730000e+02
2 6.240000e+02
3 1.325000e+03
4 8.494000e-07
Name: 1, dtype: float64
You can either call str.format through apply,
s.apply('{:.10f}'.format)
0 656.0000000000
1 673.0000000000
2 624.0000000000
3 1325.0000000000
4 0.0000008494
Name: 1, dtype: object
s.apply('{:.10f}'.format).tolist()
# ['656.0000000000', '673.0000000000', '624.0000000000',
# '1325.0000000000', '0.0000008494']
Or perhaps, through a list comprehension.
['{:f}'.format(x) for x in s]
# ['656.000000', '673.000000', '624.000000', '1325.000000', '0.000001']
Notice that if you do not specify decimal precision, the last value is rounded up.

How do I print entire number in Python from describe() function?

I am doing some statistical work using Python's pandas and I am having the following code to print out the data description (mean, count, median, etc).
data=pandas.read_csv(input_file)
print(data.describe())
But my data is pretty big (around 4 million rows) and each rows has very small data. So inevitably, the count would be big and the mean would be pretty small and thus Python print it like this.
I just want to print these numbers entirely just for ease of use and understanding, for example it better be 4393476 instead of 4.393476e+06. I have googled it around and the most I can find is Display a float with two decimal places in Python and some other similar posts. But that will only work only if I have the numbers in a variable already. Not in my case though. In my case I haven't got those numbers. The numbers are created by the describe() function, so I don't know what numbers I will get.
Sorry if this seems like a very basic question, I am still new to Python. Any response is appreaciated. Thanks.
Suppose you have the following DataFrame:
Edit
I checked the docs and you should probably use the pandas.set_option API to do this:
In [13]: df
Out[13]:
a b c
0 4.405544e+08 1.425305e+08 6.387200e+08
1 8.792502e+08 7.135909e+08 4.652605e+07
2 5.074937e+08 3.008761e+08 1.781351e+08
3 1.188494e+07 7.926714e+08 9.485948e+08
4 6.071372e+08 3.236949e+08 4.464244e+08
5 1.744240e+08 4.062852e+08 4.456160e+08
6 7.622656e+07 9.790510e+08 7.587101e+08
7 8.762620e+08 1.298574e+08 4.487193e+08
8 6.262644e+08 4.648143e+08 5.947500e+08
9 5.951188e+08 9.744804e+08 8.572475e+08
In [14]: pd.set_option('float_format', '{:f}'.format)
In [15]: df
Out[15]:
a b c
0 440554429.333866 142530512.999182 638719977.824965
1 879250168.522411 713590875.479215 46526045.819487
2 507493741.709532 300876106.387427 178135140.583541
3 11884941.851962 792671390.499431 948594814.816647
4 607137206.305609 323694879.619369 446424361.522071
5 174424035.448168 406285189.907148 445616045.754137
6 76226556.685384 979050957.963583 758710090.127867
7 876261954.607558 129857447.076183 448719292.453509
8 626264394.999419 464814260.796770 594750038.747595
9 595118819.308896 974480400.272515 857247528.610996
In [16]: df.describe()
Out[16]:
a b c
count 10.000000 10.000000 10.000000
mean 479461624.877280 522785202.100082 536344333.626082
std 306428177.277935 320806568.078629 284507176.411675
min 11884941.851962 129857447.076183 46526045.819487
25% 240956633.919592 306580799.695412 445818124.696121
50% 551306280.509214 435549725.351959 521734665.600552
75% 621482597.825966 772901261.744377 728712562.052142
max 879250168.522411 979050957.963583 948594814.816647
End of edit
In [7]: df
Out[7]:
a b c
0 4.405544e+08 1.425305e+08 6.387200e+08
1 8.792502e+08 7.135909e+08 4.652605e+07
2 5.074937e+08 3.008761e+08 1.781351e+08
3 1.188494e+07 7.926714e+08 9.485948e+08
4 6.071372e+08 3.236949e+08 4.464244e+08
5 1.744240e+08 4.062852e+08 4.456160e+08
6 7.622656e+07 9.790510e+08 7.587101e+08
7 8.762620e+08 1.298574e+08 4.487193e+08
8 6.262644e+08 4.648143e+08 5.947500e+08
9 5.951188e+08 9.744804e+08 8.572475e+08
In [8]: df.describe()
Out[8]:
a b c
count 1.000000e+01 1.000000e+01 1.000000e+01
mean 4.794616e+08 5.227852e+08 5.363443e+08
std 3.064282e+08 3.208066e+08 2.845072e+08
min 1.188494e+07 1.298574e+08 4.652605e+07
25% 2.409566e+08 3.065808e+08 4.458181e+08
50% 5.513063e+08 4.355497e+08 5.217347e+08
75% 6.214826e+08 7.729013e+08 7.287126e+08
max 8.792502e+08 9.790510e+08 9.485948e+08
You need to fiddle with the pandas.options.display.float_format attribute. Note, in my code I've used import pandas as pd. A quick fix is something like:
In [29]: pd.options.display.float_format = "{:.2f}".format
In [10]: df
Out[10]:
a b c
0 440554429.33 142530513.00 638719977.82
1 879250168.52 713590875.48 46526045.82
2 507493741.71 300876106.39 178135140.58
3 11884941.85 792671390.50 948594814.82
4 607137206.31 323694879.62 446424361.52
5 174424035.45 406285189.91 445616045.75
6 76226556.69 979050957.96 758710090.13
7 876261954.61 129857447.08 448719292.45
8 626264395.00 464814260.80 594750038.75
9 595118819.31 974480400.27 857247528.61
In [11]: df.describe()
Out[11]:
a b c
count 10.00 10.00 10.00
mean 479461624.88 522785202.10 536344333.63
std 306428177.28 320806568.08 284507176.41
min 11884941.85 129857447.08 46526045.82
25% 240956633.92 306580799.70 445818124.70
50% 551306280.51 435549725.35 521734665.60
75% 621482597.83 772901261.74 728712562.05
max 879250168.52 979050957.96 948594814.82
import numpy as np
import pandas as pd
np.random.seed(2016)
N = 4393476
df = pd.DataFrame(np.random.uniform(1e-4, 0.1, size=(N,3)), columns=list('ABC'))
desc = df.describe()
desc.loc['count'] = desc.loc['count'].astype(int).astype(str)
desc.iloc[1:] = desc.iloc[1:].applymap('{:.6f}'.format)
print(desc)
yields
A B C
count 4393476 4393476 4393476
mean 0.050039 0.050056 0.050057
std 0.028834 0.028836 0.028849
min 0.000100 0.000100 0.000100
25% 0.025076 0.025081 0.025065
50% 0.050047 0.050050 0.050037
75% 0.074987 0.075027 0.075055
max 0.100000 0.100000 0.100000
Under the hood, DataFrames are organized in columns. The values in a column can only have one data type (the column's dtype).
The DataFrame returned by df.describe() has columns of floating-point dtype:
In [116]: df.describe().info()
<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, count to max
Data columns (total 3 columns):
A 8 non-null float64
B 8 non-null float64
C 8 non-null float64
dtypes: float64(3)
memory usage: 256.0+ bytes
DataFrames do not allow you to treat one row as integers and the other rows as floats.
However, if you change the contents of the DataFrame to strings, then you have full control over the way the values are displayed
since all the values are just strings.
Thus, to create a DataFrame in the desired format, you could use
desc.loc['count'] = desc.loc['count'].astype(int).astype(str)
to convert the count row to integers (by calling astype(int)), and then convert the integers to strings (by calling astype(str)). Then
desc.iloc[1:] = desc.iloc[1:].applymap('{:.6f}'.format)
converts the rest of the floats to strings using the str.format method to format the floats to 6 digits after the decimal point.
Alternatively, you could use
import numpy as np
import pandas as pd
np.random.seed(2016)
N = 4393476
df = pd.DataFrame(np.random.uniform(1e-4, 0.1, size=(N,3)), columns=list('ABC'))
desc = df.describe().T
desc['count'] = desc['count'].astype(int)
print(desc)
which yields
count mean std min 25% 50% 75% max
A 4393476 0.050039 0.028834 0.0001 0.025076 0.050047 0.074987 0.1
B 4393476 0.050056 0.028836 0.0001 0.025081 0.050050 0.075027 0.1
C 4393476 0.050057 0.028849 0.0001 0.025065 0.050037 0.075055 0.1
By transposing the desc DataFrame, the counts are now in their own column.
So now the problem can be solved by converting that column's dtype to int.
One advantage of doing it this way is that the values in desc remain numerical.
So further calculations based on the numeric values can still be done.
I think this solution is preferrable, provided that the transposed format is acceptable.

Pandas Datatype Conversion issue

I have a pandas series that looks like this: a bunch of unicode strings
>>> some_id
0 400742773466599424
1 400740479161352192
2 398829879107809281
3 398823962966097921
4 398799036070653952
Name: some_id, dtype: object
I can do the following but I lose the precision.
>>> some_id.convert_objects(convert_numeric=True)
0 4.007428e+17
1 4.007405e+17
2 3.988299e+17
3 3.988240e+17
4 3.987990e+17
Name: some_id, dtype: float64
But if I do some_id.astype(int), I get the following: ValueError: invalid literal for long() with base 10
How can I convert them to int or int64 type while preserving the precision ?
I am using Pandas 0.16.2
UPDATE: I found the bug. some_id.astype(int) or any other form of it should work. Somewhere along the thousands of rows I have, some_id has a string of text (not a stringed number), so it was stopping the int64 conversion.
Thanks
Dagrha is right, you should be able to use :
some_id.astype(np.int64)
the type will then be :
In[40]: some_id.dtypes
Out[41]:
some_id int64
dtype: object
Original series of numbers:
s = pd.Series([400742773466599424, 400740479161352192, 398829879107809281,
398823962966097921, 398799036070653952], dtype=object)
>>> s
0 400742773466599424
1 400740479161352192
2 398829879107809281
3 398823962966097921
4 398799036070653952
dtype: object
Simply converting using .astype(int) should be sufficient.
>>> s.astype(int)
0 400742773466599424
1 400740479161352192
2 398829879107809281
3 398823962966097921
4 398799036070653952
dtype: int64
As an interesting side note (as pointed out by #Warren Weckesser and #DSM), you can lose precision due to floating point representation. For example, int(1e23) gets represented as 99999999999999991611392L. I'm not sure if this was the precision to which you referred, or if you were merely talking about the displayed precision.
With your sample data above, two numbers would be off by one:
>>> s.astype(np.int64) - s.astype(float).astype(np.int64)
0 0
1 0
2 1
3 1
4 0
dtype: int64

Categories

Resources