change a column values with calculations - python

here is my dataframe : dataFrame
i just want to multiply all values in "sous_nutrition" by 10^6
When i do this code proportion_sous_nutrition_2017['sous_nutrition'] = proportion_sous_nutrition_2017.sous_nutrition * 1000000
It gave me this ... newDataFrame
I want to multiply by 1 million because the value is precised 'in million' and it will make easier to calculate other things after...
Any help would be greatly apreciated.

You can use pd.to_numeric(..., errors='coerce') to force to NaN values that cannot be converted into numeric.
Try:
proportion_sous_nutrition_2017['sous_nutrition'] = 1e6 * pd.to_numeric(proportion_sous_nutrition_2017['sous_nutrition'], errors='coerce')

Try:
# create a new column called 'sous_nutrition_float' that only has 1.1 or 0.3 etc. and removes the > or < etc.
proportion_sous_nutrition_2017['sous_nutrition_float'] = proportion_sous_nutrition_2017['sous_nutrition'].str.extract(r'([0-9.]+)').astype(float)
proportion_sous_nutrition_2017['sous_nutrition'] = proportion_sous_nutrition_2017.sous_nutrition_float * 1000000
To find the dtypes run:
print(proportion_sous_nutrition_2017.info())
The types should be float or int before multiplying etc.
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sous_nutrition 3 non-null float64
1 sous_nutrition_float 3 non-null float64
......
....
..

A solution can be convert all < xxx to xxx
>>> df['sous_nutrition']
0 1.2
1 NaN
2 < 0.1
Name: sous_nutrition, dtype: object
>>> df['sous_nutrition'].str.replace('<', '').astype(float)
0 1.2
1 NaN
2 0.1
Name: sous_nutrition, dtype: float64
So, this should work:
proportion_sous_nutrition_2017['sous_nutrition_float'] = proportion_sous_nutrition_2017['sous_nutrition'].str.replace('<', '').astype(float) * 1000000

The error that you have is due to the fact that the format of your column 'sous_nutrition' is not float as you expect, but string (or object). For the solution, you need to change the format as indicated
Hamza usman ghani
If there are errors when changing the type, try this code:
df['sous_nutrition'] = pd.to_numeric(df['sous_nutrition'], downcast='float', errors='coerce')
and then you do this corectly:
df['sous_nutrition'] = df['sous_nutrition']*1000000

Related

understand behavior of sqrt - giving different results when written different

I have pandas series that has the following numbers:
0 -1.309176
1 -1.226239
2 -1.339079
3 -1.298509
...
I'm trying to calculate the square root of each number in the series.
when I tried the whole series:
s**0.5
>>>
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
10778 NaN
but if I take the numbers it works:
-1.309176**0.5
I also tried to slice the numbers from the series:
b1[0]**0.5
>>>
nan
So i'm trying to understand why it works when I write number but doesn't work when I use the series
*the values are float type :
s.dtype
>>>dtype('float64')
s.to_frame().info()
>>>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10783 entries, 0 to 10782
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 B1 10783 non-null float64
dtypes: float64(1)
memory usage: 84.4 KB
You can't take a square root of a negative number (without venturing to complex numbers).
>>> np.sqrt(-1.30)
<stdin>:1: RuntimeWarning: invalid value encountered in sqrt
nan
When you do -1.309176**0.5, you're actually doing -(1.309176 ** 0.5), which is valid.
This has to do with operator precedence in python. Precedence of ** > unary operator -.
The square root of a negative number should be complex number. but when you compute -1.309176**0.5, it first computes the 1.309176**0.5 and then takes minus of it because the precedence of ** is > -.
>>>1.309176**0.5
-1.144192291531454
>>> (-1.309176)**0.5
(7.006157137165352e-17+1.144192291531454j)
Now numbers in your series are already negative, it's not like you are doing the unary operation - on them hence the square root of theses numbers should be complex number which Series shows as nan because the dtype is float.
>>> s = pd.Series([-1.30, -1.22])
>>> s
0 -1.30
1 -1.22
dtype: float64
Square root of this series gives nan.
>>> s**0.5
0 NaN
1 NaN
dtype: float64
Change dtype to np.complex
>>> s = s.astype(np.complex)
>>> s
0 -1.300000+0.000000j
1 -1.220000+0.000000j
dtype: complex128
Now you get the square root of s.
>>> s**0.05
0 1.000730+0.158500j
1 0.997557+0.157998j
dtype: complex128

how to make sure integer part is unchanged when round a float32 with pandas

here is my code
df = pd.DataFrame([110100.0], dtype=np.float32)
df.round(7)
the result is:
110099.992188
what I expect is 110100.0.
How can I make sure that the round operation only affect decimal part and the integer part remained unchanged
for example:
input => expected output
1.0 => 1.0
1.12345678 => 1.1234567
You can use numpy.modf for extract floats to fractional and integral parts and round only fractional part, last add together:
df = pd.DataFrame([110100.0, 1.0, 1.12345678], dtype=np.float32, columns=['col'])
a, b = np.modf(df['col'])
print (b + np.round(a, 7))
0 110100.000000
1 1.000000
2 1.123457
Name: col, dtype: float32
Another solution is multiple by number for round, convert to integer and divide:
val = 10**7
print (df['col'].mul(val).astype(np.int64).div(val))
0 110100.000000
1 1.000000
2 1.123457
Name: col, dtype: float64
Use round function of python and give precision
EG:
a = 1.112346789
round(a, 6)
1.112347
round(a, 1)
1.1

Get the avg of 2 numbers in one csv field python

I am trying to clean a dataset(csv) in python (pandas)
In the Projected investment columns I have data that contains 2 numbers. for example 30-35 how can I get the avg of this so that the field contains 32.5
I think the best is create float column, not mixed numeric with strings.
First replace missing to NaNs, then split, convert to floats and last get mean:
df = pd.DataFrame({'Projected investment':['missing','30-35','77']})
print (df)
Projected investment
0 missing
1 30-35
2 77
df['Projected investment'] = df['Projected investment'].replace('missing', np.nan) \
.str.split('-', expand=True) \
.astype(float) \
.mean(axis=1)
print (df)
Projected investment
0 NaN
1 32.5
2 77.0
print (df['Projected investment'].dtypes)
float64
If need missing as string:
def parse_number(x):
try:
return np.mean(np.array(str(x).split('-')).astype(float))
except ValueError:
return x
df['Projected investment'] = df['Projected investment'].map(parse_number)
print (df)
Projected investment
0 missing
1 32.5
2 77
print (df['Projected investment'].apply(type))
0 <class 'str'>
1 <class 'numpy.float64'>
2 <class 'numpy.float64'>
Name: Projected investment, dtype: object
This will work as long as you are not having NaN or missing values in that column. You need to take care of that first
df['Projected Investment'] = df['Projected Investment'].apply(lambda x : np.mean(map(int, x.split('-'))))
This should work:
string_of_nums = "30-35"
nums = string_of_nums.split("-")
nums=[int(num) for num in nums]
rest=nums[1]%nums[0]
avg = str(nums[0])[:-1] + str(rest/2)
print(avg)
#>>>32.5(as string)
df['Projected Investment'].apply(lambda x: x if x == 'Missing' else np.mean([int(i) for i in x.split('-')]))

change string object to number in dataframe

i have a 880184*1 dataframe, the only column is either integer object or string object. I want to change all string object to number 0. It looks like below:
index column
..... ......
23155 WILLS ST / MIDDLE POINT RD
23156 20323
23157 400 Block of BELLA VISTA WY
23158 19090
23159 100 Block of SAN BENITO WY
23160 20474
Now the problem is both number and string are 'object' type, I don't know how to change the string like object to 0 like below:
index column
..... ......
23155 0
23156 20323
23157 0
23158 19090
23159 0
23160 20474
Another problem is that the sample size is too large, making it too long to use for loops to fix row by row. I want to use something like:
df.loc[df.column == ...] = 0
You can convert the type to numeric with pd.to_numeric and pass errors='coerce' so that you would get NaN for the ones cannot be converted to numbers. In the end, you can replace the NaNs with zero:
df["column"] = pd.to_numeric(df["column"], errors="coerce").fillna(0)
Out[15]:
0 0.0
1 20323.0
2 0.0
3 19090.0
4 0.0
5 20474.0
Name: column, dtype: float64
If you want the integer values, add astype('int64') to the end:
df["column"] = pd.to_numeric(df["column"], errors="coerce").fillna(0).astype("int64")
Out[16]:
0 0
1 20323
2 0
3 19090
4 0
5 20474
Name: column, dtype: int64
try converting everything to integers using the int() function.
The strings cannot be converted so an error is raised. Pack this in a "try" loop and you are set.
Like this:
def converter(currentRowObj):
try:
obj = int(currentRowObj)
except:
obj = 0
return obj

Pandas Datatype Conversion issue

I have a pandas series that looks like this: a bunch of unicode strings
>>> some_id
0 400742773466599424
1 400740479161352192
2 398829879107809281
3 398823962966097921
4 398799036070653952
Name: some_id, dtype: object
I can do the following but I lose the precision.
>>> some_id.convert_objects(convert_numeric=True)
0 4.007428e+17
1 4.007405e+17
2 3.988299e+17
3 3.988240e+17
4 3.987990e+17
Name: some_id, dtype: float64
But if I do some_id.astype(int), I get the following: ValueError: invalid literal for long() with base 10
How can I convert them to int or int64 type while preserving the precision ?
I am using Pandas 0.16.2
UPDATE: I found the bug. some_id.astype(int) or any other form of it should work. Somewhere along the thousands of rows I have, some_id has a string of text (not a stringed number), so it was stopping the int64 conversion.
Thanks
Dagrha is right, you should be able to use :
some_id.astype(np.int64)
the type will then be :
In[40]: some_id.dtypes
Out[41]:
some_id int64
dtype: object
Original series of numbers:
s = pd.Series([400742773466599424, 400740479161352192, 398829879107809281,
398823962966097921, 398799036070653952], dtype=object)
>>> s
0 400742773466599424
1 400740479161352192
2 398829879107809281
3 398823962966097921
4 398799036070653952
dtype: object
Simply converting using .astype(int) should be sufficient.
>>> s.astype(int)
0 400742773466599424
1 400740479161352192
2 398829879107809281
3 398823962966097921
4 398799036070653952
dtype: int64
As an interesting side note (as pointed out by #Warren Weckesser and #DSM), you can lose precision due to floating point representation. For example, int(1e23) gets represented as 99999999999999991611392L. I'm not sure if this was the precision to which you referred, or if you were merely talking about the displayed precision.
With your sample data above, two numbers would be off by one:
>>> s.astype(np.int64) - s.astype(float).astype(np.int64)
0 0
1 0
2 1
3 1
4 0
dtype: int64

Categories

Resources