change string object to number in dataframe - python

i have a 880184*1 dataframe, the only column is either integer object or string object. I want to change all string object to number 0. It looks like below:
index column
..... ......
23155 WILLS ST / MIDDLE POINT RD
23156 20323
23157 400 Block of BELLA VISTA WY
23158 19090
23159 100 Block of SAN BENITO WY
23160 20474
Now the problem is both number and string are 'object' type, I don't know how to change the string like object to 0 like below:
index column
..... ......
23155 0
23156 20323
23157 0
23158 19090
23159 0
23160 20474
Another problem is that the sample size is too large, making it too long to use for loops to fix row by row. I want to use something like:
df.loc[df.column == ...] = 0

You can convert the type to numeric with pd.to_numeric and pass errors='coerce' so that you would get NaN for the ones cannot be converted to numbers. In the end, you can replace the NaNs with zero:
df["column"] = pd.to_numeric(df["column"], errors="coerce").fillna(0)
Out[15]:
0 0.0
1 20323.0
2 0.0
3 19090.0
4 0.0
5 20474.0
Name: column, dtype: float64
If you want the integer values, add astype('int64') to the end:
df["column"] = pd.to_numeric(df["column"], errors="coerce").fillna(0).astype("int64")
Out[16]:
0 0
1 20323
2 0
3 19090
4 0
5 20474
Name: column, dtype: int64

try converting everything to integers using the int() function.
The strings cannot be converted so an error is raised. Pack this in a "try" loop and you are set.
Like this:
def converter(currentRowObj):
try:
obj = int(currentRowObj)
except:
obj = 0
return obj

Related

change a column values with calculations

here is my dataframe : dataFrame
i just want to multiply all values in "sous_nutrition" by 10^6
When i do this code proportion_sous_nutrition_2017['sous_nutrition'] = proportion_sous_nutrition_2017.sous_nutrition * 1000000
It gave me this ... newDataFrame
I want to multiply by 1 million because the value is precised 'in million' and it will make easier to calculate other things after...
Any help would be greatly apreciated.
You can use pd.to_numeric(..., errors='coerce') to force to NaN values that cannot be converted into numeric.
Try:
proportion_sous_nutrition_2017['sous_nutrition'] = 1e6 * pd.to_numeric(proportion_sous_nutrition_2017['sous_nutrition'], errors='coerce')
Try:
# create a new column called 'sous_nutrition_float' that only has 1.1 or 0.3 etc. and removes the > or < etc.
proportion_sous_nutrition_2017['sous_nutrition_float'] = proportion_sous_nutrition_2017['sous_nutrition'].str.extract(r'([0-9.]+)').astype(float)
proportion_sous_nutrition_2017['sous_nutrition'] = proportion_sous_nutrition_2017.sous_nutrition_float * 1000000
To find the dtypes run:
print(proportion_sous_nutrition_2017.info())
The types should be float or int before multiplying etc.
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sous_nutrition 3 non-null float64
1 sous_nutrition_float 3 non-null float64
......
....
..
A solution can be convert all < xxx to xxx
>>> df['sous_nutrition']
0 1.2
1 NaN
2 < 0.1
Name: sous_nutrition, dtype: object
>>> df['sous_nutrition'].str.replace('<', '').astype(float)
0 1.2
1 NaN
2 0.1
Name: sous_nutrition, dtype: float64
So, this should work:
proportion_sous_nutrition_2017['sous_nutrition_float'] = proportion_sous_nutrition_2017['sous_nutrition'].str.replace('<', '').astype(float) * 1000000
The error that you have is due to the fact that the format of your column 'sous_nutrition' is not float as you expect, but string (or object). For the solution, you need to change the format as indicated
Hamza usman ghani
If there are errors when changing the type, try this code:
df['sous_nutrition'] = pd.to_numeric(df['sous_nutrition'], downcast='float', errors='coerce')
and then you do this corectly:
df['sous_nutrition'] = df['sous_nutrition']*1000000

How to correctly identify float values [0, 1] containing a dot, in DataFrame object dtype?

I have a dataframe like so, where my values are object dtype:
df = pd.DataFrame(data=['A', '290', '0.1744175757', '1', '1.0000000000'], columns=['Value'])
df
Out[65]:
Value
0 A
1 290
2 0.1744175757
3 1
4 1.0000000000
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 1 columns):
Value 5 non-null object
dtypes: object(1)
memory usage: 120.0+ bytes
What I want to do is select only percentages, in this case values of 0.1744175757 and 1.0000000000, which just so happen in my data will all have a period/dot in them. This is a key point - I need to be able to differentiate between a 1 integer value, and a 1.0000000000 percentage, as well as a 0 and 0.0000000000.
I've tried to look for the presence of the dot character, but this doesn't work, it returns true for every value, and I'm unclear why.
df[df['Value'].str.contains('.')]
Out[67]:
Value
0 A
1 290
2 0.1744175757
3 1
4 1.0000000000
I've also tried isdecimal(), but this isn't quite what I want:
df[df['Value'].str.isdecimal()]
Out[68]:
Value
1 290
3 1
The closest I've come up with a function:
def isPercent(x):
if pd.isnull(x):
return False
try:
x = float(x)
return x % 1 != 0
except:
return False
df[df['Value'].apply(isPercent)]
Out[74]:
Value
2 0.1744175757
but this fails to correctly identify scenarios of 1.0000000000 (and 0.0000000000).
I have two questions:
Why doesn't str.contains('.') work in this context? This seems like it's the easiest way since it will 100% of the time get me what I need in my data, but it returns True even if no '.' character is clearly in the value.
How might I correctly identify all values [0, 1] that have a dot character in the value?
str.contains performs a regex based search by default, and '.' will match any character by the regex engine. To disable it, use regex=False:
df[df['Value'].str.contains('.', regex=False)]
Value
2 0.1744175757
4 1.0000000000
You can also escape it to treat it literally:
df[df['Value'].str.contains(r'\.')]
Value
2 0.1744175757
4 1.0000000000
If you really want to pick up just float numbers, try using a regex that is a little more robust.
df[df['Value'].str.contains(r'\d+\.\d+')].astype(float)
Value
2 0.174418
4 1.000000

Casting pandas float64 to string with specific format

I have very small numbers in one pandas column. For example:
0 6.560000e+02
1 6.730000e+02
2 6.240000e+02
3 1.325000e+03
4 1.984500e+07
Unfortunately, when I cast it to a string, it gives me unusable values, such as:
df.astype('str').tolist()
['8.494e-07', ]
Is there a way to return the actual value of an item when casting to a string, such as:
'0.0000008494'
Given
# s = df[c]
s
0 6.560000e+02
1 6.730000e+02
2 6.240000e+02
3 1.325000e+03
4 8.494000e-07
Name: 1, dtype: float64
You can either call str.format through apply,
s.apply('{:.10f}'.format)
0 656.0000000000
1 673.0000000000
2 624.0000000000
3 1325.0000000000
4 0.0000008494
Name: 1, dtype: object
s.apply('{:.10f}'.format).tolist()
# ['656.0000000000', '673.0000000000', '624.0000000000',
# '1325.0000000000', '0.0000008494']
Or perhaps, through a list comprehension.
['{:f}'.format(x) for x in s]
# ['656.000000', '673.000000', '624.000000', '1325.000000', '0.000001']
Notice that if you do not specify decimal precision, the last value is rounded up.

df.iloc[1].ColumnName Is returning the entire row, not the one specific value?

Overview
I am trying to get the value of my tickers column(T) and use that for a file name.
Approach
I am using the df.iloc[1].T and the column name to capture one specific value and then I want to take that value and concatenate with another string to create a file path.
Problem
df.iloc[1].T is giving me the whole entire row not just the one value.
print(df2.iloc[1].T)
Unnamed: 0 1
D 2010-12-01
T tvix
O 99.98
H 100.69
L 98.8
C 100.69
V 0
AC 2.51725e+06
Y 2010
dow 3
M 12
W 48
Name: 1, dtype: object
I am expecting to get "tvix". Now when I print any other column I get the one value for example
print(df2.iloc[1].D)
print(df2.iloc[1].H)
print(df2.iloc[1].L)
2010-12-01
100.689997
98.799998
It prints the Date,High,and Low as expected. Now the difference between the tickers column(T) and all the others is that the ticker column has the same value for every row(I have it this way for groupBy purposes)
print(df2.head())
Unnamed: 0 D T O H L \
0 0 2010-11-30 tvix 106.269997 112.349997 104.389997
1 1 2010-12-01 tvix 99.979997 100.689997 98.799998
2 2 2010-12-02 tvix 98.309998 98.309998 86.499998
3 3 2010-12-03 tvix 88.359998 88.359998 78.119998
4 4 2010-12-06 tvix 79.769998 79.999998 74.619998
I am assuming that since the T column has the same value all the way down, that this is reason I am having this problem. Any input on this would be greatly appreciated.
Accessing columns via .column_name is a little problematic (same for indices). It doesn't always work (when the column name has spaces, when it is a number or like in this case, when a method or an attribute has the same name). .T is for transposing a DataFrame so you should use the brackets:
df.iloc[1]['T']
You can specify the column in iloc[]
df.iloc[1,2]
This will get you 'tvix'
Or you can use mix indexing with .ix[]
df.ix[1,'T']

Pandas Datatype Conversion issue

I have a pandas series that looks like this: a bunch of unicode strings
>>> some_id
0 400742773466599424
1 400740479161352192
2 398829879107809281
3 398823962966097921
4 398799036070653952
Name: some_id, dtype: object
I can do the following but I lose the precision.
>>> some_id.convert_objects(convert_numeric=True)
0 4.007428e+17
1 4.007405e+17
2 3.988299e+17
3 3.988240e+17
4 3.987990e+17
Name: some_id, dtype: float64
But if I do some_id.astype(int), I get the following: ValueError: invalid literal for long() with base 10
How can I convert them to int or int64 type while preserving the precision ?
I am using Pandas 0.16.2
UPDATE: I found the bug. some_id.astype(int) or any other form of it should work. Somewhere along the thousands of rows I have, some_id has a string of text (not a stringed number), so it was stopping the int64 conversion.
Thanks
Dagrha is right, you should be able to use :
some_id.astype(np.int64)
the type will then be :
In[40]: some_id.dtypes
Out[41]:
some_id int64
dtype: object
Original series of numbers:
s = pd.Series([400742773466599424, 400740479161352192, 398829879107809281,
398823962966097921, 398799036070653952], dtype=object)
>>> s
0 400742773466599424
1 400740479161352192
2 398829879107809281
3 398823962966097921
4 398799036070653952
dtype: object
Simply converting using .astype(int) should be sufficient.
>>> s.astype(int)
0 400742773466599424
1 400740479161352192
2 398829879107809281
3 398823962966097921
4 398799036070653952
dtype: int64
As an interesting side note (as pointed out by #Warren Weckesser and #DSM), you can lose precision due to floating point representation. For example, int(1e23) gets represented as 99999999999999991611392L. I'm not sure if this was the precision to which you referred, or if you were merely talking about the displayed precision.
With your sample data above, two numbers would be off by one:
>>> s.astype(np.int64) - s.astype(float).astype(np.int64)
0 0
1 0
2 1
3 1
4 0
dtype: int64

Categories

Resources