I have very small numbers in one pandas column. For example:
0 6.560000e+02
1 6.730000e+02
2 6.240000e+02
3 1.325000e+03
4 1.984500e+07
Unfortunately, when I cast it to a string, it gives me unusable values, such as:
df.astype('str').tolist()
['8.494e-07', ]
Is there a way to return the actual value of an item when casting to a string, such as:
'0.0000008494'
Given
# s = df[c]
s
0 6.560000e+02
1 6.730000e+02
2 6.240000e+02
3 1.325000e+03
4 8.494000e-07
Name: 1, dtype: float64
You can either call str.format through apply,
s.apply('{:.10f}'.format)
0 656.0000000000
1 673.0000000000
2 624.0000000000
3 1325.0000000000
4 0.0000008494
Name: 1, dtype: object
s.apply('{:.10f}'.format).tolist()
# ['656.0000000000', '673.0000000000', '624.0000000000',
# '1325.0000000000', '0.0000008494']
Or perhaps, through a list comprehension.
['{:f}'.format(x) for x in s]
# ['656.000000', '673.000000', '624.000000', '1325.000000', '0.000001']
Notice that if you do not specify decimal precision, the last value is rounded up.
Related
I have a pd.Series as follows:
I want to calculate difference value between a and a.max and at the same time transfer the difference value format to float. I can use picture 2 to do what I want, but why does the approach in picture 3 fail?
Picture 2 (correct):
Picture 3 (wrong):
Error message:
The problem with a.apply(lambda x:x-x.max())/np.timedelta64(1,'D') is that you are trying to get the max from a Timestamp (i.e. x). However, as mentioned in the error message, a "'Timestamp' object is not callable". So, in this particular case, it is trying to tell you that a Timestamp, understandably, does not have a function max. You are looking for a.max() instead of x.max().
Data
import pandas as pd
from datetime import datetime
data = [datetime(1997,1,1),datetime(1997,1,12),datetime(1998,5,28),
datetime(1997,12,12),datetime(1998,1,3)]
a = pd.Series(data, index=range(1,6), name='user_id')
print(a)
1 1997-01-01
2 1997-01-12
3 1998-05-28
4 1997-12-12
5 1998-01-03
Name: user_id, dtype: datetime64[ns]
Code
# using `pd.Timedelta` avoids having to import `np`
b = (a-a.max())/pd.Timedelta(days=1)
print(b)
1 -512.0
2 -501.0
3 0.0
4 -167.0
5 -145.0
Name: user_id, dtype: float64
# use `a.max()` instead of `x.max()`:
c = a.apply(lambda x:x-a.max())/pd.Timedelta(days=1)
print(b.equals(c))
# True
# refactored solution:
d = a.sub(a.max()).dt.days
print(d)
1 -512
2 -501
3 0
4 -167
5 -145
Name: user_id, dtype: int64
# chain `.astype(float)`, if you specifically want `floats`:
print(a.sub(a.max()).dt.days.astype(float).equals(b))
# True
I have a dataframe like so, where my values are object dtype:
df = pd.DataFrame(data=['A', '290', '0.1744175757', '1', '1.0000000000'], columns=['Value'])
df
Out[65]:
Value
0 A
1 290
2 0.1744175757
3 1
4 1.0000000000
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 1 columns):
Value 5 non-null object
dtypes: object(1)
memory usage: 120.0+ bytes
What I want to do is select only percentages, in this case values of 0.1744175757 and 1.0000000000, which just so happen in my data will all have a period/dot in them. This is a key point - I need to be able to differentiate between a 1 integer value, and a 1.0000000000 percentage, as well as a 0 and 0.0000000000.
I've tried to look for the presence of the dot character, but this doesn't work, it returns true for every value, and I'm unclear why.
df[df['Value'].str.contains('.')]
Out[67]:
Value
0 A
1 290
2 0.1744175757
3 1
4 1.0000000000
I've also tried isdecimal(), but this isn't quite what I want:
df[df['Value'].str.isdecimal()]
Out[68]:
Value
1 290
3 1
The closest I've come up with a function:
def isPercent(x):
if pd.isnull(x):
return False
try:
x = float(x)
return x % 1 != 0
except:
return False
df[df['Value'].apply(isPercent)]
Out[74]:
Value
2 0.1744175757
but this fails to correctly identify scenarios of 1.0000000000 (and 0.0000000000).
I have two questions:
Why doesn't str.contains('.') work in this context? This seems like it's the easiest way since it will 100% of the time get me what I need in my data, but it returns True even if no '.' character is clearly in the value.
How might I correctly identify all values [0, 1] that have a dot character in the value?
str.contains performs a regex based search by default, and '.' will match any character by the regex engine. To disable it, use regex=False:
df[df['Value'].str.contains('.', regex=False)]
Value
2 0.1744175757
4 1.0000000000
You can also escape it to treat it literally:
df[df['Value'].str.contains(r'\.')]
Value
2 0.1744175757
4 1.0000000000
If you really want to pick up just float numbers, try using a regex that is a little more robust.
df[df['Value'].str.contains(r'\d+\.\d+')].astype(float)
Value
2 0.174418
4 1.000000
i have a 880184*1 dataframe, the only column is either integer object or string object. I want to change all string object to number 0. It looks like below:
index column
..... ......
23155 WILLS ST / MIDDLE POINT RD
23156 20323
23157 400 Block of BELLA VISTA WY
23158 19090
23159 100 Block of SAN BENITO WY
23160 20474
Now the problem is both number and string are 'object' type, I don't know how to change the string like object to 0 like below:
index column
..... ......
23155 0
23156 20323
23157 0
23158 19090
23159 0
23160 20474
Another problem is that the sample size is too large, making it too long to use for loops to fix row by row. I want to use something like:
df.loc[df.column == ...] = 0
You can convert the type to numeric with pd.to_numeric and pass errors='coerce' so that you would get NaN for the ones cannot be converted to numbers. In the end, you can replace the NaNs with zero:
df["column"] = pd.to_numeric(df["column"], errors="coerce").fillna(0)
Out[15]:
0 0.0
1 20323.0
2 0.0
3 19090.0
4 0.0
5 20474.0
Name: column, dtype: float64
If you want the integer values, add astype('int64') to the end:
df["column"] = pd.to_numeric(df["column"], errors="coerce").fillna(0).astype("int64")
Out[16]:
0 0
1 20323
2 0
3 19090
4 0
5 20474
Name: column, dtype: int64
try converting everything to integers using the int() function.
The strings cannot be converted so an error is raised. Pack this in a "try" loop and you are set.
Like this:
def converter(currentRowObj):
try:
obj = int(currentRowObj)
except:
obj = 0
return obj
Say I have a list (or numpy array or pandas series) as below
l = [1,2,6,6,4,2,4]
I want to return a list of each value's ordinal, 1-->1(smallest), 2-->2, 4-->3, 6-->4 and
to_ordinal(l) == [1,2,4,4,3,2,4]
and I want it to also work for list of strings input.
I can try
s = numpy.unique(l)
then loop over each element in l and find its index in s. Just wonder if there is a direct method?
In pandas you can call rank and pass method='dense':
In [18]:
l = [1,2,6,6,4,2,4]
s = pd.Series(l)
s.rank(method='dense')
Out[18]:
0 1
1 2
2 4
3 4
4 3
5 2
6 3
dtype: float64
This also works for strings:
In [19]:
l = ['aaa','abc','aab','aba']
s = pd.Series(l)
Out[19]:
0 aaa
1 abc
2 aab
3 aba
dtype: object
In [20]:
s.rank(method='dense')
Out[20]:
0 1
1 4
2 2
3 3
dtype: float64
I don't think that there is a "direct method" for this1. The most straight forward way that I can think to do it is to sort a set of the elements:
sorted_unique = sorted(set(l))
Then make a dictionary mapping the value to it's ordinal:
ordinal_map = {val: i for i, val in enumerate(sorted_unique, 1)}
Now one more pass over the data and we can get your list:
ordinals = [ordinal_map[val] for val in l]
Note that this is a roughly O(NlogN) algorithm (due to the sort) -- And the more non-unique elements you have, the closer it becomes to O(N).
1Certainly not in vanilla python and I don't know of anything in numpy. I'm less familiar with pandas so I can't speak to that.
I have a pandas series that looks like this: a bunch of unicode strings
>>> some_id
0 400742773466599424
1 400740479161352192
2 398829879107809281
3 398823962966097921
4 398799036070653952
Name: some_id, dtype: object
I can do the following but I lose the precision.
>>> some_id.convert_objects(convert_numeric=True)
0 4.007428e+17
1 4.007405e+17
2 3.988299e+17
3 3.988240e+17
4 3.987990e+17
Name: some_id, dtype: float64
But if I do some_id.astype(int), I get the following: ValueError: invalid literal for long() with base 10
How can I convert them to int or int64 type while preserving the precision ?
I am using Pandas 0.16.2
UPDATE: I found the bug. some_id.astype(int) or any other form of it should work. Somewhere along the thousands of rows I have, some_id has a string of text (not a stringed number), so it was stopping the int64 conversion.
Thanks
Dagrha is right, you should be able to use :
some_id.astype(np.int64)
the type will then be :
In[40]: some_id.dtypes
Out[41]:
some_id int64
dtype: object
Original series of numbers:
s = pd.Series([400742773466599424, 400740479161352192, 398829879107809281,
398823962966097921, 398799036070653952], dtype=object)
>>> s
0 400742773466599424
1 400740479161352192
2 398829879107809281
3 398823962966097921
4 398799036070653952
dtype: object
Simply converting using .astype(int) should be sufficient.
>>> s.astype(int)
0 400742773466599424
1 400740479161352192
2 398829879107809281
3 398823962966097921
4 398799036070653952
dtype: int64
As an interesting side note (as pointed out by #Warren Weckesser and #DSM), you can lose precision due to floating point representation. For example, int(1e23) gets represented as 99999999999999991611392L. I'm not sure if this was the precision to which you referred, or if you were merely talking about the displayed precision.
With your sample data above, two numbers would be off by one:
>>> s.astype(np.int64) - s.astype(float).astype(np.int64)
0 0
1 0
2 1
3 1
4 0
dtype: int64