Python pandas: issues when subsetting Series using .values attribute [duplicate] - python

This question already has answers here:
python numpy arange unexpected results
(5 answers)
Closed 2 years ago.
I'm having an issue with Pandas Series: I've created an array with some values in it. For testing puroposes I was trying to make sure of the presence of certain values in the Series, so I'm subsetting it like in what's below:
A = np.arange(start=-10, stop=10, step=0.1)
Aseries = pd.Series(A)
Aseries[Aseries.values == 9]
and this returns me an empty array. But I just have to change the step (from 0.1 to 1) and then it works... I've double checked that the Series actually contains the value I'm looking for (for both steps values...)
Here's the code for when I change the step (With the output as proof)
#Generating an array conaining 200 values from -10 to 10 with a step of 0.1
A = np.arange(start=-10, stop=10, step=0.1)
Aseries = pd.Series(A)
Aseries[Aseries.values == 9]
#Generating an array conaining 20 values from -10 to 10 with a step of 0.1
B = np.arange(start=-10, stop=10, step=1)
Bseries = pd.Series(B)
print("'Aseries' having the value 9:")
print(Aseries[Aseries.values == 9])
print("'Bseries' having the value 9:")
print(Bseries[Bseries.values == 9])
output:
'Aseries' having the value 9:
Series([], dtype: float64)
'Bseries' having the value 9:
19 9
dtype: int32
any idea of what's going on here? thanks in advance!
[EDIT]: for some reason I can't add any other post to this thread, so I'll add the solution I found here:
like #Quang Hoang and #Kim Rop explained by the non integer step value which doesnt really returns what it's supposed to. So after:
Aseries = pd.Series(A)
I simply added a rounding instruction to only keep one decimal after the values in the array and adapted my subsetting operation with something like that:
Aseries[(Aseries.values < 9.1) &(Aseries.values < 9.1)]
I'm not having the issue anymore... Thanks #Quang Hoang and #Kim Rop

According to the offical document:
When using a non-integer step, such as 0.1, the results will often not
be consistent. It is better to use numpy.linspace for these cases.
And this is also partially because of floating point precision.

Related

How to retain 2 decimals without rounding in python/pandas?

How can I retrain only 2 decimals for each values in a Pandas series? (I'm working with latitudes and longitudes). dtype is float64.
series = [-74.002568, -74.003085, -74.003546]
I tried using the round function but as the name suggests, it rounds. I looked into trunc() but this can only remove all decimals. Then I figures why not try running a For loop. I tried the following:
for i in series:
i = "{0:.2f}".format(i)
I was able to run the code without any errors but it didn't modify the data in any way.
Expected output would be the following:
[-74.00, -74.00, -74.00]
Anyone knows how to achieve this? Thanks!
series = [-74.002568, -74.003085, -74.003546]
["%0.2f" % (x,) for x in series]
['-74.00', '-74.00', '-74.00']
It will convert your data to string/object data type. It is just for display purpose. If you want to use it for calculation purpose then you can cast it to float. Then only one digit decimal will be visible.
[float('{0:.2f}'.format(x)) for x in series]
[-74.0, -74.0, -74.0]
here is one way to do it
assuming you meant pandas.Series, and if its true then
# you indicated its a series but defined only a list
# assuming you meant pandas.Series, and if its true then
series = [-74.002568, -74.003085, -74.003546]
s=pd.Series(series)
# use regex extract to pick the number until first two decimal places
out=s.astype(str).str.extract(r"(.*\..{2})")[0]
out
0 -74.00
1 -74.00
2 -74.00
Name: 0, dtype: object
Change the display options. This shouldn't change your underlying data.
pd.options.display.float_format = "{:,.2f}".format

Input values to a List

When I try to do the following, the subsequent error occurs.
ranges = []
a_values= []
b_values= []
for x in params:
a= min(fifa[params][x])
a= a - (a*.25)
b = max(fifa[params][x])
b = b + (b*.25)
ranges.append((a,b))
for x in range(len(fifa['short_name'])):
if fifa['short_name'][x]=='Nunez':
a_values = df.iloc[x].values.tolist()
Error Description
What does it mean? How do I solve this?
Thank you in advance
The problem is on this line:
if fifa['short_name'][x]=='Nunez':
fifa['short_name'] is a Series;
fifa['short_name'][x] tries to index that series with x;
your code doesn't show it, but the stack trace suggests x is some scalar type;
pandas tries to look up x in the index of fifa['short_name'], and it's not there, resulting in the error.
Since the Series will share the index of the dataframe fifa, this means that the index x isn't in the dataframe. And it probably isn't, because you let x range from 0 upto (but not including) len(fifa).
What is the index of your dataframe? You didn't include the definition of params, nor that of fifa, but your problem is most likely in the latter, or you should loop over the dataframe differently, by looping over its index instead of just integers.
However, there's more efficient ways to do what you're trying to do generally in pandas - you should just include some definition of the dataframe to allow people to show you the correct one.

Python numpy array filter

I got the following numpy array named 'data'. It consists of 15118 rows and 2 columns. The first column mostly consist of 0.01 steps, but sometimes there is a step in between (shown in red) which I would like to remove/filter out.
I achieved this with the following code:
# Create array [0, 0.01 .... 140], rounded 2 decimals to prevent floating point error
b = np.round(np.arange(0,140.01,0.01),2)
# New empty data array
new_data = np.empty(shape=[0, 2])
# Loop over values to remove/filter out data
for x in b:
Index = np.where(x == data[:,0])[0][0]
new_data = np.vstack([new_data,data[Index]])
I feel like this code is far from optimal and I was wondering if anyone knows a faster/better way of achieving this?
Here's a solution using pandas for resampling, you can probably achieve the same result in pure numpy but there are a number of floating point and rounding error pitfalls you are going to face, maybe it's better to let a trusted library do the work for you.
Let's say arr is your data array and assume your index to be in fractions of seconds. You can convert your array to a dataframe with a timedelta index:
df = pd.DataFrame(arr[:,1], index=arr[:,0])
df.index = pd.to_timedelta(df.index, unit="s")
Than resampling it's pretty easy, 10ms is the frequency you want, first() should give you the expected result dropping everything but the records at 10ms ticks, but feel free to experiment with other functions
df = df.resample("10ms").first()
Eventually you could get back to your array with something like:
np.vstack([pd.to_numeric(df.index, downcast="float").values / 1e9,
df.values.squeeze()]).T

Unique values from pandas.Series [duplicate]

This question already has answers here:
in operator, float("NaN") and np.nan
(2 answers)
Closed 5 years ago.
Consider the following pandas.Series:
import pandas as pd
import numpy as np
s = pd.Series([np.nan, 1, 1, np.nan])
s
0 NaN
1 1.0
2 1.0
3 NaN
dtype: float64
I want to find only unique values in this particular series using the built-in set function:
unqs = set(s)
unqs
{nan, 1.0, nan}
Why are there duplicate NaNs in the resultant set? Using a similar function (pandas.unique) does not produce this result, so what's the difference, here?
pd.unique(s)
array([ nan, 1.])
Like in Java, and JavaScript, nan in numpy does not equal itself.
>>> np.nan == np.nan
False
This means when the set constructor checks "do I have an instance of nan in this set yet?" it alwasy returns False
So… why?
nan in both cases means "value which cannot be represented by 'float'". This means an attempt to convert it to float necessarily fails. It's also unable to be sorted, because there's no way to tell if nan is supposed to be larger or smaller than any number.
After all, which is bigger "cat" or 7? And is "goofy" == "pluto"?
SO… what do I do?
There are a couple of ways to resolve this problem. Personally, I generally try to fill nan before processing: DataFrame.fillna will help with that, and I would always use df.unique() to get a set of unique values.
no_nas = s.dropna().unique()
with_nas = s.unique()
with_replaced_nas = s.fillna(-1).unique() # using a placeholder
(note: all of the above can be passed into the set constructor.
What if I don't want to use the Pandas way?
There are reasons not to use Pandas, or to rely on native objects instead of Pandas. These should suffice.
Your other option is to filter and remove the nan.
unqs = set(item for item in s if not np.isnan(item))
You could also replace things inline:
placeholder = '{placeholder}' # There are a variety of placeholder options.
unqs = set(item if not np.isnan(item) else placeholder for item in s)

Pandas selecting row by column value, strange behaviour

Ok, I have a pandas dataframe like this:
lat long level date time value
3341 29.232 -15.652 10.0 20100109.0 700.0 0.5
3342 27.887 -13.668 120.0 20100109.0 700.0 3.2
...
3899 26.345 -11.234 0.0 20100109.0 700.0 5.8
The reason of the strange number of the index is because it comes from a csv converted to pandas dataframe with some values filtered. Columns level, date, time are not really relevant.
I am trying, in ipython, to see the some rows filtering by latitude, so I do (if the dataframe is c):
c[c['lat'] == 26.345]
or
c.loc[c['lat'] == 26.345]
and I can see if the value is present or not, but sometimes it outputs nothing for latitude values that I am seeing in the dataframe !?! (For instance, I can see in the dataframe the value of latitude 27.702 and when I do c[c['lat'] == 27.702] or c.loc[c['lat'] == 27.702] I get an empty dataframe and I am seeing the value for such latitude). What is happening here?
Thank you.
This is probably because you are asking for an exact match against floating point values, which is very, very dangerous. They are approximations, often printed to less precision than actually stored.
It's very easy to see 0.735471 printed, say, and think that's all there is, when in fact the value is really 0.73547122072282867; the display function has simply truncated the result. But when you try a strict equality test on the attractively short value, boom. Doesn't work.
Instead of
c[c['lat'] == 26.345]
Try:
import numpy as np
c[np.isclose(c['lat'], 26.345)]
Now you'll get values that are within a certain range of the value you specified. You can set the tolerance.
It is a bit difficult to give a precise answer, as the question does not contain reproducible example, but let me try. Most probably, this is due floating point issues. It is possible that the number you see (and try to compare with) is not the same number that is stored in the memory due to rounding. For example:
import numpy as np
x = 0.1
arr = np.array([x + x + x])
print(np.array([x + x + x]))
# [ 0.3]
print(arr[arr == 0.3])
# []
print(x + x + x)
# 0.30000000000000004
# in fact 0.1 is not exactly equal to 1/10,
# so 0.1 + 0.1 + 0.1 is not equal to 0.3
You can overcome this issue using np.isclose instead of ==:
print(np.isclose(arr, 0.3))
# [ True]
print(arr[np.isclose(arr, 0.3)])
# [ 0.3]
In addition to the answers addressing comparison on floating point values, some of the values in your lat column may be string type instead of numeric.
EDIT: You indicated that this is not the problem, but I'll leave this response here in case it helps someone else. :)
Use the to_numeric() function from pandas to convert them to numeric.
import pandas as pd
df['lat'] = pd.to_numeric(df['lat'])
# you can adjust the errors parameter as you need
df['lat'] = pd.to_numeric(df['lat'], errors='coerce')

Categories

Resources