Getting the index of a float in a column using pandas - python

I have a dataset that I am pulling using pandas. It looks like this:
import pandas as pd
dataset=pd.read_csv('D:\\filename.csv', header=None, usecols=3,4,10,16,22,28])
time=dataset.iloc[:,0]
Now, the 'time' dataset has a value of 0.00017 somewhere down the column and I want to find the index number of that location. How can I get that?

Assuming you're dealing with floats, you can't use an equality comparison here (because of floating point inaccuracies creeping in).
Use np.isclose + np.argmax:
idx = np.isclose(df['time'], 0.00017).argmax()
If there's a possibility this value may not exist:
m = np.isclose(df['time'], 0.00017)
if m.sum() > 0:
idx = m.argmax()
Otherwise, set idx to whatever (None, -1, etc).

Related

How to convert comma separated numbers from a dataframe to to numbers and get the avg value

I'm working on a dataset where a column is having numbers separated by commas.
I want to convert the values into integer and obtain their mean value to replace with the current anomaly.
ex: 50,45,30,20
I want to get the mean value and replace it with current value
You can simply define a function that unpack those values and then get the mean of those.
def get_mean(x):
#split into list of strings
splited = x.split(',')
#Transform into numbers
y = [float(n) for n in splited]
return sum(y)/len(y)
#Apply on desired column
df['col'] = df['col'].apply(get_mean)
from numpy import mean
data.apply(lambda x: mean(list(map(lambda y: int(y.strip()), x.split(",")))))
You can apply a custom function wike GabrielBoehme suggests, but if you are in control of the data import, handling the issue at the data import stage may be a bit cleaner.
import pandas as pd
data = pd.read_csv('foobar.csv', sep=',', thousands=',')
Obviously you are going to need to make sure everything is quoted appropriately so that the CSV is parsed correctly.
Mine is a longer explanation and the others here are probably better... but this might be easier to understand if you are newer to python.
cell_num = "1,2,3,4,5,6,7"
#Splitting the numbers by , and making a list of them
cell_numbers = cell_num.split(",")
#Run loop to sum the values in the list
sum_num = 0
for num in cell_numbers:
sum_num += int(num)
#getting the mean
mean = int(sum_num) / len(cell_numbers)
#now printing your final number
print(mean)
If you have decimals... be sure to swap int with float.

Set max for a particular column of numpy array

Is there anyway to basically take a column of a numpy array and whenever the absolute value is greater than a number, set the value to that signed number.
ie.
for val in col:
if abs(val) > max:
val = (signed) max
I know this can be done by looping and such but i was wondering if there was a cleaner/builtin way to do this.
I see there is something like
arr[arr > 255] = x
Which is kind of what i want but i want do this by column instead of the whole array. As a bonus maybe a way to do absolute values instead of having to do two separate operations for positive and negative.
The other answer is good but it doesn't get you all the way there. Frankly, this is somewhat of a RTFM situation. But you'd be forgiven for not grokking the Numpy indexing docs on your first try, because they are dense and the data model will be alien if you are coming from a more traditional programming environment.
You will have to use np.clip on the columns you want to clip, like so:
x[:,2] = np.clip(x[:,2], 0, 255)
This applies np.clip to the 2nd column of the array, "slicing" down all rows, then reassigns it to the 2nd column. The : is Python syntax meaning "give me all elements of an indexable sequence".
More generally, you can use the boolean subsetting index that you discovered in the same fashion, by slicing across rows and selecting the desired columns:
x[x[:,2] > 255, 2] = -1
Try calling clip on your numpy array:
import numpy as np
values = np.array([-3,-2,-1,0,1,2,3])
values.clip(-2,2)
Out[292]:
array([-2, -2, -1, 0, 1, 2, 2])
Maybe is a little late, but I think it's a good option:
import numpy as np
values = np.array([-3,-2,-1,0,1,2,3])
values = np.clip(values,-2,2)

Strange issue when storing FFT periods in Pandas dataframe

I am trying to store the results of FFT calculations in a Pandas data frame:
ft = pd.DataFrame(index=range(90))
ft['y'] = ft.index.map(lambda x: np.sin(2*x))
ft['spectrum'] = np.fft.fft(ft['y'])
ft['freq'] = np.fft.fftfreq(len(ft.index)).real
ft['T'] = ft['freq'].apply(lambda f: 1/f if f != 0 else 0)
Everything seems to be working fine until the last line: the column T which is supposed to store periods has for some reason all the columns of the frame, ie.:
In [499]: ft.T[0]
Out[499]:
y 0j
spectrum (0.913756021471+0j)
freq 0j
T 0j
Name: 0, dtype: complex128
I cannot figure out why is that. It happens also when I only take the real part of freq:
ft['freq'] = np.fft.fftfreq(len(ft.index)).real
or I try to calculate T values using alternative ways, such as:
ft.T = ft.index.map(lambda i: 1/ft.freq[i] if ft.freq[i] else np.inf)
ft.T = 1/ft.freq
All other columns look tidy when I run head() or describe() on them no matter if they contain real or complex values. The freq column looks like a normal 1D series, because np.fft.fftfreq() returns 1D array of complex numbers, so what could be the reason why the column T is so messed up?
I am using Pandas v. 1.19.2 and Numpy v. 1.12.0.
Pandas DataFrame objects have a property called T, which is used "to transpose index and columns" of the DataFrame object. If you use a different column name instead of T, everything works as expected.

pandas: check whether an element is in dataframe or given column leads to strange results

I am doing some data handling based on a DataFrame with the shape of (135150, 12) so double checking my results manually is not applicable anymore.
I encountered some 'strange' behavior when I tried to check if an element is part of the dataframe or a given column.
This behavior is reproducible with even smaller dataframes as follows:
import numpy as np
import pandas as pd
start = 1e-3
end = 2e-3
step = 0.01e-3
arr = np.arange(start, end+step, step)
val = 0.0019
df = pd.DataFrame(arr, columns=['example_value'])
print(val in df) # prints `False`
print(val in df['example_value']) # prints `True`
print(val in df.values) # prints `False`
print(val in df['example_value'].values) # prints `False`
print(df['example_value'].isin([val]).any()) # prints `False`
Since I am a very beginner in data analysis I am not able to explain this behavior.
I know that I am using different approaches involving different datatypes (like pd.Series, np.ndarray or np.array) in order to check if the given value exists in the dataframe. Additionally when using np.array or np.ndarray the machine accuracy comes in play which I am aware of in mind.
However, at the end, I need to implement several functions to filter the dataframe and count the occurrences of some values, which I have done several times before based on boolean columns in combination with performed operations like > and < successfully.
But in this case I need to filter by the exact value and count its occurrences which after all lead me to the issue described above.
So could anyone explain, what's going on here?
The underlying issue, as Divakar suggested, is floating point precision. Because DataFrames/Series are built on top of numpy, there isn't really a penalty for using numpy methods though, so you can just do something like:
df['example_value'].apply(lambda x: np.isclose(x, val)).any()
or
np.isclose(df['example_value'], val).any()
both of which correctly return True.

mean() of column in pandas DataFrame returning inf: how can I solve this?

I'm trying to implement some machine learning algorithms, but I'm having some difficulties putting the data together.
In the example below, I load a example data-set from UCI, remove lines with missing data (thanks to the help from a previous question), and now I would like to try to normalize the data.
For many datasets, I just used:
valores = (valores - valores.mean()) / (valores.std())
But for this particular dataset the approach above doesn't work. The problem is that the mean function is returning inf, perhaps due to a precision issue. See the example below:
bcw = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', header=None)
for col in bcw.columns:
if bcw[col].dtype != 'int64':
print "Removendo possivel '?' na coluna %s..." % col
bcw = bcw[bcw[col] != '?']
valores = bcw.iloc[:,1:10]
#mean return inf
print valores.iloc[:,5].mean()
My question is how to deal with this. It seems that I need to change the type of this column, but I don't know how to do it.
not so familiar with pandas but if you convert to a numpy array it works, try
np.asarray(valores.iloc[:,5], dtype=np.float).mean()
NaN values should not matter when computing the mean of a pandas.Series. Precision is also irrelevant. The only explanation I can think of is that one of the values in valores is equal to infinity.
You could exclude any values that are infinite when computing the mean like this:
import numpy as np
is_inf = valores.iloc[:, 5] == np.inf
valores.ix[~is_inf, 5].mean()
If the elements of the pandas series are strings you get inf and the mean result. In this specific case you can simply convert the pandas series elements to float and then calculate the mean. No need to use numpy.
Example:
valores.iloc[:,5].astype(float).mean()
I had the same problem with a column that was of dtype 'o', and whose max value was 9999. Have you tried using the convert_objects method with the convert_numeric=True parameter? This fixed the problem for me.
For me, the reason was an overflow: my original data was in float16 and calling .mean() on that would return inf. After converting my data to float32 (e.g. via .astype("float32")), .mean worked as expected.

Categories

Resources