Replace str values in series into np.nan - python

I have the following series
s = pd.Series({'A':['hey','hey',2,2.14},index=1,2,3,4)
I basically want to mask, the series and check if the values are a str if so i want to replace then with np.nan, how could i achieve that?
Wanted result
s = pd.Series({'A':[np.nan,np.nan,2,2.14},index=1,2,3,4)
I tried this
s.mask(isinstance(s,str))
But i got the following ValueError: Array conditional must be same shape as self, i am kinda a newb when it comes to these methods would appreciate a explanation on the why

You can use
out = s.mask(s.apply(type).eq(str))
print(out)
1 NaN
2 NaN
3 2
4 2.14
dtype: object

If you are set on using mask, you could try:
s = pd.Series(['hey','hey',2,2.14],index=[1,2,3,4])
s.mask(s.apply(isinstance,args = [str]))
print(s)
1 NaN
2 NaN
3 2
4 2.14
dtype: object
But as you can see, many roads leading to Rome...

Use to_numeric with the errors="coerce" parameter.
s = pd.to_numeric(s, errors = 'coerce')
Out[73]:
1 NaN
2 NaN
3 2.00
4 2.14
dtype: float64

IIUC, You need to create pd.Series like below then use isinstance like below.
import numpy as np
import pandas as pd
s = pd.Series(['hey','hey',2,2.14],index=[1,2,3,4])
s = s.apply(lambda x: np.nan if isinstance(x, str) else x)
print(s)
1 NaN
2 NaN
3 2.00
4 2.14
dtype: float64

You could use:
s[s.str.match('\D+').fillna(False)] = np.nan
But if you are looking to convert all string 'types' not just representations like "1.23" then refer to #Ynjxsjmh's answer.

Related

removing numbers from a column in python pandas

I want to remove all numbers within the entries of a certain column in a Python pandas dataframe. Unfortunately, commands like .join() and .find() are not iterable (when I define a function to iterate on the entries, it gives me a message that floating variables do not have .find and .join attributes). Are there any commands that take care of this in pandas?
def remove(data):
for i in data if not i.isdigit():
data=''
data=data.join(i)
return data
myfile['column_name']=myfile['column_name'].apply(remove())
You can remove all numbers like this:
import pandas as pd
df = pd.DataFrame ( {'x' : ['1','2','C','4']})
df[ df["x"].str.isdigit() ] = "NaN"
Impossible to know for sure without a data sample, but your code implies data contains strings since you call isdigit on the elements.
Assuming the above, there are many ways to do what you want. One of them is conditional list comprehension:
import pandas as pd
s = pd.DataFrame({'x':['p','2','3','d','f','0']})
out = [ x if x.isdigit() else '' for x in s['x'] ]
# Output: ['', '2', '3', '', '', '0']
Or look at using pd.to_numeric with errors='coerce' to cast the column as numeric and eliminate non-numeric values:
Using #Raidex setup:
s = pd.DataFrame({'x':['p','2','3','d','f','0']})
pd.to_numeric(s['x'], errors='coerce')
Output:
0 NaN
1 2.0
2 3.0
3 NaN
4 NaN
5 0.0
Name: x, dtype: float64
EDIT to handle either situation.
s['x'].where(~s['x'].str.isdigit())
Output:
0 p
1 NaN
2 NaN
3 d
4 f
5 NaN
Name: x, dtype: object
OR
s['x'].where(s['x'].str.isdigit())
Output:
0 NaN
1 2
2 3
3 NaN
4 NaN
5 0
Name: x, dtype: object

Python/Pandas: Unexpected indices when doing a groupby-apply

I'm using Pandas and Numpy on Python3 with the following versions:
Python 3.5.1 (via Anaconda 2.5.0) 64 bits
Pandas 0.19.1
Numpy 1.11.2 (probably not relevant here)
Here is the minimal code producing the problem:
import pandas as pd
import numpy as np
a = pd.DataFrame({'i' : [1,1,1,1,1], 'a': [1,2,5,6,100], 'b': [2, 4,10, np.nan, np.nan]})
a.set_index(keys='a', inplace=True)
v = a.groupby(level=0).apply(lambda x: x.sort_values(by='i')['b'].rolling(2, min_periods=0).mean())
v.index.names
This code is a simple groupby-apply, but I don't understand the outcome:
FrozenList(['a', 'a'])
For some reason, the index of the result is ['a', 'a'], which seems to be a very doubtful choice from pandas. I would have expected a simple ['a'].
Does anyone have some idea about why Pandas chooses to duplicate the column in the index?
Thanks in advance.
This is happening because sort_values returns a DataFrame or Series so the index is being concatenated to the existing groupby index, the same thing happens if you did shift on the 'b' column:
In [99]:
v = a.groupby(level=0).apply(lambda x: x['b'].shift())
v
Out[99]:
a a
1 1 NaN
2 2 NaN
5 5 NaN
6 6 NaN
100 100 NaN
Name: b, dtype: float64
even with as_index=False it would still produce a multi-index:
In [102]:
v = a.groupby(level=0, as_index=False).apply(lambda x: x['b'].shift())
v
Out[102]:
a
0 1 NaN
1 2 NaN
2 5 NaN
3 6 NaN
4 100 NaN
Name: b, dtype: float64
if the lambda was returning a plain scalar value then no duplicating index is created:
In [104]:
v = a.groupby(level=0).apply(lambda x: x['b'].max())
v
Out[104]:
a
1 2.0
2 4.0
5 10.0
6 NaN
100 NaN
dtype: float64
I don't think this is a bug rather some semantics to be aware of that some methods will return an object where the index will be aligned with the pre-existing index.

python, pandas, work through bad data

so I've got a very large dataframe of mostly floats (read from a csv) but every now and then, I get a string, or nan
date load
0 2016-07-12 19:04:31.604999 0
...
10 2016-07-12 19:04:31.634999 nan
...
50 2016-07-12 19:04:31.664999 ".942.197"
...
I can deal with nans (interpolate), but can't figure out how to use replace in order to catch strings, and not numbers
df.replace(to_replace='^[a-zA-Z0-9_.-]*$',regex=True,value = float('nan'))
returns all nans. I wan't nans for only when it's actually a string
I think you want pandas.to_numeric. It works with series-like data.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([0, float('nan'), '.942.197'], columns=['load'])
In [3]: df
Out[3]:
load
0 0
1 NaN
2 .942.197
In [4]: pd.to_numeric(df['load'], errors='coerce')
Out[4]:
0 0.0
1 NaN
2 NaN
Name: load, dtype: float64
Actually to_numeric will try to convert every item to numeric so if you have a string that looks like a number it will be converted:
In [5]: df = pd.DataFrame([0, float('nan'), '123.456'], columns=['load'])
In [6]: df
Out[6]:
load
0 0
1 NaN
2 123.456
In [7]: pd.to_numeric(df['load'], errors='coerce')
Out[7]:
0 0.000
1 NaN
2 123.456
Name: load, dtype: float64
I am not aware of any way to convert every non-numeric type to nan, other than iterate (or maybe use applyor map) and check for isinstance.
It's my understanding that .replace() will only apply to string datatypes. If you apply it to a non-string datatype (e.g. your numeric types), it will return nan. Converting the entire frame/series to string before using replace would work around this, but probably isn't the "best" way of doing so (e.g. see #Goyo's answer)!
See the notes on this page.

How to get the max/min value in Pandas DataFrame when nan value in it

Since one column of my pandas dataframe has nan value, so when I want to get the max value of that column, it just return error.
>>> df.iloc[:, 1].max()
'error:512'
How can I skip that nan value and get the max value of that column?
You can use NumPy's help with np.nanmax, np.nanmin :
In [28]: df
Out[28]:
A B C
0 7 NaN 8
1 3 3 5
2 8 1 7
3 3 0 3
4 8 2 7
In [29]: np.nanmax(df.iloc[:, 1].values)
Out[29]: 3.0
In [30]: np.nanmin(df.iloc[:, 1].values)
Out[30]: 0.0
You can use Series.dropna.
res = df.iloc[:, 1].dropna().max()
if you dont use iloc or loc, it is simple as:
df['column'].max()
or
df['column'][df.index.min():df.index.max()]
or any kind of range in this second square brackets
You can set numeric_only = True when calling max:
df.iloc[:, 1].max(numeric_only = True)
Attention:
For everyone trying to use it with pandas.series
This is not working nevertheless it is mentioned in the docs
See post on github
Dataframe aggregate function.agg() will automatically ignore NaN value.
df.agg({'income':'max'})
Besides, it can also be use together with .groupby
df.groupby('column').agg({'income':['max','mean']})
When the df contains NaN values it reports NaN values, Using
np.nanmax(df.values) gave the desired answer.

Pandas: convert column with empty strings to float

In my application, I receive a pandas DataFrame (say, block), that has a column called est. This column can contain a mix of strings or floats. I need to convert all values in the column to floats and have the column type be float64. I do so using the following code:
block[est].convert_objects(convert_numeric=True)
block[est].astype('float')
This works for most cases. However, in one case, est contains all empty strings. In this case, the first statement executes without error, but the empty strings in the column remain empty strings. The second statement then causes an error: ValueError: could not convert string to float:.
How can I modify my code to handle a column with all empty strings?
Edit: I know I can just do block[est].replace("", np.NaN), but I was wondering if there's some way to do it with just convert_objects or astype that I'm missing.
Clarification: For project-specific reasons, I need to use pandas 0.16.2.
Here's an interaction with some sample data that demonstrates the failure:
>>> block = pd.DataFrame({"eps":["", ""]})
>>> block = block.convert_objects(convert_numeric=True)
>>> block["eps"]
0
1
Name: eps, dtype: object
>>> block["eps"].astype('float')
...
ValueError: could not convert string to float:
It's easier to do it using:
pandas.to_numeric
http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.to_numeric.html
import pandas as pd
df = pd.DataFrame({'eps': ['1', 1.6, '1.6', 'a', '', 'a1']})
df['eps'] = pd.to_numeric(df['eps'], errors='coerce')
'coerce' will convert any value error to NaN
df['eps'].astype('float')
0 1.0
1 1.6
2 1.6
3 NaN
4 NaN
5 NaN
Name: eps, dtype: float64
Then you can apply other functions without getting errors :
df['eps'].round()
0 1.0
1 2.0
2 2.0
3 NaN
4 NaN
5 NaN
Name: eps, dtype: float64
def convert_float(val):
try:
return float(val)
except ValueError:
return np.nan
df = pd.DataFrame({'eps': ['1', 1.6, '1.6', 'a', '', 'a1']})
>>> df.eps.apply(lambda x: convert_float(x))
0 1.0
1 1.6
2 1.6
3 NaN
4 NaN
5 NaN
Name: eps, dtype: float64

Categories

Resources