removing numbers from a column in python pandas - python

I want to remove all numbers within the entries of a certain column in a Python pandas dataframe. Unfortunately, commands like .join() and .find() are not iterable (when I define a function to iterate on the entries, it gives me a message that floating variables do not have .find and .join attributes). Are there any commands that take care of this in pandas?
def remove(data):
for i in data if not i.isdigit():
data=''
data=data.join(i)
return data
myfile['column_name']=myfile['column_name'].apply(remove())

You can remove all numbers like this:
import pandas as pd
df = pd.DataFrame ( {'x' : ['1','2','C','4']})
df[ df["x"].str.isdigit() ] = "NaN"

Impossible to know for sure without a data sample, but your code implies data contains strings since you call isdigit on the elements.
Assuming the above, there are many ways to do what you want. One of them is conditional list comprehension:
import pandas as pd
s = pd.DataFrame({'x':['p','2','3','d','f','0']})
out = [ x if x.isdigit() else '' for x in s['x'] ]
# Output: ['', '2', '3', '', '', '0']

Or look at using pd.to_numeric with errors='coerce' to cast the column as numeric and eliminate non-numeric values:
Using #Raidex setup:
s = pd.DataFrame({'x':['p','2','3','d','f','0']})
pd.to_numeric(s['x'], errors='coerce')
Output:
0 NaN
1 2.0
2 3.0
3 NaN
4 NaN
5 0.0
Name: x, dtype: float64
EDIT to handle either situation.
s['x'].where(~s['x'].str.isdigit())
Output:
0 p
1 NaN
2 NaN
3 d
4 f
5 NaN
Name: x, dtype: object
OR
s['x'].where(s['x'].str.isdigit())
Output:
0 NaN
1 2
2 3
3 NaN
4 NaN
5 0
Name: x, dtype: object

Related

Concatenate multiple columns of dataframe with a seperating character for Non-null values

I have a data frame like this:
df:
C1 C2 C3
1 4 6
2 NaN 9
3 5 NaN
NaN 7 3
I want to concatenate the 3 columns to a single column with comma as a seperator.
But I want the comma(",") only in case of non-null value.
I tried this but this doesn't work for non-null values:
df['New_Col'] = df[['C1','C2','C3']].agg(','.join, axis=1)
This gives me the output:
New_Col
1,4,6
2,,9
3,5,
,7,3
This is my ideal output:
New_Col
1,4,6
2,9
3,5
7,3
Can anyone help me with this?
Judging by your (wrong) output, you have a dataframe of strings and NaN values are actually empty strings (otherwise it would throw TypeError: expected str instance, float found because NaN is a float).
Since you're dealing with strings, pandas is not optimized for it, so a vanilla Python list comprehension is probably the most efficient choice here.
df['NewCol'] = [','.join([e for e in x if e]) for x in df.values]
In your case do stack
df['new'] = df.stack().astype(int).astype(str).groupby(level=0).agg(','.join)
Out[254]:
0 1,4,6
1 2,9
2 3,5
3 7,3
dtype: object
You can use filter to get rid of NaNs:
df['New_Col'] = df.apply(lambda x: ','.join(filter(lambda x: x is not np.nan,list(x))), axis=1)

Replace str values in series into np.nan

I have the following series
s = pd.Series({'A':['hey','hey',2,2.14},index=1,2,3,4)
I basically want to mask, the series and check if the values are a str if so i want to replace then with np.nan, how could i achieve that?
Wanted result
s = pd.Series({'A':[np.nan,np.nan,2,2.14},index=1,2,3,4)
I tried this
s.mask(isinstance(s,str))
But i got the following ValueError: Array conditional must be same shape as self, i am kinda a newb when it comes to these methods would appreciate a explanation on the why
You can use
out = s.mask(s.apply(type).eq(str))
print(out)
1 NaN
2 NaN
3 2
4 2.14
dtype: object
If you are set on using mask, you could try:
s = pd.Series(['hey','hey',2,2.14],index=[1,2,3,4])
s.mask(s.apply(isinstance,args = [str]))
print(s)
1 NaN
2 NaN
3 2
4 2.14
dtype: object
But as you can see, many roads leading to Rome...
Use to_numeric with the errors="coerce" parameter.
s = pd.to_numeric(s, errors = 'coerce')
Out[73]:
1 NaN
2 NaN
3 2.00
4 2.14
dtype: float64
IIUC, You need to create pd.Series like below then use isinstance like below.
import numpy as np
import pandas as pd
s = pd.Series(['hey','hey',2,2.14],index=[1,2,3,4])
s = s.apply(lambda x: np.nan if isinstance(x, str) else x)
print(s)
1 NaN
2 NaN
3 2.00
4 2.14
dtype: float64
You could use:
s[s.str.match('\D+').fillna(False)] = np.nan
But if you are looking to convert all string 'types' not just representations like "1.23" then refer to #Ynjxsjmh's answer.

Parse Out Last Sequence Of Numbers From Pandas Column to create new column

I have a dataframe with codes like the following and would like to create a new column that has the last sequence of numbers parse out.
array(['K9ADXXL2', 'K9ADXL2', 'K9ADXS2', 'IVERMAXSCM12', 'HPDMUDOGDRYL'])
So the new column would contain the following:
array([2,2,2,12,None])
Sample data
df:
codes
0 K9ADXXL2
1 K9ADXL2
2 K9ADXS2
3 IVERMAXSCM12
4 HPDMUDOGDRYL
Use str.extract gets digits at the end of string and passing to pd.to_numeric
pd.to_numeric(df.codes.str.extract(r'(\d+$)')[0], errors='coerce')
Out[11]:
0 2.0
1 2.0
2 2.0
3 12.0
4 NaN
Name: 0, dtype: float64
If you want get value as string of numbers, you may use str.extract or str.findall as follow
df.codes.str.findall(r'\d+$').str[0]
or
df.codes.str.extract(r'(\d+$)')[0]
Out[20]:
0 2
1 2
2 2
3 12
4 NaN
Name: codes, dtype: object
import re
import pandas as pd
def get_trailing_digits(s):
match = re.search("[0-9]+$",s)
return match.group(0) if match else None
original_column = pd.array(['K9ADXXL2', 'K9ADXL2', 'K9ADXS2', 'IVERMAXSCM12', 'HPDMUDOGDRYL'])
new_column = pd.array([get_trailing_digits(s) for s in original_column])
# ['2', '2', '2', '12', None]
0-9] means any digit
+ means one or more times
$means only at the end of the string
You can use the apply function of a series/data frame with get_trailing_digits as the function.
eg.
my_df["new column"] = my_df["old column"].apply(get_trailing_digits)

pandas converting floats to strings without decimals

I have a dataframe
df = pd.DataFrame([
['2', '3', 'nan'],
['0', '1', '4'],
['5', 'nan', '7']
])
print df
0 1 2
0 2 3 nan
1 0 1 4
2 5 nan 7
I want to convert these strings to numbers and sum the columns and convert back to strings.
Using astype(float) seems to get me to the number part. Then summing is easy with sum(). Then back to strings should be easy too with astype(str)
df.astype(float).sum().astype(str)
0 7.0
1 4.0
2 11.0
dtype: object
That's almost what I wanted. I wanted the string version of integers. But floats have decimals. How do I get rid of them?
I want this
0 7
1 4
2 11
dtype: object
Converting to int (i.e. with .astype(int).astype(str)) won't work if your column contains nulls; it's often a better idea to use string formatting to explicitly specify the format of your string column; (you can set this in pd.options):
>>> pd.options.display.float_format = '{:,.0f}'.format
>>> df.astype(float).sum()
0 7
1 4
2 11
dtype: float64
Add a astype(int) in the mix:
df.astype(float).sum().astype(int).astype(str)
0 7
1 4
2 11
dtype: object
Demonstration of example with empty cells. This was not a requirement from the OP but to satisfy the detractors
df = pd.DataFrame([
['2', '3', 'nan', None],
[None, None, None, None],
['0', '1', '4', None],
['5', 'nan', '7', None]
])
df
0 1 2 3
0 2 3 nan None
1 None None None None
2 0 1 4 None
3 5 nan 7 None
Then
df.astype(float).sum().astype(int).astype(str)
0 7
1 4
2 11
3 0
dtype: object
Because the OP didn't specify what they'd like to happen when a column was all missing, presenting zero is a reasonable option.
However, we could also drop those columns
df.dropna(1, 'all').astype(float).sum().astype(int).astype(str)
0 7
1 4
2 11
dtype: object
For pandas >= 1.0:
<NA> type was introduced for 'Int64'. You can now do this:
df['your_column'].astype('Int64').astype('str')
And it will properly convert 1.0 to 1.
Alternative:
If you do not want to change the display options of all pandas, #maxymoo solution does, you can use apply:
df['your_column'].apply(lambda x: f'{x:.0f}')
Add astype(int) right before conversion to a string:
print (df.astype(float).sum().astype(int).astype(str))
Generates the desired result.
based on toto_tico's solution - alternative , minor changes to avoid null case become nan
df['your_column'].apply(lambda x: f'{x:.0f}' if not pd.isnull(x) else '')
The above didnt work for me so im going to add my solution
Convert to a string and strip away the .0:
db['a] = db['a'].astype(str).str.rstrip('.0')
The above solutions, when converting to string, will turn NaN into a string as well. To get around that and retain NaN, use:
c = ... # your column
np.where(
df[c].isnull(), np.nan,
df[c].apply('{:.0f}'.format)
)
Retaining NaN allows you to do stuff like convert a nullable column of integers like 19991231, 20000101, np.nan, 20000102 into date time without triggering date parsing errors.

Pandas: convert column with empty strings to float

In my application, I receive a pandas DataFrame (say, block), that has a column called est. This column can contain a mix of strings or floats. I need to convert all values in the column to floats and have the column type be float64. I do so using the following code:
block[est].convert_objects(convert_numeric=True)
block[est].astype('float')
This works for most cases. However, in one case, est contains all empty strings. In this case, the first statement executes without error, but the empty strings in the column remain empty strings. The second statement then causes an error: ValueError: could not convert string to float:.
How can I modify my code to handle a column with all empty strings?
Edit: I know I can just do block[est].replace("", np.NaN), but I was wondering if there's some way to do it with just convert_objects or astype that I'm missing.
Clarification: For project-specific reasons, I need to use pandas 0.16.2.
Here's an interaction with some sample data that demonstrates the failure:
>>> block = pd.DataFrame({"eps":["", ""]})
>>> block = block.convert_objects(convert_numeric=True)
>>> block["eps"]
0
1
Name: eps, dtype: object
>>> block["eps"].astype('float')
...
ValueError: could not convert string to float:
It's easier to do it using:
pandas.to_numeric
http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.to_numeric.html
import pandas as pd
df = pd.DataFrame({'eps': ['1', 1.6, '1.6', 'a', '', 'a1']})
df['eps'] = pd.to_numeric(df['eps'], errors='coerce')
'coerce' will convert any value error to NaN
df['eps'].astype('float')
0 1.0
1 1.6
2 1.6
3 NaN
4 NaN
5 NaN
Name: eps, dtype: float64
Then you can apply other functions without getting errors :
df['eps'].round()
0 1.0
1 2.0
2 2.0
3 NaN
4 NaN
5 NaN
Name: eps, dtype: float64
def convert_float(val):
try:
return float(val)
except ValueError:
return np.nan
df = pd.DataFrame({'eps': ['1', 1.6, '1.6', 'a', '', 'a1']})
>>> df.eps.apply(lambda x: convert_float(x))
0 1.0
1 1.6
2 1.6
3 NaN
4 NaN
5 NaN
Name: eps, dtype: float64

Categories

Resources