How to drop parentheses within column or data frame - python

df =
A B
1 5
2 6)
(3 7
4 8
To remove parentheses I did:
df.A = df.A.str.replace(r"\(.*\)","")
But no result. I have checked a lot of replies here, but still same result.
Would appreciate to remove parentheses from the whole data set or at least in coulmn

to remove parentheses from the whole data set
With regex character class [...] :
In [15]: df.apply(lambda s: s.str.replace(r'[()]', ''))
Out[15]:
A B
0 1 5
1 2 6
2 3 7
3 4 8
Or the same with df.replace(r'[()]', '', regex=True) which is a more concise way.

If you want regex, you can use r"[()]" instead of alteration groups, as long as you need to replace only one character at a time.
df.A = df.A.str.replace(r"[()]", "")
I find it easier to read and alter if needed.

Related

How to get some string of dataframe column?

I have dataframe like this.
print(df)
[ ID ... Control
0 PDF-1 ... NaN
1 PDF-3 ... NaN
2 PDF-4 ... NaN
I want to get only number of ID column. So the result will be.
1
3
4
How to get one of the strings of the dataframe column ?
How about just replace a common PDF- prefix?
df['ID'].str.replace('PDF-', '')
Could you please try following.
df['ID'].replace(regex=True,to_replace=r'([^\d])',value=r'')
One could refer documentation for df.replace
Basically using regex to remove everything apart from digits in column named ID where \d denotes digits and when we use [^\d] means apart form digits match everything.
Another possibility using Regex is:
df.ID.str.extract('(\d+)')
This avoids changing the original data just to extract the integers.
So for the following simple example:
import pandas as pd
df = pd.DataFrame({'ID':['PDF-1','PDF-2','PDF-3','PDF-4','PDF-5']})
print(df.ID.str.extract('(\d+)'))
print(df)
we get the following:
0
0 1
1 2
2 3
3 4
4 5
ID
0 PDF-1
1 PDF-2
2 PDF-3
3 PDF-4
4 PDF-5
Find "PDF-" ,and replace it with nothing
df['ID'] = df['ID'].str.replace('PDF-', '')
Then to print how you asked I'd convert the data frame to a string with no index.
print df['cleanID'].to_string(index=False)

How to remove all occurrences of a character in a dataframe?

I have a dataframe with multiple columns and most have special characters like $, % or ^ and so on... How can I delete these characters throughout the entire data frame? I only know how to delete by column, for example:
df['Column'] = df['Column'].str.replace('^\d+','')
Just noticed that pandas.DataFrame.replace does not work with special characters like $, %, ^ etc. So, you can use following snippet to get rid of these special characters from the whole dataframe. We need to make sure that a certain column is of type string before applying str.replace
import pandas as pd
from pandas.api.types import is_string_dtype
f = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':['f;','d:','sda$sd'],
'D':['s%','d;','d^p'],
'E':[5,3,6],
'F':[7,4,3]})
f looks like follows.
A B C D E F
0 1 4 f; s% 5 7
1 2 5 d: d; 3 4
2 3 6 sda$sd d^p 6 3
Now use a loop to replace the strings.
for col in f.columns:
if is_string_dtype(f[col]):
f[col] = f[col].str.replace('[^A-Za-z0-9-\s]+', '')
Output:
A B C D E F
0 1 4 f s 5 7
1 2 5 d d 3 4
2 3 6 sdasd dp 6 3
UPDATE:
The pandas version 0.24.1 does not replace some special characters, but versions 0.23.4 and 0.25.1 do work. So if you have any of these working versions, you can easily use pandas.DataFrame.replace to delete the special characters as follows. Make sure to escape these characters with \.
f = f.replace({'\$':'', '\^':'','%':''}, regex=True)
This results in the same output as above.
I think you want:
pandas.DataFrame.replace(to_replace, value)
The parameters accept regex and it should cover the whole df. Hope this helps.
Here's the documentation on this method:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html#pandas.DataFrame.replace

I am not getting matching string

This is my data which contains number and string.
df2 = pd.DataFrame({'A': ['1,008$','4,000$','6,000$','10,00$','8,00$','45€','45€']})
df2 = pd.DataFrame(df2, columns = ['A'])
vv=df2[df2['A'].str.match('$')]
I want an output like this.
0 1,008$
1 4,000$
2 6,000$
3 10,00$
4 8,00$
but I am getting this output:
Out[144]:
Empty DataFrame
Columns: [A]
Index: []
can anyone please help me?
A somewhat verbose way using Numpy's defchararray module.
I always want to give this some attention.
# Using #cᴏʟᴅsᴘᴇᴇᴅ's suggestion
# Same function as below but shorter namespace path
df2[np.char.find(df2.A.values.astype(str), '$') >= 0]
Old Answer
from numpy.core.defchararray import find
df2[find(df2.A.values.astype(str), '$') >= 0]
A
0 1,008$
1 4,000$
2 6,000$
3 10,00$
4 8,00$
str.match starts matching from the beginning. however, your $ pattern will be found only at the end.
The fix requires either, a modification to your pattern, or changing the function.
Option 1
str.match with a modified pattern (so \$ is matched at the end) -
df2[df2.A.str.match('.*\$$')]
A
0 1,008$
1 4,000$
2 6,000$
3 10,00$
4 8,00$
If you want to be specific about what is matched, you can match only on digits and commas -
df2[df2.A.str.match('[\d,]+\$$')]
A
0 1,008$
1 4,000$
2 6,000$
3 10,00$
4 8,00$
Note that this does not account for invalid entries in your column (they're matched as long as they have those characters somewhere in the string, and are terminated by $).
Option 2
str.contains
df2[df2.A.str.contains('\$$')]
A
0 1,008$
1 4,000$
2 6,000$
3 10,00$
4 8,00$

changing values in a column of data set using regex in pandas

This is a subset of a data frame:
Index duration
1 4 months20mg 1X D
2 1 years10 1X D
3 2 weeks10 mg
4 8 years300 MG 1X D
5 20 days
6 10 months
The output should be like this:
Index duration
1 4 month
2 1 year
3 2 week
4 8 year
5 20 day
6 10 month
This is my code:
df.dosage_duration.replace(r'year[0-9a-zA-z]*' , 'year', regex=True)
df.dosage_duration.replace(r'day[0-9a-zA-z]*' , 'day', regex=True)
df.dosage_duration.replace(r'month[0-9a-zA-z]*' , 'month', regex=True)
df.dosage_duration.replace(r'week[0-9a-zA-z]*' , 'week', regex=True)
But it does not work. Any suggestion ?
There are two problems.
The first is that your regular expression doesn't match all the parts you want it to match. Look at months20mg 1X D - there is a space in the part you want to replace. I think you could probably just use 'year.*' as your matches.
The second is that you are calling replace without storing the results. If you want to do the call the way you have, you should specify inplace=True.
You can also use a single call if you use a slightly extended regular expression. We can use \1 to refer to the first matching group for the regular expression. The groups are indicated by the parentheses:
df.dosage_duration.replace(r'(year|month|week|day).*' , r'\1',
regex=True, inplace=True)

How to select DataFrame columns based on partial matching?

I was struggling this afternoon to find a way of selecting few columns of my Pandas DataFrame, by checking the occurrence of a certain pattern in their name (label?).
I had been looking for something like contains or isin for nd.arrays / pd.series, but got no luck.
This frustrated me quite a bit, as I was already checking the columns of my DataFrame for occurrences of specific string patterns, as in:
hp = ~(df.target_column.str.contains('some_text') | df.target_column.str.contains('other_text'))
df_cln= df[hp]
However, no matter how I banged my head, I could not apply .str.contains() to the object returned bydf.columns - which is an Index - nor the one returned by df.columns.values - which is an ndarray. This works fine for what is returned by the "slicing" operation df[column_name], i.e. a Series, though.
My first solution involved a for loop and the creation of a help list:
ll = []
for a in df.columns:
if a.startswith('start_exp1') | a.startswith('start_exp2'):
ll.append(a)
df[ll]
(one could apply any of the str functions, of course)
Then, I found the map function and got it to work with the following code:
import re
sel = df.columns.map(lambda x: bool(re.search('your_regex',x))
df[df.columns[sel]]
Of course in the first solution I could have performed the same kind of regex checking, because I can apply it to the str data type returned by the iteration.
I am very new to Python and never really programmed anything so I am not too familiar with speed/timing/efficiency, but I tend to think that the second method - using a map - could potentially be faster, besides looking more elegant to my untrained eye.
I am curious to know what you think of it, and what possible alternatives would be. Given my level of noobness, I would really appreciate if you could correct any mistakes I could have made in the code and point me in the right direction.
Thanks,
Michele
EDIT : I just found the Index method Index.to_series(), which returns - ehm - a Series to which I could apply .str.contains('whatever').
However, this is not quite as powerful as a true regex, and I could not find a way of passing the result of Index.to_series().str to the re.search() function..
Select column by partial string, can simply be done, via:
df.filter(like='hello') # select columns which contain the word hello
And to select rows by partial string match, you can pass axis=0 to filter:
df.filter(like='hello', axis=0)
Your solution using map is very good. If you really want to use str.contains, it is possible to convert Index objects to Series (which have the str.contains method):
In [1]: df
Out[1]:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
In [2]: df.columns.to_series().str.contains('x')
Out[2]:
x True
y False
z False
dtype: bool
In [3]: df[df.columns[df.columns.to_series().str.contains('x')]]
Out[3]:
x
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
UPDATE I just read your last paragraph. From the documentation, str.contains allows you to pass a regex by default (str.contains('^myregex'))
I think df.keys().tolist() is the thing you're searching for.
A tiny example:
from pandas import DataFrame as df
d = df({'somename': [1,2,3], 'othername': [4,5,6]})
names = d.keys().tolist()
for n in names:
print n
print type(n)
Output:
othername
type 'str'
somename
type 'str'
Then with the strings you got, you can do any string operation you want.

Categories

Resources