How to delete substrings with specific characters in a pandas dataframe? - python

I have a pandas dataframe that looks like this:
COL
hi A/P_90890 how A/P_True A/P_/93290 are AP_wueiwo A/P_|iwoeu you A/P_?9028k ?
...
Im fine, what A/P_49 A/P_0.0309 about you?
The expected result should be:
COL
hi how are you?
...
Im fine, what about you?
How can I remove efficiently from a column and for the full pandas dataframe all the strings that have A/P_?
I tried with this regular expression:
A/P_(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+
However, I do not know if there's a more simpler or robust way of removing all those substrings from my dataframe. How can I remove all the strings that have A/P_ at the beginning?
UPDATE
I tried:
df_sess['COL'] = df_sess['COL'].str.replace(r'A/P(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '')
And it works, however I would like to know if there's a more robust way of doing this. Possibily with a regular expression.

one way could be to use \S* matching all non withespaces after A/P_ and also add \s to remove the whitespace after the string to remove, such as:
df_sess['COL'] = df_sess['col'].str.replace(r'A/P_\S*\s', '')
In you input, it seems there is an typo error (or at least I think so), so with this input:
df_sess = pd.DataFrame({'col':['hi A/P_90890 how A/P_True A/P_/93290 are A/P_wueiwo A/P_|iwoeu you A/P_?9028k ?',
'Im fine, what A/P_49 A/P_0.0309 about you?']})
print (df_sess['col'].str.replace(r'A/P_\S*\s', ''))
0 hi how are you ?
1 Im fine, what about you?
Name: col, dtype: object
you get the expected output

How about:
(df['COL'].replace('A[/P|P][^ ]+', '', regex=True)
.replace('\s+',' ', regex=True))
Full example:
import pandas as pd
df = pd.DataFrame({
'COL':
["hi A/P_90890 how A/P_True A/P_/93290 AP_wueiwo A/P_|iwoeu you A/P_?9028k ?",
"Im fine, what A/P_49 A/P_0.0309 about you?"]
})
df['COL'] = (df['COL'].replace('A[/P|P][^ ]+', '', regex=True)
.replace('\s+',' ', regex=True))
Returns (oh, there is an extra space before ?):
COL
0 hi how you ?
1 Im fine, what about you?

Because of pandas 0.23.0 bug in replace() function (https://github.com/pandas-dev/pandas/issues/21159) when trying to replace by regex pattern the error occurs:
df.COL.str.replace(regex_pat, '', regex=True)
...
--->
TypeError: Type aliases cannot be used with isinstance().
I would suggest to use pandas.Series.apply function with precompiled regex pattern:
In [1170]: df4 = pd.DataFrame({'COL': ['hi A/P_90890 how A/P_True A/P_/93290 are AP_wueiwo A/P_|iwoeu you A/P_?9028k ?', 'Im fine, what A/P_49 A/P_0.0309 about you?']})
In [1171]: pat = re.compile(r'\s*A/?P_[^\s]*')
In [1172]: df4['COL']= df4.COL.apply(lambda x: pat.sub('', x))
In [1173]: df4
Out[1173]:
COL
0 hi how are you ?
1 Im fine, what about you?

Related

Removing # mentions from pandas DataFrame column

I am working on a thesis project on smartworking. I downloaded some tweets using Python and I wanted to get rid of users / mentions before implementing wordclouds. However, I can't delete the users, but with the commands shown I delete only the "#".
df['token']=df['token'].apply(lambda x:re.sub(r"#mention","", x))
df['token']=df['token'].apply(lambda x:re.sub(r"#[A-Za-z0-9]+","", x))
Your second code should work, however for efficiency use str.replace:
df['token2'] = df['token'].str.replace('#[A-Za-z0-9]+\s?', '', regex=True)
# or for [a-zA-Z0-9_] use \w
# df['token2'] = df['token'].str.replace('#\w+\s?', '', regex=True)
example:
token token2
0 this is a #test case this is a case

Pandas in df column extract string after colon if colon exits; if not, keep text

Looked for awhile on here, but couldn't find the answer.
df['Products'] = ['CC: buns', 'people', 'CC: help me']
Trying to get only text after colon or keep text if no colon is in the string.
Tried a lot of things, but this was my final attempt.
x['Product'] = x['Product'].apply(lambda i: i.extract(r'(?i):(.+)') if ':' in i else i)
I get this error:
Might take two steps, I assume.
I tried this:
x['Product'] = x['Product'].str.extract(r'(?i):(.+)')
Got me everything after the colon and a bunch of NaN, so my regex is working. I am assuming my lambda sucks.
Use str.split and get last item
df['Products'] = df['Products'].str.split(': ').str[-1]
Out[9]:
Products
0 buns
1 people
2 help me
Try this
df['Products'] = df.Products.apply(lambda x: x.split(': ')[-1] if ':' in x else x)
print(df)
Output:
Products
0 buns
1 people
2 help me

How can I convert into lower case the first letter of each word in a pandas colum?

I would like how to convert the first letter of each word in this column:
Test
There is a cat UNDER the table
The pen is working WELL.
Into lower case, in order to have
Test
there is a cat uNDER the table
the pen is working wELL.
I know there is capitalize() but I would need a function which does the opposite.
Many thanks
Please note that the strings are within a column.
I don't believe there is a builtin for this, but I could be mistaken. This is however quite easy to do with string comprehension!.
" ".join(i[0].lower()+i[1:] for i in line.split(" "))
Where line is each individual line.
According to this solution you can do :
>>> func = lambda s: s[:1].lower() + s[1:] if s else ''
>>> sent = "There is a cat UNDER the table "
>>> res = " ".join(list(map(func , sent.split())))
>>> res
'there is a cat uNDER the table'
You can use .str.lower, .str.split and ' '.join:
s=df.Test.str.split()
df.Test=s.str[0].str.lower()+' '+s.str[1:].agg(' '.join)
Same as spliting the words with .str.split and then modifying with apply:
df.Test=df.Test.str.split().apply(lambda x: [x[0].lower()]+x[1:] ).agg(' '.join)
Both outputs:
df
Test
0 there is a cat UNDER the table
1 the pen is working WELL.

How to remove - from values in a field- python or pyspark

I have a field that looks like
field1
231-206-2222
231-206-2344
231-206-1111
231-206-1111
I tried regexing it but to no avail. I am new to this so any ideas would help. Any suggestions?ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
I tried regexing it but to no avail. I am new to this so any ideas would help. Any suggestions?ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
I tried regexing it but to no avail. I am new to this so any ideas would help. Any suggestions?ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
It seems like a dataframe to me, if so try this:
df['field1'].apply(lambda x: x.replace("-",""))
There are many ways of doing it.
Demo:
1) # where sub will replace hyphen with empty space
df = pd.DataFrame({'field1': ['123-456-999', '333-222-111']})
df['field1'] = df['field1'].apply(lambda x: re.sub(r'-', '', x))
2) # where \D+ will match one or more non-digits and remove it
df['field1'] = df['field1'].str.replace(r'\D+', '')
3) # to replace - with empty space
df['field1'] = df['field1'].str.replace('-', '')
Result:
field1
0 123456999
1 333222111

Replace multiple substrings in a Pandas series with a value

All,
To replace one string in one particular column I have done this and it worked fine:
dataUS['sec_type'].str.strip().str.replace("LOCAL","CORP")
I would like now to replace multiple strings with one string say replace ["LOCAL", "FOREIGN", "HELLO"] with "CORP"
How can make it work? the code below didn't work
dataUS['sec_type'].str.strip().str.replace(["LOCAL", "FOREIGN", "HELLO"], "CORP")
You can perform this task by forming a |-separated string. This works because pd.Series.str.replace accepts regex:
Replace occurrences of pattern/regex in the Series/Index with some
other string. Equivalent to str.replace() or re.sub().
This avoids the need to create a dictionary.
import pandas as pd
df = pd.DataFrame({'A': ['LOCAL TEST', 'TEST FOREIGN', 'ANOTHER HELLO', 'NOTHING']})
pattern = '|'.join(['LOCAL', 'FOREIGN', 'HELLO'])
df['A'] = df['A'].str.replace(pattern, 'CORP')
# A
# 0 CORP TEST
# 1 TEST CORP
# 2 ANOTHER CORP
# 3 NOTHING
replace can accept dict , os we just create a dict for those values need to be replaced
dataUS['sec_type'].str.strip().replace(dict(zip(["LOCAL", "FOREIGN", "HELLO"], ["CORP"]*3)),regex=True)
Info of the dict
dict(zip(["LOCAL", "FOREIGN", "HELLO"], ["CORP"]*3))
Out[585]: {'FOREIGN': 'CORP', 'HELLO': 'CORP', 'LOCAL': 'CORP'}
The reason why you receive the error ,
str.replace is different from replace
The answer of #Rakesh is very neat but does not allow for substrings. With a small change however, it does.
Use a replacement dictionary because it makes it much more generic
Add the keyword argument regex=True to Series.replace() (not Series.str.replace) This does two things actually: It changes your replacement to regex replacement, which is much more powerful but you will have to escape special characters. Beware for that. Secondly it will make the replace work on substrings instead of the entire string. Which is really cool!
replacement = {
"LOCAL": "CORP",
"FOREIGN": "CORP",
"HELLO": "CORP"
}
dataUS['sec_type'].replace(replacement, regex=True)
Full code example
dataUS = pd.DataFrame({'sec_type': ['LOCAL', 'Sample text LOCAL', 'Sample text LOCAL sample FOREIGN']})
replacement = {
"LOCAL": "CORP",
"FOREIGN": "CORP",
"HELLO": "CORP"
}
dataUS['sec_type'].replace(replacement, regex=True)
Output
0 CORP
1 CORP
2 Sample text CORP
3 Sample text CORP sample CORP
Name: sec_type, dtype: object
#JJP answer is a good one if you have a long list. But if you have just two or three then you can simply use the '|' within the pattern. Make sure to add regex=True parameter.
Clearly .str.strip() is not a requirement but is good practise.
import pandas as pd
df = pd.DataFrame({'A': ['LOCAL TEST', 'TEST FOREIGN', 'ANOTHER HELLO', 'NOTHING']})
df['A'] = df['A'].str.strip().str.replace("LOCAL|FOREIGN|HELLO", "CORP", regex=True)
output
A
0 CORP TEST
1 TEST CORP
2 ANOTHER CORP
3 NOTHING
Function to replace multiple values in pandas Series:
def replace_values(series, to_replace, value):
for i in to_replace:
series = series.str.replace(i, value)
return series
Hope this helps someone
Try:
dataUS.replace({"sec_type": { 'LOCAL' : "CORP", 'FOREIGN' : "CORP"}})

Categories

Resources