How to remove - from values in a field- python or pyspark

How to remove - from values in a field- python or pyspark - python

I have a field that looks like
field1
231-206-2222
231-206-2344
231-206-1111
231-206-1111
I tried regexing it but to no avail. I am new to this so any ideas would help. Any suggestions?ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
I tried regexing it but to no avail. I am new to this so any ideas would help. Any suggestions?ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
I tried regexing it but to no avail. I am new to this so any ideas would help. Any suggestions?ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss

It seems like a dataframe to me, if so try this:
df['field1'].apply(lambda x: x.replace("-",""))

There are many ways of doing it.
Demo:
1) # where sub will replace hyphen with empty space
df = pd.DataFrame({'field1': ['123-456-999', '333-222-111']})
df['field1'] = df['field1'].apply(lambda x: re.sub(r'-', '', x))
2) # where \D+ will match one or more non-digits and remove it
df['field1'] = df['field1'].str.replace(r'\D+', '')
3) # to replace - with empty space
df['field1'] = df['field1'].str.replace('-', '')
Result:
field1
0 123456999
1 333222111

Related

Problem loading excel file and not showing the missing values and trying to use regex to clear and values in a cellphone number

I have an Excel file with a lot of date from some customers.
For some reason when read the excel file, the jupyter notebook cannot visualize the field with missing values.
The reason for the missing values is because the customers didn't fufill this data.
I tried a lot of thing when exporting the data.
df = pd.read_excel(r'Recomeco/Decisões 06.09 a 12.09.xlsx', index_col=0, skiprows=2)#, na_values=['CELULAR', 'EMAIL']) #keep_default_na = False, na_filter= False, verbose= True)
I dont know whats is the reason for this heappen.
In the CELULAR colum:
Cellphone_original Cellphone_GOAL
(12)98272-8620 55 12 98272-8620
I used this function:
def split_CELULAR(celular):
number = re.findall(r"\d+-\d+", celular)
return number
df['CELULAR1'] = df['CELULAR'].apply(split_CELULAR)
Value find: 98272-8620
But I have to add the values - 55 12, to every row and I can't do it.
Could someone help me?

You can use
df['CELULAR1'] = df['CELULAR'].str.replace(r'^\(\d+\)', '55 12 ', regex=True)
The regex (see its online demo) means:
^ - start of string
\( - a ( char
\d+ - one or more digits
-\) - a ) char.
Note the regex = True argument, it is necessary to avoid warnings, see the FutureWarning: The default value of regex will change from True to False in a future version thread.

Removing words starting with 'http' in a pandas Dataframe

I previously asked this but I got the data type wrong!
I have my Pandas Dataframe, which looks like this
print(data)
text
0 FollowFriday for being top engaged members...
1 Hey James! How odd :/ Please call our Contact...
2 we had a listen last night :) As You Bleed is...
In this dataframe theree are links, which all start with "http". I have already got a line of code in a function, below, which removes words starting with '#' and other cleaning methods.
def cleanData(data):
#Loop through the data, creating a new dataframe with only ascii characters
data['text'] = data['text'].apply(lambda s: "".join(char for char in s if char.isascii()))
#Remove any tokens with numbers, or digits.
data['text'] = data['text'].apply(lambda s: "".join(char for char in s if not char.isdigit()))
#Removes any words which start with #, which are replies.
data['text']= data['text'].str.replace('(#\w+.*?)',"")
#Remove any left over characters
data = data['text'].str.replace('[^\w\s]','')
#return the cleaned data
return data
Can anyone help to remove words which start with 'http' please? I have already tried to edit what I have but no luck so far.
Thanks in advance!

Use Series.str.replace
data['text'] = data['text'].str.replace('http[^\s]*',"")

One option is to use the str.replace() method:
df = pd.DataFrame( dict(text = [r'FollowFridayhttphttp http http for being top engaged members.',r'James!http How odd http:/ Please call ou',r'httpe had a listen last night :) As You Bleed is...']))
df['text'] = df['text'].apply(lambda x: x.replace('http',''))
And you could do something like this in your function.

Pandas in df column extract string after colon if colon exits; if not, keep text

Looked for awhile on here, but couldn't find the answer.
df['Products'] = ['CC: buns', 'people', 'CC: help me']
Trying to get only text after colon or keep text if no colon is in the string.
Tried a lot of things, but this was my final attempt.
x['Product'] = x['Product'].apply(lambda i: i.extract(r'(?i):(.+)') if ':' in i else i)
I get this error:
Might take two steps, I assume.
I tried this:
x['Product'] = x['Product'].str.extract(r'(?i):(.+)')
Got me everything after the colon and a bunch of NaN, so my regex is working. I am assuming my lambda sucks.

Use str.split and get last item
df['Products'] = df['Products'].str.split(': ').str[-1]
Out[9]:
Products
0 buns
1 people
2 help me

Try this
df['Products'] = df.Products.apply(lambda x: x.split(': ')[-1] if ':' in x else x)
print(df)
Output:
Products
0 buns
1 people
2 help me

How to delete substrings with specific characters in a pandas dataframe?

I have a pandas dataframe that looks like this:
COL
hi A/P_90890 how A/P_True A/P_/93290 are AP_wueiwo A/P_|iwoeu you A/P_?9028k ?
...
Im fine, what A/P_49 A/P_0.0309 about you?
The expected result should be:
COL
hi how are you?
...
Im fine, what about you?
How can I remove efficiently from a column and for the full pandas dataframe all the strings that have A/P_?
I tried with this regular expression:
A/P_(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+
However, I do not know if there's a more simpler or robust way of removing all those substrings from my dataframe. How can I remove all the strings that have A/P_ at the beginning?
UPDATE
I tried:
df_sess['COL'] = df_sess['COL'].str.replace(r'A/P(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '')
And it works, however I would like to know if there's a more robust way of doing this. Possibily with a regular expression.

one way could be to use \S* matching all non withespaces after A/P_ and also add \s to remove the whitespace after the string to remove, such as:
df_sess['COL'] = df_sess['col'].str.replace(r'A/P_\S*\s', '')
In you input, it seems there is an typo error (or at least I think so), so with this input:
df_sess = pd.DataFrame({'col':['hi A/P_90890 how A/P_True A/P_/93290 are A/P_wueiwo A/P_|iwoeu you A/P_?9028k ?',
'Im fine, what A/P_49 A/P_0.0309 about you?']})
print (df_sess['col'].str.replace(r'A/P_\S*\s', ''))
0 hi how are you ?
1 Im fine, what about you?
Name: col, dtype: object
you get the expected output

How about:
(df['COL'].replace('A[/P|P][^ ]+', '', regex=True)
.replace('\s+',' ', regex=True))
Full example:
import pandas as pd
df = pd.DataFrame({
'COL':
["hi A/P_90890 how A/P_True A/P_/93290 AP_wueiwo A/P_|iwoeu you A/P_?9028k ?",
"Im fine, what A/P_49 A/P_0.0309 about you?"]
})
df['COL'] = (df['COL'].replace('A[/P|P][^ ]+', '', regex=True)
.replace('\s+',' ', regex=True))
Returns (oh, there is an extra space before ?):
COL
0 hi how you ?
1 Im fine, what about you?

Because of pandas 0.23.0 bug in replace() function (https://github.com/pandas-dev/pandas/issues/21159) when trying to replace by regex pattern the error occurs:
df.COL.str.replace(regex_pat, '', regex=True)
...
--->
TypeError: Type aliases cannot be used with isinstance().
I would suggest to use pandas.Series.apply function with precompiled regex pattern:
In [1170]: df4 = pd.DataFrame({'COL': ['hi A/P_90890 how A/P_True A/P_/93290 are AP_wueiwo A/P_|iwoeu you A/P_?9028k ?', 'Im fine, what A/P_49 A/P_0.0309 about you?']})
In [1171]: pat = re.compile(r'\s*A/?P_[^\s]*')
In [1172]: df4['COL']= df4.COL.apply(lambda x: pat.sub('', x))
In [1173]: df4
Out[1173]:
COL
0 hi how are you ?
1 Im fine, what about you?

python - remove text of string

I have a string that looks like this
name = '1/23/20151'
And now I want to remove just the trailing 1, at the end of 2015. So that it becomes
1/23/2015
So i tried this
sep = '2015'
name = name.split(sep, 1)[0]
but this removes the 2015 also, I want the 2015 to stay, how could I do this.
Thanks for the help in advance.
EDIT
Sorry I didn't fully explain the problem I have two strings the one previously mentioned and a noraml date '1/22/2015' and I loop through and only want to remove this extra character if it is there which is why name = name[:-1] doesn't work.

name = name.rstrip('1')
will only remove trailing '1'
name = '1/23/20151'
name = name.rstrip('1') # 1/23/2015
'1/23/2015'.rstrip('1') # 1/23/2015

just do this
name = name[:-1]
That should do it.
If you only want to remove the fifth digit after the year, I'd do this:
name = name.split('/')
name = '/'.join([name[0],name[1],name[2][:4]])

List slicing can easily accomplish this:
>>> name[:-1]
>>> '1/23/2015'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove - from values in a field- python or pyspark - python

It seems like a dataframe to me, if so try this: df['field1'].apply(lambda x: x.replace("-",""))

Related

Problem loading excel file and not showing the missing values and trying to use regex to clear and values in a cellphone number

Removing words starting with 'http' in a pandas Dataframe

Pandas in df column extract string after colon if colon exits; if not, keep text

How to delete substrings with specific characters in a pandas dataframe?

python - remove text of string

Categories

Resources