replacing quotes, commas, apostrophes w/ regex - python/pandas - python

I have a column with addresses, and sometimes it has these characters I want to remove => ' - " - ,(apostrophe, double quotes, commas)
I would like to replace these characters with space in one shot. I'm using pandas and this is the code I have so far to replace one of them.
test['Address 1'].map(lambda x: x.replace(',', ''))
Is there a way to modify these code so I can replace these characters in one shot? Sorry for being a noob, but I would like to learn more about pandas and regex.
Your help will be appreciated!

You can use str.replace:
test['Address 1'] = test['Address 1'].str.replace(r"[\"\',]", '')
Sample:
import pandas as pd
test = pd.DataFrame({'Address 1': ["'aaa",'sa,ss"']})
print (test)
Address 1
0 'aaa
1 sa,ss"
test['Address 1'] = test['Address 1'].str.replace(r"[\"\',]", '')
print (test)
Address 1
0 aaa
1 sass

Here's the pandas solution:
To apply it to an entire dataframe use, df.replace. Don't forget the \ character for the apostrophe.
Example:
import pandas as pd
df = #some dataframe
df.replace('\'','', regex=True, inplace=True)

Related

Extract words after a symbol in python

I have the following data where i would like to extract out source= from the values. Is there a way to create a general regex function so that i can apply on other columns as well to extract words after equal sign?
Data Data2
source=book social-media=facebook
source=book social-media=instagram
source=journal social-media=facebook
Im using python and i have tried the following:
df['Data'].astype(str).str.replace(r'[a-zA-Z]\=', '', regex=True)
but it didnt work
you can try this :
df.replace(r'[a-zA-Z]+-?[a-zA-Z]+=', '', regex=True)
It gives you the following result :
Data Data2
0 book facebook
1 book instagram
2 journal facebook
Regex is not required in this situation:
print(df['Data'].apply(lambda x : x.split('=')[-1]))
print(df['Data2'].apply(lambda x : x.split('=')[-1]))
You have to repeat the character class 1 or more times and you don't have to escape the equals sign.
What you can do is make the match a bit broader matching all characters except a whitespace char or an equals sign.
Then set the result to the new value.
import pandas as pd
data = [
"source=book",
"source=journal",
"social-media=facebook",
"social-media=instagram"
]
df = pd.DataFrame(data, columns=["Data"])
df['Data'] = df['Data'].astype(str).str.replace(r'[^\s=]+=', '', regex=True)
print(df)
Output
Data
0 book
1 journal
2 facebook
3 instagram
If there has to be a value after the equals sign, you can also use str.extract
df['Data'] = df['Data'].astype(str).str.extract(r'[^\s=]+=([^\s=]+)')

Remove unwanted str in Pandas dataframe

'I am reading a csv file using panda read_csv which contains data,
Id;LibId;1;mod;modId;28;Index=10, Step=0, data=d720983f0000c0bf0000000014ae47bf0fe7c23ad1de3039;
Id;LibId;1;mod;modId;4;f9e9003e;
.
.
.
.
In the last column, I want to remove the Index, Step, data= and want to retain the hex value part.
I have created a list with the unwanted values and used regex but nothing seem to work.
to_remove = ['Index','Step','data=']
rex = '[' + re.escape (''. join (to_remove )) + ']'
output_csv['Column_name'].str.replace(rex , '', regex=True)
I suggest that you fix your code using
to_remove = ['Index','Step','data=']
output_csv['Column_name'] = output_csv['Column_name'].str.replace('|'.join([re.escape(x) for x in to_remove]), '', regex=True)
The '|'.join([re.escape(x) for x in to_remove]) part will create a regex like Index|Step|data\= and will match any of the to_remove substrings.
Input (added columns name for reference, can be avoided):
col1;col2;col3;col4;col5;col6;col7
Id;LibId;1;mod;modId;28;Index=10, Step=0, data=d720983f0000c0bf0000000014ae47bf0fe7c23ad1de3039
Id;LibId;1;mod;modId;28;Index=10, Step=0, data=d7203ad1de3039
Id;LibId;1;mod;modId;28;Index=10, Step=0, data=d720e47bf0fe7c23ad1de3039
Code:
import pandas as pd
df = pd.read_csv(r"check.csv", sep=";")
df["col7"].replace(regex=True, to_replace="(Index=)(.*)(data=)", value="", inplace=True)
This will extract only the hex value from "data" part and remove everything else. Do not forget about inplace=True.

How to strip/replace "domain\" from Pandas DataFrame Column?

I have a pandas DataFrame that's being read in from a CSV that has hostnames of computers including the domain they belong to along with a bunch of other columns. I'm trying to strip out the Domain information such that I'm left with ONLY the Hostname.
DataFrame ex:
name
domain1\computername1
domain1\computername45
dmain3\servername1
dmain3\computername3
domain1\servername64
....
I've tried using both str.strip() and str.replace() with a regex as well as a string literal, but I can't seem to correctly target the domain information correctly.
Examples of what I've tried thus far:
df['name'].str.strip('.*\\')
df['name'].str.replace('.*\\', '', regex = True)
df['name'].str.replace(r'[.*\\]', '', regex = True)
df['name'].str.replace('domain1\\\\', '', regex = False)
df['name'].str.replace('dmain3\\\\', '', regex = False)
None of these seem to make any changes when I spit the DataFrame out using logging.debug(df)
You are already close to the answer, just use:
df['name'] = df['name'].str.replace(r'.*\\', '', regex = True)
which just adds using r-string from one of your tried code.
Without using r-string here, the string is equivalent to .*\\ which will be interpreted to only one \ in the final regex. However, with r-string, the string will becomes '.*\\\\' and each pair of \\ will be interpreted finally as one \ and final result becomes 2 slashes as you expect.
Output:
0 computername1
1 computername45
2 servername1
3 computername3
4 servername64
Name: name, dtype: object
You can use .str.split:
df["name"] = df["name"].str.split("\\", n=1).str[-1]
print(df)
Prints:
name
0 computername1
1 computername45
2 servername1
3 computername3
4 servername64
No regex approach with ntpath.basename:
import pandas as pd
import ntpath
df = pd.DataFrame({'name':[r'domain1\computername1']})
df["name"] = df["name"].apply(lambda x: ntpath.basename(x))
Results: computername1.
With rsplit:
df["name"] = df["name"].str.rsplit('\\').str[-1]

How to check the pattern of a column in a dataframe

I have a dataframe which has some id's. I want to check the pattern of those column values.
Here is how the column looks like-
id: {ASDH12HK,GHST67KH,AGSH90IL,THKI86LK}
I want to to write a code that can distinguish characters and numerics in the pattern above and display an output like 'SSSS99SS' as the pattern of the column above where 'S' represents a character and '9' represents a numeric.This dataset is a large dataset so I can't predefine the position the characters and numeric will be in.I want the code to calculate the position of the characters and numerics. I am new to python so any leads will be helpful!
You can try something like:
my_string = "ASDH12HK"
def decode_pattern(my_string):
my_string = ''.join(str(9) if s.isdigit() else s for s in my_string)
my_string = ''.join('S' if s.isalpha() else s for s in my_string)
return my_string
decode_pattern(my_string)
Output:
'SSSS99SS'
You can apply this to the column in your dataframe as well as below:
import pandas as pd
df = pd.DataFrame(['ASDH12HK','GHST67KH','AGSH90IL','THKI86LK', 'SOMEPATTERN123'], columns=['id'])
df['pattern'] = df['id'].map(decode_pattern)
df
Output:
id pattern
0 ASDH12HK SSSS99SS
1 GHST67KH SSSS99SS
2 AGSH90IL SSSS99SS
3 THKI86LK SSSS99SS
4 SOMEPATTERN123 SSSSSSSSSSS999
You can use regular experssion:
st = "SSSS99SSSS"
a = re.match("[A-Za-z]{4}[0-9]{2}[A-Za-z]{4}", st)
It will return a match if the string starting with 4 Char followed by 2 numeric and again 4 char
So you can use this in your df to filter the df
You can use the function findall() from the re module:
import re
text = "ASDH12HK,GHST67KH,AGSH90IL,THKI86LK"
result = re.findall("[A-Za-z]{4}[0-9]{2}[A-Za-z]{2}", text)
print(result)

removing newlines from messy strings in pandas dataframe cells?

I've used multiple ways of splitting and stripping the strings in my pandas dataframe to remove all the '\n'characters, but for some reason it simply doesn't want to delete the characters that are attached to other words, even though I split them. I have a pandas dataframe with a column that captures text from web pages using Beautifulsoup. The text has been cleaned a bit already by beautifulsoup, but it failed in removing the newlines attached to other characters. My strings look a bit like this:
"hands-on\ndevelopment of games. We will study a variety of software technologies\nrelevant to games including programming languages, scripting\nlanguages, operating systems, file systems, networks, simulation\nengines, and multi-media design systems. We will also study some of\nthe underlying scientific concepts from computer science and related\nfields including"
Is there an easy python way to remove these "\n" characters?
EDIT: the correct answer to this is:
df = df.replace(r'\n',' ', regex=True)
I think you need replace:
df = df.replace('\n','', regex=True)
Or:
df = df.replace('\n',' ', regex=True)
Or:
df = df.replace(r'\\n',' ', regex=True)
Sample:
text = '''hands-on\ndev nologies\nrelevant scripting\nlang
'''
df = pd.DataFrame({'A':[text]})
print (df)
A
0 hands-on\ndev nologies\nrelevant scripting\nla...
df = df.replace('\n',' ', regex=True)
print (df)
A
0 hands-on dev nologies relevant scripting lang
df.replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value=["",""], regex=True, inplace=True)
worked for me.
Source:
https://gist.github.com/smram/d6ded3c9028272360eb65bcab564a18a
To remove carriage return (\r), new line (\n) and tab (\t)
df = df.replace(r'\r+|\n+|\t+','', regex=True)
in messy data it might to be a good idea to remove all whitespaces df.replace(r'\s', '', regex = True, inplace = True).
df = 'Sarah Marie Wimberly So so beautiful!!!\nAbram Staten You guys look good man.\nTJ Sloan I miss you guys\n'
df = df.replace(r'\\n',' ', regex=True)
This worked for the messy data I had.

Categories

Resources