How to check the pattern of a column in a dataframe - python

I have a dataframe which has some id's. I want to check the pattern of those column values.
Here is how the column looks like-
id: {ASDH12HK,GHST67KH,AGSH90IL,THKI86LK}
I want to to write a code that can distinguish characters and numerics in the pattern above and display an output like 'SSSS99SS' as the pattern of the column above where 'S' represents a character and '9' represents a numeric.This dataset is a large dataset so I can't predefine the position the characters and numeric will be in.I want the code to calculate the position of the characters and numerics. I am new to python so any leads will be helpful!

You can try something like:
my_string = "ASDH12HK"
def decode_pattern(my_string):
my_string = ''.join(str(9) if s.isdigit() else s for s in my_string)
my_string = ''.join('S' if s.isalpha() else s for s in my_string)
return my_string
decode_pattern(my_string)
Output:
'SSSS99SS'
You can apply this to the column in your dataframe as well as below:
import pandas as pd
df = pd.DataFrame(['ASDH12HK','GHST67KH','AGSH90IL','THKI86LK', 'SOMEPATTERN123'], columns=['id'])
df['pattern'] = df['id'].map(decode_pattern)
df
Output:
id pattern
0 ASDH12HK SSSS99SS
1 GHST67KH SSSS99SS
2 AGSH90IL SSSS99SS
3 THKI86LK SSSS99SS
4 SOMEPATTERN123 SSSSSSSSSSS999

You can use regular experssion:
st = "SSSS99SSSS"
a = re.match("[A-Za-z]{4}[0-9]{2}[A-Za-z]{4}", st)
It will return a match if the string starting with 4 Char followed by 2 numeric and again 4 char
So you can use this in your df to filter the df

You can use the function findall() from the re module:
import re
text = "ASDH12HK,GHST67KH,AGSH90IL,THKI86LK"
result = re.findall("[A-Za-z]{4}[0-9]{2}[A-Za-z]{2}", text)
print(result)

Related

Find if the string (sentence) in a list of string in other columns in Python

I want to check if sentence columns contains any keyword in other columns (without case sensitive).
I also got the problem when import file from csv, the keyword list has ' ' on the string so when I tried to use join str.join('|') it add | into every character
Sentence = ["Clear is very good","Fill- low light, compact","stripping topsoil"]
Keyword =[['Clearing', 'grubbing','clear','grub'],['Borrow,', 'Fill', 'and', 'Compaction'],['Fall']]
df = pd.DataFrame({'Sentence': Sentence, 'Keyword': Keyword})
My expect output will be
df['Match'] = [True,True,False]
You can try DataFrame.apply on rows
import re
df['Match'] = df.apply(lambda row: bool(re.search('|'.join(row['Keyword']), row['Sentence'], re.IGNORECASE)), axis=1)
print(df)
Sentence Keyword Match
0 Clear is very good [Clearing, grubbing, clear, grub] True
1 Fill- low light, compact [Borrow,, Fill, and, Compaction] True
2 stripping topsoil [Fall] False

Extract words after a symbol in python

I have the following data where i would like to extract out source= from the values. Is there a way to create a general regex function so that i can apply on other columns as well to extract words after equal sign?
Data Data2
source=book social-media=facebook
source=book social-media=instagram
source=journal social-media=facebook
Im using python and i have tried the following:
df['Data'].astype(str).str.replace(r'[a-zA-Z]\=', '', regex=True)
but it didnt work
you can try this :
df.replace(r'[a-zA-Z]+-?[a-zA-Z]+=', '', regex=True)
It gives you the following result :
Data Data2
0 book facebook
1 book instagram
2 journal facebook
Regex is not required in this situation:
print(df['Data'].apply(lambda x : x.split('=')[-1]))
print(df['Data2'].apply(lambda x : x.split('=')[-1]))
You have to repeat the character class 1 or more times and you don't have to escape the equals sign.
What you can do is make the match a bit broader matching all characters except a whitespace char or an equals sign.
Then set the result to the new value.
import pandas as pd
data = [
"source=book",
"source=journal",
"social-media=facebook",
"social-media=instagram"
]
df = pd.DataFrame(data, columns=["Data"])
df['Data'] = df['Data'].astype(str).str.replace(r'[^\s=]+=', '', regex=True)
print(df)
Output
Data
0 book
1 journal
2 facebook
3 instagram
If there has to be a value after the equals sign, you can also use str.extract
df['Data'] = df['Data'].astype(str).str.extract(r'[^\s=]+=([^\s=]+)')

How to strip/replace "domain\" from Pandas DataFrame Column?

I have a pandas DataFrame that's being read in from a CSV that has hostnames of computers including the domain they belong to along with a bunch of other columns. I'm trying to strip out the Domain information such that I'm left with ONLY the Hostname.
DataFrame ex:
name
domain1\computername1
domain1\computername45
dmain3\servername1
dmain3\computername3
domain1\servername64
....
I've tried using both str.strip() and str.replace() with a regex as well as a string literal, but I can't seem to correctly target the domain information correctly.
Examples of what I've tried thus far:
df['name'].str.strip('.*\\')
df['name'].str.replace('.*\\', '', regex = True)
df['name'].str.replace(r'[.*\\]', '', regex = True)
df['name'].str.replace('domain1\\\\', '', regex = False)
df['name'].str.replace('dmain3\\\\', '', regex = False)
None of these seem to make any changes when I spit the DataFrame out using logging.debug(df)
You are already close to the answer, just use:
df['name'] = df['name'].str.replace(r'.*\\', '', regex = True)
which just adds using r-string from one of your tried code.
Without using r-string here, the string is equivalent to .*\\ which will be interpreted to only one \ in the final regex. However, with r-string, the string will becomes '.*\\\\' and each pair of \\ will be interpreted finally as one \ and final result becomes 2 slashes as you expect.
Output:
0 computername1
1 computername45
2 servername1
3 computername3
4 servername64
Name: name, dtype: object
You can use .str.split:
df["name"] = df["name"].str.split("\\", n=1).str[-1]
print(df)
Prints:
name
0 computername1
1 computername45
2 servername1
3 computername3
4 servername64
No regex approach with ntpath.basename:
import pandas as pd
import ntpath
df = pd.DataFrame({'name':[r'domain1\computername1']})
df["name"] = df["name"].apply(lambda x: ntpath.basename(x))
Results: computername1.
With rsplit:
df["name"] = df["name"].str.rsplit('\\').str[-1]

Pandas: Clean up String column containing Single Quotes and Brackets using Regex?

I want to clean the following Pandas dataframe column, but in a single and efficient statement than the way I am trying to achieve it in the code below.
Input:
string
0 ['string', '#string']
1 ['#string']
2 []
Output:
string
0 string, #string
1 #string
2 NaN
Code:
import pandas as pd
import numpy as np
d = {"string": ["['string', '#string']", "['#string']", "[]"]}
df = pd.DataFrame(d)
df['string'] = df['string'].astype(str).str.strip('[]')
df['string'] = df['string'].replace("\'", "", regex=True)
df['string'] = df['string'].replace(r'^\s*$', np.nan, regex=True)
print(df)
You can use
df['string'] = df['string'].astype(str).str.replace(r"^[][\s]*$|(^\[+|\]+$|')", lambda m: '' if m.group(1) else np.nan)
Details:
^[][\s]*$ - matches a string that only consists of zero or more [, ] or whitespace chars
| - or
(^\[+|\]+$|') - captures into Group 1 one or more [ chars at the start of string, or one or more ] chars at the end of string or any ' char.
If Group 1 matches, the replacement is an empty string (the match is removed), else, the replacement is np.nan.

Is there a way to use str.count() function with a LIST of values instead of a single string?

I am trying to count the number of times that any string from a list_of_strings appears in a csv file cell.
For example, the following would work fine.
import pandas as pd
data_path = "SurveryResponses.csv"
df = pd.read_csv(data_path)
totalCount = 0
for row in df['rowName']:
if type(row) == str:
print(row.count('word_of_interest'))
However, I would like to be able to enter a list of strings (['str1', str2', str3']) rather than just one 'word_of_interest', such that if any of those strings appear the count value will increase by one.
Is there a way to do this?
Perhaps something along the lines of
totalCount = 0
words_of_interst = ['cat','dog','foo','bar']
for row in df['rowName']:
if type(row) == str:
if sum([word in row for word in words_of_interst]) > 0:
totalCount += 1
Use the str accessor:
df['rowName'].str.count('word_of_interest')
If you need to convert the column to string first, use astype:
df['rowName'].astype(str).str.count('word_of_interest')
Assuming list_of_strings = ['str1', str2', str3'] you can try the following:
if any(map(lambda x: x in row, list_of_strings)):
totalCount += 1
You can use this method to count from an external list
strings = ['string1','string2','string3']
sum([1 if sr in strings else 0 for sr in df.rowName])
Here is an example:
import io
filedata = """animal,amount
"['cat','dog']",2
"['cat','horse']",2"""
df = pd.read_csv(io.StringIO(filedata))
Returns this dataframe:
animal amount
0 ['cat','dog'] 2
1 ['cat','horse'] 2
Search for word cat (looping through all columns as series):
search = "cat"
# sums True for each serie and then wrap a sum around all sums
# sum([2,0]) in this case
sum([sum(df[cols].astype(str).str.contains(search)) for cols in df.columns])
Returns 2

Categories

Resources