Pandas: Clean up String column containing Single Quotes and Brackets using Regex? - python

I want to clean the following Pandas dataframe column, but in a single and efficient statement than the way I am trying to achieve it in the code below.
Input:
string
0 ['string', '#string']
1 ['#string']
2 []
Output:
string
0 string, #string
1 #string
2 NaN
Code:
import pandas as pd
import numpy as np
d = {"string": ["['string', '#string']", "['#string']", "[]"]}
df = pd.DataFrame(d)
df['string'] = df['string'].astype(str).str.strip('[]')
df['string'] = df['string'].replace("\'", "", regex=True)
df['string'] = df['string'].replace(r'^\s*$', np.nan, regex=True)
print(df)

You can use
df['string'] = df['string'].astype(str).str.replace(r"^[][\s]*$|(^\[+|\]+$|')", lambda m: '' if m.group(1) else np.nan)
Details:
^[][\s]*$ - matches a string that only consists of zero or more [, ] or whitespace chars
| - or
(^\[+|\]+$|') - captures into Group 1 one or more [ chars at the start of string, or one or more ] chars at the end of string or any ' char.
If Group 1 matches, the replacement is an empty string (the match is removed), else, the replacement is np.nan.

Related

How to replace last three characters of a string in a column if it starts with character

I have a pandas dataframe of postcodes which have been concatenated with the two-letter country code. Some of these are Brazilian postcodes and I want to replace the last three characters of any postcode which starts with 'BR' with '000'.
import pandas as pd
data = ['BR86037-890', 'GBBB7', 'BR86071-570','BR86200-000','BR86026-480','BR86082-701', 'GBCW9', 'NO3140']
df = pd.DataFrame(data, columns=['postcode'])
I have tried the below, but it is not changing any of the postcodes:
if df['postcode'].str.startswith('BR').all():
df["postcode"] = df["postcode"].str.replace(r'.{3}$', '000')
Use str.replace with a capturing group:
df['postcode'] = df['postcode'].str.replace(r'(BR.*)...', r'\g<1>000', regex=True)
# or, more generic
df['postcode'] = df['postcode'].str.replace(r'(BR.*).{3}', r'\g<1>'+'0'*3, regex=True)
Output:
postcode
0 BR86037-000
1 GBBB7
2 BR86071-000
3 BR86200-000
4 BR86026-000
5 BR86082-000
6 GBCW9
7 NO3140
regex demo
The code is not working because df['postcode'].str.startswith('BR').all() will return a boolean value indicating whether all postcodes in the column start with 'BR'.
try this
data = ['BR86037-890', 'GBBB7', 'BR86071-570','BR86200-000','BR86026-480','BR86082-701', 'GBCW9', 'NO3140']
df = pd.DataFrame(data, columns=['postcode'])
mask = df['postcode'].str.startswith('BR')
df.loc[mask, 'postcode'] = df.loc[mask, 'postcode'].str.replace(r'.{3}$', '000')

Find if the string (sentence) in a list of string in other columns in Python

I want to check if sentence columns contains any keyword in other columns (without case sensitive).
I also got the problem when import file from csv, the keyword list has ' ' on the string so when I tried to use join str.join('|') it add | into every character
Sentence = ["Clear is very good","Fill- low light, compact","stripping topsoil"]
Keyword =[['Clearing', 'grubbing','clear','grub'],['Borrow,', 'Fill', 'and', 'Compaction'],['Fall']]
df = pd.DataFrame({'Sentence': Sentence, 'Keyword': Keyword})
My expect output will be
df['Match'] = [True,True,False]
You can try DataFrame.apply on rows
import re
df['Match'] = df.apply(lambda row: bool(re.search('|'.join(row['Keyword']), row['Sentence'], re.IGNORECASE)), axis=1)
print(df)
Sentence Keyword Match
0 Clear is very good [Clearing, grubbing, clear, grub] True
1 Fill- low light, compact [Borrow,, Fill, and, Compaction] True
2 stripping topsoil [Fall] False

How to delete some value in string columns python

How can I delete this number in my "Kecamatan" Columns
You can replace the leading digits and dot using regular expression, as follows:
import pandas as pd
df = pd.DataFrame({
'Kecamatan': ['010. Donomulyo', '020. Kalipare', '030. Pagak', '040. Bankur']
})
df['Kecamatan'] = df['Kecamatan'].str.replace(r'\A[0-9. ]+', '', regex=True)
print(df)
>>>df
Kecamatan
0 Donomulyo
1 Kalipare
2 Pagak
3 Bankur

How to check the pattern of a column in a dataframe

I have a dataframe which has some id's. I want to check the pattern of those column values.
Here is how the column looks like-
id: {ASDH12HK,GHST67KH,AGSH90IL,THKI86LK}
I want to to write a code that can distinguish characters and numerics in the pattern above and display an output like 'SSSS99SS' as the pattern of the column above where 'S' represents a character and '9' represents a numeric.This dataset is a large dataset so I can't predefine the position the characters and numeric will be in.I want the code to calculate the position of the characters and numerics. I am new to python so any leads will be helpful!
You can try something like:
my_string = "ASDH12HK"
def decode_pattern(my_string):
my_string = ''.join(str(9) if s.isdigit() else s for s in my_string)
my_string = ''.join('S' if s.isalpha() else s for s in my_string)
return my_string
decode_pattern(my_string)
Output:
'SSSS99SS'
You can apply this to the column in your dataframe as well as below:
import pandas as pd
df = pd.DataFrame(['ASDH12HK','GHST67KH','AGSH90IL','THKI86LK', 'SOMEPATTERN123'], columns=['id'])
df['pattern'] = df['id'].map(decode_pattern)
df
Output:
id pattern
0 ASDH12HK SSSS99SS
1 GHST67KH SSSS99SS
2 AGSH90IL SSSS99SS
3 THKI86LK SSSS99SS
4 SOMEPATTERN123 SSSSSSSSSSS999
You can use regular experssion:
st = "SSSS99SSSS"
a = re.match("[A-Za-z]{4}[0-9]{2}[A-Za-z]{4}", st)
It will return a match if the string starting with 4 Char followed by 2 numeric and again 4 char
So you can use this in your df to filter the df
You can use the function findall() from the re module:
import re
text = "ASDH12HK,GHST67KH,AGSH90IL,THKI86LK"
result = re.findall("[A-Za-z]{4}[0-9]{2}[A-Za-z]{2}", text)
print(result)

Replace string in pandas dataframe if it contains specific substring

I have a dataframe generated from a .csv (I use Python 3.5). The df['category'] contains only strings. What I want is to check this column and if a string contains a specific substring(not really interested where they are in the string as long as they exist) to be replaced. I am using this script
import pandas as pd
df=pd.read_csv('lastfile.csv')
df.dropna(inplace=True)
g='Drugs'
z='Weapons'
c='Flowers'
df.category = df.category.str.lower().apply(lambda x: g if ('mdma' or 'xanax' or 'kamagra' or 'weed' or 'tabs' or 'lsd' or 'heroin' or 'morphine' or 'hci' or 'cap' or 'mda' or 'hash' or 'kush' or 'wax'or 'klonop'or\
'dextro'or'zepam'or'amphetamine'or'ketamine'or 'speed' or 'xtc' or 'XTC' or 'SPEED' or 'crystal' or 'meth' or 'marijuana' or 'powder' or 'afghan'or'cocaine'or'haze'or'pollen'or\
'sativa'or'indica'or'valium'or'diazepam'or'tablet'or'codeine'or \
'mg' or 'dmt'or'diclazepam'or'zepam'or 'heroin' ) in x else(z if ('weapon'or'milit'or'gun'or'grenades'or'submachine'or'rifle'or'ak47')in x else c) )
print(df['category'])
My problem is that some records though they contain some of the substrings I defined, do not get replaced. Is it a regex related problem?
Thank you in advance.
Create dictionary of list of substrings with key for replace strings, loop it and join all list values by | for regex OR, so possible check column by contains and replace matched rows with loc:
df = pd.DataFrame({'category':['sss mdma df','milit ss aa','aa ss']})
a = ['mdma', 'xanax' , 'kamagra']
b = ['weapon','milit','gun']
g='Drugs'
z='Weapons'
c='Flowers'
d = {g:a, z:b}
df['new_category'] = c
for k, v in d.items():
pat = '|'.join(v)
mask = df.category.str.contains(pat, case=False)
df.loc[mask, 'new_category'] = k
print (df)
category new_category
0 sss mdma df Drugs
1 milit ss aa Weapons
2 aa ss Flowers

Categories

Resources