I'm wondering if someone in the community could help with the following:
Aim to regex replace substrings in a pandas DataFrame (based on a dictionary I pass as argument). Though the key:value replacement should only take place, if the dict key is found as a standalone substring (not as part of a word). By standalone substring I mean it starts after a whitespace
e.x:
mapping = {
"sweatshirt":"sweat_shirt",
"sweat shirt":"sweat_shirt",
"shirt":"shirts"
}
df = pd.DataFrame([
["men sweatshirt"]
["men sweat shirt"]
["yellow shirt"]
])
df = df.replace(mapping,regex=True)
expected result:
substring "shirt" within sweatshirt should NOT be replaced with "shirts" as value is part of another string not a standalone value(\b)
NOTE:
the dictionary I pass is rather long so ideally there is a way to pass the standalone requirement (\b) as part of the dict I pass onto df.replace(dict, regex=True)
Thanks upfront
You can use
df[0].str.replace(fr"\b(?:{'|'.join([x for x in mapping])})\b", lambda x: mapping[x.group()])
The regex will look like \b(?:sweatshirt|shirt)\b, it will match sweatshirt or shirt as whole words. The match will be passed to a lambda and the corresponding value will be fetched using mapping[x.group()].
Multiword Search Term Update
Since you may have multiword terms to search in the mapping keys, you should make sure the longest search terms come first in the alternation group. That is, \b(?:abc def|abc)\b and not \b(?:abc|abc def)\b.
import pandas as pd
mapping = {
"sweat shirt": "sweat_shirt",
"shirt": "shirts"
}
df = pd.DataFrame([
["men sweatshirt"],
["men sweat shirt"]
])
rx = fr"\b(?:{'|'.join(sorted([x for x in mapping],key=len,reverse=True))})\b"
df[0].str.replace(rx, lambda x: mapping[x.group()])
Output:
0 men sweatshirt
1 men sweat_shirt
Name: 0, dtype: object
Include the white-space in your pattern! :)
mapping = {
" sweatshirt":" sweat_shirt",
" shirt":" shirts"
}
df = ([
["men sweatshirt"]
])
df = df.replace(mapping,regex=True)
Try this code-
mapping = {
" sweatshirt":" sweat_shirt",
" shirt":" shirts"
}
import pandas as pd
df = pd.DataFrame ({'ID':["men sweatshirt", "black shirt"]}
)
df = df.apply(lambda x: ' '+x, axis=1).replace(mapping,regex=True).ID.str.strip()
print(df)
Related
How can I apply merge function or any other method on column A.
For example in layman term I want to convert this string "(A|B|C,D)|(A,B|C|D)|(B|C|D)" into a
"(D A|D B|D C)|(A B|A C|A D)|(B|C|D)"
This (B|C|D) will remain same as it doesn't have comma value to merge in it. Basically I want to merge the values which are in commas to rest of its other values.
I have below data frame.
import pandas as pd
data = {'A': [ '(A|B|C,D)|(A,B|C|D)|(B|C|D)'],
'B(Expected)': [ '(D A|D B|D C)|(A B|A C|A D)|(B|C|D)']
}
df = pd.DataFrame(data)
print (df)
My expected result is mentioned in column B(Expected)
Below method I tried:-
(1)
df['B(Expected)'] = df['A'].apply(lambda x: x.replace("|", " ").replace(",", "|") if "|" in x and "," in x else x)
(2)
# Split the string by the pipe character
df['string'] = df['string'].str.split('|')
df['string'] = df['string'].apply(lambda x: '|'.join([' '.join(i.split(' ')) for i in x]))
You can use a regex to extract the values in parentheses, then a custom function with itertools.product to reorganize the values:
from itertools import product
def split(s):
return '|'.join([' '.join(x) for x in product(*[x.split('|') for x in s.split(',')])])
df['B'] = df['A'].str.replace(r'([^()]+)', lambda m: split(m.group()), regex=True)
print(df)
Note that this requires non-nested parentheses.
Output:
A B
0 (A|B|C,D)|(A,B|C|D)|(B|C|D) (A D|B D|C D)|(A B|A C|A D)|(B|C|D)
I'm trying to do a twitter sentiment analysis between Johnny Depp and Amber Heard. I've extracted the data during the period of 2021 and the Pandas DataFrame for both individuals are stored in df_dict dictionary described below. The error I am receiving is Unhashable type: 'Series'.
As far as I've learnt is that this error happens when you have a dictionary that does not have a list or anything. I first tested it with a single key but I got the same error. I'm on a roadblock and don't know how to solve this issue.
This is my preprocess method
def preprocess(df_dict, remove_rows, keep_rows):
for key, df in df_dict.items():
print(key)
initial_count = len(df_dict[key])
df_dict[key] = (
df
# Make everything lower case
.assign(Text=lambda x: x['Text'].str.lower())
# Keep the rows that mention name
.query(f'Text.str.contains("{keep_rows[key]}")')
# Remove the rows that mentioned the other three people.
.query(f'~Text.str.contains("{remove_rows[key]}")')
# Remove all the URLs
.assign(Text=lambda x:x['Text'].apply(lambda s: re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', s)))
)
final_count = len(df_dict[key])
print("%d tweets kept out of %d" % (final_count, initial_count))
return df_dict
This is the code I'm using to call preprocess method
df_dict = {
'johnny depp': johnny_data,
"amber heard": amber_data
}
remove_rows = {
'johnny depp': 'amber|heard|camila|vasquez|shannon|curry',
"amber heard": 'johnny|depp|camila|vasquez|shannon|curry'
}
keep_rows = {
'johnny depp': 'johnny|depp',
"amber heard": 'amber|heard'
}
df_test_data = preprocess(df_dict, remove_rows, keep_rows)
I hope I've cleared up my issue on this forum and since this is my first post here, so I also hope I've followed all the regular protocols regarding posting.
I am attaching the the error message I received:
Code error
Error part 1 Error part 2
The link to the code is down below:
Colab link
Since DataFrame.query is really for simple logical operations, you cannot access Series methods of columns. As workaround, consider assign of flags to then query against. Consider also Series.str.replace for regex clean.
df_dict[key] = (
df
# Make everything lower case
.assign(
Text = lambda x: x['Text'].str.lower(),
keep_flag = lambda x: x['Text'].str.contains(keep_rows[key]),
drop_flag = lambda x: x['Text'].str.contains(remove_rows[key])
)
# Keep the rows that mention name
.query("keep_flag == True")
# Remove the rows that mentioned the other three people.
.query("drop_flag == False")
# Remove all the URLs
.assign(
Text = lambda x: x['Text'].str.replace(
r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*',
'',
regex=True)
)
)
.drop(["keep_flag", "drop_flag"], axis="columns")
)
I have the following data where i would like to extract out source= from the values. Is there a way to create a general regex function so that i can apply on other columns as well to extract words after equal sign?
Data Data2
source=book social-media=facebook
source=book social-media=instagram
source=journal social-media=facebook
Im using python and i have tried the following:
df['Data'].astype(str).str.replace(r'[a-zA-Z]\=', '', regex=True)
but it didnt work
you can try this :
df.replace(r'[a-zA-Z]+-?[a-zA-Z]+=', '', regex=True)
It gives you the following result :
Data Data2
0 book facebook
1 book instagram
2 journal facebook
Regex is not required in this situation:
print(df['Data'].apply(lambda x : x.split('=')[-1]))
print(df['Data2'].apply(lambda x : x.split('=')[-1]))
You have to repeat the character class 1 or more times and you don't have to escape the equals sign.
What you can do is make the match a bit broader matching all characters except a whitespace char or an equals sign.
Then set the result to the new value.
import pandas as pd
data = [
"source=book",
"source=journal",
"social-media=facebook",
"social-media=instagram"
]
df = pd.DataFrame(data, columns=["Data"])
df['Data'] = df['Data'].astype(str).str.replace(r'[^\s=]+=', '', regex=True)
print(df)
Output
Data
0 book
1 journal
2 facebook
3 instagram
If there has to be a value after the equals sign, you can also use str.extract
df['Data'] = df['Data'].astype(str).str.extract(r'[^\s=]+=([^\s=]+)')
I have a pandas dataframe containing a lot of variables:
df.columns
Out[0]:
Index(['COUNADU_SOIL_P_NUMBER_16_DA_B_VE_count_nr_lesion_PRATZE',
'COUNEGG_SOIL_P_NUMBER_50_DA_B_VT_count_nr_lesion_PRATZE',
'COUNJUV_SOIL_P_NUMBER_128_DA_B_V6_count_nr_lesion_PRATZE',
'COUNADU_SOIL_P_SAUDPC_150_DA_B_V6_lesion_saudpc_PRATZE',
'CONTRO_SOIL_P_pUNCK_150_DA_B_V6_lesion_p_control_PRATZE',
'COUNJUV_SOIL_P_p_0_100_16_DA_B_V6_lesion_incidence_PRATZE',
'COUNADU_SOIL_P_p_0_100_50_DA_B_VT_lesion_incidence_PRATZE',
'COUNEGG_SOIL_P_p_0_100_128_DA_B_VT_lesion_incidence_PRATZE',
'COUNEGG_SOIL_P_NUMBER_50_DA_B_V6_count_nr_spiral_HELYSP',
'COUNJUV_SOIL_P_NUMBER_128_DA_B_V10_count_nr_spiral_HELYSP', # and so on
I would like to keep only the number followed by DA, so the first column is 16_DA. I have been using the pandas function findall():
df.columns.str.findall(r'[0-9]*\_DA')
Out[595]:
Index([ ['16_DA'], ['50_DA'], ['128_DA'], ['150_DA'], ['150_DA'],
['16_DA'], ['50_DA'], ['128_DA'], ['50_DA'], ['128_DA'], ['150_DA'],
['150_DA'], ['50_DA'], ['128_DA'],
But this returns a list, which i would like to avoid, so that i end up with a column index looking like this:
df.columns
Out[595]:
Index('16_DA', '50_DA', '128_DA', '150_DA', '150_DA',
'16_DA', '50_DA', '128_DA', '50_DA', '128_DA', '150_DA',
Is there a smoother way to do this?
You can use .str.join(", ") to join all found matches with a comma and space:
df.columns.str.findall(r'\d+_DA').str.join(", ")
Or, just use str.extract to get the first match:
df.columns.str.extract(r'(\d+_DA)', expand=False)
from typing import List
pattern = r'[0-9]*\_DA'
flattened: List[str] = sum(df.columns.str.findall(pattern), [])
output: str = ",".join(flattened)
I have a dataframe generated from a .csv (I use Python 3.5). The df['category'] contains only strings. What I want is to check this column and if a string contains a specific substring(not really interested where they are in the string as long as they exist) to be replaced. I am using this script
import pandas as pd
df=pd.read_csv('lastfile.csv')
df.dropna(inplace=True)
g='Drugs'
z='Weapons'
c='Flowers'
df.category = df.category.str.lower().apply(lambda x: g if ('mdma' or 'xanax' or 'kamagra' or 'weed' or 'tabs' or 'lsd' or 'heroin' or 'morphine' or 'hci' or 'cap' or 'mda' or 'hash' or 'kush' or 'wax'or 'klonop'or\
'dextro'or'zepam'or'amphetamine'or'ketamine'or 'speed' or 'xtc' or 'XTC' or 'SPEED' or 'crystal' or 'meth' or 'marijuana' or 'powder' or 'afghan'or'cocaine'or'haze'or'pollen'or\
'sativa'or'indica'or'valium'or'diazepam'or'tablet'or'codeine'or \
'mg' or 'dmt'or'diclazepam'or'zepam'or 'heroin' ) in x else(z if ('weapon'or'milit'or'gun'or'grenades'or'submachine'or'rifle'or'ak47')in x else c) )
print(df['category'])
My problem is that some records though they contain some of the substrings I defined, do not get replaced. Is it a regex related problem?
Thank you in advance.
Create dictionary of list of substrings with key for replace strings, loop it and join all list values by | for regex OR, so possible check column by contains and replace matched rows with loc:
df = pd.DataFrame({'category':['sss mdma df','milit ss aa','aa ss']})
a = ['mdma', 'xanax' , 'kamagra']
b = ['weapon','milit','gun']
g='Drugs'
z='Weapons'
c='Flowers'
d = {g:a, z:b}
df['new_category'] = c
for k, v in d.items():
pat = '|'.join(v)
mask = df.category.str.contains(pat, case=False)
df.loc[mask, 'new_category'] = k
print (df)
category new_category
0 sss mdma df Drugs
1 milit ss aa Weapons
2 aa ss Flowers