I have been trying to clean a particular column from a dataset. I am using the function .apply() multiple times in order to throw out any symbol that could be in in the string values of the column.
For each symbol, here's the function : .apply(lambda x: x.replace("", ""))
Although my code works, it is quite long and not that clean. I would like to know if there is a shorter and/or better manner of cleaning a column.
Here is my code:
df_reviews = pd.read_csv("reviews.csv")
df_reviews = df_reviews.rename(columns={"Unnamed: 0" : "index", "0" : "Name"})
df_reviews['name'] = df_reviews["Name"].str.split(':', expand=True)[0]
df_reviews['name'] = df_reviews['name'].apply(lambda x: x.replace("Review", "")).apply(lambda x: x.replace(":", "")).apply(lambda x: x.replace("'", "")).apply(lambda x: x.replace('"', "")).apply(lambda x: x.replace("#", ""))\
.apply(lambda x: x.replace("{", "")).apply(lambda x: x.replace("}", "")).apply(lambda x: x.replace("_", "")).apply(lambda x: x.replace(":", ""))
df_reviews['name'] = df_reviews['name'].str.strip()
As you can see, the many .apply() functions makes it difficult to clearly see what is getting removed from the "name" column.
Could someone help me ?
Kind regards
You can also use regex:
df_reviews['name'] = df_reviews['name'].str.replace('Review|[:\'"#{}_]', "", regex=True)
Regex pattern:
'Review|[:\'"#{}_]'
Review : replace the word "Review"
| : or
[:\'"#{}_] - any of these characters within the square brackets []
Note:
If you are looking to remove ALL punctuation: you can use this instead
import string
df_reviews['name'] = df_reviews['name'].str.replace(f'Review|[{string.punctuation}]', "", regex=True)
Which will remove the following characters:
!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~
Try this one:
df['name'] = df['name'].str.replace('Review| \:| \'|\"|\#| \_', "").str.strip()
import pandas as pd
REMOVE_CHARS = ["Review", ":", "#", "{", "}", "_", "'", '"']
def process_name(name: str) -> str:
for removal_char in REMOVE_CHARS:
try:
print(f"removal char {removal_char}", name.index(removal_char))
name = name.replace(removal_char,"")
except ValueError:
continue
return name
def clean_code(df_reviews: pd.DataFrame):
# Renaming `Unnamed: 0` as `index` ; `0` as `Name`
df_reviews = df_reviews.rename(columns={"Unnamed: 0": "index", "0": "Name"})
# todo: clarification needed
# Here Name col contains a words separated by : so `expand=True` separate it into different columns
# then we just read the zeroth column
df_reviews['name'] = df_reviews["Name"].str.split(':', expand=True)[0]
# Preprocessing of name column
# >> if `name` contains ["Review",":","#","{","}","_","'",'"'] remove/replace it
df_reviews['name'] = df_reviews['name'].apply(lambda x: process_name(x))
df_reviews['name'] = df_reviews['name'].str.strip()
if __name__ == "__main__":
df_reviews = pd.read_csv("reviews.csv")
Related
How can I apply merge function or any other method on column A.
For example in layman term I want to convert this string "(A|B|C,D)|(A,B|C|D)|(B|C|D)" into a
"(D A|D B|D C)|(A B|A C|A D)|(B|C|D)"
This (B|C|D) will remain same as it doesn't have comma value to merge in it. Basically I want to merge the values which are in commas to rest of its other values.
I have below data frame.
import pandas as pd
data = {'A': [ '(A|B|C,D)|(A,B|C|D)|(B|C|D)'],
'B(Expected)': [ '(D A|D B|D C)|(A B|A C|A D)|(B|C|D)']
}
df = pd.DataFrame(data)
print (df)
My expected result is mentioned in column B(Expected)
Below method I tried:-
(1)
df['B(Expected)'] = df['A'].apply(lambda x: x.replace("|", " ").replace(",", "|") if "|" in x and "," in x else x)
(2)
# Split the string by the pipe character
df['string'] = df['string'].str.split('|')
df['string'] = df['string'].apply(lambda x: '|'.join([' '.join(i.split(' ')) for i in x]))
You can use a regex to extract the values in parentheses, then a custom function with itertools.product to reorganize the values:
from itertools import product
def split(s):
return '|'.join([' '.join(x) for x in product(*[x.split('|') for x in s.split(',')])])
df['B'] = df['A'].str.replace(r'([^()]+)', lambda m: split(m.group()), regex=True)
print(df)
Note that this requires non-nested parentheses.
Output:
A B
0 (A|B|C,D)|(A,B|C|D)|(B|C|D) (A D|B D|C D)|(A B|A C|A D)|(B|C|D)
I have a DataFrame with columns that look like this:
df=pd.DataFrame(columns=['(NYSE_close, close)','(NYSE_close, open)','(NYSE_close, volume)', '(NASDAQ_close, close)','(NASDAQ_close, open)','(NASDAQ_close, volume)'])
df:
(NYSE_close, close) (NYSE_close, open) (NYSE_close, volume) (NASDAQ_close, close) (NASDAQ_close, open) (NASDAQ_close, volume)
I want to remove everything after the underscore and append whatever comes after the comma to get the following:
df:
NYSE_close NYSE_open NYSE_volume NASDAQ_close NASDAQ_open NASDAQ_volume
I tried to strip the column name but it replaced it with nan. Any suggestions on how to do that?
Thank you in advance.
You could use re.sub to extract the appropriate parts of the column names to replace them with:
import re
df=pd.DataFrame(columns=['(NYSE_close, close)','(NYSE_close, open)','(NYSE_close, volume)', '(NASDAQ_close, close)','(NASDAQ_close, open)','(NASDAQ_close, volume)'])
df.columns = [re.sub(r'\(([^_]+_)\w+, (\w+)\)', r'\1\2', c) for c in df.columns]
Output:
Empty DataFrame
Columns: [NYSE_close, NYSE_open, NYSE_volume, NASDAQ_close, NASDAQ_open, NASDAQ_volume]
Index: []
You could:
import re
def cvt_col(x):
s = re.sub('[()_,]', ' ', x).split()
return s[0] + '_' + s[2]
df.rename(columns = cvt_col)
Empty DataFrame
Columns: [NYSE_close, NYSE_open, NYSE_volume, NASDAQ_close, NASDAQ_open, NASDAQ_volume]
Index: []
Use a list comprehension, twice:
step1 = [ent.strip('()').split(',') for ent in df]
df.columns = ["_".join([left.split('_')[0], right.strip()])
for left, right in step1]
df
Empty DataFrame
Columns: [NYSE_close, NYSE_open, NYSE_volume, NASDAQ_close, NASDAQ_open, NASDAQ_volume]
Index: []
Given the following source data:
import pandas as pd, numpy as np
import re
data = [
("1 Bedroom 1 Bathroom Apartment", 1, 1),
("We've got a great 2br2ba over here!", np.nan, np.nan),
("Luxurious Apartment. Bedrooms: 3 Bathrooms: 3", np.nan, np.nan)]
df = pd.DataFrame(data, columns = ['description', 'bedrooms', 'bathrooms'])
I want to scrape the description field for the bedrooms and bathrooms. I have a regular expression and a function that will do this:
def quantity_in_string(search_text, pattern):
'''receives a string and a pattern, returns the highest quantity for the pattern described in the string'''
unusable_matches = ['.', '..','...', '']
matches = re.findall(pattern, search_text, flags = re.IGNORECASE)
if type(matches) is list:
if matches == []: return np.nan
matches = [j for i in matches for j in i if i not in unusable_matches]
return max(matches)
bedroom_expression = r"(?:bedrooms:[ ]*(\d+\.*\d*))|(?:(\d+\.*\d*)[ ]*(?:bed|br|bd|bedroom))"
My question is, how do I apply quantity_in_string to df and replace missing values with the output from this function?
You can use apply and then the isna function that returns index where the value is na and modify only these values
bedroom_guess = df['description'].apply(lambda x: quantity_in_string(x, bedroom_expression))
df.loc[df['bedrooms'].isna(),['bedrooms']] = bedroom_guess[df['bedrooms'].isna()]
I'm wondering if someone in the community could help with the following:
Aim to regex replace substrings in a pandas DataFrame (based on a dictionary I pass as argument). Though the key:value replacement should only take place, if the dict key is found as a standalone substring (not as part of a word). By standalone substring I mean it starts after a whitespace
e.x:
mapping = {
"sweatshirt":"sweat_shirt",
"sweat shirt":"sweat_shirt",
"shirt":"shirts"
}
df = pd.DataFrame([
["men sweatshirt"]
["men sweat shirt"]
["yellow shirt"]
])
df = df.replace(mapping,regex=True)
expected result:
substring "shirt" within sweatshirt should NOT be replaced with "shirts" as value is part of another string not a standalone value(\b)
NOTE:
the dictionary I pass is rather long so ideally there is a way to pass the standalone requirement (\b) as part of the dict I pass onto df.replace(dict, regex=True)
Thanks upfront
You can use
df[0].str.replace(fr"\b(?:{'|'.join([x for x in mapping])})\b", lambda x: mapping[x.group()])
The regex will look like \b(?:sweatshirt|shirt)\b, it will match sweatshirt or shirt as whole words. The match will be passed to a lambda and the corresponding value will be fetched using mapping[x.group()].
Multiword Search Term Update
Since you may have multiword terms to search in the mapping keys, you should make sure the longest search terms come first in the alternation group. That is, \b(?:abc def|abc)\b and not \b(?:abc|abc def)\b.
import pandas as pd
mapping = {
"sweat shirt": "sweat_shirt",
"shirt": "shirts"
}
df = pd.DataFrame([
["men sweatshirt"],
["men sweat shirt"]
])
rx = fr"\b(?:{'|'.join(sorted([x for x in mapping],key=len,reverse=True))})\b"
df[0].str.replace(rx, lambda x: mapping[x.group()])
Output:
0 men sweatshirt
1 men sweat_shirt
Name: 0, dtype: object
Include the white-space in your pattern! :)
mapping = {
" sweatshirt":" sweat_shirt",
" shirt":" shirts"
}
df = ([
["men sweatshirt"]
])
df = df.replace(mapping,regex=True)
Try this code-
mapping = {
" sweatshirt":" sweat_shirt",
" shirt":" shirts"
}
import pandas as pd
df = pd.DataFrame ({'ID':["men sweatshirt", "black shirt"]}
)
df = df.apply(lambda x: ' '+x, axis=1).replace(mapping,regex=True).ID.str.strip()
print(df)
I have this regex_func helper function below that has been working well to extract a match from a df column using map and lambda.
def regex_func(regex_compile,x,item=0,return_list=False):
"""Function to handle list returned by re.findall()
Takes the first value of the list.
If empty list, returns empty string"""
match_list = regex_compile.findall(x)
if return_list:
match = match_list
elif match_list:
try:
match = match_list[item]
except:
match = ""
else:
match = ""
return match
#Working example
regex_1 = re.compile('(?i)(?<=\()[^ ()]+')
df['colB'] = df['colA'].map(lambda x: regex_func(regex_1, x))
I am having trouble doing a similar task. I want the regex to be based on a value in another column and then applied. One method I was trying that did not work:
# Regex should be based on value in col1
# Extracting that value and prepping to input into my regex_func()
value_list = df['col1'].tolist()
value_list = ['(?i)(?<=' + d + ' )[^ ]+' for d in value_list]
value_list = [re.compile(d) for d in value_list]
# Adding prepped list back into df as col2
df.insert(1,'col2',value_list)
#Trying to create col4, based on applying my re.compile in col 2 to a value in col3.
df.insert(2,'col4', df['col3'].map(lambda x: df['col2'],x)
I understand why the above doesn't work, but have not been able to find a solution.
You can zip the columns and then build the regex on the fly:
df['colB'] = [regex_func('(?i)(?<=' + y + ' )[^ ]+', x)
for x, y in zip(df['colA'], df['col1'])]