How can I apply merge function or any other method on column A.
For example in layman term I want to convert this string "(A|B|C,D)|(A,B|C|D)|(B|C|D)" into a
"(D A|D B|D C)|(A B|A C|A D)|(B|C|D)"
This (B|C|D) will remain same as it doesn't have comma value to merge in it. Basically I want to merge the values which are in commas to rest of its other values.
I have below data frame.
import pandas as pd
data = {'A': [ '(A|B|C,D)|(A,B|C|D)|(B|C|D)'],
'B(Expected)': [ '(D A|D B|D C)|(A B|A C|A D)|(B|C|D)']
}
df = pd.DataFrame(data)
print (df)
My expected result is mentioned in column B(Expected)
Below method I tried:-
(1)
df['B(Expected)'] = df['A'].apply(lambda x: x.replace("|", " ").replace(",", "|") if "|" in x and "," in x else x)
(2)
# Split the string by the pipe character
df['string'] = df['string'].str.split('|')
df['string'] = df['string'].apply(lambda x: '|'.join([' '.join(i.split(' ')) for i in x]))
You can use a regex to extract the values in parentheses, then a custom function with itertools.product to reorganize the values:
from itertools import product
def split(s):
return '|'.join([' '.join(x) for x in product(*[x.split('|') for x in s.split(',')])])
df['B'] = df['A'].str.replace(r'([^()]+)', lambda m: split(m.group()), regex=True)
print(df)
Note that this requires non-nested parentheses.
Output:
A B
0 (A|B|C,D)|(A,B|C|D)|(B|C|D) (A D|B D|C D)|(A B|A C|A D)|(B|C|D)
Related
I have been trying to clean a particular column from a dataset. I am using the function .apply() multiple times in order to throw out any symbol that could be in in the string values of the column.
For each symbol, here's the function : .apply(lambda x: x.replace("", ""))
Although my code works, it is quite long and not that clean. I would like to know if there is a shorter and/or better manner of cleaning a column.
Here is my code:
df_reviews = pd.read_csv("reviews.csv")
df_reviews = df_reviews.rename(columns={"Unnamed: 0" : "index", "0" : "Name"})
df_reviews['name'] = df_reviews["Name"].str.split(':', expand=True)[0]
df_reviews['name'] = df_reviews['name'].apply(lambda x: x.replace("Review", "")).apply(lambda x: x.replace(":", "")).apply(lambda x: x.replace("'", "")).apply(lambda x: x.replace('"', "")).apply(lambda x: x.replace("#", ""))\
.apply(lambda x: x.replace("{", "")).apply(lambda x: x.replace("}", "")).apply(lambda x: x.replace("_", "")).apply(lambda x: x.replace(":", ""))
df_reviews['name'] = df_reviews['name'].str.strip()
As you can see, the many .apply() functions makes it difficult to clearly see what is getting removed from the "name" column.
Could someone help me ?
Kind regards
You can also use regex:
df_reviews['name'] = df_reviews['name'].str.replace('Review|[:\'"#{}_]', "", regex=True)
Regex pattern:
'Review|[:\'"#{}_]'
Review : replace the word "Review"
| : or
[:\'"#{}_] - any of these characters within the square brackets []
Note:
If you are looking to remove ALL punctuation: you can use this instead
import string
df_reviews['name'] = df_reviews['name'].str.replace(f'Review|[{string.punctuation}]', "", regex=True)
Which will remove the following characters:
!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~
Try this one:
df['name'] = df['name'].str.replace('Review| \:| \'|\"|\#| \_', "").str.strip()
import pandas as pd
REMOVE_CHARS = ["Review", ":", "#", "{", "}", "_", "'", '"']
def process_name(name: str) -> str:
for removal_char in REMOVE_CHARS:
try:
print(f"removal char {removal_char}", name.index(removal_char))
name = name.replace(removal_char,"")
except ValueError:
continue
return name
def clean_code(df_reviews: pd.DataFrame):
# Renaming `Unnamed: 0` as `index` ; `0` as `Name`
df_reviews = df_reviews.rename(columns={"Unnamed: 0": "index", "0": "Name"})
# todo: clarification needed
# Here Name col contains a words separated by : so `expand=True` separate it into different columns
# then we just read the zeroth column
df_reviews['name'] = df_reviews["Name"].str.split(':', expand=True)[0]
# Preprocessing of name column
# >> if `name` contains ["Review",":","#","{","}","_","'",'"'] remove/replace it
df_reviews['name'] = df_reviews['name'].apply(lambda x: process_name(x))
df_reviews['name'] = df_reviews['name'].str.strip()
if __name__ == "__main__":
df_reviews = pd.read_csv("reviews.csv")
I want to search for names in column col_one where I have a list of names in the variable list20. When searching, if the value of col_one matches in list20, put the same name in a new column named new_col
Most of the time, the name will be at the front, such as ZEN, W, WICE, but there will be some names.
with a symbol after the name again, such as ZEN-R, ZEN-W2, ZEN13P2302A
my data
import pandas as pd
list20 = ['ZEN', 'OOP', 'WICE', 'XO', 'WP', 'K', 'WGE', 'YGG', 'W', 'YUASA', 'XPG', 'ABC', 'WHA', 'WHAUP', 'WFX', 'WINNER', 'WIIK', 'WIN', 'YONG', 'WPH', 'KCE']
data = {
"col_one": ["ZEN", "WPH", "WICE", "YONG", "K" "XO", "WIN", "WP", "WIIK", "YGG-W1", "W-W5", "WINNER", "YUASA", "WGE", "WFX", "XPG", "WHAUP", "WHA", "KCE13P2302A", "OOP-R"],
}
df = pd.DataFrame(data)
# The code you provided will give the result like the picture below. and it's not right
# or--------
df['new_col'] = df['col_one'].str.extract('('+'|'.join(list20)+')')[0]
# or--------
import re
pattern = re.compile(r"|".join(x for x in list20))
df = (df
.assign(new=lambda x: [re.findall(pattern, string)[0] for string in x.col_one])
)
# or----------
def matcher(col_one):
for i in list20:
if i in col_one:
return i
return 'na' #adjust as you see fit
df['new_col'] = df.apply(lambda x: matcher(x['col_one']), axis=1)
The result obtained from the code above and it's not right
Expected Output
Try to sort the list first:
pattern = re.compile(r"|".join(x for x in sorted(list20, reverse=True, key=len)))
(df
.assign(new=lambda x: [re.findall(pattern, string)[0] for string in x.col_one])
)
Try with str.extract
df['new'] = df['col_one'].str.extract('('+'|'.join(list20)+')')[0]
df
Out[121]:
col_one new
0 CFER CFER
1 ABCP6P45C9 ABC
2 LOU-W5 LOU
3 CFER-R CFER
4 ABC-W1 ABC
5 LOU13C2465 LOU
One way to do this, less attractive in terms of efficiency, is to use a simple function with a lambda such that:
def matcher(col_one):
for i in list20:
if i in col_one:
return i
return 'na' #adjust as you see fit
df['new_col'] = df.apply(lambda x: matcher(x['col_one']), axis=1)
df
expected results:
col_one new_col
0 CFER CFER
1 ABCP6P45C9 ABC
2 LOU-W5 LOU
3 CFER-R CFER
4 ABC-W1 ABC
5 LOU13C2465 LOU
Another approach:
pattern = re.compile(r"|".join(x for x in list20))
(df
.assign(new=lambda x: [re.findall(pattern, string)[0] for string in x.col_one])
)
I have a DataFrame with columns that look like this:
df=pd.DataFrame(columns=['(NYSE_close, close)','(NYSE_close, open)','(NYSE_close, volume)', '(NASDAQ_close, close)','(NASDAQ_close, open)','(NASDAQ_close, volume)'])
df:
(NYSE_close, close) (NYSE_close, open) (NYSE_close, volume) (NASDAQ_close, close) (NASDAQ_close, open) (NASDAQ_close, volume)
I want to remove everything after the underscore and append whatever comes after the comma to get the following:
df:
NYSE_close NYSE_open NYSE_volume NASDAQ_close NASDAQ_open NASDAQ_volume
I tried to strip the column name but it replaced it with nan. Any suggestions on how to do that?
Thank you in advance.
You could use re.sub to extract the appropriate parts of the column names to replace them with:
import re
df=pd.DataFrame(columns=['(NYSE_close, close)','(NYSE_close, open)','(NYSE_close, volume)', '(NASDAQ_close, close)','(NASDAQ_close, open)','(NASDAQ_close, volume)'])
df.columns = [re.sub(r'\(([^_]+_)\w+, (\w+)\)', r'\1\2', c) for c in df.columns]
Output:
Empty DataFrame
Columns: [NYSE_close, NYSE_open, NYSE_volume, NASDAQ_close, NASDAQ_open, NASDAQ_volume]
Index: []
You could:
import re
def cvt_col(x):
s = re.sub('[()_,]', ' ', x).split()
return s[0] + '_' + s[2]
df.rename(columns = cvt_col)
Empty DataFrame
Columns: [NYSE_close, NYSE_open, NYSE_volume, NASDAQ_close, NASDAQ_open, NASDAQ_volume]
Index: []
Use a list comprehension, twice:
step1 = [ent.strip('()').split(',') for ent in df]
df.columns = ["_".join([left.split('_')[0], right.strip()])
for left, right in step1]
df
Empty DataFrame
Columns: [NYSE_close, NYSE_open, NYSE_volume, NASDAQ_close, NASDAQ_open, NASDAQ_volume]
Index: []
I have this regex_func helper function below that has been working well to extract a match from a df column using map and lambda.
def regex_func(regex_compile,x,item=0,return_list=False):
"""Function to handle list returned by re.findall()
Takes the first value of the list.
If empty list, returns empty string"""
match_list = regex_compile.findall(x)
if return_list:
match = match_list
elif match_list:
try:
match = match_list[item]
except:
match = ""
else:
match = ""
return match
#Working example
regex_1 = re.compile('(?i)(?<=\()[^ ()]+')
df['colB'] = df['colA'].map(lambda x: regex_func(regex_1, x))
I am having trouble doing a similar task. I want the regex to be based on a value in another column and then applied. One method I was trying that did not work:
# Regex should be based on value in col1
# Extracting that value and prepping to input into my regex_func()
value_list = df['col1'].tolist()
value_list = ['(?i)(?<=' + d + ' )[^ ]+' for d in value_list]
value_list = [re.compile(d) for d in value_list]
# Adding prepped list back into df as col2
df.insert(1,'col2',value_list)
#Trying to create col4, based on applying my re.compile in col 2 to a value in col3.
df.insert(2,'col4', df['col3'].map(lambda x: df['col2'],x)
I understand why the above doesn't work, but have not been able to find a solution.
You can zip the columns and then build the regex on the fly:
df['colB'] = [regex_func('(?i)(?<=' + y + ' )[^ ]+', x)
for x, y in zip(df['colA'], df['col1'])]
I have a dataframe generated from a .csv (I use Python 3.5). The df['category'] contains only strings. What I want is to check this column and if a string contains a specific substring(not really interested where they are in the string as long as they exist) to be replaced. I am using this script
import pandas as pd
df=pd.read_csv('lastfile.csv')
df.dropna(inplace=True)
g='Drugs'
z='Weapons'
c='Flowers'
df.category = df.category.str.lower().apply(lambda x: g if ('mdma' or 'xanax' or 'kamagra' or 'weed' or 'tabs' or 'lsd' or 'heroin' or 'morphine' or 'hci' or 'cap' or 'mda' or 'hash' or 'kush' or 'wax'or 'klonop'or\
'dextro'or'zepam'or'amphetamine'or'ketamine'or 'speed' or 'xtc' or 'XTC' or 'SPEED' or 'crystal' or 'meth' or 'marijuana' or 'powder' or 'afghan'or'cocaine'or'haze'or'pollen'or\
'sativa'or'indica'or'valium'or'diazepam'or'tablet'or'codeine'or \
'mg' or 'dmt'or'diclazepam'or'zepam'or 'heroin' ) in x else(z if ('weapon'or'milit'or'gun'or'grenades'or'submachine'or'rifle'or'ak47')in x else c) )
print(df['category'])
My problem is that some records though they contain some of the substrings I defined, do not get replaced. Is it a regex related problem?
Thank you in advance.
Create dictionary of list of substrings with key for replace strings, loop it and join all list values by | for regex OR, so possible check column by contains and replace matched rows with loc:
df = pd.DataFrame({'category':['sss mdma df','milit ss aa','aa ss']})
a = ['mdma', 'xanax' , 'kamagra']
b = ['weapon','milit','gun']
g='Drugs'
z='Weapons'
c='Flowers'
d = {g:a, z:b}
df['new_category'] = c
for k, v in d.items():
pat = '|'.join(v)
mask = df.category.str.contains(pat, case=False)
df.loc[mask, 'new_category'] = k
print (df)
category new_category
0 sss mdma df Drugs
1 milit ss aa Weapons
2 aa ss Flowers