Match the Exact substring from string of pandas series object - python

I am trying to match the exact substring from the string of pandas data frame series but somehow str.contains don't seem to be working here. I saw the documentation and it's saying to apply regex = False which is also not working. Can anyone suggest a solution?
Output:
Creative Name Revised Targeting Type
0 ff~tg~conbhv contextual
1 ff~tg~conbhv contextual
2 ff~tg~con contextual
Expected Output:
Creative Name Revised Targeting Type
0 ff~tg~conbhv contextual + behavioral
1 ff~tg~conbhv contextual + behavioral
2 ff~tg~con contextual
Approach:
import pandas as pd
import numpy as np
column = {'col_name': ['Revised Targeting Type']}
data = {"Creative Name":["ff~pd~q4-smartphones-note10-pdp-iphone7_mk~gb_ch~social_md~h_ad~ss1x1_dt~cross_fm~spost_pb~fcbk_sz~1x1_rt~cpm_tg~conbhv_sa~lo_vv~ia_it~soc_ts~lo-iphone7_ff~ukp q4 smartphones ukc q4 - smartphones - static ukt lo-iphone7 ukcdj buy_ct~fb_cs~1x1_lg~engb_cv~ge_ce~loc_mg~oth_ta~lrn_cw~na",
"ff~tg~conbhv",
"ff~tg~con"], "Revised Targeting Type":["ABC", "NA", "NA"]}
mapping = {"Code": ['con', 'conbhv'], "Actual": ['contextual', 'contextual + behavioral'], "OtherPV": [np.nan, np.nan],
"SheetName": ['tg', 'tg']}
# Creating a dataFrame
dataframe_data = pd.DataFrame(data)
mapping_data = pd.DataFrame(mapping)
column_data = pd.DataFrame(column)
print(dataframe_data)
print(mapping_data)
print(column_data)
# loop through Dataframe column avilable in (column_data) dataframe
for i in column_data.iloc[:,0]:
print(i)
# loop through mapping dataframe (mapping_data)
for k, l, m in zip(mapping_data.iloc[:, 0], mapping_data.iloc[:, 1], mapping_data.iloc[:, 3]):
# mask the dataframe (dataframe_date)
mask_null_revised_new_col = (dataframe_data['{}'.format(i)].isin(['NA']))
#apply dataframe values in main dataframe (dataframe_data)
dataframe_data['{}'.format(i)] = np.select([mask_null_revised_new_col &
dataframe_data['Creative Name'].str.contains('{}~{}'.format(m, k))],
[l], default=dataframe_data['{}'.format(i)])
print(dataframe_data)
Creative Name Revised Targeting Type
0 ff~tg~conbhv contextual
1 ff~tg~conbhv contextual
2 ff~tg~con contextual

To be honest, I'm a little confused by your question, but is this what your looking for?
dataframe_data['Revised Targeting Type'] = np.where(dataframe_data['Creative Name'].str.contains('.*conbhv*', regex = True), 'contextual + behavioral', 'contextual')

Related

Splitting column by multiple custom delimiters in Python

I need to split a column called Creative where each cell contains samples such as:
pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)
Where each two-letter code preceding each bubbled section ( ) is the title of the desired column, and are the same in every row. The only data that changes is what is inside the bubbles. I want the data to look like:
pn
io
ta
pt
cn
cs
2021
302
Yes
Blue
John
Doe
I tried
df[['Creative', 'Creative Size']] = df['Creative'].str.split('cs(',expand=True)
and
df['Creative Size'] = df['Creative Size'].str.replace(')','')
but got an error, error: missing ), unterminated subpattern at position 2, assuming it has something to do with regular expressions.
Is there an easy way to split these ? Thanks.
Use extract with named capturing groups (see here):
import pandas as pd
# toy example
df = pd.DataFrame(data=[["pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)"]], columns=["Creative"])
# extract with a named capturing group
res = df["Creative"].str.extract(
r"pn\((?P<pn>\d+)\)io\((?P<io>\d+)\)ta\((?P<ta>\w+)\)pt\((?P<pt>\w+)\)cn\((?P<cn>\w+)\)cs\((?P<cs>\w+)\)",
expand=True)
print(res)
Output
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
I'd use regex to generate a list of dictionaries via comprehensions. The idea is to create a list of dictionaries that each represent rows of the desired dataframe, then constructing a dataframe out of it. I can build it in one nested comprehension:
import re
rows = [{r[0]:r[1] for r in re.findall(r'(\w{2})\((.+)\)', c)} for c in df['Creative']]
subtable = pd.DataFrame(rows)
for col in subtable.columns:
df[col] = subtable[col].values
Basically, I regex search for instances of ab(*) and capture the two-letter prefix and the contents of the parenthesis and store them in a list of tuples. Then I create a dictionary out of the list of tuples, each of which is essentially a row like the one you display in your question. Then, I put them into a data frame and insert each of those columns into the original data frame. Let me know if this is confusing in any way!
David
Try with extractall:
names = df["Creative"].str.extractall("(.*?)\(.*?\)").loc[0][0].tolist()
output = df["Creative"].str.extractall("\((.*?)\)").unstack()[0].set_axis(names, axis=1)
>>> output
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
1 2020 301 No Red Jane Doe
Input df:
df = pd.DataFrame({"Creative": ["pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)",
"pn(2020)io(301)ta(No)pt(Red)cn(Jane)cs(Doe)"]})
We can use str.findall to extract matching column name-value pairs
pd.DataFrame(map(dict, df['Creative'].str.findall(r'(\w+)\((\w+)')))
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
Using regular expressions, different way of packaging final DataFrame:
import re
import pandas as pd
txt = 'pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)'
data = list(zip(*re.findall('([^\(]+)\(([^\)]+)\)', txt))
df = pd.DataFrame([data[1]], columns=data[0])

When adding SpaCy output to existing dataframe, columns do not align

I have a csv with a column of article titles from which I've used SpaCy to extract any people's names that appear in the titles. When trying to add a new column to the csv with the names extracted by SpaCy, they do not align with the rows from which they were extracted.
I believe this is because the SpaCy results have their own index which is independent of the original data's index.
I've tried adding , index=df.index) to the new column line but I get "ValueError: Length of passed values is 2, index implies 10."
How do I align the SpaCy output to the rows from which they originated?
Here's my code:
import pandas as pd
from pandas import DataFrame
df = (pd.read_csv(r"C:\Users\Admin\Downloads\itsnicethat (5).csv", nrows=10,
usecols=['article_title']))
article = [_ for _ in df['article_title']]
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp(str(article))
ents = list(doc.ents)
people = []
for ent in ents:
if ent.label_ == "PERSON":
people.append(ent)
import numpy as np
df['artist_names'] = pd.Series(people)
print(df.head())
This is the resulting dataframe:
article_title artist_names
0 “They’re like, is that? Oh it’s!” – ... (Hannah, Ward)
1 Billed as London’s biggest public festival of ... (Dylan, Mulvaney)
2 Transport yourself back to the dusky skies and... NaN
3 Turning to art at the beginning of quarantine ... NaN
4 Dylan Mulvaney, head of design at Gretel, expl... NaN
This is what I'm expecting:
article_title artist_names
0 “They’re like, is that? Oh it’s!” – ... (Hannah, Ward)
1 Billed as London’s biggest public festival of ... NaN
2 Transport yourself back to the dusky skies and... NaN
3 Turning to art at the beginning of quarantine ... NaN
4 Dylan Mulvaney, head of design at Gretel, expl... (Dylan, Mulvaney)
You can see the 5th value in artist_names column is related to the 5th article title. How can I get them to align?
Thank you for your help.
I would iterate through the articles, detect entities from each article separately, and put the detected entities in a list with one element per article:
nlp = spacy.load('en_core_web_lg')
article = [_ for _ in df['article_title']]
entities_by_article = []
for doc in nlp.pipe(article):
people = []
for ent in doc.ents:
if ent.label_ == "PERSON":
people.append(ent)
entities_by_article.append(people)
df['artist_names'] = pd.Series(entities_by_article)
Note: for doc in nlp.pipe(article) is spaCy's more efficient way of looping through a list of texts and could be replaced by:
for a in article:
doc = nlp(a)
## rest of code within loop
if ent.label_ == "PERSON":
people.append(ent)
else:
people.append(np.nan) # if ent.label_ is not a PERSON
include an else statement so if label_ is not PERSON it will be consider as NaN.

spacy stemming on pandas df column not working

How to apply stemming on Pandas Dataframe column
am using this function for stemming which is working perfect on string
xx='kenichan dived times ball managed save 50 rest'
def make_to_base(x):
x_list = []
doc = nlp(x)
for token in doc:
lemma=str(token.lemma_)
if lemma=='-PRON-' or lemma=='be':
lemma=token.text
x_list.append(lemma)
print(" ".join(x_list))
make_to_base(xx)
But when i am applying this function on my pandas dataframe column it is not working neither giving any error
x = list(df['text']) #my df column
x = str(x)#converting into string otherwise it is giving error
make_to_base(x)
i've tried different thing but nothing working. like this
df["texts"] = df.text.apply(lambda x: make_to_base(x))
make_to_base(df['text'])
my dataset looks like this:
df['text'].head()
Out[17]:
0 Hope you are having a good week. Just checking in
1 K..give back my thanks.
2 Am also doing in cbe only. But have to pay.
3 complimentary 4 STAR Ibiza Holiday or £10,000 ...
4 okmail: Dear Dave this is your final notice to...
Name: text, dtype: object
You need to actually return the value you got inside the make_to_base method, use
def make_to_base(x):
x_list = []
for token in nlp(x):
lemma=str(token.lemma_)
if lemma=='-PRON-' or lemma=='be':
lemma=token.text
x_list.append(lemma)
return " ".join(x_list)
Then, use
df['texts'] = df['text'].apply(lambda x: make_to_base(x))

Apply row logic on date while extracting only multiple columns of a dataframe

I am extracting a data frame in pandas and want to only extract rows where the date is after a variable.
I can do this in multiple steps but would like to know if it is possible to apply all logic in one call for best practice.
Here is my code
import pandas as pd
self.min_date = "2020-05-01"
#Extract DF from URL
self.df = pd.read_html("https://webgate.ec.europa.eu/rasff-window/portal/index.cfm?event=notificationsList")[0]
#Here is where the error lies, I want to extract the columns ["Subject","Reference","Date of case"] but where the date is after min_date.
self.df = self.df.loc[["Date of case" < self.min_date], ["Subject","Reference","Date of case"]]
return(self.df)
I keep getting the error: "IndexError: Boolean index has wrong length: 1 instead of 100"
I cannot find the solution online because every answer is too specific to the scenario of the person that asked the question.
e.g. this solution only works for if you are calling one column: How to select rows from a DataFrame based on column values?
I appreciate any help.
Replace this:
["Date of case" < self.min_date]
with this:
self.df["Date of case"] < self.min_date
That is:
self.df = self.df.loc[self.df["Date of case"] < self.min_date,
["Subject","Reference","Date of case"]]
You have a slight syntax issue.
Keep in mind that it's best practice to convert string dates into pandas datetime objects using pd.to_datetime.
min_date = pd.to_datetime("2020-05-01")
#Extract DF from URL
df = pd.read_html("https://webgate.ec.europa.eu/rasff-window/portal/index.cfm?event=notificationsList")[0]
#Here is where the error lies, I want to extract the columns ["Subject","Reference","Date of case"] but where the date is after min_date.
df['Date of case'] = pd.to_datetime(df['Date of case'])
df = df.loc[df["Date of case"] > min_date, ["Subject","Reference","Date of case"]]
Output:
Subject Reference Date of case
0 Salmonella enterica ser. Enteritidis (presence... 2020.2145 2020-05-22
1 migration of primary aromatic amines (0.4737 m... 2020.2131 2020-05-22
2 celery undeclared on green juice drink from Ge... 2020.2118 2020-05-22
3 aflatoxins (B1 = 29.4 µg/kg - ppb) in shelled ... 2020.2146 2020-05-22
4 too high content of E 200 - sorbic acid (1772 ... 2020.2125 2020-05-22

Replace partial string/char in columdata of Panda dataframe

I have a dataframe as follows:
Name Rating
0 ABC Good
1 XYZ Good #
2 GEH Good
3 ABH *
4 FEW Normal
Here I want to replace in the Rating element if it contain # it should replace by Can be improve , if it contain * then Very Poor. I have tried with following but it replace whole string. But I want to replace only the special char if it present.But it solves for another case if only special char is present.
import pandas as pd
df = pd.DataFrame() # Load with data
df['Rating'] = df['Rating'].str.replace('.*#+.*', 'Can be improve')
is returning
Name Rating
0 ABC Good
1 XYZ Can be improve
2 GEH Good
3 ABH Very Poor
4 FEW Normal
Can anybody help me out with this?
import pandas as pd
df = pd.DataFrame({"Rating": ["Good", "Good #", "*"]})
df["Rating"] = df["Rating"].str.replace("#", "Can be improve")
df["Rating"] = df["Rating"].str.replace("*", "Very Poor")
print(df)
Output:
0 Good
1 Good Can be improve
2 Very Poor
You replace the whole string because .* matches any character zero or more times.
If your special values are always at the end of the string you might use:
.str.replace(r'#$', "Can be improve")
.str.replace(r'\*$', "Very Poor")

Categories

Resources