I have a DF similar to the below:
Name
Text
Michael
66l additional text
John
55i additional text
Mary
88l additional text
What I want to do is anywhere "l" occurs in the first string of the "Text" column, then replace it with "P"
Current code
DF['Text'] = DF['Text'].replace({"l", "P", 1})
Desired Outcome
Name
Text
Michael
66P additional text
John
55i additional text
Mary
88P additional text
You can use pandas.Series.str.replace with regex to identify the first word of the string.
>>> import pandas as pd
>>>
>>>
>>> df
Text
0 66l additional text
1 55i additional text
2 88l additional text
>>>
>>>
>>> df['Text'] = df['Text'].str.replace(r"^\w+\b", lambda x: x.group(0).replace("l", "P"), regex=True)
>>> df
Text
0 66P additional text
1 55i additional text
2 88P additional text
Asssuming the l only occurs once (as is shown in your sample dataframe) you can use
df['Text'].str.replace(r'^(\S*)l', r'\1P', regex=True)
# => 0 66P additional text
# 1 55i additional text
# 2 88P additional text
# Name: Text, dtype: object
See the regex demo. Details:
^ - start of string
(\S*) - Group 1: zero or more whitespaces
l - an l char (letter).
The replacement is \1P, i.e. the Group 1 value + P letter.
With your shown samples only, this could be easily done by using str[range] functionality of Python pandas, with your shown samples of DataFrame please try following code.
import pandas as pd
##Create your df here....
df['Text'] = df['Text'].str[:2] + 'P ' + df['Text'].str[4:]
Explanation:
df['Text'].str[:2]: Taking(printing) from 1st position of column Text to till 3rd position(it starts from 0).
+ 'P ' +: Adding/concatenating P to it as per OP's requirement in question here.
df['Text'].str[4:]: Taking(printing) from 5th position of column Text to till end of column's value here and saving this whole df['Text'].str[:2] + 'P ' + df['Text'].str[4:] code's output into Text column itself of DataFrame.
Related
I have a column of names and informations of products, i need to remove the codes from the names and every code starts with four or more zeros, some names have four zeros or more in the weight and some are joined with the name as the example below:
data = {
'Name' : ['ANOA 250g 00004689', 'ANOA 10000g 00000059884', '80%c asjw 150000001568 ', 'Shivangi000000478761'],
}
testdf = pd.DataFrame(data)
The correct output would be:
results = {
'Name' : ['ANOA 250g', 'ANOA 10000g', '80%c asjw 150000001568 ', 'Shivangi'],
}
results = pd.DataFrame(results)
you can split the strings by the start of the code pattern, which is expressed by the regex (?<!\d)0{4,}. this pattern consumes four 0s that are not preceded by any digit. after splitting the string, take the first fragment, and the str.strip gets rid of possible trailing space
testdf.Name.str.split('(?<!\d)0{4,}', regex=True, expand=True)[0].str.strip()[0].str.strip()
# outputs:
0 ANOA 250g
1 ANOA 10000g
2 80%c asjw 150000001568
3 Shivangi
note that this works for the case where the codes are always at the end of your string.
Use a regex with str.replace:
testdf['Name'] = testdf['Name'].str.replace(r'(?:(?<=\D)|\s*\b)0{4}\d*',
'', regex=True)
Or, similar to #HaleemurAli, with a negative match
testdf['Name'] = testdf['Name'].str.replace(r'(?<!\d)0{4,}0{4}\d*',
'', regex=True)
Output:
Name
0 ANOA 250g
1 ANOA 10000g
2 80%c asjw 150000001568
3 Shivangi
regex1 demo
regex2 demo
try splitting it at each space and checking if the each item has 0000 in it like:
answer=[]
for i in results["Name"]:
answer.append("".join([j for j in i.split() if "0000" not in j]))
Let's say I have a list
lst = ["fi", "ap", "ko", "co", "ex"]
and we have this series
Explanation
a "fi doesn't work correctly"
b "apples are cool"
c "this works but translation is ko"
and I'm looking to get something like this:
Explanation Explanation Extracted
a "fi doesn't work correctly" "fi"
b "apples are cool" "N/A"
c "this works but translation is ko" "ko"
With a dataframe like
df = pd.DataFrame(
{"Explanation": ["fi doesn't co work correctly",
"apples are cool",
"this works but translation is ko"]},
index=["a", "b", "c"]
)
you can use .str.extract() to do
lst = ["fi", "ap", "ko", "co", "ex"]
pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
df["Explanation Extracted"] = df.Explanation.str.extract(pattern, expand=False)
to get
Explanation Explanation Extracted
a fi doesn't co work correctly fi
b apples are cool NaN
c this works but translation is ko ko
The regex pattern r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)" looks for an occurrence of one of the lst items either at the beginning with withespace afterwards, in the middle with whitespace before and after, or at the end with withespace before. str.extract() extracts the capture group (the part in the middle in ()). Without a match the return is NaN.
If you want to extract multiple matches, you could use .str.findall() and then ", ".join the results:
pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
df["Explanation Extracted"] = (
df.Explanation.str.findall(pattern).str.join(", ").replace({"": None})
)
Alternative without regex:
df.index = df.index.astype("category")
matches = df.Explanation.str.split().explode().loc[lambda s: s.isin(lst)]
df["Explanation Extracted"] = (
matches.groupby(level=0).agg(set).str.join(", ").replace({"": None})
)
If you only want to match at the beginning or end of the sentences, then replace the first part with:
df.index = df.index.astype("category")
splitted = df.Explanation.str.split()
matches = (
(splitted.str[:1] + splitted.str[-1:]).explode().loc[lambda s: s.isin(lst)]
)
...
I think this solves your problem.
import pandas as pd
lst = ["fi", "ap", "ko", "co", "ex"]
df = pd.DataFrame([["fi doesn't work correctly"],["apples are cool"],["this works but translation is ko"]],columns=["Explanation"])
extracted =[]
for index, row in df.iterrows():
tempList =[]
rowSplit = row['Explanation'].split(" ")
for val in rowSplit:
if val in lst:
tempList.append(val)
if len(tempList)>0:
extracted.append(','.join(tempList))
else:
extracted.append('N/A')
df['Explanation Extracted'] = extracted
apply function of Pandas might be helpful
def extract_explanation(dataframe):
custom_substring = ["fi", "ap", "ko", "co", "ex"]
substrings = dataframe['explanation'].split(" ")
explanation = "N/A"
for string in substrings:
if string in custom_substring:
explanation = string
return explanation
df['Explanation Extracted'] = df.apply(extract_explanation, axis=1)
The catch here is assumption of only one explanation, but it can be converted into a list, if multiple explanations are expected.
Option 1
Assuming that one wants to extract the exact string in the list lst one can start by creating a regex
regex = f'\\b({"|".join(lst)})\\b'
where \b is the word boundary (beginning or end of a word) that indicates the word is not followed by additional characters, or with characters before. So, considering that one has the string ap in the list lst, if one has the word apple in the dataframe, that won't be considered.
And then, using pandas.Series.str.extract, and, to make it case insensitive, use re.IGNORECASE
import re
df['Explanation Extracted'] = df['Explanation'].str.extract(regex, flags=re.IGNORECASE, expand=False)
[Out]:
ID Explanation Explanation Extracted
0 1 fi doesn't work correctly fi
1 2 cap ples are cool NaN
2 3 this works but translation is ko ko
Option 2
One can also use pandas.Series.apply with a custom lambda function.
df['Explanation Extracted'] = df['Explanation'].apply(lambda x: next((i for i in lst if i.lower() in x.lower().split()), 'N/A'))
[Out]:
ID Explanation Explanation Extracted
0 1 fi doesn't work correctly fi
1 2 cap ples are cool N/A
2 3 this works but translation is ko ko
Notes:
.lower() is to make it case insensitive.
.split() is one way to prevent that even though ap is in the list, the string apple doesn't appear in the Explanation Extracted column.
I have a dataset in Excel:
column1
Bank A : 12
Bank B : 40
Bank C : 55
where it contains only a single row with Bank A, B and C information inside one cell.
How would I be able to use regex in Python to create 3 columns whereby my new dataset is:
Bank A Bank B Bank C
12 40 55
Thank You!
You could do it with the following regex:
(.*?)\s:\s(\d+)
Regex Demo
Or using this regex to be more forgiving with the spaces before and after the :
(.*?)(?:\s+)?:(?:\s+)?(\d+)
Regex Demo
Explanation:
(.*?) # For Group 1, match every character
\s:\s # until reaching a space + : + space
(\d+) # For Group 2, match every digit
Then with your python code you can access contents of Group 1 and 2 using the Match.group() method and build the columns as you need.
I am trying to write a regex that matches columns in my dataframe. All the columns in the dataframe are
cols = ['after_1', 'after_2', 'after_3', 'after_4', 'after_5', 'after_6',
'after_7', 'after_8', 'after_9', 'after_10', 'after_11', 'after_12',
'after_13', 'after_14', 'after_15', 'after_16', 'after_17', 'after_18',
'after_19', 'after_20', 'after_21', 'after_22', 'after_10_missing',
'after_11_missing', 'after_12_missing', 'after_13_missing',
'after_14_missing', 'after_15_missing', 'after_16_missing',
'after_17_missing', 'after_18_missing', 'after_19_missing',
'after_1_missing', 'after_20_missing', 'after_21_missing',
'after_22_missing', 'after_2_missing', 'after_3_missing',
'after_4_missing', 'after_5_missing', 'after_6_missing',
'after_7_missing', 'after_8_missing', 'after_9_missing']
I want to select all the columns that have values in the strings that range from 1-14.
This code works
df.filter(regex = '^after_[1-9]$|after_([1-9]\D|1[0-4])').columns
but I'm wondering how to make it in one line instead of splititng it in two. The first part selects all strings that end in a number between 1 and 9 (i.e. 'after_1' ... 'after_9') but not their "missing" counterparts. The second part (after the |), selects any string that begins with 'after' and is between 1 and 9 and is followed by a word character, or begins with 1 and is followed by 0-4.
Is there a better way to write this?
I already tried
df.filter(regex = 'after_([1-9]|1[0-4])').columns
But that picks up strings that begin with a 1 or a 2 (i.e. 'after_20')
Try this: after_([1-9]|1[0-4])[a-zA-Z_]*\b
import re
regexp = '''(after_)([1-9]|1[0-4])(_missing)*\\b'''
cols = ['after_1', 'after_14', 'after_15', 'after_14_missing', 'after_15_missing', 'after_9_missing']
for i in cols:
print(i , re.findall(regexp, i))
I would like to remove stopwords from a column of a data frame.
Inside the column there is text which needs to be splitted.
For example my data frame looks like this:
ID Text
1 eat launch with me
2 go outside have fun
I want to apply stopword on text column so it should be splitted.
I tried this:
for item in cached_stop_words:
if item in df_from_each_file[['text']]:
print(item)
df_from_each_file['text'] = df_from_each_file['text'].replace(item, '')
So my output should be like this:
ID Text
1 eat launch
2 go fun
It means stopwords have been deleted.
but it does not work correctly. I also tried vice versa in a way make my data frame as series and then loop through that, but iy also did not work.
Thanks for your help.
replace (by itself) isn't a good fit here, because you want to perform partial string replacement. You want regex based replacement.
One simple solution, when you have a manageable number of stop words, is using str.replace.
p = re.compile("({})".format('|'.join(map(re.escape, cached_stop_words))))
df['Text'] = df['Text'].str.lower().str.replace(p, '')
df
ID Text
0 1 eat launch
1 2 outside have fun
If performance is important, use a list comprehension.
cached_stop_words = set(cached_stop_words)
df['Text'] = [' '.join([w for w in x.lower().split() if w not in cached_stop_words])
for x in df['Text'].tolist()]
df
ID Text
0 1 eat launch
1 2 outside have fun