I would need to remove a list of strings:
list_strings=['describe','include','any']
from a column in pandas:
My_Column
include details about your goal
describe expected and actual results
show some code anywhere
I tried
df['My_Column']=df['My_Column'].str.replace('|'.join(list_strings), '')
but it removes parts of words.
For example:
My_Column
details about your goal
expected and actual results
show some code where # here it should be anywhere
My expected output:
My_Column
details about your goal
expected and actual results
show some code anywhere
Use the "word boundary" expression \b like.
In [46]: df.My_Column.str.replace(r'\b{}\b'.format('|'.join(list_strings)), '')
Out[46]:
0 details about your goal
1 expected and actual results
2 show some code anywhere
Name: My_Column, dtype: object
Your issue is that pandas doesn't see words, it simply sees a list of characters. So when you ask pandas to remove "any", it doesn't start by delineating words. So one option would be to do that yourself, maybe something like this:
# Your data
df = pd.DataFrame({'My_Column':
['Include details about your goal',
'Describe expected and actual results',
'Show some code anywhere']})
list_strings=['describe','include','any'] # make sure it's lower case
def remove_words(s):
if s is not None:
return ' '.join(x for x in s.split() if x.lower() not in list_strings)
# Apply the function to your column
df.My_Column = df.My_Column.map(remove_words)
The first parameter of .str.replace() method must be a string or compiled regex; not a list as you have.
You probably wanted
list_strings=['Describe','Include','any'] # Note capital D and capital I
for s in [f"\\b{s}\\b" for s in list_strings]: # surrounded word boundaries (\b)
df['My_Column'] = df['My_Column'].str.replace(s, '')
to obtain
My_Column
0 details about your goal
1 expected and actual results
2 Show some code anywhere
Related
I have a column containing strings that are comprised of different words but always have a similar structure structure. E.g.:
2cm off ORDER AGAIN (191 1141)
I want to extract the sub-string that starts after the second space and ends at the space before the opening bracket/parenthesis. So in this example I want to extract ORDER AGAIN.
Is this possible?
You could use str.extract here:
df["out"] = df["col"].str.extract(r'^\w+ \w+ (.*?)(?: \(|$)')
Note that this answer is robust even if the string doesn't have a (...) term at the end.
Here is a demo showing that the regex logic is working.
You can try the following:
r"2cm off ORDER AGAIN (191 1141)".split(r"(")[0].split(" ", maxsplit=2)[-1].strip()
#Out[3]: 'ORDER AGAIN'
If the pattern of data is similar to what you have posted then I think the below code snippet should work for you:
import re
data = "2cm off ORDER AGAIN (191 1141)"
extr = re.match(r".*?\s.*?\s(.*)\s\(.*", data)
if extr:
print (extr.group(1))
You can try the following code
s = '2cm off ORDER AGAIN (191 1141)'
second_space = s.find(' ', s.find(' ') + 1)
openparenthesis = s.find('(')
substring = s[second_space : openparenthesis]
print(substring) #ORDER AGAIN
I'm trying to split on 5x asterisk in Pandas by reading in data that looks like this
"This place is not good ***** less salt on the popcorn!"
My code attempt is trying to read in the reviews column and get the zero index
review = review_raw["reviews"].str.split('*****').str[0]
print(review)
The error
sre_constants.error: nothing to repeat at position 0
My expectation
This place is not good
pandas.Series.str.split
Series.str.split(pat=None, n=- 1, expand=False)
Parameters:
patstr, optional String or regular expression to split on. If not
specified, split on whitespace.
* character is a part of regex string which defines zero or more number of occurrences, and this is the reason why your code is failing.
You can either try escaping the character:
>>df['review'].str.split('\*\*\*\*\*').str[0]
0 This place is not good
Name: review, dtype: object
Or you can just pass the regex:
>>df['review'].str.split('[*]{5}').str[0]
0 This place is not good
Name: review, dtype: object
Third option would be to use inbuilt str.split() instead of pandas' Series.str.split()
>>df['review'].apply(lambda x: x.split('*****')).str[0]
0 This place is not good
Name: review, dtype: object
Try out with this code
def replace_str(string):
return str(string).replace("*****",',').split(',')[0]
review = review_raw["reviews"].apply(lambda x:replace_str(x))
Well suppose we already have a ',' in our input string in that case the code can be little tweaked like below. Since I am replacing ***** , I can replace with any character like '[' in the modified answer.
def replace_str(string):
return str(string).replace("*****",'[').split('[')[0]
review = review_raw["reviews"].apply(lambda x:replace_str(x))
Currently I have a dataframe. Here is an example of my dataframe:
I also have a list of keywords/ sentences. I want to match it to the column 'Content' and see if any of the keywords or sentences match.
Here is what I've done
# instructions_list is just the list of keywords and key sentences
instructions_list = instructions['Key words & sentence search'].tolist()
pattern = '|'.join(instructions_list)
bureau_de_sante[bureau_de_sante['Content'].str.contains(pattern, regex = True)]
While it is giving me the results, it is also giving me this UserWarning : UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
return func(self, *args, **kwargs).
Questions:
How can I prevent the userwarning from showing up?
After finding and see if a match is in the column, how can I print the specific match in a new column?
You are supplying a regex to search the dataframe. If you have parenthesis in your instruction list (like it is the case in your example), then that constitutes a match group. In order to avoid this, you have to escape them (i.e.: add \ in front of them, so that (Critical risk) becomes \(Critical risk\)). You will also probably want to escape all special characters like \ . " ' etc.
Now, you can use these groups to extract the match from your data. Here is an example:
df = pd.DataFrame(["Hello World", "Foo Bar Baz", "Goodbye"], columns=["text"])
pattern = "(World|Bar)"
print(df.str.extract(pattern))
# 0
# 0 World
# 1 Bar
# 2 NaN
You can add this column in your dataframe with a simple assignment (eg df["result"] = df.str.extract(pattern))
I have a list of words negative that has 4783 elements. I want to use the following code
tweets3 = tweets2[tweets2['full_text'].str.contains('|'.join(negative))]
But, it gives ane error like this error: multiple repeat at position 4193.
I do not understand this error. Apparently, if I use a single word in str.contains such as str.contains("deal") I am able to get results.
All I need is a new dataframe that carries only those rows which carry any of the words occuring in the dataframe tweets2 column full_text.
As a matter of choice I would also like to see if I can have a boolean column for present and absent values as 0 or 1.
I arrived at using the following code with the help of #wp78de:
tweets2['negative'] = tweets2.loc[tweets2['full_text'].str.contains(r'(?:{})'.format('|'.join(negative)), regex=True, na=False)].copy()
For arbitrary literal strings that may have regular expression metacharacters in it you can use the re.escape() function. Something along this line should be sufficient:
.str.contains(r'(?:{})'.format(re.escape('|'.join(words)), regex=True, na=False)]
I have a list of strings that looks like this:
Input:
prices_list = ["CNY1234", "$ 4.421,00", "PHP1,000", "€432"]
I want to remove everything except .isdigit(), and '.|,'. In other words, I would like to split before the first occurrence of any digit with maxsplit=1:
Desired output:
["1234", "4.421,00", "1,000", "432"]
First attempt (two regex replacements):
# Step 1: Remove special characters
prices_list = [re.sub(r'[^\x00-\x7F]+',' ', price).encode("utf-8") for price in prices_list]
# Step 2: Remove [A-Aa-z]
prices_list = [re.sub(r'[A-Za-z]','', price).strip() for price in prices_list]
Current output:
['1234', '$ 4.421,00', '1,000', '432'] # $ still in there
Second attempt (still two regex replacements):
prices_list = [''.join(re.split("[A-Za-z]", re.sub(r'[^\x00-\x7F]+','', price).encode("utf-8").strip())) for price in price_list]
This (of course) leads to the same output as my first attempt. Also, this isn't much shorter and looks very ugly. Is there a better (shorter) way to do this?
Third attempt (list comprehension/nestedfor-loop/no regex):
prices_list = [''.join(token) for token in price for price in price_list if token.isdigit() or token == ',|;']
which yields:
NameError: name 'price' is not defined
How to best parse the above-mentioned price list?
If you need to leave only specific characters, it's better to tell regex to do exactly that thing:
import re
prices_list = ["CNY1234", "$ 4.421,00", "PHP1,000", "€432"]
prices = list()
for it in prices_list:
pattern = r"[\d.|,]+"
s = re.search(pattern, it)
if s:
prices.append(s.group())
> ['1234', '4.421,00', '1,000', '432']
The Problem
Correct me if I'm wrong, but essentially you're trying to remove symbols and such and only leave any trailing digits, right?
I would like to split before the first occurrence of any digit
That, I feel, is the simplest way to frame the regex problem that you are trying to solve.
A Solution
# -*- coding: utf-8 -*-
import re
# Match any contiguous non-digit characters
regex = re.compile(r"\D+")
# Input list
prices_list = ["CNY1234", "$ 4.421,00", "PHP1,000", "€432"]
# Regex mapping
desired_output = map(lambda price: regex.split(price, 1)[-1], prices_list)
This gives me ['1234', '4.421,00', '1,000', '432'] as the output.
Explanation
The reason this works is because of the lambda and the map function. Basically, the map function takes in a lambda (a portable, one-line function if you will), and executes it on every element in the list. The negative index takes the last element that the list of matches that the split method generates
Essentially, this works because of the assumption that you don't want any initial non-digits in your output.
Caveats
This code not only keeps . and , in the resulting substring, but all characters in the resulting substring. So, an input string of "$10e7" will be output as '10e7'.
If you were to have just digits and . and ,, such as "10.00" as an input string, you would get '00' in the corresponding location in the output list.
If none of these are desired behavior, you would have to get rid of the negative indexing next to the regex.split(price, 1) and do further processing on the resulting list of lists so that you can handle all of those pesky edge cases that arise with using regex.
Either way, I would try and throw more extreme examples at it just to make sure that it's what you need.