Pandas unable to split on multiple asterisk - python

I'm trying to split on 5x asterisk in Pandas by reading in data that looks like this
"This place is not good ***** less salt on the popcorn!"
My code attempt is trying to read in the reviews column and get the zero index
review = review_raw["reviews"].str.split('*****').str[0]
print(review)
The error
sre_constants.error: nothing to repeat at position 0
My expectation
This place is not good

pandas.Series.str.split
Series.str.split(pat=None, n=- 1, expand=False)
Parameters:
patstr, optional String or regular expression to split on. If not
specified, split on whitespace.
* character is a part of regex string which defines zero or more number of occurrences, and this is the reason why your code is failing.
You can either try escaping the character:
>>df['review'].str.split('\*\*\*\*\*').str[0]
0 This place is not good
Name: review, dtype: object
Or you can just pass the regex:
>>df['review'].str.split('[*]{5}').str[0]
0 This place is not good
Name: review, dtype: object
Third option would be to use inbuilt str.split() instead of pandas' Series.str.split()
>>df['review'].apply(lambda x: x.split('*****')).str[0]
0 This place is not good
Name: review, dtype: object

Try out with this code
def replace_str(string):
return str(string).replace("*****",',').split(',')[0]
review = review_raw["reviews"].apply(lambda x:replace_str(x))
Well suppose we already have a ',' in our input string in that case the code can be little tweaked like below. Since I am replacing ***** , I can replace with any character like '[' in the modified answer.
def replace_str(string):
return str(string).replace("*****",'[').split('[')[0]
review = review_raw["reviews"].apply(lambda x:replace_str(x))

Related

How can I remove '\n + 1' from a DataFrame?

I have the following dataframe:
Senior
Location
False
Warszawa
True
Warszawa\n + 1
I try to remove that "\n + 1", which looks like a hidden character to me. At first, I tried with:
df['Location']=df['Location'].str.replace('Warszawa\n + 1','Warszawa')
but nothing happened.
I managed to remove those characters manually, with a long row of splits and replaces, but it is not a viable solution, because it gives me some weird results in subsequent part of the program: although I have "Warszawa" in both rows of the df, they are treated as being two different locations, although there is only one location.
What I want is this:
Senior
Location
False
Warszawa
True
Warszawa
How can I correctly remove that "\n + 1"? And what character is it?
The .str.replace method searches for regex (Regular Expression) patterns. In regex the + has a special meaning. In order to tell pandas, that you are searching for exactly +, you need to set regex = False.
df['Location'] = df['Location'].str.replace(r'Warszawa\n + 1','Warszawa', regex = False)
Here you can read more about the parameters:
pandas.Series.str.replace
You will have same problem if one of following characters are in the column, which you search:
., [, ], *, ?
For the complete list, search for regex special characters
When using str.replace() the regex parameter is set to True by default. Since you just want to replace the literal string you either want to do what #Amir Py has done and turn regex=False or you can use the replace() method and do an inplace literal string replacement. The regex parameter is replace() is set to False by default.
Code:
df['Location'].replace('Warszawa\n + 1', 'Warszawa', inplace=True)
It can also be useful if you have other similar issues in other columns of your dataframe. For more information there is a great question and answer on stack:
str.replace v replace

Removing list of strings from column in pandas

I would need to remove a list of strings:
list_strings=['describe','include','any']
from a column in pandas:
My_Column
include details about your goal
describe expected and actual results
show some code anywhere
I tried
df['My_Column']=df['My_Column'].str.replace('|'.join(list_strings), '')
but it removes parts of words.
For example:
My_Column
details about your goal
expected and actual results
show some code where # here it should be anywhere
My expected output:
My_Column
details about your goal
expected and actual results
show some code anywhere
Use the "word boundary" expression \b like.
In [46]: df.My_Column.str.replace(r'\b{}\b'.format('|'.join(list_strings)), '')
Out[46]:
0 details about your goal
1 expected and actual results
2 show some code anywhere
Name: My_Column, dtype: object
Your issue is that pandas doesn't see words, it simply sees a list of characters. So when you ask pandas to remove "any", it doesn't start by delineating words. So one option would be to do that yourself, maybe something like this:
# Your data
df = pd.DataFrame({'My_Column':
['Include details about your goal',
'Describe expected and actual results',
'Show some code anywhere']})
list_strings=['describe','include','any'] # make sure it's lower case
def remove_words(s):
if s is not None:
return ' '.join(x for x in s.split() if x.lower() not in list_strings)
# Apply the function to your column
df.My_Column = df.My_Column.map(remove_words)
The first parameter of .str.replace() method must be a string or compiled regex; not a list as you have.
You probably wanted
list_strings=['Describe','Include','any'] # Note capital D and capital I
for s in [f"\\b{s}\\b" for s in list_strings]: # surrounded word boundaries (\b)
df['My_Column'] = df['My_Column'].str.replace(s, '')
to obtain
My_Column
0 details about your goal
1 expected and actual results
2 Show some code anywhere

How do you find all instances of a substring, followed by a certain number of dynamic characters?

I'm trying to find all instances of a specific substring(a!b2 as an example) and return them with the 4 characters that follow after the substring match. These 4 following characters are always dynamic and can be any letter/digit/symbol.
I've tried searching, but it seems like the similar questions that are asked are requesting help with certain characters that can easily split a substring, but since the characters I'm looking for are dynamic, I'm not sure how to write the regex.
When using regex, you can use "." to dynamically match any character. Use {number} to specify how many characters to match, and use parentheses as in (.{number}) to specify that the match should be captured for later use.
>>> import re
>>> s = "a!b2foobar a!b2bazqux a!b2spam and eggs"
>>> print(re.findall("a!b2(.{4})", s))
['foob', 'bazq', 'spam']
import re
print (re.search(r'a!b2(.{4})')).group(1))
.{4} matches any 4 characters except special characters.
group(0) is the complete match of the searched string. You can read about group id here.
If you're only looking for how to grab the following 4 characters using Regex, what you are probably looking to use is the curly brace indicator for quantity to match: '{}'.
They go into more detail in the post here, but essentially you would do [a-Z][0-9]{X,Y} or (.{X,Y}), where X to Y is the number of characters you're looking for (in your case, you would only need {4}).
A more Pythonic way to solve this problem would be to make use of string slicing, and the index function however.
Eg. given an input_string, when you find the substring at index i using index, then you could use input_string[i+len(sub_str):i+len(sub_str)+4] to grab those special characters.
As an example,
input_string = 'abcdefg'
sub_str = 'abcd'
found_index = input_string.index(sub_str)
start_index = found_index + len(sub_str)
symbol = input_string[start_index: start_index + 4]
Outputs (to show it works with <4 as well): efg
Index also allows you to give start and end indexes for the search, so you could also use it in a loop if you wanted to find it for every sub string, with the start of the search index being the previous found index + 1.

Python re.sub() is not replacing every match

I'm using Python 3 and I have two strings: abbcabb and abca. I want to remove every double occurrence of a single character. For example:
abbcabb should give c and abca should give bc.
I've tried the following regex (here):
(.)(.*?)\1
But, it gives wrong output for first string. Also, when I tried another one (here):
(.)(.*?)*?\1
But, this one again gives wrong output. What's going wrong here?
The python code is a print statement:
print(re.sub(r'(.)(.*?)\1', '\g<2>', s)) # s is the string
It can be solved without regular expression, like below
>>>''.join([i for i in s1 if s1.count(i) == 1])
'bc'
>>>''.join([i for i in s if s.count(i) == 1])
'c'
re.sub() doesn't perform overlapping replacements. After it replaces the first match, it starts looking after the end of the match. So when you perform the replacement on
abbcabb
it first replaces abbca with bbc. Then it replaces bb with an empty string. It doesn't go back and look for another match in bbc.
If you want that, you need to write your own loop.
while True:
newS = re.sub(r'(.)(.*?)\1', r'\g<2>', s)
if newS == s:
break
s = newS
print(newS)
DEMO
Regular expressions doesn't seem to be the ideal solution
they don't handle overlapping so it it needs a loop (like in this answer) and it creates strings over and over (performance suffers)
they're overkill here, we just need to count the characters
I like this answer, but using count repeatedly in a list comprehension loops over all elements each time.
It can be solved without regular expression and without O(n**2) complexity, only O(n) using collections.Counter
first count the characters of the string very easily & quickly
then filter the string testing if the count matches using the counter we just created.
like this:
import collections
s = "abbcabb"
cnt = collections.Counter(s)
s = "".join([c for c in s if cnt[c]==1])
(as a bonus, you can change the count to keep characters which have 2, 3, whatever occurrences)
EDIT: based on the comment exchange - if you're just concerned with the parity of the letter counts, then you don't want regex and instead want an approach like #jon's recommendation. (If you don't care about order, then a more performant approach with very long strings might use something like collections.Counter instead.)
My best guess as to what you're trying to match is: "one or more characters - call this subpattern A - followed by a different set of one or more characters - call this subpattern B - followed by subpattern A again".
You can use + as a shortcut for "one or more" (instead of specifying it once and then using * for the rest of the matches), but either way you need to get the subpatterns right. Let's try:
>>> import re
>>> pattern = re.compile(r'(.+?)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'bbcbaca'
Hmm. That didn't work. Why? Because with the first pattern not being greedy, our "subpattern A" can just match the first a in the string - it does appear later, after all. So if we use a greedy match, Python will backtrack until it finds as long of a pattern for subpattern A that still allows for the A-B-A pattern to appear:
>>> pattern = re.compile(r'(.+)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'cbc'
Looks good to me.
The site explains it well, hover and use the explanation section.
(.)(.*?)\1 Does not remove or match every double occurance. It matches 1 character, followed by anything in the middle sandwiched till that same character is encountered again.
so, for abbcabb the "sandwiched" portion should be bbc between two a
EDIT:
You can try something like this instead without regexes:
string = "abbcabb"
result = []
for i in string:
if i not in result:
result.append(i)
else:
result.remove(i)
print(''.join(result))
Note that this produces the "last" odd occurrence of a string and not first.
For "first" known occurance, you should use a counter as suggested in this answer . Just change the condition to check for odd counts. pseudo code(count[letter] %2 == 1)

Formatting the contents of pandas column. Removing trailing text and digits

I've used BeautifulSoup and pandas to create a csv with columns that contain error codes and corresponding error messages.
Before formatting, the columns look something like this
-132456ErrorMessage
-3254Some other Error
-45466You've now used 3 different examples. 2 more to go.
-10240 This time there was a space.
-1232113That was a long number.
I've successfully isolated the text of the codes like this:
dfDSError['text'] = dfDSError['text'].map(lambda x: x.lstrip('-0123456789'))
This returns just what I want.
But I've been struggling to come up with a solution for the codes.
I tried this:
dfDSError['codes'] = dfDSError['codes'].replace(regex=True,to_replace= r'\D',value=r'')
But that will append numbers from the error message to the end of the code number. So for the third example above instead of 45466 I would get 4546632. Also I would like to keep the leading minus sign.
I thought maybe that I could somehow combine rstrip() with a regex to find where there was a nondigit or a space next to a space and remove everything else, but I've been unsuccessful.
for_removal = re.compile(r'\d\D*')
dfDSError['codes'] = dfDSError['codes'].map(lambda x: x.rstrip(re.findall(for_removal,x)))
TypeError: rstrip arg must be None, unicode or str
Any suggestions? Thanks!
You can use extract:
dfDSError[['code','text']] = dfDSError.text.str.extract('([-0-9]+)(.*)', expand=True)
print (dfDSError)
text code
0 ErrorMessage -132456
1 Some other Error -3254
2 You've now used 3 different examples. 2 more t... -45466
3 This time there was a space. -10240
4 That was a long number. -1232113

Categories

Resources