Pandas don't recognize "||" as string to split

Pandas don't recognize "||" as string to split - python

I'm trying to split a DataFrame in two columns and get the left part in result, but pandas don't recognize that string and give me an out in empty.
q=['Sar || var','lol ||']
y=pd.DataFrame(q)
split_data = y[0].str.split("||", n = 1, expand = False).str[0]
print(split_data)
out
0
1
Name: 0, dtype: object

The documentation is somewhat deceptive for this method. What is happening is that for patterns longer than 1 character, pandas interprets the separator as a regular expression. You can see the implementation here.
You can use "||" as a literal, non-regex separator by escaping the character "|" (which has special meaning in regular expressions) using a backslash:
series.str.split("\\|\\|")
Note that python provides a "raw" syntax for string literals that can be useful for writing regular expressions, removing the need to escape the backslashes themselves:
series.str.split(r"\|\|")
You can consult the documentation for the re module for a list of special characters that will need to be escaped when using multi-character separators. Alternatively, just use the function re.escape:
import re
series.str.split(re.escape("||"))

Related

regular expression problems in re.findall python [duplicate]

This question already has answers here:
Can't escape the backslash with regex?
(7 answers)
Closed 4 years ago.
I am confused with the backslash in regular expressions. Within a regex a \ has a special meaning, e.g. \d means a decimal digit. If you add a backslash in front of the backslash this special meaning gets lost. In the regex-howto one can read:
Perhaps the most important metacharacter is the backslash, \. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.
So print(re.search('\d', '\d')) gives None because \d matches any decimal digit but there is none in \d.
I now would expect print(re.search('\\d', '\d')) to match \d but the answer is still None.
Only print(re.search('\\\d', '\d')) gives as output <_sre.SRE_Match object; span=(0, 2), match='\\d'>.
Does someone have an explanation?

The confusion is due to the fact that the backslash character \ is used as an escape at two different levels. First, the Python interpreter itself performs substitutions for \ before the re module ever sees your string. For instance, \n is converted to a newline character, \t is converted to a tab character, etc. To get an actual \ character, you can escape it as well, so \\ gives a single \ character. If the character following the \ isn't a recognized escape character, then the \ is treated like any other character and passed through, but I don't recommend depending on this. Instead, always escape your \ characters by doubling them, i.e. \\.
If you want to see how Python is expanding your string escapes, just print out the string. For example:
s = 'a\\b\tc'
print(s)
If s is part of an aggregate data type, e.g. a list or a tuple, and if you print that aggregate, Python will enclose the string in single quotes and will include the \ escapes (in a canonical form), so be aware of how your string is being printed. If you just type a quoted string into the interpreter, it will also display it enclosed in quotes with \ escapes.
Once you know how your string is being encoded, you can then think about what the re module will do with it. For instance, if you want to escape \ in a string you pass to the re module, you will need to pass \\ to re, which means you will need to use \\\\ in your quoted Python string. The Python string will end up with \\ and the re module will treat this as a single literal \ character.
An alternative way to include \ characters in Python strings is to use raw strings, e.g. r'a\b' is equivalent to "a\\b".

An r character before the regular expression in a call to search() specifies that the regular expression is a raw string. This allows backslashes to be used in the regular expression as regular characters rather than in an escape sequence of characters. Let me explain ...
Before the re module's search method processes the strings that are passed to it, the Python interpreter takes an initial pass over the string. If there are backslashes present in a string, the Python interpreter must decide if each is part of a Python escape sequence (e.g. \n or \t) or not.
Note: at this point Python does not care whether or not '\' is a regular expression meta-character.
If the '\' is followed by a recognized Python escape character (t,n, etc.), then the backslash and the escape character are replaced with the actual Unicode or 8-bit character. For example, '\t' would be replaced with the ASCII character for tab. Otherwise it is passed by and interpreted as a '\' character.
Consider the following.
>>> s = '\t'
>>> print ("[" + s + "]")
>>> [ ] // an actual tab character after preprocessing
>>> s = '\d'
>>> print ("[" + s + "]")
>>> [\d] // '\d' after preprocessing
Sometimes we want to include in a string a character sequence that includes '\' without it being interpreted by Python as an escape sequence. To do this we escape the '\' with a '\'. Now when Python sees '\' it replaces the two backslashes with a single '\' character.
>>> s = '\\t'
>>> print ("[" + s + "]")
>>> [\t] // '\t' after preprocessing
After the Python interpreter take a pass on both strings, they are passed to the re module's search method. The search method parses the regular expression string to identify the regular expression's meta-characters.
Now '\' is also a special regular expression meta-character and is interpreted as one UNLESS it is escaped at the time that the re search() method is executed.
Consider the following call.
>>> match = re.search('a\\t','a\\t') //Match is None
Here, match is None. Why? Lets look at the strings after the Python interpreter makes its pass.
String 1: 'a\t'
String 2: 'a\t'
So why is match equal to None? When search() interprets String 1, since it is a regular expression, the backslash is interpreted as a meta-character, not an ordinary character. The backslash in String 2 however is not in a regular expression and has already been processed by the Python interpreter, so it is interpreted as an ordinary character.
So the search() method is looking for 'a escape-t' in the string 'a\t' which are not a match.
To fix this we can tell the search() method to not interpret the '\' as a meta-character. We can do this by escaping it.
Consider the following call.
>>> match = re.search('a\\\\t','a\\t') // Match contains 'a\t'
Again, lets look at the strings after the Python interpreter has made its pass.
String 1: 'a\\t'
String 2: 'a\t'
Now when the search() method processes the regular expression, it sees that the second backslash is escaped by the first and should not be considered a meta-character. It therefore interprets the string as 'a\t', which matches String 2.
An alternate way to have search() consider '\' as a character is to place an r before the regular expression. This tells the Python interpreter to NOT preprocess the string.
Consider this.
>>> match = re.search(r'a\\t','a\\t') // match contains 'a\t'
Here the Python interpreter does not modify the first string but does process the second string. The strings passed to search() are:
String 1: 'a\\t'
String 2: 'a\t'
As in the previous example, search interprets the '\' as the single character '\' and not a meta-character, thus matches with String 2.

Python's own string parsing (partially) comes in your way.
If you want to see what re sees, type
print '\d'
print '\\d'
print '\\\d'
on the Python command prompt. You see that \d and \\d both result in \d, the latter one being taken care by the Python string parser.
If you want to avoid any hassle with these, use raw strings as suggested by the re module documentation: r'\\d' will result in \\d seen by the RE module.

How to write a regular expression in Python that accepts alphabets, numbers and a few selected special characters(,.-|;!_?)?

A Regular Expression in Python that accepts letters,numbers and only these special characters (,.-|;!_?).
I have tried solving the problem through the following regular expressions but it didn't work:
'([a-zA-Z0-9,.-|;!_?])$'
'([a-zA-Z0-9][.-|;!_?])$'
Can someone please help me write the regular expression.

I think the following should work (tested on RegExr against Foo123,.-|;!_?):
^[\w,.\-|;!_?]*$
In your regular expressions, you forget to escape the '-' character, which is interpreted as a range of characters to match against.

Use this for only one character:
'[a-zA-Z0-9,.\-|;!_?]' or '[\w,.\-|;!_?]'
Use this for all characters:
'[a-zA-Z0-9,.\-|;!_?]*' or '[\w,.\-|;!_?]*'
Use this for an equal check:
'^[a-zA-Z0-9,.\-|;!_?]*$' or '^[\w,.\-|;!_?]*$'

Try this (you should escape - like this \-):
^[a-zA-Z0-9,.\-|;!_?]+$
+ to prevent matching empty strings, to allow them, you can use * instead.
Examples:
>>> import re
>>>
>>> re.match('^[a-zA-Z0-9,.\-|;!_?]+$', '12.0')
<_sre.SRE_Match object at 0x00000000027EB850>
>>> re.match('^[a-zA-Z0-9,.\-|;!_?]+$', '')
>>>
>>> re.match('^[a-zA-Z0-9,.\-|;!_?]+$', 'test!?')
<_sre.SRE_Match object at 0x00000000027EB7E8>

You could use \w (bonus: unicode and locale support!):
matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]
See Python's documentation. Also, you might want to use a raw string when specifying your regular expression pattern:
m = re.match(r'[\w,.-|;!?]+', your_string)
Notice the use of + (repeat once or more). You also used $ to match the end of the string but I did not include it in mine. YMMV.

basic regex operations in Python

I'm following a tutorial about regular expression. I'm getting an error when doing the following:
regex = r'(+|-)?\d*\.?\d*'
Apparently, Python doesn't like (+|-). What could be the problem?
Also, what could be the issue with not adding r ahead of the regex?

You need to escape + in regular expressions to get a literal +, because it usually means "one or more instances of something":
regex = r'(\+|-)?\d*\.?\d*'
And r makes it a "raw" string. Without the r, the regular expression escape sequences will be interpreted as string escape sequences, and they'll cause all sorts of problems. (\b being a backspace instead of a word boundary, and that kind of thing.)

+ is a special character. You can use brackets to specify a range of characters, which is better than using an "or" with the pipe character in this case.:
regex = r'([+-])?\d*\.?\d*'
Otherwise, you just need to escape it in your original version:
regex = r'(\+|-)?\d*\.?\d*'
Using the r is the preferred way of specifying a regex string in python because it indicates a raw string, which should not be interpreted and reduces the amount of escaping you must perform with backslashes. It is just a python regex idiom you will see everywhere.
r'(\+|-)?\d*\.?\d*'
#'(\\+|-)?\\d*\\.?\\d*'

Weird Python Regex Issues

whitespace_pattern = u"\s" # bug: tried to use unicode \u0020, broke regex
time_sig_pattern = \
"""^%(ws)s*time signature:%(ws)s*(?P<top>\d+)%(ws)s*\/%(ws)s*(?P<bottom>\d+)%(ws)s*$""" %{"ws": whitespace_pattern}
time_sig = compile(time_sig_pattern, U|M)
For some reason, adding the Verbose flag, X, to compile breaks the pattern.
Also, I wanted to use unicode for whitespace_pattern recognition (supposedly, we'll get patterns that use non-unicode spaces and we need to explicitly check for that one unicode character as a valid space), but the pattern keeps breaking.

VERBOSE gives you the ability to write comments in your regex to document it.
In order to do so, it ignores spaces, since you need to use line breaks to write comments.
Replace all spaces in your regex by \s to specify they are spaces you want to match in your pattern, and not just some spaces to format your comments.
What's more, you may want to use the r prefix for the string you use as a pattern. It tells Python not to interpret special notations such as \n and let you use backslashes without escaping them.

Always define regexes with the r prefix to indicate they are raw strings.
r"""^%(ws)s*time signature:%(ws)s*(?P<top>\d+)%(ws)s*\/%(ws)s*(?P<bottom>\d+)%(ws)s*$""" %{"ws": whitespace_pattern}

When creating a regex to match unicode characters you do not want to use a Python unicode string. In your example regular expression needs to see the literal characters \u0020, so you should use whitespace_pattern = r"\u0020" instead of u"\u0020".
As other answers have mentioned, you should also use the r prefix for time_sig_pattern, after those two changes your code should work fine.
For VERBOSE to work correctly you need to escape all whitespace in the pattern, so towards the beginning of the pattern replace the space in time signature with "\ " (quotes for clarity), \s, or [ ] as documented here.

Need regular expression expert: round bracket within stringliteral

I'm searching for strings within strings using Regex. The pattern is a string literal that ends in (, e.g.
# pattern
" before the bracket ("
# string
this text is before the bracket (and this text is inside) and this text is after the bracket
I know the pattern will work if I escape the character with a backslash, i.e.:
# pattern
" before the bracket \\("
But the pattern strings are coming from another search and I can not control what characters will be or where. Is there a way of escaping an entire string literal so that anything between markers is treated as a string? For example:
# pattern
\" before the ("
The only other option I have is to do a substitute adding escapes for every protected character.
re.escape is exactly what I need. I'm using regexp in Access VBA which doens't have that method. I only have replace, execute or test methods.
Is there a way to escape everything within a string in VBA?
Thanks

You didn't specify the language, but it looks like Python, so if you have a string in Python whose special regex characters you need to escape, use re.escape():
>>> import re
>>> re.escape("Wow. This (really) is *cool*")
'Wow\\.\\ This\\ \\(really\\)\\ is\\ \\*cool\\*'
Note that spaces are escaped, too (probably to ensure that they still work in a re.VERBOSE regex).

Maybe write your own VBA escape function:
Function EscapeRegEx(text As String) As String
Dim regEx As RegExp
Set regEx = New RegExp
regEx.Global = True
regEx.Pattern = "(\[|\\|\^|\$|\.|\||\?|\*|\+|\(|\)|\{|\})"
EscapeRegEx = regEx.Replace(text, "\$1")
End Function

I'm pretty sure that with the limitations of the RegExp abilities in VBA/VBScript, you are going to have to replace the special characters in your pattern before using it. There doesn't seem to be anything built into it like there is in Python.

The following regex will capture everything from the beginning of the string to the first (. The first captured group $1 will contain the portion before (.
^([^(]+)\(
Depending on your language, you might have to escape it as:
"^([^(]+)\\("

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.