Python regex: having trouble understanding results [duplicate] - python

This question already has answers here:
Removing a list of characters in string
(20 answers)
Closed 3 years ago.
I have a dataframe that I need to write to disk but pyspark doesn't allow any of these characters ,;{}()\\n\\t= to be present in the headers while writing as a parquet file.
So I wrote a simple script to detect if this is happening
import re
for each_header in all_headers:
print(re.match(",;{}()\\n\\t= ", each_header))
But for each header, None was printed. This is wrong because I know my file has spaces in its headers.
So, I decided to check it out by executing the following couple of lines
a = re.match(",;{}()\\n\\t= ", 'a s')
print(a)
a = re.search(",;{}()\\n\\t= ", 'a s')
print(a)
This too resulted in None getting printed.
I am not sure what I am doing wrong here.
PS: I am using python3.7

The problem is that {} and also () are regex metacharacters, and have a special meaning. Perhaps the easiest way to write your logic would be to use the pattern:
[,;{}()\n\t=]
This says to match the literal characters which PySpark does not allow to be present in the headers.
a = re.match("[,;{}()\n\t=]", 'a s')
print(a)
If you wanted to remove these characters, you could try using re.sub:
header = '...'
header = re.sub(r'[,;{}()\n\t=]+', '', header)

If you want to check whether a text contains any of the "forbidden"
characters, you have to put them between [ and ].
Another flaw in your regex is that in "normal" strings (not r-strings)
any backslash should be doubled.
So change your regex to:
"[,;{}()\\n\\t= ]"
Or use r-string:
r"[,;{}()\n\t= ]"
Note that I included also a space, which you missed.
One more remark: {} and () have special meaning, but outside [...].
Between [ and ] they represent themselves, so they need no
quotation with a backslash.

As already explained you could use regex for looking for forbidden characters, I want to add that you could do it without using regex following way:
forbidden = ",;{}()\n\t="
def has_forbidden(txt):
for i in forbidden:
if i in txt:
return True
return False
print(has_forbidden("ok name")) # False
print(has_forbidden("wrong=name")) # True
print(has_forbidden("with\nnewline")) # True
Note that using this approach you do not have to care about escaping special-regex characters, like for example *.

Related

Looking for a way to correctly strip a string [duplicate]

This question already has answers here:
python split() vs rsplit() performance?
(5 answers)
Closed 2 years ago.
I'm using the Spotify API to get song data from a lot of songs. To this end, I need to input the song URI intro an API call. To obtain the song URI's, I'm using another API endpoint. It returns the URI in this form: 'spotify:track:5CQ30WqJwcep0pYcV4AMNc' I only need the URI part,
So I used 'spotify:track:5CQ30WqJwcep0pYcV4AMNc'.strip("spotify:track) to strip away the first part. Only this did not work as expected, as this call also removes the trailing "c".
I tried to built a regex to strip away the first part, but instructions were too complicated and D**K is now stuck in ceiling fan :'(. Any help would be greatly appreciated.
strip() removes all the leading and trailing characters that are in the in the argument string, it doesn't match the string exactly.
You can use replace() to remove an exact string:
'spotify:track:5CQ30WqJwcep0pYcV4AMNc'.replace("spotify:track:", "")
or split it at : characters:
'spotify:track:5CQ30WqJwcep0pYcV4AMNc'.split(":")[-1]
Use simple regex replace:
import re
txt = 'spotify:track:5CQ30WqJwcep0pYcV4AMNc'
pat_to_strip = ['^spotify\:track', 'MNc$']
pat = f'({")|(".join(pat_to_strip)})'
txt = re.sub(pat, '', txt)
# outputs:
>>> txt
:5CQ30WqJwcep0pYcV4A
Essentially the patterns starting with ^ will be stripped from the beginning, and the ones ending with $ will be stripped from the end.
I stripped last 3 letters just as an example.

Search and replace --.sub(replacement, string[, count=0])-does not replace special character \ [duplicate]

This question already has an answer here:
Search and replace --.sub(replacement, string[, count=0])-does not work with special characters
(1 answer)
Closed 6 years ago.
I have a string and I want to replace special characters with html code. The code is as follows:
s= '\nAxes.axvline\tAdd a vertical line across the axes.\nAxes.axvspan\tAdd a vertical span (rectangle) across the axes.\nSpectral\nAxes.acorr'
p = re.compile('(\\t)')
s= p.sub('<\span>', s)
p = re.compile('(\\n)')
s = p.sub('<p>', s)
This code replaces \t in the string with <\\span> rather than with <\span> as asked by the code.
I have tested the regex pattern on regex101.com and it works. I cannot understand why the code is not working.
My objective is to use the output as html code. The '<\span>' string is not recognized as a Tag by HTML and thus it is useless. I must find a way to replace the \t in the text with <\span> and not with <\span>. Is this impossible in Python? I have posted earlier a similar question but that question did not specifically addressed the problem that I raise here, neither was making clear my objective to use the corrected text as HTML code. The answer that was received did not function properly, possibly because the person responding was negligent of these facts.
No, it does work. It's just that you printed the repr of it. Were you testing this in the python shell?
In the python shell:
>>> '\\'
'\\'
>>> print('\\')
\
>>> print(repr('\\'))
'\\'
>>>
The shell outputs the returned value (if it's not None) using the the repr function. To overcome
this, you can use the print function, which returns None (so is not outputted by the shell), and
doesn't call the repr function.
Note that in this case, you don't need regex. You just do a simple replace:
s = s.replace('\n', '<p>').replace('\t', '<\span>')
And, for your regex, you should prefix your strings with r:
compiled_regex = re.compile(r'[a-z]+\s?') # for example
matchobj = compiled_regex.search('in this normal string')
othermatchobj = compiled_regex.search('in this other string')
Note that if you're not using your compile regex more than once, you can do this in one step
matchobj = re.search(r'[a-z]+\s?', '<- the pattern -> the string to search in')
Regex are super powerful though. Don't give up!

Why string getting from file is not equal to common string? [duplicate]

This question already has answers here:
Is there a difference between "==" and "is"?
(13 answers)
Closed 6 years ago.
I am on python 3.5 and want to find the matched words from a file. The word I am giving is awesome and the very first word in the .txt file is also awesome. Then why addedWord is not equal to word? Can some one give me the reason?
myWords.txt
awesome
shiny
awesome
clumsy
Code for matching
addedWord = "awesome"
with open("myWords.txt" , 'r') as openfile:
for word in openfile:
if addedWord is word:
print ("Match")
I also tried as :
d = word.replace("\n", "").rstrip()
a = addedWord.replace("\n", "").rstrip()
if a is d:
print ("Matched :" +word)
I also tried to get the class of variables by typeOf(addedWord) and typeOf(word) Both are from 'str' class but are not equal. Is any wrong here?
There are two problems with your code.
1) Strings returned from iterating files include the trailing newline. As you suspected, you'll need to .strip(), .rstrip() or .replace() the newline away.
2) String comparison should be performed with ==, not is.
So, try this:
if addedWord == word.strip():
print ("Match")
Those two strings will never be the same object, so you should not use is to compare them. Use ==.
Your intuition to strip off the newlines was spot-on, but you just need a single call to strip() (it will strip all whitespace including tabs and newlines).

re.sub() doesn't replace middle of string [duplicate]

This question already has answers here:
How to replace only the contents within brackets using regular expressions?
(2 answers)
Closed 6 years ago.
I am trying to replace the contents of brackets in a string with nothing. The code I am using right now is like this:
tstString = "OUTPUT:TRACK[:STATE]?"
modString = re.sub("[\[\]]","",tstString)
When I print the results, I get:
OUTPUT:TRACK:STATE?
But I want the result to be:
OUTPUT:TRACK?
How can I do this?
I guess this one will work fine. Regexp now match Some string Inside []. Not ? after *. It makes * non-greedy
import re
tstString = "OUTPUT:TRACK[:STATE]?"
modString = re.sub("\[.*?\]", "", tstString)
print modString
Your regular expression "[\[\]]" says 'any of these characters: "[", "]"'.
But you want to delete what's between the square brackets too, so you should use something like r"\[:\w+\]". It says '[, then :, then one or more alphanumeric characters, then ]'.
And please, always use raw strings (r in front of quotes) when working with regular expressions to avoid funny things connected with Python string processing.

Issues with string appending - python

I'm trying to append a string in python and the following code produces
buildVersion =request.values.get("buildVersion", None)
pathToSave = 'processedImages/%s/'%buildVersion
print pathToSave
prints out
processedImages/V41
/
I'm expecting the string to be of format: processedImages/V41/
It doesn't seem to be a new line character.
pathToSave = pathToSave.replace("\n", "")
This dint really help
It might not be relevant to actual question but, in addition to Alex Martelli's answer, I would also check if buildVersion ever exists in the first place, because otherwise all solutions posted here will give you another errors:
import re
buildVersion = request.values.get('buildVersion')
if buildVersion is not None:
return 'processedImages/{}/'.format(re.sub('\W+', '', buildVersion))
else:
return None
It might be a \r or other special whitespace character. Just clean up buildVersion of all such whitespace before executing
pathToSave = 'processedImages/%s/' % buildVersion
You can approach the clean-up task in several ways -- for example, if valid characters in buildVersion are only "word characters" (letters, digits, underscore), something like
import re
buildVersion = re.sub('\W+', '', buildVersion)
would usefully clean up even whitespace inside the string. It's hard to be more specific without knowing exactly what characters you need to accept in buildVersion, of course.

Categories

Resources