Regular Expressions, Python 3, - python

I have problems to understand this regular expression in python:
re.findall(r'([a-z]+?)\w*', "Ham, spam, and, eggs")
I understand that:
[a-z] is a class that includes the all letters from a-z
+ says that it can appear at least once
? is it can appear once or never
My output for ([a-z]+?) is:
['a', 'm', 's', 'p', 'a', 'm', 'a', 'n', 'd', 'e', 'g', 'g', 's']
Now the problems start:
if I test:
re.findall(r'([a-z]+?)\w', "Ham, spam, and, eggs")
My output is:
['a', 's', 'a', 'a', 'e', 'g'] # Why?
and if i test the full expression:
re.findall(r'([a-z]+?)\w*', "Ham, spam, and, eggs")
my output is:
['a', 's', 'a', 'e'] # Why?
Can somebody explain this to me, please?

You misunderstand the use of +? * - this means at least once, non-greedy, i.e. as a few additional characters as needed to match. In practice, this is the same as [a-z] ("at least once and as few times as possible" is the same as, simply, "once").
The other token in your pattern, \w, means any "word character", equivalent to [A-Za-z0-9_].
Your first attempt, ([a-z]+?)\w, captures any single, lower-case letter that is followed by any other word character - hence ['a', 's', 'a', 'a', 'e', 'g']:
"Ham, spam, and, eggs"
# ^. ^.^. ^. ^.^.
(Note: ^ is the captured character, . is the non-captured match.)
Your second attempt, ([a-z]+?)\w* captures any single, lower-case letter followed by as many other word characters as possible, hence only captures once per word (the first lower-case letter):
"Ham, spam, and, eggs"
# ^. ^... ^.. ^...
In both cases, as you have specified a capture group, findall only returns the characters within that group. If you remove the capture group parentheses, it will capture the whole match:
>>> re.findall(r'[a-z]+?\w*', "Ham, spam, and, eggs")
['am', 'spam', 'and', 'eggs']
You can try an interactive demonstration here.
* You have confused it with ? on its own, which does mean "zero or one times".

I am going to take a stab although I am not sure if I am correct.
Does the "?" apply non-greedy 1 or more matching (+ sign) maybe?
So you do not match the "H" in Ham because it is upper case.
Next you we look at "a". Since its followed by a word character (\w) the matching captures the letter "a" there, and carries on to the "," where we start over.
Next letter it matches the s in spam and a following word characters captures the "s" and moves on to th next ",", and so on.

Related

Converting to lower-case: every letter gets tokenized

I have a text document that I want to convert to lower case, but when I do it in the following way every letter of my document gets tokenized. Why does it happen?
with open('assign_1.txt') as g:
assign_1 = g.read()
assign_new = [word.lower() for word in assign_1]
What I get:
assign_new
['b',
'a',
'n',
'g',
'l',
'a',
'd',
'e',
's',
'h',]
You iterated through the entire input, one character at a time, dropped each to lower-case, and specified the result as a list. It's simpler than that:
assign_lower = g.read().lower()
Using the variable "word" doesn't make you iterate over words -- assign_1 still a sequence of characters.
If you want to break this into words, use the split method ... which is independent of the lower-case operation.

Regular expression issue in python

Why infinite wildcard (*) in the regular expression is treated differently in python? Please tell me why in case one I'm getting different output than case two?
CASE ONE:
import re
b= None
a=None
while a!='chk':
a=input()
b= re.findall('[A-Z][a-z]{1,400}',a)
if b!=None:
print(b,bool(b),type(b))
if a=='chk':
break
output:
CAPITALLETTERSsmallletters
['Ssmallletters'] True <class 'list'>
CASE TWO:
import re
b= None
a=None
while a!='chk':
a=input()
b= re.findall('[A-Z][a-z]*',a)
if b!=None:
print(b,bool(b),type(b))
if a=='chk':
break
output:
CAPITALLETTERSsmallletters
['C', 'A', 'P', 'I', 'T', 'A', 'L', 'L', 'E', 'T', 'T', 'E', 'R', 'Ssmallletters'] True <class 'list'>
CASE ONE:
The regular expression says:
Look for things that have a uppercase letter followed by 1 to 400 lowercase letters
This gives one hit, the one it prints.
CASE TWO:
The regular expression says:
Look for things that have one uppercase letter followed by 0 to infinite lowercase letters
In this case each capital letter alone is one hit, plus the same hit you had before.

How to take out symbols from string with regex

I am trying to extract some useful symbols for me from the strings, using regex and Python 3.4.
For example, I need to extract any lowercase letter + any uppercase letter + any digit. The order is not important.
'adkkeEdkj$4' --> 'aE4'
'4jdkg5UU' --> 'jU4'
Or, maybe, a list of the symbols, e.g.:
'adkkeEdkj$4' --> ['a', 'E', 4]
'4jdkg5UU' --> ['j', 'U', 4]
I know that it's possible to match them using:
r'(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])'
Is it possible to get them using regex?
You can get those values by using capturing groups in the look-aheads you have:
import re
p = re.compile('^(?=[^a-z]*([a-z]))(?=[^A-Z]*([A-Z]))(?=[^0-9]*([0-9]))', re.MULTILINE)
test_str = "adkkeEdkj$4\n4jdkg5UU"
print(re.findall(p, test_str))
See demo
The output:
[('a', 'E', '4'), ('j', 'U', '4')]
Note I have edited the look-aheads to include contrast classes for better performance, and the ^ anchor is important here, too.

Comparing and printing elements in nested loops

The program identifies if one of the elements in the string word is a consonant by looping though the word string, and then for each iteration though the word string, iterating though the consonants list and comparing if the current element in word string equals to the current element of consonant list.
If yes, the current element of the word string is a consonant and the consonant gets printed (not the index of the consonant, but the actual consonant, for e.g. "d".)
The problem is, I get this instead:
1
1
What am I doing wrong? Shouldn't the nested loops work so that the below loop iterates every element for each element in the above loop? That is, each index above makes the below loop iterate though each index?
That's the program:
word = "Hello"
consonants = ['b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'z']
for character in range(len(word)):
for char in range(len(consonants)):
if consonants[char] == word[character]:
consonant = word[character]
print consonant
You are misreading the output. The character is the letter L lowercase, not the number 1.
In other words, your code is working as designed. The captital letter H is not in your consonants list, but the two lowercase letters l in Hello are.
Note that it'd be much more efficient to use a set for consonants here; you'd not have to loop over that whole list and just use in to test for membership. That works with lists too, but is much more efficient with a set. If you lowercase the word value you'd also be able to match the H.
Last but not least, you can loop over the word string directly rather than use range(len(word)) then use the generated index:
word = "Hello"
consonants = set('bcdfghjklmnpqrstvwxz')
for character in word.lower():
if character in consonants:
print character
Demo:
>>> word = "Hello"
>>> consonants = set('bcdfghjklmnpqrstvwxz')
>>> for character in word.lower():
... if character in consonants:
... print character
...
h
l
l

How is this weird behaviour of escaping special characters explained?

Just for fun, I wrote this simple function to reverse a string in Python:
def reverseString(s):
ret = ""
for c in s:
ret = c + ret
return ret
Now, if I pass in the following two strings, I get interesting results.
print reverseString("Pla\net")
print reverseString("Plan\et")
The output of this is
te
alP
te\nalP
My question is: Why does the special character \n get translated into a new line when passed into the function, but not when the function parses it together by reversing n\? Also, how could I stop the function from parsing \n and instead return n\?
You should take a look at the individual character sequences to see what happens:
>>> list("Pla\net")
['P', 'l', 'a', '\n', 'e', 't']
>>> list("Plan\et")
['P', 'l', 'a', 'n', '\\', 'e', 't']
So as you can see, \n is a single character while \e are two characters as it is not a valid escape sequence.
To prevent this from happening, escape the backslash itself, or use raw strings:
>>> list("Pla\\net")
['P', 'l', 'a', '\\', 'n', 'e', 't']
>>> list(r"Pla\net")
['P', 'l', 'a', '\\', 'n', 'e', 't']
The reason is that '\n' is a single character in the string. I'm guessing \e isn't a valid escape, so it's treated as two characters.
look into raw strings for what you want, or just use '\\' wherever you actually want a literal '\'
The translation is a function of python's syntax, so it only occurs during python's parsing of input to python itself (i.e. when python parses code). It doesn't occur at other times.
In the case of your programme, you have a string which by the time it is constructed as an object, contains the single character denoted by '\n', and a string which when constructed contains the sub-string '\e'. After you reverse them, python doesn't reparse them.

Categories

Resources