I'm trying to learn how to use regular expressions but have a question. Let's say I have the string
line = 'Cow Apple think Woof`
I want to see if line has at least two words that begin with capital letters (which, of course, it does). In Python, I tried to do the following
import re
test = re.search(r'(\b[A-Z]([a-z])*\b){2,}',line)
print(bool(test))
but that prints False. If I instead do
test = re.search(r'(\b[A-Z]([a-z])*\b)',line)
I find that print(test.group(1)) is Cow but print(test.group(2)) is w, the last letter of the first match (there are no other elements in test.group).
Any suggestions on pinpointing this issue and/or how to approach the problem better in general?
The last letter of the match is in group because of inner parentheses. Just drop those and you'll be fine.
>>> t = re.findall('([A-Z][a-z]+)', line)
>>> t
['Cow', 'Apple', 'Woof']
>>> t = re.findall('([A-Z]([a-z])+)', line)
>>> t
[('Cow', 'w'), ('Apple', 'e'), ('Woof', 'f')]
The count of capitalised words is, of course, len(t).
I use the findall function to find all instances that match the regex. The use len to see how many matches there are, in this case, it prints out 3. You can check if the length is greater than 2 and return a True or False.
import re
line = 'Cow Apple think Woof'
test = re.findall(r'(\b[A-Z]([a-z])*\b)',line)
print(len(test) >= 2)
If you want to use only regex, you can search for a capitalized word then some characters in between and another capitalized word.
test = re.search(r'(\b[A-Z][a-z]*\b)(.*)(\b[A-Z][a-z]*\b)',line)
print(bool(test))
(\b[A-Z][a-z]*\b) - finds a capitalized word
(.*) - matches 0 or more characters
(\b[A-Z][a-z]*\b) - finds the second capitalized word
This method isn't as dynamical since it will not work for trying to match 3 capitalized word.
import re
sent = "His email is abc#some.com, however his wife uses xyz#gmail.com"
x = re.findall('[A-Za-z]+#[A-Za-z\.]+', sent)
print(x)
If there is a period at the end of an email ID (abc#some,com.), it will be returned at the end of the email address. However, this can be dealt separately.
Related
I am making a Python script that finds the word "Hold" in a list string and confirms if it is holdable or not.
File = [
"Hold_Small_Far_BG1_123456789.jpg",
"Firm_Large_Near_BG1_123456789.jpg",
"Move_Large_Far_BG1_123456789.jpg",
"Firm_Large_Far_BG1_123456789.jpg",
"Hold_Small_Hold_BG1_123456789.jpg",
"Hold_Small_Near_BG1_123456789.jpg",
"Small_Small_Far_BG1_123456789.jpg",
]
for item in File:
if "Hold" in item: return print('Yes, object is holdable.')
else: return print('No, object is not holdable.')
The code above sees the first 'Hold' word and returns true. The holdable objects are the ones that have 'Hold' as the third word.
The problem is the code sees the first 'Hold' word and returns true. I want the code to check if there's a word 'Hold' in the filename while ignoring the first 'Hold' word.
Please note that I cannot split the string using the '_' because it is generated by people. So, sometimes it can be a comma, dot, or space even.
Is there an expression for this? Sorry for the bad English.
Thank you. :)
You can use a regex pattern:
import re
holdables = ['yes' if re.findall(r'\w.*(Hold)', x) else 'no' for x in File]
for x in holdables:
print(x)
The regex here only assumes that 'Hold' is not the first word in the string but does exist elsewhere, since you said you can't be sure whether underscores or other delimiters will be present. If you need more stringent conditions for the regex pattern, you can always update it.
If I have understood the question correctly, we want to find if the filename contains "Hold" anywhere ignoring the first occurrence. Without definite separators, however, it is difficult. Here are two approaches that I think could work:
Using regex like:
import re
for fname in File:
if re.match("(^.+Hold.*$)", fname):
#code if hold is found
Assumptions: This answer relies on the assumption that Hold can only occur only in the third position and first position if it does occur. We ignore the first in this case and search for the third "hold"
>>> re.match("(^.+Hold.*$)", "Hold_Small_Hold_BG1_123456789.jpg")
<re.Match object; span=(0, 33), match='Hold_Small_Hold_BG1_123456789.jpg'>
>>> re.match("(^.+Hold.*$)", "Hold_Small_Far_BG1_123456789.jpg")
>>>
Use split()
We can split the string with "Hold". When "Hold" is present in the third position, we get a list with either 2 or 3 elements.
for fname in File:
if len(fname.split("Hold")) == 3:
#code if hold is found
Again the assumption is that Hold can only occur only at the third position and first position if it does occur.
>>> "Hold_Small_Hold_BG1_123456789.jpg".split("Hold")
['', '_Small_', '_BG1_123456789.jpg'] #list with 3 elements
>>> "Hold_Small_Far_BG1_123456789.jpg".split("Hold")
['', '_Small_Far_BG1_123456789.jpg'] #list with 2 elements
i = 0
while (i< len(File)):
s = File[i]; ct = 0
ct = s.count('Hold')
if ct >1:
print ('Yes, object is holdable.')
else:
print ('No, object is not holdable.')
i+=1
Edited, now it works only if 'Hold' appears more than once.
I'm trying to solve this problem were they give me a set of strings where to count how many times a certain word appears within a string like 'code' but the program also counts any variant where the 'd' changes like 'coze' but something like 'coz' doesn't count this is what I made:
def count(word):
count=0
for i in range(len(word)):
lo=word[i:i+4]
if lo=='co': # this is what gives me trouble
count+=1
return count
Test if the first two characters match co and the 4th character matches e.
def count(word):
count=0
for i in range(len(word)-3):
if word[i:i+1] == 'co' and word[i+3] == 'e'
count+=1
return count
The loop only goes up to len(word)-3 so that word[i+3] won't go out of range.
You could use regex for this, through the re module.
import re
string = 'this is a string containing the words code, coze, and coz'
re.findall(r'co.e', string)
['code', 'coze']
from there you could write a function such as:
def count(string, word):
return len(re.findall(word, string))
Regex is the answer to your question as mentioned above but what you need is a more refined regex pattern. since you are looking for certain word appears you need to search for boundary words. So your pattern should be sth. like this:
pattern = r'\bco.e\b'
this way your search will not match with the words like testcodetest or cozetest but only match with code coze coke but not leading or following characters
if you gonna test for multiple times, then it's better to use a compiled pattern, that way it'd be more memory efficient.
In [1]: import re
In [2]: string = 'this is a string containing the codeorg testcozetest words code, coze, and coz'
In [3]: pattern = re.compile(r'\bco.e\b')
In [4]: pattern.findall(string)
Out[4]: ['code', 'coze']
Hope that helps.
I have a sentence that I want to parse to check for some conditions:
a) If there is a period and it is followed by a whitespace followed by a lowercase letter
b) If there is a period internal to a sequence of letters with no adjacent whitespace (i.e. www.abc.com)
c) If there is a period followed by a whitespace followed by an uppercase letter and preceded by a short list of titles (i.e. Mr., Dr. Mrs.)
Currently I am iterating through the string (line) and using the next() function to see whether the next character is a space or lowercase, etc. And then I just loop through the line. But how would I check to see what the next, next character would be? And how would I find the previous ones?
line = "This is line.1 www.abc.com. Mr."
t = iter(line)
b = next(t)
for i in line[:len(line)-1]:
a = next(t)
if i == "." and (a.isdigit()): #for example, this checks to see if the value after the period is a number
print("True")
Any help would be appreciated. Thank you.
Regular expressions is what you want.
Since your going to check for a pattern in a string, you can make use of the python's builtin support for regular expressions through re library.
Example:
#To check if there is a period internal to a sequence of letters with no adjacent whitespace
import re
str = 'www.google.com'
pattern = '.*\..*'
obj = re.compile(pattern)
if obj.search(str):
print "Pattern matched"
Similarly generate patterns for the conditions you want to check in your string.
#If there is a period and it is followed by a whitespace followed by a lowercase letter
regex = '.*\. [a-z].*'
You can generate and test your regular expressions online using this simple tool
Read more extensively about re library here
You can use multiple next operations to get more data
line = "This is line.1 www.abc.com. Mr."
t = iter(line)
b = next(t)
for i in line[:len(line)-1]:
a = next(t)
c = next(t)
if i == "." and (a.isdigit()): #for example, this checks to see if the value after the period is a number
print("True")
You can get previous ones by saving your iterations to a temporary list
I really apologize if this has been answered before but I have been scouring SO and Google for a couple of hours now on how to properly do this. It should be easy and I know I am missing something simple.
I am trying to read from a file and count all occurrences of elements from a list. This list is not just whole words though. It has special characters and punctuation that I need to get as well.
This is what I have so far, I have been trying various ways and this post got me the closest:
Python - Finding word frequencies of list of words in text file
So I have a file that contains a couple of paragraphs and my list of strings is:
listToCheck = ['the','The ','the,','the;','the!','the\'','the.','\'the']
My full code is:
#!/usr/bin/python
import re
from collections import Counter
f = open('text.txt','r')
wanted = ['the','The ','the,','the;','the!','the\'','the.','\'the']
words = re.findall('\w+', f.read().lower())
cnt = Counter()
for word in words:
if word in wanted:
print word
cnt[word] += 1
print cnt
my output thus far looks like:
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
Counter({'the': 17})
It is counting my "the" strings with punctuation but not counting them as separate counters. I know it is because of the \W+. I am just not sure what the proper regex pattern to use here or if I'm going about this the wrong way.
I suspect there may be some extra details to your specific problem that you are not describing here for simplicity. However, I'll assume that what you are looking for is to find a given word, e.g. "the", which could have either an upper or lower case first letter, and can be preceded and followed either by a whitespace or by some punctuation characters such as ;,.!'. You want to count the number of all the distinct instances of this general pattern.
I would define a single (non-disjunctive) regular expression that define this. Something like this
import re
pattern = re.compile(r"[\s',;.!][Tt]he[\s.,;'!]")
(That might not be exactly what you are looking for in general. I just assuming it is based on what you stated above. )
Now, let's say our text is
text = '''
Foo and the foo and ;the, foo. The foo 'the and the;
and the' and the; and foo the, and the. foo.
'''
We could do
matches = pattern.findall(text)
where matches will be
[' the ',
';the,',
' The ',
"'the ",
' the;',
" the'",
' the;',
' the,',
' the.']
And then you just count.
from collections import Counter
count = Counter()
for match in matches:
count[match] += 1
which in this case would lead to
Counter({' the;': 2, ' the.': 1, ' the,': 1, " the'": 1, ' The ': 1, "'the ": 1, ';the,': 1, ' the ': 1})
As I said at the start, this might not be exactly what you want, but hopefully you could modify this to get what you want.
Just to add, a difficulty with using a disjunctive regular expression like
'the|the;|the,|the!'
is that the strings like "the," and "the;" will also match the first option, i.e. "the", and that will be returned as the match. Even though this problem could be avoided by more careful ordering of the options, I think it might not be easier in general.
The simplest option is to combine all "wanted" strings into one regular expression:
rr = '|'.join(map(re.escape, wanted))
and then find all matches in the text using re.findall.
To make sure longer stings match first, just sort the wanted list by length:
wanted.sort(key=len, reverse=True)
rr = '|'.join(map(re.escape, wanted))
I have a lot of long strings - not all of them have the same length and content, so that's why I can't use indices - and I want to extract a string from all of them. This is what I want to extract:
http://www.someDomainName.com/anyNumber
SomeDomainName doesn't contain any numbers and and anyNumber is different in each long string. The code should extract the desired string from any string possible and should take into account spaces and any other weird thing that might appear in the long string - should be possible with regex right? -. Could anybody help me with this? Thank you.
Update: I should have said that www. and .com are always the same. Also someDomainName! But there's another http://www. in the string
import re
results = re.findall(r'\bhttp://www\.someDomainName\.com/\d+\b', long_string)
>>> import re
>>> pattern = re.compile("(http://www\\.)(\\w*)(\\.com/)(\\d+)")
>>> matches = pattern.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(0)
print matches.group(1)
print matches.group(2)
print matches.group(3)
print matches.group(4)
http://www.someDomainName.com/2134
http://www.
someDomainName
.com/
2134
In the above pattern, we have captured 5 groups -
One is the complete string that is matched
Rest are in the order of the brackets you see.. (So, you are looking for the second one..) - (\\w*)
If you want, you can capture only the part of the string you are interested in.. So, you can remove the brackets from rest of the pattern that you don't want and just keep (\w*)
>>> pattern = re.compile("http://www\\.(\\w*)\\.com/\\d+")
>>> matches = patter.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(1)
someDomainName
In the above example, you won't have groups - 2, 3 and 4, as in the previous example, as we have captured only 1 group.. And yes group 0 is always captured.. That is the complete string that matches..
Yeah, your simplest bet is regex. Here's something that will probably get the job done:
import re
matcher = re.compile(r'www.(.+).com\/(.+)
matches = matcher.search(yourstring)
if matches:
str1,str2 = matches.groups()
If you are sure that there are no dots in SomeDomainName you can just take the first occurence of the string ".com/" and take everything from that index on
this will avoid you the use of regex which are harder to maintain
exp = 'http://www.aejlidjaelidjl.com/alieilael'
print exp[exp.find('.com/')+5:]