from mrjob.job import MRJob
import re
Creation_date=re.compile('CreationDate=\"[0-9]*\"[:17]')
class Part2(MRJob):
def mapper(self, _, line):
DateOnly=Creation_date.group(0).split("=")
if(DateOnly > 2013):
yield None, 1
def reducer(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
Part1.run()
I have written python code for MapReduce Job where CreationDate="2010-07-28T19:04:21.300". I have to find all the dates where creation date is at or after 2014-01-01. But I have encountered an error.
Creation_date is just a regex.
You need to match your input string before you can call group(0)
Regular expression object (the result of re.compile) does not have group method:
>>> pattern = re.compile('CreationDate="([0-9]+)"')
>>> pattern.group
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: '_sre.SRE_Pattern' object has no attribute 'group'
To get a match object (which has a group method), you need to match the pattern against the string (line) using regex.search method (or regex.match method depending on your need):
>>> pattern.search('CreationDate="2013"')
<_sre.SRE_Match object at 0x7fac5c64e8a0>
>>> pattern.search('CreationDate="2013"').group(1) # returns a string
'2013'
Creation_date = re.compile('CreationDate="([0-9]+)"')
def mapper(self, _, line):
date_only = Creation_date.search(line), group(1)
if int(date_only) > 2013:
yield None, 1
NOTE: modifed the regular express to capture the numeric part as a group. and convert the matched string to int (comparing string with the number 2013 has no meaning, or raise exception depending on Python version)
Related
I'm trying to get the domain from a list of URL's. For that, I'm using regex in a function for pattern matching.
def get_domain(url):
m = re.search(r"https:\/\/(.*)\/", url)
result = m.group(1)
return result;
string_array = ("hTTps://stack0verflow.com/", "hTTps://stackoverfl0w.com/", "hTTps://stackoverfiow.com/")
m = list(map(get_domain, string_array))
The function get_domain works if I use a for-loop to iterate over the list of strings, but whenever I try to use the map function, I get the error below.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-23-575cbcab950e> in <module>
9 print(get_domain(url))
10
---> 11 m = list(map(get_domain, string_array))
12 ##print(m)
<ipython-input-19-fc11e511d74d> in get_domain(url)
12 def get_domain(url):
13 m = re.search(r"https:\/\/(.*)\/", url)
---> 14 result = m.group(1)
15 return result;
AttributeError: 'NoneType' object has no attribute 'group'
Why does this happen and what am I doing wrong? I've seen a lot of examples online of the map function, and I think i have the syntax down.
this regex will get only domain name in first group :
(?:https?:\/\/)?(?:(?:www|ssh).)?((?=.*\.)[^\n\/]*)
and don't forget to make it case insensitive
exemple :
import re
arr = ["https://www.exemple.com/?query=blablabla","https://www.exemple.com/aaa","hTTp://www.exemple.com","www.exemple.com/aaa","exemple.com"]
for i in arr:
m = re.search(r"(?:https?:\/\/)?(?:(?:www|ssh).)?((?=.*\.)[^\n\/]*)",i,re.IGNORECASE)
print(m.group(1))
i wan to extract (abc)(def) using the regex
which i ended up with that error below
import re
def main():
str = "-->(abc)(def)<--"
match = re.search("\-->(.*?)\<--" , str).group(1)
print match
The error is:
Traceback (most recent call last):
File "test.py", line 7, in <module>
match = re.search("\-->(.*?)\<--" , str).group()
File "/usr/lib/python2.7/re.py", line 146, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or buffer
Corrected:
import re
def main():
my_string = "-->(abc)(def)<--"
match = re.search("\-->(.*?)\<--" , my_string).group(1)
print match
# (abc)(def)
main()
Note, that I renamed str to my_string (do not use standard library functions as own variables!). Maybe you can still optimize your regex with lookarounds, the lazy star (.*?) can get very ineffective sometimes.
I tried the code on "Natural language processing with python", but a type error occurred.
import nltk
from nltk.corpus import brown
suffix_fdist = nltk.FreqDist()
for word in brown.words():
word = word.lower()
suffix_fdist.inc(word[-1:])
suffix_fdist.inc(word[-2:])
suffix_fdist.inc(word[-3:])
common_suffixes = suffix_fdist.items()[:100]
def pos_features(word):
features = {}
for suffix in common_suffixes:
features['endswith(%s)' % suffix] = word.lower().endswith(suffix)
return features
pos_features('people')
the error is below:
Traceback (most recent call last):
File "/home/wanglan/javadevelop/TestPython/src/FirstModule.py", line 323, in <module>
pos_features('people')
File "/home/wanglan/javadevelop/TestPython/src/FirstModule.py", line 321, in pos_features
features['endswith(%s)' % suffix] = word.lower().endswith(suffix)
TypeError: not all arguments converted during string formatting
Does anyone could help me find out where i am wrong?
suffix is a tuple, because .items() returns (key,value) tuples. When you use %, if the right hand side is a tuple, the values will be unpacked and substituted for each % format in order. The error you get is complaining that the tuple has more entries than % formats.
You probably want just the key (the actual suffix), in which case you should use suffix[0], or .keys() to only retrieve the dictionary keys.
I have a string that looks like a path from which I am trying to extract 020414_001 with a regular expression I got from here.
str1 = "Test 123 <C:\User\Test\xyz\022014-101\more\stuff\022014\1> Text"
Actually I am retrieving the string from a text file so I dont have to escape it, but for testing purpose I used this string instead:
str1 = <C:\\User\\Test\\xyz\\022014-101\\more\\stuff\\022014\\1>
Here is the code I tried to match the first occuring 022014-101:
import re
p = re.compile('(?<=\\)[\d]{6}[^\\]*')
m = p.match(str1)
print m.group(0) #Line 6
It gave me this error:
Traceback (most recent call last):
File "test12.py", line 6, in <module>
print m.group(0)
AttributeError: 'NoneType' object has no attribute 'group'
How can I get the desired output 020414_001 ?
EDIT:
That did it:
import re
m = re.search(r'(?<=\\)[\d]{6}[^\\]*', str1)
print m.group(0)
I am working through some example code which I've found on What's the most efficient way to find one of several substrings in Python?. I've changed the code to:
import re
to_find = re.compile("hello|there")
search_str = "blah fish cat dog haha"
match_obj = to_find.search(search_str)
#the_index = match_obj.start()
which_word_matched = ""
which_word_matched = match_obj.group()
Since there is now no match , I get:
Traceback (most recent call last):
File "<console>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
What is the standard way in python to handle the scenario of no match, so as to avoid the error
match_obj = to_find.search(search_str)
if match_obj:
#do things with match_obj
Other handling will go in an else block if you need to do something even when there's no match.
Your match_obj is None because the regular expression did not match. Test for it explicitly:
which_word_matched = match_obj.group() if match_obj else ''