Regex sub only removes certain expressions

Regex sub only removes certain expressions - python

I'm running a program which creates product labels based on csv data. The function which I am struggling with takes a data structure which consists of a number combination(width of a wooden plank) and a string (name of product). Possible combinations I search for are as follows:
5 MAPLE PEPPER-ANTIQUE
3-1/4 MAPLE CUMIN-ANTIQUE
2-1/4+4-1/4 MAPLE TIMBERWOLF
My function needs to take in the data, split the width from the name and return them both as separate variables as follows:
desc = row[1]
if filter.lower() in desc.lower():
size = re.search(r'(\d{1})(\-*)(\d{0,1})(\/*)(\d{0,2})(\+*)(\d{0,1})(\-*)(\d{0,1})(\/*)(\d{0,2})', desc)
if size:
# remove size from description
desc = re.sub(size.group(), '', desc)
size = size.group() # extract match from obj
else:
size = "None"
The function does as intended with the first two samples, however when it comes across the last product, it recognizes the size but does not remove it from description. Screen shot below shows the output after I print (size + \n + desc)
Is there an issue with my re expression or elsewhere?
Thanks

re.sub() expects its first argument to be a regex. It works for the first two because they don't contain any characters that have special meaning in the context, however the third contains +, which is special.
There's not actually any reason to use regex there... regular string replacement should work:
desc = desc.replace(size.group(), '')

Why replace and not simply match what you need?
import re
text = """5 MAPLE PEPPER-ANTIQUE
3-1/4 MAPLE CUMIN-ANTIQUE
2-1/4+4-1/4 MAPLE TIMBERWOLF""".split('\n')
print(text)
for t in text:
pattern = r'(?P<size>[0-9-+/]+) *(?P<species>[^0123456789]*)'
m = re.search(pattern,t)
print(m.group('size'))
print(m.group('species'))
Output:
5
MAPLE PEPPER-ANTIQUE
3-1/4
MAPLE CUMIN-ANTIQUE
2-1/4+4-1/4
MAPLE TIMBERWOLF
Regex:
r'(?P<size>[0-9-+/]+) *(?P<species>[^0123456789]*)'
2 named groups, between them 0-n spaces.
1st group only 0123456789-+/ allowed
2nd group any but 0123456789 allowed

Related

what is the fast way to match words in text?

i have a list of regex like :
regex_list = [".+rive.+",".+ll","[0-9]+ blue car.+"......] ## list of length 3000
what is the best method to match all this regex to my text
for example :
text : Hello, Owning 2 blue cars for a single driver
so in the output , i want to have a list of matched words :
matched_words = ["Hello","4 blue cars","driver"] ##Hello <==>.+llo

Alright, first of all, you will probably want to adjust your regex_list, because of now, matching those strings will give you the entire text back as match. This is because of .+, which states that there may follow any character any amount of time. What I have done here is the following:
import re
regex_list = [".rive.",".+ll.","[0-9]+ blue car."]
text = "Hello, Owning 2 blue cars for a single driver"
# Returns all the spans of matched regex items in text
spans = [re.search(regex_item,text).span() for regex_item in regex_list]
# Sorts the spans on first occurence (so, first element in item for every item in span).
spans.sort()
# Retrieves the text via index of spans in text.
matching_texts = [text[x[0]:x[1]] for x in spans]
print(matching_texts)
I adjusted your regex_list slightly, so it does not match the entire text. Then, I retrieve all spans from the matches with the text. Additionally, I sort the spans on first occurence. Lastly, I retrieve the texts via the indexes of the spans and print those out. What you will get is the following
['Hello', '2 blue cars', 'driver']
NOTE: I am unsure why you would like to match '4 blue cars', because that is not in your text.

You could also try this which is multi threaded version of #Lexpj answer
from concurrent.futures import ThreadPoolExecutor, as_completed
import re
# list of length 3000
regex_list = [".rive.", ".+ll.", "[0-9]+ blue car."]
my_string = "Hello, Owning 2 blue cars for a single driver "
def test(text, regex):
# Returns all the spans of matched regex items in text
spans = [re.search(regex, text).span()]
# Sorts the spans on first occurence (so, first element in item for every item in span).
spans.sort()
# Retrieves the text via index of spans in text.
matching_texts = [text[x[0]:x[1]] for x in spans]
return matching_texts
with ThreadPoolExecutor(max_workers=10) as executor:
futures = {executor.submit(test, my_string, regex)
for regex in regex_list}
# as_completed() gives you the threads once finished
matched = set()
for f in as_completed(futures):
# Get the results
rs = f.result()
matched = matched.union(set(rs))
print(matched)

Looking at the desired result, your regexes are not correct. You don't want to match .+, but \w+, and also with the second regex, you'll want to match some letters after ll too.
The main idea is then to make one regex for all, by concatenating them with the | symbol:
import re
regex_list = [r"\w+rive\w+", r"\w+ll\w+", r"\d+ blue car\w+"]
regex = re.compile('|'.join(regex_list))
text = "Hello, Owning 2 blue cars for a single driver "
print(regex.findall(text)) # ["Hello","2 blue cars","driver"]
This still could give undesired effects when there is a part of your string that would match with more than one regex in the list. In that case the first will "win". So make sure that when multiple regexes could match the same text, they are ordered along their desired priority.

Regex not specific enough

So I wrote a program for my Kindle e-reader that searches my highlights and deletes repetitive text (it's usually information about the book title, author, page number, etc.). I thought it was functional but sometimes there would random be periods (.) on certain lines of the output. At first I thought the program was just buggy but then I realized that the regex I'm using to match the books title and author was also matching any sentence that ended in brackets.
This is the code for the regex that I'm using to detect the books title and author
titleRegex = re.compile('(.+)\((.+)\)')
Example
Desired book title and author match: Book title (Author name)
What would also get matched: *I like apples because they are green (they are sometimes red as well). *
In this case it would delete everything and leave just the period at the end of the sentence. This is obviously not ideal because it deletes the text I highlighted
Here is the unformatted text file that goes into my program
The program works by finding all of the matches for the regexes I wrote, looping through those matches and one by one replacing them with empty strings.
Would there be any ways to make my title regex more specific so that it only picks up author titles and not full sentences that end in brackets? If not, what steps would I have to take to restructure this program?
I've attached my code to the bottom of this post. I would greatly appreciate any help as I'm a total coding newbie. Thanks :)
import re
titleRegex = re.compile('(.+)\((.+)\)')
titleRegex2 = re.compile(r'\ufeff (.+)\((.+)\)')
infoRegex = re.compile(r'(.) ([a-zA-Z]+) (Highlight|Bookmark|Note) ([a-zA-Z]+) ([a-zA-Z]+) ([0-9]+) (\|)')
locationRegex = re.compile(r' Location (\d+)(-\d+)? (\|)')
dateRegex = re.compile(r'([a-zA-Z]+) ([a-zA-Z]+) ([a-zA-Z]+), ([a-zA-Z]+) ([0-9]+), ([0-9]+)')
timeRegex = re.compile(r'([0-9]+):([0-9]+):([0-9]+) (AM|PM)')
newlineRegex = re.compile(r'\n')
sepRegex = re.compile('==========')
regexList = [titleRegex, titleRegex2, infoRegex, locationRegex, dateRegex, timeRegex, sepRegex, newlineRegex]
string = open("/Users/devinnagami/myclippings.txt").read()
for x in range(len(regexList)):
newString = re.sub(regexList[x], ' ', string)
string = newString
finalText = newString.split(' ')
with open('booknotes.txt', 'w') as f:
for item in finalText:
f.write('%s\n' % item)

There isn't enough information to tell if "Book title (Book Author)" is different than something like "I like Books (Good Ones)" without context. Thankfully, the text you showed has plenty of context. Instead of creating several different regular expressions, you can combine them into one expression to encode that context.
For instance:
quoteInfoRegex = re.compile(
r"^=+\n(?P<title>.*?) \((?P<author>.*?)\)\n" +
r"- Your Highlight on page (?P<page>[\d]+) \| Location (?P<location>[\d-]+) \| Added on (?P<added>.*?)\n" +
r"\n" +
r"(?P<quote>.*?)\n", flags=re.MULTILINE)
for m in quoteInfoRegex.finditer(data):
print(m.groupdict())
This will pull out each line of the text, and parse it, knowing that the book title is the first line after the equals, and the quote itself is below that.

Parsing the String to different format in Python

I have a text document and I need to add two # symbols before the keywords present in an array.
Sample text and Array:
str ="This is a sample text document which consists of all demographic information of employee here is the value you may need,name: George employee_id:14296blood_group:b positive this is the blood group of the employeeage:32"
arr=['name','employee_id','blood_group','age']
Required Text:
str ="This is a sample text document which consists of all demographic information of employee here is the value you may need, ##name: George ##employee_id:14296 ##blood_group:b positive this is the blood group of the employee ##age:32"

Just use the replace function
str ="This is a sample text document which consists of all demographic information of employee here is the value you may need,name: George employee_id:14296blood_group:b positive this is the blood group of the employeeage:32"
arr = ['name','employee_id','blood_group','age']
for w in arr:
str = str.replace(w, f'##{w}')
print(str)

You can simply loop over arr and use the str.replace function:
for repl in arr:
strng.replace(repl, '##'+repl)
print(strng)
However, I urge you to change the variable name str because it is a reserved keyword.

You might use re module for that task following way
import re
txt = "This is a sample text document which consists of all demographic information of employee here is the value you may need,name: George employee_id:14296blood_group:b positive this is the blood group of the employeeage:32"
arr=['name','employee_id','blood_group','age']
newtxt = re.sub('('+'|'.join(arr)+')',r'##\1',txt)
print(newtxt)
Output:
This is a sample text document which consists of all demographic information of employee here is the value you may need,##name: George ##employee_id:14296##blood_group:b positive this is the blood group of the employee##age:32
Explanation: here I used regular expression to catch words from your list and replace each with ##word. This is single pass, as opposed to X passes when using multiple str.replace (where X is length of arr), so should be more efficient for cases where arr is long.

As an alternative, you can convert the below in a loop for lengthier list. There seems to be space before ## too.
str= str[:str.find(arr[0])] + ' ##' + str[str.find(arr[0]):]
str= str[:str.find(arr[1])] + ' ##' + str[str.find(arr[1]):]
str= str[:str.find(arr[2])] + ' ##' + str[str.find(arr[2]):]
str= str[:str.find(arr[3])] + ' ##' + str[str.find(arr[3]):]

You can replace the value and add space and double ## before the replaced value and in the result replace double spaces with one space.
str ="This is a sample text document which consists of all demographic information of employee here is the value you may need,name: George employee_id:14296blood_group:b positive this is the blood group of the employeeage:32"
arr=['name','employee_id','blood_group','age']
for i in arr:
str = str.replace(i, " ##{}".format(i))
print(str.replace(" ", " "))
Output
This is a sample text document which consists of all demographic information of employee here is the value you may need, ##name: George ##employee_id:14296 ##blood_group:b positive this is the blood group of the employee ##age:32

Regex Python [python-2.7]

I'm working on a Python program that sifts through a .txt file to find the genus and species name. The lines are formatted like this (yes, the equals signs are consistently around the common name):
1. =Common Name= Genus Species some other words that I don't want.
2. =Common Name= Genus Species some other words that I don't want.
I can't seem to figure out a regex that will work to match only the genus and species and not the common name. I know the equals signs (=) will probably help in some way but I cannot think of how to use them.
Edit: Some real data:
1. =Western grebe.= ÆCHMOPHORUS OCCIDENTALIS. Rare migrant; western species, chiefly interior regions of North America.
2. =Holboell's grebe.= COLYMBUS HOLBOELLII. Rare migrant; breeds far north; range, all of North America.
3. =Horned grebe.= COLYMBUS AURITUS. Rare migrant; range, almost the same as the last.
4. =American eared grebe.= COLYMBUS NIGRICOLLIS CALIFORNICUS. Summer resident; rare in eastern, common in western Colorado; breeds from plains to 8,000 feet; partial to alkali lakes; western species.

You probably don't need regex for this one. If the order of the words you need and the count of the words is always the same, you can just split each line into list of substrings and get the third (genus) and the fourth (species) element of that list. The code will probably look like that:
myfile = open('myfilename.txt', 'r')
for line in myfile.readlines():
words = line.split()
genus, species = words[2], words[3]
It just looks a little more "pythonic" to me.
If common name can consist of multiple words, then suggested code will return an incorrect result. To get the right result in this case too, you can use this code:
myfile = open('myfilename.txt', 'r')
for line in myfile.readlines():
words = line.split('=')[2].split() # If the program returns wrong results, try changing the index from 2 to 1 or 3. What number is the right one depends on whether there can be any symbols before the first "=".
genus, species = words[0], words[1]

If it is enough to capture words in groups (and you dont't wont direct match) you can try with:
(?=\d\.\s*=[^=]+=\s(?:(?P<genus>\w+)\s(?P<species>\w+)))
DEMO
the desired values will be in groups <genus> and <species>. The whole regex is a positive lookbehind, so it match a zero point position on a beginning of string, but it captures some content into groups.
(?=\d\.\s*=[^=]+=\s - decimal folowed by some content between equal
signs and space,
(?:(?P<genus>\w+)\s(?P<species>\w+))) - capture first word to genus
groups, and second word do species groups,

You can try something like:
import re
txt='1. =Common Name= Genus Species some other words that I don\'t want.'
re1='.*?' # Non-greedy match on filler
re2='(?:[a-z][a-z]+)' # Uninteresting: word
re3='.*?' # Non-greedy match on filler
re4='(?:[a-z][a-z]+)' # Uninteresting: word
re5='.*?' # Non-greedy match on filler
re6='((?:[a-z][a-z]+))' # Word 1
re7='.*?' # Non-greedy match on filler
re8='((?:[a-z][a-z]+))' # Word 2
rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
if m:
word1=m.group(1)
word2=m.group(2)
print "("+word1+")"+"("+word2+")"+"\n"
In your test input as shown in txt, this will print
(Genus)(Species)
You can you this awesome site to help do regexes like this!
Hope this helps

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?

Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!

Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.

Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex sub only removes certain expressions - python

Related

what is the fast way to match words in text?

Regex not specific enough

Parsing the String to different format in Python

Regex Python [python-2.7]

Python parsing

Categories

Resources