I need some help with a regex string to pull any filename that looks like it might be part of a frame sequence out of a previously generated list of filenames.
Frames in a sequence will generally have a minimum padding of 3 and will be preceeded by either a '.' or a '_' An exception is: if the filename is only made up of a number and the .jpg extension (e.g 0001.jpg, 0002.jpg, etc.). I'd like to capture all these in one line of regex, if possible.
Here's what I have so far:
(.*?)(.|_)(\d{3,})(.*)\.jpg
Now I know this doesn't do the "preceeded by . or _" bit and instead just finds a . or _ anywhere in the string to return a positive. I've tried a bit of negative lookbehind testing, but can't get the syntax to work.
A sample of data is:
test_canon_shot02.jpg
test_shot01-04.jpg
test_shot02-03.jpg
test_shot02-02.jpg
test_shot01-03.jpg
test_canon_shot03.jpg
test_shot01-02.jpg
test_shot02.jpg
test_canon_shot02.jpg
test_shot01.jpg
GameAssets_.00024.jpg
GameAssets_.00023.jpg
GameAssets_.00022.jpg
GameAssets_.00021.jpg
GameAssets_.00020.jpg
GameAssets_.00019.jpg
GameAssets_.00018.jpg
GameAssets_.00017.jpg
GameAssets_.00016.jpg
GameAssets_.00015.jpg
GameAssets_.00014.jpg
GameAssets_.00013.jpg
GameAssets_.00012.jpg
GameAssets_.00011.jpg
GameAssets_.00010.jpg
GameAssets_.00009.jpg
GameAssets_.00008.jpg
GameAssets_.00007.jpg
GameAssets_.00006.jpg
GameAssets_.00005.jpg
GameAssets_.00004.jpg
GameAssets_.00003.jpg
GameAssets_.00002.jpg
GameAssets_.00001.jpg
GameAssets_.00000.jpg
OrangeXmas2015_Print_A ct2.jpg
sh120_HF_V01-01.jpg
sh120_HF_V01-02.jpg
sh200_DMP_v04.jpg
sh120_HF_V04.jpg
sh120_HF_V03.jpg
sh120_HF_V02.jpg
blah_v02.jpg
blah_v01.jpg
blah_Capture0 4.jpg
blah_Capture03 .jpg
blah_Capture01. jpg
blah_Capture02.jpg
Wall_GraniteBlock_G rey_TC041813.jpg
Renders10_wire.jpg
Renders10.jpg
Renders09_wire.jpg
Renders09.jpg
Renders08_wire.jpg
Renders08.jpg
Renders07_wire.jpg
Renders07.jpg
Renders06_wire.jpg
Renders06.jpg
Renders05_wire.jpg
Renders05.jpg
Renders04_wire.jpg
Renders04.jpg
Renders03_wire.jpg
Renders03.jpg
Renders02_wire.jpg
Renders02.jpg
Renders01_wire.jpg
Renders01.jpg
archmodels58_057_carpinusbetulus_leaf_diffuse.jpg
archmodels58_042_bark_bump.jpg
archmodels58_023_leaf_diffuse.jpg
WINDY TECHNICZNE-reflect00.jpg
archmodels58_057_leaf_opacity.jpg
archmodels58_057_bark_reflect.jpg
archmodels58_057_bark_bump.jpg
blahC-00-oknaka.jpg
bed
debt
cab
00018.jpg
00017.jpg
00016.jpg
00015.jpg
00014.jpg
00013.jpg
00012.jpg
00011.jpg
00010.jpg
00009.jpg
00008.jpg
00007.jpg
00006.jpg
00005.jpg
00004.jpg
00003.jpg
00002.jpg
00001.jpg
00000.jpg
The result I'm after is 2 sequences identified:
GameAssets_.00000.jpg to GameAssets_.00024.jpg
00000.jpg to 00018.jpg
Based on the rules you specified in your question, this pattern should accomplish what you need:
(^|\r?\n|.*_|.*\.)\d{3,}.*\.jpg
for item in re.findall(r'.*?[._]?0{3,}.*',data):
print(item)
GameAssets_.00024.jpg
GameAssets_.00023.jpg
GameAssets_.00022.jpg
GameAssets_.00021.jpg
GameAssets_.00020.jpg
GameAssets_.00019.jpg
GameAssets_.00018.jpg
GameAssets_.00017.jpg
GameAssets_.00016.jpg
GameAssets_.00015.jpg
GameAssets_.00014.jpg
GameAssets_.00013.jpg
GameAssets_.00012.jpg
GameAssets_.00011.jpg
GameAssets_.00010.jpg
GameAssets_.00009.jpg
GameAssets_.00008.jpg
GameAssets_.00007.jpg
GameAssets_.00006.jpg
GameAssets_.00005.jpg
GameAssets_.00004.jpg
GameAssets_.00003.jpg
GameAssets_.00002.jpg
GameAssets_.00001.jpg
GameAssets_.00000.jpg
00018.jpg
00017.jpg
00016.jpg
00015.jpg
00014.jpg
00013.jpg
00012.jpg
00011.jpg
00010.jpg
00009.jpg
00008.jpg
00007.jpg
00006.jpg
00005.jpg
00004.jpg
00003.jpg
00002.jpg
00001.jpg
00000.jpg
Try
(.*?)(\.|_?)(000\d{0,})(.*)\.jpg
Notice that I had to escape the '.' in the second group. Also, I had to make the search for '.' and '_' optional in the second group. Finally, I had to add the minimum padding to the third group.
I used regex101.com to test and refine the regex: regex101
Related
I need a Python regex which matches to mobile phone numbers from Germany and Austria.
In order to do so, we first have to understand the structure of a phone number:
a mobile number can be written with a country calling code in the beginning. However, this code is optional!
if we use the country calling code the trunk prefix is redundant!
The prefix is composed out of the trunk prefix and the company code
The prefix is followed by an individual and unique number with 7 or 8 digits, respectivley.
List of German prefixes:
0151, 0160, 0170, 0171, 0175, 0152, 0162, 0172, 0173, 0174, 0155, 0157, 0159, 0163, 0176, 0177, 0178, 0179, 0164, 0168, 0169
List of Austrian prefixes:
0664, 0680, 0688, 0681, 0699, 0664, 0667, 0650, 0678, 0650, 0677, 0676, 0660, 0699, 0690, 0665, 0686, 0670
Now that we know all rules to build a regex, we have to consider, that humans sometimes write numbers in a very strange ways with multiple whitespaces, / or (). For example:
0176 98 600 18 9
+49 17698600189
+(49) 17698600189
0176/98600189
0176 / 98600189
many more ways to write the same number
I am looking for a Python regex which can match all Austian and German mobile numbers.
What I have so far is this:
^(?:\+4[39]|004[39]|0|\+\(49\)|\(\+49\))\s?(?=(?:[^\d\n]*\d){10,11}(?!\d))(\()?[19][1567]\d{1,2}(?(1)\))\s?\d(?:[ /-]?\d)+
You can use
(?x)^ # Free spacing mode on and start of string
(?: # A container group:
(\+49|0049|\+\(49\)|\(\+49\))? [ ()\/-]* # German: country code
(?(1)|0)1(?:5[12579]|6[023489]|7[0-9]) # trunk prefix and company code
| # or
(\+43|0043|\+\(43\)|\(\+43\))? [ ()\/-]* # Austrian: country code
(?(2)|0)6(?:64|(?:50|6[0457]|7[0678]|8[0168]|9[09])) # trunk prefix and company code
)
[ ()\/-]* # zero or more spaces, parens, / and -
\d(?:[ \/-]*\d){6,7} # a digit and then six or seven occurrences of space, / or - and a digit
\s* # zero or more whites
$ # end of string
See the regex demo.
A one-line version of the pattern is
^(?:(\+49|0049|\+\(49\)|\(\+49\))?[ ()\/-]*(?(1)|0)1(?:5[12579]|6[023489]|7[0-9])|(\+43|0043|\+\(43\)|\(\+43\))?[ ()\/-]*(?(2)|0)6(?:64|(?:50|6[0457]|7[0678]|8[0168]|9[09])))[ ()\/-]*\d(?:[ \/-]*\d){6,7}\s*$
See this demo.
How to create company code regex
Go to the Optimize long lists of fixed string alternatives in regex
Click the Run code snippet button at the bottom of the answer to run the last code snippet
Re-size the input box if you wish
Get the list of your supported numbers, either comma or linebreak separated and paste it into the field
Click Generate button, and grab the pattern that will appear below.
I am trying to do autodetection of bra size in a list of clothes. While I managed to extract only the bra items, I am now looking at extracting the size information and I think I am almost there (thanks to the stackoverflow community). However, there is a particular case that I could not find on another post.
I am using:
regexp = re.compile(r' \d{2,3} ?[a-fA-F]([^bce-zBCE-Z]|$)')
So
Possible white space if not at the beginning of the description
two or three numbers
Another possible white space or not
Any letters (lower or upper case) between A and F
and then another letter for the two special case AA and FF or the end of the string.
My question is, is there a way to have the second letter to be a match of the first letter (AA or FF) because in my case, my code output me some BA and CA size which are not existing
Examples:
Not working:
"bh sexig top matchande h&m genomskinligt parti svart detaljer 42 basic plain" return "42 ba" instead of not found
"puma, sport-bh, strl: 34cd, svart/grå", I guess the customer meant c/d
Working fine:
"victoria's secret, bh, strl: 32c, gul/vit" returns "32 c"
"pink victorias secret bh 75dd burgundy" returns "75 dd"
Thanks!
You might use
\d{2,3} ?([a-fA-F])\1?(?![a-fA-F])
Explanation
\d{2,3} ? Match a space, 2-3 digits and optional space
([a-fA-F])\1? Capture a-fA-F in group 1 followed by an optional backreference to group 1
(?![a-fA-F]) Negative lookahead, assert what is on the right is not a-fA-F
Regex demo
I have a txt file, include multiple line.My result crossing multiple lines.
for example, my data can be simplified as the following:
target_str =
x:-2.12343234
aaa:-3.05594480202
aaa:-3.01292995004
aaa:-2.383299
456:-2.232342
x:-2.53739230
aaa:-2.96875038099
aaa:-2.92326261448
aaa:-2.87628054847
bbb:-2.82755928961
456:-2.77678240323
x:-2.3433210
aaa:-2.72356707049
aaa:-2.6675072938
aaa:-2.60827106148
456:-2.3323232
x:-2.8743920
aaa:-2.433233
aaa:-2.9747893
aaa:-2.9747893
bbb:-2.43873
456:-2.43434
I want to match
x:.....
aaa:.....
aaa:.....
aaa:.....
bbb:.....
456:.....
means if there exist bbb, then I pick up the lines from x:... to 456:....
The expected results for the example data is:
x:-2.53739230
aaa:-2.96875038099
aaa:-2.92326261448
aaa:-2.87628054847
bbb:-2.82755928961
456:-2.77678240323
x:-2.8743920
aaa:-2.433233
aaa:-2.9747893
aaa:-2.9747893
bbb:-2.43873
456:-2.43434
I write:
a=re.findall(r"x:(.*\n){4}bbb:.*\n456.*",target_str)
print(a)
But the results is:
['aaa:-2.87628054847\n', 'aaa:-2.9747893\n']
This is not correct, can anyone help me? thanks a lot.
Try with following regex:
(x:(?:.*\n){4}bbb:.*\n456.*)
(?:.*\n) - ?: Makes group non capturing, so it won't be set to output.
Adding parenthesses on whole regex makes it an group which you would like to see as output
I'm working on a Python program that sifts through a .txt file to find the genus and species name. The lines are formatted like this (yes, the equals signs are consistently around the common name):
1. =Common Name= Genus Species some other words that I don't want.
2. =Common Name= Genus Species some other words that I don't want.
I can't seem to figure out a regex that will work to match only the genus and species and not the common name. I know the equals signs (=) will probably help in some way but I cannot think of how to use them.
Edit: Some real data:
1. =Western grebe.= ÆCHMOPHORUS OCCIDENTALIS. Rare migrant; western species, chiefly interior regions of North America.
2. =Holboell's grebe.= COLYMBUS HOLBOELLII. Rare migrant; breeds far north; range, all of North America.
3. =Horned grebe.= COLYMBUS AURITUS. Rare migrant; range, almost the same as the last.
4. =American eared grebe.= COLYMBUS NIGRICOLLIS CALIFORNICUS. Summer resident; rare in eastern, common in western Colorado; breeds from plains to 8,000 feet; partial to alkali lakes; western species.
You probably don't need regex for this one. If the order of the words you need and the count of the words is always the same, you can just split each line into list of substrings and get the third (genus) and the fourth (species) element of that list. The code will probably look like that:
myfile = open('myfilename.txt', 'r')
for line in myfile.readlines():
words = line.split()
genus, species = words[2], words[3]
It just looks a little more "pythonic" to me.
If common name can consist of multiple words, then suggested code will return an incorrect result. To get the right result in this case too, you can use this code:
myfile = open('myfilename.txt', 'r')
for line in myfile.readlines():
words = line.split('=')[2].split() # If the program returns wrong results, try changing the index from 2 to 1 or 3. What number is the right one depends on whether there can be any symbols before the first "=".
genus, species = words[0], words[1]
If it is enough to capture words in groups (and you dont't wont direct match) you can try with:
(?=\d\.\s*=[^=]+=\s(?:(?P<genus>\w+)\s(?P<species>\w+)))
DEMO
the desired values will be in groups <genus> and <species>. The whole regex is a positive lookbehind, so it match a zero point position on a beginning of string, but it captures some content into groups.
(?=\d\.\s*=[^=]+=\s - decimal folowed by some content between equal
signs and space,
(?:(?P<genus>\w+)\s(?P<species>\w+))) - capture first word to genus
groups, and second word do species groups,
You can try something like:
import re
txt='1. =Common Name= Genus Species some other words that I don\'t want.'
re1='.*?' # Non-greedy match on filler
re2='(?:[a-z][a-z]+)' # Uninteresting: word
re3='.*?' # Non-greedy match on filler
re4='(?:[a-z][a-z]+)' # Uninteresting: word
re5='.*?' # Non-greedy match on filler
re6='((?:[a-z][a-z]+))' # Word 1
re7='.*?' # Non-greedy match on filler
re8='((?:[a-z][a-z]+))' # Word 2
rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
if m:
word1=m.group(1)
word2=m.group(2)
print "("+word1+")"+"("+word2+")"+"\n"
In your test input as shown in txt, this will print
(Genus)(Species)
You can you this awesome site to help do regexes like this!
Hope this helps
I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.