How to pull out language via regex - python

I have the following two string:
s1 = 'Audio: Dolby Digital 5.1 (English)'
s2 = 'Audio: Stereo (English, French)'
I want to pull out the first language in each string. Here is what I have so far:
re.search(r'\s\((.+)', s1)
['English)']
How would I improve this to work on both of the above?

You could use this which will only find the first language and it is only a small tweak to your existing code
f=re.findall(r'\((\w+)', s1)
e=re.findall(r'\((\w+)', s2)
if f:
print f
if e:
print e
f = ['English']
e = ['English']
if you only want the first language then you should be using search instead like so
f = re.search(r'\((\w+)', s1)
e = re.search(r'\((\w+)', s2)
if f:
print f.group(1)
if e:
print e.group(1)
This will print a string rather than a list since it is only finding one thing

Widen the search to start the phrase with a parenthesis or comma+space, and end with a parenthesis or comma+space:
>>> re.findall(r'\s(?:\(|, )(.+)(?:\)|, )', s2)
['English, French']
The ?: after a parenthesis indicates a non-capturing group.
You can then grab whichever language you're interested in with indexing.
Since the strings you're searching are actually pretty tidy, you can also do this without regex:
>>> s1.split('(')[1].split(')')[0].split(', ')[0]
'English'
>>> s2.split('(')[1].split(')')[0].split(', ')[0]
'English'

You can just use this simple modification of your regular expression:
\s\(([^,\n\)]+)
Regex101

You're looking for the text after the first LParen and before the first comma. So, a regex that would match this is:
\(([^,]*),
(Your answer will be in group 1)
Finally, I'd like to point you to https://www.debuggex.com/, which will help you easily visualize your regex questions.

Assuming languages are always at the end, surrounded by brackets and listed with ,:
(?<=\()\w+(?=(?:, \w+)*\)$)
See it in action
The idea is:
(?<=\() - the string should be preceded by an opening bracket(()
\w+ - the language itself is a sequence of letters
(?=(?:, \w+)*\)$) - after it, there can be zero or more other languages, separated with comma and space and after closing the bracket()) leaves us at the end of the string

Related

Find String Between Two Substrings in Python When There is A Space After First Substring

While there are several posts on StackOverflow that are similar to this, none of them involve a situation when the target string is one space after one of the substrings.
I have the following string (example_string):
<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>
I want to extract "I want this string." from the string above. The randomletters will always change, however the quote "I want this string." will always be between [?] (with a space after the last square bracket) and Reduced.
Right now, I can do the following to extract "I want this string".
target_quote_object = re.search('[?](.*?)Reduced', example_string)
target_quote_text = target_quote_object.group(1)
print(target_quote_text[2:])
This eliminates the ] and that always appear at the start of my extracted string, thus only printing "I want this string." However, this solution seems ugly, and I'd rather make re.search() return the current target string without any modification. How can I do this?
Your '[?](.*?)Reduced' pattern matches a literal ?, then captures any 0+ chars other than line break chars, as few as possible up to the first Reduced substring. That [?] is a character class formed with unescaped brackets, and the ? inside a character class is a literal ? char. That is why your Group 1 contains the ] and a space.
To make your regex match [?] you need to escape [ and ? and they will be matched as literal chars. Besides, you need to add a space after ] to actually make sure it does not land into Group 1. A better idea is to use \s* (0 or more whitespaces) or \s+ (1 or more occurrences).
Use
re.search(r'\[\?]\s*(.*?)Reduced', example_string)
See the regex demo.
import re
rx = r"\[\?]\s*(.*?)Reduced"
s = "<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>"
m = re.search(r'\[\?]\s*(.*?)Reduced', s)
if m:
print(m.group(1))
# => I want this string.
See the Python demo.
Regex may not be necessary for this, provided your string is in a consistent format:
mystr = '<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>'
res = mystr.split('Reduced')[0].split('] ')[1]
# 'I want this string.'
The solution turned out to be:
target_quote_object = re.search('] (.*?)Reduced', example_string)
target_quote_text = target_quote_object.group(1)
print(target_quote_text)
However, Wiktor's solution is better.
You [co]/[sho]uld use Positive Lookbehind (?<=\[\?\]) :
import re
pattern=r'(?<=\[\?\])(\s\w.+?)Reduced'
string_data='<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>'
print(re.findall(pattern,string_data)[0].strip())
output:
I want this string.
Like the other answer, this might not be necessary. Or just too long-winded for Python.
This method uses one of the common string methods find.
str.find(sub,start,end) will return the index of the first occurrence of sub in the substring str[start:end] or returns -1 if none found.
In each iteration, the index of [?] is retrieved following with index of Reduced. Resulting substring is printed.
Every time this [?]...Reduced pattern is returned, the index is updated to the rest of the string. The search is continued from that index.
Code
s = ' [?] Nice to meet you.Reduced efweww [?] Who are you? Reduced<insert_randomletters>[?] I want this
string.Reduced<insert_randomletters>'
idx = s.find('[?]')
while idx is not -1:
start = idx
end = s.find('Reduced',idx)
print(s[start+3:end].strip())
idx = s.find('[?]',end)
Output
$ python splmat.py
Nice to meet you.
Who are you?
I want this string.

Get text between last forward slash and then before first hyphen

I need to parse a URL, and get 1585710 from :
http://www.example.com/0/100013573/1585710-key-description-goes-here
So that means it's between the last / and before the first -
I have very little experience with regex, it's a really hard concept for me to understand.
Any help or assistance would be much appreciated
Edit: Using Python.
Use the below regex and get the number from group index 1.
^.*\/([^-]*)-.*$
DEMO
Code:
>>> import re
>>> s = "http://www.example.com/0/100013573/1585710-key-description-goes-here"
>>> m = re.search(r'^.*\/([^-]*)-.*$', s, re.M)
>>> m
<_sre.SRE_Match object at 0x7f8a51f07558>
>>> m.group(1)
'1585710'
>>> m = re.search(r'.*\/([^-]*)-.*', s)
>>> m.group(1)
'1585710'
>>> m = re.search(r'.*\/([^-]*)', s)
>>> m.group(1)
'1585710'
Explanation:
.*\/ Matches all the characters upto the last / symbol.
([^-]*) Captures any character but not of - zero or more times.
-.* Matches all the remaining characters.
group(1) contains the characters which are captured by the first capturing group. Printing the group(1) will give the desired result.
You can use matching groups in order to extract the number with the regex \/(\d+)-:
import re
s = 'http://www.example.com/0/100013573/1585710-key-description-goes-here'
m = re.search(r'\/(\d+)-', s)
print m.group(1) # 1585710
Check out the Fiddler
Well, if you need to find any strings between a / and a -, you could simply do:
/.*-
Since . is any char, and * is any amount. However, this poses a problem, because you could get the whole /www.example.com/0/100013573/1585710-key-description-goes, which is between / and a -. So, what you need to do is to search for anything that is not a / and -:
/[^/-]*-
^ means no, and anything between [] is, roughly, an OR list.
Hope that helps.
EDIT: No, it doesn't help, as user rici mentioned, when you have a - in your url name (as in www.lala-lele.com).
To make sure is the last / you got, you can match the rest of your string, making sure it doesn't have any / in it until the end ($), as in:
/[^/-]*-[^/]*$
And, to get just the string inside it, you can:
/\([^/-]*\)-[^/]*$
Since \( and \) specify what you want as the output of your regex.

Regex help to match groups

I am trying to write a regex for matching a text file that has multiple lines such as :
* 964 0050.56aa.3480 dynamic 200 F F Veth1379
* 930 0025.b52a.dd7e static 0 F F Veth1469
My intention is to match the "0050.56aa.3480 " and "Veth1379" and put them in group(1) & group(2) for using later on.
The regex I wrote is :
\*\s*\d{1,}\s*(\d{1,}\.(?:[a-z][a-z]*[0-9]+[a-z0-9]*)\.\d{1,})\s*(?:[a-z][a-z]+)\s*\d{1,}\s*.\s*.\s*((?:[a-z][a-z]*[0-9]+[a-z0-9]*))
But it does not seem to be working when I test at:
http://www.pythonregex.com/
Could someone point to any obvious error I am doing here.
Thanks,
~Newbie
Try this:
^\* [0-9]{3} +([0-9]{4}.[0-9a-z]{4}.[0-9a-z]{4}).*(Veth[0-9]{4})$
Debuggex Demo
The first part is in capture group one, the "Veth" code in capture group two.
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. There's a list of online testers in the bottom section.
I don't think you need a regex for this:
for line in open('myfile','r').readlines():
fields = line.split( )
print "\n" + fields[1] + "\n" +fields[6]
A very strict version would look something like this:
^\*\s+\d{3}\s+(\d{4}(?:\.[0-9a-f]{4}){2})\s+\w+\s+\d+\s+\w\s+\w\s+([0-9A-Za-z]+)$
Debuggex Demo
Here I assume that:
the columns will be pretty much the same,
your first match group contains a group of decimal digits and two groups of lower-case hex digits,
and the last word can be anything.
A few notes:
\d+ is equivalent to \d{1,} or [0-9]{1,}, but reads better (imo)
use \. to match a literal ., as . would simply match anything
[a-z]{2} is equivalent to [a-z][a-z], but reads better (my opinion, again)
however, you might want to use \w instead to match a word character
This will do it:
reobj = re.compile(r"^.*?([\w]{4}\.[\w]{4}\.[\w]{4}).*?([\w]+)$", re.IGNORECASE | re.MULTILINE)
match = reobj.search(subject)
if match:
group1 = match.group(1)
group2 = match.group(2)
else:
result = ""

Finding part of string using regular expressions

I'm pretty new in Python and don't really know regex. I've got some strings like:
a = "Tom Hanks XYZ doesn't really matter"
b = "Julia Roberts XYZ don't worry be happy"
c = "Morgan Freeman XYZ all the best"
In the middle of each string there's word XYZ and than some text. I need regex that will find and match this part, more precisely: from XYZ to the end of string.
Unles there is a specific requirement to do through Regex, a non-regex solution will work fine here.
There are two possible ways you can approach this problem
1.
Given
a = "Tom Hanks XYZ doesn't really matter"
Partition the string with the separator, preceded with a space
''.join(a.partition(" XYZ")[1:])[1:]
Please note, if the separator string does not exist this will return a empty string.
2.
a[a.index(" XYZ") + 1:]
This will raise an exception ValueError: substring not found if the string is not found
Use the following expression
(XYZ.*)
What this does is start capturing when it sees the letters "XYZ" and matches anything beyond that zero or more times.
m = re.search("(XYZ.*)", a)
If you want to show that part of the string:
print m.groups()[0]

Python Regular Expression Matching: ## ##

I'm searching a file line by line for the occurrence of ##random_string##. It works except for the case of multiple #...
pattern='##(.*?)##'
prog=re.compile(pattern)
string='lala ###hey## there'
result=prog.search(string)
print re.sub(result.group(1), 'FOUND', string)
Desired Output:
"lala #FOUND there"
Instead I get the following because its grabbing the whole ###hey##:
"lala FOUND there"
So how would I ignore any number of # at the beginning or end, and only capture "##string##".
To match at least two hashes at either end:
pattern='##+(.*?)##+'
Your problem is with your inner match. You use ., which matches any character that isn't a line end, and that means it matches # as well. So when it gets ###hey##, it matches (.*?) to #hey.
The easy solution is to exclude the # character from the matchable set:
prog = re.compile(r'##([^#]*)##')
Protip: Use raw strings (e.g. r'') for regular expressions so you don't have to go crazy with backslash escapes.
Trying to allow # inside the hashes will make things much more complicated.
EDIT: If you do not want to allow blank inner text (i.e. "####" shouldn't match with an inner text of ""), then change it to:
prog = re.compile(r'##([^#]+)##')
+ means "one or more."
'^#{2,}([^#]*)#{2,}' -- any number of # >= 2 on either end
be careful with using lazy quantifiers like (.*?) because it'd match '##abc#####' and capture 'abc###'. also lazy quantifiers are very slow
Try the "block comment trick": /##((?:[^#]|#[^#])+?)##/
Adding + to regex, which means to match one or more character.
pattern='#+(.*?)#+'
prog=re.compile(pattern)
string='###HEY##'
result=prog.search(string)
print result.group(1)
Output:
HEY
have you considered doing it non-regex way?
>>> string='lala ####hey## there'
>>> string.split("####")[1].split("#")[0]
'hey'
>>> import re
>>> text= 'lala ###hey## there'
>>> matcher= re.compile(r"##[^#]+##")
>>> print matcher.sub("FOUND", text)
lala #FOUND there
>>>

Categories

Resources