I have the following string:
s1 = AU,Singh Is "Ki,nng",2005,,,No,,,
I need to grab the title, 'Singh Is "Ki,nng"' using a regular expression.
So far I can grab everything before the title --
>>> re.split(r',\d{4}',s2)[0]
'AU,Singh Is "Ki,nng"'
But it is also grabbing the territory, AU. How would I only grab the title here?
use this pattern and check against 2nd match
((?:[^,"]*"[^"]*"[^",]*)+|[^,]+)
Demo
not sure what you want from the output but this might do it
re.search(".+?,(.*?),\d+.*",s1).group(1)
Related
Question
Assume that I have a string like this:
example_text = 'b\'\\x08\\x13"\\\\https://www.example.com/link_1.html\\xd2\\x01`https://www.example.com/link_2.html\''
Expectation
And I want to only extract the first url, which is
output = "https://www.example.com/link_1.html"
I think using regex to find the url start from "https" and end up '\' will be a good solution.
If so, how can I write the regex pattern?
I try something like this:
`
re.findall("https://([^\\\\)]+)", example_text)
output = ['www.example.com/link_1.html', 'www.example.com/link_2.html']
But then, I need to add "https://" back and choose the first item in the return.
Is there any other solution?
You need to tweak your regex a bit.
What you were doing before:
https://([^\\\\)]+) this matches your link but only captures the part after https:// since you used the capturing token after that.
Updated Regex:
(https\:\/\/[^\\\\)]+) this matches the link and also captures the whole token (escaped special characters to avoid errors)
In Code:
import re
input = 'b\'\\x08\\x13"\\\\https://www.example.com/link_1.html\\xd2\\x01`https://www.example.com/link_2.html\''
print(re.findall("(https\:\/\/[^\\\\)]+)", input))
Output:
['https://www.example.com/link_1.html', "https://www.example.com/link_2.html'"]
You could also use (https\:\/\/([^\\\\)]+).html) to get the link with https:// and without it as a tuple. (this also avoids the ending ' that you might get in some links)
If you want only the first one, simply do output[0].
Try:
match = re.search(r"https://[^\\']+", example_text)
url = match.group()
print(url)
output:
https://www.example.com/link_1.html
I want to extract a full URL from a string.
My code is:
import re
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
print re.match(r'(ftp|http)://.*\.(jpg|png)$', data)
Output:
None
Expected Output
http://www.google.com/a.jpg
I found so many questions on StackOverflow, but none worked for me.
I have seen many posts and this is not a duplicate. Please help me! Thanks.
You were close!
Try this instead:
r'(ftp|http)://.*\.(jpg|png)'
You can visualize this here.
I would also make this non-greedy like this:
r'(ftp|http)://.*?\.(jpg|png)'
You can visualize this greedy vs. non-greedy behavior here and here.
By default, .* will match as much text as possible, but you want to match as little text as possible.
Your $ anchors the match at the end of the line, but the end of the URL is not the end of the line, in your example.
Another problem is that you're using re.match() and not re.search(). Using re.match() starts the match at the beginning of the string, and re.search() searches anywhere in the string. See here for more information.
You should use search instead of match.
import re
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
url=re.search('(ftp|http)://.*\.(jpg|png)', data)
if url:
print url.group(0)
Find the start of the url by using find(http:// , ftp://) . Find the end of url using find(jpg , png). Now get the substring
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
start = data.find('http://')
kk = data[start:]
end = kk.find('.jpg')
print kk[0:end+4]
I want a python regex expression that can pull the contents between script[" and "] but there are other "]" which worries me
expected:
{bunch of javascript here. [\"apple\"] test}
my attempt:
javascript\[\"(.*)"]
target string:
//url//script["{bunch of javascript here. [\"apple\"] test}"]|//*[#attribute="eggs"]
link to the regex
You can't match nested brackets with the re module since it doesn't have the recursion feature to do that. However, in your example you can skip the innermost square brackets if you choose to ignore all brackets enclosed between double quotes.
try something like this:
p = re.compile(r'script\["([^\\"]*(?:\\.[^\\"]*)*)"]', re.S)
Note: I assumed here that the predicate is only related to the "text" content of the script node (and not an attribute, a number of item or an axe).
It's very hard to understand exactly what you want to achieve because of the way you have written the question. However if you are looking for the firs instance of "] AFTER a } then try this:
\["([^}]+}.*?)"\]
Link to the regex
This also would work:
\["(.*?}.*?)"\]
Link to the second regex example
Here is the code I have:
a='<title>aaa</title><title>aaa2</title><title>aaa3</title>'
import re
re.findall(r'<(title)>(.*)<(/title)>', a)
The result is:
[('title', 'aaa</title><title>aaa2</title><title>aaa3', '/title')]
If I ever designed a crawler to get me titles of web sites, I might end up with something like this rather than a title for the web site.
My question is, how do I limit findall to a single <title></title>?
Use re.search instead of re.findall if you only want one match:
>>> s = '<title>aaa</title><title>aaa2</title><title>aaa3</title>'
>>> import re
>>> re.search('<title>(.*?)</title>', s).group(1)
'aaa'
If you wanted all tags, then you should consider changing it to be non-greedy (ie - .*?):
print re.findall(r'<title>(.*?)</title>', s)
# ['aaa', 'aaa2', 'aaa3']
But really consider using BeautifulSoup or lxml or similar to parse HTML.
Use a non-greedy search instead:
r'<(title)>(.*?)<(/title)>'
The question-mark says to match as few characters as possible. Now your findall() will return each of the results you want.
http://docs.python.org/2/howto/regex.html#greedy-versus-non-greedy
re.findall(r'<(title)>(.*?)<(/title)>', a)
Add a ? after the *, so it will be non-greedy.
It will be much easier using BeautifulSoup module.
https://pypi.python.org/pypi/beautifulsoup4
How can I get the value from the following strings using one regular expression?
/*##debug_string:value/##*/
or
/*##debug_string:1234/##*/
or
/*##debug_string:http://stackoverflow.com//##*/
The result should be
value
1234
http://stackoverflow.com/
Trying to read behind your pattern
re.findall("/\*##debug_string:(.*?)/##\*/", your_string)
Note that your variations cannot work because you didn't escape the *. In regular expressions, * mean a repetition of the previous character/group. If you really mean the * character, you must use \*.
import re
print re.findall("/\*##debug_string:(.*?)/##\*/", "/*##debug_string:value/##*/")
print re.findall("/\*##debug_string:(.*?)/##\*/", "/*##debug_string:1234/##*/")
print re.findall("/\*##debug_string:(.*?)/##\*/", "/*##debug_string:http://stackoverflow.com//##*/")
Executes as:
['value']
['1234']
['http://stackoverflow.com/']
EDIT: Ok I see that you can have a URL. I've amended the pattern to take it into account.
Use this regex:
[^:]+:([^/]+)
And use capture group #1 for your value.
Live Demo: http://www.rubular.com/r/FxFnpfPHFn
Your regex will be something like: .*:(.*)/.+. Group 1 will be what you are looking for. However this is a REALLY inclusive regex, you might want to post some more details so that you can create some more restrictions.
Assuming that the format stays consistent:
re.findall('debug_string:([^\/]+)\/##', string)