Modifying a re.split statement

Modifying a re.split statement - python

I have the following string:
s1 = AU,Singh Is "Ki,nng",2005,,,No,,,
I need to grab the title, 'Singh Is "Ki,nng"' using a regular expression.
So far I can grab everything before the title --
>>> re.split(r',\d{4}',s2)[0]
'AU,Singh Is "Ki,nng"'
But it is also grabbing the territory, AU. How would I only grab the title here?

use this pattern and check against 2nd match
((?:[^,"]*"[^"]*"[^",]*)+|[^,]+)
Demo

not sure what you want from the output but this might do it
re.search(".+?,(.*?),\d+.*",s1).group(1)

Related

How to use Regex to extract a string from a specific string until a specific symbol in python?

Question
Assume that I have a string like this:
example_text = 'b\'\\x08\\x13"\\\\https://www.example.com/link_1.html\\xd2\\x01`https://www.example.com/link_2.html\''
Expectation
And I want to only extract the first url, which is
output = "https://www.example.com/link_1.html"
I think using regex to find the url start from "https" and end up '\' will be a good solution.
If so, how can I write the regex pattern?
I try something like this:
`
re.findall("https://([^\\\\)]+)", example_text)
output = ['www.example.com/link_1.html', 'www.example.com/link_2.html']
But then, I need to add "https://" back and choose the first item in the return.
Is there any other solution?

You need to tweak your regex a bit.
What you were doing before:
https://([^\\\\)]+) this matches your link but only captures the part after https:// since you used the capturing token after that.
Updated Regex:
(https\:\/\/[^\\\\)]+) this matches the link and also captures the whole token (escaped special characters to avoid errors)
In Code:
import re
input = 'b\'\\x08\\x13"\\\\https://www.example.com/link_1.html\\xd2\\x01`https://www.example.com/link_2.html\''
print(re.findall("(https\:\/\/[^\\\\)]+)", input))
Output:
['https://www.example.com/link_1.html', "https://www.example.com/link_2.html'"]
You could also use (https\:\/\/([^\\\\)]+).html) to get the link with https:// and without it as a tuple. (this also avoids the ending ' that you might get in some links)
If you want only the first one, simply do output[0].

Try:
match = re.search(r"https://[^\\']+", example_text)
url = match.group()
print(url)
output:
https://www.example.com/link_1.html

extract URL from string in python

I want to extract a full URL from a string.
My code is:
import re
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
print re.match(r'(ftp|http)://.*\.(jpg|png)$', data)
Output:
None
Expected Output
http://www.google.com/a.jpg
I found so many questions on StackOverflow, but none worked for me.
I have seen many posts and this is not a duplicate. Please help me! Thanks.

You were close!
Try this instead:
r'(ftp|http)://.*\.(jpg|png)'
You can visualize this here.
I would also make this non-greedy like this:
r'(ftp|http)://.*?\.(jpg|png)'
You can visualize this greedy vs. non-greedy behavior here and here.
By default, .* will match as much text as possible, but you want to match as little text as possible.
Your $ anchors the match at the end of the line, but the end of the URL is not the end of the line, in your example.
Another problem is that you're using re.match() and not re.search(). Using re.match() starts the match at the beginning of the string, and re.search() searches anywhere in the string. See here for more information.

You should use search instead of match.
import re
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
url=re.search('(ftp|http)://.*\.(jpg|png)', data)
if url:
print url.group(0)

Find the start of the url by using find(http:// , ftp://) . Find the end of url using find(jpg , png). Now get the substring
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
start = data.find('http://')
kk = data[start:]
end = kk.find('.jpg')
print kk[0:end+4]

python: regex to extract content between two text

I want a python regex expression that can pull the contents between script[" and "] but there are other "]" which worries me
expected:
{bunch of javascript here. [\"apple\"] test}
my attempt:
javascript\[\"(.*)"]
target string:
//url//script["{bunch of javascript here. [\"apple\"] test}"]|//*[#attribute="eggs"]
link to the regex

You can't match nested brackets with the re module since it doesn't have the recursion feature to do that. However, in your example you can skip the innermost square brackets if you choose to ignore all brackets enclosed between double quotes.
try something like this:
p = re.compile(r'script\["([^\\"]*(?:\\.[^\\"]*)*)"]', re.S)
Note: I assumed here that the predicate is only related to the "text" content of the script node (and not an attribute, a number of item or an axe).

It's very hard to understand exactly what you want to achieve because of the way you have written the question. However if you are looking for the firs instance of "] AFTER a } then try this:
\["([^}]+}.*?)"\]
Link to the regex
This also would work:
\["(.*?}.*?)"\]
Link to the second regex example

How do I ensure that re.findall() stops at the right place?

Here is the code I have:
a='<title>aaa</title><title>aaa2</title><title>aaa3</title>'
import re
re.findall(r'<(title)>(.*)<(/title)>', a)
The result is:
[('title', 'aaa</title><title>aaa2</title><title>aaa3', '/title')]
If I ever designed a crawler to get me titles of web sites, I might end up with something like this rather than a title for the web site.
My question is, how do I limit findall to a single <title></title>?

Use re.search instead of re.findall if you only want one match:
>>> s = '<title>aaa</title><title>aaa2</title><title>aaa3</title>'
>>> import re
>>> re.search('<title>(.*?)</title>', s).group(1)
'aaa'
If you wanted all tags, then you should consider changing it to be non-greedy (ie - .*?):
print re.findall(r'<title>(.*?)</title>', s)
# ['aaa', 'aaa2', 'aaa3']
But really consider using BeautifulSoup or lxml or similar to parse HTML.

Use a non-greedy search instead:
r'<(title)>(.*?)<(/title)>'
The question-mark says to match as few characters as possible. Now your findall() will return each of the results you want.
http://docs.python.org/2/howto/regex.html#greedy-versus-non-greedy

re.findall(r'<(title)>(.*?)<(/title)>', a)
Add a ? after the *, so it will be non-greedy.

It will be much easier using BeautifulSoup module.
https://pypi.python.org/pypi/beautifulsoup4

How can I extract two values from a string like this using a regular expression?

How can I get the value from the following strings using one regular expression?
/*##debug_string:value/##*/
or
/*##debug_string:1234/##*/
or
/*##debug_string:http://stackoverflow.com//##*/
The result should be
value
1234
http://stackoverflow.com/

Trying to read behind your pattern
re.findall("/\*##debug_string:(.*?)/##\*/", your_string)
Note that your variations cannot work because you didn't escape the *. In regular expressions, * mean a repetition of the previous character/group. If you really mean the * character, you must use \*.
import re
print re.findall("/\*##debug_string:(.*?)/##\*/", "/*##debug_string:value/##*/")
print re.findall("/\*##debug_string:(.*?)/##\*/", "/*##debug_string:1234/##*/")
print re.findall("/\*##debug_string:(.*?)/##\*/", "/*##debug_string:http://stackoverflow.com//##*/")
Executes as:
['value']
['1234']
['http://stackoverflow.com/']
EDIT: Ok I see that you can have a URL. I've amended the pattern to take it into account.

Use this regex:
[^:]+:([^/]+)
And use capture group #1 for your value.
Live Demo: http://www.rubular.com/r/FxFnpfPHFn

Your regex will be something like: .*:(.*)/.+. Group 1 will be what you are looking for. However this is a REALLY inclusive regex, you might want to post some more details so that you can create some more restrictions.

Assuming that the format stays consistent:
re.findall('debug_string:([^\/]+)\/##', string)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Modifying a re.split statement - python

use this pattern and check against 2nd match ((?:[^,"]"[^"]"[^",]*)+|[^,]+) Demo

not sure what you want from the output but this might do it re.search(".+?,(.?),\d+.",s1).group(1)

Related

How to use Regex to extract a string from a specific string until a specific symbol in python?

extract URL from string in python

python: regex to extract content between two text

How do I ensure that re.findall() stops at the right place?

How can I extract two values from a string like this using a regular expression?

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Modifying a re.split statement - python

use this pattern and check against 2nd match ((?:[^,"]*"[^"]*"[^",]*)+|[^,]+) Demo

not sure what you want from the output but this might do it re.search(".+?,(.*?),\d+.*",s1).group(1)

Related

How to use Regex to extract a string from a specific string until a specific symbol in python?

extract URL from string in python

python: regex to extract content between two text

How do I ensure that re.findall() stops at the right place?

How can I extract two values from a string like this using a regular expression?

Categories

Resources

use this pattern and check against 2nd match ((?:[^,"]"[^"]"[^",]*)+|[^,]+) Demo

not sure what you want from the output but this might do it re.search(".+?,(.?),\d+.",s1).group(1)