I have to create a unique textual marker in my document using Python 2.7, with the following function:
def build_textual_marker(number, id):
return "[xxxcixxx[[_'" + str(number) + "'] [_'" + id + "']]xxxcixxx]"
the output looks like this : [xxxcixxx[[_'1'] [_'24']]xxxcixxx]
And then I have to catch any occurrence of this expression in my document. I ended up to the following regular expression but it seems not working fine:
marker_regex = "\[xxxcixxx\[(\[_*?\])\s(\[_*?\])\]xxxcixxx\]"
I was wondering how should I write the correct regex in this case?
Try using
\[xxxcixxx\[\[_'.*?'\] \[_'.*?'\]\]xxxcixxx\]
Demo: http://regexr.com/3d887
Rather than the lazy star, you might as well get along with a digit class directly (the function build_textual_marker takes a number parameter, doesn't it?):
\[xxxcixxx\[(\[_'\d+'\])\s(\[_'\d+'\])\]xxxcixxx\]
See a demo on regex101.com.
Related
This question already has an answer here:
How can I find all matches to a regular expression in Python?
(1 answer)
Closed 4 years ago.
I'm trying to extract digits from a unicode string. The string looks like raised by 64 backers and raised by 2062 backers. I tried many different things, but the following code is the only one that actually worked.
backers = browser.find_element_by_xpath('//span[#gogo-test="backers"]').text
match = re.search(r'(\d+)', backers)
print(match.group(0))
Since I'm not sure how often I'll need to extract substrings from strings, and I don't want to be creating tons of extra variables and lines of code, I'm wondering if there's a shorter way to accomplish this?
I know I could do something like this.
def extract_digits(string):
return re.search(r'(\d+)', string)
But I was hoping for a one liner, so that I could structure the script without using an additional function like so.
backers = ...
title = ...
description = ...
...
Even though it obviously doesn't work, I'd like to do something similar to the following, but it doesn't work as intended.
backers = re.search(r'(\d+)', browser.find_element_by_xpath('//span[#gogo-test="backers"]').text)
And the output looks like this.
<_sre.SRE_Match object at 0x000000000542FD50>
Any way to deal with this?!
As an option you can skip using regex and use built-in Python isdigit() (no additional imports needed):
digit = [sub for sub in browser.find_element_by_xpath('//span[#gogo-test="backers"]').text.split() if sub.isdigit()][0]
You can try this:
number = backers.findall(r'\b\d+\b', 'raised by 64 backers')
output:
64
So the method could be like this:
def extract_digits(string):
return re.findall(r'\b\d+\b', string)
DEMO here
EDIT: since you want everything in one line, try this:
import re
backers = re.findall(r'\b\d+\b', browser.find_element_by_xpath('//span[#gogo-test="backers"]').text)[0]
PS:
search ⇒ find something anywhere in the string and return a match object
findall ⇒ find something anywhere in the string and return a list.
Documentation:
Scan through string looking for the first location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance. Return None if no position in the string matches
the pattern; note that this is different from finding a zero-length
match at some point in the string.
Documentation link: docs.python.org/2/library/re.html
So to do the same with search use this:
backers = re.search(r'(\d+)', browser.find_element_by_xpath('//span[#gogo-test="backers"]').text).group(0)
I have been experimenting with Python's Regex Module: Re.
I decided to write a simple expression that searches for links (href="url") in a file.
Here is my Regex: href *= *(\"|\').*\1
When I used a site called GSkinner, I decided to try out my expression. The results are here, along with the code.
When I decided to try it out on python regex, I used the following code:
lines = """Code found in link"""
results = re.findall(r"href *= *(\"|\').*\1", lines)
print results # Ouputs: ['"', '"'] instead of two provided links
Why are the results outputting in empty strings?
findall will only return what is captured (unless nothing is captured). You have to capture the value you want as well:
r"href *= *(\"|\')(.*?)\1
All together you may want to use something like:
results = [x[1] for x in re.findall(r"href *= *(\"|\')(.*?)\1", lines)]
I'm writing my first script and trying to learn python.
But I'm stuck and can't get out of this one.
I'm writing a script to change file names.
Lets say I have a string = "this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv"
I want the result to be string = "This Is Test3 E00"
this is what I have so far:
l = list(string)
//Transform the string into list
for i in l:
if "E" in l:
p = l.index("E")
if isinstance((p+1), int () is True:
if isinstance((p+2), int () is True:
delp = p+3
a = p-3
del l[delp:]
new = "".join(l)
new = new.replace("."," ")
print (new)
get in index where "E" and check if after "E" there are 2 integers.
Then delete everything after the second integer.
However this will not work if there is an "E" anyplace else.
at the moment the result I get is:
this is tEst
because it is finding index for the first "E" on the list and deleting everything after index+3
I guess my question is how do I get the index in the list if a combination of strings exists.
but I can't seem to find how.
thanks for everyone answers.
I was going in other direction but it is also not working.
if someone could see why it would be awesome. It is much better to learn by doing then just coping what others write :)
this is what I came up with:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
anyone can tell me why this isn't working. I get an error.
Thank you so much
Have you ever heard of a Regular Expression?
Check out python's re module. Link to the Docs.
Basically, you can define a "regex" that would match "E and then two integers" and give you the index of it.
After that, I'd just use python's "Slice Notation" to choose the piece of the string that you want to keep.
Then, check out the string methods for str.replace to swap the periods for spaces, and str.title to put them in Title Case
An easy way is to use a regex to find up until the E followed by 2 digits criteria, with s as your string:
import re
up_until = re.match('(.*?E\d{2})', s).group(1)
# this.is.tEst3.E00
Then, we replace the . with a space and then title case it:
output = up_until.replace('.', ' ').title()
# This Is Test3 E00
The technique to consider using is Regular Expressions. They allow you to search for a pattern of text in a string, rather than a specific character or substring. Regular Expressions have a bit of a tough learning curve, but are invaluable to learn and you can use them in many languages, not just in Python. Here is the Python resource for how Regular Expressions are implemented:
http://docs.python.org/2/library/re.html
The pattern you are looking to match in your case is an "E" followed by two digits. In Regular Expressions (usually shortened to "regex" or "regexp"), that pattern looks like this:
E\d\d # ('\d' is the specifier for any digit 0-9)
In Python, you create a string of the regex pattern you want to match, and pass that and your file name string into the search() method of the the re module. Regex patterns tend to use a lot of special characters, so it's common in Python to prepend the regex pattern string with 'r', which tells the Python interpreter not to interpret the special characters as escape characters. All of this together looks like this:
import re
filename = 'this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv'
match_object = re.search(r'E\d\d', filename)
if match_object:
# The '0' means we want the first match found
index_of_Exx = match_object.end(0)
truncated_filename = filename[:index_of_Exx]
# Now take care of any more processing
Regular expressions can get very detailed (and complex). In fact, you can probably accomplish your entire task of fully changing the file name using a single regex that's correctly put together. But since I don't know the full details about what sorts of weird file names might come into your program, I can't go any further than this. I will add one more piece of information: if the 'E' could possibly be lower-case, then you want to add a flag as a third argument to your pattern search which indicates case-insensitive matching. That flag is 're.I' and your search() method would look like this:
match_object = re.search(r'E\d\d', filename, re.I)
Read the documentation on Python's 're' module for more information, and you can find many great tutorials online, such as this one:
http://www.zytrax.com/tech/web/regex.htm
And before you know it you'll be a superhero. :-)
The reason why this isn't working:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
...is because 'i' contains a character from the string 'l', not an integer. You compare it with 'E' (which works), but then try to add 1 to it, which errors out.
I'm trying to write a Django query that will filter by a particular regex pattern.
I want to filter by a code that pulls out any cases where there's any non-digit character followed by a number, followed by a non-digit character (a white space is fine).
Just say some codes are AJDP8EP, jsif28EP, EROE88, oskdpoeks8.
So I want my results to return: AJDP8EP, oskdpoeks8.
This is my query, but it's not recognizing things properly. Number is a variable.
results = Book.objects.filter(author__contains = firstname,type = "Fiction").filter(code__regex = r'^(\D+)(number)(\D+)')
You cannot use a variable inside your quoted regex expression. Try concatenating strings like:
r"^(\D+)(" + str(number) + ")(\D+)"
This converts your number variable to a string in case it is not already.
Also, as indicated in one of the comments to your question, oskdpoeks8 will not match your pattern. If you want to catch cases where number may come at the end of the code, one solution would be:
r"^(\D+)(" + str(number) + ")(\D*)"
Note the replacement of the + with an * to catch that case of zero occurrences.
With python 3+ and newer versions on django you can make use of f-strings to pass variables to regex and use them in queries. For example:
Car.objects.filter(
car_code__regex=rf'^{settings.SOME_SETTING}0*(\d+)$'
)
For the OP's case:
results = Book.objects.filter(author__contains = firstname,type = "Fiction").filter(code__regex = rf'^(\D+){number}(\D+)')
Is it possible to perform simple math on the output from Python regular expressions?
I have a large file where I need to divide numbers following a ")" by 100. For instance, I would convert the following line containing )75 and )2:
((words:0.23)75:0.55(morewords:0.1)2:0.55);
to )0.75 and )0.02:
((words:0.23)0.75:0.55(morewords:0.1)0.02:0.55);
My first thought was to use re.sub using the search expression "\)\d+", but I don't know how to divide the integer following the parenthesis by 100, or if this is even possible using re.
Any thoughts on how to solve this? Thanks for your help!
You can do it by providing a function as the replacement:
s = "((words:0.23)75:0.55(morewords:0.1)2:0.55);"
s = re.sub("\)(\d+)", lambda m: ")" + str(float(m.groups()[0]) / 100), s)
print s
# ((words:0.23)0.75:0.55(morewords:0.1)0.02:0.55);
Incidentally, if you wanted to do it using BioPython's Newick tree parser instead, it would look like this:
from Bio import Phylo
# assuming you want to read from a string rather than a file
from StringIO import StringIO
tree = Phylo.read(StringIO(s), "newick")
for c in tree.get_nonterminals():
if c.confidence != None:
c.confidence = c.confidence / 100
print tree.format("newick")
(while this particular operation takes more lines than the regex version, other operations involving trees might be made much easier with it).
The replacement expression for re.sub can be a function. Write a function that takes the matched text, converts it to a number, divides it by 100, and then returns the string form of the result.