Python: getting finding a variable in a string - python

I currently have some code that goes to a URL, fetches the source code, and I'm trying to get it to return a variable from the string. So I created:
changetime = refreshsource.find('VARIABLE pm NST')
But it wouldn't find the area in the string because the word is not VARIABLE, it is something else. How would I retrieve the constantly changing VARIABLE from that string?

A regular expression will be able to achieve this for you. I'd you give some examples of what variable will be the we could come up with a strict expression. To match what you have above something like the following will do:
import re
# this will match 01:23, 11:34, 12:00, etc.
timex = re.compile('.*(\d{2}:\d{2})[ ]?pm NST')
match = timex.match(text, re.M|re.S)
variable = match.groups(0)
Edit: this code will actually work (unlike that first attempt :) ):
import re
# this will match 01:23, 11:34, 12:00, etc.
timex = re.compile('(\d{2}:\d{2})[ ]?pm NST')
match = timex.search(text)
if match:
variable = match.groups(0)

If the pattern is really that simple, then this seems a typical case where regular expressions comes quite handy.
Note: if you are new to regular expressions, you may want to use some introduction, like the http://www.regular-expressions.info.
On the other hand, if the pattern is more complex, then you may want to use an HTML parser, like for instance BeautifulSoup.

Related

How to extract a numeric value from a line of text with a regular expression?

I'm new to regular expressions, help me extract the necessary information from the text:
salespackquantity=1&itemCode=3760041","quantity_box_sales_uom"
&salespackquantity=1&itemCode=2313441","quantity_box
I need to take the numbers 3760041 and 2313441 respectively. What should a regular expression look like?
If we're dealing with just line-based data as you show it could be as easy as:
.*itemCode=([0-9]+).*
Which is brutal but would do the work. You'd extract the first matching group.
Although your example seems inconsistent and truncated, so this may vary. Please provide more details if there are other conditions.
Example
>>> import re
>>> oneline = "salespackquantity=1&itemCode=3760041\",\"quantity_box_sales_uom\""
>>> match = re.search('.*itemCode=([0-9]+).*', oneline)
>>> match.group(0)
'salespackquantity=1&itemCode=3760041","quantity_box_sales_uom"'
>>> match.group(1)
'3760041'
Do you really need regex?
Arguably, a regex seems an easy way to get what you want here, but it might be grossly inefficient, depending on your use case and input data.
Several other strategies might be easier:
remove unnecessary data first,
use a proper parser for your specific content (here this looks like a mix of a CSV and URL query strings),
don't even bother and cut on appropriate boundaries, if the format is fixed.
Regex are powerful, and can be overly powerful for simple scenarios. Totally fair if it's to run a one-off data extraction script, though, or if the cost/benefit analysis of the development effort is worth it.
a = "example is the int and string 223576"
ext = []
b = "1234567890"
for i in a:
if i in b:
ext.append(i)
print(ext)

Find string in possibly multiple parentheses?

I am looking for a regular expression that discriminates between a string that contains a numerical value enclosed between parentheses, and a string that contains outside of them. The problem is, parentheses may be embedded into each other:
So, for example the expression should match the following strings:
hey(example1)
also(this(onetoo2(hard)))
but(here(is(a(harder)one)maybe23)Hehe)
But it should not match any of the following:
this(one)is22misleading
how(to(go)on)with(multiple)3parent(heses(around))
So far I've tried
\d[A-Za-z] \)
and easy things like this one. The problem with this one is it does not match the example 2, because it has a ( string after it.
How could I solve this one?
The problem is not one of pattern matching. That means regular expressions are not the right tool for this.
Instead, you need lexical analysis and parsing. There are many libraries available for that job.
You might try the parsing or pyparsing libraries.
These type of regexes are not always easy, but sometimes it's possible to come up with a way provided the input remains somewhat consistent. A pattern generally like this should work:
(.*(\([\d]+[^(].*\)|\(.*[^)][\d]+.*\)).*)
Code:
import re
p = re.compile(ur'(.*(\([\d]+[^(].*\)|\(.*[^)][\d]+.*\)).*)', re.MULTILINE)
result = re.findall(p, searchtext)
print(result)
Result:
https://regex101.com/r/aL8bB8/1

Using python to find specific pattern contained in a paragraph

I'm trying to use python to go through a file, find a specific piece of information and then print it to the terminal. The information I'm looking for is contained in a block that looks something like this:
\\Version=EM64L-G09RevD.01\State=1-A1\HF=-1159.6991675\RMSD=4.915e-11\RMSF=1.175e-07\ZeroPoint=0.0353317\
I would like to be able to get the information HF=-1159.6991675. More generally, I would like the script to copy and print \HF=WhateverTheNumberIs\
I've managed to make scripts that are able to copy an entire line and print it out to the terminal, but I am unsure how to accomplish this particular task.
My suggestions is to use regular expressions (regex) in order to catch the required pattern:
import re #for using regular expressions
s = open(<filename here>).read() #read the content of the file and hold it as a string to be scanned
p = re.compile("\HF=[^\]+", re.flags) #p would be the pattern as you described, starting with \HF= till the next \)
print p.findall(s) #finds all occurrences and prints them
Regular expressions is the answer, something like r'/HF.*/'.
Tutorial:- regex tutorial
Once you have learned regex, it is an indispensable resource.

in python find index in list if combination of strings exist

I'm writing my first script and trying to learn python.
But I'm stuck and can't get out of this one.
I'm writing a script to change file names.
Lets say I have a string = "this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv"
I want the result to be string = "This Is Test3 E00"
this is what I have so far:
l = list(string)
//Transform the string into list
for i in l:
if "E" in l:
p = l.index("E")
if isinstance((p+1), int () is True:
if isinstance((p+2), int () is True:
delp = p+3
a = p-3
del l[delp:]
new = "".join(l)
new = new.replace("."," ")
print (new)
get in index where "E" and check if after "E" there are 2 integers.
Then delete everything after the second integer.
However this will not work if there is an "E" anyplace else.
at the moment the result I get is:
this is tEst
because it is finding index for the first "E" on the list and deleting everything after index+3
I guess my question is how do I get the index in the list if a combination of strings exists.
but I can't seem to find how.
thanks for everyone answers.
I was going in other direction but it is also not working.
if someone could see why it would be awesome. It is much better to learn by doing then just coping what others write :)
this is what I came up with:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
anyone can tell me why this isn't working. I get an error.
Thank you so much
Have you ever heard of a Regular Expression?
Check out python's re module. Link to the Docs.
Basically, you can define a "regex" that would match "E and then two integers" and give you the index of it.
After that, I'd just use python's "Slice Notation" to choose the piece of the string that you want to keep.
Then, check out the string methods for str.replace to swap the periods for spaces, and str.title to put them in Title Case
An easy way is to use a regex to find up until the E followed by 2 digits criteria, with s as your string:
import re
up_until = re.match('(.*?E\d{2})', s).group(1)
# this.is.tEst3.E00
Then, we replace the . with a space and then title case it:
output = up_until.replace('.', ' ').title()
# This Is Test3 E00
The technique to consider using is Regular Expressions. They allow you to search for a pattern of text in a string, rather than a specific character or substring. Regular Expressions have a bit of a tough learning curve, but are invaluable to learn and you can use them in many languages, not just in Python. Here is the Python resource for how Regular Expressions are implemented:
http://docs.python.org/2/library/re.html
The pattern you are looking to match in your case is an "E" followed by two digits. In Regular Expressions (usually shortened to "regex" or "regexp"), that pattern looks like this:
E\d\d # ('\d' is the specifier for any digit 0-9)
In Python, you create a string of the regex pattern you want to match, and pass that and your file name string into the search() method of the the re module. Regex patterns tend to use a lot of special characters, so it's common in Python to prepend the regex pattern string with 'r', which tells the Python interpreter not to interpret the special characters as escape characters. All of this together looks like this:
import re
filename = 'this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv'
match_object = re.search(r'E\d\d', filename)
if match_object:
# The '0' means we want the first match found
index_of_Exx = match_object.end(0)
truncated_filename = filename[:index_of_Exx]
# Now take care of any more processing
Regular expressions can get very detailed (and complex). In fact, you can probably accomplish your entire task of fully changing the file name using a single regex that's correctly put together. But since I don't know the full details about what sorts of weird file names might come into your program, I can't go any further than this. I will add one more piece of information: if the 'E' could possibly be lower-case, then you want to add a flag as a third argument to your pattern search which indicates case-insensitive matching. That flag is 're.I' and your search() method would look like this:
match_object = re.search(r'E\d\d', filename, re.I)
Read the documentation on Python's 're' module for more information, and you can find many great tutorials online, such as this one:
http://www.zytrax.com/tech/web/regex.htm
And before you know it you'll be a superhero. :-)
The reason why this isn't working:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
...is because 'i' contains a character from the string 'l', not an integer. You compare it with 'E' (which works), but then try to add 1 to it, which errors out.

Regular Expressions Dependant on Previous Matchings

For example, how could we recognize a string of the following format with a single RE:
LenOfStr:Str
An example string in this format is:
5:5:str
The string we're looking for is "5:str".
In python, maybe something like the following (this isn't working):
r'(?P<len>\d+):(?P<str>.{int((?P=len))})'
In general, is there a way to change the previously matched groups before using them or I just asked yet another question not meant for RE.
Thanks.
Yep, what you're describing is outside the bounds of regular expressions. Regular expressions only deal with actual character data. This provides some limited ability to make matches dependent on context (e.g., (.)\1 to match the same character twice), but you can't apply arbitrary functions to pieces of an in-progress match and use the results later in the same match.
You could do something like search for text matching the regex (\d+):\w+, and then postprocess the results to check if the string length is equal to the int value of the first part of the match. But you can't do that as part of the matching process itself.
Well this can be done with a regex (if I understand the question):
>>> s='5:5:str and some more characters...'
>>> m=re.search(r'^(\d+):(.*)$',s)
>>> m.group(2)[0:int(m.group(1))]
'5:str'
It just cannot be done by dynamically changing the previous match group.
You can make it lool like a single regex like so:
>>> re.sub(r'^(\d+):(.*)$',lambda m: m.group(2)[0:int(m.group(1))],s)
'5:str'

Categories

Resources