Extract data between pound signs - python

Hi I am parsing through XML files grabbing SQL text and paraments. I need to pull the strings that lie between two # signs. For example if this is my text:
CASE WHEN TRIM (NVL (a.SPLR_RMRK, ' ')) = '' OR TRIM (NVL (a.SPLR_RMRK, ' ')) IS NULL THEN '~' ELSE a.SPLR_RMRK END AS TXT_DESCR_J, 'PO' AS TXT_TYP_CD_J FROM #ps_RDW_Conn.jp_RDW_SCHEMA_NAME#.P_PO_RCPT_DTL a, (SELECT PO_RCPT_DTL_KEY, ETL_CRT_DTM FROM #ps_RDW_Conn.jp_RDW_SCHEMA_NAME#.#jp_PoRcptDtl_Src# WHERE ETL_UPDT_DTM > TO_DATE ('#jp_EtlPrcsDt#', 'YYYY-MM-DD:HH24:MI:SS'))
I want to have ps_RDW_Conn.jp_RDW_SCHEMA_NAME, ps_RDW_Conn.jp_RDW_SCHEMA_NAME jp_PoRcptDtl_Src and jp_EtlPrcsDt print out.
Some code that I have so far is
for eachLine in testFile:
print re.findall('#(*?)#', eachLine)
This gives me the following error:
nothing to repeat.
Any help or suggestions is greatly appreciated!

Unlike in bash regular expressions, the * is not a wild-card character, but instead it says repeat 0 or more times the thing before me.
In your regular expression, your * had no symbol to modify and so you saw the complaint nothing to repeat.
On the other hand, if you supply a . symbol for * to modify, testing with one line as an example,
eachLine = '#ps_RDW_Conn.jp_RDW_SCHEMA_NAME#.P_PO_RCPT_DTL a, (SELECT PO_RCPT_DTL_KEY, '
re.findall('#(.*?)#', eachLine)
We get,
['ps_RDW_Conn.jp_RDW_SCHEMA_NAME']
Some more detail.
I'm not sure if this is what you intended, but your *? is actually well placed.
*? is interpreted as a single qualifier which says repeat 0 or more times the thing before me, but take as little as possible.
So this ends up having the similar effect of what #tobias_k suggests in the comments, in preventing multiple groups from being absorbed into one.
>>> line = 'And here is # some interesting code #, where later on there are #fruit flies# ?'
>>> re.findall('#(.*)#', line)
[' some interesting code #, where later on there are #fruit flies']
>>>
>>> re.findall('#(.*?)#', line)
[' some interesting code ', 'fruit flies']
>>>
For reference, browse Repeating Things in docs.python.org

Your regex is not working as intended because you are using both * (0 or more) and ? (0 or 1) to modify the thing before it, but a) there is nothing before it, and b) you should use either * or ?, not both.
If you mean to capture ## or #anything#, then use the regex #(.*)#.

Try to escape ( and ). r'\(.*?\)' should work.
for eachLine in testFile:
print re.findall(r'\(.*?\)', eachLine)

Related

How to delete whitespace in only a certain part of the string

So let's say I have this string like
string = 'abcd <# string that has whitespace> efgh'
And I want to delete all the white space inside this <#...> And not affect anything outside <#...>
But the characters outside <#...> can change too so the <#...> is not going to be in a fixed position.
How should I do this?
This is not a complicated operation. You just do it like you would as a human being. Find the two delimiters, keep the part before the first one, remove space from the middle, keep the rest.
string = 'abcd <# string that has whitespace> efgh'
i1 = string.find('<#')
i2 = string.find('>')
res = string[:i1] + string[i1:i2].replace(' ','') + string[i2:]
print(res)
Output:
abcd <#stringthathaswhitespace> efgh
How about this...
string = 'abcd <# string that has whitespace> efgh'
s = string.split()
s = ' '.join( (s[0], ''.join(s[1:-1]), s[-1]) )
If <#...> exists consistently, one method to find the string is use regular expressions (regex) to search for that part of the string with the charactersyou want to modify. You then need to strip out the white space.
It takes a bit to get your head around regex, but they can be powerful tool.
Regex

The difference between ( [^,]*) and (.*,) in regular expression? Using python

When I tried to transform the string into a dict-like form, I met this problem
s = '&a: 12, &b:13, &c:14, &d: 15' # the string I want to convert
Before converting it, I tried to find all the matched results at first so I used
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*),')
result = dict_form.findall(s)
print(result) # [('&a:', ' 12, &b:13, &c:14')]
It's quite unexpected, and a little bit messy
But when I tried another way to match the string:
dict_form1 = re.compile(r'(&[a-zA-Z]*:)([^,]*)')
result = dict_form1.findall(s)
print(result) # [('&a:', ' 12'), ('&b:', '13'), ('&c:', '14'), ('&d:', ' 15')]
This time, I get a better one with key and item separately stored in a tuple.
The only difference I made was (.), into [^,]
The first one I thought was to find anything until it matches a comma
The second one I thought was to find anything but comma
What's the difference?
In the first instance:
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*),')
the (.*) operator is greedy. This means it will match everything up to the last comma, which is why you see the match extend up to &c:14.
In the second instance, by excluding the comma, you are forcing the match to be bound by a comma-- it's like saying "match everything until we hit a comma". This will cause the matching behavior you were expecting in the first place.
as have been said the .* will be greedy and try to match as much as possible, to make it non-greedy use the question mark (?) as in .*?. In your code:
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*?),')
result = dict_form.findall(s)
print(result)
Another maybe easier solution is to just use string splits instead of regex:
result = [_s.split(':') for _s in s.split(',')]

Python inserting spaces in string

Alright, I'm working on a little project for school, a 6-frame translator. I won't go into too much detail, I'll just describe what I wanted to add.
The normal output would be something like:
TTCPTISPALGLAWS_DLGTLGFMSYSANTASGETLVSLYQLGLFEM_VVSYGRTKYYLICP_LFHLSVGFVPSD
The important part of this string are the M and the _ (the start and stop codons, biology stuff). What I wanted to do was highlight these like so:
TTCPTISPALGLAWS_DLGTLGF 'MSYSANTASGETLVSLYQLGLFEM_' VVSYGRTKYYLICP_LFHLSVGFVPSD
Now here is where (for me) it gets tricky, I got my output to look like this (adding a space and a ' to highlight the start and stop). But it only does this once, for the first start and stop it finds. If there are any other M....._ combinations it won't highlight them.
Here is my current code, attempting to make it highlight more than once:
def start_stop(translation):
index_2 = 0
while True:
if 'M' in translation[index_2::1]:
index_1 = translation[index_2::1].find('M')
index_2 = translation[index_1::1].find('_') + index_1
new_translation = translation[:index_1] + " '" + \
translation[index_1:index_2 + 1] + "' " +\
translation[index_2 + 1:]
else:
break
return new_translation
I really thought this would do it, guess not. So now I find myself being stuck.
If any of you are willing to try and help, here is a randomly generated string with more than one M....._ set:
'TTCPTISPALGLAWS_DLGTLGFMSYSANTASGETLVSLYQLGLFEM_VVSYGRTKYYLICP_LFHLSVGFVPSDGRRLTLYMPPARRLATKSRFLTPVISSG_DKPRHNPVARSQFLNPLVRPNYSISASKSGLRLVLSYTRLSLGINSLPIERLQYSVPAPAQITP_IPEHGNARNFLPEWPRLLISEPAPSVNVPCSVFVVDPEHPKAHSKPDGIANRLTFRWRLIG_VFFHNAL_VITHGYSRVDILLPVSRALHVHLSKSLLLRSAWFTLRNTRVTGKPQTSKT_FDPKATRVHAIDACAE_QQH_PDSGLRFPAPGSCSEAIRQLMI'
Thank you to anyone willing to help :)
Regular expressions are pretty handy here:
import re
sequence = "TTCP...."
highlighted = re.sub(r"(M\w*?_)", r" '\1' ", sequence)
# Output:
"TTCPTISPALGLAWS_DLGTLGF 'MSYSANTASGETLVSLYQLGLFEM_' VVSYGRTKYYLICP_LFHLSVGFVPSDGRRLTLY 'MPPARRLATKSRFLTPVISSG_' DKPRHNPVARSQFLNPLVRPNYSISASKSGLRLVLSYTRLSLGINSLPIERLQYSVPAPAQITP_IPEHGNARNFLPEWPRLLISEPAPSVNVPCSVFVVDPEHPKAHSKPDGIANRLTFRWRLIG_VFFHNAL_VITHGYSRVDILLPVSRALHVHLSKSLLLRSAWFTLRNTRVTGKPQTSKT_FDPKATRVHAIDACAE_QQH_PDSGLRFPAPGSCSEAIRQLMI"
Regex explanation:
We look for an M followed by any number of "word characters" \w* then an _, using the ? to make it a non-greedy match (otherwise it would just make one group from the first M to the last _).
The replacement is the matched group (\1 indicates "first group", there's only one), but surrounded by spaces and quotes.
You just require little slice of 'slice' module , you don't need any external module :
Python string have a method called 'index' just use it.
string_1='TTCPTISPALGLAWS_DLGTLGFMSYSANTASGETLVSLYQLGLFEM_VVSYGRTKYYLICP_LFHLSVGFVPSD'
before=string_1.index('M')
after=string_1[before:].index('_')
print('{} {} {}'.format(string_1[:before],string_1[before:before+after+1],string_1[before+after+1:]))
output:
TTCPTISPALGLAWS_DLGTLGF MSYSANTASGETLVSLYQLGLFEM_ VVSYGRTKYYLICP_LFHLSVGFVPSD

Regex include the negative lookbehind

I'm trying to filter a string before passing it through eval in python. I want to limit it to math functions, but I'm not sure how to strip it with regex. Consider the following:
s = 'math.pi * 8'
I want that to basically translate to 'math.pi*8', stripped of spaces. I also want to strip any letters [A-Za-z] that are not followed by math\..
So if s = 'while(1): print "hello"', I want any executable part of it to be stripped:
s would ideally equal something like ():"" in that scenario (all letters gone, because they were not followed by math\..
Here's the regex I've tried:
(?<!math\.)[A-Za-z\s]+
and the python:
re.sub(r'(?<!math\.)[A-Za-z\s]+', r'', 'math.pi * 8')
But the result is '.p*8', because math. is not followed by math., and i is not followed by math..
How can I strip letters that are not in math and are not followed by math.?
What I ended up doing
I followed #Thomas's answer, but also stripped square brackets, spaces, and underscores from the string, in hopes that no python function can be executed other than through the math module:
s = re.sub(r'(\[.*?\]|\s+|_)', '', s)
s = eval(s, {
'__builtins__' : None,
'math' : math
})
As #Carl says in a comment, look at what lybniz does for something better. But even this is not enough!
The technique described at the link is the following:
print eval(raw_input(), {"__builtins__":None}, {'pi':math.pi})
But this doesn't prevent something like
([x for x in 1.0.__class__.__base__.__subclasses__()
if x.__name__ == 'catch_warnings'][0]()
)._module.__builtins__['__import__']('os').system('echo hi!')
Source: Several of Ned Batchelder's posts on sandboxing, see http://nedbatchelder.com/blog/201302/looking_for_python_3_builtins.html
edit: pointed out that we don't get square brackets or spaces, so:
1.0.__class__.__base__.__subclasses__().__getitem__(i)()._module.__builtins__.get('__import__')('os').system('echo hi')
where you just try a lot of values for i.

How to tokenize the sample string using Regular Expression in Python?

I am new to regular expression. On top of finding out the pattern to match the following string, please also point out references and/or samples web sites.
The data string
1. First1 Last1 - 20 (Long Description)
2. First2 Last2 - 40 (Another Description)
I want to be able to extract tuples {First1,Last1,20} and {First2,Last2,40} from the above string.
Thisone seems ok:
http://docs.python.org/howto/regex.html#regex-howto
Just skim it over, try some examples. regexpes are a little tricky (basicly a little programming language), and require some time to learn, but they are very useful to know. Just experiment and take one step at a time.
(yes, I could just give you the answer, but fish, man, teach)
...
as reqested, a solution when you don't use the split() solution:
iterate over the lines, and check for each line:
p = re.compile('\d+\.\s+(\w+)\s+(\w+)\s+-\s+(\d+)')
m = p.match(the_line)
// m.group(0) will be the first word
// m.group(1) the second word
// m.group(2) will be the firstnumber after the last word.
The regexp is :<some digits><a dot>
<some whitespace><alphanumeric characters, captured as group 0>
<some whtespace><alphanumeric characters, captured as group 1>
<some whitespace><a '-'><some witespace><digits, captured as group 2>
it's a little strict, but that way you'll catch non-conforming lines.
There is no need to use regex here:
foo = "1. First1 Last1 - 20 (Long Description)"
foo.split(" ")
>>> ['1.', '', 'First1', 'Last1', '-', '20', '(Long', 'Description)']
You can now select the elements you like (they will always be at the same indices).
In 2.7+ you can use itertools.compress to select the elements:
tuple(compress(foo.split(" "), [0,0,1,1,0,1]))
Based on Harman's partial solution, I came up with this:
(?P<first>\w+)\s+(?P<last>\w+)[-\s]*(?P<number>\d[\d,]*)
code and the output:
>>> regex = re.compile("(?P<first>\w+)\s+(?P<last>\w+)[-\s]*(?P<number>\d[\d,]*)")
>>> r = regex.search(string)
>>> regex.findall(string)
[(u'First1', u'Last1', u'20'), (u'First2', u'Last2', u'40')]

Categories

Resources