I am trying to parse a substring using re.
From the string present in variable s,I would like to split the string present till the first !(the string stored in s has two !) and store it as a substring.From this substring(stored in variable result), I wish to parse another substring.
Here is the code,
import re
s='ecNumber*2.4.1.11#kmValue*0.57#kmValueMaximum*1.25#!ecNumber*2.3.1.11#kmValue*0.081#kmValueMaximum*#!'
Data={}
result = re.search('%s(.*)%s' % ('ec', '!'), s).group(1)
print result
ecNumber = re.search('%s(.*)%s' % ('Number*', '#kmValue*'), result).group(1)
Data["ecNumber"]=ecNumber
print Data
The value corresponding to each tag present in the substring(example:ecNumber) is stored in between * and # (example: *2.4.1.11#).I attempted to parse the value stored for the tag ecNumber in the first substring.
The output I obtain is
result='Number*2.4.1.11#kmValue*0.57#kmValueMaximum*1.25#!ecNumber*2.3.1.11#kmValue*0.081#kmValueMaximum*#'
{'ecNumber': '*2.4.1.11#kmValue*0.57#kmValueMaximum*1.25#!ecNumber*2.3.1.11#kmValue*0.081'}
The desired output is
result= 'ecNumber*2.4.1.11#kmValue*0.57#kmValueMaximum*1.25#'
{'ecNumber': '2.4.1.11'}
I would like to store each tag and its corresponding value.For example,
{'ecNumber': '2.4.1.11','kmValue':'0.021','kmValueMaximum':'1.25'}
Despite you are asking a solution with regular expression, I would say it's much easier to use direct string operations for this problem, since the source string is well formatted.
For infomation before the first i:
print dict([i.split('*') for i in s.split('!', 1)[0].split('#') if i])
For all information in s:
print [dict([i.split('*') for i in j.split('#') if i]) for j in s.split('!') if j]
You can try this:
import re
s='ecNumber*2.4.1.11#kmValue*0.57#kmValueMaximum*1.25#'
new_data = re.findall('(?<=^)[a-zA-Z]+(?=\*)|(?<=#)[a-zA-Z]+(?=\*)|(?<=\*)[-\d\.]+(?=#)', s)
final_data = dict([new_data[i:i+2] for i in range(0, len(new_data)-1, 2)])
Output:
{'kmValue': '0.57', 'kmValueMaximum': '1.25', 'ecNumber': '2.4.1.11'}
Related
I have a string into a variable x that includes ">" symbols. I would like to create a new variable each time the string is splitted at the ">" symbol.
The string I have in the variable x is as such (imported from a simple .txt file):
>AF1785813
GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
>AF1785815
GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG
>AF1785814
GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
The expected output is:
print(var_1)
>AF1785813
GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
print(var_2)
>AF1785815
GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG
print(var_3)
>AF1785814
GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
To achieve this I am using a simple for loop
count = 3
for v in range(0, count+1):
globals()[f"var_{v}"] = x.split('>')
print(var_3)
This way I am successfully getting a new variable for each count (each count is == to the number of ">").
However the output I am currently getting is:
print(var_1)
['', 'AF1785813GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA', 'AF1785815GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG', 'AF1785814GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA']
print(var_2)
['', 'AF1785813GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA', 'AF1785815GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG', 'AF1785814GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA']
print(var_3)
['', 'AF1785813GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA', 'AF1785815GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG', 'AF1785814GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA']
How can I troubleshoot the for loop in order to achieve the expected output?
Try to iterate the split result:
for i, token in enumerate(x.split('>')):
# do not include empty string
if token:
globals()[f"var_{i}"] = token
# then deal with the vars
print(var_1)
print(var_2)
..
I would use re.findall here:
import re
inp = """>AF1785813
GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
>AF1785815
GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG
>AF1785814
GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA"""
vars = re.findall(r'>[^>]+', inp)
print(vars)
# ['>AF1785813\nGTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA\n',
# '>AF1785815\nGTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG\n',
# '>AF1785814\nGTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA']
Note that re.findall returns all matches inside a single neat list, which can then be iterated or accessed later as needed.
Use the regular expression match the > character followed by the characters on the line following it, up until the next > character or the end of the string.
[^\n]*: This matches zero or more characters that are not newline characters.
[^>]*: This matches zero or more characters that are not the > character.
import re
x = ">AF1785813\nGTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA\n>AF1785815\nGTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG\n>AF1785814\nGTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA"
substrings = re.findall(">[^\n]*\n[^>]*", x)
for i, substring in enumerate(substrings, start = 1):
globals()[f"var_{i}"] = substring
output:
>>> print(var_1)
>>> print(var_2)
>>> print(var_3)
>AF1785813
GTGTGGAGGGAAAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
>AF1785815
GTGTGGAGTGAGCCAAGATCGCACCACTGCACTCCATTCAG
>AF1785814
GTGTGGAGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCAAGATCGCACCACTGCACTCCA
I'm trying to extract financial data from a wall of text. basically I have a function that splits the text three times, but I know there is a more efficient way of doing so, but I cannot figure it out. Some curly braces really throw a wrench into my plan, because i'm trying to format a string.
I want to pass my function a string such as:
"totalCashflowsFromInvestingActivities"
and extract the following raw number:
"-2478000"
this is my current function, which works, but not efficient at all
def splitting(value, text):
x= text.split('"{}":'.format(value))[1]
y=x.split(',"fmt":')[0]
z=y.split(':')[1]
return z
any help would be greatly appreciated!
sample text:
"cashflowStatementHistory":{"cashflowStatements":[{"changeToLiabilities":{"raw":66049000,"fmt":"66.05M","longFmt":"66,049,000"},"totalCashflowsFromInvestingActivities":{"raw":-2478000,"fmt":"-2.48M","longFmt":"-2,478,000"},"netBorrowings":{"raw":-31652000,"fmt":"-31.65M","longFmt":"-31,652,000"}
Here is a solution using regex. It assumes the format is always the same, having the raw value always immediately after the title and separated by ":{.
import re
def get_value(value_name, text):
""" finds all the occurrences of the passed `value_name`
and returns the `raw` values"""
pattern = value_name + r'":{"raw":(-?\d*)'
return re.findall(pattern, text)
text = '"cashflowStatementHistory":{"cashflowStatements":[{"changeToLiabilities":{"raw":66049000,"fmt":"66.05M","longFmt":"66,049,000"},"totalCashflowsFromInvestingActivities":{"raw":-2478000,"fmt":"-2.48M","longFmt":"-2,478,000"},"netBorrowings":{"raw":-31652000,"fmt":"-31.65M","longFmt":"-31,652,000"}'
val = get_value('totalCashflowsFromInvestingActivities', text)
print(val)
['-2478000']
You can cast that result to a numeric type with map by replacing the return line.
return list(map(int, re.findall(pattern, text)))
If Buran is right and your source is Json, you might find this helpful:
import json
s = '{"cashflowStatementHistory":{"cashflowStatements":[{"changeToLiabilities":{"raw":66049000,"fmt":"66.05M","longFmt":"66,049,000"},"totalCashflowsFromInvestingActivities":{"raw":-2478000,"fmt":"-2.48M","longFmt":"-2,478,000"},"netBorrowings":{"raw":-31652000,"fmt":"-31.65M","longFmt":"-31,652,000"}}]}}'
j = json.loads(s)
for i in j["cashflowStatementHistory"]["cashflowStatements"]:
if "totalCashflowsFromInvestingActivities" in i:
print(i["totalCashflowsFromInvestingActivities"]["raw"])
In this way you can find anything in the wall of text.
Take a look at this too: https://www.w3schools.com/python/python_json.asp
I have a list of strings and i would like to extract : "000000_5.612230" of :
A = '/calibration/test_min000000_5.612230.jpeg'
As the size of the strings could evolve, I try with monitoring the position of "n" of "min". I try to get the good index with :
print sorted(A, key=len).index('n')
But i got "11" which corresponds to the "n" of "calibration". I would like to know how to get the maximum index value of the string?
it is difficult to answer since you don't specify what part of the filename remains constant and what is subject to change. is it always a jpeg? is the number always the last part? is it always preceded with '_min' ?
in any case, i would suggest using a regex instead:
import re
A = '/calibration/test_min000000_5.612230.jpeg'
p = re.compile('.*min([_\d\.]*)\.jpeg')
value = p.search(A).group(1)
print value
output :
000000_5.612230
note that this snippet assumes that a match is always found, if the filename doesn't contain the pattern then p.search(...) will return None and an exception will be raised, you'll check for that case.
You can use re module and the regex to do that, for example:
import re
A = '/calibration/test_min000000_5.612230.jpeg'
text = re.findall('\d.*\d', A)
At now, text is a list. If you print it the output will be like this: ['000000_5.612230']
So if you want to extract it, just do this or use for:
import re
A = '/calibration/test_min000000_5.612230.jpeg'
text = re.findall('\d.*\d', A)
print text[0]
String slicing seems like a good solution for this
>>> A = '/calibration/test_min000000_5.612230.jpeg'
>>> start = A.index('min') + len('min')
>>> end = A.index('.jpeg')
>>> A[start:end]
'000000_5.612230'
Avoids having to import re
Try this (if extension is always '.jpeg'):
A.split('test_min')[1][:-5]
If your string is regular at the end, you can use negative indices to slice the string:
>>> a = '/calibration/test_min000000_5.612230.jpeg'
>>> a[-20:-5]
'000000_5.612230'
I am trying to do something which I thought would be simple (and probably is), however I am hitting a wall. I have a string that contains document numbers. In most cases the format is ######-#-### however in some cases, where the single digit should be, there are multiple single digits separated by a comma (i.e. ######-#,#,#-###). The number of single digits separated by a comma is variable. Below is an example:
For the string below:
('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
I need to return:
['030421-1-001', '030421-2-001' '030421-1-002', '030421-1-002', '030421-2-002', '030421-3-002' '030421-1-003']
I have only gotten as far as returning the strings that match the ######-#-### pattern:
import re
p = re.compile('\d{6}-\d{1}-\d{3}')
m = p.findall('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
print m
Thanks in advance for any help!
Matt
Perhaps something like this:
>>> import re
>>> s = '030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003'
>>> it = re.finditer(r'(\b\d{6}-)(\d(?:,\d)*)(-\d{3})\b', s)
>>> for m in it:
a, b, c = m.groups()
for x in b.split(','):
print a + x + c
...
030421-1-001
030421-2-001
030421-1-002
030421-1-002
030421-2-002
030421-3-002
030421-1-003
Or using a list comprehension
>>> [a+x+c for a, b, c in (m.groups() for m in it) for x in b.split(',')]
['030421-1-001', '030421-2-001', '030421-1-002', '030421-1-002', '030421-2-002', '030421-3-002', '030421-1-003']
Use '\d{6}-\d(,\d)*-\d{3}'.
* means "as many as you want (0 included)".
It is applied to the previous element, here '(,\d)'.
I wouldn't use a single regular expression to try and parse this. Since it is essentially a list of strings, you might find it easier to replace the "&" with a comma globally in the string and then use split() to put the elements into a list.
Doing a loop of the list will allow you to write a single function to parse and fix the string and then you can push it onto a new list and the display your string.
replace(string, '&', ',')
initialList = string.split(',')
for item in initialList:
newItem = myfunction(item)
newList.append(newItem)
newstring = newlist(join(','))
(\d{6}-)((?:\d,?)+)(-\d{3})
We take 3 capturing groups. We match the first part and last part the easy way. The center part is optionally repeated and optionally contains a ','. Regex will however only match the last one, so ?: won't store it at all. What where left with is the following result:
>>> p = re.compile('(\d{6}-)((?:\d,?)+)(-\d{3})')
>>> m = p.findall('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
>>> m
[('030421-', '1,2', '-001'), ('030421-', '1', '-002'), ('030421-', '1,2,3', '-002'), ('030421-', '1', '-003')]
You'll have to manually process the 2nd term to split them up and join them, but a list comprehension should be able to do that.
i was wondering if anyone has a simpler solution to extract a few letters in the middle of a string. i want to retrive the 3 letters (in this case, GMB) and all the entries follow the same patter. i'struggling o get a simpler way of doing this.
here is an example of what i've been using.
entry = "entries-alphabetical.jsp?raceid13=GMB$20140313A"
symbol = entry.strip('entries-alphabetical.jsp?raceid13=')
symbol = symbol[0:3]
print symbol
thanks
First of all the argument passed to str.strip is not prefix or suffix, it is just a combination of characters that you want to be stripped off from the string.
Since the string looks like an url, you can use urlparse.parse_qsl:
>>> import urlparse
>>> urlparse.parse_qsl(entry)
[('entries-alphabetical.jsp?raceid13', 'GMB$20140313A')]
>>> urlparse.parse_qsl(entry)[0][1][:3]
'GMB'
This is what regular expressions are for. http://docs.python.org/2/library/re.html
import re
val = re.search(r'(GMB.*)', entry)
print val.group(1)