I am trying to match a line in a text file using a regex, but every time I call pattern.finditer(line) the program freezes. Another part of the program passes a block of text to the formatLine method. The text is in the form:
line="8,6,14,32,42,4,4,4,3,5,3,3,4,2,2,2,1,2,3,2,1,3,4,2,3,10,false,false,false,false,true,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,1,2,1,2,4,"
def formatLine(line):
print(line)
print("----------")
commas=len(line.split(","))
timestamp="(\d+,){5}"
q1="([1-5,]+){20}"
q2="([1-5,]+)"
q3="(true,|false,){10}"
q4="(true,|false,){6}"
q5="(true,|false,){20}"
q6="([1-5,]+){5}"
pattern=re.compile(timestamp+q1+q2+q4+q5)
print("here")
response=pattern.finditer(line)
for ans in response:
numPattern+=1
#write to file for each instance of ans
#these check that the file is valid
print("here")
#more code, omitted
formatLine(line)#call method here
The first and second print statements print correctly, but the word "here" is never printed. Anyone know why it freezes and/or what I can do to fix it?
Edit: After reading the comments I realized a better question would be: How can I improve the regex above to get the pattern below? I have just started python (yesterday) and have been reading the python regex tutorial repetitively.
Each value (true or false or digit is separated by a comma)..... the file I am pulling from is a CSV.
-Pattern I am trying to get:
5 digits (each digit is 0-60)
20 digits (each digit is 1-5)
36 true or false (may be in any arrangement of true or false)
5 digits (each digit is 1-5)
Your expression, particularly the ([1-5,]+){20} part causes catastrophic backtracking. It doesn't hang, it's just busy solving the puzzle: "get me digits repeated N times repeated 20 times". You might be better off replacing it with something like ([1-5]+,){20}, although I don't think your approach is viable at all. Just split the string by commas and slice what you want from the list.
Per your update, this seems to be the right pattern:
pattern = r"""(?x)
([0-9], | [1-5][0-9], | 60,) {5} # 5 numbers (each number is 0-60)
([1-5],) {20} # 20 digits (each digit is 1-5)
(true,|false,) {36} # 36 true or false (may be in any arrangement of true or false)
([1-5],) {20} # 20 digits (each digit is 1-5)
"""
Of course, this is what the csv module is designed to do, no regex needed:
line="8,6,14,32,42,4,4,4,3,5,3,3,4,2,2,2,1,2,3,2,1,3,4,2,3,10,false,false,false,false,true,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,1,2,1,2,4"
from io import StringIO # this allows the string to act like a file
f=StringIO(line) # -- just use a file
### cut here
import csv
reader=csv.reader(f)
for e in reader:
# then just slice up the components:
q1=e[0:5] # ['8', '6', '14', '32', '42']
q2=e[6:26] # ['4', '4', '3', '5', '3', '3', '4', '2', '2', '2', '1', '2', '3', '2', '1', '3', '4', '2', '3', '10']
q3=e[27:53] # ['false', 'false', 'false', 'true', 'false', 'false', 'true', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false']
q4=e[54:] # ['2', '1', '2', '4']
You can then validate each section as desired.
Related
so i have some data i have been trying to clean up, its a list and it looks like this
a = [\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain]
i have tried to clean it up by doing this
a.replace("\n", "|")
the output turn out like this :
[london||18||20||30||||japan||6||80||2|||Spain]
if i do this:
a.replace("\n","")
i get this:
[london,"", "", 18,"","",20"","",30,"","","",""japan,"",""6,"","",80,"","",2"","","","",Spain]
can anyone explain why i am having multiple pipes, spaces and whats the best way to clean the data.
Assuming that your input is:
s = '\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain'
The issue is that there are multiple '\n' in-between data, therefore just replacing each '\n' with another character (say '|') will give you as many of the new characters as there were '\n'.
The simplest approach is to use str.split() to get the non-blank data:
l = list(s.split())
print(l)
# ['london', '18', '20', '30', 'japan', '6', '80', '2', 'Spain']
or, combine it with str.join(), if you want to have it separated by '|':
t = '|'.join(s.split())
print(t)
# london|18|20|30|japan|6|80|2|Spain
I tried it and got this:
a = ['\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain']
print(a[0].replace("\n", ""))
Output:
london182030japan6802Spain
Could you please clarify the exact input and the expected output? it does not seem correct yet and I have taken some liberties.
If your input was a string you can use split():
a = '\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain'
print(a.split())
Output:
['london', '18', '20', '30', 'japan', '6', '80', '2', 'Spain']
I have a string of numbers which may have incomplete decimal reprisentation
for example
a = '1. 1,00,000.00 1 .99 1,000,000.999'
desired output
['1','1,00,000.00','1','.99','1,000,000.999']
so far i have tried the following 2
re.findall(r'[-+]?(\d+(?:[.,]\d+)*)',a)
which gives
['1', '1,00,000.00', '1', '99', '1,000,000.999']
which makes .99 to 99 which is not desired
while
re.findall(r'[-+]?(\d*(?:[.,]\d+)*)',a)
gives
['1', '', '', '1,00,000.00', '', '', '1', '', '.99', '', '1,000,000.999', '']
which gives undesirable empty string results as well
this is for finding currency values in a string so the commas separators don't have a set pattern or mat not be present at all
My suggestion is to use the regex below:
I've implemented a snippet in python.
import re
a = '1. 1,00,000.00 1 .99 1,000,000.999'
result = re.split('/\.?\d\.?\,?/', a)
print result
Output:
['1', '1,00,000.00', '1', '.99', '1,000,000.999']
You can use re.split:
import re
a = '1. 1,00,000.00 1 .99 1,000,000.999'
d = re.split('(?<=\d)\.\s+|(?<=\d)\s+', a)
Output:
['1', '1,00,000.00', '1', '.99', '1,000,000.999']
This regex will give you your desired output:
([0-9]+(?=\.))|([0-9,]+\.[0-9]+)|([0-9]+)|(\.[0-9]+)
You can test it here: https://regex101.com/r/VfQIJC/6
I'm working with a dataframe where I wish to change entries in country column, eg:
'Bolivia (Plurinational State of)' should be 'Bolivia',
'Switzerland17' should be 'Switzerland'
I have defined the following function:
def process(w):
for i in range(len(w)):
if w[i] in ['(', ')', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '&', '/']:
w = w[0:i]
w = ''.join(w).replace(" ", "")
break
return w
which I have then applied to the dataframe using the python apply function.
energy['Country'] = energy['Country'].apply(process)
While I have been able to achieve the desired output, it is not entirely correct. Some entries like
United Kingdom of Great Britain and Northern Ireland and United States of America20 have changed to
UnitedKingdomofGreatBritainandNorthernIreland and UnitedStatesofAmerica .
What am I doing wrong? Also what would be a more effective, concise code to achieve the result?
I could be missing something, but it looks like
replace(" ", "")
is going to remove spaces, which is exactly what is happening with UnitedStatesofAmerica
I am building a file stripper to build a config report, and I have a very very long string as my base data. The following is a very small snippet of it, but it at least illustrates what I'm working with.
Snippet Example: DEFAULT_GATEWAY=192.168.88.1&DELVRY_AGGREGATION_INTERVAL0=1&DELVRY_AGGREGATION_INTERVAL1=1&DELVRY_SCHEDULE0=1&DELVRY_SNI0=192.168.88.158&DELVRY_USE_SSL_TLS1=0&
How would I go about matching the following:
between "DEFAULT_GATEWAY=" and "&"
between "DELVRY_AGGREGATION_INTERVAL0=" and "&"
between "DELVRY_AGGREGATION_INTERVAL1=" and "&"
between "DELVRY_SCHEDULE=" and "&"
between "DELVRY_SNI0=" and "&"
between "DELVRY_USE_SSL_TLS1=" and "&"
and building a dict with it like:
{"DEFAULT_GATEWAY":"192.168.88.1",
"DELVRY_AGGREGATION_INTERVAL0":"1",
"DELVRY_AGGREGATION_INTERVAL1":"1",
"DELVRY_SCHEDULE0":"1",
"DELVRY_SNI0":"0",
"DELVRY_USE_SSL_TLS1":"0"}
?
Here is a way to do it.
In [1]: input = 'DEFAULT_GATEWAY=192.168.88.1&DELVRY_AGGREGATION_INTERVAL0=1&DELVRY_AGGREGATION_INTERVAL1=1&DELVRY_SCHEDULE0=1&DELVRY_SNI0=192.168.88.158&DELVRY_USE_SSL_TLS1=0&'
In [2]: input.split('&')
Out[2]:
['DEFAULT_GATEWAY=192.168.88.1',
'DELVRY_AGGREGATION_INTERVAL0=1',
'DELVRY_AGGREGATION_INTERVAL1=1',
'DELVRY_SCHEDULE0=1',
'DELVRY_SNI0=192.168.88.158',
'DELVRY_USE_SSL_TLS1=0',
'']
In [3]: [keyval.split('=') for keyval in input.split('&') if keyval]
Out[3]:
[['DEFAULT_GATEWAY', '192.168.88.1'],
['DELVRY_AGGREGATION_INTERVAL0', '1'],
['DELVRY_AGGREGATION_INTERVAL1', '1'],
['DELVRY_SCHEDULE0', '1'],
['DELVRY_SNI0', '192.168.88.158'],
['DELVRY_USE_SSL_TLS1', '0']]
In [4]: dict(keyval.split('=') for keyval in input.split('&') if keyval)
Out[4]:
{'DEFAULT_GATEWAY': '192.168.88.1',
'DELVRY_AGGREGATION_INTERVAL0': '1',
'DELVRY_AGGREGATION_INTERVAL1': '1',
'DELVRY_SCHEDULE0': '1',
'DELVRY_SNI0': '192.168.88.158',
'DELVRY_USE_SSL_TLS1': '0'}
Notes
This is the input line
Split by & to get pairs of key-values. Note the last entry is empty
Split each entry by the equal sign while throwing away empty entries
Build a dictionary
Another Solution
In [8]: import urlparse
In [9]: urlparse.parse_qsl(input)
Out[9]:
[('DEFAULT_GATEWAY', '192.168.88.1'),
('DELVRY_AGGREGATION_INTERVAL0', '1'),
('DELVRY_AGGREGATION_INTERVAL1', '1'),
('DELVRY_SCHEDULE0', '1'),
('DELVRY_SNI0', '192.168.88.158'),
('DELVRY_USE_SSL_TLS1', '0')]
In [10]: dict(urlparse.parse_qsl(input))
Out[10]:
{'DEFAULT_GATEWAY': '192.168.88.1',
'DELVRY_AGGREGATION_INTERVAL0': '1',
'DELVRY_AGGREGATION_INTERVAL1': '1',
'DELVRY_SCHEDULE0': '1',
'DELVRY_SNI0': '192.168.88.158',
'DELVRY_USE_SSL_TLS1': '0'}
Split first by '&' to get a list of strings, then by '=', like this:
d = dict(kv.split('=') for kv in line.split('&'))
import re
keys = {"DEFAULT_GATEWAY",
"DELVRY_AGGREGATION_INTERVAL0",
"DELVRY_AGGREGATION_INTERVAL1",
"DELVRY_SCHEDULE0",
"DELVRY_SNI0",
"DELVRY_USE_SSL_TLS1"}
resdict = {}
for k in keys:
pat = '{}([^&])&'.format(k)
mo = re.search(pat, bigstring)
if mo is None: continue # no match
resdict[k] = mo.group(1)
will leave your desired result in resdict, if bigstring is the string you're searching in.
This assumes you know in advance which keys you'll be looking for, and you keep them in a set keys. If you don't know in advance the keys of interest, that's a very different issue of course.
I have the following file to be parsed:
Total Virtual Clients : 10 (1 Machines)
Current Connections : 10
Total Elapsed Time : 50 Secs (0 Hrs,0 Mins,50 Secs)
Total Requests : 337827 ( 6687/Sec)
Total Responses : 337830 ( 6687/Sec)
Total Bytes : 990388848 ( 20571 KB/Sec)
Total Success Connections : 3346 ( 66/Sec)
Total Connect Errors : 0 ( 0/Sec)
Total Socket Errors : 0 ( 0/Sec)
Total I/O Errors : 0 ( 0/Sec)
Total 200 OK : 33864 ( 718/Sec)
Total 30X Redirect : 0 ( 0/Sec)
Total 304 Not Modified : 0 ( 0/Sec)
Total 404 Not Found : 303966 ( 5969/Sec)
Total 500 Server Error : 0 ( 0/Sec)
Total Bad Status : 303966 ( 5969/Sec)
so I have the parsing algorithm to search the files for those values, however, when I do:
for data in temp:
line = data.strip().split()
print line
where temp is my temporary buffer, which contains those values,
I get:
['Total', 'I/O', 'Errors', ':', '0', '(', '0/Sec)']
['Total', '200', 'OK', ':', '69807', '(', '864/Sec)']
['Total', '30X', 'Redirect', ':', '0', '(', '0/Sec)']
['Total', '304', 'Not', 'Modified', ':', '0', '(', '0/Sec)']
['Total', '404', 'Not', 'Found', ':', '420953', '(', '5289/Sec)']
['Total', '500', 'Server', 'Error', ':', '0', '(', '0/Sec)']
and I wanted:
['Total I/O Errors', '0', '0']
['Total 200 OK', '69807', '864']
['Total 30X Redirect', '0', '0']
and so on.
How could I accomplish that?
You could use a regular expression as follows:
import re
rex = re.compile('([^:]+\S)\s*:\s*(\d+)\s*\(\s*(\d+)/Sec\)')
for line in temp:
match = rex.match(line)
if match:
print match.groups()
which will give you:
['Total Requests', '337827', '6687']
['Total Responses', '337830', '6687']
['Total Success Connections', '3346', '66']
['Total Connect Errors', '0', '0']
['Total Socket Errors', '0', '0']
['Total I/O Errors', '0', '0']
['Total 200 OK', '33864', '718']
['Total 30X Redirect', '0', '0']
['Total 304 Not Modified', '0', '0']
['Total 404 Not Found', '303966', '5969']
['Total 500 Server Error', '0', '0']
['Total Bad Status', '303966', '5969']
Note that will only match lines which correspond to "TITLE:NUMBER(NUMBER/Sec)". You can adapt the expression to match the other lines as well.
regular expressions are overkill for parsing your data, but it is a convenient way to express the fixed length fields. For example
for data in temp:
first, second, third = re.match("(.{28}):(.{21})(.*)", data).groups()
...
This means the first field is 28 chars. Skip the ':', next 21 chars is second field, remainder is the 3rd field
Instead of splitting on whitespace, you will need to split based on the other delimiters in your format, it might look something like this:
for data in temp:
first, rest = data.split(':')
second, rest = rest.split('(')
third, rest = rest.split(')')
print [x.strip() for x in (first, second, third)]