how to correctly split the following string ? - Python - python

I have the following file to be parsed:
Total Virtual Clients : 10 (1 Machines)
Current Connections : 10
Total Elapsed Time : 50 Secs (0 Hrs,0 Mins,50 Secs)
Total Requests : 337827 ( 6687/Sec)
Total Responses : 337830 ( 6687/Sec)
Total Bytes : 990388848 ( 20571 KB/Sec)
Total Success Connections : 3346 ( 66/Sec)
Total Connect Errors : 0 ( 0/Sec)
Total Socket Errors : 0 ( 0/Sec)
Total I/O Errors : 0 ( 0/Sec)
Total 200 OK : 33864 ( 718/Sec)
Total 30X Redirect : 0 ( 0/Sec)
Total 304 Not Modified : 0 ( 0/Sec)
Total 404 Not Found : 303966 ( 5969/Sec)
Total 500 Server Error : 0 ( 0/Sec)
Total Bad Status : 303966 ( 5969/Sec)
so I have the parsing algorithm to search the files for those values, however, when I do:
for data in temp:
line = data.strip().split()
print line
where temp is my temporary buffer, which contains those values,
I get:
['Total', 'I/O', 'Errors', ':', '0', '(', '0/Sec)']
['Total', '200', 'OK', ':', '69807', '(', '864/Sec)']
['Total', '30X', 'Redirect', ':', '0', '(', '0/Sec)']
['Total', '304', 'Not', 'Modified', ':', '0', '(', '0/Sec)']
['Total', '404', 'Not', 'Found', ':', '420953', '(', '5289/Sec)']
['Total', '500', 'Server', 'Error', ':', '0', '(', '0/Sec)']
and I wanted:
['Total I/O Errors', '0', '0']
['Total 200 OK', '69807', '864']
['Total 30X Redirect', '0', '0']
and so on.
How could I accomplish that?

You could use a regular expression as follows:
import re
rex = re.compile('([^:]+\S)\s*:\s*(\d+)\s*\(\s*(\d+)/Sec\)')
for line in temp:
match = rex.match(line)
if match:
print match.groups()
which will give you:
['Total Requests', '337827', '6687']
['Total Responses', '337830', '6687']
['Total Success Connections', '3346', '66']
['Total Connect Errors', '0', '0']
['Total Socket Errors', '0', '0']
['Total I/O Errors', '0', '0']
['Total 200 OK', '33864', '718']
['Total 30X Redirect', '0', '0']
['Total 304 Not Modified', '0', '0']
['Total 404 Not Found', '303966', '5969']
['Total 500 Server Error', '0', '0']
['Total Bad Status', '303966', '5969']
Note that will only match lines which correspond to "TITLE:NUMBER(NUMBER/Sec)". You can adapt the expression to match the other lines as well.

regular expressions are overkill for parsing your data, but it is a convenient way to express the fixed length fields. For example
for data in temp:
first, second, third = re.match("(.{28}):(.{21})(.*)", data).groups()
...
This means the first field is 28 chars. Skip the ':', next 21 chars is second field, remainder is the 3rd field

Instead of splitting on whitespace, you will need to split based on the other delimiters in your format, it might look something like this:
for data in temp:
first, rest = data.split(':')
second, rest = rest.split('(')
third, rest = rest.split(')')
print [x.strip() for x in (first, second, third)]

Related

python regex for incomplete decimals numbers

I have a string of numbers which may have incomplete decimal reprisentation
for example
a = '1. 1,00,000.00 1 .99 1,000,000.999'
desired output
['1','1,00,000.00','1','.99','1,000,000.999']
so far i have tried the following 2
re.findall(r'[-+]?(\d+(?:[.,]\d+)*)',a)
which gives
['1', '1,00,000.00', '1', '99', '1,000,000.999']
which makes .99 to 99 which is not desired
while
re.findall(r'[-+]?(\d*(?:[.,]\d+)*)',a)
gives
['1', '', '', '1,00,000.00', '', '', '1', '', '.99', '', '1,000,000.999', '']
which gives undesirable empty string results as well
this is for finding currency values in a string so the commas separators don't have a set pattern or mat not be present at all
My suggestion is to use the regex below:
I've implemented a snippet in python.
import re
a = '1. 1,00,000.00 1 .99 1,000,000.999'
result = re.split('/\.?\d\.?\,?/', a)
print result
Output:
['1', '1,00,000.00', '1', '.99', '1,000,000.999']
You can use re.split:
import re
a = '1. 1,00,000.00 1 .99 1,000,000.999'
d = re.split('(?<=\d)\.\s+|(?<=\d)\s+', a)
Output:
['1', '1,00,000.00', '1', '.99', '1,000,000.999']
This regex will give you your desired output:
([0-9]+(?=\.))|([0-9,]+\.[0-9]+)|([0-9]+)|(\.[0-9]+)
You can test it here: https://regex101.com/r/VfQIJC/6

Python version 3.3 iterating over regex freezes

I am trying to match a line in a text file using a regex, but every time I call pattern.finditer(line) the program freezes. Another part of the program passes a block of text to the formatLine method. The text is in the form:
line="8,6,14,32,42,4,4,4,3,5,3,3,4,2,2,2,1,2,3,2,1,3,4,2,3,10,false,false,false,false,true,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,1,2,1,2,4,"
def formatLine(line):
print(line)
print("----------")
commas=len(line.split(","))
timestamp="(\d+,){5}"
q1="([1-5,]+){20}"
q2="([1-5,]+)"
q3="(true,|false,){10}"
q4="(true,|false,){6}"
q5="(true,|false,){20}"
q6="([1-5,]+){5}"
pattern=re.compile(timestamp+q1+q2+q4+q5)
print("here")
response=pattern.finditer(line)
for ans in response:
numPattern+=1
#write to file for each instance of ans
#these check that the file is valid
print("here")
#more code, omitted
formatLine(line)#call method here
The first and second print statements print correctly, but the word "here" is never printed. Anyone know why it freezes and/or what I can do to fix it?
Edit: After reading the comments I realized a better question would be: How can I improve the regex above to get the pattern below? I have just started python (yesterday) and have been reading the python regex tutorial repetitively.
Each value (true or false or digit is separated by a comma)..... the file I am pulling from is a CSV.
-Pattern I am trying to get:
5 digits (each digit is 0-60)
20 digits (each digit is 1-5)
36 true or false (may be in any arrangement of true or false)
5 digits (each digit is 1-5)
Your expression, particularly the ([1-5,]+){20} part causes catastrophic backtracking. It doesn't hang, it's just busy solving the puzzle: "get me digits repeated N times repeated 20 times". You might be better off replacing it with something like ([1-5]+,){20}, although I don't think your approach is viable at all. Just split the string by commas and slice what you want from the list.
Per your update, this seems to be the right pattern:
pattern = r"""(?x)
([0-9], | [1-5][0-9], | 60,) {5} # 5 numbers (each number is 0-60)
([1-5],) {20} # 20 digits (each digit is 1-5)
(true,|false,) {36} # 36 true or false (may be in any arrangement of true or false)
([1-5],) {20} # 20 digits (each digit is 1-5)
"""
Of course, this is what the csv module is designed to do, no regex needed:
line="8,6,14,32,42,4,4,4,3,5,3,3,4,2,2,2,1,2,3,2,1,3,4,2,3,10,false,false,false,false,true,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,1,2,1,2,4"
from io import StringIO # this allows the string to act like a file
f=StringIO(line) # -- just use a file
### cut here
import csv
reader=csv.reader(f)
for e in reader:
# then just slice up the components:
q1=e[0:5] # ['8', '6', '14', '32', '42']
q2=e[6:26] # ['4', '4', '3', '5', '3', '3', '4', '2', '2', '2', '1', '2', '3', '2', '1', '3', '4', '2', '3', '10']
q3=e[27:53] # ['false', 'false', 'false', 'true', 'false', 'false', 'true', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false', 'false']
q4=e[54:] # ['2', '1', '2', '4']
You can then validate each section as desired.

Python: extracting patterns from CSV [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Below is a sample of the typical contents of a CSV file.
['**05:32:55PM**', '', '', 'Event Description', '0', "89.0 near Some Street; Suburb Ext 3; in Town Park; [**Long 37\xb0 14' 34.8 E Lat 29\xb0", '']
['', '', '', '', '', "17' 29.1 S** ]", '']
['06:09:11PM', '', '', 'Event Description', '0', "89.0 near Someother Street; Suburb Ext 3; in Town Park; [Long 37\xb0 14' 34.9 E Lat 29\xb0", '']
['', '', '', '', '', "17' 29.1 S ]", '']
['Report Line Header ', '', '', '', '', '', '']
['HeaderX', ': HeaderY', '', 'HeaderZ', '', 'HeaderAA', '']
['From Date', ': 2014/01/17 06:00:00 AM', '', 'To Date : 2014/01/17 06:15:36 PM', '', 'HeaderBB', '']
['HeaderA', 'HeaderB', 'Header0', 'Header1', 'Header2', 'Header3', '']
['', '', '', '', 'Header 4', 'Header5', '']
From each line containing the Date/Time and the location ( marked with ** -- ** ), I would like to just extract those relevant info, while ignoring the rest.
Even if I can just print results to screen, that is OK, ideally, create a new CSV containing only the time and lat/long.
If you really want to extract the data of this file formatted as in your example, then you could use the following since the data in every line has a list representation:
>>> import ast
>>> f = open('data.txt', 'r')
>>> lines = f.readlines()
>>> for line in lines:
... list_representation_of_line = ast.literal_eval(line)
... for element in list_representation_of_line:
... if element.startswith('**') and element.endswith('**'):
... print list_representation_of_line
... # or print single fields, e.g. timeIndex = 0 or another index
... # print list_representation_of_line[timeindex]
... break
...
['**05:32:55PM**', '', '', 'Event Description', '0', "89.0 near Some Street; Suburb Ext 3; in Town Park; [**Long 37\xb0 14' 34.8 E Lat 29\xb0", '']
>>>
otherwise you should reformat your data as csv
If that's really what your CSV file looks like, I wouldn't even bother. It's got different data on different rows, and a huge mess of nested ad-hoc strings, with separators within separators.
Even once you get to your lat and long figures, they look like a bizarre mix of decimal, hex and character data.
I think you'd be asking for trouble by giving the impression that you can deal with data in that format programmatically. If it's just a once off task, and that's the extent of the data, I'd do it by hand.
If not, I think the correct solution is to push back and try to get some cleaner data.

Python - slow speed when extracting numbers/words from a string

Noob here trying to learn python by doing a project as I don't learn well from books.
I am using a huge lump of code to perform what seems to me to be a small operation -
I want to extract 4 variables from the following string
'Miami 0, New England 28'
(variables being home_team, away_team, home_score, away_score)
My program is running pretty slow and I think it might be this bit of code. I guess I am looking for the quickest/most efficient way of doing this.
Would regex be quicker? Thanks
It seems like your text could be split twice. First on , and next on whitespace:
info1,info2 = s.split(',')
home,home_score = info1.rsplit(None,1)
away,away_score = info2.rsplit(None,1)
e.g.:
>>> s = 'Miami 0, New England 28'
>>> info1,info2 = s.split(',')
>>> home,home_score = info1.rsplit(None,1)
>>> away,away_score = info2.rsplit(None,1)
>>> print [home,home_score,away,away_score]
['Miami', '0', ' New England', '28']
You could do this with regex without too much difficulty -- but you pay for it in terms of readability.
In case you do want a regex:
import re
s='Miami 0, New England 28'
l=re.findall(r'^([^\d]+)\s(\d+)\s*,\s*([^\d]+)\s(\d+)',s)
hm_team,away_team,hm_score,away_score=l[0]
print l
Prints [('Miami', '0', 'New England', '28')] and assigns those values to the variables.
import re
reg = re.compile('\s*(\D+?)\s*(\d+)'
'[,;:.#=#\s]*'
'(\D+?)\s*(\d+)'
'\s*')
for s in ('Miami 0, New England 28',
'Miami0,New England28 ',
' Miami 0 . New England28',
'Miami 0 ; New England 28',
'Miami0#New England28 ',
' Miami 0 # New England28'):
print reg.search(s).groups()
result
('Miami', '0', 'New England', '28')
('Miami', '0', 'New England', '28')
('Miami', '0', 'New England', '28')
('Miami', '0', 'New England', '28')
('Miami', '0', 'New England', '28')
('Miami', '0', 'New England', '28')
'\D' means 'no digit'

Regex to match a capturing group one or more times

I'm trying to match pair of digits in a string and capture them in groups, however i seem to be only able to capture the last group.
Regex:
(\d\d){1,3}
Input String: 123456 789101
Match 1: 123456
Group 1: 56
Match 2: 789101
Group 1: 01
What I want is to capture all the groups like this:
Match 1: 123456
Group 1: 12
Group 2: 34
Group 3: 56
* Update
It looks like Python does not let you capture multiple groups, for example in .NET you could capture all the groups in a single pass, hence re.findall('\d\d', '123456') does the job.
You cannot do that using just a single regular expression. It is a special case of counting, which you cannot do with just a regex pattern. \d\d will get you:
Group1: 12
Group2: 23
Group3: 34
...
regex library in python comes with a non-overlapping routine namely re.findall() that does the trick. as in:
re.findall('\d\d', '123456')
will return ['12', '34', '56']
(\d{2})+(\d)?
I'm not sure how python handles its matching, but this is how i would do it
Try this:
import re
re.findall(r'\d\d','123456')
Is this what you want ? :
import re
regx = re.compile('(?:(?<= )|(?<=\A)|(?<=\r)|(?<=\n))'
'(\d\d)(\d\d)?(\d\d)?'
'(?= |\Z|\r|\n)')
for s in (' 112233 58975 6677 981 897899\r',
'\n123456 4433 789101 41586 56 21365899 362547\n',
'0101 456899 1 7895'):
print repr(s),'\n',regx.findall(s),'\n'
result
' 112233 58975 6677 981 897899\r'
[('11', '22', '33'), ('66', '77', ''), ('89', '78', '99')]
'\n123456 4433 789101 41586 56 21365899 362547\n'
[('12', '34', '56'), ('44', '33', ''), ('78', '91', '01'), ('56', '', ''), ('36', '25', '47')]
'0101 456899 1 7895'
[('01', '01', ''), ('45', '68', '99'), ('78', '95', '')]

Categories

Resources