PyParsing : how to use SkipTo and OR(^) operator

PyParsing : how to use SkipTo and OR(^) operator - python

I have different formats of date prefixes and other prefixes. I needed to create a grammar which can skip this prefixes and obtain the required data. But , when I use SkipTo and Or(^) operator , I am not able to get the desired results.
from pyparsing import *
import pprint
def print_cal(v):
print v
f=open("test","r")
NAND_TIME= Group(SkipTo(Literal("NAND TIMES"),include=True) + Word(nums)+Literal(":").suppress()+Word(nums)).setParseAction(lambda t: print_cal('NAND TIME'))
TEST_TIME= Group(SkipTo(Literal("TEST TIMES"),include=True) + Word(nums)+Literal(":").suppress()+Word(nums)).setParseAction(lambda t: print_cal('TEST TIME'))
testing =NAND_TIME ^ TEST_TIME
watch=OneOrMore(testing)
watch.parseString(f.read())
File Contents:
01 may 2015 15:15:100 NAND TIMES 1: 88008888
01 april 2015 15:15:100 NAND TIMES 2: 77777777
1154544 15:15:100 TEST TIMES 1: 78544545
8787878 aug 2015 15:15:100 TEST TIMES 2: 78787878
OUTPUT :
TEST TIME
TEST TIME
Desired output :
NAND TIME
NAND TIME
TEST TIME
TEST TIME
Can anyone help me understand this ?

Using SkipTo as the first element of a parser is a bit bold, and may indicate that searchString or scanString would be a better choice than parseString (searchString and scanString allow you to define just the part of the input that you are interested in, and the rest will be skipped over automatically - but you have to take care that your definition of "what you want" is unambiguous and does not accidentally pick up unwanted bits.) Here is your parser implemented using searchString:
NAND_TIME= (Literal("NAND TIMES") + Word(nums)+Literal(":").suppress()+Word(nums)).setParseAction(lambda t: print_cal('NAND TIME'))
TEST_TIME= (Literal("TEST TIMES") + Word(nums)+Literal(":").suppress()+Word(nums)).setParseAction(lambda t: print_cal('TEST TIME'))
testing =NAND_TIME | TEST_TIME
testdata = f.read()
for match in testing.searchString(testdata):
print match.asList()
'|' is perfectly fine to use in this case, as there is no possible confusion between starting with NAND or starting with TEST.
You might also consider just parsing this file a line at a time:
for line in f:
if not line: continue
print line
print testing.searchString(line).asList()
print

Related

How to extract multiple time from same string in Python?

I'm trying to extract time from single strings where in one string there will be texts other than only time. An example is s = 'Dates : 12/Jul/2019 12/Aug/2019, Loc : MEISHAN BRIDGE, Time : 06:00 17:58'.
I've tried using datefinder module like this :
from datetime import datetime as dt
import datefinder as dfn
for m in dfn.find_dates(s):
print(dt.strftime(m, "%H:%M:%S"))
Which gives me this :
17:58:00
In this case the time "06:00" is missed out. Now if I try without datefinder with only datetime module like this :
dt.strftime(s, "%H:%M")
It notifies me that the input must be a datetime object already, not a string with the following error :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: descriptor 'strftime' requires a 'datetime.date' object but received a 'str'
So I tried to use dateutil module to parse this string s to a datetime object with this :
from dateutil.parser import parse
parse(s)
but, now it now says that my string is not in proper format (which in most cases will not be in any fixed format), showing me this error :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/michael/anaconda3/envs/sec_img/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 1358, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/home/michael/anaconda3/envs/sec_img/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 649, in parse
raise ValueError("Unknown string format:", timestr)
ValueError: ('Unknown string format:', '12/Jul/2019 12/Aug/2019 MEISHAN BRIDGE 06:00 17:58')
I have thought of getting the time with regex like
import re
p = r"\d{2}\:\d{2}"
times = [i.group() for i in re.finditer(p, s)]
# Gives me ['06:00', '17:58']
But doing this way will need me to check again whether this regex matched chunks are actually time or not because even "99:99" could be regex matched rightly and told as time wrongly. Is there any work around without regex to get all the times from a single string?
Please note that the string might contain or might not contain any date, but it will contain a time always. Even if it contains date, the date format might be anything on earth and also this string might or might not contain other irrelevant texts.

I don't see many options here, so I would go with a heuristic. I would run the following against the whole dataset and extend the config/regexes until it covers all/most of the cases:
import re
import logging
from datetime import datetime as dt
s = 'Dates : 12/Jul/2019 12/08/2019, Loc : MEISHAN BRIDGE, Time : 06:00 17:58:59'
SUPPORTED_DATE_FMTS = {
re.compile(r"(\d{2}/\w{3}/\d{4})"): "%d/%b/%Y",
re.compile(r"(\d{2}/\d{2}/\d{4})"): "%d/%m/%Y",
re.compile(r"(\d{2}/\w{3}\w+/\d{4})"): "%d/%B/%Y",
# Capture more here
}
SUPPORTED_TIME_FMTS = {
re.compile(r"((?:[0-1][0-9]|2[0-4]):[0-5][0-9])[^:]"): "%H:%M",
re.compile(r"((?:[0-1][0-9]|2[0-4]):[0-5][0-9]:[0-5][0-9])"): "%H:%M:%S",
# Capture more here
}
def extract_supported_dt(config, s):
"""
Loop thru the given config (keys are regexes, values are date/time format)
and attempt to gather all valid data.
"""
valid_data = []
for regex, fmt in config.items():
# Extract what you think looks like date
valid_ish_data = regex.findall(s)
if not valid_ish_data:
continue
print("Checking " + str(valid_ish_data))
# validate it
for d in valid_ish_data:
try:
valid_data.append(dt.strptime(d, fmt))
except ValueError:
pass
return valid_data
# Handle dates
dates = extract_supported_dt(SUPPORTED_DATE_FMTS, s)
# Handle times
times = extract_supported_dt(SUPPORTED_TIME_FMTS, s)
print("Found dates: ")
for date in dates:
print("\t" + str(date.date()))
print("Found times: ")
for t in times:
print("\t" + str(t.time()))
Example output:
Checking ['12/Jul/2019']
Checking ['12/08/2019']
Checking ['06:00']
Checking ['17:58:59']
Found dates:
2019-07-12
2019-08-12
Found times:
06:00:00
17:58:59
This is a trial and error approach but I do not think there is an alternative in your case. Thus my goal here is to make it as easy as possible to extend support with more date/time formats as opposed to try to find a solution that covers 100% of the data day-1. This way, the more data you run against the more complete your config will be.
One thing to note is that you will have to detect strings that appear to have no dates and log them somewhere. Later you will need to manually revise and see if something that was missed could be captured.
Now, assuming that your data are being generated by another system, sooner or later you will be able to match 100% of it. If the data input is from human, then you will probably never manage to get 100%! (people tend to make spelling mistakes and sometimes import random stuff... date=today :) )

How to extract multiple time from same string in Python?
If you need only time this regex should work fine
r"[0-2][0-9]\:[0-5][0-9]"
If there could be spaces in time like 23 : 59 use this
r"[0-2][0-9]\s*\:\s*[0-5][0-9]"

Use Regex But Something Like This,
(?=[0-1])[0-1][0-9]\:[0-5][0-9]|(?=2)[2][0-3]\:[0-5][0-9]
This Matched
00:00, 00:59 01:00 01:59 02:00 02: 59
09:00 10:00 11:59 20:00 21:59 23:59
Not work for
99:99 23:99 01:99
Check Here Dude if it works for You
Check on Repl.it

you could use dictionaries:
my_dict = {}
for i in s.split(', '):
m = i.strip().split(' : ', 1)
my_dict[m[0]] = m[1].split()
my_dict
Out:
{'Dates': ['12/Jul/2019', '12/Aug/2019'],
'Loc': ['MEISHAN', 'BRIDGE'],
'Time': ['06:00', '17:58']}

Parsing numbers in strings from a file

I have a txt file as here:
pid,party,state,res
SC5,Republican,NY,Donald Trump 45%-Marco Rubio 18%-John Kasich 18%-Ted Cruz 11%
TB1,Republican,AR,Ted Cruz 27%-Marco Rubio 23%-Donald Trump 23%-Ben Carson 11%
FX2,Democratic,MI,Hillary Clinton 61%-Bernie Sanders 34%
BN1,Democratic,FL,Hillary Clinton 61%-Bernie Sanders 30%
PB2,Democratic,OH,Hillary Clinton 56%-Bernie Sanders 35%
what I want to do, is check that the % of each "res" gets to 100%
def addPoll(pid,party,state,res,filetype):
with open('Polls.txt', 'a+') as file: # open file temporarly for writing and reading
lines = file.readlines() # get all lines from file
file.seek(0)
next(file) # go to next line --
#this is suppose to skip the 1st line with pid/pary/state/res
for line in lines: # loop
line = line.split(',', 3)[3]
y = line.split()
print y
#else:
#file.write(pid + "," + party + "," + state + "," + res+"\n")
#file.close()
return "pass"
print addPoll("123","Democratic","OH","bla bla 50%-Asd ASD 50%",'f')
So in my code I manage to split the last ',' and enter it into a list, but im not sure how I can get only the numbers out of that text.

You can use regex to find all the numbers:
import re
for line in lines:
numbers = re.findall(r'\d+', line)
numbers = [int(n) for n in numbers]
print(sum(numbers))
This will print
0 # no numbers in the first line
97
85
97
92
93
The re.findall() method finds all substrings matching the specified pattern, which in this case is \d+, meaning any continuous string of digits. This returns a list of strings, which we cast to a list of ints, then take the sum.

It seems like what you have is CSV. Instead of trying to parse that on your own, Python already has a builtin parser that will give you back nice dictionaries (so you can do line['res']):
import csv
with open('Polls.txt') as f:
reader = csv.DictReader(f)
for row in reader:
# Do something with row['res']
pass
For the # Do something part, you can either parse the field manually (it appears to be structured): split('-') and then rsplit(' ', 1) each - separated part (the last thing should be the percent). If you're trying to enforce a format, then I'd definitely go this route, but regex are also a fine solution too for quickly pulling out what you want. You'll want to read up on them, but in your case, you want \d+%:
# Manually parse (throws IndexError if there isn't a space separating candidate name and %)
percents = [candidate.rsplit(' ', 1)[1] for candidate row['res'].split('-')]
if not all(p.endswith('%') for p in percents):
# Handle bad percent (not ending in %)
pass
else:
# Throws ValueError if any of the percents aren't integers
percents = [int(p[:-1]) for p in percents]
if sum(percents) != 100:
# Handle bad total
pass
Or with regex:
percents = [int(match.group(1)) for match in re.finditer(r'(\d+)%', row['res'])]
if sum(percents) != 100:
# Handle bad total here
pass
Regex is certainly shorter, but the former will enforce more strict formatting requirements on row['res'] and will allow you to later extract things like candidate names.
Also some random notes:
You don't need to open with 'a+' unless you plan to append to the file, 'r' will do (and 'r' is implicit, so you don't have to specify it).
Instead of next() use a for loop!

Regex remove certain characters from a file

I'd like to write a python script that reads a text file containing this:
FRAME
1 J=1,8 SEC=CL1 NSEG=2 ANG=0
2 J=8,15 SEC=CL2 NSEG=2 ANG=0
3 J=15,22 SEC=CL3 NSEG=2 ANG=0
And output a text file that looks like this:
1 1 8
2 8 15
3 15 22
I essentially don't need the commas or the SEC, NSEG and ANG data. Could someone help me use regex to do this?
So far I have this:
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
with open('RawDataFile_445.txt') as a:
# open all 4 files with a meaningful name
file=[open(outputfile.txt","w")
for line in a:

Without regex:
for line in file:
keep = []
line = line.strip()
if line.startswith('FRAME'):
continue
first, second, *_ = line.split()
keep.append(first)
first, second = second.split('=')
keep.extend(second.split(','))
print(' '.join(keep))

My advice? Since I don't write many regex's I avoid writing big ones all at once. Since you've already done that I would try to verify it a small chunk at a time, as illustrated in this code.
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
r = re.compile(r"\s*(\d+)")
r = re.compile(r"\s*(\d+)\s+J=(\d+)")
with open('RawDataFile_445.txt') as a:
a.readline()
for line in a.readlines():
result = r.match(line)
if result:
print (result.groups())
The first regex is your entire brute of an expression. The next line is the first chunk I verified. The next line is the second, bigger chunk that worked. Notice the slight change.
At this point I would go back, make the correction to the original, whole regex and then copy a bigger chunk to try. And re-run.

Let's focus on an example string we want to parse:
1 J=1,8
We have space(s), digit(s), more space(s), some characters, then digit(s), a comma, and more digit(s). If we replace them with regex characters, we get (\d+)\s+J=(\d+),(\d+), where + means we want 1 or more of that type. Note that we surround the digits with parentheses so we can capture them later with .groups() or .group(#), where # is the nth group.

Getting 'TypeError, not enough arguments for format string' while writing to file?

I am trying to create a code that takes number from output and create a generate a text file
with that number , then write some lines in that file with that number in use ..
I tried to use the code :
for i in Nums:
sh1 = '%d.txt' %i
target = open (sh1, 'w') ## a will append, w will over-write
text = '%d * 0\n %d *1'
target.write(text %(i))
target.close()
but i face this error TypeError: not enough arguments for format string .
I do not understand this error why shows for me . I searched but solutions did not work with my code .
What i need from the code is to create the text file Like if i entered the number 1 . creates txt file name 1.txt and write these lines to it .
1 * 0
1 * 1
1 * 2
1 * 3
Any help ?
Thanks in advance

The problem is with the expression
text %(i)
text refers to the format string '%d * 0\n %d *1', which contains two %d placeholders, but you’re only passing one argument, i. You need to do something like
text % (i, j)
For example, text % (4, 5) would give you
4 * 0\n 5 *1
By the way, it’s standard to include spaces both before and after the % operator used for formatting. And if you’re passing just one argument to a formatting operation and you want to use a tuple, you need to use syntax like (i,) instead of just (i). You can read more about that rule here.

Why don't you use str.format?
for i in Nums:
target = open('{}.txt'.format(i), 'w')
target.write('{0} * 0\n{0} * 1\n{0} * 2\n{0} * 3\n'.format(i))
target.close()
The use of '%d.txt' %i string format is slowly becoming less used, perhaps due it's slightly confusing usability. str.format is a bit more concise and provides you bit the same functionality. You only need to specify the {} to signify where the parameter will go. You can further specify the index of the parameter inside the brackets, {0}, or {1}.

How can I read one line from a telnet response with Python?

I was surprised that I couldn't find this question on here.
I would like to take extract one line from a telnet response and make it a variable. (actually one number from that line). I can extract up to where I need using telnet.read_until(), but the whole beginning is still there. The printout shows different statuses of a machine.
The line I am trying to get is formatted like this:
CPU Utilization : 5 %
I really only need the number, but there are many ':' and '%' characters in the rest of the output. Can anyone help me extract this value? Thanks in advance!
Here is my code (this reads the whole output and prints):
import telnetlib, time
print ("Starting Client...")
host = input("Enter IP Address: ")
timeout = 120
print ("Connecting...")
try:
session = telnetlib.Telnet(host, 23, timeout)
except socket.timeout:
print ("socket timeout")
else:
print("Sending Commands...")
session.write("command".encode('ascii') + b"\r")
print("Reading...")
output = session.read_until(b"/r/n/r/n#>", timeout )
session.close()
print(output)
print("Done")
Edit: some example of what an output could be:
Boot Version : 1.1.3 (release_82001975_C)
Post Version : 1.1.3 (release_82001753_E)
Product VPD Version : release_82001754_C
Product ID : 0x0076
Hardware Strapping : 0x004C
CPU Utilization : 5 %
Uptime : 185 days, 20 hours, 31 minutes, 29 seconds
Current Date/Time : Fri Apr 26 17:50:30 2013

As you say in the question:
I can extract up to where I need using telnet.read_until(), but the whole beginning is still there.
So you can get all of the lines up to and including the one you want into a variable output. The only thing you're missing is how to get just the last line in that output string, right?
That's easy: just split output into lines and take the last one:
output.splitlines()[:-1]
Or just split off the last line:
output.rpartition('\n')[-1]
This doesn't change output, it's just an expression that computes a new value (the last line in output). So, just doing this, followed by print(output), won't do anything visibly useful.
Let's take a simpler example:
a = 3
a + 1
print(a)
That's obviously going to print 3. If you want to print 4, you need something like this:
a = 3
b = a + 1
print(b)
So, going back to the real example, what you want is probably something like this:
line = output.rpartition('\n')[-1]
print(line)
And now you'll see this:
CPU Utilization : 5 %
Of course, you still need something like Johnny's code to extract the number from the rest of the line:
numbers = [int(s) for s in line.split() if s.isdigit()]
print(numbers)
Now you'll get this:
['5']
Notice that gives you a list of one string. If you want just the one string, you still have another step:
number = numbers[0]
print(number)
Which gives you:
5
And finally, number is still the string '5', not the integer 5. If you want that, replace that last bit with:
number = int(numbers[0])
print(number)
This will still print out 5, but now you have a variable you can actually use as a number:
print(number / 100.0) # convert percent to decimal
I'm depending on the fact that telnet defines end-of-line as \r\n, and any not-quite-telnet-compatible server that gets it wrong is almost certainly going to use either Windows-style (also \r\n) or Unix-style (just \n) line endings. So, splitting on \n will always get the last line, even for screwy servers. If you don't need to worry about that extra robustness, you can split on \r\n instead of \n.
There are other ways you could solve this. I would probably either use something like session.expect([r'CPU Utilization\s*: (\d+)\s*%']), or wrap the session as an iterator of lines (like a file) and then just do write the standard itertools solution. But this seems to be simplest given what you already have.

As I understand the problem, you want to select 1 line out of a block of lines, but not necessarily the last line.
The line you're interested in always starts with "CPU Utilization"
This should work:
for line in output.splitlines():
if 'CPU Utilization' in line:
cpu_utilization = line.split()[-2]

If you want to get only numbers:
>>> output = "CPU Utilization : 5 %"
>>> [int(s) for s in output.split() if s.isdigit()]
[5]
>>> output = "CPU Utilization : 5 % % 4.44 : 1 : 2"
>>> [int(s) for s in output.split() if s.isdigit()]
[5, 4.44, 1, 2]
EDIT:
for line in output:
print line # this will print every single line in a loop, so you can make:
print [int(s) for s in line.split() if s.isdigit()]

In [27]: mystring= "% 5 %;%,;;;;;%"
In [28]: ''.join(c for c in mystring if c.isdigit())
Out[28]: '5'
faster way :
def find_digit(mystring):
return filter(str.isdigit, mystring)
find_digit(mystring)
5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PyParsing : how to use SkipTo and OR(^) operator - python

Related

How to extract multiple time from same string in Python?

Parsing numbers in strings from a file

Regex remove certain characters from a file

Getting 'TypeError, not enough arguments for format string' while writing to file?

How can I read one line from a telnet response with Python?

Categories

Resources