Python - regex parsing file - python

I have a file like this
module modulename(wire1, wire2, \wire3[0], \wire3[1], \wire3[2], wire4, wire5,wire6, wire7, \wire8[0], wire9); nonmodule modulename(wire1, wire2, \wire3[0], \wire3[1], \wire3[2], wire4, wire5,wire6, wire7, \wire8[0], wire9)
i want to change this string to
module modulename(wire1, wire2, wire3[0:2],wire4, wire5, wire6, wire7,wire8[0],wire9) ; nonmodule modulename(wire1, wire2, wire3[0], wire3[1], wire3[2], wire4, wire5,wire6, wire7, wire8[0], wire9)
so basically remove \ and delete individual copies of wires and change size to [start:stop] when the starting keyword is module and just removing slashes when starting keyword after ";" is not module
If i can parse it with regex i can do the rest, i am trying the code below but its not matching anything. the code is modified from -pattern to dictionary of lists Python
lines=f.read()
d = defaultdict(list)
module_pattern = r'(\w+)\s(\w+)\(([^;]+)'
mod_rex = re.compile(module_pattern)
wire_pattern = r'(\w+)\s[\\]?(\w+)['
wire_rex = re.compile(wire_pattern)
for match in mod_rex.finditer(lines):
#print '\n'.join(match.groups())
module, instance, wires = match.groups()
for match in wire_rex.finditer(wires):
wire, connection = match.groups()
#print '\t', wire, connection
d[wire].append((module, instance, connection))
for k, v in d.items():
print k, ':', v
Help is appreciated , havent been able to identify the tokens.

This should get you started. I'm not sure of what assumptions you can make about your file format, but it should be straightforward enough to modify this code to suit your needs.
Also, I assumed that the ordering of the ports was strict, so they have been left unmodified. This is also the reason I didn't use dicts.
This code will strip out all backslashes and collapse adjacent bits into vectors. This will also handle vectors that do not start at 0 (for example someport[3:8]). I also chose to make single bit vectors say [0:0] rather than [0].
import re
import sys
mod_re = re.compile(r'module\s*([^(]+)\(([^)]*)\);(.*)')
wire_re = re.compile(r'([^[]+)\[([0-9]+)\]')
def process(line):
# Get rid of all backslashes. You can make this more selective if you want
clean = line.replace('\\', '')
m = mod_re.search(clean)
if m:
ports = []
mod_name, wires, remaining = m.groups()
for wire in wires.split(','):
wire = wire.replace(' ', '')
m = wire_re.search(wire)
if m:
# Found a vector
n = int(m.group(2))
prev_wire, _ = ports[-1]
# If previous port was a vector, tack on next value
if prev_wire == m.group(1):
ports[-1][1][1] = n
else:
ports.append((m.group(1), [n, n]))
else:
# Found a scalar
ports.append((wire, None))
# Stringify ports
out = []
for port in ports:
name, val = port
if val is None:
out.append(name)
else:
start, end = val
out.append('%s[%s:%s]' % (name, start, end))
print 'module %s(%s); %s' % (mod_name, ', '.join(out), remaining)
f = open(sys.argv[1], 'r')
if f:
for l in f.readlines():
process(l)
f.close()
Output:
module modulename(wire1, wire2, wire3[0:2], wire4, wire5, wire6, wire7, wire8[0:0], wire9); nonmodule modulename(wire1, wire2, wire3[0], wire3[1], wire3[2], wire4, wire5,wire6, wire7, wire8[0], wire9)
PS: I don't know what exactly you are trying to do, but changing the module definition will also require changing the instantiation as well.
EDIT: Removed with keyword when opening file for Python2.5 support.

it seems like you're removing \ regardless if where it is, so replace them after performing this pattern
(\bmodule\b[^()]+\([^;]*?)(\\wire(\d+)\[(\d+)\][^;]*\wire\3\[(\d+)\])
and replace w/ \1wire\3[\4:\5]
Demo
per comment try new pattern
(\bmodule\b[^\\;]+)\\([^[]+)\[(\d+)\][^;]+\2\[(\d+)\]
Demo

Related

How to search string in a line and extract data between two characters in python?

file contents:
module traffic(
green_main, yellow_main, red_main, green_first, yellow_first,
red_first, clk, rst, waiting_main, waiting_first
);
I need to search the string 'module' and I need to extract the contents between (.......); the brackets.
Here is the code I tried out, I am not able to get the result
fp = open(file_name)
contents = fp.read()
unique_word_a = '('
unique_word_b = ');'
s = contents
for line in contents:
if 'module' in line:
your_string=s[s.find(unique_word_a)+len(unique_word_a):s.find(unique_word_b)].strip()
print(your_string)
The problem with your code is here:
for line in contents:
if 'module' in line:
Here, contents is a single string holding the entire content of the file, not a list of strings (lines) or a file handle that can be looped line-by-line. Thus, your line is in fact not a line, but a single character in that string, which obviously can never contain the substring "module".
Since you never actually use the line within the loop, you could just remove both the loop and the condition and your code will work just fine. (And if you changed your code to actually loop lines, and find within those lines, it would not work since the ( and ) are not on the same line.)
Alternatively, you can use a regular expression:
>>> content = """module traffic(green_main, yellow_main, red_main, green_first, yellow_first,
... red_first, clk, rst, waiting_main, waiting_first);"""
...
>>> re.search("module \w+\((.*?)\);", content, re.DOTALL).group(1)
'green_main, yellow_main, red_main, green_first, yellow_first, \n red_first, clk, rst, waiting_main, waiting_first'
Here, module \w+\((.*?)\); means
the word module followed by a space and some word-type \w characters
an literal opening (
a capturing group (...) with anything ., including linebreaks (re.DOTALL), non-greedy *?
an literal closing ) and ;
and group(1) gets you what's found in between the (non-escaped) pair of (...)
And if you want those as a list:
>>> list(map(str.strip, _.split(",")))
['green_main', 'yellow_main', 'red_main', 'green_first', 'yellow_first', 'red_first', 'clk', 'rst', 'waiting_main', 'waiting_first']
if you want to extract content between "(" ")" you can do:(but first take care how you handle the content):
for line in content.split('\n'):
if 'module' in line:
line_content = line[line.find('(') + 1: line.find(')')]
if your content is not only in one line :
import math
def find_all(your_string, search_string, max_index=math.inf, offset=0,):
index = your_string.find(search_string, offset)
while index != -1 and index < max_index:
yield index
index = your_string.find(search_string, index + 1)
s = content.replace('\n', '')
for offset in find_all(s, 'module'):
max_index = s.find('module', offset=offset + len('module'))
if max_index == -1:
max_index = math.inf
print([s[start + 1: stop] for start, stop in zip(find_all(s, '(',max_index, offset), find_all(s, ')', max_index, offset))])

Python - Error Caused by Space in argv Arument [duplicate]

I'm a python learner. If I have a lines of text in a file that looks like this
"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Can I split the lines around the inverted commas? The only constant would be their position in the file relative to the data lines themselves. The data lines could range from 10 to 100+ characters (they'll be nested network folders). I cannot see how I can use any other way to do those markers to split on, but my lack of python knowledge is making this difficult.
I've tried
optfile=line.split("")
and other variations but keep getting valueerror: empty seperator. I can see why it's saying that, I just don't know how to change it. Any help is, as always very appreciated.
Many thanks
You must escape the ":
input.split("\"")
results in
['\n',
'Y:\\DATA\x0001\\SERVER\\DATA.TXT',
' ',
'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT',
'\n']
To drop the resulting empty lines:
[line for line in [line.strip() for line in input.split("\"")] if line]
results in
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
I'll just add that if you were dealing with lines that look like they could be command line parameters, then you could possibly take advantage of the shlex module:
import shlex
with open('somefile') as fin:
for line in fin:
print shlex.split(line)
Would give:
['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']
No regex, no split, just use csv.reader
import csv
sample_line = '10.0.0.1 foo "24/Sep/2015:01:08:16 +0800" www.google.com "GET /" -'
def main():
for l in csv.reader([sample_line], delimiter=' ', quotechar='"'):
print l
The output is
['10.0.0.1', 'foo', '24/Sep/2015:01:08:16 +0800', 'www.google.com', 'GET /', '-']
shlex module can help you.
import shlex
my_string = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
shlex.split(my_string)
This will spit
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
Reference: https://docs.python.org/2/library/shlex.html
Finding all regular expression matches will do it:
input=r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
re.findall('".+?"', # or '"[^"]+"', input)
This will return the list of file names:
["Y:\DATA\00001\SERVER\DATA.TXT", "V:\DATA2\00002\SERVER2\DATA2.TXT"]
To get the file name without quotes use:
[f[1:-1] for f in re.findall('".+?"', input)]
or use re.finditer:
[f.group(1) for f in re.finditer('"(.+?)"', input)]
The following code splits the line at each occurrence of the inverted comma character (") and removes empty strings and those consisting only of whitespace.
[s for s in line.split('"') if s.strip() != '']
There is no need to use regular expressions, an escape character, some module or assume a certain number of whitespace characters between the paths.
Test:
line = r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
output = [s for s in line.split('"') if s.strip() != '']
print(output)
>>> ['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']
I think what you want is to extract the filepaths, which are separated by spaces. That is you want to split the line about items contained within quotations. I.e with a line
"FILE PATH" "FILE PATH 2"
You want
["FILE PATH","FILE PATH 2"]
In which case:
import re
with open('file.txt') as f:
for line in f:
print(re.split(r'(?<=")\s(?=")',line))
With file.txt:
"Y:\DATA\00001\SERVER\DATA MINER.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Outputs:
>>>
['"Y:\\DATA\\00001\\SERVER\\DATA MINER.TXT"', '"V:\\DATA2\\00002\\SERVER2\\DATA2.TXT"']
This was my solution. It parses most sane input exactly the same as if it was passed into the command line directly.
import re
def simpleParse(input_):
def reduce_(quotes):
return '' if quotes.group(0) == '"' else '"'
rex = r'("[^"]*"(?:\s|$)|[^\s]+)'
return [re.sub(r'"{1,2}',reduce_,z.strip()) for z in re.findall(rex,input_)]
Use case: Collecting a bunch of single shot scripts into a utility launcher without having to redo command input much.
Edit:
Got OCD about the stupid way that the command line handles crappy quoting and wrote the below:
import re
tokens = list()
reading = False
qc = 0
lq = 0
begin = 0
for z in range(len(trial)):
char = trial[z]
if re.match(r'[^\s]', char):
if not reading:
reading = True
begin = z
if re.match(r'"', char):
begin = z
qc = 1
else:
begin = z - 1
qc = 0
lc = begin
else:
if re.match(r'"', char):
qc = qc + 1
lq = z
elif reading and qc % 2 == 0:
reading = False
if lq == z - 1:
tokens.append(trial[begin + 1: z - 1])
else:
tokens.append(trial[begin + 1: z])
if reading:
tokens.append(trial[begin + 1: len(trial) ])
tokens = [re.sub(r'"{1,2}',lambda y:'' if y.group(0) == '"' else '"', z) for z in tokens]
I know this got answered a million year ago, but this works too:
input = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
input = input.replace('" "','"').split('"')[1:-1]
Should output it as a list containing:
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
My question Python - Error Caused by Space in argv Arument was marked as a duplicate of this one. We have a number of Python books doing back to Python 2.3. The oldest referred to using a list for argv, but with no example, so I changed things to:-
repoCmd = ['Purchaser.py', 'task', repoTask, LastDataPath]
SWCore.main(repoCmd)
and in SWCore to:-
sys.argv = args
The shlex module worked but I prefer this.

How to remove brackets and the contents inside from a file

I have a file named sample.txt which looks like below
ServiceProfile.SharediFCList[1].DefaultHandling=1
ServiceProfile.SharediFCList[1].ServiceInformation=
ServiceProfile.SharediFCList[1].IncludeRegisterRequest=n
ServiceProfile.SharediFCList[1].IncludeRegisterResponse=n
Here my requirement is to remove the brackets and the integer and enter os commands with that
ServiceProfile.SharediFCList.DefaultHandling=1
ServiceProfile.SharediFCList.ServiceInformation=
ServiceProfile.SharediFCList.IncludeRegisterRequest=n
ServiceProfile.SharediFCList.IncludeRegisterResponse=n
I am quite a newbie in Python. This is my first attempt. I have used these codes to remove the brackets:
#!/usr/bin/python
import re
import os
import sys
f = os.open("sample.txt", os.O_RDWR)
ret = os.read(f, 10000)
os.close(f)
print ret
var1 = re.sub("[\(\[].*?[\)\]]", "", ret)
print var1f = open("removed.cfg", "w+")
f.write(var1)
f.close()
After this using the file as input I want to form application specific commands which looks like this:
cmcli INS "DefaultHandling=1 ServiceInformation="
and the next set as
cmcli INS "IncludeRegisterRequest=n IncludeRegisterRequest=y"
so basically now I want the all the output to be bunched to a set of two for me to execute the commands on the operating system.
Is there any way that I could bunch them up as set of two?
Reading 10,000 bytes of text into a string is really not necessary when your file is line-oriented text, and isn't scalable either. And you need a very good reason to be using os.open() instead of open().
So, treat your data as the lines of text that it is, and every two lines, compose a single line of output.
from __future__ import print_function
import re
command = [None,None]
cmd_id = 1
bracket_re = re.compile(r".+\[\d\]\.(.+)")
# This doesn't just remove the brackets: what you actually seem to want is
# to pick out everything after [1]. and ignore the rest.
with open("removed_cfg","w") as outfile:
with open("sample.txt") as infile:
for line in infile:
m = bracket_re.match(line)
cmd_id = 1 - cmd_id # gives 0, 1, 0, 1
command[cmd_id] = m.group(1)
if cmd_id == 1: # we have a pair
output_line = """cmcli INS "{0} {1}" """.format(*command)
print (output_line, file=outfile)
This gives the output
cmcli INS "DefaultHandling=1 ServiceInformation="
cmcli INS "IncludeRegisterRequest=n IncludeRegisterResponse=n"
The second line doesn't correspond to your sample output. I don't know how the input IncludeRegisterResponse=n is supposed to become the output IncludeRegisterRequest=y. I assume that's a mistake.
Note that this code depends on your input data being precisely as you describe it and has no error checking whatsoever. So if the format of the input is in reality more variable than that, then you will need to add some validation.

Checking to see if a specific string is in a file txt

I'm trying to check if a specific string is in a file text
so i have this file that contains the following:
Active Internet connections
Proto Recv-Q Send-Q Local Address Foreign Address (state) rxbytes txbytes
tcp4 0 0 192.168.1.6.50860 72.21.91.29.http CLOSE_WAIT 892 691
tcp4 0 0 192.168.1.6.50858 www.v.dropbox.co.https ESTABLISHED 27671 7563
tcp4 0 0 192.168.1.6.50857 162.125.17.1.https ESTABLISHED 17581 3642
and here is my code:
char = ""
file = open("location")
for i, line in enumerate(file):
addi = i + 1
if line.strip() == char:
print "MATCH FOUND on line " + str(addi)
print "finished"
For this to work, I have to paste the entire line in my char var. For example, it works if I paste "Active Internet connections", but If I put "Internet", it goes straight to the print "finished" line. How would I fix this?
You need to look for contains (in) rather than equals (==). You can also use a list comprehension to get all the matches then print out the results:
char = "<search-string>"
with open("location") as file:
results = [i for i, line in enumerate(file, 1) if char in line]
if results:
print "MATCHES FOUND on lines " + ', '.join(results)
print "finished"
If you need more complicated search rules, then you may want to look at the regex module re
Might want to try using with open() as for proper file-handling.
And using the in keyword will work better than == because you want a match if it contains your string.
Also, using str.format is more readable IMO than "stuff" + str(value)
find = "Active Internet connections"
with open('location') as f:
for i, line in enumerate(f, 1):
if find in line:
print("Match found on line {}".format(i))
print("finished")
In Python, strings are nothing more than lists of characters. To check if a string exists in another, you can use the in operator.
if char in line:
# do something
As simple as char in line.
Example usage is that "hi" in "hit" will be True, and "hi" in "hello" will be False.
You are checking if the line is in char, but you should do the reverse, since the entire line isn't in the char:
for i, line in enumerate(file):
line_index = i + 1
if char in line:
print "MATCH FOUND on line " + str(line_index)
print "finished"
also, I would recommend not to use char as a variable name. try to use more explicit and less ambiguous names like pattern_to_find
Looking for a sub-string in Python is very simple task. Python's methods find() and count() are very useful in this context.
# This is the string you're looking for
ip = "192.168.1.6.50860"
# You need to do both, open and read file, to get its content
file = open("/home/my/own/directory/here/file.txt").read()
def findLine(text, string):
if string in text:
return "MATCH FOUND on line {}".format(text[0, text.find(string)].count("\n") + 1)
else:
return "MATCH NOT FOUND"
print(findLine(file, ip)) # Prints 3 (1-based indexing)
Try this:
search = "what you want to find goes here"
filename = "file to read"
with open(filename) as f:
for i, line in enumerate(f, 1):
if search in line:
print "MATCH FOUND in line", i

Python split string on quotes

I'm a python learner. If I have a lines of text in a file that looks like this
"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Can I split the lines around the inverted commas? The only constant would be their position in the file relative to the data lines themselves. The data lines could range from 10 to 100+ characters (they'll be nested network folders). I cannot see how I can use any other way to do those markers to split on, but my lack of python knowledge is making this difficult.
I've tried
optfile=line.split("")
and other variations but keep getting valueerror: empty seperator. I can see why it's saying that, I just don't know how to change it. Any help is, as always very appreciated.
Many thanks
You must escape the ":
input.split("\"")
results in
['\n',
'Y:\\DATA\x0001\\SERVER\\DATA.TXT',
' ',
'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT',
'\n']
To drop the resulting empty lines:
[line for line in [line.strip() for line in input.split("\"")] if line]
results in
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
I'll just add that if you were dealing with lines that look like they could be command line parameters, then you could possibly take advantage of the shlex module:
import shlex
with open('somefile') as fin:
for line in fin:
print shlex.split(line)
Would give:
['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']
No regex, no split, just use csv.reader
import csv
sample_line = '10.0.0.1 foo "24/Sep/2015:01:08:16 +0800" www.google.com "GET /" -'
def main():
for l in csv.reader([sample_line], delimiter=' ', quotechar='"'):
print l
The output is
['10.0.0.1', 'foo', '24/Sep/2015:01:08:16 +0800', 'www.google.com', 'GET /', '-']
shlex module can help you.
import shlex
my_string = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
shlex.split(my_string)
This will spit
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
Reference: https://docs.python.org/2/library/shlex.html
Finding all regular expression matches will do it:
input=r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
re.findall('".+?"', # or '"[^"]+"', input)
This will return the list of file names:
["Y:\DATA\00001\SERVER\DATA.TXT", "V:\DATA2\00002\SERVER2\DATA2.TXT"]
To get the file name without quotes use:
[f[1:-1] for f in re.findall('".+?"', input)]
or use re.finditer:
[f.group(1) for f in re.finditer('"(.+?)"', input)]
The following code splits the line at each occurrence of the inverted comma character (") and removes empty strings and those consisting only of whitespace.
[s for s in line.split('"') if s.strip() != '']
There is no need to use regular expressions, an escape character, some module or assume a certain number of whitespace characters between the paths.
Test:
line = r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
output = [s for s in line.split('"') if s.strip() != '']
print(output)
>>> ['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']
I think what you want is to extract the filepaths, which are separated by spaces. That is you want to split the line about items contained within quotations. I.e with a line
"FILE PATH" "FILE PATH 2"
You want
["FILE PATH","FILE PATH 2"]
In which case:
import re
with open('file.txt') as f:
for line in f:
print(re.split(r'(?<=")\s(?=")',line))
With file.txt:
"Y:\DATA\00001\SERVER\DATA MINER.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Outputs:
>>>
['"Y:\\DATA\\00001\\SERVER\\DATA MINER.TXT"', '"V:\\DATA2\\00002\\SERVER2\\DATA2.TXT"']
This was my solution. It parses most sane input exactly the same as if it was passed into the command line directly.
import re
def simpleParse(input_):
def reduce_(quotes):
return '' if quotes.group(0) == '"' else '"'
rex = r'("[^"]*"(?:\s|$)|[^\s]+)'
return [re.sub(r'"{1,2}',reduce_,z.strip()) for z in re.findall(rex,input_)]
Use case: Collecting a bunch of single shot scripts into a utility launcher without having to redo command input much.
Edit:
Got OCD about the stupid way that the command line handles crappy quoting and wrote the below:
import re
tokens = list()
reading = False
qc = 0
lq = 0
begin = 0
for z in range(len(trial)):
char = trial[z]
if re.match(r'[^\s]', char):
if not reading:
reading = True
begin = z
if re.match(r'"', char):
begin = z
qc = 1
else:
begin = z - 1
qc = 0
lc = begin
else:
if re.match(r'"', char):
qc = qc + 1
lq = z
elif reading and qc % 2 == 0:
reading = False
if lq == z - 1:
tokens.append(trial[begin + 1: z - 1])
else:
tokens.append(trial[begin + 1: z])
if reading:
tokens.append(trial[begin + 1: len(trial) ])
tokens = [re.sub(r'"{1,2}',lambda y:'' if y.group(0) == '"' else '"', z) for z in tokens]
I know this got answered a million year ago, but this works too:
input = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
input = input.replace('" "','"').split('"')[1:-1]
Should output it as a list containing:
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
My question Python - Error Caused by Space in argv Arument was marked as a duplicate of this one. We have a number of Python books doing back to Python 2.3. The oldest referred to using a list for argv, but with no example, so I changed things to:-
repoCmd = ['Purchaser.py', 'task', repoTask, LastDataPath]
SWCore.main(repoCmd)
and in SWCore to:-
sys.argv = args
The shlex module worked but I prefer this.

Categories

Resources