How to Parse a MadAppLauncher Configuration (.mal) file?

How to Parse a MadAppLauncher Configuration (.mal) file? - python

Trying to parse through this .mal file to get a database of starting and resulting techniques. I have tried packages in python to strip and split the file based on certain characters and have the columns 'Filename', 'Category', 'Asset', 'Starting Technique', and 'Resulting Technique' but it does not pick up on all the techniques (i.e., where there are ->, leading with & and |). For instance:
if 'category' in line.lower():
cat_def = line.split()[1].strip().replace('{','')
#print(cat_def)
if 'asset' in line.lower() and 'extends' not in line.lower():
asset_def = line.split()[1].strip().replace('{','')
#print('\t',asset_def)
if '&' in line.lower() or '|' in line.lower():
multi_tech = False
tech_def = line.split()[1].strip().replace('{','')
for idx in [1,2,3]:
try:
if '->' in file_lines[line_num + idx] and '// ->' not in file_lines[line_num + idx] and '//->' not in file_lines[line_num + idx]:
next_tech = file_lines[line_num + idx].split()[1].strip().split('//')[0].strip()
if ',' in next_tech:
multi_tech = True
next_tech = next_tech.replace(',','')
print('\t'.join([file, cat_def, asset_def, tech_def, next_tech]))
Is there another way to parse through this file or a package to handle a malware file like the example given?

Related

How to search string in a line and extract data between two characters in python?

file contents:
module traffic(
green_main, yellow_main, red_main, green_first, yellow_first,
red_first, clk, rst, waiting_main, waiting_first
);
I need to search the string 'module' and I need to extract the contents between (.......); the brackets.
Here is the code I tried out, I am not able to get the result
fp = open(file_name)
contents = fp.read()
unique_word_a = '('
unique_word_b = ');'
s = contents
for line in contents:
if 'module' in line:
your_string=s[s.find(unique_word_a)+len(unique_word_a):s.find(unique_word_b)].strip()
print(your_string)

The problem with your code is here:
for line in contents:
if 'module' in line:
Here, contents is a single string holding the entire content of the file, not a list of strings (lines) or a file handle that can be looped line-by-line. Thus, your line is in fact not a line, but a single character in that string, which obviously can never contain the substring "module".
Since you never actually use the line within the loop, you could just remove both the loop and the condition and your code will work just fine. (And if you changed your code to actually loop lines, and find within those lines, it would not work since the ( and ) are not on the same line.)
Alternatively, you can use a regular expression:
>>> content = """module traffic(green_main, yellow_main, red_main, green_first, yellow_first,
... red_first, clk, rst, waiting_main, waiting_first);"""
...
>>> re.search("module \w+\((.*?)\);", content, re.DOTALL).group(1)
'green_main, yellow_main, red_main, green_first, yellow_first, \n red_first, clk, rst, waiting_main, waiting_first'
Here, module \w+\((.*?)\); means
the word module followed by a space and some word-type \w characters
an literal opening (
a capturing group (...) with anything ., including linebreaks (re.DOTALL), non-greedy *?
an literal closing ) and ;
and group(1) gets you what's found in between the (non-escaped) pair of (...)
And if you want those as a list:
>>> list(map(str.strip, _.split(",")))
['green_main', 'yellow_main', 'red_main', 'green_first', 'yellow_first', 'red_first', 'clk', 'rst', 'waiting_main', 'waiting_first']

if you want to extract content between "(" ")" you can do:(but first take care how you handle the content):
for line in content.split('\n'):
if 'module' in line:
line_content = line[line.find('(') + 1: line.find(')')]
if your content is not only in one line :
import math
def find_all(your_string, search_string, max_index=math.inf, offset=0,):
index = your_string.find(search_string, offset)
while index != -1 and index < max_index:
yield index
index = your_string.find(search_string, index + 1)
s = content.replace('\n', '')
for offset in find_all(s, 'module'):
max_index = s.find('module', offset=offset + len('module'))
if max_index == -1:
max_index = math.inf
print([s[start + 1: stop] for start, stop in zip(find_all(s, '(',max_index, offset), find_all(s, ')', max_index, offset))])

Python - Error Caused by Space in argv Arument [duplicate]

I'm a python learner. If I have a lines of text in a file that looks like this
"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Can I split the lines around the inverted commas? The only constant would be their position in the file relative to the data lines themselves. The data lines could range from 10 to 100+ characters (they'll be nested network folders). I cannot see how I can use any other way to do those markers to split on, but my lack of python knowledge is making this difficult.
I've tried
optfile=line.split("")
and other variations but keep getting valueerror: empty seperator. I can see why it's saying that, I just don't know how to change it. Any help is, as always very appreciated.
Many thanks

You must escape the ":
input.split("\"")
results in
['\n',
'Y:\\DATA\x0001\\SERVER\\DATA.TXT',
' ',
'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT',
'\n']
To drop the resulting empty lines:
[line for line in [line.strip() for line in input.split("\"")] if line]
results in
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']

I'll just add that if you were dealing with lines that look like they could be command line parameters, then you could possibly take advantage of the shlex module:
import shlex
with open('somefile') as fin:
for line in fin:
print shlex.split(line)
Would give:
['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']

No regex, no split, just use csv.reader
import csv
sample_line = '10.0.0.1 foo "24/Sep/2015:01:08:16 +0800" www.google.com "GET /" -'
def main():
for l in csv.reader([sample_line], delimiter=' ', quotechar='"'):
print l
The output is
['10.0.0.1', 'foo', '24/Sep/2015:01:08:16 +0800', 'www.google.com', 'GET /', '-']

shlex module can help you.
import shlex
my_string = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
shlex.split(my_string)
This will spit
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
Reference: https://docs.python.org/2/library/shlex.html

Finding all regular expression matches will do it:
input=r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
re.findall('".+?"', # or '"[^"]+"', input)
This will return the list of file names:
["Y:\DATA\00001\SERVER\DATA.TXT", "V:\DATA2\00002\SERVER2\DATA2.TXT"]
To get the file name without quotes use:
[f[1:-1] for f in re.findall('".+?"', input)]
or use re.finditer:
[f.group(1) for f in re.finditer('"(.+?)"', input)]

The following code splits the line at each occurrence of the inverted comma character (") and removes empty strings and those consisting only of whitespace.
[s for s in line.split('"') if s.strip() != '']
There is no need to use regular expressions, an escape character, some module or assume a certain number of whitespace characters between the paths.
Test:
line = r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
output = [s for s in line.split('"') if s.strip() != '']
print(output)
>>> ['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']

I think what you want is to extract the filepaths, which are separated by spaces. That is you want to split the line about items contained within quotations. I.e with a line
"FILE PATH" "FILE PATH 2"
You want
["FILE PATH","FILE PATH 2"]
In which case:
import re
with open('file.txt') as f:
for line in f:
print(re.split(r'(?<=")\s(?=")',line))
With file.txt:
"Y:\DATA\00001\SERVER\DATA MINER.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Outputs:
>>>
['"Y:\\DATA\\00001\\SERVER\\DATA MINER.TXT"', '"V:\\DATA2\\00002\\SERVER2\\DATA2.TXT"']

This was my solution. It parses most sane input exactly the same as if it was passed into the command line directly.
import re
def simpleParse(input_):
def reduce_(quotes):
return '' if quotes.group(0) == '"' else '"'
rex = r'("[^"]*"(?:\s|$)|[^\s]+)'
return [re.sub(r'"{1,2}',reduce_,z.strip()) for z in re.findall(rex,input_)]
Use case: Collecting a bunch of single shot scripts into a utility launcher without having to redo command input much.
Edit:
Got OCD about the stupid way that the command line handles crappy quoting and wrote the below:
import re
tokens = list()
reading = False
qc = 0
lq = 0
begin = 0
for z in range(len(trial)):
char = trial[z]
if re.match(r'[^\s]', char):
if not reading:
reading = True
begin = z
if re.match(r'"', char):
begin = z
qc = 1
else:
begin = z - 1
qc = 0
lc = begin
else:
if re.match(r'"', char):
qc = qc + 1
lq = z
elif reading and qc % 2 == 0:
reading = False
if lq == z - 1:
tokens.append(trial[begin + 1: z - 1])
else:
tokens.append(trial[begin + 1: z])
if reading:
tokens.append(trial[begin + 1: len(trial) ])
tokens = [re.sub(r'"{1,2}',lambda y:'' if y.group(0) == '"' else '"', z) for z in tokens]

I know this got answered a million year ago, but this works too:
input = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
input = input.replace('" "','"').split('"')[1:-1]
Should output it as a list containing:
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']

My question Python - Error Caused by Space in argv Arument was marked as a duplicate of this one. We have a number of Python books doing back to Python 2.3. The oldest referred to using a list for argv, but with no example, so I changed things to:-
repoCmd = ['Purchaser.py', 'task', repoTask, LastDataPath]
SWCore.main(repoCmd)
and in SWCore to:-
sys.argv = args
The shlex module worked but I prefer this.

Python - splitting lines in txt file by semicolon in order to extract a text title...except sometimes the title has semicolons in it

So, I have an extremely inefficient way to do this that works, which I'll show, as it will help illustrate the problem more clearly. I'm an absolute beginner in python and this is definitely not "the python way" nor "remotely sane."
I have a .txt file where each line contains information about a large number of .csv files, following format:
File; Title; Units; Frequency; Seasonal Adjustment; Last Updated
(first entry:)
0\00XALCATM086NEST.csv;Harmonized Index of Consumer Prices: Overall Index Excluding Alcohol and Tobacco for AustriaÂ©; Index 2005=100; M; NSA; 2015-08-24
and so on, repeats like this for a while. For anyone interested, this is the St.Louis Fed (FRED) data.
I want to rename each file (currently named the alphanumeric code # the start, 00XA etc), to the text name. So, just split by semicolon, right? Except, sometimes, the text title has semicolons within it (and I want all of the text).
So I did:
data_file_data_directory = 'C:\*****\Downloads\FRED2_csv_3\FRED2_csv_2'
rename_data_file_name = 'README_SERIES_ID_SORT.txt'
rename_data_file = open(data_file_data_directory + '\\' + rename_data_file_name)
for line in rename_data_file.readlines():
data = line.split(';')
if len(data) > 2 and data[0].rstrip().lstrip() != 'File':
original_file_name = data[0]
These last 2 lines deal with the fact that there is some introductory text that we want to skip, and we don't want to rename based on the legend # the top (!= 'File'). It saves the 00XAL__.csv as the oldname. It may be possible to make this more elegant (I would appreciate the tips), but it's the next part (the new, text name) that gets really ugly.
if len(data) ==6:
new_file_name = data[0][:-4].split("\\")[-1] + '-' + data[1][:-2].replace(':',' -').replace('"','').replace('/',' or ')
else:
if len(data) ==7:
new_file_name = data[0][:-4].split("\\")[-1] + '-' + data[1].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[2][:-2].replace(':',' -').replace('"','').replace('/',' or ')
else:
if len(data) ==8:
new_file_name = data[0][:-4].split("\\")[-1] + '-' + data[1].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[2].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[3][:-2].replace(':',' -').replace('"','').replace('/',' or ')
else:
if len(data) ==9:
new_file_name = data[0][:-4].split("\\")[-1] + '-' + data[1].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[2].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[3].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[4][:-2].replace(':',' -').replace('"','').replace('/',' or ')
else:
if len(data) ==10:
new_file_name = data[0][:-4].split("\\")[-1] + '-' + data[1].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[2].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[3].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[4].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[5][:-2].replace(':',' -').replace('"','').replace('/',' or ')
else:
(etc)
What I'm doing here is that there is no way to know for each line in the .csv how many items are in the list created by splitting it by semicolons. Ideally, the list would be length 6 - as follows the key # the top of my example of the data. However, for every semicolon in the text name, the length increases by 1...and we want everything before the last four items in the list (counting backwards from the right: date, seasonal adjustment, frequency, units/index) but after the .csv code (this is just another way of saying, I want the text "title" - everything for each line after .csv but before units/index).
Really what I want is just a way to save the entirety of the text name as "new_name" for each line, even after I split each line by semicolon, when I have no idea how many semicolons are in each text name or the line as a whole. The above code achieves this, but OMG, this can't be the right way to do this.
Please let me know if it's unclear or if I can provide more info.

Read only one line every time, with or without an "\n" at the end

I have a file which is filled like this:
Samsung CLP 680/ CLX6260 + CLT-C506S/ELS + CLT-M506S/ELS + CLT-Y506S/ELS + 39.50
Xerox Phaser 6000/6010/6015 + 106R01627 + 106R01628 + 106R01629 + 8.43
Xerox DocuPrint 6110/6110mfp + 106R01206 + 106R01204 + 106R01205 + 7.60
Xerox Phaser 6121/6121D + 106R01466 + 106R01467 + 106R01468 + 18.20
When I read it with:
for line in excelRead:
title=line.split("+")
title=[lines.strip()for lines in title]
sometimes there is an "\n" at the end of the line, and sometimes there is not, if line ends with \n splitting gives me 5 elements, if not 9 and etc., until it founds and "\n" as I guess
So, the question is: How do I read only one line in file each time, and obtain 5 elements every time, with or without an "\n" at the end? I can't check all all file whether there is, or not an "\n" at the end
Thanks

You might consider using the csv module to parse this, and placing into a dict by model:
import csv
data={}
with open('/tmp/excel.csv') as f:
for line in csv.reader(f, delimiter='+', skipinitialspace=True):
data[line[0].strip()]=[e.strip() for e in line[1:]]
print data
# {'Samsung CLP 680/ CLX6260': ['CLT-C506S/ELS', 'CLT-M506S/ELS', 'CLT-Y506S/ELS', '39.50'],
'Xerox Phaser 6121/6121D': ['106R01466', '106R01467', '106R01468', '18.20'],
'Xerox DocuPrint 6110/6110mfp': ['106R01206', '106R01204', '106R01205', '7.60'],
'Xerox Phaser 6000/6010/6015': ['106R01627', '106R01628', '106R01629', '8.43']}

for line in excelRead:
title = [x.strip() for x in line.rstrip('\n').split('+')]
It's better to avoid making one variable (title) mean two different things. Rather than give it a different name in your second line, I just removed the line entirely and put the split inside the list comprehension.
Instead of feeding line into split, first I rstrip the \n (removes that character from the end)

When \n is missing, this will split title[4] to give two titles:
import re
data = []
with open('aa.txt') as excelRead:
for line in excelRead:
title=line.split("+")
title=[lines.strip()for lines in title]
while len(title) > 5:
one = re.sub('(\d+\.\d+)', '', title[4])
five = title[4].replace(one, '')
title1 = title[:4] + [five]
title = [one] + title[5:]
data.append(title1)
data.append(title)
for item in data:
print(item)
You could easily make data a dictionary instead of a list.

Python split string on quotes

I'm a python learner. If I have a lines of text in a file that looks like this
"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Can I split the lines around the inverted commas? The only constant would be their position in the file relative to the data lines themselves. The data lines could range from 10 to 100+ characters (they'll be nested network folders). I cannot see how I can use any other way to do those markers to split on, but my lack of python knowledge is making this difficult.
I've tried
optfile=line.split("")
and other variations but keep getting valueerror: empty seperator. I can see why it's saying that, I just don't know how to change it. Any help is, as always very appreciated.
Many thanks

You must escape the ":
input.split("\"")
results in
['\n',
'Y:\\DATA\x0001\\SERVER\\DATA.TXT',
' ',
'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT',
'\n']
To drop the resulting empty lines:
[line for line in [line.strip() for line in input.split("\"")] if line]
results in
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']

I'll just add that if you were dealing with lines that look like they could be command line parameters, then you could possibly take advantage of the shlex module:
import shlex
with open('somefile') as fin:
for line in fin:
print shlex.split(line)
Would give:
['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']

No regex, no split, just use csv.reader
import csv
sample_line = '10.0.0.1 foo "24/Sep/2015:01:08:16 +0800" www.google.com "GET /" -'
def main():
for l in csv.reader([sample_line], delimiter=' ', quotechar='"'):
print l
The output is
['10.0.0.1', 'foo', '24/Sep/2015:01:08:16 +0800', 'www.google.com', 'GET /', '-']

shlex module can help you.
import shlex
my_string = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
shlex.split(my_string)
This will spit
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
Reference: https://docs.python.org/2/library/shlex.html

Finding all regular expression matches will do it:
input=r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
re.findall('".+?"', # or '"[^"]+"', input)
This will return the list of file names:
["Y:\DATA\00001\SERVER\DATA.TXT", "V:\DATA2\00002\SERVER2\DATA2.TXT"]
To get the file name without quotes use:
[f[1:-1] for f in re.findall('".+?"', input)]
or use re.finditer:
[f.group(1) for f in re.finditer('"(.+?)"', input)]

The following code splits the line at each occurrence of the inverted comma character (") and removes empty strings and those consisting only of whitespace.
[s for s in line.split('"') if s.strip() != '']
There is no need to use regular expressions, an escape character, some module or assume a certain number of whitespace characters between the paths.
Test:
line = r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
output = [s for s in line.split('"') if s.strip() != '']
print(output)
>>> ['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']

I think what you want is to extract the filepaths, which are separated by spaces. That is you want to split the line about items contained within quotations. I.e with a line
"FILE PATH" "FILE PATH 2"
You want
["FILE PATH","FILE PATH 2"]
In which case:
import re
with open('file.txt') as f:
for line in f:
print(re.split(r'(?<=")\s(?=")',line))
With file.txt:
"Y:\DATA\00001\SERVER\DATA MINER.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Outputs:
>>>
['"Y:\\DATA\\00001\\SERVER\\DATA MINER.TXT"', '"V:\\DATA2\\00002\\SERVER2\\DATA2.TXT"']

This was my solution. It parses most sane input exactly the same as if it was passed into the command line directly.
import re
def simpleParse(input_):
def reduce_(quotes):
return '' if quotes.group(0) == '"' else '"'
rex = r'("[^"]*"(?:\s|$)|[^\s]+)'
return [re.sub(r'"{1,2}',reduce_,z.strip()) for z in re.findall(rex,input_)]
Use case: Collecting a bunch of single shot scripts into a utility launcher without having to redo command input much.
Edit:
Got OCD about the stupid way that the command line handles crappy quoting and wrote the below:
import re
tokens = list()
reading = False
qc = 0
lq = 0
begin = 0
for z in range(len(trial)):
char = trial[z]
if re.match(r'[^\s]', char):
if not reading:
reading = True
begin = z
if re.match(r'"', char):
begin = z
qc = 1
else:
begin = z - 1
qc = 0
lc = begin
else:
if re.match(r'"', char):
qc = qc + 1
lq = z
elif reading and qc % 2 == 0:
reading = False
if lq == z - 1:
tokens.append(trial[begin + 1: z - 1])
else:
tokens.append(trial[begin + 1: z])
if reading:
tokens.append(trial[begin + 1: len(trial) ])
tokens = [re.sub(r'"{1,2}',lambda y:'' if y.group(0) == '"' else '"', z) for z in tokens]

I know this got answered a million year ago, but this works too:
input = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
input = input.replace('" "','"').split('"')[1:-1]
Should output it as a list containing:
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']

My question Python - Error Caused by Space in argv Arument was marked as a duplicate of this one. We have a number of Python books doing back to Python 2.3. The oldest referred to using a list for argv, but with no example, so I changed things to:-
repoCmd = ['Purchaser.py', 'task', repoTask, LastDataPath]
SWCore.main(repoCmd)
and in SWCore to:-
sys.argv = args
The shlex module worked but I prefer this.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to Parse a MadAppLauncher Configuration (.mal) file? - python

Related

How to search string in a line and extract data between two characters in python?

Python - Error Caused by Space in argv Arument [duplicate]

Python - splitting lines in txt file by semicolon in order to extract a text title...except sometimes the title has semicolons in it

Read only one line every time, with or without an "\n" at the end

Python split string on quotes

Categories

Resources