How do I process numerous files using numerous regex conditions?

How do I process numerous files using numerous regex conditions? - python

I want to process the cna and linear_cna files by reading only lines that do not contain either Hugo_Symbol or -01.
import os
import re
class DataProcessing:
def __init__(self, data):
self.line = [line.rstrip('\n') for line in data]
self.data = data
def read_data(self):
with open(self.data):
pass
return self.line
def read_cna(self):
# In cna and linear_cna files, skip lines that either begin with "Hugo_Symbol" or "-01"
for lines in self.line:
cna_lines = [lines for l in cna if not re.findall(r"^(Hugo_Symbol|[-01])", l)]
return cna_lines
...continue...
dp_cna = DataProcessing("data_cna.txt")
dp_linear_cna = DataProcessing("data_linear_cna.txt")
dp_cna.read_data()
dp_linear_cna.read_data()
Traceback:
Traceback (most recent call last):
File "C:/Users/User/PycharmProjects/testing/main.py", line 24, in <module>
cna = DataProcessing.read_data("data_cna.txt")
File "C:/Users/User/PycharmProjects/testing/main.py", line 14, in read_data
with open(self.data) as f:
AttributeError: 'str' object has no attribute 'data'

The right way to use your class consists of two steps.
Step 1: Create an instance of DataProcessing by invoking __init__. You do this by declaring dp = DataProcessing("data_cna.txt"). You can replace dp with any name you want.
Now dp is an instance of DataProcessing. Its data field is set to "DataProcessing". In other words, dp remembers the name of the file.
Step 2: Call read_data on dp. Note that read_data has only one parameter, namely self, which should not be passed as an argument, meaning it takes no arguments. Therefore, the right way to call read_data is just read_data(). To call read_data on dp you do dp.read_data().

Related

How to set value of parent argument to child method?

I have a Paragraph class:
from googletrans import Translator
class Paragraph:
def __init__(self, text, origin_lang='en'):
self.text = text
self.origin_lang = origin_lang
def translate(self, dest_lang='ne'):
translator = Translator()
translation = translator.translate(text = self.text,
dest=dest_lang)
return translation.text
I made a subclass out of it:
class FileParagraph(Paragraph):
def __init__(self, filepath):
super().__init__(text=self.get_from_file())
self.filepath = filepath
def get_from_file(self):
with open(self.filepath) as file:
return file.read()
While Paragraph got the text directly as argument, the subclass generates the text from the get_from_file method.
However, I cannot seem to call the inherited translate method:
fp = FileParagraph("sample.txt")
print(fp.translate(dest_lang='de'))
That throws an error:
Traceback (most recent call last):
File "C:/main.py", line 66, in <module>
fp = FileParagraph("sample.txt")
File "C:/main.py", line 20, in __init__
super().__init__(text=self.get_from_file())
File "C:/main.py", line 25, in get_from_file
with open(self.filepath) as file:
AttributeError: 'FileParagraph' object has no attribute 'filepath'
One solution is to change the subclass init to:
def __init__(self, filepath):
self.filepath = filepath
self.text = self.get_from_file()
However, that means removing the initialization of super(). Is there another solution without having to remove super().__init__?
Or is this not even the case to make use of inheritance?

The error comes from calling the get_from_file method, which relies on self.filepath, before self.filepath is set. Simply changing the order of the two lines in __init__ fixes this
class FileParagraph(Paragraph):
def __init__(self, filepath):
# set member variable first
self.filepath = filepath
# then call super's init
super().__init__(text=self.get_from_file())
def get_from_file(self):
with open(self.filepath) as file:
return file.read()

i think that you should also give a value for the filepath while creating the object here
fp = FileParagraph("sample.txt")
you should also input a value for the filepath along with text
eg
fp = FileParagraph(text = "sample.txt", filepath = " ")

How to chain file objects in Python?

I'm trying to find a simple way to chain file-like objects. I have a single CSV file which is split into a number of segments on disk. I'd like to be able to pass them to csv.DictReader without having to make a concatenated temporary first.
Something like:
files = map(io.open, filenames)
for row in csv.DictReader(io.chain(files)):
print(row[column_name])
But I haven't been able to find anything like io.chain. If I were parsing it myself, I could do something like:
from itertools import chain
def lines(fp):
for line in fp.readlines():
yield line
a = open('segment-1.dat')
b = open('segment-2.dat')
for line in chain(lines(a), lines(b)):
row = line.strip().split(',')
However DictReader needs something it can call read() on, so this method doesn't work. I can iterate over the files, copying the fieldnames property from the previous reader, but I was hoping for something which let me put all the processing within a single loop body.

An iterable might help
from io import BytesIO
a = BytesIO(b"1st file 1st line \n1st file 2nd line")
b = BytesIO(b"2nd file 1st line \n2nd file 2nd line")
class Reader:
def __init__(self, *files):
self.files = files
self.current_idx = 0
def __iter__(self):
return self
def __next__(self):
f = self.files[self.current_idx]
for line in f:
return line
else:
if self.current_idx < len(self.files) - 1:
self.current_idx += 1
return next (self)
raise StopIteration("feed me more files")
r = Reader(a, b)
for l in r:
print(l)
Result:
b'1st file 1st line \n'
b'1st file 2nd line'
b'2nd file 1st line \n'
b'2nd file 2nd line'
Edit:
:D then there are standard library goodies.
https://docs.python.org/3.7/library/fileinput.html
with fileinput.input(files=('spam.txt', 'eggs.txt')) as f:
for line in f:
process(line)

You could create a class that's an iterator that returns a string each time its __next__() method is called (quoting the docs).
import csv
class ChainedCSVfiles:
def __init__(self, filenames):
self.filenames = filenames
def __iter__(self):
return next(self)
def __next__(self):
for filename in self.filenames:
with open(filename, 'r', newline='') as csvfile:
for line in csvfile:
yield line
filenames = 'segment-1.dat', 'segment-2.dat'
reader = csv.DictReader(ChainedCSVfiles(filenames),
fieldnames=('field1', 'field2', 'field3'))
for row in reader:
print(row)

Object orientation: error with method in Python

I've made a Python programm with a interface that receives the name of the file and a numerical data. When I create methods to manipulate filename, directory, among others, it returns an error.
I believe the error comes with object orientation. How can I solve this?
I've divided the program in two parts: one to solve my problem (no object orientation) and another to receive user data.
Error:
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Python27\lib\lib-tk\Tkinter.py", line 1541, in __call__
return self.func(*args)
File "teste.py", line 60, in verificaSenha
if (Procura_nome(nome_arq) == 1):
NameError: global name 'Procura_nome' is not defined
The complete code: https://pastebin.com/Br6JAcuR
Problematic method:
def Procura_nome(nome_arq):
dir = Percorre_dir_entrada()
arquivo = dir + dir[2] + nome_arq + + ".shp"
os.path.isfile(nome_arq)
try:
with open(arquivo, 'r') as f:
return 1
except IOError:
return 0

All python method class must have self param as first argument, this argument refers to the instance of your class, also when using class methods and attributs inside your class you should refer to them with self.
You probably need to add self to all your method class in your file.
You also need to remove one '+' on the 3rd line.
def Procura_nome(self, nome_arq):
dir = self.Percorre_dir_entrada()
arquivo = dir + dir[2] + nome_arq + ".shp"
os.path.isfile(nome_arq)
try:
with open(arquivo, 'r') as f:
return 1
except IOError:
return 0
Your Percorre_dir_entrada and Percorre_dir_saida function are doing exactly the same thing on different files, you should think about doing a generic version who take the file name as param like so :
def Percorre_dir(self, file_name):
achou = 0
dirlist = os.listdir(".")
for i in dirlist:
filename = os.path.abspath(i)
if((filename.find(file_name)) != -1):
achou = 1
return filename
if(achou == 0):
return 0
then call it :
Percorre_dir("Saida")
Percorre_dir("Entrada")

How to pickle a dictionary with classes in the hash?

So I have an program with the list of Class's (TestEVAL) and a I use a QMdiArea to present the multiple Class data in a subwindow. I am trying to enhance my pickle to include the QSubwindow data which is stored in it's subwindow widget with the following code:
with open(fname, 'wb') as fout:
for dataset in self.datasets:
# Temp save chdlg - do not want saved
ochdlg = dataset.chdlg
dataset.chdlg = None
pickle.dump(dataset, fout)
# Restore chdlg
dataset.chdlg = ochdlg
pickle.dump('plots', fout)
for sub in self.mdi.subWindowList():
mw = sub.widget()
old_plot = dict()
old_plot['title'] = sub.windowTitle()
old_plot['setup'] = mw.setup
pickle.dump(old_plot, fout)
pickle.dump(mw.channels, fout)
fout.close()
At which point I get a:
Traceback (most recent call last):
File "TestEVAL.pyw", >line 1448, in saveas_session
pickle.dump(mw.channels, fout)
File "AppData\Local\Continuum\Anaconda\lib\copy_reg.py", >line 71, in _reduce_ex
state = base(self)
TypeError: the sip.wrapper type cannot be instantiated or sub-classed
Since mw.channels is a dict with hash values from self.datasets (not a simple object) I think I need to do something more? Maybe I need to implement a .setstate() and .getstate() (with __ before and after!) But I do not understand what it returns in getstate. The dataset class is large so I do not want to duplicate it's pickle information. I could add a dataset.id and hash the mw.channels[dataset.id] but this is not clear to me. furthermore, mw.channels[dataset] = values and values is a list of another Class (headers) with Meta states associated (not too big data wise). What do I do?

Annotating Python print() output with comments

Given a Python script with print() statements, I'd like to be able to run through the script and insert a comment after each statement that shows the output from each. To demonstrate, take this script named example.py:
a, b = 1, 2
print('a + b:', a + b)
c, d = 3, 4
print('c + d:', c + d)
The desired output would be:
a, b = 1, 2
print('a + b:', a + b)
# a + b: 3
c, d = 3, 4
print('c + d:', c + d)
# c + d: 7
Here's my attempt, which works for simple examples like the one above:
import sys
from io import StringIO
def intercept_stdout(func):
"redirect stdout from a target function"
def wrapper(*args, **kwargs):
"wrapper function for intercepting stdout"
# save original stdout
original_stdout = sys.stdout
# set up StringIO object to temporarily capture stdout
capture_stdout = StringIO()
sys.stdout = capture_stdout
# execute wrapped function
func(*args, **kwargs)
# assign captured stdout to value
func_output = capture_stdout.getvalue()
# reset stdout
sys.stdout = original_stdout
# return captured value
return func_output
return wrapper
#intercept_stdout
def exec_target(name):
"execute a target script"
with open(name, 'r') as f:
exec(f.read())
def read_target(name):
"read source code from a target script & return it as a list of lines"
with open(name) as f:
source = f.readlines()
# to properly format last comment, ensure source ends in a newline
if len(source[-1]) >= 1 and source[-1][-1] != '\n':
source[-1] += '\n'
return source
def annotate_source(target):
"given a target script, return the source with comments under each print()"
target_source = read_target(target)
# find each line that starts with 'print(' & get indices in reverse order
print_line_indices = [i for i, j in enumerate(target_source)
if len(j) > 6 and j[:6] == 'print(']
print_line_indices.reverse()
# execute the target script and get each line output in reverse order
target_output = exec_target(target)
printed_lines = target_output.split('\n')
printed_lines.reverse()
# iterate over the source and insert commented target output line-by-line
annotated_source = []
for i, line in enumerate(target_source):
annotated_source.append(line)
if print_line_indices and i == print_line_indices[-1]:
annotated_source.append('# ' + printed_lines.pop() + '\n')
print_line_indices.pop()
# return new annotated source as a string
return ''.join(annotated_source)
if __name__ == '__main__':
target_script = 'example.py'
with open('annotated_example.py', 'w') as f:
f.write(annotate_source(target_script))
However, it fails for scripts with print() statements that span multiple lines, as well as for print() statements that aren't at the start of a line. In a best-case scenario, it would even work for print() statements inside a function. Take the following example:
print('''print to multiple lines, first line
second line
third line''')
print('print from partial line, first part') if True else 0
1 if False else print('print from partial line, second part')
print('print from compound statement, first part'); pass
pass; print('print from compound statement, second part')
def foo():
print('bar')
foo()
Ideally, the output would look like this:
print('''print to multiple lines, first line
second line
third line''')
# print to multiple lines, first line
# second line
# third line
print('print from partial line, first part') if True else 0
# print from partial line, first part
1 if False else print('print from partial line, second part')
# print from partial line, second part
print('print from compound statement, first part'); pass
# print from compound statement, first part
pass; print('print from compound statement, second part')
# print from compound statement, second part
def foo():
print('bar')
foo()
# bar
But the script above mangles it like so:
print('''print to multiple lines, first line
# print to multiple lines, first line
second line
third line''')
print('print from partial line, first part') if True else 0
# second line
1 if False else print('print from partial line, second part')
print('print from compound statement, first part'); pass
# third line
pass; print('print from compound statement, second part')
def foo():
print('bar')
foo()
What approach would make this process more robust?

Have you considered using the inspect module? If you are willing to say that you always want the annotations next to the top most call, and the file you are annotating is simple enough, you can get reasonable results. The following is my attempt, which overrides the built in print function and looks at a stack trace to determine where print was called:
import inspect
import sys
from io import StringIO
file_changes = {}
def anno_print(old_print, *args, **kwargs):
(frame, filename, line_number,
function_name, lines, index) = inspect.getouterframes(inspect.currentframe())[-2]
if filename not in file_changes:
file_changes[filename] = {}
if line_number not in file_changes[filename]:
file_changes[filename][line_number] = []
orig_stdout = sys.stdout
capture_stdout = StringIO()
sys.stdout = capture_stdout
old_print(*args, **kwargs)
output = capture_stdout.getvalue()
file_changes[filename][line_number].append(output)
sys.stdout = orig_stdout
return
def make_annotated_file(old_source, new_source):
changes = file_changes[old_source]
old_source_F = open(old_source)
new_source_F = open(new_source, 'w')
content = old_source_F.readlines()
for i in range(len(content)):
line_num = i + 1
new_source_F.write(content[i])
if content[i][-1] != '\n':
new_source_F.write('\n')
if line_num in changes:
for output in changes[line_num]:
output = output[:-1].replace('\n', '\n#') + '\n'
new_source_F.write("#" + output)
new_source_F.close()
if __name__=='__main__':
target_source = "foo.py"
old_print = __builtins__.print
__builtins__.print = lambda *args, **kwargs: anno_print(old_print, *args, **kwargs)
with open(target_source) as f:
code = compile(f.read(), target_source, 'exec')
exec(code)
__builtins__.print = old_print
make_annotated_file(target_source, "foo_annotated.py")
If I run it on the following file "foo.py":
def foo():
print("a")
print("b")
def cool():
foo()
print("c")
def doesnt_print():
a = 2 + 3
print(1+2)
foo()
doesnt_print()
cool()
The output is "foo_annotated.py":
def foo():
print("a")
print("b")
def cool():
foo()
print("c")
def doesnt_print():
a = 2 + 3
print(1+2)
#3
foo()
#a
#b
doesnt_print()
cool()
#a
#b
#c

You can make it a lot easier by using an existing python parser to extract top level statements from your code. The ast module in the standard library for example. However, ast loses some information like comments.
Libraries built with source code transformations (which you are doing) in mind might be more suited here. redbaron is a nice example.
To carry globals to the next exec(), you have to use the second parameter (documentation):
environment = {}
for statement in statements:
exec(statement, environment)

Thanks to feedback from #Lennart, I've almost got it working... It iterates through line-by-line, clumping lines into longer and longer blocks as long as the current block contains a SyntaxError when fed to exec(). Here it is in case it's of use to anyone else:
import sys
from io import StringIO
def intercept_stdout(func):
"redirect stdout from a target function"
def wrapper(*args, **kwargs):
"wrapper function for intercepting stdout"
# save original stdout
original_stdout = sys.stdout
# set up StringIO object to temporarily capture stdout
capture_stdout = StringIO()
sys.stdout = capture_stdout
# execute wrapped function
func(*args, **kwargs)
# assign captured stdout to value
func_output = capture_stdout.getvalue()
# reset stdout
sys.stdout = original_stdout
# return captured value
return func_output
return wrapper
#intercept_stdout
def exec_line(source, block_globals):
"execute a target block of source code and get output"
exec(source, block_globals)
def read_target(name):
"read source code from a target script & return it as a list of lines"
with open(name) as f:
source = f.readlines()
# to properly format last comment, ensure source ends in a newline
if len(source[-1]) >= 1 and source[-1][-1] != '\n':
source[-1] += '\n'
return source
def get_blocks(target, block_globals):
"get outputs for each block of code in source"
outputs = []
lines = 1
#intercept_stdout
def eval_blocks(start_index, end_index, full_source, block_globals):
"work through a group of lines of source code and exec each block"
nonlocal lines
try:
exec(''.join(full_source[start_index:end_index]), block_globals)
except SyntaxError:
lines += 1
eval_blocks(start_index, start_index + lines,
full_source, block_globals)
for i, s in enumerate(target):
if lines > 1:
lines -= 1
continue
outputs.append((eval_blocks(i, i+1, target, block_globals), i, lines))
return [(i[1], i[1] + i[2]) for i in outputs]
def annotate_source(target, block_globals={}):
"given a target script, return the source with comments under each print()"
target_source = read_target(target)
# get each block's start and end indices
outputs = get_blocks(target_source, block_globals)
code_blocks = [''.join(target_source[i[0]:i[1]]) for i in outputs]
# iterate through each
annotated_source = []
for c in code_blocks:
annotated_source.append(c)
printed_lines = exec_line(c, block_globals).split('\n')
if printed_lines and printed_lines[-1] == '':
printed_lines.pop()
for line in printed_lines:
annotated_source.append('# ' + line + '\n')
# return new annotated source as a string
return ''.join(annotated_source)
def main():
### script to format goes here
target_script = 'example.py'
### name of formatted script goes here
new_script = 'annotated_example.py'
new_code = annotate_source(target_script)
with open(new_script, 'w') as f:
f.write(new_code)
if __name__ == '__main__':
main()
It works for each of the two examples above. However, when trying to execute the following:
def foo():
print('bar')
print('baz')
foo()
Instead of giving me the desired output:
def foo():
print('bar')
print('baz')
foo()
# bar
# baz
It fails with a very long traceback:
Traceback (most recent call last):
File "ex.py", line 55, in eval_blocks
exec(''.join(full_source[start_index:end_index]), block_globals)
File "<string>", line 1
print('baz')
^
IndentationError: unexpected indent
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "ex.py", line 55, in eval_blocks
exec(''.join(full_source[start_index:end_index]), block_globals)
File "<string>", line 1
print('baz')
^
IndentationError: unexpected indent
During handling of the above exception, another exception occurred:
...
Traceback (most recent call last):
File "ex.py", line 55, in eval_blocks
exec(''.join(full_source[start_index:end_index]), block_globals)
File "<string>", line 1
print('baz')
^
IndentationError: unexpected indent
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "ex.py", line 102, in <module>
main()
File "ex.py", line 97, in main
new_code = annotate_source(target_script)
File "ex.py", line 74, in annotate_source
outputs = get_blocks(target_source, block_globals)
File "ex.py", line 65, in get_blocks
outputs.append((eval_blocks(i, i+1, target, block_globals), i, lines))
File "ex.py", line 16, in wrapper
func(*args, **kwargs)
File "ex.py", line 59, in eval_blocks
full_source, block_globals)
File "ex.py", line 16, in wrapper
func(*args, **kwargs)
...
File "ex.py", line 16, in wrapper
func(*args, **kwargs)
File "ex.py", line 55, in eval_blocks
exec(''.join(full_source[start_index:end_index]), block_globals)
RecursionError: maximum recursion depth exceeded while calling a Python object
Looks like this happens due to def foo(): print('bar') being valid code and so print('baz') isn't being included in the function, causing it to fail with an IndentationError. Any ideas as to how to avoid this issue? I suspect it may require diving into ast as suggested above but would love further input or a usage example.

It looks like except SyntaxError isn't a sufficient check for a full function, as it will finish the block a the first line which doesn't create a syntax error. What you want is to make sure the whole function is encompassed in the same block. To accomplish this:
check if the current block is a function. Check if the first line starts with def.
check if the next line in full_source begins with a higher or equal number of spaces as the second line of the function (the one which defines the indent). This will mean that the eval_blocks will check if the next line of the code has higher or equal spacing, and is therefore inside the function.
The code for get_blocks might look something like this:
# function for finding num of spaces at beginning (could be in global spectrum)
def get_front_whitespace(string):
spaces = 0
for char in string:
# end loop at end of spaces
if char not in ('\t', ' '):
break
# a tab is equal to 8 spaces
elif char == '\t':
spaces += 8
# otherwise must be a space
else:
spaces += 1
return spaces
...
def get_blocks(target, block_globals):
"get outputs for each block of code in source"
outputs = []
lines = 1
# variable to check if current block is a function
block_is_func = False
#intercept_stdout
def eval_blocks(start_index, end_index, full_source, block_globals):
"work through a group of lines of source code and exec each block"
nonlocal lines
nonlocal block_is_func
# check if block is a function
block_is_func = ( full_source[start_index][:3] == 'def' )
try:
exec(''.join(full_source[start_index:end_index]), block_globals)
except SyntaxError:
lines += 1
eval_blocks(start_index, start_index + lines,
full_source, block_globals)
else:
# if the block is a function, check for indents
if block_is_func:
# get number of spaces in first indent of function
func_indent= get_front_whitespace( full_source[start_index + 1] )
# get number of spaces in the next index
next_index_spaces = get_front_whitespace( full_source[end_index + 1] )
# if the next line is equally or more indented than the function indent, continue to next recursion layer
if func_indent >= next_index_spaces:
lines += 1
eval_blocks(start_index, start_index + lines,
full_source, block_globals)
for i, s in enumerate(target):
# reset the function variable for next block
if block_is_func: block_is_func = False
if lines > 1:
lines -= 1
continue
outputs.append((eval_blocks(i, i+1, target, block_globals), i, lines))
return [(i[1], i[1] + i[2]) for i in outputs]
This might create an index error if the last line of the function was the end of the file though, due to the forward indexing at end_index_spaces = get_front_whitespace( full_source[end_index + 1] )
This could also be used for selection statements and loops, which may have the same problem: just check for if for and while at the beginning of the start_index line as well as for def. This would cause the comment to be after the indented region, but as printed output inside indented regions are dependent on the variables which are used to call them, I think having the output outside the indent would be necessary in any case.

Try https://github.com/eevleevs/hashequal/
I made this as an attempt to replace Mathcad. Does not act on print statements, but on #= comments, e.g.:
a = 1 + 1 #=
becomes
a = 1 + 1 #= 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I process numerous files using numerous regex conditions? - python

Related

How to set value of parent argument to child method?

How to chain file objects in Python?

Object orientation: error with method in Python

How to pickle a dictionary with classes in the hash?

Annotating Python print() output with comments

Categories

Resources