Python Regex reading dictionary variable name from another file - python

I really dont know how to word this.
I am creating a program that reads through another py file called code.py, it will find all VALID dictionary variable names and print them, easy enough? But the code im trying to run through is extremely tricky, purposely put in examples to trick the regex. The test code for code.py is here and my current code is:
import re
with open ("code.py", "r") as myfile:
data=myfile.read()
potato = re.findall(r' *(\w+)\W*{',data,re.M)
for i in range(len(potato)):
print(potato[i])
That regex doesnt work 100%, when used on the test code it will print variables that arent meant to be printed such as:
# z={}
z="z={}"
print('your mother = {}')
The expected output for the test file is
a0, a, b ,c d, e, etc all the way down to z, then it will be aa, ab , ac, ad, etc all the way down to aq
and anything really labeled z in the test code shouldnt print.
I realise that regex isn't amazing for doing this but i have to use regex and it can be done.
EDIT: Using the new regex (r'^ (\w+)\W{',data,re.M) the output fails on examples where there are variables assigned on one line such as,
d={
};e={
};

l should print but z shouldn't
potato = re.findall(r'^ *(\w+)\W*{',data,re.M)
This should fix it.
EDIT:
".*?(?<!\\)"|'.*?(?<!\\)'|\([^)(]*\)|#[^\n]*\n|[^\'\"\#(\w\n]*(\w+)[^\w]*?{
See demo.
https://regex101.com/r/gP5iH5/6

Trying to parse a Python file using a regular expression will usually be able to be fooled. I would suggest the following kind of approach. The dis library could be used to disassemble byte code from compiled source code. From this all of the dictionaries can be picked out.
So assuming a Python source file called code.py:
import code
source_module = code
source_py = "code.py"
import sys, dis, re
from contextlib import contextmanager
from StringIO import StringIO
#contextmanager
def captureStdOut(output):
stdout = sys.stdout
sys.stdout = output
yield
sys.stdout = stdout
with open(source_py) as f_source:
source_code = f_source.read()
byte_code = compile(source_code, source_py, "exec")
output = StringIO()
with captureStdOut(output):
dis.dis(byte_code)
dis.dis(source_module)
disassembly = output.getvalue()
dictionaries = re.findall("(?:BUILD_MAP|STORE_MAP).*?(?:STORE_FAST|STORE_NAME).*?\((.*?)\)", disassembly, re.M+re.S)
print dictionaries
As dis prints to stdout, you need to redirect the output. A regular expression can then be used to spot all of the entries. I do this twice, once by compiling the source to get the globals and once by importing the module to get the functions. There is probably a better way to do this but it seems to work.

Related

Run python files and change strings with 1 file

lets say i have a folder which contains the following files:
f1.py
f2.py
f3.py
in f1.py i got this code:
#O = "Random string"
print("ABCD")
#P = "Random string"
but in f2.py and f3.py i have this code:
#M = "Random string"
print("EFGH")
#Z = "Random string"
And i want to change the strings in the 'print' function in f2.py and f3.py to the string i have in print in f1.py, and run all the files in the folder after changing the strings, using f1.py
It would be best to have more context why you want to do this.
This is possible, but in 99% of the cases it's not a god idea to write self modifying code, though it can be a lot of fun.
In fact you do not really write self modifying code, but more one piece of code modifying other files. But this is also rarely to be recommended.
What's more usual is, that one script analyzes / parses f1.py, extracts the data writes some data into a file (e.g. a json file)
and f2.py and f3.py read the data from that file and do then print this data.
Is there a particular reason you want to have code, that is modifying other python files.
If you really want to have f2.py and f3.py modified, then there is another solution, which is called templating (you can for example use Jinji).
In this case you have two template files f2.py.template and f3.py.template.
you write a file parsing f1.py, extracting the data and creates f2.py from f2.py.template and the extracted data. (Same for f3.py.template and f3.py)
If you're really 100% sure, that you really want what you ask for.
Yes it is possible:
you write a script, tht opens and reads f1.py line by line, looks for the line "#O = ", then memorizes the next line.
Then it reads f2.py line by line and writes it to another file (e.g. next_version_of_f2.py). it reads in a line and writes it out until it encounters the line #M = "Random string in f2.py In this case the line will be written out, the desired print will be written out, the print line from f2.py will be read and ignored and then you read and write all the other lines.
Then you close f2.py and next_version_of_f2.py, rename f2.py into f2.py.old and rename next_version_of_f2.py to f2.py
This is certainly possible but probably inadvisable.
Editing code should typically be a separate action from executing it (even when we use a single tool that can do both, like a lot of modern IDEs).
It suggests a poor workflow. If you want f2.py to print("ABCD"), then it should be written that way.
It's confusing. In order to understand what f2.py does, you have to mentally model the entirety of f1.py and f2.py, and there's no indication of this in f2.py.
It invites all kinds of difficult-to-debug situations. What happens if f1.py is run twice at the same time? What if two different versions of f1.py are run at the same time? Or if I happen to be reading f2.py when you run f1.py? Or if I'm editing f2.py and save my changes while you're running f1.py?
It's a security problem. For f1.py to edit f2.py, the user (shell, web-server, or other surface) calling f1.py has to have edit permissions on f2.py. That means that if they can get f1.py to do something besides what you intended (specifically, if they can get their own text in place of "ABCD"), then they can get arbitrary code execution in everyone else's runtime!
Note that it's perfectly fine to have code that generates or edits other code. The problem is when a program (possibly spanning multiple files) edits its own source.
gelonida discusses some options, which are fine and appropriate for certain contexts such as managing user-specific configuration or building documents. That said, if you're familiar with functions, variables, imports, and other basics of computer science, then you may not even need a config.json file or a template engine.
Reconsider the end result you're trying to accomplish. Do some more research/reading, and if you're still stuck start a new question about the bigger-picture task.
complaints about XYZ problems are old hat, so here's how to do what you want, even though it's awful and you really shouldn't.
f1.py
import os
import re
import sys
#O = "Random string"
print("ABCD")
#P = "Random string"
selector = re.compile(
r'#([A-Z]) = \"Random string\"\nprint\(\"([A-Z]{4})\"\)\n#([A-Z]) = \"Random string\"')
template = '''#{} = "Random string"
print("{}")
#{} = "Random string"'''
own_file_name = os.path.abspath(__file__)
own_directory = os.path.dirname(own_file_name)
def read_file(name: str) -> str:
with open(name, 'r') as f:
return f.read()
find_replacement = selector.match(read_file(own_file_name))
replacement = find_replacement.group(1) if find_replacement else False
if not replacement:
sys.exit(-1)
def make_replacement(reg_match) -> str:
return template.format(reg_match.group(1), replacement, reg_match.group(3))
for dir_entry in os.listdir(own_directory):
if dir_entry.is_file():
original = read_file(dir_entry.path)
with open(dir_entry.path, 'w') as out_file:
out_file.write(selector.sub(make_replacement, original))
# this will cause an infinite loop, but you technically asked for it :)
for dir_entry in os.listdir(own_directory):
if dir_entry.is_file():
exec(read_file(dir_entry.path))
I want to be clear that the above is a joke. I haven't tested it, and I desperately hope it won't actually solve any problems for you.

how to import files from command line in python

I am using python and I am supposed to read a file from command line for further processing. My input file has a binary that should be read for further processing. Here is my input file sub.py:
CODE = " \x55\x48\x8b\x05\xb8\x13\x00\x00"
and my main file which should read this is like the following:
import pyvex
import archinfo
import fileinput
import sys
filename = sys.argv[-1]
f = open(sys.argv[-1],"r")
CODE = f.read()
f.close()
print CODE
#CODE = b"\x55\x48\x8b\x05\xb8\x13\x00\x00"
# translate an AMD64 basic block (of nops) at 0x400400 into VEX
irsb = pyvex.IRSB(CODE, 0x1000, archinfo.ArchAMD64())
# pretty-print the basic block
irsb.pp()
# this is the IR Expression of the jump target of the unconditional exit at the end of the basic block
print irsb.next
# this is the type of the unconditional exit (i.e., a call, ret, syscall, etc)
print irsb.jumpkind
# you can also pretty-print it
irsb.next.pp()
# iterate through each statement and print all the statements
for stmt in irsb.statements:
stmt.pp()
# pretty-print the IR expression representing the data, and the *type* of that IR expression written by every store statement
import pyvex
for stmt in irsb.statements:
if isinstance(stmt, pyvex.IRStmt.Store):
print "Data:",
stmt.data.pp()
print ""
print "Type:",
print stmt.data.result_type
print ""
# pretty-print the condition and jump target of every conditional exit from the basic block
for stmt in irsb.statements:
if isinstance(stmt, pyvex.IRStmt.Exit):
print "Condition:",
stmt.guard.pp()
print ""
print "Target:",
stmt.dst.pp()
print ""
# these are the types of every temp in the IRSB
print irsb.tyenv.types
# here is one way to get the type of temp 0
print irsb.tyenv.types[0]
The problem is that when I run "python maincode.py sub.py' it reads the code as content of the file but its output is completely different from when I directly add CODE into the statement irsb = pyvex.IRSB(CODE, 0x1000, archinfo.ArchAMD64()). Does anyone know what is the problem and how can I solve it? I even use importing from inputfile but it does not read a text.
Have you considered the __import__ way?
You could do
mod = __import__(sys.argv[-1])
print mod.CODE
and just pass the filename without the .py extension as your command line argument:
python maincode.py sub
EDIT: Apparently using __import__ is discouraged. Instead though you can use importlib module:
import sys,importlib
mod = importlib.import_module(sys.argv[-1])
print mod.CODE
..and it should work the same as using __import__
If you need to pass a path to the module, one way is if in each of the directories you added an empty file named
__init__.py
That will allow python to interpret the directories as module namespaces, and you can then pass the path in its module form: python maincode.py path.to.subfolder.sub
If for some reason you cannot or don't want to add the directories as namespaces, and don't want to add the init.py files everywhere, you could also use imp.find_module. Your maincode.py would instead look like this:
import sys, imp
mod = imp.find_module("sub","/path/to/subfolder/")
print mod.code
You'll have to write code which breaks apart your command line input into the module part "sub" and the folder path "/path/to/subfolder/" though. Does that make sense? Once its ready you'll call it like you expect, python maincode.py /path/to/subfolder/sub/
you're reading the code as text, while when reading as file you're likely reading as binary
you probably need to convert binary to text of vice-versa to make this work
Binary to String/Text in Python

Running grep through Python - doesn't work

I have some code like this:
f = open("words.txt", "w")
subprocess.call(["grep", p, "/usr/share/dict/words"], stdout=f)
f.close()
I want to grep the MacOs dictionary for a certain pattern and write the results to words.txt. For example, if I want to do something like grep '\<a.\>' /usr/share/dict/words, I'd run the above code with p = "'\<a.\>'". However, the subprocess call doesn't seem to work properly and words.txt remains empty. Any thoughts on why that is? Also, is there a way to apply regex to /usr/share/dict/words without calling a grep-subprocess?
edit:
When I run grep '\<a.\>' /usr/share/dict/words in my terminal, I get words like: aa
ad
ae
ah
ai
ak
al
am
an
ar
as
at
aw
ax
ay as results in the terminal (or a file if I redirect them there). This is what I expect words.txt to have after I run the subprocess call.
Like #woockashek already commented, you are not getting any results because there are no hits on '\<a.\>' in your input file. You are probably actually hoping to find hits for \<a.\> but then obviously you need to omit the single quotes, which are messing you up.
Of course, Python knows full well how to look for a regex in a file.
import re
rx = re.compile(r'\ba.\b')
with open('/usr/share/dict/words', 'Ur') as reader, open('words.txt', 'w') as writer:
for line in reader:
if rx.search(line):
print(line, file=writer, end='')
The single quotes here are part of Python's string syntax, just like the single quotes on the command line are part of the shell's syntax. In neither case are they part of the actual regular expression you are searching for.
The subprocess.Popen documentation vaguely alludes to the frequently overlooked fact that the shell's quoting is not necessary or useful when you don't have shell=True (which usually you should avoid anyway, for this and other reasons).
Python unfortunately doesn't support \< and \> as word boundary operators, so we have to use (the functionally equivalent) \b instead.
The standard input and output channels for the process started by call() are bound to the parent’s input and output. That means the calling programm cannot capture the output of the command. Use check_output() to capture the output for later processing:
import subprocess
f = open("words.txt", "w")
output = subprocess.check_output(['grep', p ,'-1'])
file.write(output)
print output
f.close()
PD: I hope it works, i cant check the answer because i have not MacOS to try it.

Python script that prints its source

Is it possible (not necessarly using python introspection) to print the source code of a script?
I want to execute a short python script that also print its source (so I can see which commands are executed).
The script is something like this:
command1()
#command2()
command3()
print some_variable_that_contain_src
The real application is that I want to run a script from IPython with the run -i magic and have as output the source (i.e. the commands executed). In this way I can check which commands are commented at every execution. Moreover, if executed in a Notebook I leave a trace of which commands have been used.
Solution
Using korylprince solution I end up with this one-liner to be put at the beginning of the script:
with open(__file__) as f: print '\n'.join(f.read().split('\n')[1:])
This will print the script source except the first line (that would be only noise). It's also easy to modify the slicing in order to print a different "slice" of the script.
If you want to print the whole file instead, the one-liner simplifies to:
with open(__file__) as f: print f.read()
As long as you're not doing anything crazy with packages, put this at the top of your script
with open(__file__) as f:
print f.read()
Which will read in the current file and print it out.
For python 3 make sure to use instead
print(f.read())
For the most simple answer:
import my_module
print open(my_module.__file__).read()
I also tried using the inspect package.
import inspect
import my_module
source_list = inspect.getsourcelines(my_module)
Will give you a list of strings with the source code defined in it
for line in source_list[0]:
print line
Will print out the entire source code in a readable manner

Python: how to capture output to a text file? (only 25 of 530 lines captured now)

I've done a fair amount of lurking on SO and a fair amount of searching and reading, but I must also confess to being a relative noob at programming in general. I am trying to learn as I go, and so I have been playing with Python's NLTK. In the script below, I can get everything to work, except it only writes what would be the first screen of a multi-screen output, at least that's how I am thinking about it.
Here's the script:
#! /usr/bin/env python
import nltk
# First we have to open and read the file:
thefile = open('all_no_id.txt')
raw = thefile.read()
# Second we have to process it with nltk functions to do what we want
tokens = nltk.wordpunct_tokenize(raw)
text = nltk.Text(tokens)
# Now we can actually do stuff with it:
concord = text.concordance("cultural")
# Now to save this to a file
fileconcord = open('ccord-cultural.txt', 'w')
fileconcord.writelines(concord)
fileconcord.close()
And here's the beginning of the output file:
Building index...
Displaying 25 of 530 matches:
y .   The Baobab Tree : Stories of Cultural Continuity The continuity evident
regardless of ethnicity , and the cultural legacy of Africa as well . This Af
What am I missing here to get the entire 530 matches written to the file?
text.concordance(self, word, width=79, lines=25) seem to have other parameters as per manual.
I see no way to extract the size of concordance index, however, the concordance printing code seem to have this part: lines = min(lines, len(offsets)), therefore you can simply pass sys.maxint as a last argument:
concord = text.concordance("cultural", 75, sys.maxint)
Added:
Looking at you original code now, I can't see a way it could work before. text.concordance does not return anything, but outputs everything to stdout using print. Therefore, the easy option would be redirection stdout to you file, like this:
import sys
....
# Open the file
fileconcord = open('ccord-cultural.txt', 'w')
# Save old stdout stream
tmpout = sys.stdout
# Redirect all "print" calls to that file
sys.stdout = fileconcord
# Init the method
text.concordance("cultural", 200, sys.maxint)
# Close file
fileconcord.close()
# Reset stdout in case you need something else to print
sys.stdout = tmpout
Another option would be to use the respective classes directly and omit the Text wrapper. Just copy bits from here and combine them with bits from here and you are done.
Update:
I found this write text.concordance output to a file Options
from the ntlk usergroup. It's from 2010, and states:
Documentation for the Text class says: "is intended to support
initial exploration of texts (via the interactive console). ... If you
wish to write a program which makes use of these analyses, then you
should bypass the Text class, and use the appropriate analysis
function or class directly instead."
If nothing has changed in the package since then, this may be the source of your problem.
--- previously ---
I don't see a problem with writing to the file using writelines():
file.writelines(sequence)
Write a sequence of strings to the file. The sequence can be any
iterable object producing strings, typically a list of strings. There
is no return value. (The name is intended to match readlines();
writelines() does not add line separators.)
Note the italicized part, did you examine the output file in different editors? Perhaps the data is there, but not being rendered correctly due to missing end of line seperators?
Are you sure this part is generating the data you want to output?
concord = text.concordance("cultural")
I'm not familiar with nltk, so I'm just asking as part of eliminating possible sources for the problem.

Categories

Resources