Simple tokenizer for C++ in Python - python

Struggling to find a Python library of script to tokenize (find specific tokens like function definition names, variable names, keywords etc.).
I have managed to find keywords, whitespaces etc. using something like this but I found it quite a challenge for function/class definition names etc. I was hoping of using a pre-existent script; I explored Pygments with no success. Its lexer seems amazing for what I want but have no idea how to utilize it in Python and to also get positions for each found token.
For example I am looking at doing something like that:
int fac(int n)
{
return (n>1) ? n∗fac(n−1) : 1;
}
from the source code above I would like to get:
function_name: 'fac' at position (x, y)
variable_name: 'n' at position (x, y+8)
EDITED:
Any suggestions will be appreciated since I am in the dark here regarding tokenizations and parsing in C++?

Eli Bendersky is a smart guy, and sometimes active here on SO. He's got a blog post on this issue which I'll refer you directly to: Parsing C++ in Python with Clang.
Because things disappear, here's the takeaway:
Eli Bendersky wrote a C language (not C++) parser in Python, called pycparser. People keep asking him if he's going to add support for C++. He is not. He recommends instead that people use the Python bindings for libclang to get access to "a C API that the Clang team vows to keep relatively stable, allowing the user to examine parsed code at the level of an abstract syntax tree (AST)".
You can find the bindings separately on PyPI here. Note though that you'll have to have clang installed, so you may just want to point your PYTHON_PATH directly at the install location.

You're struggling to find a python library to do what you want because what you want is impossible to do, fundamentally.
I have managed to find keywords, whitespaces etc. using something like this but I found it quite a challenge for function/class definition names etc
You mean like this:
foo = 3
def foo():pass
What is foo? All a tokenizer should/can tell you is that foo is an identifier. It's context tells you whether it's a variable or a function declaration. You need a parser to handle context free grammars. Mathematically, the space of context free grammars is too large for a standard lexer to tackle.
Try a parser: here's one in python
Normally I'd try and provide you links here to distinguish between the topics, but this is too broad to provide a single good link to. If you're interested, start with any standard compiler text. Elsewhere on SE, we see this question pop up as a theoretical question and, in some form, as a famous question about html.
Once you realize that tokenizers are (usually) built (largely) on regular expressions, it becomes more obvious why your task is not going to end happily.
Now that you know the terminology, I think you'll find this SO article useful, which recommends gcc-ml. I don't know how up-to-date it is, but it's the type of program you're looking for.

Related

How can i know what parameters i can pass to a python function and what is the proper values?

For example, when i use matplotlib as plt, a possible statement is like below:
plt.plot(x,y,color='blue')
so how can i get what arguments like 'color' i can pass to the 'plot' function, and what is the proper values for that argument?
Especially when i use some modules.
thanks for any answers.
I'm a little disappointed that this post is being downvoted because I think it's a very legitimate question. In particular, I appreciate that you asked not what the answer was but instead how you could find it for yourself in the future.
Exploring Local Python Documentation
Python has a very robust built-in documentation system as well as a very active and supportive community. At any point in time, you can use the help function to examine a particular object that you want more information on - this will pull up the documentation for that object. In your example, you could do something like:
help(plt.plot)
to pull up the documentation for the matplotlib.pyplot.plot function. Outside of a running Python process, you can use the pydoc command line tool to read and explore that same documentation. Something like:
$ pydoc matplotlib.pyplot.plot
Running that in the shell will display the same documentation as the help command example.
Writing Documentation
As a good citizen of the Python ecosystem, you'll naturally want to document your own code. This is done pretty simply by adding a docstring to the top of a function, class, or module. Docstrings are denoted with a triple quotation mark """, seen in the examples below:
"""This module contains some example classes and functions"""
class MyClass(object):
"""MyClass does some things"""
pass
def my_function(a):
"""Calculates the sum"""
return(a)
There are many different documentation styles that people prefer, so what you choose is up to you. Though I would recommend the official docstring standards outlined in PEP 257.
Finding Online Resources
It's also often useful to take advantage of online documentation and resources. The official Python documentation includes all of the builtin documentation for the standard libraries as well as a tutorial for developers who are new to Python!
As it seems that you're relatively new to the ecosystem yourself, here's some more resources that you might find useful:
Learn Python the Hard Way
CodeAcademy has a Python track
StackOverflow, obviously

Documentation For Specific Functions in Python

What is the optimal way to get documentation about a specific function in Python? I am trying to download stock price data such that I can view it in my Spyder IDE.
The function I am interested in is:
ystockquote.get_historical_prices
How do I know how many inputs the function takes, the types of inputs it accepts, and the date format for example?
Just finding documentation
I suspect this question was super-downvoted because the obvious answer is to look at the documentation. It depends where your function came from, but googling is typically a good way to find it (I found the class here in a few seconds of googling).
It is also very trivialy to just check the source code
In order to import a function, you need to know where the source file it comes from is; open that file: in python, docstrings are what generate the documentation and can be found in triple-quotes beneath the function declaration. The arguments can be inferred from the function signature, but because python is dynamically typed, any type "requirements" are just suggestions. Some good documenters will provide the optimal types, too.
While "how do I google for documentation" is not a suitable question, the question of how to dynamically infer documentation is more reasonable. The answer is
The help function, built in here
The file __doc__ accessible on any python object, as a string
Using inspection
The question is even more reasonable if you are working with python extensions, like from external packages. I don't if the package you specifically asked about has any of those, but they can be tricky to work with if the authors haven't defined docstrings in the module. The problem is that in these cases, the typing can be rigidly inforced. There is no great way to get the tpye requirements in this case, as inspection will fail. If you can get at the source code, though, (perhaps by googling), this is where the documentation would be provided

Regular expression to match python method

Is there a python regex which will generically match a method definition (not just the declaration but also the method body) inside a python code file?
I did my share of googling but only found something similar for Java. Python is different w.r.t. to the way scopes are entered through indentation rather than accolades. What makes this problem hard is the fact that indentation may drop inside the method body (i.e. blank lines, multiline strings, comments).
I also looked for DOM parsers but basically they're all aimed at XML or HTML.
Finally I am looking into introspection (How can I get the source code of a Python function?) but I still wonder if there is a nicer way for code analysis without execution.
EDIT: the question receives a bunch of downvotes but I think it's actually a valid and specific programming question. I elaborated the question a bit.
Err, you don't want to use regexes to parse Python. The 'nicer way for code analysis without execution' is to use the Python standard library parser and/or ast modules. Look under the heading Python Language Services, e.g. https://docs.python.org/2/library/language.html

What is the System() import in Python

Long story short, a piece of code that I'm working with at work has the line:
from System import System
with a later bit of code of:
desc_ = System()
xmlParser = Parser(desc_.getDocument())
# xmlParser.setEntityBase(self.dtdBase)
for featureXMLfile in featureXmlList.split(","):
print featureXMLfile
xmlParser.parse(featureXMLfile)
feat = desc_.get(featureName)
return feat
Parser is an XML parser in Java (it's included in a different import), but I don't get what the desc_ bit is doing. I mean obviously, it somehow holds the feature that we're trying to pull out, but I don't entirely see where. Is System a standard library in Python or Java, or am I looking at something custom?
Unfortunately, everyone else in my group is out for Christmas Eve vacation, so I can't ask them directly. Thank you for your help. I'm still not horribly familiar with Python.
This isn't from the standard library, so you'll need to check your system (Python has plenty of introspection to help you with that).
You can tell as Python modules in the standard library use lowercase names as per PEP-8, or by searching the library reference.
Note as well that Python has it's own XML parsing tools that will be much nicer to work with in Python than Java's.
Edit: As you have noted in the comments you are using Jython, it seems likely this is Java's System package.
millimoose indicated the correct answer in his comment, but neglected to submit it as an answer, so I'm posting to indicate the correct answer. It was indeed a custom module built by my company. I was able to determine this by typing import System; print(System) into the interpreter.

programming language implemented in pure python

i am creating ( researching possibility of ) a highly customizable python client and would like to allow users to actually edit the code in another language to customize the running of program. ( analogous to browser which itself coded in c/c++ and run another language html/js ). so my question is , is there any programming language implemented in pure python which i can see as a reference ( or use directly ? ) -- i need simple language ( simple statements and ifs can do )
edit: sorry if i did not make myself clear but what i want is "a language to customize the running of program" , even though pypi seems a great option, what i am looking for is more simple which i can study and extend myself if need arise. my google searches pointing towards xml based langagues. ( BMEL , XForms etc ).
The question isn't completely clear on scope, but I have a hunch that PyPy, embedding other full languages, and similar solutions might be overkill. It sounds like iamgopal may really be interested in something more like Interpreter Pattern or Little Language.
If the language you want to support is really small (see the Interpreter Pattern link), then hand-coding this yourself in Python won't be too hard. You can write a simple parser (Google around; here's one example), then walk the AST and evaluate user expressions.
However, if you expect this to be used for a long time or by many people, it may be worth throwing a real language at the problem. (I'd recommend Python itself if your users are already familiar with basic Python syntax).
Ren'Py is a modification to Python syntax built on top of Python itself, using the language tools in the stdlib.
For your user's sake, don't use an XML based language - XML is an awful basis for a programming language and your users will hate you for it.
Here is a suggestion. Use a strict subset of Python for your language. Use the compiler module to convert their code into an abstract syntax tree and walk the tree to to validate that the code conforms to your subset before converting the AST into python bytecode.
N.B. I just checked the docs and see that the compiler package is deprecated in 2.6 and removed in Python 3.x. Does anyone know why that is?
Numerous template languages such as Cheetah, Django templates, Genshi, Mako, Mighty might serve as an example.
Why not Python itself? With some care you can use eval to run user code.
One of the good thing about interpreted scripting languages is that you don't need another extra scripting language!
PLY (Python Lex-Yacc)
is something of your interest.
Possibly Common Lisp (or any other Lisp) will be the best choice for that task. Because Lisp make it possible to easily extend host language with powerful macroses and construct DSL (domain specific language).
If all you need is simple if statements and expressions, I'm sure it wouldn't be an awful task to parse each line. Something like
if some flag
activate some feature
deactivate some feature
elif some other flag
activate some feature
activate some feature
else
logout
Just write a class which, while parsing takes the first word, checks if it's "if, elif, else," etc, and if so, check a flag and set a flag saying you either are or are not executing until the next conditional. If it's not a conditional, call a function based on the first keyword that would modify the program state in some way.
The class could store some local execution state (are we in an if statement? If so are we executing this branch?) and have another class containing some global application state (flags that are checkable by if statements, etc).
This is probably the wrong thing to do in your situation (it's very prone to bugs, it's dangerous if you don't treat the data in the scripts correctly), but it's at least a start if you do decide to interpret your own mini-language.
Seriously though, if you try this, be very, very, srs careful. Don't give the scripts any functionality that they don't definitely need, because you are almost certainly opening security holes by doing something like this.
Don't say I didn't warn you.

Categories

Resources