Parsing a C file in python for analyses

Parsing a C file in python for analyses - python

From a C file I'd like to parse switch parts to be able to identify 2 things :
a switch with only 2 cases :
switch(Data)
{
case 0: value = 10 ; break;
case 1 : value = 20 ;break;
default:
somevar = false;
value = 0 ;
----
break;
}
==> for instance would print "section with 2 case"
a switch with many (unlimited) cases :
switch(Data)
{
case Constant1 : value = 10 ; break;
case constant2 : value = 20 ;break;
case constant3 : value = 30 ;break;
case constant4 : value = 40 ;break;
default:
somevar = false;
value = 0 ;
----
break;
}
==> would print "section with case : Constant1, Constant2, Constant3, Constant4"
To do that, I've done the following :
original_file = open(original_file,"r")
for line in original_file:
line_nb +=1
regex_case = re.compile('.*case.*:')
found_case = regex_case.search(line)
if found_case :
cases_dict[line_nb]=found_case.group() # rely on line nb is somewhat not reliable as the c file may have break on an additional line
bool_or_enum(cases_dict)
what would need the bool_or_enum to test all the required results:
def bool_or_enum(in_dict={}):
sorted_dict = sorted(in_dict.items(), key=operator.itemgetter(0))
for index, item in enumerate(sorted_dict):

According to comments I've searching and found that 2 solutions can be provided :
by using pycparser
pros: is a python package, free and opensource
cons : not really easy to start with, need addtional tools to preprocess files (gcc, llvm, etc) .
by using an external tool : understand from scitools
This tool is usable with its GUI to build a complete project to parse so you can have call graph, metrics, code checking, etc. . For this question I've been using the API which is available as docs and examples
pros:
the parsing is totally done by the tool from the GUI
I have many source files, re-parsing from a complete directory is a "push-button" solution
cons :
not free
not open sourced
I usually prefer open source project but in that case Understand is the unique solution. By the way the license is not so expensive and over all : I've chosen this because I could parse some files that couldn't be compiled because their dependencies (libs and header files) couldn't be available.
Here is the code I've used :
import understand
understand_file="C:\\Users\\dlewin\\myproject.udb"
#Create a list with all the cases from the file
def find_cases(file):
returnList = []
for lexeme in file.lexer(False,8,False,True):
if ( lexeme.text() == "case" ) :
returnList.append(lexeme.line_end()) #line nb
returnList.append(lexeme.text()) #found a case
return returnList
def find_identifiers(file):
returnList = []
# Open the file lexer with macros expanded and inactive code removed
for lexeme in file.lexer(False,8,False,True):
if(lexeme.token() == "Identifier"):
returnList.append(lexeme.line_end()) #line nb
returnList.append(lexeme.text()) #identifier found
return returnList
db = understand.open(understand_file ) # Open Database
file = db.lookup("mysourcefile.cpp","file")[0]
print (file.longname())
liste_idents = find_identifiers(file)
liste_cases = find_cases(file)

Related

Is it good practise to indent inline comments?

I found myself writing some tricky algorithmic code, and I tried to comment it as well as I could since I really do not know who is going to maintain this part of the code.
Following this idea, I've wrote quite a lot of block and inline comments, also trying not to over-comment it. But still, when I go back to the code I wrote a week ago, I find it difficult to read because of the swarming presence of the comments, especially the inline ones.
I though that indenting them (to ~120char) could easy the readability, but would obviously make the line way too long according to style standards.
Here's an example of the original code:
fooDataTableOccurrence = nestedData.find("table=\"public\".")
if 0 > fooDataTableOccurrence: # selects only tables without tag value "public-"
otherDataTableOccurrence = nestedData.find("table=")
dbNamePos = nestedData.find("dbname=") + 7 # 7 is the length of "dbname="
if -1 < otherDataTableOccurrence: # selects only tables with tag value "table="
# database resource case
resourceName = self.findDB(nestedData, dbNamePos, otherDataTableOccurrence, destinationPartitionName)
if resourceName: #if the resource is in a wrong path
if resourceName in ["foo", "bar", "thing", "stuff"]:
return True, False, False # respectively isProjectAlreadyExported, isThereUnexpectedData and wrongPathResources
wrongPathResources.append("Database table: " + resourceName)
And here's how indenting inline comments would look like:
fooDataTableOccurrence = nestedData.find("table=\"public\".")
if 0 > seacomDataTableOccurrence: # selects only tables without tag value "public-"
otherDataTableOccurrence = nestedData.find("table=")
dbNamePos = nestedData.find("dbname=") + 7 # 7 is the length of "dbname="
if -1 < otherDataTableOccurrence: #selects only tables with tag value "table="
# database resource case
resourceName = self.findDB(nestedData, dbNamePos, otherDataTableOccurrence, destinationPartitionName)
if resourceName: # if the resource is in a wrong path
if resourceName in ["foo", "bar", "thing", "stuff"]:
return True, False, False # respectively isProjectAlreadyExported, isThereUnexpectedData and wrongPathResources
wrongPathResources.append("Database table: " + resourceName)
The code is in Python (my company legacy code is not drastically following the PEP8 standard so we had to stick with that), but my point is not about the cleanness of the code itself, but on the comments. I am looking for a trade-off between readability and easy understanding of the code, and sometimes I find difficult achieving both at the same time
Which of the examples is better? If none, what would be?

Maybe this is an XY_Problem?
Could the comments be eliminated altogether?
Here is a (quick & dirty) attempt at refactoring the code posted:
dataTableOccurrence_has_tag_public = nestedData.find("table=\"public\".") > 0
if dataTableOccurrence_has_tag_public:
datataTableOccurrence_has_tag_table = nestedData.find("table=") > 0
prefix = "dbname="
dbNamePos = nestedData.find(prefix) + len(prefix)
if datataTableOccurrence_has_tag_table:
# database resource case
resourceName = self.findDB(nestedData,
dbNamePos,
otherDataTableOccurrence,
destinationPartitionName)
resource_name_in_wrong_path = len(resourceName) > 0
if resourceNameInWrongPath:
if resourceName in ["foo", "bar", "thing", "stuff"]:
project_already_exported = True
unexpected_data = False
return project_already_exported,
unexpected_data,
resource_name_in_wrong_path
wrongPathResources.append("Database table: " + resourceName)
Further work could involve extracting functions out of the block of code.

Extract input address from a namecoin transaction given a 'name' operation

I have been trying to extract input addresses from Namecoin transactions using some python code. This code works for regular transactions (where some namecoins are transferred from one address to another); however, this doesn't work on the transactions which have name operations, such as name_new. Here is some code:
raw = namecoind.getrawtransaction(tx_hash)
data = namecoind.decoderawtransaction(raw)
if 'vin' in data:
inputs = data['vin']
for input in inputs:
input_value = input.get('value')
if 'scriptSig' in input:
script_sig_asm = str(input['scriptSig'].get('asm'))
script_sig_parts = script_sig_asm.split(' ')
if len(script_sig_parts) > 1 and (len(script_sig_parts[-1]) == 130
or len(script_sig_parts[-1]) == 66):
public_key_string = script_sig_parts[-1]
try:
recipient_address = NamecoinPublicKey(public_key_string, verify=False).address()
print recipient_address
except Exception, e:
print str(e)
return
elif len(script_sig_parts) == 1:
print "coinbase transaction input"
return
#print "Inputs:"
Running this code on a regular transaction works, i,e, we get the recipient address. But running this code for a name operation such as this shows that its a coinbase transaction, that is,
len(script_sig_parts) == 1
is True, and so the recipient_address is empty.
Can anybody point me how I can get the recipient address (in the above transaction, it is: NCAzVGKq8JrsETxAkgw3MsDPinAEPwsTfn) in a Namecoin transaction which involves name operation?

Your code should work fine (but I have not tested it) on most name transactions. For instance, if you take 499a1e4c7bb1388347e1bd1142425949971eaf1fa2521af625006f4f49ce85c5 (which is the latest update of d/domob), the relevant script sig of the input is this:
"scriptSig" : {
"asm" : "3045022100c489ac648d08416d83d5e561d222049242242ded5f1c2bfd35ea48bb78b6a90f02203a28dee3d9473755fc2288bcaec9105e973e306071d28e69e593668c94b19fc101 04a85b7360b6b95459f7286111220f7a1eaef23bc9ced8a3b56cd57360374381bcabf7182941be400ccdfb761e26fa62d50b8911358aceb6aa30de9e8df5c46742",
"hex" : "483045022100c489ac648d08416d83d5e561d222049242242ded5f1c2bfd35ea48bb78b6a90f02203a28dee3d9473755fc2288bcaec9105e973e306071d28e69e593668c94b19fc1014104a85b7360b6b95459f7286111220f7a1eaef23bc9ced8a3b56cd57360374381bcabf7182941be400ccdfb761e26fa62d50b8911358aceb6aa30de9e8df5c46742"
},
As you can see, it does have two pieces in the scriptsig. (The pubkey and the signature.) For your transaction, however, the previous output is not pay-to-pubkeyhash but instead pay-to-pubkey. Here's the relevant output of the previous 92457dfc2831bdb6439fc03e72dbe3908140d43ec410f4a7396e3d65f5ab605b:
"scriptPubKey" : {
"asm" : "046a77fa46493d61985c1157a6e3e498b3b97c878c9c23e5b4729d354b574eb33a20c0483551308e2bd08295ce238e8ad09a7a2477732eb2e995a3e20455e9d137 OP_CHECKSIG",
"hex" : "41046a77fa46493d61985c1157a6e3e498b3b97c878c9c23e5b4729d354b574eb33a20c0483551308e2bd08295ce238e8ad09a7a2477732eb2e995a3e20455e9d137ac",
"reqSigs" : 1,
"type" : "pubkey",
"addresses" : [
"NCAzVGKq8JrsETxAkgw3MsDPinAEPwsTfn"
]
}
Due to the nature of the pay-to-pubkey script, the script sig does not contain the pubkey anymore (just the signature). I don't see a way to find the input address just from the script sig. If you want to handle these (rare) cases, you have to fetch the previous output and see there. (My knowledge about ECDSA is limited -- maybe there is a way to extract the pubkey from the signature.)

Assign and Test Regex in Python?

In many of my python projects, I find myself having to go through a file, match lines against regexes, and then perform some computation on the basis of elements from the line extracted by regex.
In pseudo-C code, this is pretty-easy:
while (read(line))
{
if (m=matchregex(regex1,line))
{
/* munch on the components extracted in regex1 by accessing m */
}
else if (m=matchregex(regex2,line))
{
/* munch on the components extracted in regex2 by accessing m */
}
else if ...
...
else
{
error("Unrecognized line format");
}
}
However, because python does not allow an assignment in the conditional of an if, this can't be done elegantly. One could first parse against all the regexes and then do the if on the various match objects, but that is neither elegant nor efficient.
What I find myself doing instead is including code like this at the base level of every project:
im=None
img=None
def imps(p,s):
global im
global img
im=re.search(p,s)
if im:
img=im.groups()
return True
else:
img=None
return False
Then I can work like this:
for line in open(file,'r').read().splitlines():
if imps(regex1,line):
# munch on contents of img
elsif imps(regex2,line):
# munch on contents of img
else:
error('Unrecognised line: {}'.format(line))
That works, is reasonably compact, and easy to type. But it is hardly beautiful; it uses global variables and is not thread safe (which has not been an issue for me so far).
But I'm sure others have run across this problem before and come up with an equally compact, but more python-y and generally superior solution. What is it?

Depends on the needs of the code.
A common choice I use is something like this:
# note, order is important here. The first one to match will exit the processing
parse_regexps = [
(r"^foo", handle_foo),
(r"^bar", handle_bar),
]
for regexp, handler in parse_regexps:
m = regexp.match(line)
if m:
handler(line) # possibly other data too like m.groups
break
else:
error("Unrecognized format....")
This has the advantage of moving the handling code into clear and obvious functions which makes testing and change easy.

You can just use continue:
for line in file:
m = re.match(re1, line)
if m:
do stuff
continue
m = re.match(re2, line)
if m:
do stuff
continue
raise BadLine
Another, less obvious, option is to have a function like this:
def match_any(subject, *regexes):
for n, regex in enumerate(regexes):
m = re.match(regex, subject)
if m:
return n, m
return -1, None
and then:
for line in file:
n, m = match_any(line, re1, re2)
if n == 0:
....
elif n == 1:
....
else:
raise BadLine

parse nested conditional statements

I need to parse a file that contains conditional statements, sometimes nested inside one another.
I have a file that stores configuration data but the configuration data is slightly different depending on user defined options. I can deal with the conditional statements, they're all just booleans with no operations but I don't know how to recursively evaluate the nested conditionals. For instance, a piece of the file might look like:
...
#if CELSIUS
#if FROM_KELVIN ; this is a comment about converting kelvin to celsius.
temp_conversion = 1, 273
#else
temp_conversion = 0.556, -32
#endif
#else
#if FROM_KELVIN
temp_conversion = 1.8, -255.3
#else
temp_conversion = 1.8, 17.778
#endif
#endif
...
... Also, some conditionals don't have an #else statement, just #if CONDITION statement(s) #endif.
I realize that this could be easy if the file were just written in XML or something else with a nice parser to begin with, but this is what I have to work with so I'm wondering if there's any relatively simple way to parse this file. It's similar to parenthesis matching so I imagine there would be some module for it but I haven't found anything.
I'm working in python but I can switch for this function if it's easier to solve this in another language.

Here's a simple recursive parser for this syntax:
def parse(lines):
result = []
while lines:
if lines[0].startswith('#if'):
block = [lines.pop(0).split()[1], parse(lines)]
if lines[0].startswith('#else'):
lines.pop(0)
block.append(parse(lines))
lines.pop(0) #endif
result.append(block)
elif not lines[0].startswith(('#else', '#endif')):
result.append(lines.pop(0))
else:
break
return result
tree = parse([x.strip() for x in your_code.splitlines() if x.strip()])
From your example it creates the following tree structure:
[['CELSIUS',
[['FROM_KELVIN',
['temp_conversion = 1, 273'],
['temp_conversion = 0.556, -32']]],
[['FROM_KELVIN',
['temp_conversion = 1.8, -255.3'],
['temp_conversion = 1.8, 17.778']]]]]
which should be easy to evaluate.
For more advanced parsing consider one of many parsing tools available for Python.

Since all of the conditions are binary and I know the values of all of them in advance (no need to evaluate them in order in order like a programming language), i was able to do it with a regular expression. This works better for me. It finds the lowest level conditionals (ones with no nested conditions), evaluates them and replaces them with the correct contents. Then repeats for the higher level conditionals and so on.
import re
conditions = ['CELSIUS', 'FROM_KELVIN']
def eval_conditional(matchobj):
statement = matchobj.groups()[1].split('#else')
statement.append('') # in case there was no else statement
if matchobj.groups()[0] in conditions: return statement[0]
else: return statement[1]
def parse(text):
pattern = r'#if\s*(\S*)\s*((?:.(?!#if|#endif))*.)#endif'
regex = re.compile(pattern, re.DOTALL)
while True:
if not regex.search(text): break
text = regex.sub(eval_conditional, text)
return text
if __name__ == '__main__':
i = open('input.txt', 'r').readlines()
g = ''.join([x.split(';')[0] for x in i if x.strip()])
o = parse(g)
open('output.txt', 'w').write(o)
Given the input in the original post, it outputs:
...
temp_conversion = 1, 273
...
which is what I need. Thanks to everyone for their responses, I really appreciate the help!

Python bidirectional mapping

I'm not sure what to call what I'm looking for; so if I failed to find this question else where, I apologize. In short, I am writing python code that will interface directly with the Linux kernel. Its easy to get the required values from include header files and write them in to my source:
IFA_UNSPEC = 0
IFA_ADDRESS = 1
IFA_LOCAL = 2
IFA_LABEL = 3
IFA_BROADCAST = 4
IFA_ANYCAST = 5
IFA_CACHEINFO = 6
IFA_MULTICAST = 7
Its easy to use these values when constructing structs to send to the kernel. However, they are of almost no help to resolve the values in the responses from the kernel.
If I put the values in to dict I would have to scan all the values in the dict to look up keys for each item in each struct from the kernel I presume. There must be a simpler, more efficient way.
How would you do it? (feel free to retitle the question if its way off)

If you want to use two dicts, you can try this to create the inverted dict:
b = {v: k for k, v in a.iteritems()}

Your solution leaves a lot of work do the repeated person creating the file. That is a source for error (you actually have to write each name three times). If you have a file where you need to update those from time to time (like, when new kernel releases come out), you are destined to include an error sooner or later. Actually, that was just a long way of saying, your solution violates DRY.
I would change your solution to something like this:
IFA_UNSPEC = 0
IFA_ADDRESS = 1
IFA_LOCAL = 2
IFA_LABEL = 3
IFA_BROADCAST = 4
IFA_ANYCAST = 5
IFA_CACHEINFO = 6
IFA_MULTICAST = 7
__IFA_MAX = 8
values = {globals()[x]:x for x in dir() if x.startswith('IFA_') or x.startswith('__IFA_')}
This was the values dict is generated automatically. You might want to (or have to) change the condition in the if statement there, according to whatever else is in that file. Maybe something like the following. That version would take away the need to list prefixes in the if statement, but it would fail if you had other stuff in the file.
values = {globals()[x]:x for x in dir() if not x.endswith('__')}
You could of course do something more sophisticated there, e.g. check for accidentally repeated values.

What I ended up doing is leaving the constant values in the module and creating a dict. The module is ip_addr.py (the values are from linux/if_addr.h) so when constructing structs to send to the kernel I can use if_addr.IFA_LABEL and resolves responses with if_addr.values[2]. I'm hoping this is the most straight forward so when I have to look at this again in a year+ its easy to understand :p
IFA_UNSPEC = 0
IFA_ADDRESS = 1
IFA_LOCAL = 2
IFA_LABEL = 3
IFA_BROADCAST = 4
IFA_ANYCAST = 5
IFA_CACHEINFO = 6
IFA_MULTICAST = 7
__IFA_MAX = 8
values = {
IFA_UNSPEC : 'IFA_UNSPEC',
IFA_ADDRESS : 'IFA_ADDRESS',
IFA_LOCAL : 'IFA_LOCAL',
IFA_LABEL : 'IFA_LABEL',
IFA_BROADCAST : 'IFA_BROADCAST',
IFA_ANYCAST : 'IFA_ANYCAST',
IFA_CACHEINFO : 'IFA_CACHEINFO',
IFA_MULTICAST : 'IFA_MULTICAST',
__IFA_MAX : '__IFA_MAX'
}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing a C file in python for analyses - python

Related

Is it good practise to indent inline comments?

Extract input address from a namecoin transaction given a 'name' operation

Assign and Test Regex in Python?

parse nested conditional statements

Python bidirectional mapping

Categories

Resources