Python, os.walk(), pass information back up?

Python, os.walk(), pass information back up? - python

I'm currently attempting to write a simple python program that loops through a bunch of subdirectories finding java files and printing some information regarding the number of times certain keywords are used. I've managed to get this working for the most part. The problem I'm having is printing overall information regarding the higher directories, for example, my current output is as follows:
testcases/part1/testcase2/root_dir:
0 bytes 0 public 0 private 0 try 0 catch
testcases/part1/testcase2/root_dir/folder1:
12586 bytes 19 public 7 private 8 try 22 catch
testcases/part1/testcase2/root_dir/folder1/folder5:
7609 bytes 9 public 2 private 7 try 11 catch
testcases/part1/testcase2/root_dir/folder4:
0 bytes 0 public 0 private 0 try 0 catch
testcases/part1/testcase2/root_dir/folder4/folder2:
7211 bytes 9 public 2 private 4 try 9 catch
testcases/part1/testcase2/root_dir/folder4/folder3:
0 bytes 0 public 0 private 0 try 0 catch
and I want the output to be:
testcases/part1/testcase2/root_dir :
27406 bytes 37 public 11 private 19 try 42 catch
testcases/part1/testcase2/root_dir/folder1 :
20195 bytes 28 public 9 private 15 try 33 catch
testcases/part1/testcase2/root_dir/folder1/folder5 :
7609 bytes 9 public 2 private 7 try 11 catch
testcases/part1/testcase2/root_dir/folder4 :
7211 bytes 9 public 2 private 4 try 9 catch
testcases/part1/testcase2/root_dir/folder4/folder2 :
7211 bytes 9 public 2 private 4 try 9 catch
testcases/part1/testcase2/root_dir/folder4/folder3 :
0 bytes 0 public 0 private 0 try 0 catch
As you can see the lower subdirectories directly provide the information to the higher subdirectories. This is the problem I'm running into. How to efficiently implement this. I have considered storing each print as a string in a list and then printing everything at the very end, but I don't think that would work for multiple subdirectories such as the example provided. This is my code so far:
def lsJava(path):
print()
for dirname, dirnames, filenames in os.walk(path):
size = 0
public = 0
private = 0
tryCount = 0
catch = 0
#Get stats by current directory.
tempStats = os.stat(dirname)
#Print current directory information
print(dirname + ":")
#Print files of directory.
for filename in filenames:
if(filename.endswith(".java")):
fileTempStats = os.stat(dirname + "/" + filename)
size += fileTempStats[6]
tempFile = open(dirname + "/" + filename)
tempString = tempFile.read()
tempString = removeComments(tempString)
public += tempString.count("public", 0, len(tempString))
private += tempString.count("private", 0, len(tempString))
tryCount += tempString.count("try", 0, len(tempString))
catch += tempString.count("catch", 0, len(tempString))
print(" ", size, " bytes ", public, " public ",
private, " private ", tryCount, " try ", catch,
" catch")
The removeComments function simply removes all comments from the java files using a regular expression pattern. Thank you for any help in advance.
EDIT:
The following code was added at the beginning of the for loop:
current_dirpath = dirname
if( dirname != current_dirpath):
size = 0
public = 0
private = 0
tryCount = 0
catch = 0
The output is now as follows:
testcases/part1/testcase2/root_dir/folder1/folder5:
7609 bytes 9 public 2 private 7 try 11 catch
testcases/part1/testcase2/root_dir/folder1:
20195 bytes 28 public 9 private 15 try 33 catch
testcases/part1/testcase2/root_dir/folder4/folder2:
27406 bytes 37 public 11 private 19 try 42 catch
testcases/part1/testcase2/root_dir/folder4/folder3:
27406 bytes 37 public 11 private 19 try 42 catch
testcases/part1/testcase2/root_dir/folder4:
27406 bytes 37 public 11 private 19 try 42 catch
testcases/part1/testcase2/root_dir:
27406 bytes 37 public 11 private 19 try 42 catch

os.walk() takes an optional topdown argument. If you use os.walk(path, topdown=False) it will instead traverse directories bottom-up.
When you first start the loop save off the first element of the tuple (dirpath) as a variable like current_dirpath. As you continue through the loop you can keep a running total of the file sizes in that directory. Then just add a check like if dirpath != current_dirpath, at which point you know you've gone up a directory level, and can reset the totals.

I don't believe you can do this with a single counter, even bottom-up: If a directory A has subdirectories B and C, when you're done with B you need to zero the counter before you descend into C; but when it's time to do A, you need to add the sizes of B and C (but B's count is long gone).
Instead of maintaining a single counter, build up a dictionary mapping each directory (key) to the associated counts (in a tuple or whatever). As you iterate (bottom-up), whenever you are ready to print output for a directory, you can look up all its subdirectories (from the dirname argument returned by os.walk()) and add their counts together.
Since you don't discard the data, this approach can be extended to maintain separate deep and shallow counts, so that at the end of the scan you can sort your directories by shallow count, report the 10 largest counts, etc.

Related

Encode data to HEX and get an L at the end in Python 2.7. Why?

I ask a Measurement Device to give me some Data. At first it tells me how many bytes of data are in the storage. It is always 14. Then it gives me the data which i have to encode into hex. It is Python 2.7 can´t use newer versions. Line 6 to 10 tells the Device to give me the measured data.
Line 12 to 14 is the encoding to Hex. In other Programs it works. but when i print result(Line 14) then i get a Hex number with 13 Bytes PLUS 1 which can not be correct because it has an L et the end. I guess it is some LONG or whatever. and i dont need the last Byte. but i do think it changes the Data too, which is picked out from Line 15 and up. at first in Hex. Then it is converted into Int.
Is it possible that the L has an effect on the Data or not?
How can i fix it?
1 ap.write(b"ML\0")
rmemb = ap.read(2)
print(rmemb)
rmemb = int(rmemb)+1
5 rmem = rmemb #must be and is 14 Bytes
addmem = ("MR:%s\0" % rmem)
# addmem = ("MR:14\0")
ap.write(addmem.encode())
10 time.sleep(1)
test = ap.read(rmem)
result = hex(int(test.encode('hex'), 16))
print(result)
15 ftflash = result[12:20]
ftbg = result[20:28]
print(ftflash)
print(ftbg)
ftflash = int(ftflash, 16)
20 # print(ftflash)
ftbg = int(ftbg, 16)
# print(ftbg)
OUTPUT:
14
0x11bd5084c0b000001ce00000093L
b000001c
e0000009

Python 2 has two built-in integer types, int and long. hex returns a string representing a Python hexadecimal literal, and in Python 2, that means that longs get an L at the end, to signify that it's a long.

Python threads memory leak

I have a class like this:
class Detector:
...
def detect:
sniff(iface='eth6', filter='vlan or not vlan and udp port 53', prn=self.spawnThread, store=0)
def spawnThread(self, pkt):
t = threading.Thread(target=self.predict, args=(pkt,))
t.start()
def predict(self, pkt):
# do something
# write log file with logging module
wheresniff() is a method from scapy, for every packet it captured, it passes the packet to spawnThread, in spawnThread I want to create different threads to run predict method.
But there seems to be a memory leak, I checked with Heapy and get this output:
Partition of a set of 623561 objects. Total size = 87355208 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 236145 38 26871176 31 26871176 31 str
1 139658 22 13805832 16 40677008 47 tuple
2 6565 1 7366648 8 48043656 55 dict (no owner)
3 1408 0 6592768 8 54636424 63 dict of module
4 25764 4 3297792 4 57934216 66 types.CodeType
5 17737 3 3223240 4 61157456 70 list
6 24878 4 2985360 3 64142816 73 function
7 14367 2 2577384 3 66720200 76 unicode
8 2445 0 2206320 3 68926520 79 type
9 2445 0 2173752 2 71100272 81 dict of type
the count and size of tuple objects keeps growing, I think that's what causes memory leak but I don't know where and why. Thanks for any feed back!
Update: If I directly call predict from sniff without using threads, there is no memory leak. And also, there are no other tuple objects anywhere else in the class. in __init__ I just initiated some strings like paths and names.
class Detector:
...
def detect(self):
sniff(iface='eth6', filter='vlan or not vlan and udp port 53',
prn=self.predict, store=0)
def predict(self, pkt):
# do something with pkt
# write log file with loggin module

Modify function at runtime (pulling local variable out)

Imagine this simple function creating a modified value of a variable default, modified:
default = 0
def modify():
modified = default + 1
print(modified) # replace with OS call, I can't see the output
modify() # 1
default # 0
disassembled:
import dis
dis.dis(modify)
2 0 LOAD_GLOBAL 0 (default)
3 LOAD_CONST 1 (1)
6 BINARY_ADD
7 STORE_FAST 0 (modified)
3 10 LOAD_GLOBAL 1 (print)
13 LOAD_FAST 0 (modified)
16 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
19 POP_TOP
20 LOAD_CONST 0 (None)
23 RETURN_VALUE
I can't change the function modify(), but I know what's in it either directly (I can see the code) or indirectly (disassembly). What I need it is to get a value of the modified variable, so I though maybe there is a way how to remove specific parts (print(modified)) of the function through dis module, but I didn't find anything.
Is there any way how to remove probably everything except return_value after 16 CALL_FUNCTION and replace it with e.g. return modified? Or is there any other way how to pull a local variable out without actually executing the last line(s)?
As a possible solution I see 3 ways:
pulling disassembled codes and creating my own function (or inplace) according to them with removing the code I don't want (everything after 16 ...)
modifying the function's return value, so that it returns modified (that unfortunately calls the OS function)
manually recreating the function according to the source code
I'd like to avoid the second way, which is probably easier than the first one, but I must avoid the third way, so... is there any way how to solve my problem?

There is a 4th option: replace the print() global:
printed = []
print = lambda *args: printed.extend(args)
modify()
del print
modified = printed[0]
It is otherwise possible to produce modified bytecode, but this can easily lead to bugs that blow up the interpreter (there is zero protection from invalid bytecode), so be warned.
You can create a new function object with a new code object with updated bytecode; based on the offsets in the dis you showed, I manually created new bytecode that would return the local variable at index 0:
>>> altered_bytecode = modify.__code__.co_code[:8] + bytes(
... [dis.opmap['LOAD_FAST'], 0, # load local variable 0 onto the stack
... dis.opmap['RETURN_VALUE']])) # and return it.
>>> dis.dis(altered_bytecode)
0 LOAD_GLOBAL 0 (0)
2 LOAD_CONST 1 (1)
4 BINARY_ADD
6 STORE_FAST 0 (0)
8 LOAD_FAST 0 (0)
10 RETURN_VALUE
RETURN_VALUE returns the object at the top of the stack; all I did was inject a LOAD_FAST opcode to load what modified references onto the stack.
You'd have to create a new code object, then a new function object wrapping the code object, to make this callable:
>>> code = type(modify.__code__)
>>> function = type(modify)
>>> ocode = modify.__code__
>>> new_modify = function(
... code(ocode.co_argcount, ocode.co_kwonlyargcount, ocode.co_nlocals, ocode.co_stacksize,
... ocode.co_flags, altered_bytecode,
... ocode.co_consts, ocode.co_names, ocode.co_varnames, ocode.co_filename,
... 'new_modify', ocode.co_firstlineno, ocode.co_lnotab, ocode.co_freevars,
... ocode.co_cellvars),
... modify.__globals__, 'new_modify', modify.__defaults__, modify.__closure__)
>>> new_modify()
1
This does, obviously, require some understanding of how Python bytecode works in the first place; the dis module does contain descriptions of the various codes, and the dis.opmap dictionary lets you map back to byte values.
There are a few modules out there that try to make this easier; take a look at byteplay, the bytecode module of the pwnypack project or several others, if you want to explore this further.
I can also heartily recommend you watch the Playing with Python Bytecode presentation given by Scott Sanderson, Joe Jevnik at PyCon 2016, and play with their codetransformer module. Highly entertaining and very informative.

How to pack & unpack 64 bits of data?

I have a 64-bit data structure as follows:
HHHHHHHHHHHHHHHHGGGGGGGGGGGGFFFEEEEDDDDCCCCCCCCCCCCBAAAAAAAAAAAA
A: 12 bits (unsigned)
B: 1 bit
C: 12 bits (unsigned)
D: 4 bits (unsigned)
E: 4 bits (unsigned)
F: 3 bits (unsigned)
G: 12 bits (unsigned)
H: 16 bits (unsigned)
Using Python, I am trying to determine which module (preferably native Python 3.x) I should be using. I am looking at BitVector but having trouble figuring some things out.
For ease of use, I want to be able to do something like the following:
# To set the `A` bits, use a mapped mask 'objectId'
bv = BitVector(size=64)
bv['objectId'] = 1
I'm not sure that BitVector actually works the way I want it to. Whatever module I end up implementing, the data structure will be encapsulated in a class that reads and writes to the structure via property getters/setters.
I will also be using constants (or enums) for some of the bit values and it would be convenient to be able to set a mapped mask using something like:
# To set the 'B' bit, use the constant flag to set the mapped mask 'visibility'
bv['visibility'] = PUBLIC
print(bv['visibility']) # output: 1
# To set the 'G' bits, us a mapped mask 'imageId'
bv['imageId'] = 238
Is there a Python module in 3.x that will help me achieve this goal? If BitVector will (or should) work, some helpful hints (e.g. examples) would be appreciated. It seems that BitVector wants to force everything to an 8-bit format which is not ideal for my application (IMHO).

Based on the recommendation to use bitarray I have come up with the following implementation with two utility methods:
def test_bitvector_set_block_id_slice(self):
bv = bitvector(VECTOR_SIZE)
bv.setall(False)
print("BitVector[{len}]: {bv}".format(len=bv.length(),
bv=bv.to01()))
print("set block id: current {bid} --> {nbid}".format(bid=bv[52:VECTOR_SIZE].to01(),
nbid=inttobitvector(12, 1).to01()))
# set blockVector.blockId (last 12 bits)
bv[52:VECTOR_SIZE] = inttobitvector(12, 1)
block_id = bv[52:VECTOR_SIZE]
self.assertTrue(bitvectortoint(block_id) == 1)
print("BitVector[{len}] set block id: {bin} [{val}]".format(len=bv.length(),
bin=block_id.to01(),
val=bitvectortoint(block_id)))
print("BitVector[{len}]: {bv}".format(len=bv.length(),
bv=bv.to01()))
print()
# utility methods
def bitvectortoint(bitvec):
out = 0
for bit in bitvec:
out = (out << 1) | bit
return out
def inttobitvector(size, n):
bits = "{bits}".format(bits="{0:0{size}b}".format(n,
size=size))
print("int[{n}] --> binary: {bits}".format(n=n,
bits=bits))
return bitvector(bits)
The output is as follows:
BitVector[64]: 0000000000000000000000000000000000000000000000000000000000000000
int[1] --> binary: 000000000001
set block id: current 000000000000 --> 000000000001
int[1] --> binary: 000000000001
BitVector[64] set block id: 000000000001 [1]
BitVector[64]: 0000000000000000000000000000000000000000000000000000000000000001
If there are improvements to the utility methods, I am more than willing to take some advice.

Listing all usb mass storage disks using python

I have few USB disks inserted in my system. I would like to list all of them like:-
/dev/sdb
/dev/sdc
.... ans so on..
Please remember that I don't want to list partitions in it like /dev/sdb1. I am looking for solution under Linux. Tried cat /proc/partitions.
major minor #blocks name
8 0 488386584 sda
8 1 52428800 sda1
8 2 52428711 sda2
8 3 1 sda3
8 5 52428800 sda5
8 6 15728516 sda6
8 7 157683712 sda7
8 8 157682688 sda8
11 0 1074400 sr0
11 1 47602 sr1
8 32 3778852 sdc
8 33 1 sdc1
8 37 3773440 sdc5
But it list all the disks and unable to figure which one is USB storage disks. I am looking for a solution which does not require an additional package installation.

You can convert Klaus D.'s suggestion into Python code like this:
#!/usr/bin/env python
import os
basedir = '/dev/disk/by-path/'
print 'All USB disks'
for d in os.listdir(basedir):
#Only show usb disks and not partitions
if 'usb' in d and 'part' not in d:
path = os.path.join(basedir, d)
link = os.readlink(path)
print '/dev/' + os.path.basename(link)
path contains info in this format:
/dev/disk/by-path/pci-0000:00:1d.7-usb-0:5:1.0-scsi-0:0:0:0
which is a symbolic link, so we can get the pseudo-scsi device name using os.readlink().
But that returns info in this format:
../../sdc
so we use os.path.basename() to clean it up.
Instead of using
'/dev/' + os.path.basename(link)
you can produce a string in the '/dev/sdc' format by using
os.path.normpath(os.path.join(os.path.dirname(path), link))
but I think you'll agree that the former technique is simpler. :)

List the right path in /dev:
ls -l /dev/disk/by-path/*-usb-* | fgrep -v part

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python, os.walk(), pass information back up? - python

Related

Encode data to HEX and get an L at the end in Python 2.7. Why?

Python threads memory leak

Modify function at runtime (pulling local variable out)

How to pack & unpack 64 bits of data?

Listing all usb mass storage disks using python

Categories

Resources