Extracting text from a binary PE file in python

Extracting text from a binary PE file in python - python

I am trying to extract strings from a PE files (.exe & .dll) using pefile library but for a while I am stuck as this type of data format is new to new, I've read many questions similar to mine but with no success I am able to adapt the code to fit my needs.
I have a following code:
# path to random pe file
p = 'dfghdsfhrtkl54165hs.exe'
pe = pefile.PE(p)
# Extract the file's metadata
print('Machine: ', pe.FILE_HEADER.Machine)
print('Number of sections: ', pe.FILE_HEADER.NumberOfSections)
print('Timestamp: ', pe.FILE_HEADER.TimeDateStamp)
print('Entry point: ', pe.OPTIONAL_HEADER.AddressOfEntryPoint)
# Machine: 332
# Number of sections: 3
# Timestamp: 1441263997
# Entry point: 5432
As I understand there are sections that contain .text which can be used to classify if the file is bening or malignant so I've tried the following:
for section in pe.sections:
if section.Name.decode().strip('\x00') == '.text':
text_section = section
break
text_section
Which returns
<Structure: [IMAGE_SECTION_HEADER] 0x1B0 0x0
Name: .text 0x1B8 0x8
Misc: 0xF53C 0x1B8 0x8
Misc_PhysicalAddress: 0xF53C 0x1B8 0x8
Misc_VirtualSize: 0xF53C 0x1BC 0xC
VirtualAddress: 0x1000 0x1C0 0x10
SizeOfRawData: 0x10000 0x1C4 0x14
PointerToRawData: 0x1000 0x1C8 0x18
PointerToRelocations: 0x0 0x1CC 0x1C
PointerToLinenumbers: 0x0 0x1D0 0x20
NumberOfRelocations: 0x0 0x1D2 0x22
NumberOfLinenumbers: 0x0 0x1D4 0x24
Characteristics: 0x60000020>
But I am unsure how to proceed extracting printable strings from this or if this is even the right way.
I've read the following answers:
1
2
3
My end goal is to extract text from PE files that I can use in my ML model as features.

Related

pyserial read variable byte array

I am writing a code that can flexibly read a mutable array (4 byte array and 7 byte array).
I don't think it's a good idea to read 7 bytes continuously and judge it by the number of bytes sent in the 2nd byte.
Reading an array that is sent in 4 bytes as 7 bytes results in slow reception.
serial_port = serial.Serial(
port="/dev/ttyUSB0",
baudrate=9600,
bytesize=serial.EIGHTBITS,
parity=serial.PARITY_NONE,
stopbits=serial.STOPBITS_ONE,
)
if __name__ == '__main__':
while True :
time.sleep(1)
if serial_port.readable():
serial_port.flushInput()
serial_port.timeout = None
data =serial_port.read(7)
l_data=list(data)
print("Header : " , str.join("", ("0x%02X " % i for i in l_data)))
Results received from connected equipment
ouput:
Header : 0x2B 0x04 0x53 0x7C 0x2B 0x04 0x53 #0x04 is the actual total number of byte
Header : 0x2b 0x07 0x51 0x06 0x07 0x46 0x3a #0x07 is the actual total number of byte
If I replace read() with readline(), nothing comes up.
So did read_until().
(Actually, it is because of my lack of skills.)
I tried to find a solution to find the end of the array (\n or \r) and truncate it, but it didn't work.
Is there a way to flexibly receive 4 bytes and 7 bytes?

How to extract particular lines in text file using heading in python and return it in function?

I am trying to extract particular information from text file using headings in the file.There are different headings and corresponding content for each heading.I need to extract content from particular heading and return it in function.I tried lot of ways but not able to achieve
Here is the description of file:
INTERFACE 2: XYZ =====================================
bLength : 0x9 (9 bytes)
bDescriptorType : 0x4 Interface
bInterfaceNumber : 0x2
bAlternateSetting : 0x0
INTERFACE 2, 1: ABC ==================================
bLength : 0x9 (9 bytes)
bDescriptorType : 0x4 Interface
bInterfaceNumber : 0x2
bAlternateSetting : 0x1
bNumEndpoints : 0x1
ENDPOINT 0x1: Isochronous ========================
bLength : 0x9 (7 bytes)
bDescriptorType : 0x5 Endpoint
bEndpointAddress : 0x1 OUT
I need to extract these information like ABC content , XYZ content but not able to extract that.
My question is as below:
1) How we will extract content with particular heading and how to return it in some function ?

Not sure whether this is what you want, but this will give you a dictionary with headings and content:
content_dict = dict()
heading_line = None
for l in f.readlines():
# recognize heading lines
if "======" in l: # if this doesn't cover, you might need to define a regex
heading_line = l
content_dict[heading_line] = dict()
else:
field_name = l.split(':')[0].strip()
field_val = l.split(':')[1].strip()
content_dict[heading_line][field_name] = field_val
using the content_dict above, for each heading you have a dict of the fields string values. If you want you could extract from them whatever you need.

Python parsing serial hex string with fixed format

I am successfully communicating with a simple device over serial, with a specific request packet, and a fixed format return packet is received. Im starting with python so I can use this on multiple devices, I'm only really used to PHP/C.
For example, I send the following as hex:
12 05 0b 03 1f
and in return I get
12 05 0b 03 1f 12 1D A0 03 18 00 22 00 00 CA D4 4F 00 00 22 D6 99 18 00 70 80 00 80 00 06 06 00 00 D9
I know how the packet is constructed, the first 5 bytes is the data that was sent. The next 3 bytes are an ID, the packet length, and a response code. Its commented in my code here:
import serial, time
ser = serial.Serial(port='COM1', baudrate=9600, timeout=0, parity=serial.PARITY_EVEN, stopbits=serial.STOPBITS_ONE, bytesize=serial.EIGHTBITS)
while True:
# Send The Request - 0x12 0x05 0x0B 0x03 0x1F
ser.write("\x12\x05\x0B\x03\x1F")
# Clear first 5 bytes (Original Request is Returned)
ser.read(5)
# Response Header (3 bytes)
# - ID (Always 12)
# - Packet Length (inc these three)
# - General Response (a0 = Success, a1 = Busy, ff = Bad Command)
ResponseHeader = ser.read(3).encode('hex')
PacketLength = int(ResponseHeader[2:4],16)
if ResponseHeader[4:6] == "a0":
# Response Success
ResponseData = ser.read(PacketLength).encode('hex')
# Read First Two Bytes
Data1 = int(ResponseData[0:4],16)
print Data1
else:
# Clear The Buffer
RemainingBuffer = ser.inWaiting()
ser.read(RemainingBuffer)
time.sleep(0.12)
To keep it simple for now, I was just trying to read the first two bytes of the actual response (ResponseData), which should give me the hex 0318. I then want to output that as a decimal =792. The program is meant to run in a continuous loop.
Some of the variables in the packet are one byte, some are two bytes. Although, up to now I'm just getting an error:
ValueError: invalid literal for int() with base 16: ''
I'm guessing this is due to the format of the data/variables I have set, so not sure if I'm even going about this the right way. I just want to read the returned HEX data in byte form and be able to access them on an individual level, so I can format/output them as required.
Is there a better way to do this? Many thanks.

I recommend using the struct module to read binary data, instead of recoding it using string functions to hex and trying to parse the hex strings.

As your code stands now, you send binary (not hex) data over the wire, and receive binary (not hex) data back from the device. Then you convert the binary data to hex, only to convert it again to Python variables.
Let's skip the extra conversion step by using struct.unpack:
# UNTESTED
import struct
...
while True:
# Send The Request - 0x12 0x05 0x0B 0x03 0x1F
ser.write("\x12\x05\x0B\x03\x1F")
# Clear first 5 bytes (Original Request is Returned)
ser.read(5)
# Response Header (3 bytes)
# - ID (Always 12)
# - Packet Length (inc these three)
# - General Response (a0 = Success, a1 = Busy, ff = Bad Command)
ResponseHeader = ser.read(3)
ID,PacketLength,Response = struct.unpack("!BBB", ResponseHeader)
if Response == 0xa0:
# Response Success
ResponseData = ser.read(PacketLength)
# Read First Two Bytes
Data1 = struct.unpack("!H", ResponseData[0:2])
print Data1
else:
# Clear The Buffer
RemainingBuffer = ser.inWaiting()
ser.read(RemainingBuffer)

parsing ERF capture files in python

What is the best way of parsing ERF (endace) capture files in python? I found a libpcap wrapper for python but I do not think that lipcap supports ERF format.
Thanks!

Here's a simplistic ERF record parser which returns a dict per packet (I just hacked it together, so not extremely well tested. Not all flag fields are decoded, but the ones that aren't, aren't widely applicable):
NB:
ERF record types: 1 = HDLC, 2 = Ethernet, 3 = ATM, 4 = Reassembled AAL5, 5-7 multichannel variants with extra headers not processed here.
rlen can be less than wlen+len(header) if the snaplength is too short.
The interstitial loss counter is the number of packets lost between this packet and the previous captured packet as noted by the Dag packet processor when its input queue overflows.
Comment out the two scapy lines if you don't want to use scapy.
Code:
import scapy.layers.all as sl
def erf_records( f ):
"""
Generator which parses ERF records from file-like ``f``
"""
while True:
# The ERF header is fixed length 16 bytes
hdr = f.read( 16 )
if hdr:
rec = {}
# The timestamp is in Intel byte-order
rec['ts'] = struct.unpack( '<Q', hdr[:8] )[0]
# The rest is in network byte-order
rec.update( zip( ('type', # ERF record type
'flags', # Raw flags bit field
'rlen', # Length of entire record
'lctr', # Interstitial loss counter
'wlen'), # Length of packet on wire
struct.unpack( '>BBHHH', hdr[8:] ) ) )
rec['iface'] = rec['flags'] & 0x03
rec['rx_err'] = rec['flags'] & 0x10 != 0
rec['pkt'] = f.read( rec['rlen'] - 16 )
if rec['type'] == 2:
# ERF Ethernet has an extra two bytes of pad between ERF header
# and beginning of MAC header so that IP-layer data are DWORD
# aligned. From memory, none of the other types have pad.
rec['pkt'] = rec['pkt'][2:]
rec['pkt'] = sl.Ether( rec['pkt'] )
yield rec
else:
return

ERF records can contain optional Extension Headers which are appended to the 16 byte ERF record header. The high bit of the 'type' field indicates the presence of an Extension Header. I've added a test for the Extension Header to strix's example, along with a decode of the Extension Header itself. Note that the test for an Ethernet frame also needs to change slightly if an Extension Header is present.
Caveat: I believe that ERF records can contain multiple Extensions Headers, but I don't know to test for these. The Extension Header structure is not particularly well documented and the only records I have in captivity just contain a single extension.
import struct
import scapy.layers.all as sl
def erf_records( f ):
"""
Generator which parses ERF records from file-like ``f``
"""
while True:
# The ERF header is fixed length 16 bytes
hdr = f.read( 16 )
if hdr:
rec = {}
# The timestamp is in Intel byte-order
rec['ts'] = struct.unpack( '<Q', hdr[:8] )[0]
# The rest is in network byte-order
rec.update( zip( ('type', # ERF record type
'flags', # Raw flags bit field
'rlen', # Length of entire record
'lctr', # Interstitial loss counter
'wlen'), # Length of packet on wire
struct.unpack( '>BBHHH', hdr[8:] ) ) )
rec['iface'] = rec['flags'] & 0x03
rec['rx_err'] = rec['flags'] & 0x10 != 0
#- Check if ERF Extension Header present.
# Each Extension Header is 8 bytes.
if rec['type'] & 0x80:
ext_hdr = f.read( 8 )
rec.update( zip( (
'ext_hdr_signature', # 1 byte
'ext_hdr_payload_hash', # 3 bytes
'ext_hdr_filter_color', # 1 bye
'ext_hdr_flow_hash'), # 3 bytes
struct.unpack( '>B3sB3s', ext_hdr ) ) )
#- get remaining payload, less ext_hdr
rec['pkt'] = f.read( rec['rlen'] - 24 )
else:
rec['pkt'] = f.read( rec['rlen'] - 16 )
if rec['type'] & 0x02:
# ERF Ethernet has an extra two bytes of pad between ERF header
# and beginning of MAC header so that IP-layer data are DWORD
# aligned. From memory, none of the other types have pad.
rec['pkt'] = rec['pkt'][2:]
rec['pkt'] = sl.Ether( rec['pkt'] )
yield rec
else:
return

How to Read-in Binary of a File in Python

In Python, when I try to read in an executable file with 'rb', instead of getting the binary values I expected (0010001 etc.), I'm getting a series of letters and symbols that I do not know what to do with.
Ex: ???}????l?S??????V?d?\?hG???8?O=(A).e??????B??$????????: ???Z?C'???|lP#.\P?!??9KRI??{F?AB???5!qtWI??8𜐮???!ᢉ?]?zъeF?̀z??/?n??
How would I access the binary numbers of a file in Python?
Any suggestions or help would be appreciated. Thank you in advance.

That is the binary. They are stored as bytes, and when you print them, they are interpreted as ASCII characters.
You can use the bin() function and the ord() function to see the actual binary codes.
for value in enumerate(data):
print bin(ord(value))

Byte sequences in Python are represented using strings. The series of letters and symbols that you see when you print out a byte sequence is merely a printable representation of bytes that the string contains. To make use of this data, you usually manipulate it in some way to obtain a more useful representation.
You can use ord(x) or bin(x) to obtain decimal and binary representations, respectively:
>>> f = open('/tmp/IMG_5982.JPG', 'rb')
>>> data = f.read(10)
>>> data
'\x00\x00II*\x00\x08\x00\x00\x00'
>>> data[2]
'I'
>>> ord(data[2])
73
>>> hex(ord(data[2]))
'0x49'
>>> bin(ord(data[2]))
'0b1001001'
>>> f.close()
The 'b' flag that you pass to open() does not tell Python anything about how to represent the file contents. From the docs:
Append 'b' to the mode to open the file in binary mode, on systems that differentiate between binary and text files; on systems that don’t have this distinction, adding the 'b' has no effect.
Unless you just want to look at what the binary data from the file looks like, Mark Pilgrim's book, Dive Into Python, has an example of working with binary file formats. The example shows how you can read IDv1 tags from an MP3 file. The book's website seems to be down, so I'm linking to a mirror.

Each character in the string is the ASCII representation of a binary byte. If you want it as a string of zeros and ones then you can convert each byte to an integer, format it as 8 binary digits and join everything together:
>>> s = "hello world"
>>> ''.join("{0:08b}".format(ord(x)) for x in s)
'0110100001100101011011000110110001101111001000000111011101101111011100100110110001100100'
Depending on if you really need to analyse / manipulate things at the binary level an external module such as bitstring could be helpful. Check out the docs; to just get the binary interpretation use something like:
>>> f = open('somefile', 'rb')
>>> b = bitstring.Bits(f)
>>> b.bin
0100100101001001...

Use ord(x) to get the integer value of each byte.
>>> with open('settings.dat', 'rb') as file:
... data = file.read()
...
>>> for index, value in enumerate(data):
... print '0x%08x 0x%02x' % (index, ord(value))
...
0x00000000 0x28
0x00000001 0x64
0x00000002 0x70
0x00000003 0x30
0x00000004 0x0d
0x00000005 0x0a
0x00000006 0x53
0x00000007 0x27
0x00000008 0x4d
0x00000009 0x41
0x0000000a 0x49
0x0000000b 0x4e
0x0000000c 0x5f
0x0000000d 0x57
0x0000000e 0x49
0x0000000f 0x4e

If you realy want to convert the binaray bytes to a stream of bits, you have to remove the first two chars ('0b') from the output of bin() and reverse the result:
with open("settings.dat", "rb") as fp:
print "".join( (bin(ord(c))[2:][::-1]).ljust(8,"0") for c in fp.read() )
If you use Python prior to 2.6, you have no bin() function.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting text from a binary PE file in python - python

Related

pyserial read variable byte array

How to extract particular lines in text file using heading in python and return it in function?

Python parsing serial hex string with fixed format

parsing ERF capture files in python

How to Read-in Binary of a File in Python

Categories

Resources