Help fix the code. My script sorts into even and odd numbers of coordinates in the list and only works with a list in decimal number format, but I need to fix the code to work with a list in HEX format (hexadecimal number format)
I don't know the Python language well, but I need to add function hex(str)
Here is a list like this List.txt
(0x52DF625,0x47A406E)
(0x3555F30,0x3323041)
(0x326A573,0x5A5E578)
(0x48F8EF7,0x98A4EF3)
(0x578FE62,0x331DF3E)
(0x3520CAD,0x1719BBB)
(0x506FC9F,0x40CF4A6)
Сode:
with open('List.txt') as fin,\
open('Save+even.txt', 'a') as foutch,\
open('Save-odd.txt', 'a') as foutnch:
data = [line.strip() for line in fin]
nch = [foutnch.write(str(i) + '\n')
for i in data if int(i[1:-1].split(',')[1]) % 2]
ch = [foutch.write(str(i) + '\n')
for i in data if int(i[1:-1].split(',')[1]) % 2 != 1]
this may work for you (i used StringIO instead of real files - but added a comment on how you could use that with real files)
in_file = StringIO("""(0x52DF625,0x47A406E)
(0x3555F30,0x3323041)
(0x326A573,0x5A5E578)
(0x48F8EF7,0x98A4EF3)
(0x578FE62,0x331DF3E)
(0x3520CAD,0x1719BBB)
(0x506FC9F,0x40CF4A6)
""")
even_file = StringIO()
odd_file = StringIO()
# with open( "List.txt") as in_file, open("Save-even.txt", "w") as even_file, open("Save-odd.txt", "w") as odd_file:
for line in in_file:
x_str, y_str = line.strip()[1:-1].split(",")
x, y = int(x_str, 0), int(y_str, 0)
if y & 1: # y is odd
odd_file.write(line)
else:
even_file.write(line)
print("odd")
print(odd_file.getvalue())
print("even")
print(even_file.getvalue())
it outputs:
odd
(0x3555F30,0x3323041)
(0x48F8EF7,0x98A4EF3)
(0x3520CAD,0x1719BBB)
even
(0x52DF625,0x47A406E)
(0x326A573,0x5A5E578)
(0x578FE62,0x331DF3E)
(0x506FC9F,0x40CF4A6)
the trick is to use base 0 when converting a hex string to int: int(x_str, 0),. see this answer.
Related
I am looking to convert a file to binary for a project, preferably using Python as I am most comfortable with it, though if walked-through, I could probably use another language.
Basically, I need this for a project I am working on where we want to store data using a DNA strand and thus need to store files in binary ('A's and 'T's = 0, 'G's and 'C's = 1)
Any idea how I could proceed? I did find that use could encode in base64, then decode it, but it seems a bit inefficient, and the code that I have doesn't seem to work...
import base64
import tkinter as tk
from tkinter import filedialog
root = tk.Tk()
root.withdraw()
file_path = filedialog.askopenfilename()
print(file_path)
with open(file_path) as f:
encoded = base64.b64encode(f.readlines())
print(encoded)
Also, I already have a program to do that simply with text. Any tips on how to improve it would also be appreciated!
import binascii
t = bytearray(str(input("Texte?")), 'utf8')
h = binascii.hexlify(t)
b = bin(int(h, 16)).replace('b','')
#removing the b that appears in the end for some reason
g = b.replace('1','G').replace('0','A')
print(g)
For example, if I input test:
ok so for the text to DNA:
I input 'test' and expect the DNA sequence that comes from the binary
the binary being: 01110100011001010111001101110100 (Also I asked to print every conversion in the example so that it is more comprehensible)
>>>Texte?test #Asks the text
>>>b'74657374' #converts to hex
>>>01110100011001010111001101110100 #converts to binary
>>>AGGGAGAAAGGAAGAGAGGGAAGGAGGGAGAA #converts 0 to A and 1 to G
So, thanks to #jonrshape and Sergey Vturin, I finally was able to achieve what I wanted!
My program asks for a file, turns it into binary, which then gives me its equivalent in "DNA code" using pairs of binary numbers (00 = A, 01 = T, 10 = G, 11 = C)
import binascii
from tkinter import filedialog
file_path = filedialog.askopenfilename()
x = ""
with open(file_path, 'rb') as f:
for chunk in iter(lambda: f.read(32), b''):
x += str(binascii.hexlify(chunk)).replace("b","").replace("'","")
b = bin(int(x, 16)).replace('b','')
g = [b[i:i+2] for i in range(0, len(b), 2)]
dna = ""
for i in g:
if i == "00":
dna += "A"
elif i == "01":
dna += "T"
elif i == "10":
dna += "G"
elif i == "11":
dna += "C"
print(x) #hexdump
print(b) #converted to binary
print(dna) #converted to "DNA"
Of course, it is inefficient!
base64 is designed to store binary in a text. It makes a bigger size block after conversion.
btw: what efficiency do you want? compactness?
if so: second sample is much nearer to what you want
btw: in your task you loose information! Are you aware of this?
Here is a sample how to store and restore.
It stores data in an easy to understand Hex-In-Text format -- just for the sake of a demo. If you want compactness - you can easily modify the code so as to store in binary file or if you want 00011001 view - modification will be easy too.
import math
#"make a long test string"
import numpy as np
s=''.join((str(x) for x in np.random.randint(4,size=33)))\
.replace('0','A').replace('1','T').replace('2','G').replace('3','C')
def store_(s):
size=len(s) #size will changed to fit 8*integer so remember true value of it and store with data
s2=s.replace('A','0').replace('T','0').replace('G','1').replace('C','1')\
.ljust( int(math.ceil(size/8.)*8),'0') #add '0' to 8xInt to the right
a=(hex( eval('0b'+s2[i*8:i*8+8]) )[2:].rjust(2,'0') for i in xrange(len(s2)/8))
return ''.join(a),size
yourDataAsHexInText,sizeToStore=store_(s)
print yourDataAsHexInText,sizeToStore
def restore_(s,size=None):
if size==None: size=len(s)/2
a=( bin(eval('0x'+s[i*2:i*2+2]))[2:].rjust(8,'0') for i in xrange(len(s)/2))
#you loose information, remember?, so it`s only A or G
return (''.join(a).replace('1','G').replace('0','A') )[:size]
restore_(yourDataAsHexInText,sizeToStore)
print "so check it"
print s ,"(input)"
print store_(s)
print s.replace('C','G').replace('T','A') ,"to compare with information loss"
print restore_(*store_(s)),"restored"
print s.replace('C','G').replace('T','A') == restore_(*store_(s))
result in my test:
63c9308a00 33
so check it
AGCAATGCCGATGTTCATCGTATACTTTGACTA (input)
('63c9308a00', 33)
AGGAAAGGGGAAGAAGAAGGAAAAGAAAGAGAA to compare with information loss
AGGAAAGGGGAAGAAGAAGGAAAAGAAAGAGAA restored
True
import csv
import datetime
with open('soundTransit1_remote_rawMeasurements_15m.txt','r') as infile, open('soundTransit1.txt','w') as outfile:
inr = csv.reader(infile,delimiter='\t')
#ouw = csv.writer(outfile,delimiter=' ')
for row in inr:
d = datetime.datetime.strptime(row[0],'%Y-%m-%d %H:%M:%S')
s = 1
p = int(row[5])
nr = [format(s,'02')+format(d.year,'04')+format(d.month,'02')+format(d.day,'02')+format(d.hour,'02')+format(d.minute,'02')+format(int(p*0.2),'04')]
outfile.writelines(nr+'/n')
Using the above script, I have read in a .txt file and reformatted it as 'nr' so it looks like this:
['012015072314000000']
['012015072313450000']
['012015072313300000']
['012015072313150000']
['012015072313000000']
['012015072312450000']
['012015072312300000']
['012015072312150000']
..etc.
I need to now print it onto my new .txt file, but Python is not allowing me to print 'nr' with line breaks after each entry, I think because the data is in strings. I get this error:
TypeError: can only concatenate list (not "str") to list
Is there another way to do this?
You are trying to combine a list with a string, which cannot work. Simply don't create a list in nr.
import csv
import datetime
with open('soundTransit1_remote_rawMeasurements_15m.txt','r') as infile, open('soundTransit1.txt','w') as outfile:
inr = csv.reader(infile,delimiter='\t')
#ouw = csv.writer(outfile,delimiter=' ')
for row in inr:
d = datetime.datetime.strptime(row[0],'%Y-%m-%d %H:%M:%S')
s = 1
p = int(row[5])
nr = "{:02d}{:%Y%m%d%H%M}{:04d}\n".format(s,d,int(p*0.2))
outfile.write(nr)
There is no need to put your string into a list; just use outfile.write() here and build a string without a list:
nr = format(s,'02') + format(d.year,'04') + format(d.month, '02') + format(d.day, '02') + format(d.hour, '02') + format(d.minute, '02') + format(int(p*0.2), '04')
outfile.write(nr + '\n')
Rather than use 7 separate format() calls, use str.format():
nr = '{:02}{:%Y%m%d%H%M}{:04}\n'.format(s, d, int(p * 0.2))
outfile.write(nr)
Note that I formatted the datetime object with one formatting operation, and I included the newline into the string format.
You appear to have hard-coded the s value; you may as well put that into the format directly:
nr = '01{:%Y%m%d%H%M}{:04}\n'.format(d, int(p * 0.2))
outfile.write(nr)
Together, that updates your script to:
with open('soundTransit1_remote_rawMeasurements_15m.txt', 'r') as infile,\
open('soundTransit1.txt','w') as outfile:
inr = csv.reader(infile, delimiter='\t')
for row in inr:
d = datetime.datetime.strptime(row[0], '%Y-%m-%d %H:%M:%S')
p = int(int(row[5]) * 0.2)
nr = '01{:%Y%m%d%H%M}{:04}\n'.format(d, p)
outfile.write(nr)
Take into account that the csv module works better if you follow the guidelines about opening files; in Python 2 you need to open the file in binary mode ('rb'), in Python 3 you need to set the newline parameter to ''. That way the module can control newlines correctly and supports including newlines in column values.
I am reading the body of a .gpx file and need to change the format of the data so it can be read as a .kml
.kml has lat and long in a swapped order from .gpx, so i need to find a way to successively take values between two sub-strings and temporarily store it before writing them in a different order. .kml also separates <time> from <coords> but that is pretty much the same type of task.
I have looked at many resources including:
Python: Reading part of a text file but i'm not just after one value, i need it for LOTS of data points.
i also tried ElementTree. but couldn't get it to fly
i tried
lat = re.search('<trkpt lat="(.*)" lon="', x)
lon = re.search('" lon="(.*)">', x)
which obviously doesnt work for multiple values in the original file. my code probably isn't very pythonic (yet). The code:
def convert(fileName):
f = open(fileName, "r")
x = f.read()
x = re.sub(r'<trkpt lat="', ' <gx:coord>', x)
x = re.sub(r'" lon="', ' ', x)
x = re.sub(r'"><ele>', ' ', x)
x = re.sub(r'</ele>', '</gx:coord>\n', x)
x = re.sub(r'<speed>.*?</speed>', '', x)
return x
is getting me close to a format of whats needed. But i can't work out how to successively pass the multiple values, swap them around a bit and progressively re-write
i'm new to python....please send help. thanks!
EDIT
examples of each file type follows (for clarity i have taken off the header text of each)
.gpx looks like this and has time and coordinates concurrent. As you can see, each data point exists between <trkpt and </trkpt> (.gpx also has speed and sometimes other stuff that needs cleaning out too):
<trkseg>
<trkpt lat="-33.8598" lon="151.17912"><ele>7.8</ele><speed>0.9013878</speed><time>2012-09-25T07:38:42Z</time></trkpt><trkpt lat="-33.859936" lon="151.17906"><ele>20.8</ele><speed>2.25</speed><time>2012-09-25T07:38:43Z</time></trkpt><trkpt lat="-33.859818" lon="151.17934"><ele>-3.4</ele><speed>1.5</speed><time>2012-09-25T07:38:45Z</time></trkpt>
<trkpt lat="-33.859947" lon="151.17914"><ele>16.2</ele><speed>1.5</speed><time>2012-09-25T07:38:49Z</time></trkpt><trkpt lat="-33.860016" lon="151.1792"><ele>18.0</ele><speed>1.75</speed><time>2012-09-25T07:38:52Z</time></trkpt><trkpt lat="-33.86008" lon="151.17923"><ele>18.4</ele><speed>1.5811388</speed><time>2012-09-25T07:38:57Z</time></trkpt><trkpt lat="-33.86013" lon="151.17932"><ele>18.1</ele><speed>1.75</speed><time>2012-09-25T07:39:03Z</time></trkpt>
OK....and this is the equivalent .kml which separates <when> from the coordinates<gx:coords>. of course there is always the same number of each. you can see the elevation (<ele> in the .gpx) is an untagged number in coords after the position data.
`
<when>2012-09-25T07:38:42Z</when>
<when>2012-09-25T07:38:43Z</when>
<when>2012-09-25T07:38:45Z</when>
<when>2012-09-25T07:38:49Z</when>
<when>2012-09-25T07:38:52Z</when>
<when>2012-09-25T07:38:57Z</when>
<when>2012-09-25T07:39:03Z</when>
<gx:coord>151.17912 -33.8598 7.8</gx:coord>
<gx:coord>151.17906 -33.859936 20.8</gx:coord>
<gx:coord>151.17934 -33.859818 -3.4</gx:coord>
<gx:coord>151.17914 -33.859947 16.2</gx:coord>
<gx:coord>151.1792 -33.860016 18</gx:coord>
<gx:coord>151.17923 -33.86008 18.4</gx:coord>
<gx:coord>151.17932 -33.86013 18.1</gx:coord>
`
This is working but is SLOW. for a small .gpx of 477k, it is writing a .kml of 207k that takes 198 seconds to complete. my hunch is that it is the stringIO.stringIO(x) that's so slow. any ideas how to speed it up would be fantastic.
Here are the key snips ONLY of what i have done:
f = open(fileName, "r")
x = f.read()
x = re.sub(r'\n', '', x, re.S) #remove any newline returns
name = re.search('<name>(.*)</name>', x, re.S)
print "Attachment name (as recorded from GPS device): " + name.group(1)
x = re.sub(r'<(.*)<trkseg>', '', x, re.S) #strip header
x = x.replace("</trkseg></trk></gpx>",""); #strip footer
x = x.replace("<trkpt","\n<trkpt"); #make the file in lines
x = re.sub(r'<speed>(.*?)</speed>', '', x, re.S) #strip speed
x = re.sub(r'<extensions>(.*?)</extensions>', '', x, re.S) # strip out extensions
then
#.kml header goes here
kmlTrack = """<?xml version="1.0" encoding="UTF-8"?><kml xmlns="http://www.ope......etc etc
then
buf = StringIO.StringIO(x)
for line in buf:
if line is not None:
timm = re.search('time>(.*?)</time', line, re.S)
if timm is not None:
kmlTrack += (" <when>"+ timm.group(1)+"</when>\n")
checkSumA =+ 1
buf = StringIO.StringIO(x)
for line in buf:
if line is not None:
lat = re.search('lat="(.*?)" lo', line, re.S)
lon = re.search('lon="(.*?)"><ele>', line, re.S)
ele = re.search('<ele>(.*?)</ele>', line, re.S)
if lat is not None:
kmlTrack += (" <gx:coord>"+ lon.group(1) + " " + lat.group(1) + " " + ele.group(1) + "</gx:coord>\n")
checkSumB =+ 1
if checkSumA == checkSumB:
#put a footer on
kmlTrack += """ </gx:Track></Placemark></Document></kml>"""
else:
print ("checksum error")
return None
with open("Realbush2.kml", "a") as myfile:
myfile.write(kmlTrack)
return ("succsesful .kml file-write completed in :" + str(c.seconds) + " seconds.")
Once again, this is working but it is very slow. If anyone can see how to speed this up, please let me know! Thanks
Currently, I'm using this to calculate the time between two messages and listing the times if they are above 20 seconds.
def time_deltas(infile):
entries = (line.split() for line in open(INFILE, "r"))
ts = {}
for e in entries:
if " ".join(e[2:5]) == "OuchMsg out: [O]":
ts[e[8]] = e[0]
elif " ".join(e[2:5]) == "OuchMsg in: [A]":
in_ts, ref_id = e[0], e[7]
out_ts = ts.pop(ref_id, None)
yield (float(out_ts),ref_id[1:-1],(float(in_ts)*10000 - float(out_ts)*10000))
n = (float(in_ts)*10000 - float(out_ts)*10000)
if n> 20:
print float(out_ts),ref_id[1:-1], n
INFILE = 'C:/Users/klee/Documents/text.txt'
import csv
with open('output_file1.csv', 'w') as f:
csv.writer(f).writerows(time_deltas(INFILE))
However, there are two major errors. First of all, python drops zeros when the time is before 10, ie. 0900. And, it drops zeros making the time difference not accurate.
It looks like:
130203.08766
when it should be:
130203.087660
You are yielding floats, so the csv writer turns those floats into strings as it pleases.
If you want your output values to be a certain format, yield a string in that format.
Perhaps something like this?
print "%04.0f" % (900) # prints 0900
I wrote a python script to create a binary file of integers.
import struct
pos = [7623, 3015, 3231, 3829]
inh = open('test.bin', 'wb')
for e in pos:
inh.write(struct.pack('i', e))
inh.close()
It worked well, then I tried to read the 'test.bin' file using the below code.
import struct
inh = open('test.bin', 'rb')
for rec in inh:
pos = struct.unpack('i', rec)
print pos
inh.close()
But it failed with an error message:
Traceback (most recent call last):
File "readbinary.py", line 10, in <module>
pos = struct.unpack('i', rec)
File "/usr/lib/python2.5/struct.py", line 87, in unpack
return o.unpack(s)
struct.error: unpack requires a string argument of length 4
I would like to know how I can read these file using struct.unpack.
Many thanks in advance,
Vipin
for rec in inh: reads one line at a time -- not what you want for a binary file. Read 4 bytes at a time (with a while loop and inh.read(4)) instead (or read everything into memory with a single .read() call, then unpack successive 4-byte slices). The second approach is simplest and most practical as long as the amount of data involved isn't huge:
import struct
with open('test.bin', 'rb') as inh:
indata = inh.read()
for i in range(0, len(data), 4):
pos = struct.unpack('i', data[i:i+4])
print(pos)
If you do fear potentially huge amounts of data (which would take more memory than you have available), a simple generator offers an elegant alternative:
import struct
def by4(f):
rec = 'x' # placeholder for the `while`
while rec:
rec = f.read(4)
if rec: yield rec
with open('test.bin', 'rb') as inh:
for rec in by4(inh):
pos = struct.unpack('i', rec)
print(pos)
A key advantage to this second approach is that the by4 generator can easily be tweaked (while maintaining the specs: return a binary file's data 4 bytes at a time) to use a different implementation strategy for buffering, all the way to the first approach (read everything then parcel it out) which can be seen as "infinite buffering" and coded:
def by4(f):
data = inf.read()
for i in range(0, len(data), 4):
yield data[i:i+4]
while leaving the "application logic" (what to do with that stream of 4-byte chunks) intact and independent of the I/O layer (which gets encapsulated within the generator).
I think "for rec in inh" is supposed to read 'lines', not bytes. What you want is:
while True:
rec = inh.read(4) # Or inh.read(struct.calcsize('i'))
if len(rec) != 4:
break
(pos,) = struct.unpack('i', rec)
print pos
Or as others have mentioned:
while True:
try:
(pos,) = struct.unpack_from('i', inh)
except (some_exception...):
break
Check the size of the packed integers:
>>> pos
[7623, 3015, 3231, 3829]
>>> [struct.pack('i',e) for e in pos]
['\xc7\x1d\x00\x00', '\xc7\x0b\x00\x00', '\x9f\x0c\x00\x00', '\xf5\x0e\x00\x00']
We see 4-byte strings, it means that reading should be 4 bytes at a time:
>>> inh=open('test.bin','rb')
>>> b1=inh.read(4)
>>> b1
'\xc7\x1d\x00\x00'
>>> struct.unpack('i',b1)
(7623,)
>>>
This is the original int! Extending into a reading loop is left as an exercise .
You can probably use array as well if you want:
import array
pos = array.array('i', [7623, 3015, 3231, 3829])
inh = open('test.bin', 'wb')
pos.write(inh)
inh.close()
Then use array.array.fromfile or fromstring to read it back.
This function reads all bytes from file
def read_binary_file(filename):
try:
f = open(filename, 'rb')
n = os.path.getsize(filename)
data = array.array('B')
data.read(f, n)
f.close()
fsize = data.__len__()
return (fsize, data)
except IOError:
return (-1, [])
# somewhere in your code
t = read_binary_file(FILENAME)
fsize = t[0]
if (fsize > 0):
data = t[1]
# work with data
else:
print 'Error reading file'
Your iterator isn't reading 4 bytes at a time so I imagine it's rather confused. Like SilentGhost mentioned, it'd probably be best to use unpack_from().