Python read data from file and convert to double precision - python

I've been reading an ASCII data file using python. Then I covert the data into a numpy array.
However, I've noticed that the numbers are being rounded.
E.g. My original value from the file is: 2368999.932089
which python has rounded to: 2368999.93209
here is an example of my code:
import numpy as np
datafil = open("test.txt",'r')
tempvar = []
header = datafil.readline()
for line in datafil:
word = line.split()
char = word[0] # take the first element word[0] of the list
word.pop() # remove the last element from the list "word"
if char[0:3] >= '224' and char[0:3] < '225':
tempvar.append(word)
strvar = np.array(tempvar,dtype = np.longdouble) # Here I want to read all data as double
print(strvar.shape)
var = strvar[:,0:23]
print(var[0,22]) # here it prints 2368999.93209 but the actual value is 2368999.932089
Any ideas guys?
Abedin

I think this is not a problem of your code. It's the usual floating point representation in Python. See
https://docs.python.org/2/tutorial/floatingpoint.html
I think when you print it, print already formatted your number to str
In [1]: a=2368999.932089
In [2]: print a
2368999.93209
In [3]: str(a)
Out[3]: '2368999.93209'
In [4]: repr(a)
Out[4]: '2368999.932089'
In [5]: a-2368999.93209
Out[5]: -9.997747838497162e-07

I'm not totally sure what you're trying to do, but simplified with test.txt containing only
asdf
2368999.932089
and then the code:
import numpy as np
datafil = open("test.txt",'r')
tempvar = []
header = datafil.readline()
for line in datafil:
tempvar.append(line)
print(tempvar)
strvar = np.array(tempvar, dtype=np.float)
print(strvar.shape)
print(strvar)
I get the following output:
$ python3 so.py
['2368999.932089']
(1,)
[ 2368999.932089]
which seems to be working fine.
Edit: Updated with your provided line, so test.txt is
asdf
t JD a e incl lasc aper truean rdnnode RA Dec RArate Decrate metdr1 metddr1 metra1 metdec1 metbeta1 metdv1 metsl1 metarrJD1 beta JDej name 223.187263 2450520.619348 3.12966 0.61835 70.7196 282.97 171.324 -96.2738 1.19968 325.317 35.8075 0.662368 0.364967 0.215336 3.21729 -133.586 46.4884 59.7421 37.7195 282.821 2450681.900221 0 2368999.932089 EH2003
and the code
import numpy as np
datafil = open("test.txt",'r')
tempvar = []
header = datafil.readline()
for line in datafil:
tempvar.append(line.split(' '))
print(tempvar)
strvar = np.array(tempvar[0][-2], dtype=np.float)
print(strvar)
the last print still outputs 2368999.932089 for me. So I'm guessing this is a platform issue? What happens if you force dtype=np.float64 or dtype=np.float128? Some other sanity checks: have you tried spitting out the text before it is converted to a float? And what do you get from doing something like:
>>> np.array('2368999.932089')
array('2368999.932089',
dtype='<U14')
>>> float('2368999.932089')
2368999.932089

Related

Python: How to remove $ character from list after CSV import

I am attempting to import a CSV file into Python. After importing the CSV, I want to take an every of every ['Spent Past 6 Months'] value, however the "$" symbol that the CSV includes in front of that value is causing me problems. I've tried a number of things to get rid of that symbol and I'm honestly lost at this point!
I'm really new to Python, so I apologize if there is something very simple here that I am missing.
What I have coded is listed below. My output is listed first:
File "customer_regex2.py", line 24, in <module>
top20Cust = top20P(data)
File "customer_regex2.py", line 15, in top20P
data1 += data1 + int(a[i]['Spent Past 6 Months'])
ValueError: invalid literal for int() with base 10: '$2099.83'
error screenshot
import csv
import re
data = []
with open('customerData.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
data.append(row)
def top20P(a):
outputList=[]
data1=0
for i in range(0,len(a)):
data1 += data1 + int(a[i]['Spent Past 6 Months'])
top20val= int(data1*0.8)
for j in range(0,len(a)):
if data[j]['Spent Past 6 Months'] >= top20val:
outputList.append('a[j]')
return outputList
top20Cust = top20P(data)
print(outputList)
It looks like a datatype issue.
You could strip the $ characters like so:
someString = '$2099.83'
someString = someString.strip('$')
print(someString)
2099.83
Now the last step is to wrap in float() since you have decimal values.
print(type(someString))
<class 'str'>
someFloat = float(someString)
print(type(someFloat))
<class 'float'>
Hope that helps.

Reading Text File From Webpage by Python3

import re
import urllib
hand=urllib.request.urlopen("http://www.pythonlearn.com/code/mbox-short.txt")
qq=hand.read().decode('utf-8')
numlist=[]
for line in qq:
line.rstrip()
stuff=re.findall("^X-DSPAM-Confidence: ([0-9.]+)",line)
if len(stuff)!=1:
continue
num=float(stuff[0])
numlist.append(num)
print('Maximum:',max(numlist))
The variable qq contains all the strings from the text file. However, the for loop doesn't work and numlist is still empty.
When I download the text file as a local file then read it, everything is ok.
Use the regex on qq using the multiline flag re.M, you are iterating over a string so going character by character, not line by line so you are calling findall on single characters:
In [18]: re.findall("^X-DSPAM-Confidence: ([0-9.]+)",qq, re.M)
Out [18]: ['0.8475', '0.6178', '0.6961', '0.7565', '0.7626', '0.7556', '0.7002', '0.7615', '0.7601', '0.7605', '0.6959', '0.7606', '0.7559', '0.7605', '0.6932', '0.7558', '0.6526', '0.6948', '0.6528', '0.7002', '0.7554', '0.6956', '0.6959', '0.7556', '0.9846', '0.8509', '0.9907']
What you are doing is equivalnet to:
In [13]: s = "foo\nbar"
In [14]: for c in s:
....: stuff=re.findall("^X-DSPAM-Confidence: ([0-9.]+)",c)
print(c)
....:
f
o
o
b
a
r
If you want floats, you can cast with map:
list(map(float,re.findall("^X-DSPAM-Confidence: ([0-9.]+)",qq, re.M)))
But if you just want the max, you can pass a key to max:
In [22]: max(re.findall("^X-DSPAM-Confidence: ([0-9.]+)",qq, re.M),key=float)
Out[22]: '0.9907'
So all you need is three lines:
In [28]: hand=urllib.request.urlopen("http://www.pythonlearn.com/code/mbox-short.txt")
In [29]: qq = hand.read().decode('utf-8')
In [30]: max(re.findall("^X-DSPAM-Confidence: ([0-9.]+)",qq, re.M),key=float)
Out[30]: '0.9907'
If you wanted to go line by line, iterate directly over hand :
import re
import urllib
hand = urllib.request.urlopen("http://www.pythonlearn.com/code/mbox-short.txt")
numlist = []
# iterate over each line like a file object
for line in hand:
stuff = re.search("^X-DSPAM-Confidence: ([0-9.]+)", line.decode("utf-8"))
if stuff:
numlist.append(float(stuff.group(1)))
print('Maximum:', max(numlist))

Want to convert string value to float in spark python

Hello subject expert please have look and help me got stuck here
I have two files and i combined them using union function is spark. and getting ouptput like.
file1 contains.(u'[12345, 90604080', 0.0)
(u'[67890, 70806080', 320.0)
file2 contains. (u'[12345, 90604080', 0.0)
(u'[67890, 70806080', 0.0)
[u"(u'[12345", u" 90604080'", u' 0.0)']
[u"(u'[67890", u" 70806080'", u' 320.0)'] this is combined output [12345", u" 90604080'" is my keys and 0.0 are their values i want to aggregate the values according to the keys and stoared the output into third file. this is my code. like '12345, 90604080',0.0 and 67890, 70806080', 320.0
but Getting following error
ValueError: invalid literal for float(): 70.0)
from pyspark import SparkContext
import os
import sys
sc = SparkContext("local", "aggregate")
file1 = sc.textFile("hdfs://localhost:9000/data//part-00000")
file2 = sc.textFile("hdfs://localhost:9000/data/second/part-00000")
file3 = file1.union(file2).coalesce(1).map(lambda line: line.split(','))
result = file3.map(lambda x: ((x[0]+', '+x[1],float(x[2])))).reduceByKey(lambda a,b:a+b).coalesce(1)
result.saveAsTextFile("hdfs://localhost:9000/Test1")
thanks for the help
It looks like you have an extra closing parenthesis in your string. Try:
result = file3.map(lambda x: ((x[0]+', '+x[1],float(x[2][:-1])))).reduceByKey(lambda a,b:a+b).coalesce(1)
Clarification:
The error-message tells us that the float-conversion got 70.0) as argument. What we want is 70.0. So we just need to omit the last character of the string which we can do with index slicing:
>>> a = "70.0)"
>>> a = a[:-1]
>>> print a
"70.0"
The last line can be read as a is equal to a from index 0 to index -1. -1 is equivalent to len(a)-1.

Convert file to binary code in Python

I am looking to convert a file to binary for a project, preferably using Python as I am most comfortable with it, though if walked-through, I could probably use another language.
Basically, I need this for a project I am working on where we want to store data using a DNA strand and thus need to store files in binary ('A's and 'T's = 0, 'G's and 'C's = 1)
Any idea how I could proceed? I did find that use could encode in base64, then decode it, but it seems a bit inefficient, and the code that I have doesn't seem to work...
import base64
import tkinter as tk
from tkinter import filedialog
root = tk.Tk()
root.withdraw()
file_path = filedialog.askopenfilename()
print(file_path)
with open(file_path) as f:
encoded = base64.b64encode(f.readlines())
print(encoded)
Also, I already have a program to do that simply with text. Any tips on how to improve it would also be appreciated!
import binascii
t = bytearray(str(input("Texte?")), 'utf8')
h = binascii.hexlify(t)
b = bin(int(h, 16)).replace('b','')
#removing the b that appears in the end for some reason
g = b.replace('1','G').replace('0','A')
print(g)
For example, if I input test:
ok so for the text to DNA:
I input 'test' and expect the DNA sequence that comes from the binary
the binary being: 01110100011001010111001101110100 (Also I asked to print every conversion in the example so that it is more comprehensible)
>>>Texte?test #Asks the text
>>>b'74657374' #converts to hex
>>>01110100011001010111001101110100 #converts to binary
>>>AGGGAGAAAGGAAGAGAGGGAAGGAGGGAGAA #converts 0 to A and 1 to G
So, thanks to #jonrshape and Sergey Vturin, I finally was able to achieve what I wanted!
My program asks for a file, turns it into binary, which then gives me its equivalent in "DNA code" using pairs of binary numbers (00 = A, 01 = T, 10 = G, 11 = C)
import binascii
from tkinter import filedialog
file_path = filedialog.askopenfilename()
x = ""
with open(file_path, 'rb') as f:
for chunk in iter(lambda: f.read(32), b''):
x += str(binascii.hexlify(chunk)).replace("b","").replace("'","")
b = bin(int(x, 16)).replace('b','')
g = [b[i:i+2] for i in range(0, len(b), 2)]
dna = ""
for i in g:
if i == "00":
dna += "A"
elif i == "01":
dna += "T"
elif i == "10":
dna += "G"
elif i == "11":
dna += "C"
print(x) #hexdump
print(b) #converted to binary
print(dna) #converted to "DNA"
Of course, it is inefficient!
base64 is designed to store binary in a text. It makes a bigger size block after conversion.
btw: what efficiency do you want? compactness?
if so: second sample is much nearer to what you want
btw: in your task you loose information! Are you aware of this?
Here is a sample how to store and restore.
It stores data in an easy to understand Hex-In-Text format -- just for the sake of a demo. If you want compactness - you can easily modify the code so as to store in binary file or if you want 00011001 view - modification will be easy too.
import math
#"make a long test string"
import numpy as np
s=''.join((str(x) for x in np.random.randint(4,size=33)))\
.replace('0','A').replace('1','T').replace('2','G').replace('3','C')
def store_(s):
size=len(s) #size will changed to fit 8*integer so remember true value of it and store with data
s2=s.replace('A','0').replace('T','0').replace('G','1').replace('C','1')\
.ljust( int(math.ceil(size/8.)*8),'0') #add '0' to 8xInt to the right
a=(hex( eval('0b'+s2[i*8:i*8+8]) )[2:].rjust(2,'0') for i in xrange(len(s2)/8))
return ''.join(a),size
yourDataAsHexInText,sizeToStore=store_(s)
print yourDataAsHexInText,sizeToStore
def restore_(s,size=None):
if size==None: size=len(s)/2
a=( bin(eval('0x'+s[i*2:i*2+2]))[2:].rjust(8,'0') for i in xrange(len(s)/2))
#you loose information, remember?, so it`s only A or G
return (''.join(a).replace('1','G').replace('0','A') )[:size]
restore_(yourDataAsHexInText,sizeToStore)
print "so check it"
print s ,"(input)"
print store_(s)
print s.replace('C','G').replace('T','A') ,"to compare with information loss"
print restore_(*store_(s)),"restored"
print s.replace('C','G').replace('T','A') == restore_(*store_(s))
result in my test:
63c9308a00 33
so check it
AGCAATGCCGATGTTCATCGTATACTTTGACTA (input)
('63c9308a00', 33)
AGGAAAGGGGAAGAAGAAGGAAAAGAAAGAGAA to compare with information loss
AGGAAAGGGGAAGAAGAAGGAAAAGAAAGAGAA restored
True

How to convert a word in string to binary

I was working on a module(the import module stuff) which would help to convert words in string to hex and binary(And octal if possible).I finished the hex part.But now I am struggling in case of the binary.I don't know where to start from or what to do.What I want to do is simple.It would take an input string such as 'test'.The function inside the module would convert it to binary.
What I have done till now is given below:
def string_hex(string): # Converts a word to hex
keyword = string.encode()
import binascii
hexadecimal=str(binascii.hexlify(keyword), 'ascii')
formatted_hex=':'.join(hexadecimal[i:i+2] for i in range(0, len(hexadecimal), 2))
return formatted_hex
def hex_string(hexa):
# hexa(Given this name because there is a built-in function hex()) should be written as string.For accuracy on words avoid symbols(, . !)
string = bytes.fromhex(hexa)
formatted_string = string.decode()
return formatted_string
I saved in the directory where I have installed my python in the name experiment.py.This is the way I call it.
>>> from experiment import string_hex
>>> string_hex('test')
'74:65:73:74'
Just like that I am able to convert it back also like this:
>>> from experiment import hex_string
>>> hex_string('74657374')
'test'
Just like this wanted to convert words in strings to binary.And one more thing I am using python 3.4.2.Please help me.
You can do it as follows. You don't even have to import binascii.
def string_hex(string):
return ':'.join(format(ord(c), 'x') for c in string)
def hex_string(hexa):
hexgen = (hexa[i:i+2] for i in range(0, len(hexa), 2))
return ''.join(chr(eval('0x'+n)) for n in hexgen)
def string_bin(string):
return ':'.join(format(ord(c), 'b') for c in string)
def bin_string(binary):
bingen = (binary[i:i+7] for i in range(0, len(binary), 7))
return ''.join(chr(eval('0b'+n)) for n in bingen)
And here is the output:
>>> string_hex('test')
'74:65:73:74'
>>> hex_string('74657374')
'test'
>>> string_bin('test')
'1110100:1100101:1110011:1110100'
>>> bin_string('1110100110010111100111110100')
'test'

Categories

Resources