Want to convert string value to float in spark python - python

Hello subject expert please have look and help me got stuck here
I have two files and i combined them using union function is spark. and getting ouptput like.
file1 contains.(u'[12345, 90604080', 0.0)
(u'[67890, 70806080', 320.0)
file2 contains. (u'[12345, 90604080', 0.0)
(u'[67890, 70806080', 0.0)
[u"(u'[12345", u" 90604080'", u' 0.0)']
[u"(u'[67890", u" 70806080'", u' 320.0)'] this is combined output [12345", u" 90604080'" is my keys and 0.0 are their values i want to aggregate the values according to the keys and stoared the output into third file. this is my code. like '12345, 90604080',0.0 and 67890, 70806080', 320.0
but Getting following error
ValueError: invalid literal for float(): 70.0)
from pyspark import SparkContext
import os
import sys
sc = SparkContext("local", "aggregate")
file1 = sc.textFile("hdfs://localhost:9000/data//part-00000")
file2 = sc.textFile("hdfs://localhost:9000/data/second/part-00000")
file3 = file1.union(file2).coalesce(1).map(lambda line: line.split(','))
result = file3.map(lambda x: ((x[0]+', '+x[1],float(x[2])))).reduceByKey(lambda a,b:a+b).coalesce(1)
result.saveAsTextFile("hdfs://localhost:9000/Test1")
thanks for the help

It looks like you have an extra closing parenthesis in your string. Try:
result = file3.map(lambda x: ((x[0]+', '+x[1],float(x[2][:-1])))).reduceByKey(lambda a,b:a+b).coalesce(1)
Clarification:
The error-message tells us that the float-conversion got 70.0) as argument. What we want is 70.0. So we just need to omit the last character of the string which we can do with index slicing:
>>> a = "70.0)"
>>> a = a[:-1]
>>> print a
"70.0"
The last line can be read as a is equal to a from index 0 to index -1. -1 is equivalent to len(a)-1.

Related

Python: How to remove $ character from list after CSV import

I am attempting to import a CSV file into Python. After importing the CSV, I want to take an every of every ['Spent Past 6 Months'] value, however the "$" symbol that the CSV includes in front of that value is causing me problems. I've tried a number of things to get rid of that symbol and I'm honestly lost at this point!
I'm really new to Python, so I apologize if there is something very simple here that I am missing.
What I have coded is listed below. My output is listed first:
File "customer_regex2.py", line 24, in <module>
top20Cust = top20P(data)
File "customer_regex2.py", line 15, in top20P
data1 += data1 + int(a[i]['Spent Past 6 Months'])
ValueError: invalid literal for int() with base 10: '$2099.83'
error screenshot
import csv
import re
data = []
with open('customerData.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
data.append(row)
def top20P(a):
outputList=[]
data1=0
for i in range(0,len(a)):
data1 += data1 + int(a[i]['Spent Past 6 Months'])
top20val= int(data1*0.8)
for j in range(0,len(a)):
if data[j]['Spent Past 6 Months'] >= top20val:
outputList.append('a[j]')
return outputList
top20Cust = top20P(data)
print(outputList)
It looks like a datatype issue.
You could strip the $ characters like so:
someString = '$2099.83'
someString = someString.strip('$')
print(someString)
2099.83
Now the last step is to wrap in float() since you have decimal values.
print(type(someString))
<class 'str'>
someFloat = float(someString)
print(type(someFloat))
<class 'float'>
Hope that helps.

How to solve problem decoding from wrong json format

everyone. Need help opening and reading the file.
Got this txt file - https://yadi.sk/i/1TH7_SYfLss0JQ
It is a dictionary
{"id0":"url0", "id1":"url1", ..., "idn":"urln"}
But it was written using json into txt file.
#This is how I dump the data into a txt
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))
So, the file structure is
{"id0":"url0", "id1":"url1", ..., "idn":"urln"}{"id2":"url2", "id3":"url3", ..., "id4":"url4"}{"id5":"url5", "id6":"url6", ..., "id7":"url7"}
And it is all a string....
I need to open it and check repeated ID, delete and save it again.
But getting - json.loads shows ValueError: Extra data
Tried these:
How to read line-delimited JSON from large file (line by line)
Python json.loads shows ValueError: Extra data
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 190)
But still getting that error, just in different place.
Right now I got as far as:
with open('111111111.txt', 'r') as log:
before_log = log.read()
before_log = before_log.replace('}{',', ').split(', ')
mu_dic = []
for i in before_log:
mu_dic.append(i)
This eliminate the problem of several {}{}{} dictionaries/jsons in a row.
Maybe there is a better way to do this?
P.S. This is how the file is made:
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))
Your file size is 9,5M, so it'll took you a while to open it and debug it manually.
So, using head and tail tools (found normally in any Gnu/Linux distribution) you'll see that:
# You can use Python as well to read chunks from your file
# and see the nature of it and what it's causing a decode problem
# but i prefer head & tail because they're ready to be used :-D
$> head -c 217 111111111.txt
{"1933252590737725178": "https://instagram.fiev2-1.fna.fbcdn.net/vp/094927bbfd432db6101521c180221485/5CC0EBDD/t51.2885-15/e35/46950935_320097112159700_7380137222718265154_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net",
$> tail -c 219 111111111.txt
, "1752899319051523723": "https://instagram.fiev2-1.fna.fbcdn.net/vp/a3f28e0a82a8772c6c64d4b0f264496a/5CCB7236/t51.2885-15/e35/30084016_2051123655168027_7324093741436764160_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net"}
$> head -c 294879 111111111.txt | tail -c 12
net"}{"19332
So the first guess is that your file is a malformed series ofJSON data, and the best guess is to seperate }{ by a \n for further manipulations.
So, here is an example of how you can solve your problem using Python:
import json
input_file = '111111111.txt'
output_file = 'new_file.txt'
data = ''
with open(input_file, mode='r', encoding='utf8') as f_file:
# this with statement part can be replaced by
# using sed under your OS like this example:
# sed -i 's/}{/}\n{/g' 111111111.txt
data = f_file.read()
data = data.replace('}{', '}\n{')
seen, total_keys, to_write = set(), 0, {}
# split the lines of the in memory data
for elm in data.split('\n'):
# convert the line to a valid Python dict
converted = json.loads(elm)
# loop over the keys
for key, value in converted.items():
total_keys += 1
# if the key is not seen then add it for further manipulations
# else ignore it
if key not in seen:
seen.add(key)
to_write.update({key: value})
# write the dict's keys & values into a new file as a JSON format
with open(output_file, mode='a+', encoding='utf8') as out_file:
out_file.write(json.dumps(to_write) + '\n')
print(
'found duplicated key(s): {seen} from {total}'.format(
seen=total_keys - len(seen),
total=total_keys
)
)
Output:
found duplicated key(s): 43836 from 45367
And finally, the output file will be a valid JSON file and the duplicated keys will be removed with their values.
The basic difference between the file structure and actual json format is the missing commas and the lines are not enclosed within [. So the same can be achieved with the below code snippet
with open('json_file.txt') as f:
# Read complete file
a = (f.read())
# Convert into single line string
b = ''.join(a.splitlines())
# Add , after each object
b = b.replace("}", "},")
# Add opening and closing parentheses and ignore last comma added in prev step
b = '[' + b[:-1] + ']'
x = json.loads(b)

Summing a column in csv using Python

I work with large csv files and wanted to test if we can sum a numeric
column using Python. I generated a random data set:
id,first_name,last_name,email,gender,money
1,Clifford,Casterou,ccasterou0#dropbox.com,Male,53
2,Ethyl,Millichap,emillichap1#miitbeian.gov.cn,Female,58
3,Jessy,Stert,jstert2#gnu.org,Female,
4,Doy,Beviss,dbeviss3#dedecms.com,Male,80
5,Josee,Rust,jrust4#epa.gov,Female,13
6,Hedvige,Ahlf,hahlf5#vkontakte.ru,Female,67
On line 3 you will notice that value is missing(i removed that data on
purpose to test.)
I wrote the code :
import csv
with open("mock_7.txt","r+",encoding='utf8') as fin:
headerline = fin.readline()
amount = 0
debit = 0
value = 0
for row in csv.reader(fin):
# var = row.rstrip()
value =row[5].replace('',0)
value= float(value)
debit+=value
print (debit)
I got the error :
Traceback (most recent call last):
File "sum_csv1_v2.py", line 11, in <module>
value+= float(value)
TypeError: must be str, not float
As i am new to Python, my plan was to convert the empty cells with zero but I think i am missing something here. Also my script is based on comma separated files but i'm sure it wont work for other delimited files. Can you help me improve this code?
The original exception, now lost in the edit history,
TypeError: replace() argument 2 must be str, not int
is the result of str.replace() expecting string arguments, but you're passing an integer zero. Instead of replace you could simply check for empty string before conversion:
value = row[5]
value = float(value) if value else 0.0
Another option is to catch the potential ValueError:
try:
value = float(row[5])
except ValueError:
value = 0.0
This might hide the fact that the column contains "invalid" values other than just missing values.
Note that had you passed string arguments the end result would probably not have been what you expected:
In [2]: '123'.replace('', '0')
Out[2]: '0102030'
In [3]: float(_)
Out[3]: 102030.0
As you can see an empty string as the "needle" ends up replacing around each and every character in the string.
The latest exception in the question, after fixing the other errors, is the result of the float(value) conversion working and
value += float(value)
being equal to:
value = value + float(value)
and as the exception states, strings and floats don't mix.
Problem with your code is you're calling replace() without checking if its row[5] is empty or not.
Fixed code:
import csv
with open("mock_7.txt","r+",encoding='utf8') as fin:
headerline = fin.readline()
amount = 0
debit = 0
value = 0
for row in csv.reader(fin):
# var = row.rstrip()
if row[5].strip() == '':
row[5] = 0
value = float(row[5])
value += float(value)
debit += value
print (debit)
output:
542.0

parsing large dataset with python

I have a large matrix in a gzip that looks something like this:
locus_1 mark1 0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0
locus_2 mark2 0.0,0.0,0.0,0.0,0.0,0.5536,0.9177,0.2929,0.0,0.0
locus_3 mark2 0.0,0.0,0.1,0.0,0.0,0.9536,0.8177,0.2827,0.0,0.0
So, each row starts with two descriptors, followed by 10 values.
I simply want to parse out the first 5 values of this row, such that I have a matrix like this:
locus_1 mark1 0.0,0.0,0.0,0.0,0.0
locus_2 mark2 0.0,0.0,0.0,0.0,0.0
locus_3 mark2 0.0,0.0,0.1,0.0,0.0
I have made the following python script to parse this, but to no avail:
import gzip
import numpy as np
inFile = gzip.open('/home/anish/data.gz')
inFile.next()
for line in inFile:
cols = line.strip().replace('nan','0').split('\t')
data = cols[2:]
data = map(float,data)
gfpVals = data[:5]
print '\t'.join(cols[:6]) + '\t' + '\t'.join(map(str,gfpVals))
I simply get the error:
data = map(float,data)
ValueError: could not convert string to float:
You are using only tabs as delimiters while the values are delimited also by commas.
As a result
locus_1 mark1 0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0
is split to
locus_1 || mark1 || 0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0
and you are passing to float the string
"0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0"
which is an invalid literal.
You should replace:
data = cols[2:]
with
data = cols[2:].split(',')

Python read data from file and convert to double precision

I've been reading an ASCII data file using python. Then I covert the data into a numpy array.
However, I've noticed that the numbers are being rounded.
E.g. My original value from the file is: 2368999.932089
which python has rounded to: 2368999.93209
here is an example of my code:
import numpy as np
datafil = open("test.txt",'r')
tempvar = []
header = datafil.readline()
for line in datafil:
word = line.split()
char = word[0] # take the first element word[0] of the list
word.pop() # remove the last element from the list "word"
if char[0:3] >= '224' and char[0:3] < '225':
tempvar.append(word)
strvar = np.array(tempvar,dtype = np.longdouble) # Here I want to read all data as double
print(strvar.shape)
var = strvar[:,0:23]
print(var[0,22]) # here it prints 2368999.93209 but the actual value is 2368999.932089
Any ideas guys?
Abedin
I think this is not a problem of your code. It's the usual floating point representation in Python. See
https://docs.python.org/2/tutorial/floatingpoint.html
I think when you print it, print already formatted your number to str
In [1]: a=2368999.932089
In [2]: print a
2368999.93209
In [3]: str(a)
Out[3]: '2368999.93209'
In [4]: repr(a)
Out[4]: '2368999.932089'
In [5]: a-2368999.93209
Out[5]: -9.997747838497162e-07
I'm not totally sure what you're trying to do, but simplified with test.txt containing only
asdf
2368999.932089
and then the code:
import numpy as np
datafil = open("test.txt",'r')
tempvar = []
header = datafil.readline()
for line in datafil:
tempvar.append(line)
print(tempvar)
strvar = np.array(tempvar, dtype=np.float)
print(strvar.shape)
print(strvar)
I get the following output:
$ python3 so.py
['2368999.932089']
(1,)
[ 2368999.932089]
which seems to be working fine.
Edit: Updated with your provided line, so test.txt is
asdf
t JD a e incl lasc aper truean rdnnode RA Dec RArate Decrate metdr1 metddr1 metra1 metdec1 metbeta1 metdv1 metsl1 metarrJD1 beta JDej name 223.187263 2450520.619348 3.12966 0.61835 70.7196 282.97 171.324 -96.2738 1.19968 325.317 35.8075 0.662368 0.364967 0.215336 3.21729 -133.586 46.4884 59.7421 37.7195 282.821 2450681.900221 0 2368999.932089 EH2003
and the code
import numpy as np
datafil = open("test.txt",'r')
tempvar = []
header = datafil.readline()
for line in datafil:
tempvar.append(line.split(' '))
print(tempvar)
strvar = np.array(tempvar[0][-2], dtype=np.float)
print(strvar)
the last print still outputs 2368999.932089 for me. So I'm guessing this is a platform issue? What happens if you force dtype=np.float64 or dtype=np.float128? Some other sanity checks: have you tried spitting out the text before it is converted to a float? And what do you get from doing something like:
>>> np.array('2368999.932089')
array('2368999.932089',
dtype='<U14')
>>> float('2368999.932089')
2368999.932089

Categories

Resources