parsing large dataset with python

parsing large dataset with python - python

I have a large matrix in a gzip that looks something like this:
locus_1 mark1 0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0
locus_2 mark2 0.0,0.0,0.0,0.0,0.0,0.5536,0.9177,0.2929,0.0,0.0
locus_3 mark2 0.0,0.0,0.1,0.0,0.0,0.9536,0.8177,0.2827,0.0,0.0
So, each row starts with two descriptors, followed by 10 values.
I simply want to parse out the first 5 values of this row, such that I have a matrix like this:
locus_1 mark1 0.0,0.0,0.0,0.0,0.0
locus_2 mark2 0.0,0.0,0.0,0.0,0.0
locus_3 mark2 0.0,0.0,0.1,0.0,0.0
I have made the following python script to parse this, but to no avail:
import gzip
import numpy as np
inFile = gzip.open('/home/anish/data.gz')
inFile.next()
for line in inFile:
cols = line.strip().replace('nan','0').split('\t')
data = cols[2:]
data = map(float,data)
gfpVals = data[:5]
print '\t'.join(cols[:6]) + '\t' + '\t'.join(map(str,gfpVals))
I simply get the error:
data = map(float,data)
ValueError: could not convert string to float:

You are using only tabs as delimiters while the values are delimited also by commas.
As a result
locus_1 mark1 0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0
is split to
locus_1 || mark1 || 0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0
and you are passing to float the string
"0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0"
which is an invalid literal.
You should replace:
data = cols[2:]
with
data = cols[2:].split(',')

Related

How to transform a multi dimensional array from a CSV file into a list

screenshot of the csv file
Hi(sorry if this is a dump question)..i have a data set as CSV file ...every row contains 44 column and every cell containes 44 float number separated by two spaces like this(look at the screenshot) ...i tried CSV readline/s + numpy and non of them worked
i want to take every row as a list with[1936] variable (44*44)
and then combine the whole data set into 2d array ...my_data[n_of_samples][1936]

so as stated by user ybl, this is not a CSV. It's not even close to being a CSV.
This means that you have to implement some processing to turn this into something useable. I put the screenshot through an OCR to extract the actual text values, but next time provide the input file. Screenshots of data are annoying to work with.
The processing you need to to is to find the start and end of the rows, using the [ and ] characters respectively. Then you split this data with the basic string.split() which doesn't care about the number of spaces.
Try the code below and see if that works for the input file.
rows = []
current_row = ""
with open("somefile.txt") as infile:
for line in infile.readlines():
cleaned = line.replace('"', '').replace("\n", " ")
if "]" in cleaned:
current_row = f"{current_row} {cleaned.split(']')[0]}"
rows.append(current_row.split())
current_row = ""
cleaned = cleaned.split(']')[1]
if "[" in cleaned:
cleaned = cleaned.split("[")[1]
current_row = f"{current_row} {cleaned}"
for row in rows:
print(len(row))
output
44
44
44
input file:
"[ 1.79619717e+04 1.09988207e+02 4.13270009e+01 1.72227906e+01
1.06178751e+01 5.20957856e+00 7.50891645e+00 4.57943370e+00
2.65572713e+00 2.96725867e-01 2.43040664e+00 1.32822091e+00
4.09853169e-01 1.18412873e+00 6.43398990e-01 1.23796528e+00
9.63975374e-02 2.95295579e-01 7.68998970e-01 4.98040980e-01
2.84036936e-01 1.76004564e-01 1.43527613e-01 1.64765236e-01
1.51171075e-01 1.02586637e-01 3.27835810e-02 1.21872869e-02
-7.59824907e-02 8.48217334e-02 7.29953754e-02 4.89750588e-02
5.89426950e-02 5.05485266e-02 2.34761263e-02 -2.41095452e-02
5.15952510e-02 1.39933210e-02 2.12354074e-02 3.40820680e-03
-2.57466949e-03 -1.06481222e-02 -8.35155410e-03 1.21653512e-12]","[-6.12189619e+02 1.03584744e+04 2.34417495e+02 7.01761526e+01
3.92495170e+01 1.81609738e+01 2.58114624e+01 1.52275550e+01
8.59676934e+00 9.45036161e-01 7.71943506e+00 4.17516432e+00
1.27920413e+00 3.68862368e+00 1.99582544e+00 3.82999035e+00
2.96068511e-01 9.06341796e-01 2.35621065e+00 1.52094079e+00
8.64565916e-01 5.34605108e-01 4.35456793e-01 4.99450615e-01
4.57778770e-01 3.10324997e-01 9.90860520e-02 3.68281889e-02
-2.29532895e-01 2.56108491e-01 2.20284123e-01 1.47727878e-01
1.77724506e-01 1.52350751e-01 7.07318164e-02 -7.26252404e-02
1.55364050e-01 4.21222079e-02 6.39113311e-02 1.02558665e-02
-7.74736016e-03 -3.20368093e-02 -2.51241082e-02 1.21653512e-12]","[-5.03959282e+02 -5.64452044e+02 7.90433958e+03 1.94146598e+02
1.06178751e+01 5.20957856e+00 7.50891645e+00 4.57943370e+00
2.65572713e+00 2.96725867e-01 2.43040664e+00 1.32822091e+00
4.09853169e-01 1.18412873e+00 6.43398990e-01 1.23796528e+00
9.63975374e-02 2.95295579e-01 7.68998970e-01 4.98040980e-01
2.84036936e-01 1.76004564e-01 1.43527613e-01 1.64765236e-01
1.51171075e-01 1.02586637e-01 3.27835810e-02 1.21872869e-02
-7.59824907e-02 8.48217334e-02 7.29953754e-02 4.89750588e-02
5.89426950e-02 5.05485266e-02 2.34761263e-02 -2.41095452e-02
5.15952510e-02 1.39933210e-02 2.12354074e-02 3.40820680e-03
-2.57466949e-03 -1.06481222e-02 -8.35155410e-03 1.21653512e-12]"

The option is this:
import numpy as np
import csv
c = np.array([n_of_samples])
with open('cocacola_sick.csv') as f:
p = csv.reader(f) # read file as csv
for s in p:
a = ','.join(s) # concatenate all lines into one line
a = a.replace("\n", "") # remove line breaks
b = np.array(np.mat(a))
my_data = np.vstack((c,b))
print(my_data)

Python: How to remove $ character from list after CSV import

I am attempting to import a CSV file into Python. After importing the CSV, I want to take an every of every ['Spent Past 6 Months'] value, however the "$" symbol that the CSV includes in front of that value is causing me problems. I've tried a number of things to get rid of that symbol and I'm honestly lost at this point!
I'm really new to Python, so I apologize if there is something very simple here that I am missing.
What I have coded is listed below. My output is listed first:
File "customer_regex2.py", line 24, in <module>
top20Cust = top20P(data)
File "customer_regex2.py", line 15, in top20P
data1 += data1 + int(a[i]['Spent Past 6 Months'])
ValueError: invalid literal for int() with base 10: '$2099.83'
error screenshot
import csv
import re
data = []
with open('customerData.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
data.append(row)
def top20P(a):
outputList=[]
data1=0
for i in range(0,len(a)):
data1 += data1 + int(a[i]['Spent Past 6 Months'])
top20val= int(data1*0.8)
for j in range(0,len(a)):
if data[j]['Spent Past 6 Months'] >= top20val:
outputList.append('a[j]')
return outputList
top20Cust = top20P(data)
print(outputList)

It looks like a datatype issue.
You could strip the $ characters like so:
someString = '$2099.83'
someString = someString.strip('$')
print(someString)
2099.83
Now the last step is to wrap in float() since you have decimal values.
print(type(someString))
<class 'str'>
someFloat = float(someString)
print(type(someFloat))
<class 'float'>
Hope that helps.

Want to convert string value to float in spark python

Hello subject expert please have look and help me got stuck here
I have two files and i combined them using union function is spark. and getting ouptput like.
file1 contains.(u'[12345, 90604080', 0.0)
(u'[67890, 70806080', 320.0)
file2 contains. (u'[12345, 90604080', 0.0)
(u'[67890, 70806080', 0.0)
[u"(u'[12345", u" 90604080'", u' 0.0)']
[u"(u'[67890", u" 70806080'", u' 320.0)'] this is combined output [12345", u" 90604080'" is my keys and 0.0 are their values i want to aggregate the values according to the keys and stoared the output into third file. this is my code. like '12345, 90604080',0.0 and 67890, 70806080', 320.0
but Getting following error
ValueError: invalid literal for float(): 70.0)
from pyspark import SparkContext
import os
import sys
sc = SparkContext("local", "aggregate")
file1 = sc.textFile("hdfs://localhost:9000/data//part-00000")
file2 = sc.textFile("hdfs://localhost:9000/data/second/part-00000")
file3 = file1.union(file2).coalesce(1).map(lambda line: line.split(','))
result = file3.map(lambda x: ((x[0]+', '+x[1],float(x[2])))).reduceByKey(lambda a,b:a+b).coalesce(1)
result.saveAsTextFile("hdfs://localhost:9000/Test1")
thanks for the help

It looks like you have an extra closing parenthesis in your string. Try:
result = file3.map(lambda x: ((x[0]+', '+x[1],float(x[2][:-1])))).reduceByKey(lambda a,b:a+b).coalesce(1)
Clarification:
The error-message tells us that the float-conversion got 70.0) as argument. What we want is 70.0. So we just need to omit the last character of the string which we can do with index slicing:
>>> a = "70.0)"
>>> a = a[:-1]
>>> print a
"70.0"
The last line can be read as a is equal to a from index 0 to index -1. -1 is equivalent to len(a)-1.

Print strings with line break Python

import csv
import datetime
with open('soundTransit1_remote_rawMeasurements_15m.txt','r') as infile, open('soundTransit1.txt','w') as outfile:
inr = csv.reader(infile,delimiter='\t')
#ouw = csv.writer(outfile,delimiter=' ')
for row in inr:
d = datetime.datetime.strptime(row[0],'%Y-%m-%d %H:%M:%S')
s = 1
p = int(row[5])
nr = [format(s,'02')+format(d.year,'04')+format(d.month,'02')+format(d.day,'02')+format(d.hour,'02')+format(d.minute,'02')+format(int(p*0.2),'04')]
outfile.writelines(nr+'/n')
Using the above script, I have read in a .txt file and reformatted it as 'nr' so it looks like this:
['012015072314000000']
['012015072313450000']
['012015072313300000']
['012015072313150000']
['012015072313000000']
['012015072312450000']
['012015072312300000']
['012015072312150000']
..etc.
I need to now print it onto my new .txt file, but Python is not allowing me to print 'nr' with line breaks after each entry, I think because the data is in strings. I get this error:
TypeError: can only concatenate list (not "str") to list
Is there another way to do this?

You are trying to combine a list with a string, which cannot work. Simply don't create a list in nr.
import csv
import datetime
with open('soundTransit1_remote_rawMeasurements_15m.txt','r') as infile, open('soundTransit1.txt','w') as outfile:
inr = csv.reader(infile,delimiter='\t')
#ouw = csv.writer(outfile,delimiter=' ')
for row in inr:
d = datetime.datetime.strptime(row[0],'%Y-%m-%d %H:%M:%S')
s = 1
p = int(row[5])
nr = "{:02d}{:%Y%m%d%H%M}{:04d}\n".format(s,d,int(p*0.2))
outfile.write(nr)

There is no need to put your string into a list; just use outfile.write() here and build a string without a list:
nr = format(s,'02') + format(d.year,'04') + format(d.month, '02') + format(d.day, '02') + format(d.hour, '02') + format(d.minute, '02') + format(int(p*0.2), '04')
outfile.write(nr + '\n')
Rather than use 7 separate format() calls, use str.format():
nr = '{:02}{:%Y%m%d%H%M}{:04}\n'.format(s, d, int(p * 0.2))
outfile.write(nr)
Note that I formatted the datetime object with one formatting operation, and I included the newline into the string format.
You appear to have hard-coded the s value; you may as well put that into the format directly:
nr = '01{:%Y%m%d%H%M}{:04}\n'.format(d, int(p * 0.2))
outfile.write(nr)
Together, that updates your script to:
with open('soundTransit1_remote_rawMeasurements_15m.txt', 'r') as infile,\
open('soundTransit1.txt','w') as outfile:
inr = csv.reader(infile, delimiter='\t')
for row in inr:
d = datetime.datetime.strptime(row[0], '%Y-%m-%d %H:%M:%S')
p = int(int(row[5]) * 0.2)
nr = '01{:%Y%m%d%H%M}{:04}\n'.format(d, p)
outfile.write(nr)
Take into account that the csv module works better if you follow the guidelines about opening files; in Python 2 you need to open the file in binary mode ('rb'), in Python 3 you need to set the newline parameter to ''. That way the module can control newlines correctly and supports including newlines in column values.

How do I convert integers into high-resolution times in Python? Or how do I keep Python from dropping zeros?

Currently, I'm using this to calculate the time between two messages and listing the times if they are above 20 seconds.
def time_deltas(infile):
entries = (line.split() for line in open(INFILE, "r"))
ts = {}
for e in entries:
if " ".join(e[2:5]) == "OuchMsg out: [O]":
ts[e[8]] = e[0]
elif " ".join(e[2:5]) == "OuchMsg in: [A]":
in_ts, ref_id = e[0], e[7]
out_ts = ts.pop(ref_id, None)
yield (float(out_ts),ref_id[1:-1],(float(in_ts)*10000 - float(out_ts)*10000))
n = (float(in_ts)*10000 - float(out_ts)*10000)
if n> 20:
print float(out_ts),ref_id[1:-1], n
INFILE = 'C:/Users/klee/Documents/text.txt'
import csv
with open('output_file1.csv', 'w') as f:
csv.writer(f).writerows(time_deltas(INFILE))
However, there are two major errors. First of all, python drops zeros when the time is before 10, ie. 0900. And, it drops zeros making the time difference not accurate.
It looks like:
130203.08766
when it should be:
130203.087660

You are yielding floats, so the csv writer turns those floats into strings as it pleases.
If you want your output values to be a certain format, yield a string in that format.
Perhaps something like this?
print "%04.0f" % (900) # prints 0900

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

parsing large dataset with python - python

Related

How to transform a multi dimensional array from a CSV file into a list

Python: How to remove $ character from list after CSV import

Want to convert string value to float in spark python

Print strings with line break Python

How do I convert integers into high-resolution times in Python? Or how do I keep Python from dropping zeros?

Categories

Resources