Dealing with strings amidst int in csv file, None value - python

I'm reading in data from a csv file where some of the values are "None". The values that are being read in are then contained in a list.
The list is the passed to a function which requires all values within the list to be in int() format.
However I can't apply this with the "None" string value being present. I've tried replacing the "None" with None, or with "" but that hasn't worked, it results in an error. The data in the list also needs to stay in the same position so I cant just completely ignore it all together.
I could replace all "None" with 0 but None != 0 really.
EDIT: I've added my code so hopefully it'll make a bit more sense. Trying to create a line chart from data in csv file:
import csv
import sys
from collections import Counter
import pygal
from pygal.style import LightSolarizedStyle
from operator import itemgetter
#Read in file to data variable and set header variable
filename = sys.argv[1]
data = []
with open(filename) as f:
reader = csv.reader(f)
header = reader.next()
data = [row for row in reader]
#count rows in spreadsheet (minus header)
row_count = (sum(1 for row in data))-1
#extract headers which I want to use
headerlist = []
for x in header[1:]:
headerlist.append(x)
#initialise line chart in module pygal. set style, title, and x axis labels using headerlist variable
line_chart = pygal.Line(style = LightSolarizedStyle)
line_chart.title = 'Browser usage evolution (in %)'
line_chart.x_labels = map(str, headerlist)
#create lists for data from spreadsheet to be put in to
empty1 = []
empty2 = []
#select which data i want from spreadsheet
for dataline in data:
empty1.append(dataline[0])
empty2.append(dataline[1:-1])
#DO SOMETHING TO "NONE" VALUES IN EMPTY TWO SO THEY CAN BE PASSED TO INT CONVERTER ASSIGNED TO EMPTY 3
#convert all items in the lists, that are in the list of empty two to int
empty3 = [[int(x) for x in sublist] for sublist in empty2]
#add data to chart line by line
count = -1
for dataline in data:
while count < row_count:
count += 1
line_chart.add(empty1[count], [x for x in empty3[count]])
#function that only takes int data
line_chart.render_to_file("browser.svg")
There will be a lot of inefficiencies or weird ways of doing things, trying to slowly learn.
The above script gives chart:
With all the Nones set as 0, bu this doesn't really reflect the existence of Chrome pre a certain date. Thanks

Without seeing your code, I can only offer limited help.
It sounds like you need to utilize ast.literal_eval().
import ast
csvread = csv.reader(file)
list = []
for row in csvread:
list.append(ast.literal_eval(row[0]))

Related

how to iterate over files in python and export several output files

I have a code and I want to put it in a for loop. I want to input some data stored as files into my code and based on the each input, generate outputs automatically. At the moment, my code is only working for one input file and consequently gives one output. My input file is named as model000.msh, but the fact is that I have a series of these input files with the names model000.msh, model001.msh, and so on. In the code I am doing some calculation on the imported file and finally compare it to a numpy array (my_data) that is generated by another numpy array (ID) having one column and thousands of rows. ID array is the second variable which I want to iterate over. ID is making my_data through a np.concatenate function. I want to use each column of ID to make my_data (my_data=np.concatenate((ID[:,iterator], gr), axis =1)). So, I want to iterate over several files, then extract arrays from each file (extracted), then follow the loop with generating my_data from each column of ID and do calculations on my_data and extracted and finally export results of each iteration with a dynamic naming method (changed_000, changed_001 and so on). This is my code fo one single input and one single my_data array (made by an ID that has only one column), but I want to change iterate over several input files and several my_data arrays and finally several outputs:
from itertools import islice
with open('model000.msh') as lines:
nodes = np.genfromtxt(islice(lines, 0, 1000))
with open('model000.msh', "r") as f:
saved_lines = np.array([line.split() for line in f if len(line.split()) == 9])
saved_lines[saved_lines == ''] = 0.0
elem = saved_lines.astype(np.int)
# following lines extract some data from my file
extracted=np.c_[elem[:,:-4], nodes[elem[:,-4]-1, 1:], nodes[elem[:,-3]-1, 1:],nodes[elem[:,-2]-1, 1:], nodes[elem[:,-1]-1, 1:]]
…
extracted =np.concatenate((extracted, avs), axis =1) # each input file ('model000.msh') will make this numpy array
# another data set, stored as a numpy array is compared to the data extracted from the file
ID= np.array [[… ..., …, …]] # now, it is has one column, but it should have several columns and each iteration, one column will make a my_data array
my_data=np.concatenate((ID, gr), axis =1) # I think it should be something like my_data=np.concatenate((ID[:,iterator], gr), axis =1)
from scipy.spatial import distance
distances=distance.cdist(extracted [:,17:20],my_data[:,1:4])
ind_min_dis=np.argmin(distances, axis=1).reshape(-1,1)
z=np.array([])
for i in ind_min_dis:
u=my_data[i,0]
z=np.array([np.append(z,u)]).reshape(-1,1)
final_merged=np.concatenate((extracted,z), axis =1)
new_vol=final_merged[:,-1].reshape(-1,1)
new_elements=np.concatenate((elements,new_vol), axis =1)
new_elements[:,[4,-1]] = new_elements[:,[-1,4]]
# The next block is output block
chunk_size = 3
buffer = ""
i = 0
relavent_line = 0
with open('changed_00', 'a') as fout:
with open('model000.msh', 'r') as fin:
for line in fin:
if len(line.split()) == 9:
aux_string = ' '.join([str(num) for num in new_elements[relavent_line]])
buffer += '%s\n' % aux_string
relavent_line += 1
else:
buffer += line
i+=1
if i == chunk_size:
fout.write(buffer)
i=0
buffer = ""
if buffer:
fout.write(buffer)
i=0
buffer = ""
I appreciate any help in advance.
I'm not very sure about your question. But it seems like you are asking for something like:
for idx in range(10):
with open('changed_{:0>2d}'.format(idx), 'a') as fout:
with open('model0{:0>2d}.msh'.format(idx), 'r') as fin:
#read something from fin...
#calculate something...
#write something to fout...
If so, you could search for str.format() for more details.

How to find max and min values within lists without using maps/SQL?

I'm learning python and have a data set (csv file) I've been able to split the lines by comma but now I need to find the max and min value in the third column and output the corresponding value in the first column in the same row.
This is the .csv file: https://www.dropbox.com/s/fj8tanwy1lr24yk/loan.csv?dl=0
I also can't use Pandas or any external libraries; I think it would have been easier if I used them
I have written this code so far:
f = open("loanData.csv", "r")
mylist = []
for line in f:
mylist.append(line)
newdata = []
for row in mylist:
data = row.split(",")
newdata.append(data)
I'd use the built-in csv library for parsing your CSV file, and then just generate a list with the 3rd column values in it:
import csv
with open("loanData.csv", "r") as loanCsv:
loanCsvReader = csv.reader(loanCsv)
# Comment out if no headers
next(loanCsvReader, None)
loan_data = [ row[2] for row in loanCsvReader]
max_val = max(loan_data)
min_val = min(loan_data)
print("Max: {}".format(max_val))
print("Max: {}".format(min_val))
Don't know if the details of your file, whether it has a headers or not but you can comment out
next(loanCsvReader, None)
if you don't have any headers present
Something like this might work. The index would start at zero, so the third column should be 2.
min = min([row.split(',')[2] for row in mylist])
max = max([row.split(',')[2] for row in mylist])
Separately, you could probably read and reformat your data to a list with the following:
with open('loanData.csv', 'r') as f:
data = f.read()
mylist = list(data.split('\n'))
This assumes that the end of each row of data is newline (\n) delimited (Windows), but that might be different depending on the OS you're using.

Reading Data into Lists

I'm trying to open a CSV file that contains 100 columns and 2 rows. I want to read the file and put the data in the first column into one list (my x_coordinates) and the data in the second column into another list (my y_coordinates)
X= []
Y = []
data = open("data.csv")
headers = data.readline()
readMyDocument = data.read()
for data in readMyDocument:
X = readMyDocument[0]
Y = readMyDocument[1]
print(X)
print(Y)
I'm looking to get two lists but instead the output is simply a list of 2's.
Any suggestions on how I can change it/where my logic is wrong.
You can do something like:
import csv
# No need to initilize your lists here
X = []
Y = []
with open('data.csv', 'r') as f:
data = list(csv.reader(f))
X = data[0]
Y = data[1]
print(X)
print(Y)
See if that works.
You can use pandas:
import pandas as pd
XY = pd.read_csv(path_to_file)
X = XY.iloc[:,0]
Y = XY.iloc[:,1]
or you can
X=[]
Y=[]
with open(path_to_file) as f:
for line in f:
xy = line.strip().split(',')
X.append(xy[0])
Y.append(xy[1])
First things first: you are not closing your file.
A good practice would be to use with when opening files so it can close even if the code breaks.
Then, if you want just one column, you can break your lines by the column separator and use just the column you want.
But this would be kind of learning only, in a real situation you may want to use a lib like built in csv or, even better, pandas.
X = []
Y = []
with open("data.csv") as data:
lines = data.read().split('\n')
# headers is not being used in this spinet
headers = lines[0]
lines = lines[1:]
# changing variable name for better reading
for line in lines:
X.append(line[0])
Y.append(line[1])
print(X)
print(Y)
Ps.: I'm ignoring some variables that you used but were not declared in your code snipet. But they could be a problem too.
Using numpy's genfromtxt , read the docs here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html
Some assumptions:
Delimiter is ","
You don't want the headers obliviously in the lists, that's why
skipping the headers.
You can read the docs and use other keywords as well.
import numpy as np
X= list(np.genfromtxt('data.csv',delimiter=",",skip_header=1)[:,0])
Y = list(np.genfromtxt('data.csv',delimiter=",",skip_header=1)[:,1])

'float' object is not iterable typerror

I've written a script that takes a large excel spreadsheet of data and strips away unwanted columns, rows that contain zero values in particular columns and then saves out to a csv. The piece that I'm stuck on is I'm also trying to remove rows that are missing cells. The way I was trying this was by way of:
for each_row in row_list :
if not all(map(len, each_row)) :
continue
else :
UICData.append(row_list)
But this isn't working correctly as I'm getting the error:
File
"/Users/kenmarold/PycharmProjects/sweetCrude/Work/sweetCrude.py",
line
56, in PrepareRawData
if not all(map(len, each_row)) :
TypeError: 'float' object is not iterable
I'm not exactly sure how to resolve this, what's the way forward on this? I've also attached the full script below.
#!/usr/bin/env python3
import os
import sqlite3
import csv
import unicodecsv
from datetime import date
from xlrd import open_workbook, xldate_as_tuple
from xlwt import Workbook
orig_xls = 'data/all_uic_wells_jun_2016.xls'
temp_xls = 'data/temp.xls'
new_csv = 'data/gh_ready_uic_well_data.csv'
temp_csv = 'data/temp.csv'
input_worksheet_index = 0 # XLS Sheet Number
output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('Sweet Crude')
lat_col_index = 13
long_col_index = 14
#### SELECT AND FORMAT DATA
def PrepareRawData(inputFile, tempXLSFile, tempCSVFile, outputFile):
# 0 = API# # 7 = Approval Date
# 1 = Operator # 13 = Latitude
# 2 = Operator ID # 14 = Longitude
# 3 = Well Type # 15 = Zone
keep_columns = [0, 1, 2, 3, 7, 13, 14, 15]
with open_workbook(inputFile) as rawUICData:
UICSheet = rawUICData.sheet_by_index(input_worksheet_index)
UICData = []
for each_row_index in range(1, UICSheet.nrows - 1, 1):
row_list = []
lat_num = UICSheet.cell_value(each_row_index, lat_col_index) # Get Lat Values
long_num = UICSheet.cell_value(each_row_index, long_col_index) # Get Long Values
if lat_num != 0.0 and long_num != 0.0: # Find Zero Lat/Long Values
for each_column_index in keep_columns:
cell_value = UICSheet.cell_value(each_row_index, each_column_index)
cell_type = UICSheet.cell_type(each_row_index, each_column_index)
if cell_type == 3:
date_cell = xldate_as_tuple(cell_value, rawUICData.datemode)
date_cell = date(*date_cell[0:3]).strftime('%m/%d/%Y')
row_list.append(date_cell)
else:
row_list.append(cell_value)
for each_row in row_list :
if not all(map(len, each_row)) :
continue
else :
UICData.append(row_list)
# CreateDB(row_list) # Send row data to Database
for each_list_index, output_list in enumerate(UICData):
for each_element_index, element in enumerate(output_list):
output_worksheet.write(each_list_index, each_element_index, element)
output_workbook.save(tempXLSFile)
#### RUN XLS-CSV CONVERSION
workbook = open_workbook(tempXLSFile)
sheet = workbook.sheet_by_index(input_worksheet_index)
fh = open(outputFile, 'wb')
csv_out = unicodecsv.writer(fh, encoding = 'utf-8')
for each_row_number in range(sheet.nrows) :
csv_out.writerow(sheet.row_values(each_row_number))
fh.close()
#### KILL TEMP FILES
filesToRemove = [tempXLSFile]
for each_file in filesToRemove:
os.remove(each_file)
print("Raw Data Conversion Ready for Grasshopper")
# ---------------------------------------------------
PrepareRawData(orig_xls, temp_xls, temp_csv, new_csv)
# ---------------------------------------------------
This is a dirty patch.
for each_row in row_list :
if not isinstance(each_row, list):
each_row = [each_row]
if not any(map(len, each_row)) :
continue
UICData.append(row_list)
EDIT: If the any/map/len raises it still, then I would try a different route to check if it's empty.
Also I'm not sure why you are appending the entire row_list and not the current row. I changed it to appending each_row.
Option1
for each_row in row_list:
if not each_row:
continue
UICData.append(each_row)
Option2
keep_data = [arow in row_list if arow] # Or w/e logic. This will be faster.
UICData.append(keep_data)
Your row_list contains a set of values, for example:
[1.01, 75, 3.56, ...]
When you call for each_row in row_list:, you assign a float value to each_row for every iteration of the loop.
You then try to do this:
if not all(map(len, each_row)):
Python's map function expects a list as the second argument, and tries to iterate over it to apply the function len to each item in the list. You can't iterate a float.
I'm not entirely sure what you are trying to do here, but if you are wanting to check that none of the items in your row_list are None or an empty string, then you could do:
if None not in row_list and '' not in row_list:
UICData.append(row_list)
Your overall object appears to be to copy selected columns from all rows of one sheet of an Excel XLS file to a CSV file. Each output row must contain only valid cells, for some definition of "valid".
As you have seen, using map() is not a good idea; it's only applicable if all the fields are text. You should apply tests depending generally on the datatype and specifically on the individual column.
Once you have validated the items in the row, you are in a position to output the data. You have chosen a path which (1) builds a list of all output rows (2) uses xlwt to write to a temp XLS file (3) uses xlrd to read the temp file and unicodecsv to write a CSV file. Please consider avoiding all that; instead just use unicodecsv.writer.writerow(row_list)

Import CSV and create one list for each column in Python

I am processing a CSV file in python thats delimited by a comma (,).
Each column is a sampled parameter, for instance column 0 is time, sampled at once a second, column 1 is altitude sampled at 4 times a second, etc.
So columns will look like as below:
Column 0 -> ["Time", 0, " "," "," ",1]
Column 1 -> ["Altitude", 100, 200, 300, 400]
I am trying to create a list for each column that captures its name and all its data. That way I can do calculations and organize my data into a new file automatically (the sampled data I am working with has substantial number of rows)
I want to do this for any file not just one, so the number of columns can vary.
Normally if every file was consistent I would do something like:
import csv
time =[]
alt = []
dct = {}
with open('test.csv',"r") as csvfile:
csv_f = csv.reader(csvfile)
for row in csv_f:
header.append(row[0])
alt.append(row[1]) #etc for all columns
I am pretty new in python. Is this a good way to tackle this, if not what is better methodology?
Thanks for your time
Pandas will probably work best for you. If you use csv_read from pandas, it will create a DataFrame based on the column. It's roughly a dictionary of lists.
You can also use the .tolist() functionality of pandas to convert it to a list if you want a list specifically.
import pandas as pd
data = pd.read_csv("soqn.csv")
dict_of_lists = {}
for column_name in data.columns:
temp_list = data[column_name].tolist()
dict_of_lists[column_name] = temp_list
print dict_of_lists
EDIT:
dict_of_lists={column_name: data[column_name].tolist() for column_name in data.columns}
#This list comprehension might work faster.
I think I made my problem more simpler and just focused on one column.
What I ultimately wanted to do was to interpolate to the highest sampling rate. So here is what I came up with... Please let me know if I can do anything more efficient. I used A LOT of searching on this site to help build this. Again I am new at Python (about 2-3 weeks but some former programming experience)
import csv
header = []
#initialize variables
loc_int = 0
loc_fin = 0
temp_i = 0
temp_f = 0
with open('test2.csv',"r") as csvfile: # open csv file
csv_f = csv.reader(csvfile)
for row in csv_f:
header.append(row[0]) #make a list that consists of all content in column A
for x in range(0,len(header)-1): #go through entire column
if header[x].isdigit() and header[x+1]=="": # find lower bound of sample to be interpolated
loc_int = x
temp_i = int(header[x])
elif header[x+1].isdigit() and header[x]=="": # find upper bound of sample to be interpolated
loc_fin = x
temp_f = int(header[x+1])
if temp_f>temp_i: #calculate interpolated values
f_min_i = temp_f - temp_i
interp = f_min_i/float((loc_fin+1)-loc_int)
for y in range(loc_int, loc_fin+1):
header[y] = temp_i + interp*(y-loc_int)
print header
with open("output.csv", 'wb') as g: #write to new file
writer = csv.writer(g)
for item in header:
writer.writerow([item])
I couldn't figure out how to write my new list "header" with its interpolated values and replace it with column A of my old file , test2.csv.
Anywho thank you very much for looking...

Categories

Resources