Reading Data into Lists

Reading Data into Lists - python

I'm trying to open a CSV file that contains 100 columns and 2 rows. I want to read the file and put the data in the first column into one list (my x_coordinates) and the data in the second column into another list (my y_coordinates)
X= []
Y = []
data = open("data.csv")
headers = data.readline()
readMyDocument = data.read()
for data in readMyDocument:
X = readMyDocument[0]
Y = readMyDocument[1]
print(X)
print(Y)
I'm looking to get two lists but instead the output is simply a list of 2's.
Any suggestions on how I can change it/where my logic is wrong.

You can do something like:
import csv
# No need to initilize your lists here
X = []
Y = []
with open('data.csv', 'r') as f:
data = list(csv.reader(f))
X = data[0]
Y = data[1]
print(X)
print(Y)
See if that works.

You can use pandas:
import pandas as pd
XY = pd.read_csv(path_to_file)
X = XY.iloc[:,0]
Y = XY.iloc[:,1]
or you can
X=[]
Y=[]
with open(path_to_file) as f:
for line in f:
xy = line.strip().split(',')
X.append(xy[0])
Y.append(xy[1])

First things first: you are not closing your file.
A good practice would be to use with when opening files so it can close even if the code breaks.
Then, if you want just one column, you can break your lines by the column separator and use just the column you want.
But this would be kind of learning only, in a real situation you may want to use a lib like built in csv or, even better, pandas.
X = []
Y = []
with open("data.csv") as data:
lines = data.read().split('\n')
# headers is not being used in this spinet
headers = lines[0]
lines = lines[1:]
# changing variable name for better reading
for line in lines:
X.append(line[0])
Y.append(line[1])
print(X)
print(Y)
Ps.: I'm ignoring some variables that you used but were not declared in your code snipet. But they could be a problem too.

Using numpy's genfromtxt , read the docs here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html
Some assumptions:
Delimiter is ","
You don't want the headers obliviously in the lists, that's why
skipping the headers.
You can read the docs and use other keywords as well.
import numpy as np
X= list(np.genfromtxt('data.csv',delimiter=",",skip_header=1)[:,0])
Y = list(np.genfromtxt('data.csv',delimiter=",",skip_header=1)[:,1])

Related

How to get around a NumPy error "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"

The below code is being used to analysis a csv file and at the moment im trying to remove the columns of the array which are not in my check_list. This only checks the first row and if the first row of the particular column doesnt belong to the check_list it removes the entire column. But this error keeps getting thrown and not sure how to avoid it.
import numpy as np
def load_metrics(filename):
"""opens a csv file and returns stuff"""
check_list = ["created_at","tweet_ID","valence_intensity","anger_intensity","fear_intensity","sadness_intensity","joy_intensity","sentiment_category","emotion_category"]
file=open(filename)
data = []
for lin in file:
lin = lin.strip()
lin = lin.split(",")
data.append(lin)
for col in range(len(data[0])):
if np.any(data[0][col] not in check_list) == True:
data[0]= np.delete(np.array(data), col, 1)
print(col)
return np.array(data)
The below test is being used on the code too:
data = load_metrics("covid_sentiment_metrics.csv")
print(data[0])
Test results:
['created_at' 'tweet_ID' 'valence_intensity' 'anger_intensity'
'fear_intensity' 'sadness_intensity' 'joy_intensity' 'sentiment_category'
'emotion_category']

Change your load_metrics function to:
def load_metrics(filename):
check_list = ["created_at","tweet_ID","valence_intensity","anger_intensity",
"fear_intensity","sadness_intensity","joy_intensity","sentiment_category",
"emotion_category"]
data = []
with open(filename, 'r') as file:
for lin in file:
lin = lin.strip()
lin = lin.split(",")
data.append(lin)
arr = np.array(data)
colFilter = []
for col in arr[0]:
colFilter.append(col in check_list)
return arr[:, colFilter]
I introduced the following corrections:
Use with to automatically close the input file (your code fails to close it).
Create a "full" Numpy array (all columns) after the data has been read.
Compute colFilter list - which columns are in check_list.
Return only filtered columns.

Read columns by checklist
This code does not include checks related to reading a file or a broken data structure, so that the main idea is more or less clear. So, here I assume that a csv-file exists and has at least 2 lines:
import numpy as np
def load_metrics(filename, check_list):
"""open a csv file and return data as numpy.array
with columns from a check list"""
data = []
with open(filename) as file:
headers = file.readline().rstrip("\n").split(",")
for line in file:
data.append(line.rstrip("\n").split(","))
col_to_remove = []
for col in reversed(range(len(headers))):
if headers[col] not in check_list:
col_to_remove.append(col)
headers.pop(col)
data = np.delete(np.array(data), col_to_remove, 1)
return data, headers
Quick testing:
test_data = """\
hello,some,other,world
1,2,3,4
5,6,7,8
"""
with open("test.csv",'w') as f:
f.write(test_data)
check_list = ["hello","world"]
d, h = load_metrics("test.csv", check_list)
print(d, h)
Expected output:
[['1' '4']
['5' '8']] ['hello', 'world']
Some details:
Instead of np.any(data[0][col] not in check_list) == True would be enough data[0][col] not in check_list
Stripping with default parameters is not good as far as you can delete meaningful spaces.
Do not delete anything while looping forward. But we can do it (with some reservations) while looping backward.
check_list is better as a parameter.
Separate data and headers as they may have different types.
In your case it is better to use pandas.read_csv, see the picture below.

Removing certain separators from csv file with pandas or csv

I've got multiple csv files, which I received in the following line format:
-8,000E-04,2,8E+1,
The first and the third comma are meant to be decimal separators, the second comma is a column delimiter and I think the last one is supposed to indicate a new line. So the csv should only consist of two columns and I have to prepare the data in order to plot it. Therefore I need to specify the two columns as x and y to plot the data.I tried removing or replacing the separators in every line but by doing that I'm no longer able to specify the two columns. Is there a way to remove certain separators from every line of the csv?

You can use the string returned by reading line as follow
line="-8,000E-04,2,8E+1,"
list_string = line.split(',')
x= float(list_string[0]+"."+list_string[1])
y= float(list_string[2]+"."+list_string[3])
print(x,y)
Result is
-0.0008 28.0
you can arrange x and y in columns also or whatever you want

Here a short program in python to convert your csv-files
import csv
f1 = "in_test.csv"
f2 = "out_test.csv"
with open(f1, newline='') as csv_reader:
reader = csv.reader(csv_reader, delimiter=',')
with open(f2, mode='w', newline='') as csv_writer:
writer = csv.writer(csv_writer, delimiter=";")
for row in reader:
out_row = [row[0] + '.' + row[1], row[2] + '.' + row[3]]
writer.writerow(out_row)
Sample input:
-8,000E-04,2,8E+1,
-2,000E-03,2,7E+2,
Sample output:
-8.000E-04;2.8E+1
-2.000E-03;2.7E+2

I think you should replace the second comma using regex. Well, I'm definitely not an expert at it, but I've managed to come up with this:
import re
s = "-8,000E-04,2,8E+1,"
pattern = "^([^,]*,[^,]*),(.*),$"
grps = re.search(pattern, s).groups()
res = [float(s.replace(",", ".")) for s in grps]
print(res)
# [-0.0008, 28.0]
Sample csv file:
-8,000E-04,2,8E+1,
6,0E-6,-45E+2,
-5,550E-6,-6,2E+1,
And you can do something like this:
x = []
y = []
regex = re.compile("^([^,]*,[^,]*),(.*),$")
with open("a.csv") as f:
for line in f:
result = regex.search(line).groups()
x.append(float(result[0].replace(",", ".")))
y.append(float(result[1].replace(",", ".")))
The result is:
print(x, y)
# [-0.0008, 6e-06, -5.55e-06] [28.0, -4500.0, -62.0]
I'm not sure this is the most efficient way, but it works.

How to find max and min values within lists without using maps/SQL?

I'm learning python and have a data set (csv file) I've been able to split the lines by comma but now I need to find the max and min value in the third column and output the corresponding value in the first column in the same row.
This is the .csv file: https://www.dropbox.com/s/fj8tanwy1lr24yk/loan.csv?dl=0
I also can't use Pandas or any external libraries; I think it would have been easier if I used them
I have written this code so far:
f = open("loanData.csv", "r")
mylist = []
for line in f:
mylist.append(line)
newdata = []
for row in mylist:
data = row.split(",")
newdata.append(data)

I'd use the built-in csv library for parsing your CSV file, and then just generate a list with the 3rd column values in it:
import csv
with open("loanData.csv", "r") as loanCsv:
loanCsvReader = csv.reader(loanCsv)
# Comment out if no headers
next(loanCsvReader, None)
loan_data = [ row[2] for row in loanCsvReader]
max_val = max(loan_data)
min_val = min(loan_data)
print("Max: {}".format(max_val))
print("Max: {}".format(min_val))
Don't know if the details of your file, whether it has a headers or not but you can comment out
next(loanCsvReader, None)
if you don't have any headers present

Something like this might work. The index would start at zero, so the third column should be 2.
min = min([row.split(',')[2] for row in mylist])
max = max([row.split(',')[2] for row in mylist])
Separately, you could probably read and reformat your data to a list with the following:
with open('loanData.csv', 'r') as f:
data = f.read()
mylist = list(data.split('\n'))
This assumes that the end of each row of data is newline (\n) delimited (Windows), but that might be different depending on the OS you're using.

How to add one list's items to another list, one by one?

I have a csv file with some contents as shown below:
name,x,y
N1,30.2356,12.5263
N2,30.2452,12.5300
...and it goes on.
This is what I tried, I called them from .csv and seperately added to different lists.
import csv
nn = []
xkoor = []
ykoor = []
coord = []
with open('C:/Users/Mert/Desktop/py/transformation/1.csv', 'r') as f:
reader = csv.reader(f)
for row in reader:
nn.append(row[0].split(','))
xkoor.append(row[1].split(','))
ykoor.append(row[2].split(','))
j = 1
for i in range(len(xkoor)):
for j in range(len(ykoor)):
I'm trying to make a list as:
coord = [30.2356,12.5263],[30.2452,12.5300],....
and I couldn't understand how to do it. Any ideas?

The csv-reader should split rows for you by commas on default:
import csv
with open('somefile.csv') as fh:
reader = csv.reader(fh)
for row in reader:
print(row)
# outputs
['name', 'x', 'y']
['N1', '30.2356', '12.5263']
['N2', '30.2452', '12.5300 ']
With this in mind, if you are just looking to loop over coords, you can use unpacking to get your x and y, then build your list by appending tuples:
import csv
coords = []
with open('somefile.csv') as fh:
reader = csv.reader(fh)
next(reader) # skips the headers
for row in reader:
name, x, y = row
coords.append((float(x), float(y)))
# then you can iterate over that list like so
for x, y in coords:
# do something
Coords will then look like:
[(30.2356, 12.5263), (30.2452, 12.53)]

You should not split the strings by commas yourself since csv.reader already does it for you. Simply iterate over the csv.reader generator and unpack the columns as desired:
reader = csv.reader(f)
next(reader)
coord = [[float(x), float(y)] for _, x, y in reader]

Seems like you're over-complicating things.
If all you're trying to do is create an array of coordinates containing only the X and Y values, this is how you would accomplish that:
import csv
coord = []
with open('C:/Users/Mert/Desktop/py/transformation/1.csv', 'r') as f:
reader = csv.reader(f)
for row in reader:
rowlist = row.split(',')
coord.append(rowlist[1:3])
print(coord)
All you need to do is extract a subset on a per-row basis, and append it to your coord array. No need to call row split each time, or to create separate arrays for your axis.
K.I.S.S!
(Also, a word of advice - keep PII out of your questions. No need to use your whole windows file path, just indicate that it's a CSV file. I didn't need to know your name to answer the question!)

Why not pandas?!
read_csv will ready your file and convert as a dataframe
iterate on rows and access columns x and y
combine into a list of list
and it is easier to use
import pandas as pd
df = pd.read_csv('1.csv', header=0)
[[r.x, r.y] for _, r in df.iterrows()]
Result:
[[30.2356, 12.5263], [30.2452, 12.53]]

I'd go about it something like this:
import csv
# coordinates as strings
with open('some.csv', 'r') as f:
cord = [a for _, *a in csv.reader(f)]
# coordinates as floats
with open('some.csv', 'r') as f:
cord = [[float(x), float(y)] for _, x, y in csv.reader(f)]
[print(xy) for xy in cord]

If you are into oneliners:
data = """name,x,y
N1,30.2356,12.5263
N2,30.2452,12.5300"""
coords = [[x,y]
for line in data.split("\n")[1:]
for _,x,y in [line.split(",")]]
print(coords)
This yields
[['30.2356', '12.5263'], ['30.2452', '12.5300']]

Dealing with strings amidst int in csv file, None value

I'm reading in data from a csv file where some of the values are "None". The values that are being read in are then contained in a list.
The list is the passed to a function which requires all values within the list to be in int() format.
However I can't apply this with the "None" string value being present. I've tried replacing the "None" with None, or with "" but that hasn't worked, it results in an error. The data in the list also needs to stay in the same position so I cant just completely ignore it all together.
I could replace all "None" with 0 but None != 0 really.
EDIT: I've added my code so hopefully it'll make a bit more sense. Trying to create a line chart from data in csv file:
import csv
import sys
from collections import Counter
import pygal
from pygal.style import LightSolarizedStyle
from operator import itemgetter
#Read in file to data variable and set header variable
filename = sys.argv[1]
data = []
with open(filename) as f:
reader = csv.reader(f)
header = reader.next()
data = [row for row in reader]
#count rows in spreadsheet (minus header)
row_count = (sum(1 for row in data))-1
#extract headers which I want to use
headerlist = []
for x in header[1:]:
headerlist.append(x)
#initialise line chart in module pygal. set style, title, and x axis labels using headerlist variable
line_chart = pygal.Line(style = LightSolarizedStyle)
line_chart.title = 'Browser usage evolution (in %)'
line_chart.x_labels = map(str, headerlist)
#create lists for data from spreadsheet to be put in to
empty1 = []
empty2 = []
#select which data i want from spreadsheet
for dataline in data:
empty1.append(dataline[0])
empty2.append(dataline[1:-1])
#DO SOMETHING TO "NONE" VALUES IN EMPTY TWO SO THEY CAN BE PASSED TO INT CONVERTER ASSIGNED TO EMPTY 3
#convert all items in the lists, that are in the list of empty two to int
empty3 = [[int(x) for x in sublist] for sublist in empty2]
#add data to chart line by line
count = -1
for dataline in data:
while count < row_count:
count += 1
line_chart.add(empty1[count], [x for x in empty3[count]])
#function that only takes int data
line_chart.render_to_file("browser.svg")
There will be a lot of inefficiencies or weird ways of doing things, trying to slowly learn.
The above script gives chart:
With all the Nones set as 0, bu this doesn't really reflect the existence of Chrome pre a certain date. Thanks

Without seeing your code, I can only offer limited help.
It sounds like you need to utilize ast.literal_eval().
import ast
csvread = csv.reader(file)
list = []
for row in csvread:
list.append(ast.literal_eval(row[0]))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading Data into Lists - python

You can do something like: import csv # No need to initilize your lists here X = [] Y = [] with open('data.csv', 'r') as f: data = list(csv.reader(f)) X = data[0] Y = data[1] print(X) print(Y) See if that works.

You can use pandas: import pandas as pd XY = pd.read_csv(path_to_file) X = XY.iloc[:,0] Y = XY.iloc[:,1] or you can X=[] Y=[] with open(path_to_file) as f: for line in f: xy = line.strip().split(',') X.append(xy[0]) Y.append(xy[1])

Related

How to get around a NumPy error "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"

Removing certain separators from csv file with pandas or csv

How to find max and min values within lists without using maps/SQL?

How to add one list's items to another list, one by one?

Dealing with strings amidst int in csv file, None value

Categories

Resources