How to convert csv to multiple arrays without pandas? - python

I have an csv file like this:
student_id,event_id,score
1,1,20
3,1,20
4,1,18
5,1,13
6,1,18
7,1,14
8,1,14
9,1,11
10,1,19
...
and I need to convert it into multiple arrays/lists like I did using pandas here:
scores = pd.read_csv("/content/score.csv", encoding = 'utf-8',
index_col = [])
student_id = scores['student_id'].values
event_id = scores['event_id'].values
score = scores['score'].values
print(scores.head())
As you can see, I get three arrays, which I need in order to run the data analysis. How can I do this using Python's CSV library? I have to do this without the use of pandas. Also, how can I export data from multiple new arrays into a csv file when I am done with this data? I, again, used panda to do this:
avg = avgScore
max = maxScore
min = minScore
sum = sumScore
id = student_id_data
dict = {'avg(score)': avg, 'max(score)': max, 'min(score)': min, 'sum(score)': sum, 'student_id': id}
df = pd.DataFrame(dict)
df.to_csv(r'/content/AnalyzedData.csv', index=False)
Those first 5 are arrays if you are wondering.

Here's a partial answer which will produce a separate list for each column in the CSV file.
import csv
csv_filepath = "score.csv"
with open(csv_filepath, "r", newline='') as csv_file:
reader = csv.DictReader(csv_file)
columns = reader.fieldnames
lists = {column: [] for column in columns} # Lists for each column.
for row in reader:
for column in columns:
lists[column].append(int(row[column]))
for column_name, column in lists.items():
print(f'{column_name}: {column}')
Sample output:
student_id: [1, 3, 4, 5, 6, 7, 8, 9, 10]
event_id: [1, 1, 1, 1, 1, 1, 1, 1, 1]
score: [20, 20, 18, 13, 18, 14, 14, 11, 19]
You also asked how to do the reverse of this. Here's an example I how is self-explanatory:
# Dummy sample analysis data
length = len(lists['student_id'])
avgScore = list(range(length))
maxScore = list(range(length))
minScore = list(range(length))
sumScore = list(range(length))
student_ids = lists['student_id']
csv_output_filepath = 'analysis.csv'
fieldnames = ('avg(score)', 'max(score)', 'min(score)', 'sum(score)', 'student_id')
with open(csv_output_filepath, 'w', newline='') as csv_file:
writer = csv.DictWriter(csv_file, fieldnames)
writer.writeheader()
for values in zip(avgScore, maxScore, minScore, sumScore, student_ids):
row = dict(zip(fieldnames, values)) # Combine into dictionary.
writer.writerow(row)

What you want to do does not require the csv module, it's just three lines of code (one of them admittedly dense)
splitted_lines = (line.split(',') for line in open('/path/to/you/data.csv')
labels = next(splitted_lines)
arr = dict(zip(labels,zip(*((int(i) for i in ii) for ii in splitted_lines))))
splitted_lines is a generator that iterates over your data file one line at a time and provides you a list with the three (in your example) items in each line, line by line.
next(splitted_lines) returns the list with the (splitted) content of the first line, that is our three labels
We fit our data in a dictionary; using the class init method (i.e., by invoking dict) it is possible to initialize it using a generator of 2-uples, here the value of a zip:
zip 1st argument is labels, so the keys of the dictionary will be the labels of the columns
the 2nd argument is the result of the evaluation of an inner zip but in this case zip is used because zipping the starred form of a sequence of sequences has the effect of transposing it... so the value associated to each key will be the transpose of what follows * …
what follows the * is simply (the generator equivalent of) a list of lists with (in your example) 9 rows of three integer values so that
the second argument to the 1st zip is consequently a sequence of three sequences of nine integers, that are going to be coupled to the corresponding three keys/labels
Here I have an example of using the data collected by the previous three lines of code
In [119]: print("\n".join("%15s:%s"%(l,','.join("%3d"%i for i in arr[l])) for l in labels))
...:
student_id: 1, 3, 4, 5, 6, 7, 8, 9, 10
event_id: 1, 1, 1, 1, 1, 1, 1, 1, 1
score: 20, 20, 18, 13, 18, 14, 14, 11, 19
In [120]: print(*arr['score'])
20 20 18 13 18 14 14 11 19
PS If the question were about an assignment in a sort of Python 101 it's unlikely that my solution would be deemed acceptable

Related

How to automate the process to select the clusters using the labels

So I'm new to using python and I'm working in the analyze of some data, I'm using a process extremely manual to find the clusters, first I get the labels using the method from the library:
labels = optics_model.labels_[optics_model.ordering_]
then I use the command angwhere to find the index values that have that label:
cluster_0 = np.argwhere(labels == 0)
then I print this data, use another site to clean the data and use it to select from the dataframe the rows that are from this cluster:
index_0 = [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
cluster_0 = df.iloc[index_0]
can someone help me automate this process?
So after some looking and testing I made it work for me using a method to add a column to the dataframe with the labels:
df_copy = df.assign(labels=labels)
then I calculated the number of clusters using this:
max = 0
for i in range(len(labels)):
if max < labels[i]:
max = labels[i]
then a made the necessary number of empty dataframes:
cluster = {}
for i in range(max):
cluster[i] = pd.DataFrame()
then I just copy the data I want from the dataframe:
for i in range(0, max):
cluster[i] = df_copy.loc[df_copy['labels'] == i]

Mismatched Names to Values from lists in written csv file

I'm trying to write out a CSV file by printing out an output of average readings, maximum readings and outlier readings. I have the readings, but they are not in order. Here is my output when I try to print out outlier readings, printing out the sensor name beforehand. This happens for every reading I try in the functions I have made. I would like it to print from 1 to 25, instead of starting at 22 and then becoming randomized.
sensor_22
[5]
sensor_23
[5, 6]
sensor_20
[5, 6, 5]
sensor_21
[5, 6, 5, 1]
sensor_24
[5, 6, 5, 1, 1]
sensor_25
[5, 6, 5, 1, 1, 6]
sensor_9
[5, 6, 5, 1, 1, 6, 8]
sensor_8
[5, 6, 5, 1, 1, 6, 8, 5]
sensor_3
[5, 6, 5, 1, 1, 6, 8, 5, 0]
sensor_2
[5, 6, 5, 1, 1, 6, 8, 5, 0, 6]
sensor_1
[5, 6, 5, 1, 1, 6, 8, 5, 0, 6, 9]...etc
The values are mismatched to the sensor name in my final CSV file.
Here is some CSV file data. It is a fairly large file.
sensor_1,2018-01-02,115
sensor_1,2018-01-03,51
sensor_1,2018-01-04,30
sensor_1,2018-01-05,198
Here is my current code. I cannot use any pandas, numpy..etc
import csv
options = []
options_readings = {}
max_readings= []
avg_readings = []
outlier_readings = []
with open('sensor_data.csv', 'rb') as f:
reader = csv.reader(f, delimiter = ',')
for row in reader:
if row[0] not in options:
options.append(row[0])
options_readings[row[0]] = []
sensor_name = row[0]
datastamp = row[1]
readings = float(row[2])
options_readings[row[0]].append(readings)
def calculateNumberOfOutlierReadings():
for option in options_readings:
print(option)
readings = options_readings[option]
count = 0
for row in readings:
if(row > 180 or row <0):
count +=1
outlier_readings.append(count)
print(outlier_readings)
def calculateMaxReadings():
for options in options_readings:
max_readings.append(max(options_readings[options]))
def calculateAverageReadings():
for option in options_readings:
readings = options_readings[option]
avg_readings.append(sum(readings)/len(readings))
calculateMaxReadings()
calculateAverageReadings()
calculateNumberOfOutlierReadings()
zip(options, avg_readings, max_readings, outlier_readings)
with open('output.csv', 'wb') as out:
writer = csv.writer(out, delimiter = ',')
header = ["Sensor Name", "Average Reading", "Maximum Reading", "Number of Outlier Readings"]
writer.writerow([i for i in header])
writer.writerows(zip(options, avg_readings, max_readings, outlier_readings))
I believe using options_readings as a dictionary and trying to sort those keys are my issue. Any help would be appreciated.
At the end of your solution, sort the data before you write it to a file.
....
data = sorted(zip(options, avg_readings, max_readings, outlier_readings) )
with open('output.csv', 'w') as out:
writer = csv.writer(out, delimiter = ',')
header = ["Sensor Name", "Average Reading", "Maximum Reading", "Number of Outlier Readings"]
writer.writerow([i for i in header])
writer.writerows(data)
Suggested improvements:
Rewrite the functions so that they return a result. Many (most?) people prefer functions to return value(s) instead of mutating things external to the function.
def calculateNumberOfOutlierReadings(readings):
count = 0
for reading in readings:
if reading < 0 or reading > 180:
count += 1
return count
def calculateMaxReadings(readings):
return max(readings)
def calculateAverageReadings(readings):
return sum(readings) / len(readings)
Keep all the data from the csv in a dictionary - {sensor_name:[value,value,...]}.
data = collections.defaultdict(list)
with open('foo.csv') as f:
reader = csv.reader(f)
for (sensor,date,value) in reader:
data[sensor].append(float(value))
Iterate over the data; perform calcs on each item's values; put the calcs in a tuple using the sensor name for the first item; store the tuple in a list.
results = []
for sensor,readings in data.items():
number_of_outliers = calculateNumberOfOutlierReadings(readings)
maximum = calculateMaxReadings(readings)
average = calculateAverageReadings(readings)
results.append((sensor,average,maximum,number_of_outliers))
Sort the results
results.sort()
For a csv that looks like
sensor_4,2018-01-05,198
sensor_2,2018-01-02,115
sensor_4,2018-01-03,51
sensor_2,2018-01-04,30
sensor_1,2018-01-02,115
sensor_4,2018-01-04,30
sensor_2,2018-01-05,198
sensor_1,2018-01-04,30
sensor_3,2018-01-04,30
sensor_1,2018-01-05,198
sensor_4,2018-01-02,115
sensor_2,2018-01-05,198
sensor_3,2018-01-02,115
sensor_1,2018-01-03,51
sensor_3,2018-01-03,51
sensor_2,2018-01-03,51
sensor_3,2018-01-05,198
results ends up being in the order you want:
In [16]: results
Out[16]:
[('sensor_1', 98.5, 198.0, 1),
('sensor_2', 118.4, 198.0, 2),
('sensor_3', 98.5, 198.0, 1),
('sensor_4', 98.5, 198.0, 1)]

creating nested dictionary to loop over my text files and folders to create multiple key dictionary

I have counts.txt files in 50 folders that each related to one sample. There are two columns in counts.txt: the first one is a string, and the other is a number. I try to make a nested dictionary through them. The goal is to use the first column of counts.txt and folders as a key of dictionary and the second column in counts.txt as a value. Unfortunately, the list of folders, that I want to make a loop over them to give me the proper shape is not working and face a problem!
import os
from natsort import natsorted
path1 = "/home/ali/Desktop/SAMPLES/"
data_ali={}
samples_name=natsorted(os.listdir(path1))
data_ali = {}
samples_name=natsorted(os.listdir(path1))
for i in samples_name:
with open(path1+i[0:]+"/counts.txt","rt") as fin:
for l in fin.readlines():
l=l.strip().split()
if l[0][:4]=='ENSG':
gene=l[0]
data_ali[gene]={}
reads=int(l[1])
data_ali[gene][samples_name]=reads
print(data_ali)
i expect the output like this:
'ENSG00000120659': {
'Sample_1-Leish_011_v2': 14,
'Sample_2-leish_011_v3': 7,
'Sample_3-leish_012_v2': 6,
'Sample_4-leish_012_v3': 1,
'Sample_5-leish_015_v2': 9,
'Sample_6-leish_015_v3': 3,
'Sample_7-leish_016_v2': 4,
'Sample_8-leish_016_v3': 8,
'Sample_9-leish_017_v2': 8,
'Sample_10-leish_017_v3': 2,
'Sample_11-leish_018_v2': 4,
'Sample_12-leish_018_v3': 4,
'Sample_13-leish_019_v2': 7,
'Sample_14-leish_019_v3': 4,
'Sample_15-leish_021_v2': 12,
'Sample_16-leish_021_v3': 5,
'Sample_17-leish_022_v2': 4,
'Sample_18-leish_022_v3': 2,
'Sample_19-leish_023_v2': 9,
'Sample_20-leish_023_v3': 6,
'Sample_21-leish_024_v2': 22,
'Sample_22-leish_024_v3': 10,
'Sample_23-leish026_v2': 9,
'Sample_24-leish026_v3': 5,
'Sample_25-leish027_v2': 4,
'Sample_26-leish027_v3': 1,
'Sample_27-leish028_v2': 7,
'Sample_28-leish028_v3': 5,
'Sample_29-leish032_v2': 8,
'Sample_30-leish032_v3': 2
}
Try this:
if l[0][:4] == 'ENSG':
gene = l[0]
reads = int(l[1])
data_ali.setdefault(gene, {})[i] = reads
Two important changes:
Your code data_ali[gene]={} always cleared what was previously there and made a new empty dictionary instead. setdefault only creates the dictionary if the key gene is not already present.
The second key should be i, not the list samples_name.
Full code cleanup:
import os
from natsort import natsorted
root = "/home/ali/Desktop/SAMPLES/"
data_ali = {}
for sample_name in natsorted(os.listdir(root)):
with open(os.path.join(root, sample_name, "counts.txt"), "r") as fin:
for line in fin.readlines():
gene, reads = line.split()
reads = int(reads)
if gene.startswith('ENSG'):
data_ali.setdefault(gene, {})[sample_name] = reads
print(data_ali)

Python make second array from first

I'm working with python and I read out a column from a CSV file. I save the values in an array by grouping them. This array looks something like this:
[1, 5, 10, 15, 7, 3]
I want to create a second array where I take the number of the array and make the sum with the previous values. So in this case I would like the have the following output:
[1, 6, 16, 31, 38, 41]
My code is as follows:
import csv
import itertools
with open("c:/test", 'rb') as f:
reader = csv.reader(f, delimiter=';')
data1 = []
data2 = []
for column in reader:
data1.append(column[2])
results = data1
results = [int(i) for i in results]
results = [len(list(v)) for _, v in itertools.groupby(results)]
print results
data2.append(results[0])
data2.append(results[0]+results[1])
data2.append(results[0]+results[1]+results[2])
print data2
So I can make the array by doing it manually, but this costs a lot of time and is probably not the best way to do it. So what is the best way to do something like this?
You are looking for the cumulative sum of a list. The easiest way is to let numpy do it.
>>> import numpy as np
>>> np.cumsum([1, 5, 10, 15, 7, 3])
array([ 1, 6, 16, 31, 38, 41])
a = [1, 5, 10, 15, 7, 3]
b = [a[0]]
for i in range(1, len(a)):
b.append(b[-1]+ a[i])
a is your column from .csv. b is a list with already one value in it, which is first item of a. Then we loop through a starting from it's second item and we add the consequent values from it to last item of b and append it to b.
Using your code objects, what you look for would be something like:
from __future__ import print_function
import csv
import itertools
"""
with open("c:/test", 'rb') as f:
reader = csv.reader(f, delimiter=';')
for column in reader:
data1.append(column[2])
"""
data1 = [1, 5, 10, 15, 7, 3]
results = [data1[0]]
for i in range(1, len(data1)):
results.append(results[i-1] + data1[i])
print(data1, results)

Python csv: merge rows with same field

I am attempting to merge several rows of csv data into one long row, given two cells contain the same data. For instance, take the following csv:
one, two, three
1, 2, 3
4, 5, 6
7, 8, 9
1, 1, 1
4, 4, 4
If two rows share the same value at row[0], I want the second row appended to the first. So my end product should look like this:
one, two, three
1, 2, 3, 1, 1, 1
4, 5, 6, 4, 4, 4
7, 8, 9
Here is my attempt so far:
import csv
uniqueNum = []
uniqueMaster = []
count = -1
with open("Test.csv", "rb") as source:
reader = csv.reader(source)
header = next(reader)
for row in reader:
if row[0] not in uniqueNum:
uniqueMaster.append(row)
uniqueNum.append(row[0])
count = count + 1
for row in reader:
if row[0] in uniqueNum:
uniqueMaster[count].append(row)
with open("holding.csv","wb") as result:
writer = csv.writer(result)
writer.writerow(header)
for row in uniqueMaster:
writer.writerow(row)
Things LOOK ok to me, but my script only outputs the following:
one, two, three
1, 2, 3, ['1', '1', '1']
This is obviously wrong for two reasons. First, it doesn't iterate through the entire csv, and second, the appended values are being squeezed into one cell, rather than individual cells. If anyone had any advice on getting this to work right I'd highly appreciate it!
Use a dictionary instead. Starting from the middle of your code(assume I have declared a dict called my_dict):
for row in reader:
if row[0] in my_dict.keys():
my_dict[row[0]].extend(row)
else:
my_dict[row[0]]=row
#...now we are at the bottom of your code, writing to the csv
for v in my_dict.values():
writer.writerow(v)
import csv
csv_dict = {}
with open("Test.csv", "r") as source:
reader = csv.reader(source)
header = next(reader)
for row in reader:
if row[0] in csv_dict:
csv_dict[row[0]] += row
else:
csv_dict[row[0]] = row

Categories

Resources