I'm working with python and I read out a column from a CSV file. I save the values in an array by grouping them. This array looks something like this:
[1, 5, 10, 15, 7, 3]
I want to create a second array where I take the number of the array and make the sum with the previous values. So in this case I would like the have the following output:
[1, 6, 16, 31, 38, 41]
My code is as follows:
import csv
import itertools
with open("c:/test", 'rb') as f:
reader = csv.reader(f, delimiter=';')
data1 = []
data2 = []
for column in reader:
data1.append(column[2])
results = data1
results = [int(i) for i in results]
results = [len(list(v)) for _, v in itertools.groupby(results)]
print results
data2.append(results[0])
data2.append(results[0]+results[1])
data2.append(results[0]+results[1]+results[2])
print data2
So I can make the array by doing it manually, but this costs a lot of time and is probably not the best way to do it. So what is the best way to do something like this?
You are looking for the cumulative sum of a list. The easiest way is to let numpy do it.
>>> import numpy as np
>>> np.cumsum([1, 5, 10, 15, 7, 3])
array([ 1, 6, 16, 31, 38, 41])
a = [1, 5, 10, 15, 7, 3]
b = [a[0]]
for i in range(1, len(a)):
b.append(b[-1]+ a[i])
a is your column from .csv. b is a list with already one value in it, which is first item of a. Then we loop through a starting from it's second item and we add the consequent values from it to last item of b and append it to b.
Using your code objects, what you look for would be something like:
from __future__ import print_function
import csv
import itertools
"""
with open("c:/test", 'rb') as f:
reader = csv.reader(f, delimiter=';')
for column in reader:
data1.append(column[2])
"""
data1 = [1, 5, 10, 15, 7, 3]
results = [data1[0]]
for i in range(1, len(data1)):
results.append(results[i-1] + data1[i])
print(data1, results)
Related
I have an csv file like this:
student_id,event_id,score
1,1,20
3,1,20
4,1,18
5,1,13
6,1,18
7,1,14
8,1,14
9,1,11
10,1,19
...
and I need to convert it into multiple arrays/lists like I did using pandas here:
scores = pd.read_csv("/content/score.csv", encoding = 'utf-8',
index_col = [])
student_id = scores['student_id'].values
event_id = scores['event_id'].values
score = scores['score'].values
print(scores.head())
As you can see, I get three arrays, which I need in order to run the data analysis. How can I do this using Python's CSV library? I have to do this without the use of pandas. Also, how can I export data from multiple new arrays into a csv file when I am done with this data? I, again, used panda to do this:
avg = avgScore
max = maxScore
min = minScore
sum = sumScore
id = student_id_data
dict = {'avg(score)': avg, 'max(score)': max, 'min(score)': min, 'sum(score)': sum, 'student_id': id}
df = pd.DataFrame(dict)
df.to_csv(r'/content/AnalyzedData.csv', index=False)
Those first 5 are arrays if you are wondering.
Here's a partial answer which will produce a separate list for each column in the CSV file.
import csv
csv_filepath = "score.csv"
with open(csv_filepath, "r", newline='') as csv_file:
reader = csv.DictReader(csv_file)
columns = reader.fieldnames
lists = {column: [] for column in columns} # Lists for each column.
for row in reader:
for column in columns:
lists[column].append(int(row[column]))
for column_name, column in lists.items():
print(f'{column_name}: {column}')
Sample output:
student_id: [1, 3, 4, 5, 6, 7, 8, 9, 10]
event_id: [1, 1, 1, 1, 1, 1, 1, 1, 1]
score: [20, 20, 18, 13, 18, 14, 14, 11, 19]
You also asked how to do the reverse of this. Here's an example I how is self-explanatory:
# Dummy sample analysis data
length = len(lists['student_id'])
avgScore = list(range(length))
maxScore = list(range(length))
minScore = list(range(length))
sumScore = list(range(length))
student_ids = lists['student_id']
csv_output_filepath = 'analysis.csv'
fieldnames = ('avg(score)', 'max(score)', 'min(score)', 'sum(score)', 'student_id')
with open(csv_output_filepath, 'w', newline='') as csv_file:
writer = csv.DictWriter(csv_file, fieldnames)
writer.writeheader()
for values in zip(avgScore, maxScore, minScore, sumScore, student_ids):
row = dict(zip(fieldnames, values)) # Combine into dictionary.
writer.writerow(row)
What you want to do does not require the csv module, it's just three lines of code (one of them admittedly dense)
splitted_lines = (line.split(',') for line in open('/path/to/you/data.csv')
labels = next(splitted_lines)
arr = dict(zip(labels,zip(*((int(i) for i in ii) for ii in splitted_lines))))
splitted_lines is a generator that iterates over your data file one line at a time and provides you a list with the three (in your example) items in each line, line by line.
next(splitted_lines) returns the list with the (splitted) content of the first line, that is our three labels
We fit our data in a dictionary; using the class init method (i.e., by invoking dict) it is possible to initialize it using a generator of 2-uples, here the value of a zip:
zip 1st argument is labels, so the keys of the dictionary will be the labels of the columns
the 2nd argument is the result of the evaluation of an inner zip but in this case zip is used because zipping the starred form of a sequence of sequences has the effect of transposing it... so the value associated to each key will be the transpose of what follows * …
what follows the * is simply (the generator equivalent of) a list of lists with (in your example) 9 rows of three integer values so that
the second argument to the 1st zip is consequently a sequence of three sequences of nine integers, that are going to be coupled to the corresponding three keys/labels
Here I have an example of using the data collected by the previous three lines of code
In [119]: print("\n".join("%15s:%s"%(l,','.join("%3d"%i for i in arr[l])) for l in labels))
...:
student_id: 1, 3, 4, 5, 6, 7, 8, 9, 10
event_id: 1, 1, 1, 1, 1, 1, 1, 1, 1
score: 20, 20, 18, 13, 18, 14, 14, 11, 19
In [120]: print(*arr['score'])
20 20 18 13 18 14 14 11 19
PS If the question were about an assignment in a sort of Python 101 it's unlikely that my solution would be deemed acceptable
I'm trying to write out a CSV file by printing out an output of average readings, maximum readings and outlier readings. I have the readings, but they are not in order. Here is my output when I try to print out outlier readings, printing out the sensor name beforehand. This happens for every reading I try in the functions I have made. I would like it to print from 1 to 25, instead of starting at 22 and then becoming randomized.
sensor_22
[5]
sensor_23
[5, 6]
sensor_20
[5, 6, 5]
sensor_21
[5, 6, 5, 1]
sensor_24
[5, 6, 5, 1, 1]
sensor_25
[5, 6, 5, 1, 1, 6]
sensor_9
[5, 6, 5, 1, 1, 6, 8]
sensor_8
[5, 6, 5, 1, 1, 6, 8, 5]
sensor_3
[5, 6, 5, 1, 1, 6, 8, 5, 0]
sensor_2
[5, 6, 5, 1, 1, 6, 8, 5, 0, 6]
sensor_1
[5, 6, 5, 1, 1, 6, 8, 5, 0, 6, 9]...etc
The values are mismatched to the sensor name in my final CSV file.
Here is some CSV file data. It is a fairly large file.
sensor_1,2018-01-02,115
sensor_1,2018-01-03,51
sensor_1,2018-01-04,30
sensor_1,2018-01-05,198
Here is my current code. I cannot use any pandas, numpy..etc
import csv
options = []
options_readings = {}
max_readings= []
avg_readings = []
outlier_readings = []
with open('sensor_data.csv', 'rb') as f:
reader = csv.reader(f, delimiter = ',')
for row in reader:
if row[0] not in options:
options.append(row[0])
options_readings[row[0]] = []
sensor_name = row[0]
datastamp = row[1]
readings = float(row[2])
options_readings[row[0]].append(readings)
def calculateNumberOfOutlierReadings():
for option in options_readings:
print(option)
readings = options_readings[option]
count = 0
for row in readings:
if(row > 180 or row <0):
count +=1
outlier_readings.append(count)
print(outlier_readings)
def calculateMaxReadings():
for options in options_readings:
max_readings.append(max(options_readings[options]))
def calculateAverageReadings():
for option in options_readings:
readings = options_readings[option]
avg_readings.append(sum(readings)/len(readings))
calculateMaxReadings()
calculateAverageReadings()
calculateNumberOfOutlierReadings()
zip(options, avg_readings, max_readings, outlier_readings)
with open('output.csv', 'wb') as out:
writer = csv.writer(out, delimiter = ',')
header = ["Sensor Name", "Average Reading", "Maximum Reading", "Number of Outlier Readings"]
writer.writerow([i for i in header])
writer.writerows(zip(options, avg_readings, max_readings, outlier_readings))
I believe using options_readings as a dictionary and trying to sort those keys are my issue. Any help would be appreciated.
At the end of your solution, sort the data before you write it to a file.
....
data = sorted(zip(options, avg_readings, max_readings, outlier_readings) )
with open('output.csv', 'w') as out:
writer = csv.writer(out, delimiter = ',')
header = ["Sensor Name", "Average Reading", "Maximum Reading", "Number of Outlier Readings"]
writer.writerow([i for i in header])
writer.writerows(data)
Suggested improvements:
Rewrite the functions so that they return a result. Many (most?) people prefer functions to return value(s) instead of mutating things external to the function.
def calculateNumberOfOutlierReadings(readings):
count = 0
for reading in readings:
if reading < 0 or reading > 180:
count += 1
return count
def calculateMaxReadings(readings):
return max(readings)
def calculateAverageReadings(readings):
return sum(readings) / len(readings)
Keep all the data from the csv in a dictionary - {sensor_name:[value,value,...]}.
data = collections.defaultdict(list)
with open('foo.csv') as f:
reader = csv.reader(f)
for (sensor,date,value) in reader:
data[sensor].append(float(value))
Iterate over the data; perform calcs on each item's values; put the calcs in a tuple using the sensor name for the first item; store the tuple in a list.
results = []
for sensor,readings in data.items():
number_of_outliers = calculateNumberOfOutlierReadings(readings)
maximum = calculateMaxReadings(readings)
average = calculateAverageReadings(readings)
results.append((sensor,average,maximum,number_of_outliers))
Sort the results
results.sort()
For a csv that looks like
sensor_4,2018-01-05,198
sensor_2,2018-01-02,115
sensor_4,2018-01-03,51
sensor_2,2018-01-04,30
sensor_1,2018-01-02,115
sensor_4,2018-01-04,30
sensor_2,2018-01-05,198
sensor_1,2018-01-04,30
sensor_3,2018-01-04,30
sensor_1,2018-01-05,198
sensor_4,2018-01-02,115
sensor_2,2018-01-05,198
sensor_3,2018-01-02,115
sensor_1,2018-01-03,51
sensor_3,2018-01-03,51
sensor_2,2018-01-03,51
sensor_3,2018-01-05,198
results ends up being in the order you want:
In [16]: results
Out[16]:
[('sensor_1', 98.5, 198.0, 1),
('sensor_2', 118.4, 198.0, 2),
('sensor_3', 98.5, 198.0, 1),
('sensor_4', 98.5, 198.0, 1)]
I have counts.txt files in 50 folders that each related to one sample. There are two columns in counts.txt: the first one is a string, and the other is a number. I try to make a nested dictionary through them. The goal is to use the first column of counts.txt and folders as a key of dictionary and the second column in counts.txt as a value. Unfortunately, the list of folders, that I want to make a loop over them to give me the proper shape is not working and face a problem!
import os
from natsort import natsorted
path1 = "/home/ali/Desktop/SAMPLES/"
data_ali={}
samples_name=natsorted(os.listdir(path1))
data_ali = {}
samples_name=natsorted(os.listdir(path1))
for i in samples_name:
with open(path1+i[0:]+"/counts.txt","rt") as fin:
for l in fin.readlines():
l=l.strip().split()
if l[0][:4]=='ENSG':
gene=l[0]
data_ali[gene]={}
reads=int(l[1])
data_ali[gene][samples_name]=reads
print(data_ali)
i expect the output like this:
'ENSG00000120659': {
'Sample_1-Leish_011_v2': 14,
'Sample_2-leish_011_v3': 7,
'Sample_3-leish_012_v2': 6,
'Sample_4-leish_012_v3': 1,
'Sample_5-leish_015_v2': 9,
'Sample_6-leish_015_v3': 3,
'Sample_7-leish_016_v2': 4,
'Sample_8-leish_016_v3': 8,
'Sample_9-leish_017_v2': 8,
'Sample_10-leish_017_v3': 2,
'Sample_11-leish_018_v2': 4,
'Sample_12-leish_018_v3': 4,
'Sample_13-leish_019_v2': 7,
'Sample_14-leish_019_v3': 4,
'Sample_15-leish_021_v2': 12,
'Sample_16-leish_021_v3': 5,
'Sample_17-leish_022_v2': 4,
'Sample_18-leish_022_v3': 2,
'Sample_19-leish_023_v2': 9,
'Sample_20-leish_023_v3': 6,
'Sample_21-leish_024_v2': 22,
'Sample_22-leish_024_v3': 10,
'Sample_23-leish026_v2': 9,
'Sample_24-leish026_v3': 5,
'Sample_25-leish027_v2': 4,
'Sample_26-leish027_v3': 1,
'Sample_27-leish028_v2': 7,
'Sample_28-leish028_v3': 5,
'Sample_29-leish032_v2': 8,
'Sample_30-leish032_v3': 2
}
Try this:
if l[0][:4] == 'ENSG':
gene = l[0]
reads = int(l[1])
data_ali.setdefault(gene, {})[i] = reads
Two important changes:
Your code data_ali[gene]={} always cleared what was previously there and made a new empty dictionary instead. setdefault only creates the dictionary if the key gene is not already present.
The second key should be i, not the list samples_name.
Full code cleanup:
import os
from natsort import natsorted
root = "/home/ali/Desktop/SAMPLES/"
data_ali = {}
for sample_name in natsorted(os.listdir(root)):
with open(os.path.join(root, sample_name, "counts.txt"), "r") as fin:
for line in fin.readlines():
gene, reads = line.split()
reads = int(reads)
if gene.startswith('ENSG'):
data_ali.setdefault(gene, {})[sample_name] = reads
print(data_ali)
I would like to write a piece of code which calculates the sum of the elements in each row of a list and returns a new list of the row sums. For example
def row_sums(square):
square = [
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]
]
print(row_sums(square))
This would give the output of [10, 26, 42, 58] As the sum of the first row equals 10, sum of the second row equals 26 and so on. However I would like to NOT use the built in sum function to do this. How would I go about doing this? Thanks in advance.
A simple piece of code for calculating the sum of the elements in each row of a list.
square = [
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]
]
su=[sum(i) for i in square]
print (su)
Output:
[10, 26, 42, 58]
If you really cant use the sum() function, here is a function to sum the rows, I've written it explicitly to show the steps but it would be worth looking at list comprehensions once you understand what is going on:
def row_sums(square):
# list to store sums
output = []
# go through each row in square
for row in square:
# variable to store row total
total = 0
# go through each item in row and add to total
for item in row:
total += item
# append the row's total to the output list
output.append(total)
# return the output list
return output
This can then be used as such:
square = [
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]
]
row_totals = row_sums(square)
EDIT:
In answer to your comment, I'd do something like this:
def sum_columns(square):
# as before have a list to store the totals
output = []
# assuming your square will have the same row length for each row
# get the number of columns
num_of_columns = len(square[0])
# iterate over the columns
for i in xrange(0, num_of_columns):
# store the total for the column (same as before)
total = 0
# for each row, get the value for the column and add to the column total
# (value at index i)
for row in square:
total += row[i]
# append the total to the output list
output.append(total)
# return the list of totals
return output
Write your own sum function...
The module functools has a useful reduce function that you can use to write your own sum function. If you are comfortable with lambda functions you could do it this way:
lst = [0,1,2,3,4,5]
which would give sum(lst) as 15. However your own sum function with reduce may look something like:
from functools import reduce
reduce(lambda x,y: x + y, l)
which woudld also give 15. You should be able to write the rest yourself (i.e it within another list working on rows).
You can also do it with comprehension and reduce:
[reduce(lambda x, y: x + y, item) for item in square]
You can add that lines to your existent function:
result = []
for row in square: # iterates trough a row
line = 0 #stores the total of a line
for num in row: #go trough every number in row
line += num #add that number to the total of that line
result.append(line) #append to a list the result
print(result) #finally return the total of every line sum's in a list
Well at a certain point you have to use the sum-function (either as sum() or "+") but you could use map like
list(map(sum, square))
I am attempting to take an RDD containing pairs of integer ranges, and transform it so that each pair has a third term which iterates through the possible values in the range. Basically, I've got this:
[[1,10], [11,20], [21,30]]
And I'd like to end up with this:
[[1,1,10], [2,1,10], [3,1,10], [4,1,10], [5,1,10]...]
The file I'd like to transform is very large, which is why I'm looking to do this with PySpark rather than just Python on a local machine (I've got a way to do it locally on a CSV file, but the process takes several hours given the file's size). So far, I've got this:
a = [[1,10], [11,20], [21,30]]
b = sc.parallelize(a)
c = b.map(lambda x: [range(x[0], x[1]+1), x[0], x[1]])
c.collect()
Which yields:
>>> c.collect()
[[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 1, 10], [[11, 12, 13, 14, 15, 16, 17, 18, 19, 20], 11, 20], [[21, 22, 23, 24, 25, 26, 27, 28, 29, 30], 21, 30]]
I can't figure out what the next step needs to be from here, to iterate over the expanded range, and pair each of those with the range delimiters.
Any ideas?
EDIT 5/8/2017 3:00PM
The local Python technique that works on a CSV input is:
import csv
import gzip
csvfile_expanded = gzip.open('C:\output.csv', 'wb')
ranges_expanded = csv.writer(csvfile_expanded, delimiter=',', quotechar='"')
csvfile = open('C:\input.csv', 'rb')
ranges = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in ranges:
for i in range(int(row[0]),int(row[1])+1):
ranges_expanded.writerow([i,row[0],row[1])
The PySpark script I'm questioning begins with the CSV file already having been loaded into HDFS and cast as an RDD.
Try this:
c = b.flatMap(lambda x: ([y, x[0], x[1]] for y in xrange(x[0], x[1]+1)))
The flatMap() ensures that you get one output record per element of the range. Note also the outer ( ) in conjunction with the xrange -- this is a generator expression that avoids materialising the entire range in memory of the executor.
Note: xrange() is Python2. If you are running Python3, use range()