Averaging columns in a text file with row and column headers - python

I'm new to the group, and to python. I have a very specific type of input file that I'm working with. It is a text file with one header row of text. In addition there is a column of text too which makes things more annoying. What I want to do is read in this file, and then perform operations on the columns of numbers (like average, stdev, etc)...but reading in the file and parsing out the text column is giving me trouble.
I've played with many different approaches and got it close, but figured I'd reach out to the group here. If this were matlab I'd have had it down hours ago. As of now if I used fixed width to define my columns, I think it will work, but I thought there is likely a more efficient way to read in the lines and ignore strings properly.
Here is the file format. As you can see, row one is header...so can be ignored.
column 1 contains text.
postraw.txt
....I think I figured it out. My code is probably very crude, but it works for now:
CTlist = []
CLlist = []
CDlist = []
CMZlist = []
LDelist = []
loopout = {'a1':CTlist, 'a2':CLlist, 'a3':CDlist, 'a4':CMZlist, 'a5':LDelist}
#Specifcy number of headerlines
headerlines = 1
#set initial index to 0
i = 0
#begin loop to process input file, avoiding any header lines
with open('post.out', 'r') as file:
for row in file:
if i > (headerlines - 1):
rowvars = row.split()
for i in range(2,len(rowvars)):
#print(rowvars[i]) #JUST A CHECK/DEBUG LINE
loopout['a{0}'.format(i-1)].append(float(rowvars[i]))
i = i+1

CTlist = []
CLlist = []
CDlist = []
CMZlist = []
LDelist = []
loopout = {'a1':CTlist, 'a2':CLlist, 'a3':CDlist, 'a4':CMZlist, 'a5':LDelist}
#Specifcy number of headerlines
headerlines = 1
#set initial index to 0
i = 0
#begin loop to process input file, avoiding any header lines
with open('post.out', 'r') as file:
for row in file:
if i > (headerlines - 1):
rowvars = row.split()
for i in range(2,len(rowvars)):
#print(rowvars[i]) #JUST A CHECK/DEBUG LINE
loopout['a{0}'.format(i-1)].append(float(rowvars[i]))
i = i+1

Related

How can i calculate the sum of the values in a field less than a certain value

I have a CSV file separated by commas. I need to read the file, determine the sum of the values in the field [reading] less than (say 406.2).
My code so far is as follows:
myfile = open('3517315a.csv','r')
myfilecount = 0
linecount = 0
firstline = True
for line in myfile:
if firstline:
firstline = False
continue
fields = line.split(',')
linecount += 1
count = int(fields[0])
colour = str(fields[1])
channels = int(fields[2])
code = str(fields[3])
correct = str(fields[4])
reading = float(fields[5])
How can i set this condition?
Use np.genfromtxt to read the CSV.
import numpy as np
#data = np.genfromtxt('3517315a.csv', delimiter=',')
data = np.random.random(10).reshape(5,2) * 600 # exemplary data
# since I don't have your CSV
threshold = 406.2
print(np.sum(data * (data<threshold)))
I haven't tested this (I don't have example data or your file) but this should do it
import numpy as np
#import data from file, give each column a name
data = np.genfromtxt('3517315a.csv', names=['count','channels','code','correct','reading'])
#move to a normal array to make it easier to follow (not necessary)
readingdata = data['reading']
#find the values greater than your limit (np.where())
#extract only those values (readingdata[])
#then sum those extracted values (np.sum())
total = np.sum( readingdata[np.where(readingdata > 406.2)] )
You can write an iterator that extracts the reading field and casts it to a float. Wrap that in another iterator that tests your condition and sum the result.
import csv
with open('3517315a.csv', newline='') as fp:
next(fp) # discard header
reading_sum = sum(reading for reading in
(float(row[5]) for row in csv.reader(fp))
if reading < 406.5)

Get number of rows from .csv file

I am writing a Python module where I read a .csv file with 2 columns and a random amount of rows. I then go through these rows until column 1 > x. At this point I need the data from the current row and the previous row to do some calculations.
Currently, I am using 'for i in range(rows)' but each csv file will have a different amount of rows so this wont work.
The code can be seen below:
rows = 73
for i in range(rows):
c_level = Strapping_Table[Tank_Number][i,0] # Current level
c_volume = Strapping_Table[Tank_Number][i,1] # Current volume
if c_level > level:
p_level = Strapping_Table[Tank_Number][i-1,0] # Previous level
p_volume = Strapping_Table[Tank_Number][i-1,1] # Previous volume
x = level - p_level # Intermediate values
if x < 0:
x = 0
y = c_level - p_level
z = c_volume - p_volume
volume = p_volume + ((x / y) * z)
return volume
When playing around with arrays, I used:
for row in Tank_data:
print row[c] # print column c
time.sleep(1)
This goes through all the rows, but I cannot access the previous rows data with this method.
I have thought about storing previous row and current row in every loop, but before I do this I was wondering if there is a simple way to get the amount of rows in a csv.
Store the previous line
with open("myfile.txt", "r") as file:
previous_line = next(file)
for line in file:
print(previous_line, line)
previous_line = line
Or you can use it with generators
def prev_curr(file_name):
with open(file_name, "r") as file:
previous_line = next(file)
for line in file:
yield previous_line ,line
previous_line = line
# usage
for prev, curr in prev_curr("myfile"):
do_your_thing()
You should use enumerate.
for i, row in enumerate(tank_data):
print row[c], tank_data[i-1][c]
Since the size of each row in the csv is unknown until it's read, you'll have to do an intial pass through if you want to find the number of rows, e.g.:
numberOfRows = (1 for row in file)
However that would mean your code will read the csv twice, which if it's very big you may not want to do - the simple option of storing the previous row into a global variable each iteration may be the best option in that case.
An alternate route could be to just read in the file and analyse it from that from e.g. a panda DataFrame (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
but again this could lead to slowness if your csv is too big.

Dealing with strings amidst int in csv file, None value

I'm reading in data from a csv file where some of the values are "None". The values that are being read in are then contained in a list.
The list is the passed to a function which requires all values within the list to be in int() format.
However I can't apply this with the "None" string value being present. I've tried replacing the "None" with None, or with "" but that hasn't worked, it results in an error. The data in the list also needs to stay in the same position so I cant just completely ignore it all together.
I could replace all "None" with 0 but None != 0 really.
EDIT: I've added my code so hopefully it'll make a bit more sense. Trying to create a line chart from data in csv file:
import csv
import sys
from collections import Counter
import pygal
from pygal.style import LightSolarizedStyle
from operator import itemgetter
#Read in file to data variable and set header variable
filename = sys.argv[1]
data = []
with open(filename) as f:
reader = csv.reader(f)
header = reader.next()
data = [row for row in reader]
#count rows in spreadsheet (minus header)
row_count = (sum(1 for row in data))-1
#extract headers which I want to use
headerlist = []
for x in header[1:]:
headerlist.append(x)
#initialise line chart in module pygal. set style, title, and x axis labels using headerlist variable
line_chart = pygal.Line(style = LightSolarizedStyle)
line_chart.title = 'Browser usage evolution (in %)'
line_chart.x_labels = map(str, headerlist)
#create lists for data from spreadsheet to be put in to
empty1 = []
empty2 = []
#select which data i want from spreadsheet
for dataline in data:
empty1.append(dataline[0])
empty2.append(dataline[1:-1])
#DO SOMETHING TO "NONE" VALUES IN EMPTY TWO SO THEY CAN BE PASSED TO INT CONVERTER ASSIGNED TO EMPTY 3
#convert all items in the lists, that are in the list of empty two to int
empty3 = [[int(x) for x in sublist] for sublist in empty2]
#add data to chart line by line
count = -1
for dataline in data:
while count < row_count:
count += 1
line_chart.add(empty1[count], [x for x in empty3[count]])
#function that only takes int data
line_chart.render_to_file("browser.svg")
There will be a lot of inefficiencies or weird ways of doing things, trying to slowly learn.
The above script gives chart:
With all the Nones set as 0, bu this doesn't really reflect the existence of Chrome pre a certain date. Thanks
Without seeing your code, I can only offer limited help.
It sounds like you need to utilize ast.literal_eval().
import ast
csvread = csv.reader(file)
list = []
for row in csvread:
list.append(ast.literal_eval(row[0]))

Selecting rows in cvs file and write them in another csv file

I have a csv file with 2 columns (titles are value, image). The value list contains values in ascending order (0,25,30...), and the image list contains pathway to images (e.g. X.jpg). Total lines are 81 including the titles (that is, there are 80 values and 80 images)
What I want to divide this list 4-ways. Basically the idea is to have a spread of pairs of images.
In the first group I took the image part of every two near rows (2+3, 4+5....), and wrote them in a new csv file. I write each image in a different column. Here's the code:
import csv
f = open('random_sorted.csv')
csv_f = csv.reader(f)
i = 0
prev = ""
#open csv file for writing
with open('first_group.csv', 'wb') as test_file:
csv_writer = csv.writer(test_file)
csv_writer.writerow(["image1"] + ["image2"])
for row in csv_f:
if i%2 == 0 and i!=0:
#print prev + "," + row[1]
csv_writer.writerow([prev] + [row[1]])
else:
prev = row[1]
i = i+1
Here's the output of this:
I want to keep the concept similar with the rest 3 groups(write into a new csv file the paired images and having two columns), but just increase the spread. That is, pair together every 5 rows (i.e. 2+7 etc.), every 7 (i.e. 2+9 etc.), and every 9 rows together. Would love to get some directions as to how to execute it. I was lucky with the first group (just learned about the remainder/divider option in the CodeAcademy course, but can't think of ideas for the other groups.
First collect all the rows in the csv file in a list:
with open('random_sorted.csv') as csvfile:
csv_reader = csv.reader(csvfile, delimiter=';')
headers = next(csv_reader)
rows = [row for row in csv_reader]
Then set your required step size (5, 7 or 9) and identify the rows on the basis of their index in the list of rows:
with open('first_group.csv', 'wb') as test_file:
csv_writer = csv.writer(test_file)
csv_writer.writerow(["image1"] + ["image2"])
step_size = 7 # set step size here
seen = set() # here we remember images we've already seen
for x in range(0, len(rows)-step_size):
img1 = rows[x][1]
img2 = rows[x+step_size][1]
if not (img1 in seen or img2 in seen):
csv_writer.writerow([img1, img2])
seen.add(img1)
seen.add(img2)

Matching line number with string in a table.

I have a file with a list of columns describing particular parameters:
size magnitude luminosity
I need only particular data (in particular lines, and columns) from this file. So far I have a code in python, where I have appended the necessary line numbers. I just need to know how i can match it to get the right string in the text file along with just the variables in columns (magnitude) and (luminosity.) Any suggestions on how I could approach this?
Here is a sample of my code (#comments describe what I have done and what I want to do):
temp_ListMatch = (point[5]).strip()
if temp_ListMatch:
ListMatchaddress = (point[5]).strip()
ListMatchaddress = re.sub(r'\s', '_', ListMatchaddress)
ListMatch_dirname = '/projects/XRB_Web/apmanuel/499/Lists/' + ListMatchaddress
#print ListMatch_dirname+"\n"
try:
file5 = open(ListMatch_dirname, 'r')
except IOError:
print 'Cannot open: '+ListMatch_dirname
Optparline = []
for line in file5:
point5 = line.split()
j = int(point5[1])
Optparline.append(j)
#Basically file5 contains the line numbers I need,
#and I have appended these numbers to the variable j.
temp_others = (point[4]).strip()
if temp_others:
othersaddress = (point[4]).strip()
othersaddress =re.sub(r'\s', '_', othersaddress)
othersbase_dirname = '/projects/XRB_Web/apmanuel/499/Lists/' + othersaddress
try:
file6 = open(othersbase_dirname, 'r')
except IOError:
print 'Cannot open: '+othersbase_dirname
gmag = []
z = []
rh = []
gz = []
for line in file6:
point6 = line.split()
f = float(point6[2])
g = float(point6[4])
h = float(point6[6])
i = float(point6[9])
# So now I have opened file 6 where this list of data is, and have
# identified the columns of elements that I need.
# I only need the particular rows (provided by line number)
# with these elements chosen. That is where I'm stuck!
Load the whole data file in to a pandas DataFrame (assuming that the data file has a header from which we can get the column names)
import pandas as pd
df = pd.read_csv('/path/to/file')
Load the file of line numbers into a pandas Series (assuming there's one per line):
# squeeze = True makes the function return a series
row_numbers = pd.read_csv('/path/to/rows_file', squeeze = True)
Return only those lines which are in the row number file, and the columns magnitude and luminosity (this assumes that the first row is numbered 0):
relevant_rows = df.ix[row_numbers][['magnitude', 'luminosity']

Categories

Resources