My csv file looks like this:
Test Number,Score
1,100 2,40 3,80 4,90.
I have been trying to figure out how to write a code that ignores the header + first column and focuses on scores because the assignment was to find the averages of the test scores and print out a float(for those particular numbers the output should be 77.5). I've looked online and found pieces that I think would work but I'm getting errors every time. Were learning about read, realines, split, rstrip and \n if that helps! I'm sure the answer is so simple, but I'm new to coding and I have no idea what I'm doing. Thank you!
def calculateTestAverage(fileName):
myFile = open(fileName, "r")
column = myFile.readline().rstrip("\n")
for column in myFile:
scoreColumn = column.split(",")
(scoreColumn[1])
This is my code so far my professor wanted us to define a function and go from there using the stuff we learned in lecture. I'm stuck because it's printing out all the scores I need on separate returned lines, yet I am not able to sum those without getting an error. Thanks for all your help, I don't think I would be able to use any of the suggestions because we never went over them. If anyone has an idea of how to take those test scores that printed out vertically as a column and sum them that would help me a ton!
You can use csv library. This code should do the work:
import csv
reader = csv.reader(open('csvfile.txt','r'), delimiter=' ')
reader.next() # this line lets you skip the header line
for row_number, row in enumerate(reader):
total_score = 0
for element in row:
test_number, score = element.split(',')
total_score += score
average_score = total_score/float(len(row))
print("Average score for row #%d is: %.1f" % (row_number, average_score))
The output should look like this:
Average score for row #1 is: 77.5
I always approach this with a pandas data frame. Specifically the read_csv() function. You don’t need to ignore the header, just state that it is in row 0 (for example) and then also the same with the row labels.
So for example:
import pandas as pd
import numpy as np
df=read_csv(“filename”,header=0,index_col=0)
scores=df.values
print(np.average(scores))
I will break it down for you.
Since you're dealing with .csv files, I recommend using the csv library. You can import it with:
import csv
Now we need to open() the file. One common way is to use with:
with open('test.csv') as file:
Which is a context manager that avoids having to close the file at the end. The other option is to open and close normally:
file = open('test.csv')
# Do your stuff here
file.close()
Now you need to wrap the opened file with csv.reader(), which allows you to read .csv files and do things with them:
csv_reader = csv.reader(file)
To skip the headers, you can use next():
next(csv_reader)
Now for the average calculation part. One simple way is to have two variables, score_sum and total. The aim is to increment the scores and totals to these two variables respectively. Here is an example snippet :
score_sum = 0
total = 0
for number, score in csv_reader:
score_sum += int(score)
total += 1
Here's how to do it with indexing also:
score_sum = 0
total = 0
for line in csv_reader:
score_sum += int(line[1])
total += 1
Now that we have our score and totals calculated, getting the average is simply:
score_sum / total
The above code combined will then result in an average of 77.5.
Off course, this all assumes that your .csv file is actually in this format:
Test Number,Score
1,100
2,40
3,80
4,90
Related
There is a csv with 9 columns and 1.5 million rows. The question asks us to compute the spending for each account. There are 7700 account numbers that I was able to extract. Here is an sample from the file since someone asked (it is a link since i don't have enough clout on here to post photos apparently):
sample of the file
I'm especially confused given that you need to add the extra step of multiplying quantity and price since the transactions in the table are for individual items.
Oh, and we are not allowed to use pandas. And all of this is string data.
I have not tried much because I'm pretty stumped beyond simply getting the list of all of the account ids. Even that was a challenge for me, so any help is appreciated. Below is simply the code I used to get the list of IDs, and I'm pretty sure I wasn't even supposed to use import csv for that but oh well.
import csv
f_file = open ('myfile.csv')
csv_f_file = csv.reader(f_file)
account_id = []
for row in csv_f_file:
account_id.append(row[4])
account_id = set(account_id)
account_id_list = list(account_id)
print(customer_id_list)
The result should look like this (but imagine it 7000 times):
account: SID600
spending: 87.500
Thank you to anyone who can help!!
You could make it readable by using DictReader and DictWriter, but it's mandatory that you have the CSV with a header. Also you could save the results in another CSV for persistence.
Since in your input there may be a different product per entry for the same account (for example for SID600 there could be entries for chair, table and some other table, with different prices and quantities), there is a need to gather all the spendings in lists for each account and then sum them to a total.
Sample CSV input:
date,trans,item,account,quantity,price
0409,h65009,chair,SID600,12.5,7
0409,h65009,table,SID600,40,2
0409,h65009,table,SID600,22,10
0409,h65009,chair,SID601,30,11
0409,h65009,table,SID601,30,11
0409,h65009,table,SID602,4,9
The code:
import csv
from collections import defaultdict
inpf = open("accounts.csv", "r")
outpf = open("accounts_spending.csv", "w")
incsv = csv.DictReader(inpf)
outcsv = csv.DictWriter(outpf, fieldnames=['account', 'spending'])
outcsv.writeheader()
spending = defaultdict(list)
# calculate spendings for all entries
for row in incsv:
spending[row["account"]].append(float(row["quantity"]) * float(row["price"]))
# sum the spendings for all accounts
for account in spending:
spending[account] = sum(spending[account])
# output the spending to a CSV
for account, total_spending in spending.items():
outcsv.writerow({
"account": account,
"spending": total_spending
})
inpf.close()
outpf.close()
for whose output will be:
account,spending
SID600,387.5
SID601,660.0
SID602,36.0
You can try this:
import csv
with open ('myfile.csv') as f:
csv_f_file = csv.reader(f)
data = list(csv_f_file)
res = {}
for row in data:
res[row[3]] = res.get(row[3], 0.0)
res[row[3]] += float(row[4]) * float(row[5])
print(res)
import csv
f_file = open ('myfile.csv')
csv_f_file = csv.reader(p_supermarket_file)
account_id = []
for row in csv_f_file:
account_id.append(row[4])
account_id = set(account_id)
account_id_list = list(account_id)
for id in account_id_list:
for row in csv_f_file:
if row[3] == id:
total_amount = row[4] * row[5]
#make a dictionary to store amount and its corresponding is in it.
I have not tested it but it i from what i understood.
Give a try to Pandas. Use groupby method with lamda.
If your CSV file have features row wise take the transpose and then use groupby method.
Only refer pandas official documentation sites.
I'm trying to create a leaderboard in python, where a player will get a score from playing a game, which will write to a .csv file. I then need to read from this leaderboard, sorted from largest at the top to smallest at the bottom.
Is the sorting something that should be done when the values are written to the file, when i read the file, or somewhere in between?
my code:
writefile=open("leaderboard.csv","a")
writefile.write(name+", "points)
writefile.close()
readfile=open("leaderboard.csv","r")
I'm hoping to display the top 5 scores and the accompanying names.
It is this point that I have hit a brick wall. Thanks for any help.
Edit: getting the error 'list index out of range'
import csv
name = 'Test'
score = 3
with open('scores.csv', 'a') as f:
writer = csv.writer(f)
writer.writerow([name, score])
with open('scores.csv') as f:
reader = csv.reader(f)
scores = sorted(reader, key=lambda row: (float(row[1]), row[0]))
top5 = scores[-5:]
csv file:
test1 3
test2 3
test3 3
Python has a csv module in the standard library. It's very simple to use:
import csv
with open('scores.csv', 'a') as f:
writer = csv.writer(f)
writer.writerow([name, score])
with open('scores.csv') as f:
reader = csv.reader(f)
scores = sorted(reader, key=lambda row: (float(row[1]), row[0]))
top5 = scores[-5:]
Is the sorting something that should be done when the values are written to the file, when i read the file, or somewhere in between?
Both approaches have their benefits. If you sort when you read, you can simply append to the file, which makes writing new scores faster. You do take more time when reading as you have to sort it before you can figure out the highest score, but in a local leaderboard file this is unlikely to be a problem as you are unlikely to have more than a few thousands of lines, so you'll likely be fine with sorting when reading.
If you sort while writing, this comes with the problem that you'll have to rewrite the entire file every time a new score is added. However, it's also easier to cleanup the leaderboard file if you sort while writing. You can simply remove old/low scores that you don't care about anymore while writing.
Try this:
import pandas as pd
df = pd.read_csv('myfullfilepath.csv', sep=',', names=['name', 'score'])
df = df.sort_values(['score'], ascending=False)
first_ten = df.head(10)
first_ten.to_csv('myfullpath.csv', index=False).
I named the columns like that , following the structure tat you suggested.
I am looking to tally up all of the Baseline numbers in the Isc, Voc Imp, Vmp, FF and Pmp columns individually and take the average for each column. Below is the file that I am reading in to my program (test_results.csv).
Here is my code.
from MyClasses import TestResult
def main():
test = "test_results.csv"
inputFile = open(test, 'r')
user = TestResult()
counter = 0.0
hold = 0.0
for i in range (4,10):
for l in inputFile.readlines()[1:]:
split = l.split(",")
if user.getTestSeq(split[1]) == "Baseline":
num = float(user.getIsc(split[i]))
hold += num
counter += 1
print counter
print hold
total = hold/counter
print total
main()
I used the line
num = float(user.getIsc(split[i]))
with the hope that I could iterate through with the i, totaling one column, taking the average and moving to the next column. But I am not able to move to the next column. I just print out the same Isc column multiple times. Any ideas as to why? I am also looking to put the Test Sequences items in a list that I could iterate through in the same way for line
if user.getTestSeq(split[1]) == "Baseline":
so that I can tally up all the columns for Baseline, then move to tally up all columns for TC200, Hotspot and so on. Is this a good approach? wanted to solve the first iteration issue first before moving on to this one.
Thank you
You should use either DictReader from CSV Module or read_csv from pandas module
I recommend pandas module as you also perform operations on your data using.
import pandas as pd
df = pd.read_csv("test_results.csv")
df will contain your CSV table as is, no need to convert to floats or integers
so I've seen this done is other questions asked here but I'm still a little confused. I've been learning python3 for the last few days and figured I'd start working on a project to really get my hands dirty. I need to loop through a certain amount of CSV files and make edits to those files. I'm having trouble with going to a specific column and also for loops in python in general. I'm used to the convention (int i = 0; i < expression; i++), but in python it's a little different. Here's my code so far and I'll explain where my issue is.
import os
import csv
pathName = os.getcwd()
numFiles = []
fileNames = os.listdir(pathName)
for fileNames in fileNames:
if fileNames.endswith(".csv"):
numFiles.append(fileNames)
for i in numFiles:
file = open(os.path.join(pathName, i), "rU")
reader = csv.reader(file, delimiter=',')
for column in reader:
print(column[4])
My issue falls on this line:
for column in reader:
print(column[4])
So in the Docs it says column is the variable and reader is what I'm looping through. But when I write 4 I get this error:
IndexError: list index out of range
What does this mean? If I write 0 instead of 4 it prints out all of the values in column 0 cell 0 of each CSV file. I basically need it to go through the first row of each CSV file and find a specific value and then go through that entire column. Thanks in advance!
It could be that you don't have 5 columns in your .csv file.
Python is base0 which means it starts counting at 0 so the first column would be column[0], the second would be column[1].
Also you may want to change your
for column in reader:
to
for row in reader:
because reader iterates through the rows, not the columns.
This code loops through each row and then each column in that row allowing you to view the contents of each cell.
for i in numFiles:
file = open(os.path.join(pathName, i), "rU")
reader = csv.reader(file, delimiter=',')
for row in reader:
for column in row:
print(column)
if column=="SPECIFIC VALUE":
#do stuff
Welcome to Python! I suggest you to print some debugging messages.
You could add this to you printing loop:
for row in reader:
try:
print(row[4])
except IndexError as ex:
print("ERROR: %s in file %s doesn't contain 5 colums" % (row, i))
This will print bad lines (as lists because this is how they are represented in CSVReader) so you could fix the CSV files.
Some notes:
It is common to use snake_case in Python and not camelCase
Name your variables appropriately (csv_filename instead of i, row instead of column etc.)
Use the with close to handle files (read more)
Enjoy!
I got the following problem: I would like to read a data textfile which consists of two columns, year and temperature, and be able to calculate the minimum temperature etc. for each year. The whole file starts like this:
1995.0012 -1.34231
1995.3030 -3.52533
1995.4030 -7.54334
and so on, until year 2013. I had the following idea:
f=open('munich_temperatures_average.txt', 'r')
for line in f:
line = line.strip()
columns = line.split()
year = float(columns[0])
temperature=columns[1]
if year-1995<1 and year-1995>0:
print 1995, min(temperature)
With this I get only the year 1995 data which is what I want in a first step. In a second step I would like to calculate the minimal temperature of the whole dataset in year 1995. By using the script above, I however get the minimum temperature for every line in the datafile. I tried building a list and then appending the temperature but I run into trouble if I want to transform the year into an integer or the temperature into a float etc.
I feel like I am missing the right idea how to calculate the minimum value of a set of values in a column (but not of the whole column).
Any ideas how I could approach said problem? I am trying to learn Python but still at a beginners stage so if there is a way to do the whole thing without using "advanced" commands, I'd be ecstatic!
I could do this using the regexp
import re
from collections import defaultdict
REGEX = re.compile(ur"(\d{4})\.\d+ ([0-9\-\.\+]+)")
f = open('munich_temperatures_average.txt', 'r')
data = defaultdict(list)
for line in f:
year, temperature = REGEX.findall(line)[0]
temperature = float(temperature)
data[year].append(temperature)
print min(data["1995"])
You could use the csv module which would make it very easy to read and manipulate each row of your file:
import csv
with open('munich_temperatures_average.txt', 'r') as temperatures:
for row in csv.reader(temperatures, delimiter=' '):
print "year", row[0], "temp", row[1]
Afterwards it is just a matter of finding the min temperature in the rows. See
csv module documentation
If you just want the years and the temps:
years,temp =[],[]
with open("f.txt") as f:
for line in f:
spl = line.rstrip().split()
years.append(int(spl[0].split(".")[0]))
temp.append(float(spl[1]))
print years,temp
[1995, 1995, 1995] [-1.34231, -3.52533, -7.54334]
I've previously submit another approach, using a numpy library, that could be confusing considering that you are new to python. Sorry for that. As you mentioned yourself, you need to have some kind of record of the year 1995, but you don't need a list for that:
mintemp1995 = None
for line in f:
line = line.strip()
columns = line.split()
year = int(float(columns[0]))
temp = float(columns[1])
if year == 1995 and (mintemp1995 is None or temp < mintemp1995):
mintemp1995 = temp
print "1995:", mintemp1995
Note the cast to int of the year, so you can directly compare it to 1995, and the condition after it:
If the variable mintemp1995 has never set before (is None and therefore, the first entry of the dataset), or the current temperature is lower than that, it replaces it, so you have a record of only the lowest temperature.