Bit of an involved setup to this question, but bear with me!
(Copy and pasting the below block into an editor works well)
I am using clevercsv to load my data from a financial website's csv file.
Each row is stored as an item in a list.
data = clevercsv.wrappers.read_csv(in_file_name)
After some account info lines, the stock data begins:
stock_data = data[8:]
I wish to remove the data: Market, Loan Value - all the way to - Day High (inclusive0
And Keep Symbol, Description -> % of Positions (inclusive), 52-wk Low, 52-wk High
Each stock has this data associated with it on the relevant line.
Any best practices for removing this data? I have been trying and seem to be having logic errors.
As of Date,2020-04-29 18:44:29
Account,TD Direct Investing - HAHAHA
Cash,123.12
Investments,1234.12
Total Value,12345.12
Margin,123456.12,
,
Symbol,Market,Description,Quantity,Average Cost,Price,Book Cost,Market Value,Unrealized $,Unrealized %,% of Positions,Loan Value,Change Today $,Change Today %,Bid,Bid Lots,Ask,Ask Lots,Volume,Day Low,Day High,52-wk Low,52-wk High
AFL,US,"AFLAC INC",500,43.79,39.23,21895.79,19615.00,-2280.79,-10.42,7.26,,1.4399986,3.81,39.19,1,40.2,1,3001288,38.31,39.48,23.07,57.18
AKTS,US,"AKOUSTIS TECHNOLOGIES INC",2500,5.04,8.94,12609.87,22350.00,9740.13,77.24,8.27,,0.35999966,4.20,8.68,1,9.2,10,1161566,8.65,9.25,3.76,9.25
And here is my code so far:
import clevercsv
data = clevercsv.wrappers.read_csv(in_file_name)
# store the earlier lines for later use, all rows 8 and later are stock data
cash = data[2]
investments = data[3]
tot_value = data[4]
margin = data[5]
full_header = data[7]
stock_data = data[8:]
new_header = []
new_stock_data = []
# I have found the index positions I wish to save, append their data to the new_ lists:
for i in range(len(full_header)):
if i == 0:
new_header.append(full_header[i])
if (i >= 2 and i <= 10):
new_header.append(full_header[i])
if i == 21:
new_header.append(full_header[i])
if i == 22:
new_header.append(full_header[i])
# I have found the index positions I wish to save, append their data to the new_ lists:
for i in range(len(stock_data)):
if i == 0:
new_stock_data.append(stock_data[i])
if (i >= 2 and i <= 10):
new_stock_data.append(stock_data[i])
if i == 21:
new_stock_data.append(stock_data[i])
if i == 22:
new_stock_data.append(stock_data[i])
with open(os.path.join(folder_path,out_file_name),'w') as out_file:
writer = clevercsv.writer(out_file)
writer.writerow(cash)
writer.writerow(investments)
writer.writerow(tot_value)
writer.writerow(margin)
writer.writerow(new_header)
for row in new_stock_data:
writer.writerow(row)
If this is too involved I understand, and if someone has a better library to use, or a better way to use the csv library that will be plenty of help on it's own.
If you already know the column indices and the header length, you can do something like this:
import csv
with open('input.csv', 'r', newline='') as input_file, open('output.csv','w', newline='') as output_file:
reader = csv.reader(input_file)
writer = csv.writer(output_file)
for line_number, row in enumerate(reader, start=0): # Avoid range(len(x))
if line_number < 7:
writer.writerow(row) # Write cash, investments, etc
else:
shortened_row = row[0:1] + row[2:11] + row[21:] # Slice only the columns you need
writer.writerow(shortened_row)
Whenever you find yourself writing range(len(something)), that's a good sign that you probably want to use enumerate(), which will loop through your data and automatically keep track of the current index.
For parsing each row after the header, you can use the slice notation row[start:end] and add slices together to get a new list, which you can then write to a file. Keep in mind that row[start:end] will not return the item at index end, which can be counter intuitive.
Finally, I always add newline='' when working with CSVs, since you can get unexpected line breaks otherwise, but this might be something clevercsv handles for you.
In Python, I would recommend using Pandas for this sort of operation.
First isolate the CSV data. Then treat it as a stream. I am dropping part of your sample in as x:
# This is python3 code
# first treat string as though it is a file
import io
x = io.StringIO("""Symbol,Market,Description,Quantity,Average Cost,Price,Book Cost,Market Value,Unrealized $,Unrealized %,% of Positions,Loan Value,Change Today $,Change Today %,Bid,Bid Lots,Ask,Ask Lots,Volume,Day Low,Day High,52-wk Low,52-wk High
AFL,US,"AFLAC INC",500,43.79,39.23,21895.79,19615.00,-2280.79,-10.42,7.26,,1.4399986,3.81,39.19,1,40.2,1,3001288,38.31,39.48,23.07,57.18
AKTS,US,"AKOUSTIS TECHNOLOGIES INC",2500,5.04,8.94,12609.87,22350.00,9740.13,77.24,8.27,,0.35999966,4.20,8.68,1,9.2,10,1161566,8.65,9.25,3.76,9.25""")
Then use pandas to read the string as CSV, treating the first row as headers by default:
import pandas as pd
df = pd.read_csv(x)
Then select the columns you want by passing a list of column names to the data frame:
new_df = df[['Book Cost', 'Market Value', 'Unrealized $', 'Unrealized %','% of Positions','52-wk Low', '52-wk High']]
Book Cost Market Value Unrealized $ Unrealized % % of Positions \
0 21895.79 19615.0 -2280.79 -10.42 7.26
1 12609.87 22350.0 9740.13 77.24 8.27
52-wk Low 52-wk High
0 23.07 57.18
1 3.76 9.25
Finally you can save it:
new_df.to_csv('test.csv', index=False) # Turn off indexing
And you are set:
Book Cost,Market Value,Unrealized $,Unrealized %,% of Positions,52-wk Low,52-wk High
21895.79,19615.0,-2280.79,-10.42,7.26,23.07,57.18
12609.87,22350.0,9740.13,77.24,8.27,3.76,9.25
(Full disclosure, I'm the author of CleverCSV.)
If you'd like to use CleverCSV for this task, and your data is small enough to fit into memory, you could use clevercsv.read_csv to load the data and clevercsv.write_table to save the data. By using these functions you don't have to worry about CSV dialects etc. You could also find the index of the header row automatically. It could go something like this:
from clevercsv import read_csv, write_table
# Load the table with CleverCSV
table = read_csv(in_file_name)
# Find the index of the header row and get the header
header_idx = next((i for i, r in enumerate(table) if r[0] == 'Symbol'), None)
header = table[header_idx]
# Extract the data as a separate table
data = table[header_idx+1:]
# Create a list of header names that you want to keep
keep = ["Symbol", "Description", "Quantity","Average Cost","Price","Book Cost","Market Value","Unrealized $","Unrealized %","% of Positions","52-wk Low", "52-wk High"]
# Turn that list into column indices (and ensure all exist)
keep_idx = [header.index(k) for k in keep]
# Then create a new table by adding the header and the sliced rows
new_table = [keep]
for row in data:
new_row = [row[i] for i in keep_idx]
new_table.append(new_row)
# Finally, write the table to a new csv file
write_table(new_table, out_file_name)
I'm trying to read in a .csv file and extract specific columns so that I can output a single table that essentially performs a 'GROUP BY' on a particular column and aggregates certain other columns of interest (similar to how you would in SQL) but I'm not too familiar how to do this easily in Python.
The csv file is in the following form:
age,education,balance,approved
30,primary,1850,yes
54,secondary,800,no
24,tertiary,240,yes
I've tried to import and read in the csv files to parse the three columns I care about and iterate through them to put them into three separate array lists. I'm not too familiar with packages and how to get these into a data frame or matrix with 3 columns so that I can then iterate through them mutate or perform all of the aggregated output field (see below expected results).
with open('loans.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter = ',')
next(readCSV) ##skips header row
education = []
balance = []
loan_approved = []
for row in readCSV:
educat = row[1]
bal = row[2]
approve = row[3]
education.append(educat)
balance.append(bal)
loan_approved.append(approve)
print(education)
print(balance)
print(loan_approved)
The output would be a 4x7 table of four rows (grouped by education level) and the following headers:
Education|#Applicants|Min Bal|Max Bal|#Approved|#Rejected|%Apps Approved
Primary ...
Secondary ...
Terciary ...
It seems to be much simpler by using Pandas instead. For instance, you can read only the columns that you care for instead of all of them:
import Pandas as pd
df = pd.read_csv(usecols=['education', 'balance', 'loan_approved'])
Now, to group by education level, you can find all the unique entries for that column and group them:
groupby_education = {}
for level in list(set(df['education'])):
groupby_education[level] = df.loc[df['education'] == level]
print(groupby_education)
I hope this helped. Let me know if you still need help.
Cheers!
My csv file looks like this:
Test Number,Score
1,100 2,40 3,80 4,90.
I have been trying to figure out how to write a code that ignores the header + first column and focuses on scores because the assignment was to find the averages of the test scores and print out a float(for those particular numbers the output should be 77.5). I've looked online and found pieces that I think would work but I'm getting errors every time. Were learning about read, realines, split, rstrip and \n if that helps! I'm sure the answer is so simple, but I'm new to coding and I have no idea what I'm doing. Thank you!
def calculateTestAverage(fileName):
myFile = open(fileName, "r")
column = myFile.readline().rstrip("\n")
for column in myFile:
scoreColumn = column.split(",")
(scoreColumn[1])
This is my code so far my professor wanted us to define a function and go from there using the stuff we learned in lecture. I'm stuck because it's printing out all the scores I need on separate returned lines, yet I am not able to sum those without getting an error. Thanks for all your help, I don't think I would be able to use any of the suggestions because we never went over them. If anyone has an idea of how to take those test scores that printed out vertically as a column and sum them that would help me a ton!
You can use csv library. This code should do the work:
import csv
reader = csv.reader(open('csvfile.txt','r'), delimiter=' ')
reader.next() # this line lets you skip the header line
for row_number, row in enumerate(reader):
total_score = 0
for element in row:
test_number, score = element.split(',')
total_score += score
average_score = total_score/float(len(row))
print("Average score for row #%d is: %.1f" % (row_number, average_score))
The output should look like this:
Average score for row #1 is: 77.5
I always approach this with a pandas data frame. Specifically the read_csv() function. You don’t need to ignore the header, just state that it is in row 0 (for example) and then also the same with the row labels.
So for example:
import pandas as pd
import numpy as np
df=read_csv(“filename”,header=0,index_col=0)
scores=df.values
print(np.average(scores))
I will break it down for you.
Since you're dealing with .csv files, I recommend using the csv library. You can import it with:
import csv
Now we need to open() the file. One common way is to use with:
with open('test.csv') as file:
Which is a context manager that avoids having to close the file at the end. The other option is to open and close normally:
file = open('test.csv')
# Do your stuff here
file.close()
Now you need to wrap the opened file with csv.reader(), which allows you to read .csv files and do things with them:
csv_reader = csv.reader(file)
To skip the headers, you can use next():
next(csv_reader)
Now for the average calculation part. One simple way is to have two variables, score_sum and total. The aim is to increment the scores and totals to these two variables respectively. Here is an example snippet :
score_sum = 0
total = 0
for number, score in csv_reader:
score_sum += int(score)
total += 1
Here's how to do it with indexing also:
score_sum = 0
total = 0
for line in csv_reader:
score_sum += int(line[1])
total += 1
Now that we have our score and totals calculated, getting the average is simply:
score_sum / total
The above code combined will then result in an average of 77.5.
Off course, this all assumes that your .csv file is actually in this format:
Test Number,Score
1,100
2,40
3,80
4,90
I 'm new in SO, new at programming and even more with python haha,
I'm trying to read CSV files (which will contain different data types) and store specific values ("coordinates") as variables.
CSV file example (sorry for using code format, text didn't want to stay quiet):
$id,name,last_name,age,phone_number,addrstr,addrnum
1,Constance,Harm,37,555-1234,Ocean_view,1
2,Homer,Simpson,40,555-1235,Evergreen_Terrace,742
3,John,Doe,35,555-1236,Fake_Street,123
4,Moe,Tavern,20,7648-4377,Walnut_Street,126
I want to know if there is some easy way to store a specific value using the rows as index, for example: "take row 2 and store 2nd value in variable Name, 3rd value in variable Lastname" and the "row" for each storage will vary.
Not sure if this will help because my coding level is very crappy:
row = #this value will be taken from ANOTHER csv file
people = open('people.csv', 'r')
linepeople = csv.reader(people)
data = list(linepeople)
name = int(data[**row**][1])
lastname = int(data[**row**][2])
age = int(data[**row**][3])
phone = int(data[**row**][4])
addrstr = int(data[**row**][5])
addrnum = int(data[**row**][6])
I haven't found nothing very similar to guide me into a solution. (I have been reading about dictionaries, maybe that will help me?)
EDIT (please let me know if its not allowed to edit questions): Thanks for the solutions, I'm starting to understand the possibilities but let me give more info about my expected output:
I'm trying to create an "universal" function to get only one value at given row/col and to store that single value into a variable, not the whole row nor the whole column.
Example: Need to store the phone number of John Doe (column 5, row 4) into a variable so that when printing that variable the output will be: 555-1236
You can iterate line by line. Watch out for your example code, you are trying to cast names of people into integers...
for row in linepeople:
name=row['name']
age = int(row['age'])
If you are going to do more complicated stuff, I recommend pandas. For starters it will try to convert numerical columns to float, and you can access them with attribute notation.
import pandas as pd
import numpy as np
people = pd.read_table('people.csv', sep=',')
people.name # all the names
people.loc[0:2] # first two rows
You can use the CSV DictReader which will automatically assign dictionary names based on your CSV column names on a per row basis as follows:
import csv
with open("input.csv", "r") as f_input:
csv_input = csv.DictReader(f_input)
for row in csv_input:
id = row['$id']
name = row['name']
last_name = row['last_name']
age = row['age']
phone_number = row['phone_number']
addrstr = row['addrstr']
addrnum = row['addrnum']
print(id, name, last_name, age, phone_number, addrstr, addrnum)
This would print out your CSV entries as follows:
1 Constance Harm 37 555-1234 Ocean_view 1
2 Homer Simpson 40 555-1235 Evergreen_Terrace 742
3 John Doe 35 555-1236 Fake_Street 123
4 Moe Tavern 20 7648-4377 Walnut_Street 126
If you wanted a list of just the names, you could build them as follows:
with open("input.csv", "r") as f_input:
csv_input = csv.DictReader(f_input)
names = []
for row in csv_input:
names.append(row['name'])
print(names)
Giving:
['Constance', 'Homer', 'John', 'Moe']
As the question has changed, a rather different approach would be needed. A simple get row/col type function would work but would be very inefficient. The file would need to be read in each time. A better approach would be to use a class. This would load the file in once and then you could get as many entries as you need. This can be done as follows:
import csv
class ContactDetails():
def __init__(self, filename):
with open(filename, "r") as f_input:
csv_input = csv.reader(f_input)
self.details = list(csv_input)
def get_col_row(self, col, row):
return self.details[row-1][col-1]
data = ContactDetails("input.csv")
phone_number = data.get_col_row(5, 4)
name = data.get_col_row(2,4)
last_name = data.get_col_row(3,4)
print "%s %s: %s" % (name, last_name, phone_number)
By using the class, the file is only read in once. This would print the following:
John Doe: 555-1236
Note, Python numbers indexes from 0, so your 5,4 has to be converted to 4,3 for Python.