I am attempting to use the CSV module of python to modify a CSV file. The file represents a stock and lists (as columns) the date, open price, high price, low price, close price, and volume for the day. What I would like to do is create multiple new columns by performing algebra on the existing data. For instance, I would like to create a column for the percentage from the open price to the high price for any given day and another for the percentage change from yesterday's close to today's close (no end in sight here, as of now thinking of about 10 columns to add).
Is there a compact way to do this? As of now, I am opening the original file and reading into a list(s) the values of interest. Then writing onto some temp file the modified values using that list(s). Then writing onto some new file using a for loop and adding the rows from each spreadsheet. Then writing the entire contents of that new file onto the original csv, as I would like to maintain the name of the csv (ticker.csv).
Hopefully I have made my issue clear. If you would like any clarification or further details, please do not hesitate.
edit: I have included a snippet of the code for one function below. The function seeks to create a new column that has the percent change from yesterday's close to today's close.
def add_col_pchange(ticker):
"""
Add column with percent change in closing price.
"""
original = open('file1', 'rb')
reader = csv.reader(original)
reader.next()
close = list()
for row in reader:
# build list of close values; entries from left to right are reverse chronological
# index 4 corresponds to "Close" column
close.append(float(row[4])
original.close()
new = open(file2, 'wb')
writer = csv.writer(new)
writer.writerow(["Percent Change"])
pchange = list()
for i in (0, len(close)-1):
x = (close[i]-close[i+1])/close[i+1]
pchange.append(x)
new.close()
# open original and new csv's as read, write out to some new file.
# later, copy that entire file to original csv in order to maintain
# original csv's name and include new data
Hope this helps
def add_col_pchange(ticker):
"""
Add column with percent change in closing price.
"""
# always use with to transparently manage opening/closing files
with open('ticker.csv', 'rb') as original:
spam = csv.reader(original)
headers = spam.next() # get header row
# get all of the data at one time, then transpose it using zip
data = zip(*[row for row in spam])
# build list of close values; entries from left to right are reverse chronological
# index 4 corresponds to "Close" column
close = data[4] # the 5th column has close values
# use map to process whole column at one time
f_pchange = lambda close0, close1: 100 * (float(close0) - float(close1)) / float(close1)
Ndays = len(close) # length of table
pchange = map(f_pchange, close[:-1], close[1:]) # list of percent changes
pchange = (None,) + tuple(pchange) # add something for the first or last day
headers.append("Percent Change") # add column name to headers
data.append(pchange)
data = zip(*data) # transpose back to rows
with open('ticker.csv', 'wb') as new:
spam = csv.writer(new)
spam.writerow(headers) # write headers
for row in data:
spam.writerow(row)
# open original and new csv's as read, write out to some new file.
# later, copy that entire file to original csv in order to maintain
# original csv's name and include new data
You should check out either numpy; you could use loadtxt() and vector math, but #lightalchemist is right, pandas was designed just for this.
Related
I have an xlxs file ( calculations file) with a number of sheets where the values for a few parameters are calculated using macros. These calculations are based of a particular column values in SHEET1 and this column values are copied from a set of different files ( Input files) repetitively and after a shortwhile the parameter values are copied over to another file. I am trying to automate this process with pandas. I was able copy column data from input files to the calculations file but the macros doesn't seem to be doing anything as I can not see any changed values even after waiting for a minute. Usually these macros take only a few seconds. On the other hand if I print the column from the calculations dataframe, it shows the copied values from input. not sure what's missed here.
here is the code I am using for the process
input_file = os.path.join('./data',configs.input_file)
calculations_file = os.path.join('./resources', configs.calculations_file)
input_df = pd.read_csv(normal_file)
calcs_df = pd.read_excel(calculations_file)
# first column is what is used in calculating the parameter values.
# so copy input data to this column
calcs_df[calcs_df.columns[0]] = input_df[input_df.columns[0]]
# wait for sometime so that the macros can run and derive the parameter values
time.sleep(90)
# check the columns for parameters for their updated values, 10, 11, 12 are
# the columns for the parameters.
print(calcs_df[calcs_df.columns[0]][1],calcs_df[calcs_df.columns[10]][0],calcs_df[calcs_df.columns[11]][0],calcs_df[calcs_df.columns[12]][0] )
I appreciate if anyone can help me understand this sort of process is even possible.
xlwings does this well for me.
instead of using pandas used xlwings as follows and it worked
nwbook = xw.Book(n_file)
nwsheet = nwbook.sheets[0]
rwbook = xw.Book(r_file)
isheet = rwbook.sheets['Test']
ndata = nwsheet.range('A:A').options(ndim=2,expand='table',
transpose=True).value
# calculations file edits
cwbook = xw.Book(c_file)
csheet = cwbook.sheets['Data']
for no, data in enumerate(ndata):
# this is where the calculations are done
# get result in column orientation by setting
# transpose to true.
csheet.range('A:A').options(ndim=2,
expand='table',transpose=True).value = data
time.sleep(1)
# by now the macros in the calculations sheet are run
# and the results ready
# read values and pump them to report
isheet.range('A'+str(no)).value = data[0]
isheet.range('C'+str(no)).value = csheet.range('E2').value
isheet.range('D'+str(no)).value = csheet.range('F2').value
instant_sheet.range('E'+str(no)).value = csheet.range('G2').value
I have the header name of a column from a series of massive csv files with 50+ fields. Across the files, the index of the column I need is not always the same.
I have written code that finds the index number of the column in each file. Now I'd like to add only this column as the key in a dictionary where the value counts the number of unique strings in this column.
Because these csv files are massive and I'm trying to use best-practices for efficient data engineering, I'm looking for a solution that uses minimal memory. Every solution I find for writing a csv to a dictionary involves writing all of the data in the csv to the dictionary and I don't think this is necessary. It seems that the best solution involves only reading in the data from this one column and adding this column to the dictionary key.
So, let's take this as sample data:
FOODS;CALS
"PIZZA";600
"PIZZA";600
"BURGERS";500
"PIZZA";600
"PASTA";400
"PIZZA";600
"SALAD";100
"CHICKEN WINGS";300
"PIZZA";600
"PIZZA";600
The result I want:
food_dict = {'PIZZA': 6, 'PASTA': 1, 'BURGERS': 1, 'SALAD': 1, 'CHICKEN WINGS': 1}
Now let's say that I want the data from only the FOODS column and in this case, I have set the index value as the variable food_index.
Here's what I have tried, the problem being that the columns are not always in the same index location across the different files, so this solution won't work:
from itertools import islice
with open(input_data_txt, "r") as file:
# This enables skipping the header line.
skipped = islice(file, 1, None)
for i, line in enumerate(skipped, 2):
try:
food, cals = line.split(";")
except ValueError:
pass
food_dict = {}
if food not in food_dict:
food_dict[food] = 1
else:
food_dict[food] += 1
This solution works for only this sample -- but only if I know the location of the columns ahead of time -- and again, a reminder that I have upwards of 50 columns and the index position of the column I need is different across files.
Is it possible to do this? Again, built-ins only -- no Pandas or Numpy or other such packages.
The important part here is that you do not skip the header line! You need to split that line and find the indices of the columns you need! Since you know the column headers for the information you need, put those into a reference list:
wanted_headers = ["FOODS", "RECYCLING"]
with open(input_data_txt, "r") as infile:
header = infile.read().split(';')
wanted_cols = [header.index(label) for label in wanted_headers if label in header]
# wanted_cols is now a list of column numbers you want
for line in infile.readlines(): # Iterate through remaining file
fields = line.split(';')
data = [fields[col] for col in wanted_cols]
You now have the data in the same order as your existing headers; you can match it up or rearrange as needed.
Does that solve your blocking point? I've left plenty of implementation for you ...
Use Counter and csv:
from collections import Counter
import csv
with open(filename) as f:
reader = csv.reader(f)
next(reader, None) # skips header
histogram = Counter(line[0] for line in reader)
CVS Sample
So I have a csv file(sample in link above) , with variable names in row 7 and values in row 8 . The Variable all have units after them, and the values are just numbers like this :
Velocity (ft/s) Volumetric (Mgal/d Mass Flow (klb/d) Sound Speed (ft/s)
.-0l.121 1.232 1.4533434 1.233423
There are alot more variables, but basically I need some way to search in the csv file for the specefic unit groups, and then append the value associated with that in a list. For example search for text "(ft/s)", and then make a dictionary with Velocity and Sound speed as Keys, and their associated values . I am unable to do this because the csv is formatted like an excel spreadsheet, and the cells contains the whole variable name with it's unit
In the end I will have a dictionary for each unit group, and I need to do it this way because each csv file generated, the unit groups change ( ft/s becomes m/s). I also can't use excel read, because it doesn't work in IronPython.
You can use csv module to read the appropriate lines into lists.
defaultdict is a good choice for data aggregation, while variable
names and units can be easily separated by splitting on '('.
import csv
import collections
with open(csv_file_name) as fp:
reader = csv.feader(fp)
for k in range(6): # skip 6 lines
next(reader)
varnames = next(reader) # 7th line
values = next(reader) # 8th line
groups = collections.defaultdict(dict)
for i, (col, value) in enumerate(zip(varnames, values)):
if i < 2:
continue
name, units = map(str.strip, col.strip(')').split('(', 1))
groups[units][name] = float(value)
Edit: added the code to skip first two columns
I'll help with the part I think you're stuck on, which is trying to extract the units from the category. Given your data, your best bet may be to use regex, the following should work:
import re
f = open('data.csv')
# I assume the first row has the header you listed in your question
header = f.readline().split(',') #since you said its a csv
for item in header:
print re.search(r'\(.+\)', item).group()
print re.sub(r'\(.+\)', '', item)
That should print the following for you:
(ft/s)
Velocity
(Mgal/d)
Volumetric
(klb/d)
Mass Flow
(ft/s)
Sound Speed
You can modify the above to store these in a list, then iterate through them to find duplicates and merge the appropriate strings to dictionaries or whatnot.
I am writing a program that:
Read the content from an excel sheets for each row (90,000 rows in total)
Compare the content with another excel sheet for each row (600,000 rows in total)
If a match occurs, write the matching entry into a new excel sheet
I have written the script and everything works fine. however, the computational time is HUGE. For an hour, it has done just 200 rows from the first sheet, resulting in writing 200 different files.
I was wondering if there is a way to save the matching in a different way as I am going to use them later on? Is there any way to save in a matrix or something?
import xlrd
import xlsxwriter
import os, itertools
from datetime import datetime
# choose the incident excel sheet
book_1 = xlrd.open_workbook('D:/Users/d774911/Desktop/Telstra Internship/Working files/Incidents.xlsx')
# choose the trap excel sheet
book_2 = xlrd.open_workbook("D:/Users/d774911/Desktop/Telstra Internship/Working files/Traps.xlsx")
# choose the features sheet
book_3 = xlrd.open_workbook("D:/Users/d774911/Desktop/Telstra Internship/Working files/Features.xlsx")
# select the working sheet, either by name or by index
Traps = book_2.sheet_by_name('Sheet1')
# select the working sheet, either by name or by index
Incidents = book_1.sheet_by_name('Sheet1')
# select the working sheet, either by name or by index
Features_Numbers = book_3.sheet_by_name('Sheet1')
#return the total number of rows for the traps sheet
Total_Number_of_Rows_Traps = Traps.nrows
# return the total number of rows for the incident sheet
Total_Number_of_Rows_Incidents = Incidents.nrows
# open a file two write down the non matching incident's numbers
print(Total_Number_of_Rows_Traps, Total_Number_of_Rows_Incidents)
write_no_matching = open('C:/Users/d774911/PycharmProjects/GlobalData/No_Matching.txt', 'w')
# For loop to iterate for all the row for the incident sheet
for Rows_Incidents in range(Total_Number_of_Rows_Incidents):
# Store content for the comparable cell for incident sheet
Incidents_Content_Affected_resources = Incidents.cell_value(Rows_Incidents, 47)
# Store content for the comparable cell for incident sheet
Incidents_Content_Product_Type = Incidents.cell_value(Rows_Incidents, 29)
# Convert Excel date type into python type
Incidents_Content_Date = xlrd.xldate_as_tuple(Incidents.cell_value(Rows_Incidents, 2), book_1.datemode)
# extract the year, month and day
Incidents_Content_Date = str(Incidents_Content_Date[0]) + ' ' + str(Incidents_Content_Date[1]) + ' ' + str(Incidents_Content_Date[2])
# Store content for the comparable cell for incident sheet
Incidents_Content_Date = datetime.strptime(Incidents_Content_Date, '%Y %m %d')
# extract the incident number
Incident_Name = Incidents.cell_value(Rows_Incidents, 0)
# Create a workbook for the selected incident
Incident_Name_Book = xlsxwriter.Workbook(os.path.join('C:/Users/d774911/PycharmProjects/GlobalData/Test/', Incident_Name + '.xlsx'))
# Create sheet name for the created workbook
Incident_Name_Sheet = Incident_Name_Book.add_worksheet('Sheet1')
# insert the first row that contains the features
Incident_Name_Sheet.write_row(0, 0, Features_Numbers.row_values(0))
Insert_Row_to_Incident_Sheet = 0
# For loop to iterate for all the row for the traps sheet
for Rows_Traps in range(Total_Number_of_Rows_Traps):
# Store content for the comparable cell for traps sheet
Traps_Content_Node_Name = Traps.cell_value(Rows_Traps, 3)
# Store content for the comparable cell for traps sheet
Traps_Content_Event_Type = Traps.cell_value(Rows_Traps, 6)
# extract date temporally
Traps_Content_Date_temp = Traps.cell_value(Rows_Traps, 10)
# Store content for the comparable cell for traps sheet
Traps_Content_Date = datetime.strptime(Traps_Content_Date_temp[0:10], '%Y-%m-%d')
# If the content matches partially or full
if len(str(Traps_Content_Node_Name)) * len(str(Incidents_Content_Affected_resources)) != 0 and \
str(Incidents_Content_Affected_resources).lower().find(str(Traps_Content_Node_Name).lower()) != -1 and \
len(str(Traps_Content_Event_Type)) * len(str(Incidents_Content_Product_Type)) != 0 and \
str(Incidents_Content_Product_Type).lower().find(str(Traps_Content_Event_Type).lower()) != -1 and \
len(str(Traps_Content_Date)) * len(str(Incidents_Content_Date)) != 0 and \
Traps_Content_Date <= Incidents_Content_Date:
# counter for writing inside the new incident sheet
Insert_Row_to_Incident_Sheet = Insert_Row_to_Incident_Sheet + 1
# Write the Incident information
Incident_Name_Sheet.write_row(Insert_Row_to_Incident_Sheet, 0, Incidents.row_values(Rows_Incidents))
# Write the Traps information
Incident_Name_Sheet.write_row(Insert_Row_to_Incident_Sheet, 107, Traps.row_values(Rows_Traps))
Incident_Name_Book.close()
Thanks
What your doing is seeking/reading a litte bit of data for each cell. This is very inefficient.
Try reading all information in one go into an as basic as sensible python data structure (lists, dicts etc.) and make your comparisons/operations on this data set in memory and write all results in one go. If not all data fits into memory, try to partition it into sub-tasks.
Having to read the data set 10 times, to extract a tenth of data each time will likely still be hugely faster than reading each cell independently.
I don't see how your code can work; the second loop works on variables which change for every row in the first loop but the second loop isn't inside of the first one.
That said, comparing files in this way has a complexity of O(N*M) which means that the runtime explodes quickly. In your case you try to execute 54'000'000'000 (54 billion) loops.
If you run into these kind of problems, the solution is always a three step process:
Transform the data to make it easier to process
Put the data into efficient structures (sorted lists, dict)
Search the data with the efficient structures
You have to find a way to get rid of the find(). Try to get rid of all the junk in the cells that you want to compare so that you could use =. When you have this, you can put rows into a dict to find matches. Or you could load it into a SQL database and use SQL queries (don't forget to add indexes!)
One last trick is to use sorted lists. If you can sort the data in the same way, then you can use two lists:
Sort the data from the two sheets into two lists
Use two row counters (one per list)
If the current item from the first list is less than the current one from the second list, then there is no match and you have to advance the first row counter.
If the current item from the first list is bigger than the current one from the second list, then there is no match and you have to advance the second row counter.
If the items are the same, you have a match. Process the match and advance both counters.
This allows you to process all the items in one go.
I would suggest that you use pandas. This module provides a huge amount of functions to compare datasets. It also has a very fast import/export algorithms for excel files.
IMHO you should use the merge function and provide the arguments how = 'inner' and on = [list of your columns to compare]. That will create a new dataset with only such rows that occur in both tables (having the same values in the defined colums). This new dataset you can export to your excel file.
I asked a question about two hours ago regarding the reading and writing of data from a website. I've spent the last two hours since then trying to find a way to read the maximum date value from column 'A' of the output, comparing that value to the refreshed website data, and appending any new data to the csv file without overriding the old ones or creating duplicates.
The code that is currently 100% working is this:
import requests
symbol = "mtgoxUSD"
url = 'http://api.bitcoincharts.com/v1/trades.csv?symbol={}'.format(symbol)
data = requests.get(url)
with open("trades_{}.csv".format(symbol), "r+") as f:
f.write(data.text)
I've tried various ways of finding the maximum value of column 'A'. I've tried a bunch of different ways of using "Dict" and other methods of sorting/finding max, and even using pandas and numpy libs. None of which seem to work. Could someone point me in the direction of a decent way to find the maximum of a column from the .csv file? Thanks!
if you have it in a pandas DataFrame, you can get the max of any column like this:
>>> max(data['time'])
'2012-01-18 15:52:26'
where data is the variable name for the DataFrame and time is the name of the column
I'll give you two answers, one that just returns the max value, and one that returns the row from the CSV that includes the max value.
import csv
import operator as op
import requests
symbol = "mtgoxUSD"
url = 'http://api.bitcoincharts.com/v1/trades.csv?symbol={}'.format(symbol)
csv_file = "trades_{}.csv".format(symbol)
data = requests.get(url)
with open(csv_file, "w") as f:
f.write(data.text)
with open(csv_file) as f:
next(f) # discard first row from file -- see notes
max_value = max(row[0] for row in csv.reader(f))
with open(csv_file) as f:
next(f) # discard first row from file -- see notes
max_row = max(csv.reader(f), key=op.itemgetter(0))
Notes:
max() can directly consume an iterator, and csv.reader() gives us an iterator, so we can just pass that in. I'm assuming you might need to throw away a header line so I showed how to do that. If you had multiple header lines to discard, you might want to use islice() from the itertools module.
In the first one, we use a "generator expression" to select a single value from each row, and find the max. This is very similar to a "list comprehension" but it doesn't build a whole list, it just lets us iterate over the resulting values. Then max() consumes the iterable and we get the max value.
max() can use a key= argument where you specify a "key function". It will use the key function to get a value and use that value to figure the max... but the value returned by max() will be the unmodified original value (in this case, a row value from the CSV). In this case, the key function is manufactured for you by operator.itemgetter()... you pass in which column you want, and operator.itemgetter() builds a function for you that gets that column.
The resulting function is the equivalent of:
def get_col_0(row):
return row[0]
max_row = max(csv.reader(f), key=get_col_0)
Or, people will use lambda for this:
max_row = max(csv.reader(f), key=lambda row: row[0])
But I think operator.itemgetter() is convenient and nice to read. And it's fast.
I showed saving the data in a file, then pulling from the file again. If you want to go through the data without saving it anywhere, you just need to iterate over it by lines.
Perhaps something like:
text = data.text
rows = [line.split(',') for line in text.split("\n") if line]
rows.pop(0) # get rid of first row from data
max_value = max(row[0] for row in rows)
max_row = max(rows, key=op.itemgetter(0))
I don't know which column you want... column "A" might be column 0 so I used 0 in the above. Replace the column number as you like.
It seems like something like this should work:
import requests
import csv
symbol = "mtgoxUSD"
url = 'http://api.bitcoincharts.com/v1/trades.csv?symbol={}'.format(symbol)
data = requests.get(url)
with open("trades_{}.csv".format(symbol), "r+") as f:
all_values = list(csv.reader(f))
max_value = max([int(row[2]) for row in all_values[1:]])
(write-out-the-value?)
EDITS: I used "row[2]" because that was the sample column I was taking max of in my csv. Also, I had to strip off the column headers, which were all text, which was why I looked at "all_values[1:]" from the second row to the end of the file.