Transforming a text file into column vectors - python

I have a text file that I would like to break up into column vectors:
dtstamp ozone ozone_8hr_avg
06/18/2015 14:00:00 0.071 0.059
06/18/2015 13:00:00 0.071 0.053
How do I produce output in the following format?
dtstamp = [06/18/2015 14:00:00, 06/18/2015]
ozone = [0.071, 0.071]
etc.

import datetime
dtstamp = [] # initialize the dtstamp list
ozone = [] # initialize the ozone list
with open('file.txt', 'r') as f:
next(f) # skip the title line
for line in f: # iterate through the file
if not line: continue # skip blank lines
day, time, value, _ = line.split() # split up the line
dtstamp.append(datetime.datetime.strptime(' '.join((date, time)),
'%m/%d/%Y %H:%M:%S') # add a date
ozone.append(float(value)) # add a value
You can then combine these lists with zip to work with corresponding dates/values:
for date, value in zip(dtstamp, ozone):
print(date, value) # just an example

Few of the other answers seem to give errors on running them.
Try this, it should work like a charm!
dtstmp = []
ozone = []
ozone_8hr_avg = []
with open('file.txt', 'r') as file:
next(file)
for line in file:
if (line=="\n") or (not line): #If a blank line occurs
continue
words = line.split() #Extract the words
dtstmp.append(' '.join(words[0::1]))#join the date
ozone.append(words[2]) #Add ozone
ozone_8hr_avg.append(words[3]) #Add the third entry
print "dtstmp =", dtstmp
print "ozone =", ozone
print "ozone_8hr_avg =", ozone_8hr_avg

I would check out pandashttp://pandas.pydata.org or the csv module. With cvs you'll have to make the columns yourself, since it will give you rows.
rows = [row for row in csv.reader(file, delimiter='\t') ] #get the rows
col0 = [ row[0] for row in rows ] # construct a colonm from element 0 of each row.

Try this my friend:
# -*- coding: utf8 -*-
file = open("./file.txt")
lines = file.readlines()
data = []
data_hour = []
ozone = []
ozone_8hr_avg = []
for i_line in lines:
data.append(i_line.split()[0:2])
data_hour.append(' '.join(data[-1]))
ozone.append(i_line.split()[2])
ozone_8hr_avg.append(i_line.split()[3])
#print (data)
print (data_hour)
print (ozone)
print (ozone_8hr_avg)
If that helps you remember accept the answer.

Related

Trying to read in data from a CSV file, list index out of range?

I have no idea why this isn't working, I've done this before with CSV files and it worked. The file has no blank lines or blank values and the data is separated by commas. This is what I have:
#Create empty lists for data.
Date = []
Open = []
High = []
Low = []
Close = []
Adj_Close = []
Volume = []
#Fill lists with data.
with open("AAPL_train.csv", "r") as infile:
for lines in infile:
lines = lines.split(",")
Date.append(lines[0])
Open.append(lines[1])
High.append(lines[2])
Low.append(lines[3])
Close.append(lines[4])
Adj_Close.append(lines[5])
Volume.append(lines[6])
My error code is reading that it goes out of index at the Open.append(lines[1]) line.
Then here is a sample of the data to show you what it looks like.
Any ideas? Thank you.
Edited to add: The error I'm getting is IndexError: list index out of range on line 18, and when I try to do a print line after each loop, I get nothing but the error.
import csv
#Create empty lists for data.
Date = []
Open = []
High = []
Low = []
Close = []
Adj_Close = []
Volume = []
#Fill lists with data.
with open("AAPL_train.csv", "r") as infile:
csv_reader = csv.reader(infile, delimiter=','):
# skip header
next(csv_reader)
for row in infile:
Date.append(row[0])
Open.append(row[1])
High.append(row[2])
Low.append(row[3])
Close.append(row[4])
Adj_Close.append(row[5])
Volume.append(row[6])
I've updated your code to use the inbuilt python csv package. It makes life easier when dealing with CSV and its generator should help prevent any memory problems down the line!
I also added the next(csv_reader) call in order to skip the header in your csv data. I made the assumption this was data you may not actually want to record.
Try using pandas:
import pandas as pd
df = pd.read_csv(AAPL_train.csv)
Date = df['Date'].tolist()
Open = df['Open'].tolist()
High = df['High'].tolist()
Low = df['Low'].tolist()
Close = df['Close'].tolist()
Adj_Close = df['Adj_Close'].tolist()
Volume = df['Volume'].tolist()

Unable to access the values from the .csv file using Python3?

Using the following Python3 code, I am able to access the first column values but unable to access subsequent columns. The error is:
IndexError: list index out of range
with open('smallSample.txt', 'r') as file:
listOfLines = file.readlines()
for line in listOfLines:
print(line.strip())
header = listOfLines[0] #with all the labels
print(header.strip().split(','))
for row in listOfLines[1:]:
values = row.strip().split(',')
print(values[0]) #Able to access 1st row elements
print(values[1]) #ERROR Unable to access the Second Column Values
'''IndexError: list index out of range'''
The smallSample.txt data stored is:
Date,SP500,Dividend,Earnings,Consumer Price Index,Long Interest Rate,Real Price,Real Dividend,Real Earnings,PE10
1/1/2016,1918.6,43.55,86.5,236.92,2.09,2023.23,45.93,91.22,24.21
2/1/2016,1904.42,43.72,86.47,237.11,1.78,2006.62,46.06,91.11,24
3/1/2016,2021.95,43.88,86.44,238.13,1.89,2121.32,46.04,90.69,25.37```
Actually, your values is not a list. It is re-initialized again and again in for loop. Use this code:
with open('data.txt', 'r') as file:
listOfLines = file.readlines()
for line in listOfLines:
print(line.strip())
header = listOfLines[0] #with all the labels
print(header.strip().split(','))
values = [] # <= look at here
for row in listOfLines[1:]:
values.append(row.strip().split(',')) # <= look at here
print(values[0]) # <= outside for loop
print(values[1])
with open('SP500.txt', 'r') as file:
lines = file.readlines()
#for line in lines:
#print(line)
#header = lines[0]
#labels = header.strip().split(',')
#print(labels)
listOfData = []
totalSP = 0.0
for line in lines[6:18]:
values = line.strip().split(',')
#print(values[0], values[1], values[5])
totalSP = totalSP + float(values[1])
listOfData.append(float(values[5]))
mean_SP = totalSP/12.0
#print(listOfData)
max_interest = listOfData[0]
for i in listOfData:
if i>max_interest:
max_interest = i

How to find minimum value from CSV file row in Python?

I'm a beginner at Python and I am using it for my project.
I want to extract the minimum value from column4 of a CSV file and I am not sure how to.
I can print the whole of column[4] but am not sure how to print the minimum value (just one column) from column[4].
CSV File: https://www.emcsg.com/marketdata/priceinformation
I'm downloading the Uniform Singapore Energy Price & Demand Forecast for 9 Sep.
Thank you in advance.
This is my current codes:
import csv
import operator
with open('sep.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
header = next(readCSV)
data = []
for row in readCSV:
Date = row[0]
Time = row[1]
Demand = row[2]
RCL = row[3]
USEP = row [4]
EHEUR = row [5]
LCP = row[6]
Regulations = row[7]
Primary = row[8]
Secondary = row[9]
Contingency = row[10]
Last_Updated = row[11]
print header[4]
print row[4]
not sure how are you reading the values. however, you can add all the values to and list and then:
list = []
<loop to extract values>
list.insert(index, value)
min_value = min(list)
Note: index is the 'place' where the value get inserted.
Your phrasing is a bit ambiguous. At first I thought you meant the minimum of the fourth row, but looking at the data you seem to be wanting the minimum of the fourth column (USEP($/MWh)). For that, (assuming that "Realtime_Sep-2017.csv" is the filename) you can do:
import pandas as pd
df = pd.read_csv("Realtime_Sep-2017.csv")
print(min(df["USEP($/MWh)"])
Other options include df.min()["USEP($/MWh)"], df.min()[4], and min(df.iloc[:,4])
EDIT 2 :
Solution for a column without pandas module:
with open("Realtime_Sep-2017.csv") as file:
lines = file.read().split("\n") #Read lines
num_list = []
for line in lines:
try:
item = line.split(",")[4][1:-1] #Choose 4th column and delete ""
num_list.append(float(item)) #Try to parse
except:
pass #If it can't parse, the string is not a number
print(max(num_list)) #Prints maximum value
print(min(num_list)) #Prints minimum value
Output:
81.92
79.83
EDIT :
Here is the solution for a column:
import pandas as pd
df = pd.read_csv("Realtime_Sep-2017.csv")
row_count = df.shape[0]
column_list = []
for i in range(row_count):
item = df.at[i, df.columns.values[4]] #4th column
column_list.append(float(item)) #parse float and append list
print(max(column_list)) #Prints maximum value
print(min(column_list)) #Prints minimum value
BEFORE EDIT :
(solution for a row)
Here is a simple code block:
with open("Realtime_Sep-2017.csv") as file:
lines = file.read().split("\n") #Reading lines
num_list = []
line = lines[3] #Choosing 4th row.
for item in line.split(","):
try:
num_list.append(float(item[1:-1])) #Try to parse
except:
pass #If it can't parse, the string is not a number
print(max(num_list)) #Prints maximum value
print(min(num_list)) #Prints minimum value

Split a row into multiple cells and keep the maximum value of second value for each gene

I am new to Python and I prepared a script that will modify the following csv file
accordingly:
1) Each row that contains multiple Gene entries separated by the /// such as:
C16orf52 /// LOC102725138 1.00551
should be transformed to:
C16orf52 1.00551
LOC102725138 1.00551
2) The same gene may have different ratio values
AASDHPPT 0.860705
AASDHPPT 0.983691
and we want to keep only the pair with the highest ratio value (delete the pair AASDHPPT 0.860705)
Here is the script I wrote but it does not assign the correct ratio values to the genes:
import csv
import pandas as pd
with open('2column.csv','rb') as f:
reader = csv.reader(f)
a = list(reader)
gene = []
ratio = []
for t in range(len(a)):
if '///' in a[t][0]:
s = a[t][0].split('///')
gene.append(s[0])
gene.append(s[1])
ratio.append(a[t][1])
ratio.append(a[t][1])
else:
gene.append(a[t][0])
ratio.append(a[t][1])
gene[t] = gene[t].strip()
newgene = []
newratio = []
for i in range(len(gene)):
g = gene[i]
r = ratio[i]
if g not in newgene:
newgene.append(g)
for j in range(i+1,len(gene)):
if g==gene[j]:
if ratio[j]>r:
r = ratio[j]
newratio.append(r)
for i in range(len(newgene)):
print newgene[i] + '\t' + newratio[i]
if len(newgene) > len(set(newgene)):
print 'missionfailed'
Thank you very much for any help or suggestion.
Try this:
with open('2column.csv') as f:
lines = f.read().splitlines()
new_lines = {}
for line in lines:
cols = line.split(',')
for part in cols[0].split('///'):
part = part.strip()
if not part in new_lines:
new_lines[part] = cols[1]
else:
if float(cols[1]) > float(new_lines[part]):
new_lines[part] = cols[1]
import csv
with open('clean_2column.csv', 'wb') as csvfile:
writer = csv.writer(csvfile, delimiter=' ',
quotechar='|', quoting=csv.QUOTE_MINIMAL)
for k, v in new_lines.items():
writer.writerow([k, v])
First of all, if you're importing Pandas, know that you have I/O Tools to read CSV files.
So first, let's import it that way :
df = pd.read_csv('2column.csv')
Then, you can extract the indexes where you have your '///' pattern:
l = list(df[df['Gene Symbol'].str.contains('///')].index)
Then, you can create your new rows :
for i in l :
for sub in df['Gene Symbol'][i].split('///') :
df=df.append(pd.DataFrame([[sub, df['Ratio(ifna vs. ctrl)'][i]]], columns = df.columns))
Then, drop the old ones :
df=df.drop(df.index[l])
Then, I'll do a little trick to remove your lowest duplicate values. First, I'll sort them by 'Ratio (ifna vs. ctrl)' then I'll drop all the duplicates but the first one :
df = df.sort('Ratio(ifna vs. ctrl)', ascending=False).drop_duplicates('Gene Symbol', keep='first')
If you want to keep your sorting by Gene Symbol and reset indexes to have simpler ones, simply do :
df = df.sort('Gene Symbol').reset_index(drop=True)
If you want to re-export your modified data to your csv, do :
df.to_csv('2column.csv')
EDIT : I edited my answer to correct syntax errors, I've tested this solution with your csv and it worked perfectly :)
This should work.
It uses the dictionary suggestion of Peter.
import csv
with open('2column.csv','r') as f:
reader = csv.reader(f)
original_file = list(reader)
# gets rid of the header
original_file = original_file[1:]
# create an empty dictionary
genes_ratio = {}
# loop over every row in the original file
for row in original_file:
gene_name = row[0]
gene_ratio = row[1]
# check if /// is in the string if so split the string
if '///' in gene_name:
gene_names = gene_name.split('///')
# loop over all the resulting compontents
for gene in gene_names:
# check if the component is in the dictionary
# if not in dictionary set value to gene_ratio
if gene not in genes_ratio:
genes_ratio[gene] = gene_ratio
# if in dictionary compare value in dictionary to gene_ratio
# if dictionary value is smaller overwrite value
elif genes_ratio[gene] < gene_ratio:
genes_ratio[gene] = gene_ratio
else:
if gene_name not in genes_ratio:
genes_ratio[gene_name] = gene_ratio
elif genes_ratio[gene_name] < gene_ratio:
genes_ratio[gene_name] = gene_ratio
#loop over dictionary and print gene names and their ratio values
for key in genes_ratio:
print key, genes_ratio[key]

Python CSV writer

I have a csv that looks like this:
HA-MASTER,CategoryID
38231-S04-A00,14
39790-S10-A03,14
38231-S04-A00,15
39790-S10-A03,15
38231-S04-A00,16
39790-S10-A03,16
38231-S04-A00,17
39790-S10-A03,17
38231-S04-A00,18
39790-S10-A03,18
38231-S04-A00,19
39795-ST7-000,75
57019-SN7-000,75
38251-SV4-911,75
57119-SN7-003,75
57017-SV4-A02,75
39795-ST7-000,76
57019-SN7-000,76
38251-SV4-911,76
57119-SN7-003,76
57017-SV4-A02,76
What I would like to do is reformat this data so that there is only one line for each categoryID for example:
14,38231-S04-A00,39790-S10-A03
76,39795-ST7-000,57019-SN7-000,38251-SV4-911,57119-SN7-003,57017-SV4-A02
I have not found a way in excel that I can accomplish this programatically. I have over 100,000 lines. Is there a way using python CSV Read and Write to do something like this?
Yes there is a way:
import csv
def addRowToDict(row):
global myDict
key=row[1]
if key in myDict.keys():
#append values if entry already exists
myDict[key].append(row[0])
else:
#create entry
myDict[key]=[row[1],row[0]]
global myDict
myDict=dict()
inFile='C:/Users/xxx/Desktop/pythons/test.csv'
outFile='C:/Users/xxx/Desktop/pythons/testOut.csv'
with open(inFile, 'r') as f:
reader = csv.reader(f)
ignore=True
for row in reader:
if ignore:
#ignore first row
ignore=False
else:
#add entry to dict
addRowToDict(row)
with open(outFile,'w') as f:
writer = csv.writer(f)
#write everything to file
writer.writerows(myDict.itervalues())
Just edit inFile and outFile
This is pretty trivial using a dictionary of lists (Python 2.7 solution):
#!/usr/bin/env python
import fileinput
categories={}
for line in fileinput.input():
# Skip the first line in the file (assuming it is a header).
if fileinput.isfirstline():
continue
# Split the input line into two fields.
ha_master, cat_id = line.strip().split(',')
# If the given category id is NOT already in the dictionary
# add a new empty list
if not cat_id in categories:
categories[cat_id]=[]
# Append a new value to the category.
categories[cat_id].append(ha_master)
# Iterate over all category IDs and lists. Use ','.join() to
# to output a comma separate list from an Python list.
for k,v in categories.iteritems():
print '%s,%s' %(k,','.join(v))
I would read in the entire file, create a dictionary where the key is the ID and the value is a list of the other data.
data = {}
with open("test.csv", "r") as f:
for line in f:
temp = line.rstrip().split(',')
if len(temp[0].split('-')) == 3: # => specific format that ignores the header...
if temp[1] in data:
data[temp[1]].append(temp[0])
else:
data[temp[1]] = [temp[0]]
with open("output.csv", "w+") as f:
for id, datum in data.iteritems():
f.write("{},{}\n".format(id, ','.join(datum)))
Use pandas!
import pandas
csv_data = pandas.read_csv('path/to/csv/file')
use_this = csv_data.group_by('CategoryID').values
You will get a list with everything you want, now you just have to format it.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
Cheers.
I see many beautiful answers have come up while I was trying it, but I'll post mine as well.
import re
csvIN = open('your csv file','r')
csvOUT = open('out.csv','w')
cat = dict()
for line in csvIN:
line = line.rstrip()
if not re.search('^[0-9]+',line): continue
ham, cid = line.split(',')
if cat.get(cid,False):
cat[cid] = cat[cid] + ',' + ham
else:
cat[cid] = ham
for i in sorted(cat):
csvOUT.write(i + ',' + cat[i] + '\n')
Pandas approach:
import pandas as pd
df = pd.read_csv('data.csv')
#new = df.groupby('CategoryID')['HA-MASTER'].apply(lambda row: '%s' % ','.join(row))
new = df.groupby('CategoryID')['HA-MASTER'].agg(','.join)
new.to_csv('out.csv')
out.csv:
14,"38231-S04-A00,39790-S10-A03"
15,"38231-S04-A00,39790-S10-A03"
16,"38231-S04-A00,39790-S10-A03"
17,"38231-S04-A00,39790-S10-A03"
18,"38231-S04-A00,39790-S10-A03"
19,38231-S04-A00
75,"39795-ST7-000,57019-SN7-000,38251-SV4-911,57119-SN7-003,57017-SV4-A02"
76,"39795-ST7-000,57019-SN7-000,38251-SV4-911,57119-SN7-003,57017-SV4-A02"
This was an interesting question. My solution was to append each new item for a given key to a single string in the value, along with a comma to delimit the columns.
with open('Input01.csv') as input_file:
file_lines = [item.strip() for item in input_file.readlines()]
for item in iter([i.split(',') for i in file_lines]):
if item[1] in set_vals:
set_vals[item[1]] = set_vals[item[1]] + ',' + item[0]
else:
set_vals[item[1]] = item[0]
with open('Results01.csv','w') as output_file:
for i in sorted(set_vals.keys()):
output_file.write('{},{}\n'.format(i, set_vals[i]))
MaxU's implementation, using pandas, has good potential and looks really elegant, but all the values are placed into one cell, because each of the strings is double-quoted. For example, the line corresponding to the code '18'—"38231-S04-A00,39790-S10-A03"—would place both values in the second column.
import csv
from collections import defaultdict
inpath = '' # Path to input CSV
outpath = '' # Path to output CSV
output = defaultdict(list) # To hold {category: [serial_numbers]}
for row in csv.DictReader(open(inpath)):
output[row['CategoryID']].append(row['HA-MASTER'])
with open(outpath, 'w') as f:
f.write('CategoryID,HA-MASTER\n')
for category, serial_number in output.items():
row = '%s,%s\n' % (category, serial_number)
f.write(row)

Categories

Resources