Python: Reading Data from Multiple CSV Files to Lists - python

I'm using Python 3.5 to move through directories and subdirectories to access csv files and fill arrays with data from those files. The first csv file the code encounters looks like this:
The code I have is below:
import matplotlib.pyplot as plt
import numpy as np
import os, csv, datetime, time, glob
gpheight = []
RH = []
dewpt = []
temp = []
windspd = []
winddir = []
dirpath, dirnames, filenames = next(os.walk('/strm1/serino/DATA'))
count2 = 0
for dirname in dirnames:
if len(dirname) >= 8:
try:
dt = datetime.datetime.strptime(dirname[:8], '%m%d%Y')
csv_folder = os.path.join(dirpath, dirname)
for csv_file2 in glob.glob(os.path.join(csv_folder, 'figs', '0*.csv')):
if os.stat(csv_file2).st_size == 0:
continue
#create new arrays for each case
gpheight.append([])
RH.append([])
temp.append([])
dewpt.append([])
windspd.append([])
winddir.append([])
with open(csv_file2, newline='') as f2_input:
csv_input2 = csv.reader(f2_input,delimiter=' ')
for j,row2 in enumerate(csv_input2):
if j == 0:
continue #skip header row
#fill arrays created above
windspd[count2].append(float(row2[5]))
winddir[count2].append(float(row2[6]))
gpheight[count2].append(float(row2[1]))
RH[count2].append(float(row2[4]))
temp[count2].append(float(row2[2]))
dewpt[count2].append(float(row2[3]))
count2 = count2 + 1
except ValueError as e:
pass
I have it set up to create a new array for each new csv file. However, when I print the third (temperature) column,
for n in range(0,len(temp)):
print(temp[0][n])
it only partially prints that column of data:
-70.949997
-68.149994
-60.449997
-63.649994
-57.449997
-51.049988
-45.349991
-40.249985
-35.549988
-31.249985
-27.149994
-24.549988
-22.149994
-19.449997
-16.349976
-13.25
-11.049988
-8.949982
-6.75
-4.449982
-2.25
-0.049988
In addition, I believe a related problem is that when I simply do,
print(temp)
it prints
with the highlighted section the section that belongs to this one csv file, and should therefore be in one array. There are also additional empty arrays at the end that should not be there.
I have (not shown) a section of code before this that does the same thing but with different csv files, and that works as expected, separating each file's data into a new array, with no empty arrays. I appreciate any help!

The issue had been my use of try and pass. All the files that matched my criteria were met, but some of those files had issues with how their contents were read, which caused the errors I was receiving later in the code. For anyone looking to use try and pass, make sure that you are able to safely pass on any exceptions that block of code may encounter. Otherwise, it could cause problems later. You may still get an error if you don't pass on it, but that will force you to fix it appropriately instead of ignoring it.

Related

Operation between element in the same list of list generate from imported .dat files

I'm writing a program that takes .dat files from directory one at a time, verifies some condition, and if verification is okay copies the files to another directory.
The code below shows how I import the files and create a list of lists. I'm having trouble with the verification step. I tried with a for loop but when set if condition, operation with elements in the list of lists seems impossible.
In particular I need the difference between consecutive elements matrix[i][3] and matrix[i+1][3] to be less than 5.
for filename in glob.glob(os.path.join(folder_path, '*.dat')):
with open(filename, 'r') as f:
matrix =[]
data = f.readlines()
for raw_line in data:
split_line1= raw_line.replace(":",";")
split_line2= split_line1.replace("\n","")
split_line3 = split_line2.strip().split(";")
matrix.append(split_line3)
Hello and welcome at Stack Overflow.
You did not provide a sample of your data files. After looking at your code, I assume your data looks like this:
9;9;7;5;0;9;5;8;4;2
9;1;1;5;1;3;4;1;8;7
2;8;4;5;5;2;1;4;6;4
6;4;1;5;5;8;1;4;6;1
0;1;0;5;7;1;7;4;1;9
4;9;6;5;3;2;6;2;9;6
8;0;6;0;8;9;3;1;6;6
A few general remarks:
For parsing a csv file, use the csv module. It is easy to use and less error-prone than writing your own parser.
If you do a lot of data-processing and matrix calculations, you want to have a look at the pandas and numpy libraries. Processing matrices line by line in plain Python is slower by some orders of magnitude.
I understand your description of the verification step as follows:
A matrix matches if all consecutive elements
matrix[i][3] and matrix[i+1][3] differ by less than 5.
My suggested code looks like this:
import csv
from glob import glob
from pathlib import Path
def read_matrix(fn):
with open(fn) as f:
c = csv.reader(f, delimiter=";")
m = [[float(c) for c in row] for row in c]
return m
def verify_condition(matrix):
col = 3
pairs_of_consecutive_rows = zip(matrix[:-1], matrix[1:])
for row_i, row_j in pairs_of_consecutive_rows:
if abs(row_i[col] - row_j[col]) >= 5:
return False
return True
if __name__ == '__main__':
folder_path = Path("../data")
for filename in glob(str(folder_path / '*.dat')):
print(f"processsing {filename}")
matrix = read_matrix(filename)
matches = verify_condition(matrix)
if matches:
print("match")
# copy_file(filename, target_folder)
I am not going into detail about the function read_matrix. Just note that I convert the strings to float with the statement float(c) in order to be able to do numerical calculations later on.
I iterate over all consecutive rows by iterating over 'matrix[:-1]and 'matrix[1:] at the same time using zip. See the effect of zip in this example:
>>> list(zip("ABC", "XYZ"))
[('A', 'X'), ('B', 'Y'), ('C', 'Z')]
And the effect of the [:-1] and [1:] indices here:
>>> "ABC"[:-1], "ABC"[1:]
('AB', 'BC')
When verify_condition finds the first two consecutive rows that differ by at least 5, it returns false.
I am confident that this code should help you going on.
PS: I could not resist using the pathlib library because I really prefer to see code like folder / subfolder / "filename.txt" instead of path.join(folder, subfolder, "filename.txt") in my scripts.

Python exit code -9

I'm writing a Python code that processes thousands of files, puts the data of each file in a data frame, and each data frame gets appended in an array. Afterwards, it takes this array and concatenates it so that the end result is one matrix containing all the data of all the data frames.
Here is the code to illustrate:
for root, dirs, filenames in os.walk(folder_name):
for f in filenames:
if f == '.DS_Store':
continue
fullpath = os.path.join(folder_name, f)
book = open(fullpath, 'r')
data = {u[0]:u[1] for u in json.load(book)}
books.append(pd.DataFrame(data=[data], index=[f]))
df = pd.concat(books, axis=0).fillna(0).sort_index()
M = df.as_matrix()
I encounter no issue in the processing part; the for loop works perfectly. However, when I try to concatenate, the code keeps running for 20 minutes or so then the script stops with an "exit code -9". Any idea what that could mean and/or how this could be fixed?
Any suggestion would be very appreciated !

number of columns in xlwt worksheet

I can't seem to find a way to return the value of the number of columns in a worksheet in xlwt.Workbook(). The idea is to take a wad of .xls files in a directory and combine them into one. One problem I am having is changing the column position when writing the next file. this is what i'm working with thus far:
import xlwt, xlrd, os
def cbc(rd_sheet, wt_sheet, rlo=0, rhi=None,
rshift=0, clo=0, chi=None, cshift = 0):
if rhi is None: rhi = rd_sheet.nrows
if chi is None: chi = 2#only first two cols needed
for row_index in xrange(rlo, rhi):
for col_index in xrange(clo, chi):
cell = rd_sheet.cell(row_index, col_index)
wt_sheet.write(row_index + rshift, col_index + cshift, cell.value)
Dir = '/home/gerg/Desktop/ex_files'
ext = '.xls'
list_xls = [file for file in os.listdir(Dir) if file.endswith(ext)]
files = [Dir + '/%s' % n for n in list_xls]
output = '/home/gerg/Desktop/ex_files/copy_test.xls'
wbook = xlwt.Workbook()
wsheet = wbook.add_sheet('Summary', cell_overwrite_ok=True)#overwrite just for the repeated testing
for XLS in files:
rbook = xlrd.open_workbook(XLS)
rsheet = rbook.sheet_by_index(0)
cbc(rsheet, wsheet, cshift = 0)
wbook.save(output)
list_xls returns:
['file2.xls', 'file3.xls', 'file1.xls', 'copy_test.xls']
files returns:
['/home/gerg/Desktop/ex_files/file2.xls', '/home/gerg/Desktop/ex_files/file3.xls', '/home/gerg/Desktop/ex_files/file1.xls', '/home/gerg/Desktop/ex_files/copy_test.xls']
My question is how to scoot each file written into xlwt.workbook over by 2 each time. This code gives me the first file saved to .../copy_test.xls. Is there a problem with the file listing as well? I have a feeling there may be.
This is Python2.6 and I bounce between windows and linux.
Thank you for your help,
GM
You are using only the first two columns in each input spreadsheet. You don't need "the number of columns in a worksheet in xlwt.Workbook()". You already have the cshift mechanism in your code, but you are not using it. All you need to do is change the loop in your outer block, like this:
for file_index, file_name in enumerate(files):
rbook = xlrd.open_workbook(file_name)
rsheet = rbook.sheet_by_index(0)
cbc(rsheet, wsheet, chi = 2, cshift = file_index * 2)
For generality, change the line
if chi is None: chi = 2
in your function to
if chi is None: chi = rsheet.ncols
and pass chi=2 in as an arg as I have done in the above code.
I don't understand your rationale for overriding the overwrite check ... surely in your application, overwriting an existing cell value is incorrect?
You say "This code gives me the first file saved to .../copy_test.xls". First in input order is file2.xls. The code that you have shown is overwriting previous input and will give you the LAST file (in input order) , not the first ... perhaps you are mistaken. Note: The last input file 'copy_test.xls' is quite likely be a previous OUTPUT file; perhaps your output file should be put in a separate folder.

Optimize python file comparison script

I have written a script which works, but I'm guessing isn't the most efficient. What I need to do is the following:
Compare two csv files that contain user information. It's essentially a member list where one file is a more updated version of the other.
The files contain data such as ID, name, status, etc, etc
Write to a third csv file ONLY the records in the new file that either don't exist in the older file, or contain updated information. For each record, there is a unique ID that allows me to determine if a record is new or previously existed.
Here is the code I have written so far:
import csv
fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)
fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)
fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)
old = []
new = []
for row in fOld:
old.append(row)
for row in fNew:
new.append(row)
output = []
x = len(new)
i = 0
num = 0
while i < x:
if new[num] not in old:
fNewUpdate.writerow(new[num])
num += 1
i += 1
fileAin.close()
fileBin.close()
fileCout.close()
In terms of functionality, this script works. However I'm trying to run this on files that contain hundreds of thousands of records and it's taking hours to complete. I am guessing the problem lies with reading both files to lists and treating the entire row of data as a single string for comparison.
My question is, for what I am trying to do is this there a faster, more efficient, way to process the two files to create the third file containing only new and updated records? I don't really have a target time, just mostly wanting to understand if there are better ways in Python to process these files.
Thanks in advance for any help.
UPDATE to include sample row of data:
123456789,34,DOE,JOHN,1764756,1234 MAIN ST.,CITY,STATE,305,1,A
How about something like this? One of the biggest inefficiencies of your code is checking whether new[num] is in old every time because old is a list so you have to iterate through the entire list. Using a dictionary is much much faster.
import csv
fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)
fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)
fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)
old = {row[0]:row[1:] for row in fOld}
new = {row[0]:row[1:] for row in fNew}
fileAin.close()
fileBin.close()
output = {}
for row_id in new:
if row_id not in old or not old[row_id] == new[row_id]:
output[row_id] = new[row_id]
for row_id in output:
fNewUpdate.writerow([row_id] + output[row_id])
fileCout.close()
difflib is quite efficient: http://docs.python.org/library/difflib.html
Sort the data by your unique field(s), and then use a comparison process analogous to the merge step of merge sort:
http://en.wikipedia.org/wiki/Merge_sort

How to Compare 2 very large matrices using Python

I have an interesting problem.
I have a very large (larger than 300MB, more than 10,000,000 lines/rows in the file) CSV file with time series data points inside. Every month I get a new CSV file that is almost the same as the previous file, except for a few new lines have been added and/or removed and perhaps a couple of lines have been modified.
I want to use Python to compare the 2 files and identify which lines have been added, removed and modified.
The issue is that the file is very large, so I need a solution that can handle the large file size and execute efficiently within a reasonable time, the faster the better.
Example of what a file and its new file might look like:
Old file
A,2008-01-01,23
A,2008-02-01,45
B,2008-01-01,56
B,2008-02-01,60
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,9
etc...
New file
A,2008-01-01,23
A,2008-02-01,45
A,2008-03-01,67 (added)
B,2008-01-01,56
B,2008-03-01,33 (removed and added)
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,22 (modified)
etc...
Basically the 2 files can be seen as matrices that need to be compared, and I have begun thinking of using PyTable. Any ideas on how to solve this problem would be greatly appreciated.
Like this.
Step 1. Sort.
Step 2. Read each file, doing line-by-line comparison. Write differences to another file.
You can easily write this yourself. Or you can use difflib. http://docs.python.org/library/difflib.html
Note that the general solution is quite slow as it searches for matching lines near a difference. Writing your own solution can run faster because you know things about how the files are supposed to match. You can optimize that "resynch-after-a-diff" algorithm.
And 10,000,000 lines hardly matters. It's not that big. Two 300Mb files easily fit into memory.
This is a little bit of a naive implementation but will deal with unsorted data:
import csv
file1_dict = {}
file2_dict = {}
with open('file1.csv') as handle:
for row in csv.reader(handle):
file1_dict[tuple(row[:2])] = row[2:]
with open('file2.csv') as handle:
for row in csv.reader(handle):
file2_dict[tuple(row[:2])] = row[2:]
with open('outfile.csv', 'w') as handle:
writer = csv.writer(handle)
for key, val in file1_dict.iteritems():
if key in file2_dict:
#deal with keys that are in both
if file2_dict[key] == val:
writer.writerow(key+val+('Same',))
else:
writer.writerow(key+file2_dict[key]+('Modified',))
file2_dict.pop(key)
else:
writer.writerow(key+val+('Removed',))
#deal with added keys!
for key, val in file2_dict.iteritems():
writer.writerow(key+val+('Added',))
You probably won't be able to "drop in" this solution but it should get you ~95% of the way there. #S.Lott is right, 2 300mb files will easily fit in memory ... if your files get into the 1-2gb range then this may have to be modified with the assumption of sorted data.
Something like this is close ... although you may have to change the comparisons around for the added a modified to make sense:
#assumming both files are sorted by columns 1 and 2
import datetime
from itertools import imap
def str2date(in):
return datetime.date(*map(int,in.split('-')))
def convert_tups(row):
key = (row[0], str2date(row[1]))
val = tuple(row[2:])
return key, val
with open('file1.csv') as handle1:
with open('file2.csv') as handle2:
with open('outfile.csv', 'w') as outhandle:
writer = csv.writer(outhandle)
gen1 = imap(convert_tups, csv.reader(handle1))
gen2 = imap(convert_tups, csv.reader(handle2))
gen2key, gen2val = gen2.next()
for gen1key, gen1val in gen1:
if gen1key == gen2key and gen1val == gen2val:
writer.writerow(gen1key+gen1val+('Same',))
gen2key, gen2val = gen2.next()
elif gen1key == gen2key and gen1val != gen2val:
writer.writerow(gen2key+gen2val+('Modified',))
gen2key, gen2val = gen2.next()
elif gen1key > gen2key:
while gen1key>gen2key:
writer.writerow(gen2key+gen2val+('Added',))
gen2key, gen2val = gen2.next()
else:
writer.writerow(gen1key+gen1val+('Removed',))

Categories

Resources