Reading .csv files into Python lists - python

I have a lot of .csv files in a directory and I'd like to open each of them in a loop within Python such that the first .csv is read into list[0] and the second .csv is read into list[1] and so on.
Unfortunately, while my code loops through all of the .csv files, it puts all the .csv files into list[0]. How can I modify my code so that I can achieve my goal above? Many thanks.
John
Here's the code:
def create_data_lists():
i=0
for symbol in symbols:
with open(symbols[i]+'.csv', 'r') as f:
print i
reader = csv.reader(f)
reader.next()
for row in reader:
rowdata.append(row)
data_by_symbol.append(rowdata)
i=i+1

inside the for loop, near the top, you have to refresh the list rowdata. otherwise you are adding to that one forever. have something like rowdata = [] right after print i
def create_data_lists():
for symbol in symbols:
with open(symbol+'.csv', 'r') as f:
print symbol
rowdata = []
reader = csv.reader(f)
reader.next()
for row in reader:
rowdata.append(row)
data_by_symbol.append(rowdata)
EDIT got rid of i, as i am really not using it

why not store the readers themselves in a list?
list_of_csv_files = []
for f in filenames:
list_of_csv_files.append(csv.DictReader(open(f)))
This will store the reader itself in a list, allowing you to later on do something such as:
for row in list_of_csv_files[0]:
# do some processing on the row
The biggest advantage of this method is taht you can then do stuff like filter columns easily, using methods such as:
one_row = [row["name of column heading"] for row in list_of_csv_files[0]]
two_rows = [[row["name col 2"], row["name col 2"]] for row in list_of_csv_files[0]]
which I suspect would be more helpful to your program than storing pre-read (and thus de-structured) csv files.
but if you really want to have all the CSV files read in and stored in a list, you will need a list of lists, I don't recommend this, it will be very memory intensive:
list_of_csv_files = [[]]
for f in filenames:
list_of_csv_files.append([row.values() for row in csv.DictReader(open(f))])

Related

how to select a specific column of a csv file in python

I am a beginner of Python and would like to have your opinion..
I wrote this code that reads the only column in a file on my pc and puts it in a list.
I have difficulties understanding how I could modify the same code with a file that has multiple columns and select only the column of my interest.
Can you help me?
list = []
with open(r'C:\Users\Desktop\mydoc.csv') as file:
for line in file:
item = int(line)
list.append(item)
results = []
for i in range(0,1086):
a = list[i-1]
b = list[i]
c = list[i+1]
results.append(b)
print(results)
You can use pandas.read_csv() method very simply like this:
import pandas as pd
my_data_frame = pd.read_csv('path/to/your/data')
results = my_data_frame['name_of_your_wanted_column'].values.tolist()
A useful module for the kind of work you are doing is the imaginatively named csv module.
Many csv files have a "header" at the top, this by convention is a useful way of labeling the columns of your file. Assuming you can insert a line at the top of your csv file with comma delimited fieldnames, then you could replace your program with something like:
import csv
with open(r'C:\Users\Desktop\mydoc.csv') as myfile:
csv_reader = csv.DictReader(myfile)
for row in csv_reader:
print ( row['column_name_of_interest'])
The above will print to the terminal all the values that match your specific 'column_name_of_interest' after you edit it to match your particular file.
It's normal to work with lots of columns at once, so that dictionary method of packing a whole row into a single object, addressable by column-name can be very convenient later on.
To a pure python implementation, you should use the package csv.
data.csv
Project1,folder1/file1,data
Project1,folder1/file2,data
Project1,folder1/file3,data
Project1,folder1/file4,data
Project1,folder2/file11,data
Project1,folder2/file42a,data
Project1,folder2/file42b,data
Project1,folder2/file42c,data
Project1,folder2/file42d,data
Project1,folder3/filec,data
Project1,folder3/fileb,data
Project1,folder3/filea,data
Your python program should read it by line
import csv
a = []
with open('data.csv') as csv_file:
reader = csv.reader(csv_file, delimiter=',')
for row in reader:
print(row)
# ['Project1', 'folder1/file1', 'data']
If you print the row element you will see it is a list like that
['Project1', 'folder1/file1', 'data']
If I would like to put in my list all elements in column 1, I need to put that element in my list, doing:
a.append(row[1])
Now in list a I will have a list like:
['folder1/file1', 'folder1/file2', 'folder1/file3', 'folder1/file4', 'folder2/file11', 'folder2/file42a', 'folder2/file42b', 'folder2/file42c', 'folder2/file42d', 'folder3/filec', 'folder3/fileb', 'folder3/filea']
Here is the complete code:
import csv
a = []
with open('data.csv') as csv_file:
reader = csv.reader(csv_file, delimiter=',')
for row in reader:
a.append(row[1])

Python Looping through CSV files and their columns

so I've seen this done is other questions asked here but I'm still a little confused. I've been learning python3 for the last few days and figured I'd start working on a project to really get my hands dirty. I need to loop through a certain amount of CSV files and make edits to those files. I'm having trouble with going to a specific column and also for loops in python in general. I'm used to the convention (int i = 0; i < expression; i++), but in python it's a little different. Here's my code so far and I'll explain where my issue is.
import os
import csv
pathName = os.getcwd()
numFiles = []
fileNames = os.listdir(pathName)
for fileNames in fileNames:
if fileNames.endswith(".csv"):
numFiles.append(fileNames)
for i in numFiles:
file = open(os.path.join(pathName, i), "rU")
reader = csv.reader(file, delimiter=',')
for column in reader:
print(column[4])
My issue falls on this line:
for column in reader:
print(column[4])
So in the Docs it says column is the variable and reader is what I'm looping through. But when I write 4 I get this error:
IndexError: list index out of range
What does this mean? If I write 0 instead of 4 it prints out all of the values in column 0 cell 0 of each CSV file. I basically need it to go through the first row of each CSV file and find a specific value and then go through that entire column. Thanks in advance!
It could be that you don't have 5 columns in your .csv file.
Python is base0 which means it starts counting at 0 so the first column would be column[0], the second would be column[1].
Also you may want to change your
for column in reader:
to
for row in reader:
because reader iterates through the rows, not the columns.
This code loops through each row and then each column in that row allowing you to view the contents of each cell.
for i in numFiles:
file = open(os.path.join(pathName, i), "rU")
reader = csv.reader(file, delimiter=',')
for row in reader:
for column in row:
print(column)
if column=="SPECIFIC VALUE":
#do stuff
Welcome to Python! I suggest you to print some debugging messages.
You could add this to you printing loop:
for row in reader:
try:
print(row[4])
except IndexError as ex:
print("ERROR: %s in file %s doesn't contain 5 colums" % (row, i))
This will print bad lines (as lists because this is how they are represented in CSVReader) so you could fix the CSV files.
Some notes:
It is common to use snake_case in Python and not camelCase
Name your variables appropriately (csv_filename instead of i, row instead of column etc.)
Use the with close to handle files (read more)
Enjoy!

Remove columns + keep certain rows in multiple large .csv files using python

Hello I'm really new here as well as in the world of python.
I have some (~1000) .csv files, including ~ 1800000 rows of information each. The files are in the following form:
5302730,131841,-0.29999999999999999,NULL,2013-12-31 22:00:46.773
5303072,188420,28.199999999999999,NULL,2013-12-31 22:27:46.863
5350066,131841,0.29999999999999999,NULL,2014-01-01 00:37:21.023
5385220,-268368577,4.5,NULL,2014-01-01 03:12:14.163
5305752,-268368587,5.1900000000000004,NULL,2014-01-01 03:11:55.207
So, i would like for all of the files:
(1) to remove the 4th (NULL) column
(2) to keep in every file only certain rows (depending on the value of the first column i.e.5302730, keep only the rows that containing that value)
I don't know if this is even possible, so any answer is appreciated!
Thanks in advance.
Have a look at the csv module
One can use the csv.reader function to generate an iterator of lines, with each lines cells as a list.
for line in csv.reader(open("filename.csv")):
# Remove 4th column, remember python starts counting at 0
line = line[:3] + line[4:]
if line[0] == "thevalueforthefirstcolumn":
dosomethingwith(line)
If you wish to do this sort of operation with CSV files more than once and want to use different parameters regarding column to skip, column to use as key and what to filter on, you can use something like this:
import csv
def read_csv(filename, column_to_skip=None, key_column=0, key_filter=None):
data_from_csv = []
with open(filename) as csvfile:
csv_reader = csv.reader(csvfile)
for row in csv_reader:
# Skip data in specific column
if column_to_skip is not None:
del row[column_to_skip]
# Filter out rows where the key doesn't match
if key_filter is not None:
key = row[key_column]
if key_filter != key:
continue
data_from_csv.append(row)
return data_from_csv
def write_csv(filename, data_to_write):
with open(filename, 'w') as csvfile:
csv_writer = csv.writer(csvfile)
for row in data_to_write:
csv_writer.writerow(row)
data = read_csv('data.csv', column_to_skip=3, key_filter='5302730')
write_csv('data2.csv', data)

Python: General CSV file parsing and manipulation

The purpose of my Python script is to compare the data present in multiple CSV files, looking for discrepancies. The data are ordered, but the ordering differs between files. The files contain about 70K lines, weighing around 15MB. Nothing fancy or hardcore here. Here's part of the code:
def getCSV(fpath):
with open(fpath,"rb") as f:
csvfile = csv.reader(f)
for row in csvfile:
allRows.append(row)
allCols = map(list, zip(*allRows))
Am I properly reading from my CSV files? I'm using csv.reader, but would I benefit from using csv.DictReader?
How can I create a list containing whole rows which have a certain value in a precise column?
Are you sure you want to be keeping all rows around? This creates a list with matching values only... fname could also come from glob.glob() or os.listdir() or whatever other data source you so choose. Just to note, you mention the 20th column, but row[20] will be the 21st column...
import csv
matching20 = []
for fname in ('file1.csv', 'file2.csv', 'file3.csv'):
with open(fname) as fin:
csvin = csv.reader(fin)
next(csvin) # <--- if you want to skip header row
for row in csvin:
if row[20] == 'value':
matching20.append(row) # or do something with it here
You only want csv.DictReader if you have a header row and want to access your columns by name.
This should work, you don't need to make another list to have access to the columns.
import csv
import sys
def getCSV(fpath):
with open(fpath) as ifile:
csvfile = csv.reader(ifile)
rows = list(csvfile)
value_20 = [x for x in rows if x[20] == 'value']
If I understand the question correctly, you want to include a row if value is in the row, but you don't know which column value is, correct?
If your rows are lists, then this should work:
testlist = [row for row in allRows if 'value' in row]
post-edit:
If, as you say, you want a list of rows where value is in a specified column (specified by an integer pos, then:
testlist = []
pos = 20
for row in allRows:
testlist.append([element if index != pos else 'value' for index, element in enumerate(row)])
(I haven't tested this, but let me now if that works).

Python- Import Multiple Files to a single .csv file

I have 125 data files containing two columns and 21 rows of data and I'd like to import them into a single .csv file (as 125 pairs of columns and only 21 rows).
This is what my data files look like:
I am fairly new to python but I have come up with the following code:
import glob
Results = glob.glob('./*.data')
fout='c:/Results/res.csv'
fout=open ("res.csv", 'w')
for file in Results:
g = open( file, "r" )
fout.write(g.read())
g.close()
fout.close()
The problem with the above code is that all the data are copied into only two columns with 125*21 rows.
Any help is very much appreciated!
This should work:
import glob
files = [open(f) for f in glob.glob('./*.data')] #Make list of open files
fout = open("res.csv", 'w')
for row in range(21):
for f in files:
fout.write( f.readline().strip() ) # strip removes trailing newline
fout.write(',')
fout.write('\n')
fout.close()
Note that this method will probably fail if you try a large number of files, I believe the default limit in Python is 256.
You may want to try the python CSV module (http://docs.python.org/library/csv.html), which provides very useful methods for reading and writing CSV files. Since you stated that you want only 21 rows with 250 columns of data, I would suggest creating 21 python lists as your rows and then appending data to each row as you loop through your files.
something like:
import csv
rows = []
for i in range(0,21):
row = []
rows.append(row)
#not sure the structure of your input files or how they are delimited, but for each one, as you have it open and iterate through the rows, you would want to append the values in each row to the end of the corresponding list contained within the rows list.
#then, write each row to the new csv:
writer = csv.writer(open('output.csv', 'wb'), delimiter=',')
for row in rows:
writer.writerow(row)
(Sorry, I cannot add comments, yet.)
[Edited later, the following statement is wrong!!!] "The davesnitty's generating the rows loop can be replaced by rows = [[]] * 21." It is wrong because this would create the list of empty lists, but the empty lists would be a single empty list shared by all elements of the outer list.
My +1 to using the standard csv module. But the file should be always closed -- especially when you open that much of them. Also, there is a bug. The row read from the file via the -- even though you only write the result here. The solution is actually missing. Basically, the row read from the file should be appended to the sublist related to the line number. The line number should be obtained via enumerate(reader) where reader is csv.reader(fin, ...).
[added later] Try the following code, fix the paths for your puprose:
import csv
import glob
import os
datapath = './data'
resultpath = './result'
if not os.path.isdir(resultpath):
os.makedirs(resultpath)
# Initialize the empty rows. It does not check how many rows are
# in the file.
rows = []
# Read data from the files to the above matrix.
for fname in glob.glob(os.path.join(datapath, '*.data')):
with open(fname, 'rb') as f:
reader = csv.reader(f)
for n, row in enumerate(reader):
if len(rows) < n+1:
rows.append([]) # add another row
rows[n].extend(row) # append the elements from the file
# Write the data from memory to the result file.
fname = os.path.join(resultpath, 'result.csv')
with open(fname, 'wb') as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)

Categories

Resources