I have 125 data files containing two columns and 21 rows of data and I'd like to import them into a single .csv file (as 125 pairs of columns and only 21 rows).
This is what my data files look like:
I am fairly new to python but I have come up with the following code:
import glob
Results = glob.glob('./*.data')
fout='c:/Results/res.csv'
fout=open ("res.csv", 'w')
for file in Results:
g = open( file, "r" )
fout.write(g.read())
g.close()
fout.close()
The problem with the above code is that all the data are copied into only two columns with 125*21 rows.
Any help is very much appreciated!
This should work:
import glob
files = [open(f) for f in glob.glob('./*.data')] #Make list of open files
fout = open("res.csv", 'w')
for row in range(21):
for f in files:
fout.write( f.readline().strip() ) # strip removes trailing newline
fout.write(',')
fout.write('\n')
fout.close()
Note that this method will probably fail if you try a large number of files, I believe the default limit in Python is 256.
You may want to try the python CSV module (http://docs.python.org/library/csv.html), which provides very useful methods for reading and writing CSV files. Since you stated that you want only 21 rows with 250 columns of data, I would suggest creating 21 python lists as your rows and then appending data to each row as you loop through your files.
something like:
import csv
rows = []
for i in range(0,21):
row = []
rows.append(row)
#not sure the structure of your input files or how they are delimited, but for each one, as you have it open and iterate through the rows, you would want to append the values in each row to the end of the corresponding list contained within the rows list.
#then, write each row to the new csv:
writer = csv.writer(open('output.csv', 'wb'), delimiter=',')
for row in rows:
writer.writerow(row)
(Sorry, I cannot add comments, yet.)
[Edited later, the following statement is wrong!!!] "The davesnitty's generating the rows loop can be replaced by rows = [[]] * 21." It is wrong because this would create the list of empty lists, but the empty lists would be a single empty list shared by all elements of the outer list.
My +1 to using the standard csv module. But the file should be always closed -- especially when you open that much of them. Also, there is a bug. The row read from the file via the -- even though you only write the result here. The solution is actually missing. Basically, the row read from the file should be appended to the sublist related to the line number. The line number should be obtained via enumerate(reader) where reader is csv.reader(fin, ...).
[added later] Try the following code, fix the paths for your puprose:
import csv
import glob
import os
datapath = './data'
resultpath = './result'
if not os.path.isdir(resultpath):
os.makedirs(resultpath)
# Initialize the empty rows. It does not check how many rows are
# in the file.
rows = []
# Read data from the files to the above matrix.
for fname in glob.glob(os.path.join(datapath, '*.data')):
with open(fname, 'rb') as f:
reader = csv.reader(f)
for n, row in enumerate(reader):
if len(rows) < n+1:
rows.append([]) # add another row
rows[n].extend(row) # append the elements from the file
# Write the data from memory to the result file.
fname = os.path.join(resultpath, 'result.csv')
with open(fname, 'wb') as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)
Related
I am a beginner of Python and would like to have your opinion..
I wrote this code that reads the only column in a file on my pc and puts it in a list.
I have difficulties understanding how I could modify the same code with a file that has multiple columns and select only the column of my interest.
Can you help me?
list = []
with open(r'C:\Users\Desktop\mydoc.csv') as file:
for line in file:
item = int(line)
list.append(item)
results = []
for i in range(0,1086):
a = list[i-1]
b = list[i]
c = list[i+1]
results.append(b)
print(results)
You can use pandas.read_csv() method very simply like this:
import pandas as pd
my_data_frame = pd.read_csv('path/to/your/data')
results = my_data_frame['name_of_your_wanted_column'].values.tolist()
A useful module for the kind of work you are doing is the imaginatively named csv module.
Many csv files have a "header" at the top, this by convention is a useful way of labeling the columns of your file. Assuming you can insert a line at the top of your csv file with comma delimited fieldnames, then you could replace your program with something like:
import csv
with open(r'C:\Users\Desktop\mydoc.csv') as myfile:
csv_reader = csv.DictReader(myfile)
for row in csv_reader:
print ( row['column_name_of_interest'])
The above will print to the terminal all the values that match your specific 'column_name_of_interest' after you edit it to match your particular file.
It's normal to work with lots of columns at once, so that dictionary method of packing a whole row into a single object, addressable by column-name can be very convenient later on.
To a pure python implementation, you should use the package csv.
data.csv
Project1,folder1/file1,data
Project1,folder1/file2,data
Project1,folder1/file3,data
Project1,folder1/file4,data
Project1,folder2/file11,data
Project1,folder2/file42a,data
Project1,folder2/file42b,data
Project1,folder2/file42c,data
Project1,folder2/file42d,data
Project1,folder3/filec,data
Project1,folder3/fileb,data
Project1,folder3/filea,data
Your python program should read it by line
import csv
a = []
with open('data.csv') as csv_file:
reader = csv.reader(csv_file, delimiter=',')
for row in reader:
print(row)
# ['Project1', 'folder1/file1', 'data']
If you print the row element you will see it is a list like that
['Project1', 'folder1/file1', 'data']
If I would like to put in my list all elements in column 1, I need to put that element in my list, doing:
a.append(row[1])
Now in list a I will have a list like:
['folder1/file1', 'folder1/file2', 'folder1/file3', 'folder1/file4', 'folder2/file11', 'folder2/file42a', 'folder2/file42b', 'folder2/file42c', 'folder2/file42d', 'folder3/filec', 'folder3/fileb', 'folder3/filea']
Here is the complete code:
import csv
a = []
with open('data.csv') as csv_file:
reader = csv.reader(csv_file, delimiter=',')
for row in reader:
a.append(row[1])
I'm trying to combine a bunch of csv files. Each csv file has a different number of columns. This is not a problem, I can easily loop through the files and pull in all the column headers, pasting them into an empty file to use as a base.
The problem I'm having is that the column headers are on different rows in each file.
For example:
Table1
Random Text
!,Header1,Header2,Header3
*,123,124,5235
*,124,15,23624
*,135,677,234
Table2
Random Text
Random Text
!,Header1,Header2,Header4
*,124,2156,7478
*,126,12357,547
*,237,12,267
Output:
Table,Header1,Header2,Header3,Header4
Table1,123,124,5235
Table1,124,15,23624
Table1,135,677,234
Table2,124,2156,7478
Table2,126,12357,547
Table2,237,12,267
My existing code looks something like this:
files = glob.glob(r'//Directory/*.csv')
#This block goes through each file and works out which variables exist
variablelist=[]
for f in files:
with open(f,'r') as csvfile:
read_rows = csv.reader(csvfile)
for row in read_rows:
if row[0]!="*": #The last row with no * in column 1 is the header row
rowlist = row
variablelist.extend(x for x in rowlist if x not in variablelist)
list.sort(variablelist)
I use the fact that the header row is the last row without a * in the first column. I work out which row the headers are on and then store the header names in a list - combining the same list from all files.
I then try and combine the files together using this code that I found by searching this website:
with open("out.csv", "w", newline="") as f_out: # Comment 2 below
writer = csv.DictWriter(f_out, fieldnames=variablelist)
for f in files:
with open(f, "r", newline="",) as f_in:
reader = csv.DictReader(f_in) # Uses the field names in this file
for line in reader:
# Comment 3 below
writer.writerow(line)
The problem is, I don't know how to deal with the headers being on different lines. I tried using code to define the header row number, but don't know how to implement this into the code above - (Can dictreader skip a dynamic number of rows before finding headers?)
with open(f,'r') as csvfile:
read_rows = csv.reader(csvfile)
header_row_number = 0
for row in read_rows:
if row[0]!="*":
header_row_number=read_rows.line_num
Any help would be much appreciated
Hello I'm really new here as well as in the world of python.
I have some (~1000) .csv files, including ~ 1800000 rows of information each. The files are in the following form:
5302730,131841,-0.29999999999999999,NULL,2013-12-31 22:00:46.773
5303072,188420,28.199999999999999,NULL,2013-12-31 22:27:46.863
5350066,131841,0.29999999999999999,NULL,2014-01-01 00:37:21.023
5385220,-268368577,4.5,NULL,2014-01-01 03:12:14.163
5305752,-268368587,5.1900000000000004,NULL,2014-01-01 03:11:55.207
So, i would like for all of the files:
(1) to remove the 4th (NULL) column
(2) to keep in every file only certain rows (depending on the value of the first column i.e.5302730, keep only the rows that containing that value)
I don't know if this is even possible, so any answer is appreciated!
Thanks in advance.
Have a look at the csv module
One can use the csv.reader function to generate an iterator of lines, with each lines cells as a list.
for line in csv.reader(open("filename.csv")):
# Remove 4th column, remember python starts counting at 0
line = line[:3] + line[4:]
if line[0] == "thevalueforthefirstcolumn":
dosomethingwith(line)
If you wish to do this sort of operation with CSV files more than once and want to use different parameters regarding column to skip, column to use as key and what to filter on, you can use something like this:
import csv
def read_csv(filename, column_to_skip=None, key_column=0, key_filter=None):
data_from_csv = []
with open(filename) as csvfile:
csv_reader = csv.reader(csvfile)
for row in csv_reader:
# Skip data in specific column
if column_to_skip is not None:
del row[column_to_skip]
# Filter out rows where the key doesn't match
if key_filter is not None:
key = row[key_column]
if key_filter != key:
continue
data_from_csv.append(row)
return data_from_csv
def write_csv(filename, data_to_write):
with open(filename, 'w') as csvfile:
csv_writer = csv.writer(csvfile)
for row in data_to_write:
csv_writer.writerow(row)
data = read_csv('data.csv', column_to_skip=3, key_filter='5302730')
write_csv('data2.csv', data)
I'm working with an online survey application that allows me to download survey results into a csv file. However, the format of the downloaded csv puts each survey question and answer in a new column, whereas, I need the csv file to be formatted with each survey question and answer on a new row. There is also a lot of data in the downloaded csv file that I want to ignore completely.
How can I parse out the desired rows and columns of the downloaded csv file and write them to a new csv file in a specific format?
For example, I download data and it looks like this:
V1,V2,V3,Q1,Q2,Q3,Q4....
null,null,null,item,item,item,item....
0,0,0,4,5,4,5....
0,0,0,2,3,2,3....
The first row contains the 'keys' that I will need except V1-V3 must be excluded. Row 2 must be excluded altogether. Row 3 is my first subject so I need the values 4,5,4,5 to be paired with the keys Q1,Q2,Q3,Q4. And row 4 is a new subject which needs to be excluded as well since my program only handles one subject at a time.
The csv file that I need to create in order for my script to function properly looks like this:
Q1,4
Q2,5
Q3,4
Q4,5
I've tried using this izip to pivot the data, but I don't know how to specifically select the rows and columns I need:
from itertools import izip
a = izip(*csv.reader(open("CDI.csv", "rb")))
csv.writer(open("CDI_test.csv", "wb")).writerows(a)
Here is a simple python script that should do the job for you. It takes in arguments from the command line that designate the number of entries you want to skip at the beginning of the line,the input you want to skip at the end of the line, the input file and the output file. So for example, the command would look like
python question.py 3:7 input.txt output.txt
You can also substitute sys.argv[1] for 3, sys.argv[2] for "input.txt" and so on within the script if you don't want to state the arguments every time.
Text file version:
import sys
inputFile = open(sys.argv[2],"r")
outputFile = open(sys.argv[3], "w")
leadingRemoved=int(sys.argv[1])
#strips extra whitespace from each line in file then splits by ","
lines = [x.strip().split(",") for x in inputFile.readlines()]
#zips all but the first x number of elements in the first and third row
zipped = zip(lines[0][leadingRemoved:],lines[2][leadingRemoved:])
for tuples in zipped:
#writes the question/ number pair to a file.
outputFile.write(",".join(tuples))
inputFile.close()
outputFile.close()
#input from command line: python questions.py leadingRemoved pathToInput pathToOutput
CSV file version:
import sys
import csv
with open(sys.argv[2],"rb") as inputFile:
#removes null bytes
reader = csv.reader((line.replace('\0','') for line in inputFile),delimiter="\t")
outputFile = open(sys.argv[3], "wb")
leadingRemoved,endingremoved=[int(x) for x in sys.argv[1].split(":")]
#creates a 2d array of all the elements for each row
lines = [x for x in reader]
print lines
#zips all but the first x number of elements in the first and third row
zipped = zip(lines[0][leadingRemoved:endingremoved],lines[2][leadingRemoved:endingremoved])
writer = csv.writer(outputFile)
writer.writerows(zipped)
print zipped
outputFile.close()
Something similar I did using multiple values but could be changed to single values.
#!/usr/bin/env python
import csv
def dict_from_csv(filename):
'''
(file)->list of dictionaries
Function to read a csv file and format it to a list of dictionaries.
The headers are the keys with all other data becoming values
The format of the csv file and the headers included need to be know to extract the email addresses
'''
#open the file and read it using csv.reader()
#read the file. for each row that has content add it to list mf
#the keys for our user dict are the first content line of the file mf[0]
#the values to our user dict are the other lines in the file mf[1:]
mf = []
with open(filename, 'r') as f:
my_file = csv.reader(f)
for row in my_file:
if any(row):
mf.append(row)
file_keys = mf[0]
file_values= mf[1:] #choose row/rows you want
#Combine the two lists, turning into a list of dictionaries, using the keys list as the key and the people list as the values
my_list = []
for value in file_values:
my_list.append(dict(zip(file_keys, file_values)))
#return the list of dictionaries
return my_list
I suggest you read up on pandas for this type of activity:
http://pandas.pydata.org/pandas-docs/stable/io.html
import pandas
input_dataframe = pandas.read_csv("input.csv")
transposed_df = input_dataframe.transpose()
# delete rows and edit data easily using pandas dataframe
# this is a good library to get some experience working with
transposed_df.to_csv("output.csv")
The purpose of my Python script is to compare the data present in multiple CSV files, looking for discrepancies. The data are ordered, but the ordering differs between files. The files contain about 70K lines, weighing around 15MB. Nothing fancy or hardcore here. Here's part of the code:
def getCSV(fpath):
with open(fpath,"rb") as f:
csvfile = csv.reader(f)
for row in csvfile:
allRows.append(row)
allCols = map(list, zip(*allRows))
Am I properly reading from my CSV files? I'm using csv.reader, but would I benefit from using csv.DictReader?
How can I create a list containing whole rows which have a certain value in a precise column?
Are you sure you want to be keeping all rows around? This creates a list with matching values only... fname could also come from glob.glob() or os.listdir() or whatever other data source you so choose. Just to note, you mention the 20th column, but row[20] will be the 21st column...
import csv
matching20 = []
for fname in ('file1.csv', 'file2.csv', 'file3.csv'):
with open(fname) as fin:
csvin = csv.reader(fin)
next(csvin) # <--- if you want to skip header row
for row in csvin:
if row[20] == 'value':
matching20.append(row) # or do something with it here
You only want csv.DictReader if you have a header row and want to access your columns by name.
This should work, you don't need to make another list to have access to the columns.
import csv
import sys
def getCSV(fpath):
with open(fpath) as ifile:
csvfile = csv.reader(ifile)
rows = list(csvfile)
value_20 = [x for x in rows if x[20] == 'value']
If I understand the question correctly, you want to include a row if value is in the row, but you don't know which column value is, correct?
If your rows are lists, then this should work:
testlist = [row for row in allRows if 'value' in row]
post-edit:
If, as you say, you want a list of rows where value is in a specified column (specified by an integer pos, then:
testlist = []
pos = 20
for row in allRows:
testlist.append([element if index != pos else 'value' for index, element in enumerate(row)])
(I haven't tested this, but let me now if that works).