I have 125 data files containing two columns and 21 rows of data. Please see the image below:
and I'd like to import them into a single .csv file (as 250 columns and 21 rows).
I am fairly new to python but this what I have been advised, code wise:
import glob
Results = [open(f) for f in glob.glob("*.data")]
fout = open("res.csv", 'w')
for row in range(21):
for f in Results:
fout.write( f.readline().strip() )
fout.write(',')
fout.write('\n')
fout.close()
However, there is slight problem with the code as I only get 125 columns, (i.e. the force and displacement columns are written in one column) Please refer to the image below:
I'd very much appreciate it if anyone could help me with this !
import glob
results = [open(f) for f in glob.glob("*.data")]
sep = ","
# Uncomment if your Excel formats decimal numbers like 3,14 instead of 3.14
# sep = ";"
with open("res.csv", 'w') as fout:
for row in range(21):
iterator = (f.readline().strip().replace("\t", sep) for f in results)
line = sep.join(iterator)
fout.write("{0}\n".format(line))
So to explain what went wrong with your code, your source files use tab as a field separator, but your code uses comma to separate the lines it reads from those files. If your excel uses period as a decimal separator, it uses comma as a default field separator. The whitespace is ignored unless enclosed in quotes, and you see the result.
If you use the text import feature of Excel (Data ribbon => From Text) you can ask it to consider both comma and tab as valid field separators, and then I'm pretty sure your original output would work too.
In contrast, the above code should produce a file that will open correctly when double clicked.
You don't need to write your own program to do this, in python or otherwise. You can use an existing unix command (if you are in that environment):
paste *.data > res.csv
Try this:
import glob, csv
from itertools import cycle, islice, count
def roundrobin(*iterables):
"roundrobin('ABC', 'D', 'EF') --> A D E B F C"
# Recipe credited to George Sakkis
pending = len(iterables)
nexts = cycle(iter(it).next for it in iterables)
while pending:
try:
for next in nexts:
yield next()
except StopIteration:
pending -= 1
nexts = cycle(islice(nexts, pending))
Results = [open(f).readlines() for f in glob.glob("*.data")]
fout = csv.writer(open("res.csv", 'wb'), dialect="excel")
row = []
for line, c in zip(roundrobin(Results), cycle(range(len(Results)))):
splitline = line.split()
for item,currItem in zip(splitline, count(1)):
row[c+currItem] = item
if count == len(Results):
fout.writerow(row)
row = []
del fout
It should loop over each line of your input file and stitch them together as one row, which the csv library will write in the listed dialect.
I suggest to get used to csv module. The reason is that if the data is not that simple (simple strings in headings, and then numbers only) it is difficult to implement everything again. Try the following:
import csv
import glob
import os
datapath = './data'
resultpath = './result'
if not os.path.isdir(resultpath):
os.makedirs(resultpath)
# Initialize the empty rows. It does not check how many rows are
# in the file.
rows = []
# Read data from the files to the above matrix.
for fname in glob.glob(os.path.join(datapath, '*.data')):
with open(fname, 'rb') as f:
reader = csv.reader(f)
for n, row in enumerate(reader):
if len(rows) < n+1:
rows.append([]) # add another row
rows[n].extend(row) # append the elements from the file
# Write the data from memory to the result file.
fname = os.path.join(resultpath, 'result.csv')
with open(fname, 'wb') as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)
The with construct for opening a file can be replaced by the couple:
f = open(fname, 'wb')
...
f.close()
The csv.reader and csv.writer are simply wrappers that parse or compose the line of the file. The doc says that they require to open the file in the binary mode.
Related
This should be simple but... I have a botched up csv with commas used within fields. Fortunately though, this csv only has three columns and the surplus commas are all in the middle column - so if I would manage to delete all comma but the first and the last in each line, I should be fine. How would I get csv reader to do this?
with open('bad.csv') as f, open('good.csv', 'w') as fout:
for line in f:
first, *middle, last = line.split(',')
fout.write(f'{first},"{",".join(middle)}",{last}')
Sometimes, you want a pass-through solution that fixes files on-the-fly while reading, without generating "fixed" files, e.g if you want to directly read the data using e.g. pandas.read_csv(...). In that case, you can do this:
def fix_commas(csv_file):
with open(csv_file) as f:
buf = f.read()
buf = '\n'.join([re.sub(r',,+', ',', s) for s in buf.splitlines()])
return io.StringIO(buf)
# and then
df = pd.read_csv(fix_commas(filename), ...)
Example:
txt = """
first,second,third
a,,b,bbbb
c,,,,,d,,,,,,,e
f,g,h
"""
with open('test.csv', 'w') as f:
f.write(txt)
# now test:
df = pd.read_csv(fix_commas('test.csv'))
Result (in df):
first second third
0 a b bbbb
1 c d e
2 f g h
I have been trying to transpose my table of 2000000+ rows and 300+ columns on a cluster, but it seems that my Python script is getting killed due to lack of memory. I would just like to know if anyone has any suggestions on a more efficient way to store my table data other than using the array, as shown in my code below?
import sys
Seperator = "\t"
m = []
f = open(sys.argv[1], 'r')
data = f.read()
lines = data.split("\n")[:-1]
for line in lines:
m.append(line.strip().split("\t"))
for i in zip(*m):
for j in range(len(i)):
if j != len(i):
print(i[j] +Seperator)
else:
print(i[j])
print ("\n")
Thanks very much.
The first thing to note is that you've been careless with your variables. You're loading a large file into memory as a single string, and then a list of a strings, then a list of list of strings, before finally transposing said list. This will result in you storing all the data in the file three times before you even begin to transpose it.
If each individual string in the file is only about 10 characters long then you're going to need 18GB of memory just to store that (2e6 rows * 300 columns * 10 bytes * 3 duplicates). This is before you factor in all the overhead of python objects (~27 bytes per string object).
You have a couple of options options here.
create each new transposed row incrementally by reading over the file once for each old row and appending each new row one at a time (sacrifices time efficiency).
create one file for each new row and combine these row files at the end (sacrifices disk space efficiency, possibly problematic if you have a lot of columns in the initial file due to a limit of the number of open files a process may have).
Transposing with a limited number of open files
delimiter = ','
input_filename = 'file.csv'
output_filename = 'out.csv'
# find out the number of columns in the file
with open(input_filename) as input:
old_cols = input.readline().count(delimiter) + 1
temp_files = [
'temp-file-{}.csv'.format(i)
for i in range(old_cols)
]
# create temp files
for temp_filename in temp_files:
open(temp_filename, 'w') as output:
output.truncate()
with open(input_filename) as input:
for line in input:
parts = line.rstrip().split(delimiter)
assert len(parts) == len(temp_files), 'not enough or too many columns'
for temp_filename, cell in zip(temp_files, parts):
with open(temp_filename, 'a') as output:
output.write(cell)
output.write(',')
# combine temp files
with open(output_filename, 'w') as output:
for temp_filename in temp_files:
with open(temp_filename) as input:
line = input.read().rstrip()[:-1] + '\n'
output.write(line)
As the number of columns is far smaller than nuber of rows I would consider writing each column to separate file. And then combine them together.
import sys
Separator = "\t"
f = open(sys.argv[1], 'r')
for line in f:
for i, c in enumerate(line.strip().split("\t")):
dest = column_file[i] # you shoud open 300+ file handlers, one for each column
dest.write(c)
dest.write(Separator)
# all you need to do after than is combine the content of you "row" files
If you cannot store all of your file into memory, you can read it n times:
column_number = 4 # if necessary, read the first line of the file to calculate it
seperetor = '\t'
filename = sys.argv[1]
def get_nth_column(filename, n):
with open(filename, 'r') as file:
for line in file:
if line: # remove empty lines
yield line.strip().split('\t')[n]
for column in range(column_number):
print(seperetor.join(get_nth_column(filename, column)))
Note that an exception will be raised if the file does not have the right format. You could catch it if necessary.
When reading files : use with construct, to ensure that your file will be closed. And iterate directly on the file, instead of reading the content first. It is more readable and more efficient.
I have a dataset of about 10 CSV files. I want to combine those files row-wise into a single CSV file.
What I tried:
import csv
fout = open("claaassA.csv","a")
# first file:
writer = csv.writer(fout)
for line in open("a01.ihr.60.ann.csv"):
print line
writer.writerow(line)
# now the rest:
for num in range(2, 10):
print num
f = open("a0"+str(num)+".ihr.60.ann.csv")
#f.next() # skip the header
for line in f:
print line
writer.writerow(line)
#f.close() # not really needed
fout.close()
Definitively need more details in the question (ideally examples of the inputs and expected output).
Given the little information provided, I will assume that you know that all files are valid CSV and they all have the same number or lines (rows). I'll also assume that memory is not a concern (i.e. they are "small" files that fit together in memory). Furthermore, I assume that line endings are new line (\n).
If all these assumptions are valid, then you can do something like this:
input_files = ['file1.csv', 'file2.csv', 'file3.csv']
output_file = 'output.csv'
output = None
for infile in input_files:
with open(infile, 'r') as fh:
if output:
for i, l in enumerate(fh.readlines()):
output[i] = "{},{}".format(output[i].rstrip('\n'), l)
else:
output = fh.readlines()
with open(output_file, 'w') as fh:
for line in output:
fh.write(line)
There are probably more efficient ways, but this is a quick and dirty way to achieve what I think you are asking for.
The previous answer implicitly assumes we need to do this in python. If bash is an option then you could use the paste command. For example:
paste -d, file1.csv file2.csv file3.csv > output.csv
I don't understand fully why you use the library csv. Actually, it's enough to fill the output file with the lines from given files (it they have the same columns' manes and orders).
input_path_list = [
"a01.ihr.60.ann.csv",
"a02.ihr.60.ann.csv",
"a03.ihr.60.ann.csv",
"a04.ihr.60.ann.csv",
"a05.ihr.60.ann.csv",
"a06.ihr.60.ann.csv",
"a07.ihr.60.ann.csv",
"a08.ihr.60.ann.csv",
"a09.ihr.60.ann.csv",
]
output_path = "claaassA.csv"
with open(output_path, "w") as fout:
header_written = False
for intput_path in input_path_list:
with open(intput_path) as fin:
header = fin.next()
# it adds the header at the beginning and skips other headers
if not header_written:
fout.write(header)
header_written = True
# it adds all rows
for line in fin:
fout.write(line)
I am starting out in Python, and I am looking at csv files.
Basically my situation is this:
I have coordinates X, Y, Z in a csv.
X Y Z
1 1 1
2 2 2
3 3 3
and I want to go through and add a user defined offset value to all Z values and make a new file with the edited z-values.
here is my code so far which I think is right:
# list of lists we store all data in
allCoords = []
# get offset from user
offset = int(input("Enter an offset value: "))
# read all values into memory
with open('in.csv', 'r') as inFile: # input csv file
reader = csv.reader(inFile, delimiter=',')
for row in reader:
# do not add the first row to the list
if row[0] != "X":
# create a new coord list
coord = []
# get a row and put it into new list
coord.append(int(row[0]))
coord.append(int(row[1]))
coord.append(int(row[2]) + offset)
# add list to list of lists
allCoords.append(coord)
# write all values into new csv file
with open(in".out.csv", "w", newline="") as f:
writer = csv.writer(f)
firstRow = ['X', 'Y', 'Z']
allCoords.insert(0, firstRow)
writer.writerows(allCoords)
But now come's the hard part. How would I go about going through a bunch of csv files (in the same location), and producing a new file for each of the csv's.
I am hoping to have something like: "filename.csv" turns into "filename_offset.csv" using the original file name as a starter for the new filename, appending ".offset" to the end.
I think I need to use "os." functions, but I am not sure how to, so any explanation would be much appreciated along with the code! :)
Sorry if I didn't make much sense, let me know if I need to explain more clearly. :)
Thanks a bunch! :)
shutil.copy2(src, dst)ΒΆ
Similar to shutil.copy(), but metadata is copied as well
shutil
The glob module finds all the pathnames matching a specified pattern
according to the rules used by the Unix shell. No tilde expansion is
done, but *, ?, and character ranges expressed with [] will be correctly matched
glob
import glob
from shutil import copy2
import shutil
files = glob.glob('cvs_DIR/*csv')
for file in files:
try:
# need to have full path of cvs_DIR
oldName = os.path.join(cvs_DIR, file)
newName = os.path.join(cvs_DIR, file[:4] + '_offset.csv')
copy2(oldName,newName)
except shutil.Error as e:
print('Error: {}'.format(e))
BTW, you can write ...
for row in reader:
if row[0] == "X":
break
for row in reader:
coord = []
...
... instead of ...
for row in reader:
if row[0] != "X":
coord = []
...
This stops checking for 'X'es after the first line.
It works because you dont work with a real list here but with a self consuming iterator, which you can stop and restart.
See also: Detecting if an iterator will be consumed.
I want to parse a csv file which is in the following format:
Test Environment INFO for 1 line.
Test,TestName1,
TestAttribute1-1,TestAttribute1-2,TestAttribute1-3
TestAttributeValue1-1,TestAttributeValue1-2,TestAttributeValue1-3
Test,TestName2,
TestAttribute2-1,TestAttribute2-2,TestAttribute2-3
TestAttributeValue2-1,TestAttributeValue2-2,TestAttributeValue2-3
Test,TestName3,
TestAttribute3-1,TestAttribute3-2,TestAttribute3-3
TestAttributeValue3-1,TestAttributeValue3-2,TestAttributeValue3-3
Test,TestName4,
TestAttribute4-1,TestAttribute4-2,TestAttribute4-3
TestAttributeValue4-1-1,TestAttributeValue4-1-2,TestAttributeValue4-1-3
TestAttributeValue4-2-1,TestAttributeValue4-2-2,TestAttributeValue4-2-3
TestAttributeValue4-3-1,TestAttributeValue4-3-2,TestAttributeValue4-3-3
and would like to turn this into tab seperated format like in the following:
TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3
TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3
TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3
TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3
Number of TestAttributes vary from test to test. For some tests there are only 3 values, for some others 7, etc. Also as in TestName4 example, some tests are executed more than once and hence each execution has its own TestAttributeValue line. (in the example testname4 is executed 3 times, hence we have 3 value lines)
I am new to python and do not have much knowledge but would like to parse the csv file with python. I checked 'csv' library of python and could not be sure whether it will be enough for me or shall I write my own string parser? Could you please help me?
Best
I'd use a solution using the itertools.groupby function and the csv module. Please have a close look at the documentation of itertools -- you can use it more often than you think!
I've used blank lines to differentiate the datasets, and this approach uses lazy evaluation, storing only one dataset in memory at a time:
import csv
from itertools import groupby
with open('my_data.csv') as ifile, open('my_out_data.csv', 'wb') as ofile:
# Use the csv module to handle reading and writing of delimited files.
reader = csv.reader(ifile)
writer = csv.writer(ofile, delimiter='\t')
# Skip info line
next(reader)
# Group datasets by the condition if len(row) > 0 or not, then filter
# out all empty lines
for group in (v for k, v in groupby(reader, lambda x: bool(len(x))) if k):
test_data = list(group)
# Write header
writer.writerow([test_data[0][1]])
# Write transposed data
writer.writerows(zip(*test_data[1:]))
# Write blank line
writer.writerow([])
Output, given that the supplied data is stored in my_data.csv:
TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3
TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3
TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3
TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3
The following does what you want, and only reads up to one section at a time (saves memory for a large file). Replace in_path and out_path with the input and output file paths respectively:
import csv
def print_section(section, f_out):
if len(section) > 0:
# find maximum column length
max_len = max([len(col) for col in section])
# build and print each row
for i in xrange(max_len):
f_out.write('\t'.join([col[i] if len(col) > i else '' for col in section]) + '\n')
f_out.write('\n')
with csv.reader(open(in_path, 'r')) as f_in, open(out_path, 'w') as f_out:
line = f_in.next()
section = []
for line in f_in:
# test for new "Test" section
if len(line) == 3 and line[0] == 'Test' and line[2] == '':
# write previous section data
print_section(section, f_out)
# reset section
section = []
# write new section header
f_out.write(line[1] + '\n')
else:
# add line to section
section.append(line)
# print the last section
print_section(section, f_out)
Note that you'll want to change 'Test' in the line[0] == 'Test' statement to the correct word for indicating the header line.
The basic idea here is that we import the file into a list of lists, then write that list of lists back out using an array comprehension to transpose it (as well as adding in blank elements when the columns are uneven).