I have a large number of text files containg data arranged into a fixed number of rows and columns, the columns being separated by spaces. (like a .csv but using spaces as the delimiter). I want to extract a given column from each of these files, and write it into a new text file.
So far I have tried:
results_combined = open('ResultsCombined.txt', 'wb')
def combine_results():
for num in range(2,10):
f = open("result_0."+str(num)+"_.txt", 'rb') # all the text files have similar filename styles
lines = f.readlines() # read in the data
no_lines = len(lines) # get the number of lines
for i in range (0,no_lines):
column = lines[i].strip().split(" ")
results_combined.write(column[5] + " " + '\r\n')
f.close()
if __name__ == "__main__":
combine_results()
This produces a text file containing the data I want from the separate files, but as a single column. (i.e. I've managed to 'stack' the columns on top of each other, rather than have them all side by side as separate columns). I feel I've missed something obvious.
In another attempt, I manage to write all the separate files to a single file, but without picking out the columns that I want.
import glob
files = [open(f) for f in glob.glob("result_*.txt")]
fout = open ("ResultsCombined.txt", 'wb')
for row in range(0,488):
for f in files:
fout.write( f.readline().strip() )
fout.write(' ')
fout.write('\n')
fout.close()
What I basically want is to copy column 5 from each file (it is always the same column) and write them all to a single file.
If you don't know the maximum number of rows in the files and if the files can fit into memory, then the following solution would work:
import glob
files = [open(f) for f in glob.glob("*.txt")]
# Given file, Read the 6th column in each line
def readcol5(f):
return [line.split(' ')[5] for line in f]
filecols = [ readcol5(f) for f in files ]
maxrows = len(max(filecols, key=len))
# Given array, make sure it has maxrows number of elements.
def extendmin(arr):
diff = maxrows - len(arr)
arr.extend([''] * diff)
return arr
filecols = map(extendmin, filecols)
lines = zip(*filecols)
lines = map(lambda x: ','.join(x), lines)
lines = '\n'.join(lines)
fout = open('output.csv', 'wb')
fout.write(lines)
fout.close()
Or this option (following your second approach):
import glob
files = [open(f) for f in glob.glob("result_*.txt")]
fout = open ("ResultsCombined.txt", 'w')
for row in range(0,488):
for f in files:
fout.write(f.readline().strip().split(' ')[5])
fout.write(' ')
fout.write('\n')
fout.close()
... which uses a fixed number of rows per file but will work for very large numbers of rows because it is not storing the intermediate values in memory. For moderate numbers of rows, I'd expect the first answer's solution to run more quickly.
Why not read all the entries from each 5th column into a list and after reading in all the files, write them all to the output file?
data = [
[], # entries from first file
[], # entries from second file
...
]
for i in range(number_of_rows):
outputline = []
for vals in data:
outputline.append(vals[i])
outfile.write(" ".join(outputline))
Related
I have 10 TAB delimited txt files in a folder. It has three columns (with numbers only) preceeded by a 21 line header (text and numbers). In order to process them further, I would like to :
Choose the second column from all text files (starting after the 21 line header; i attached figure with arrow), convert the comma into decimal and stack each of these columns from the 10 files into a new tab delimited/csv file. Once all files.
I know very little scripting. I have Rstudio and Python and have tried to fiddle around a bit. But I have really no clue what to do. Since I have to process multiple folders, my work would be really simplified if it could be possible.
Reference figure
From your requirements it sounds like this Python code should do the trick:
import os
import glob
DIR = "path/to/your/directory"
OUTPUT_FILE = "path/to/your/output.csv"
HEADER_SIZE = 21
input_files = glob.glob(os.path.join(DIR, "*.txt"))
for input_file in input_files:
print("Now processing", input_file)
# read the file
with open(input_file, "r") as h:
contents = h.readlines()
# drop header
contents = contents[HEADER_SIZE:]
# grab the 2nd column
column = []
for row in contents:
# stop at the footer
if "####" in row:
break
split = row.split("\t")
if len(split) >= 2:
column.append(split[1])
# replace the comma
column_replaced = [row.replace(",", ".") for row in column]
# append to the output file
with open(OUTPUT_FILE, "a") as h:
h.write("\n".join(column_replaced))
h.write("\n") # end on a newline
Note that this will discard everything that wasn't part of the second column in the output file.
The code below is not an exact solution but if you follow along its lines you will be close to what you need.
output <- "NewFileName.txt"
old_dir <- setwd("your/folder")
files <- list.files("\\.txt")
df_list <- lapply(files, read.table, skip = 21, sep = "\t")
x <- lapply(df_list, '[[', 2)
x <- gsub(",", ".", unlist(x))
write.table(x, output, row.names = FALSE, col.names = FALSE)
setwd(old_dir)
list =[]
filename = "my_text"
file = open(filename, "r")
for line in file:
res=line.replace(",", ".")
list.append(res)
print(res)
f = open(filename, "w")
for item in list:
f.write(item)`enter code here`
I have been trying to transpose my table of 2000000+ rows and 300+ columns on a cluster, but it seems that my Python script is getting killed due to lack of memory. I would just like to know if anyone has any suggestions on a more efficient way to store my table data other than using the array, as shown in my code below?
import sys
Seperator = "\t"
m = []
f = open(sys.argv[1], 'r')
data = f.read()
lines = data.split("\n")[:-1]
for line in lines:
m.append(line.strip().split("\t"))
for i in zip(*m):
for j in range(len(i)):
if j != len(i):
print(i[j] +Seperator)
else:
print(i[j])
print ("\n")
Thanks very much.
The first thing to note is that you've been careless with your variables. You're loading a large file into memory as a single string, and then a list of a strings, then a list of list of strings, before finally transposing said list. This will result in you storing all the data in the file three times before you even begin to transpose it.
If each individual string in the file is only about 10 characters long then you're going to need 18GB of memory just to store that (2e6 rows * 300 columns * 10 bytes * 3 duplicates). This is before you factor in all the overhead of python objects (~27 bytes per string object).
You have a couple of options options here.
create each new transposed row incrementally by reading over the file once for each old row and appending each new row one at a time (sacrifices time efficiency).
create one file for each new row and combine these row files at the end (sacrifices disk space efficiency, possibly problematic if you have a lot of columns in the initial file due to a limit of the number of open files a process may have).
Transposing with a limited number of open files
delimiter = ','
input_filename = 'file.csv'
output_filename = 'out.csv'
# find out the number of columns in the file
with open(input_filename) as input:
old_cols = input.readline().count(delimiter) + 1
temp_files = [
'temp-file-{}.csv'.format(i)
for i in range(old_cols)
]
# create temp files
for temp_filename in temp_files:
open(temp_filename, 'w') as output:
output.truncate()
with open(input_filename) as input:
for line in input:
parts = line.rstrip().split(delimiter)
assert len(parts) == len(temp_files), 'not enough or too many columns'
for temp_filename, cell in zip(temp_files, parts):
with open(temp_filename, 'a') as output:
output.write(cell)
output.write(',')
# combine temp files
with open(output_filename, 'w') as output:
for temp_filename in temp_files:
with open(temp_filename) as input:
line = input.read().rstrip()[:-1] + '\n'
output.write(line)
As the number of columns is far smaller than nuber of rows I would consider writing each column to separate file. And then combine them together.
import sys
Separator = "\t"
f = open(sys.argv[1], 'r')
for line in f:
for i, c in enumerate(line.strip().split("\t")):
dest = column_file[i] # you shoud open 300+ file handlers, one for each column
dest.write(c)
dest.write(Separator)
# all you need to do after than is combine the content of you "row" files
If you cannot store all of your file into memory, you can read it n times:
column_number = 4 # if necessary, read the first line of the file to calculate it
seperetor = '\t'
filename = sys.argv[1]
def get_nth_column(filename, n):
with open(filename, 'r') as file:
for line in file:
if line: # remove empty lines
yield line.strip().split('\t')[n]
for column in range(column_number):
print(seperetor.join(get_nth_column(filename, column)))
Note that an exception will be raised if the file does not have the right format. You could catch it if necessary.
When reading files : use with construct, to ensure that your file will be closed. And iterate directly on the file, instead of reading the content first. It is more readable and more efficient.
I have a dataset of about 10 CSV files. I want to combine those files row-wise into a single CSV file.
What I tried:
import csv
fout = open("claaassA.csv","a")
# first file:
writer = csv.writer(fout)
for line in open("a01.ihr.60.ann.csv"):
print line
writer.writerow(line)
# now the rest:
for num in range(2, 10):
print num
f = open("a0"+str(num)+".ihr.60.ann.csv")
#f.next() # skip the header
for line in f:
print line
writer.writerow(line)
#f.close() # not really needed
fout.close()
Definitively need more details in the question (ideally examples of the inputs and expected output).
Given the little information provided, I will assume that you know that all files are valid CSV and they all have the same number or lines (rows). I'll also assume that memory is not a concern (i.e. they are "small" files that fit together in memory). Furthermore, I assume that line endings are new line (\n).
If all these assumptions are valid, then you can do something like this:
input_files = ['file1.csv', 'file2.csv', 'file3.csv']
output_file = 'output.csv'
output = None
for infile in input_files:
with open(infile, 'r') as fh:
if output:
for i, l in enumerate(fh.readlines()):
output[i] = "{},{}".format(output[i].rstrip('\n'), l)
else:
output = fh.readlines()
with open(output_file, 'w') as fh:
for line in output:
fh.write(line)
There are probably more efficient ways, but this is a quick and dirty way to achieve what I think you are asking for.
The previous answer implicitly assumes we need to do this in python. If bash is an option then you could use the paste command. For example:
paste -d, file1.csv file2.csv file3.csv > output.csv
I don't understand fully why you use the library csv. Actually, it's enough to fill the output file with the lines from given files (it they have the same columns' manes and orders).
input_path_list = [
"a01.ihr.60.ann.csv",
"a02.ihr.60.ann.csv",
"a03.ihr.60.ann.csv",
"a04.ihr.60.ann.csv",
"a05.ihr.60.ann.csv",
"a06.ihr.60.ann.csv",
"a07.ihr.60.ann.csv",
"a08.ihr.60.ann.csv",
"a09.ihr.60.ann.csv",
]
output_path = "claaassA.csv"
with open(output_path, "w") as fout:
header_written = False
for intput_path in input_path_list:
with open(intput_path) as fin:
header = fin.next()
# it adds the header at the beginning and skips other headers
if not header_written:
fout.write(header)
header_written = True
# it adds all rows
for line in fin:
fout.write(line)
My problem is this. I have one file with 3000 lines and 8 columns(space delimited). The important thing is that the first column is a number ranging from 1 to 22. So in the principle of divide-n-conquer I splitted the original file in to 22 subfiles depending on the first column value.
And I have some result files. Which are 15 sources each containing 1 result file. But the result file is too big, so I applied divide-n-conquer once more to split each of the 15 results in to 22 subfiles.
the file structure is as follows:
Original_file Studies
split_1 study1
split_1, split_2, ...
split_2 study2
split_1, split_2, ...
split_3 ...
... study15
split_1, split_2, ...
split_22
So by doing this, we pay a slight overhead in the beginning, but all of these split files will be deleted at the end. so it doesn't really matter.
I need my final output to be the original file with some values from the studies appended to it.
So, my take so far is this:
Algorithm:
for i in range(1,22):
for j in range(1,15)
compare (split_i of original file) with the jth studys split_i
if one value on a specific column matches:
create a list with needed columns from both files, split row with ' '.join(list) and write the result in outfile.
Is there a better way to go around this problem? Because the study files range from 300MB to 1.5GB in size.
and here's my Python code so far:
folders = ['study1', 'study2', ..., 'study15']
with open("Effects_final.txt", "w") as outfile:
for i in range(1, 23):
chr = i
small_file = "split_"+str(chr)+".txt"
with open(small_file, 'r') as sf:
for sline in sf: #small_files
sf_parts = sline.split(' ')
for f in folders:
file_to_compare_with = f + "split_" + str(chr) + ".txt"
with open(file_to_compare_with, 'r') as cf: #comparison files
for cline in cf:
cf_parts = cline.split(' ')
if cf_parts[0] == sf_parts[1]:
to_write = ' '.join(cf_parts+sf_parts)
outfile.write(to_write)
But this code uses 4 loops which is an overkill, but you have to do it since you need to read the lines from the 2 files being compared at the same time. This is my concern...
I found one solution that seems to work in a good amount of time. The code is the following:
with open("output_file", 'w') as outfile:
for i in range(1,23):
dict1 = {} # use a dictionary to map values from the inital file
with open("split_i", 'r') as split:
next(split) #skip the header
line_list = line.split(delimiter)
for line in split:
dict1[line_list[whatever_key_u_use_as_id]] = line_list
compare_dict = {}
for f in folders:
with open("each folder", 'r') as comp:
next(comp) #skip the header
for cline in comp:
cparts = cline.split('delimiter')
compare_dict[cparts[whatever_key_u_use_as_id]] = cparts
for key in dict1:
if key in compare_dict:
outfile.write("write your data")
outfile.close()
With this approach, I'm able to compute this dataset in ~10mins. Surely, there are ways for improvement. One idea, is to take the time and sort the datasets, that way search later on will be more quick, and we might save time!
I have 125 data files containing two columns and 21 rows of data. Please see the image below:
and I'd like to import them into a single .csv file (as 250 columns and 21 rows).
I am fairly new to python but this what I have been advised, code wise:
import glob
Results = [open(f) for f in glob.glob("*.data")]
fout = open("res.csv", 'w')
for row in range(21):
for f in Results:
fout.write( f.readline().strip() )
fout.write(',')
fout.write('\n')
fout.close()
However, there is slight problem with the code as I only get 125 columns, (i.e. the force and displacement columns are written in one column) Please refer to the image below:
I'd very much appreciate it if anyone could help me with this !
import glob
results = [open(f) for f in glob.glob("*.data")]
sep = ","
# Uncomment if your Excel formats decimal numbers like 3,14 instead of 3.14
# sep = ";"
with open("res.csv", 'w') as fout:
for row in range(21):
iterator = (f.readline().strip().replace("\t", sep) for f in results)
line = sep.join(iterator)
fout.write("{0}\n".format(line))
So to explain what went wrong with your code, your source files use tab as a field separator, but your code uses comma to separate the lines it reads from those files. If your excel uses period as a decimal separator, it uses comma as a default field separator. The whitespace is ignored unless enclosed in quotes, and you see the result.
If you use the text import feature of Excel (Data ribbon => From Text) you can ask it to consider both comma and tab as valid field separators, and then I'm pretty sure your original output would work too.
In contrast, the above code should produce a file that will open correctly when double clicked.
You don't need to write your own program to do this, in python or otherwise. You can use an existing unix command (if you are in that environment):
paste *.data > res.csv
Try this:
import glob, csv
from itertools import cycle, islice, count
def roundrobin(*iterables):
"roundrobin('ABC', 'D', 'EF') --> A D E B F C"
# Recipe credited to George Sakkis
pending = len(iterables)
nexts = cycle(iter(it).next for it in iterables)
while pending:
try:
for next in nexts:
yield next()
except StopIteration:
pending -= 1
nexts = cycle(islice(nexts, pending))
Results = [open(f).readlines() for f in glob.glob("*.data")]
fout = csv.writer(open("res.csv", 'wb'), dialect="excel")
row = []
for line, c in zip(roundrobin(Results), cycle(range(len(Results)))):
splitline = line.split()
for item,currItem in zip(splitline, count(1)):
row[c+currItem] = item
if count == len(Results):
fout.writerow(row)
row = []
del fout
It should loop over each line of your input file and stitch them together as one row, which the csv library will write in the listed dialect.
I suggest to get used to csv module. The reason is that if the data is not that simple (simple strings in headings, and then numbers only) it is difficult to implement everything again. Try the following:
import csv
import glob
import os
datapath = './data'
resultpath = './result'
if not os.path.isdir(resultpath):
os.makedirs(resultpath)
# Initialize the empty rows. It does not check how many rows are
# in the file.
rows = []
# Read data from the files to the above matrix.
for fname in glob.glob(os.path.join(datapath, '*.data')):
with open(fname, 'rb') as f:
reader = csv.reader(f)
for n, row in enumerate(reader):
if len(rows) < n+1:
rows.append([]) # add another row
rows[n].extend(row) # append the elements from the file
# Write the data from memory to the result file.
fname = os.path.join(resultpath, 'result.csv')
with open(fname, 'wb') as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)
The with construct for opening a file can be replaced by the couple:
f = open(fname, 'wb')
...
f.close()
The csv.reader and csv.writer are simply wrappers that parse or compose the line of the file. The doc says that they require to open the file in the binary mode.