Output comes in dictionary sorted order in python - python

I have a collection of csv files with names like 2.csv , 3.csv ...., 999.csv. Each file has 91 rows in it. I want to have a set of new files collecting a particular row from all the files. Eg. row1.csv should have first row of all the 998 files and similarly row35.csv should have 35th row of all the 998 files. Therefore I should have in total 91 files (one for each row) with each file having 998 rows (one for each original file) after my script completes the run.
I use the following code for the task
import glob
import os
for i in range(2,92):
outfile = open("row_%i.csv" %i,'w')
for filename in glob.glob('DataSet-MediaEval/devFeatures/*.csv'):
with open(filename, 'r') as infile:
lineno = 0
for line in infile:
lineno += 1
if lineno == i:
outfile.write(line)
outfile.close()
Now in any of the outfile row_i.csv my data is arranged in dictionary sorted order. Example :
First row in row_50.csv file is the 50th row of 10.csv.
In other words in any row_i.csv the rows comes from 10.csv, 100.csv, 101.csv and so on.
I wanted to know why is that happening and is there a way in which I can ensure that my row_i.csv is ordered in the way the files are ordered i.e. 2.csv, 3.csv and so on.
Thanks for the time spent reading this.

Not sure if this works out or if there are more problems, but it seems like glob either returns the filenames in sorted order (sorted as strings), or in random order. In both cases, you will have to extract the number from the filenames and sort by that number.
Try something like this:
p = re.compile(r"/(\d+)\.csv")
filenames = glob.glob(...)
for filename in sorted(filenames, key=lambda s: int(re.search(p, s).group(1))):
...
Also, it seems like you are opening, looping and closing all those 999 files for all of the 92 outfiles again and again! It might be better to open all of the 92 outfiles at once and store them in a dictionary, mapping line numbers to files. This way, you have to loop the 999 files just once.
Something like this (totally not tested):
outfiles = {i: open("row_%i.csv" %i, 'w') for i in range(2,92)}
p = re.compile(r"/(\d+)\.csv")
filenames = glob.glob('DataSet-MediaEval/devFeatures/*.csv'):
for filename in sorted(filenames, key=lambda s: int(re.search(p, s).group(1))):
with open(filename, 'r') as infile:
for lineno, line in enumerate(infile):
outfiles[lineno].write(line)
for outfile in outfiles.values():
outfile.close()

You need to sort the filename list before start iteration on it. This can help you:
import re
import glob
filename_list = glob.glob('DataSet-MediaEval/devFeatures/*.csv')
def splitByNumbers(x):
r = re.compile('(\d+)')
l = r.split(x)
return [int(y) if y.isdigit() else y for y in l]
filenames = sorted(filename_list, key = splitByNumbers)
then you can use instead of
for filename in glob.glob('DataSet-MediaEval/devFeatures/*.csv'):
this
for filename in filenames:

Related

Python: concatenating text files

Using Python, I'm seeking to iteratively combine two set of txt files to create a third set of txt files.
I have a directory of txt files in two categories:
text_[number].txt (eg: text_0.txt, text_1.txt, text_2.txt....text_20.txt)
comments_[number].txt (eg: comments_0.txt, comments_1.txt, comments_2.txt...comments_20.txt).
I'd like to iteratively combine the text_[number] files with the matching comments_[number] files into a new file category feedback_[number].txt. The script would combine text_0.txt and comments_0.txt into feedback_0.txt, and continue through each pair in the directory. The number of text and comments files will always match, but the total number of text and comment files is variable depending on preceding scripts.
I can combine two pairs using the code below with a list of file pairs:
filenames = ['text_0.txt', 'comments_0.txt']
with open("feedback_0.txt", "w") as outfile:
for filename in filenames:
with open(filename) as infile:
contents = infile.read()
outfile.write(contents)
However, I'm uncertain how to structure iteration for the rest of the files. I'm also curious how to generate lists from the contents of the file directory. Any advice or assistance on moving forward is greatly appreciated.
It would be far simpler (and possibly faster) to just fork a cat process:
import subprocess
n = ... # number of files
for i in range(n):
with open(f'feedback_{i}.txt', 'w') as f:
subprocess.run(['cat', 'text_{i}.txt', 'comments_{i}.txt'], stdout=f)
Or, if you already have lists of the file names:
for text, comment, feedback in zip(text_files, comment_files, feedback_files):
with open(feedback, 'w') as f:
subprocess.run(['cat', text, comment], stdout=f)
Unless these are all extremely small files, the cost of reading and writing the bytes will outweigh the cost of forking a new process for each pair.
Maybe not the most elegant but...
length = 10
txt = [f"text_{n}.txt" for n in range(length)]
com = [f"comments_{n}.txt" for n in range(length)]
feed = [f"feedback_{n}.txt" for n in range(length)]
for f, t, c in zip(feed, txt, com):
with open(f, "w") as outfile:
with open(t) as infile1:
contents = infile1.read()
outfile.write(contents)
with open(c) as infile2:
contents = infile2.read()
outfile.write(contents)
There are many ways to achieve this, but I don't seem to see any solution that's both beginner-friendly and takes into account the structure of the files you described.
You can iterate through the files, and for every text_[num].txt, fetch the corresponding comments_[num].txt and write to feedback_[num].txt as shown below. There's no need to add any counters or make any other assumptions about the files that might not always be true:
import os
srcpath = 'path/to/files'
for f in os.listdir(srcpath):
if f.startswith('text'):
index = f[5:-4] # extract the [num] part
# Build the paths to text, comment, feedback files
txt_path = os.path.join(srcpath, f)
cmnt_path = os.path.join(srcpath, f'comments_{index}.txt')
fb_path = os.path.join(srcpath, f'feedback_{index}.txt')
# write to output – reading in in byte mode following chepner's advice
with open(fb_path, 'wb') as outfile:
outfile.write(open(txt_path, 'rb').read())
outfile.write(open(cmnt_path, 'rb').read())
The simplest way would probably be to just iterate from 1 onwards, stopping at the first missing file. This works assuming that your files are numbered in increasing order and with no gaps (e.g. you have 1, 2, 3 and not 1, 3).
import os
from itertools import count
for i in count(1):
t = f'text_{i}.txt'
c = f'comments_{i}.txt'
if not os.path.isfile(t) or not os.path.isfile(c):
break
with open(f'feedback_{i}.txt', 'wb') as outfile:
outfile.write(open(t, 'rb').read())
outfile.write(open(c, 'rb').read())
You can try this
filenames = ['text_0.txt', 'comments_0.txt','text_1.txt', 'comments_1.txt','text_2.txt', 'comments_2.txt','text_3.txt', 'comments_3.txt']
for i,j in enumerate (zip(filenames[::2],filenames[1::2])):
with open(f'feedback_{i}','a+') as file:
for k in j:
with open(k,'r') as f:
files=f.read()
file.write(files)
I have taken a list here. Instead, you can do
import os
filenames=os.listdir('path/to/folder')

Most efficient way to compare multiple files in python

My problem is this. I have one file with 3000 lines and 8 columns(space delimited). The important thing is that the first column is a number ranging from 1 to 22. So in the principle of divide-n-conquer I splitted the original file in to 22 subfiles depending on the first column value.
And I have some result files. Which are 15 sources each containing 1 result file. But the result file is too big, so I applied divide-n-conquer once more to split each of the 15 results in to 22 subfiles.
the file structure is as follows:
Original_file Studies
split_1 study1
split_1, split_2, ...
split_2 study2
split_1, split_2, ...
split_3 ...
... study15
split_1, split_2, ...
split_22
So by doing this, we pay a slight overhead in the beginning, but all of these split files will be deleted at the end. so it doesn't really matter.
I need my final output to be the original file with some values from the studies appended to it.
So, my take so far is this:
Algorithm:
for i in range(1,22):
for j in range(1,15)
compare (split_i of original file) with the jth studys split_i
if one value on a specific column matches:
create a list with needed columns from both files, split row with ' '.join(list) and write the result in outfile.
Is there a better way to go around this problem? Because the study files range from 300MB to 1.5GB in size.
and here's my Python code so far:
folders = ['study1', 'study2', ..., 'study15']
with open("Effects_final.txt", "w") as outfile:
for i in range(1, 23):
chr = i
small_file = "split_"+str(chr)+".txt"
with open(small_file, 'r') as sf:
for sline in sf: #small_files
sf_parts = sline.split(' ')
for f in folders:
file_to_compare_with = f + "split_" + str(chr) + ".txt"
with open(file_to_compare_with, 'r') as cf: #comparison files
for cline in cf:
cf_parts = cline.split(' ')
if cf_parts[0] == sf_parts[1]:
to_write = ' '.join(cf_parts+sf_parts)
outfile.write(to_write)
But this code uses 4 loops which is an overkill, but you have to do it since you need to read the lines from the 2 files being compared at the same time. This is my concern...
I found one solution that seems to work in a good amount of time. The code is the following:
with open("output_file", 'w') as outfile:
for i in range(1,23):
dict1 = {} # use a dictionary to map values from the inital file
with open("split_i", 'r') as split:
next(split) #skip the header
line_list = line.split(delimiter)
for line in split:
dict1[line_list[whatever_key_u_use_as_id]] = line_list
compare_dict = {}
for f in folders:
with open("each folder", 'r') as comp:
next(comp) #skip the header
for cline in comp:
cparts = cline.split('delimiter')
compare_dict[cparts[whatever_key_u_use_as_id]] = cparts
for key in dict1:
if key in compare_dict:
outfile.write("write your data")
outfile.close()
With this approach, I'm able to compute this dataset in ~10mins. Surely, there are ways for improvement. One idea, is to take the time and sort the datasets, that way search later on will be more quick, and we might save time!

Multiple editing of CSV files

I have a small delay with operating CSV files in python (3.5). Previously I was working with single files and there was no problem, but right now I have >100 files in one folder.
So, my goal is:
To parse all *.csv files in the directory
From each file delete first 6 rows , the files consists of the following data:
"nu(Ep), 2.6.8"
"Date: 2/10/16, 11:18:21 AM"
19
Ep,nu
0.0952645,0.123776,
0.119036,0.157720,
...
0.992060,0.374300,
Save each file separately (for example adding "_edited"), so there should be only numbers saved.
As an option - I have data subdivided on two parts for one material. For example: Ag(0-1_s).csv and Ag(1-4)_s.csv (after steps 1-3 the should be like Ag(*)_edited.csv). How can I merge this two files in a way of adding data from (1-4) to the end of (0-1) saving it in a third file?
My code so far is the following:
import os, sys
import csv
import re
import glob
import fileinput
def get_all_files(directory, extension='.csv'):
dir_list = os.listdir(directory)
csv_files = []
for i in dir_list:
if i.endswith(extension):
csv_files.append(os.path.realpath(i))
return csv_files
csv_files = get_all_files('/Directory/Path/Here')
#Here is the problem with csv's, I dont know how to scan files
#which are in the list "csv_files".
for n in csv_files:
#print(n)
lines = [] #empty, because I dont know how to write it properly per
#each file
input = open(n, 'r')
reader = csv.reader(n)
temp = []
for i in range(5):
next(reader)
#a for loop for here regarding rows?
#for row in n: ???
# ???
input.close()
#newfilename = "".join(n.split(".csv")) + "edited.csv"
#newfilename can be used within open() below:
with open(n + '_edited.csv', 'w') as nf:
writer = csv.writer(nf)
writer.writerows(lines)
This is the fastest way I can think of. If you have a solid-state drive, you could throw multiprocessing at this for more of a performance boost
import glob
import os
for fpath in glob.glob('path/to/directory/*.csv'):
fname = os.basename(fpath).rsplit(os.path.extsep, 1)[0]
with open(fpath) as infile, open(os.path.join('path/to/dir', fname+"_edited"+os.path.extsep+'csv'), 'w') as outfile:
for _ in range(6): infile.readline()
for line in infile: outfile.write(line)

Copy columns from multiple text files in Python

I have a large number of text files containg data arranged into a fixed number of rows and columns, the columns being separated by spaces. (like a .csv but using spaces as the delimiter). I want to extract a given column from each of these files, and write it into a new text file.
So far I have tried:
results_combined = open('ResultsCombined.txt', 'wb')
def combine_results():
for num in range(2,10):
f = open("result_0."+str(num)+"_.txt", 'rb') # all the text files have similar filename styles
lines = f.readlines() # read in the data
no_lines = len(lines) # get the number of lines
for i in range (0,no_lines):
column = lines[i].strip().split(" ")
results_combined.write(column[5] + " " + '\r\n')
f.close()
if __name__ == "__main__":
combine_results()
This produces a text file containing the data I want from the separate files, but as a single column. (i.e. I've managed to 'stack' the columns on top of each other, rather than have them all side by side as separate columns). I feel I've missed something obvious.
In another attempt, I manage to write all the separate files to a single file, but without picking out the columns that I want.
import glob
files = [open(f) for f in glob.glob("result_*.txt")]
fout = open ("ResultsCombined.txt", 'wb')
for row in range(0,488):
for f in files:
fout.write( f.readline().strip() )
fout.write(' ')
fout.write('\n')
fout.close()
What I basically want is to copy column 5 from each file (it is always the same column) and write them all to a single file.
If you don't know the maximum number of rows in the files and if the files can fit into memory, then the following solution would work:
import glob
files = [open(f) for f in glob.glob("*.txt")]
# Given file, Read the 6th column in each line
def readcol5(f):
return [line.split(' ')[5] for line in f]
filecols = [ readcol5(f) for f in files ]
maxrows = len(max(filecols, key=len))
# Given array, make sure it has maxrows number of elements.
def extendmin(arr):
diff = maxrows - len(arr)
arr.extend([''] * diff)
return arr
filecols = map(extendmin, filecols)
lines = zip(*filecols)
lines = map(lambda x: ','.join(x), lines)
lines = '\n'.join(lines)
fout = open('output.csv', 'wb')
fout.write(lines)
fout.close()
Or this option (following your second approach):
import glob
files = [open(f) for f in glob.glob("result_*.txt")]
fout = open ("ResultsCombined.txt", 'w')
for row in range(0,488):
for f in files:
fout.write(f.readline().strip().split(' ')[5])
fout.write(' ')
fout.write('\n')
fout.close()
... which uses a fixed number of rows per file but will work for very large numbers of rows because it is not storing the intermediate values in memory. For moderate numbers of rows, I'd expect the first answer's solution to run more quickly.
Why not read all the entries from each 5th column into a list and after reading in all the files, write them all to the output file?
data = [
[], # entries from first file
[], # entries from second file
...
]
for i in range(number_of_rows):
outputline = []
for vals in data:
outputline.append(vals[i])
outfile.write(" ".join(outputline))

Python- Read from Multiple Files

I have 125 data files containing two columns and 21 rows of data. Please see the image below:
and I'd like to import them into a single .csv file (as 250 columns and 21 rows).
I am fairly new to python but this what I have been advised, code wise:
import glob
Results = [open(f) for f in glob.glob("*.data")]
fout = open("res.csv", 'w')
for row in range(21):
for f in Results:
fout.write( f.readline().strip() )
fout.write(',')
fout.write('\n')
fout.close()
However, there is slight problem with the code as I only get 125 columns, (i.e. the force and displacement columns are written in one column) Please refer to the image below:
I'd very much appreciate it if anyone could help me with this !
import glob
results = [open(f) for f in glob.glob("*.data")]
sep = ","
# Uncomment if your Excel formats decimal numbers like 3,14 instead of 3.14
# sep = ";"
with open("res.csv", 'w') as fout:
for row in range(21):
iterator = (f.readline().strip().replace("\t", sep) for f in results)
line = sep.join(iterator)
fout.write("{0}\n".format(line))
So to explain what went wrong with your code, your source files use tab as a field separator, but your code uses comma to separate the lines it reads from those files. If your excel uses period as a decimal separator, it uses comma as a default field separator. The whitespace is ignored unless enclosed in quotes, and you see the result.
If you use the text import feature of Excel (Data ribbon => From Text) you can ask it to consider both comma and tab as valid field separators, and then I'm pretty sure your original output would work too.
In contrast, the above code should produce a file that will open correctly when double clicked.
You don't need to write your own program to do this, in python or otherwise. You can use an existing unix command (if you are in that environment):
paste *.data > res.csv
Try this:
import glob, csv
from itertools import cycle, islice, count
def roundrobin(*iterables):
"roundrobin('ABC', 'D', 'EF') --> A D E B F C"
# Recipe credited to George Sakkis
pending = len(iterables)
nexts = cycle(iter(it).next for it in iterables)
while pending:
try:
for next in nexts:
yield next()
except StopIteration:
pending -= 1
nexts = cycle(islice(nexts, pending))
Results = [open(f).readlines() for f in glob.glob("*.data")]
fout = csv.writer(open("res.csv", 'wb'), dialect="excel")
row = []
for line, c in zip(roundrobin(Results), cycle(range(len(Results)))):
splitline = line.split()
for item,currItem in zip(splitline, count(1)):
row[c+currItem] = item
if count == len(Results):
fout.writerow(row)
row = []
del fout
It should loop over each line of your input file and stitch them together as one row, which the csv library will write in the listed dialect.
I suggest to get used to csv module. The reason is that if the data is not that simple (simple strings in headings, and then numbers only) it is difficult to implement everything again. Try the following:
import csv
import glob
import os
datapath = './data'
resultpath = './result'
if not os.path.isdir(resultpath):
os.makedirs(resultpath)
# Initialize the empty rows. It does not check how many rows are
# in the file.
rows = []
# Read data from the files to the above matrix.
for fname in glob.glob(os.path.join(datapath, '*.data')):
with open(fname, 'rb') as f:
reader = csv.reader(f)
for n, row in enumerate(reader):
if len(rows) < n+1:
rows.append([]) # add another row
rows[n].extend(row) # append the elements from the file
# Write the data from memory to the result file.
fname = os.path.join(resultpath, 'result.csv')
with open(fname, 'wb') as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)
The with construct for opening a file can be replaced by the couple:
f = open(fname, 'wb')
...
f.close()
The csv.reader and csv.writer are simply wrappers that parse or compose the line of the file. The doc says that they require to open the file in the binary mode.

Categories

Resources