So I'm writing a script to take large csv files and divide them into chunks. These files each have lines formatted accordingly:
01/07/2003,1545,12.47,12.48,12.43,12.44,137423
Where the first field is the date. The next field to the right is a time value. These data points are at minute granularity. My goal is to fill files with 8 days worth of data, so I want to write all the lines from a file for 8 days worth into a new file.
Right now, I'm only seeing the program write one line per "chunk," rather than all the lines. Code shown below and screenshots included showing how the chunk directories are made and the file as well as its contents.
For reference, day 8 shown and 1559 means it stored the last line right before the mod operator became true. So I'm thinking that everything is getting overwritten somehow since only the last values are being stored.
import os
import time
CWD = os.getcwd()
WRITEDIR = CWD+"/Divided Data/"
if not os.path.exists(WRITEDIR):
os.makedirs(WRITEDIR)
FILEDIR = CWD+"/SP500"
os.chdir(FILEDIR)
valid_files = []
filelist = open("filelist.txt", 'r')
for file in filelist:
cur_file = open(file.rstrip()+".csv", 'r')
cur_file.readline() #skip first line
prev_day = ""
count = 0
chunk_count = 1
for line in cur_file:
day = line[3:5]
WDIR = WRITEDIR + "Chunk"
cur_dir = os.getcwd()
path = WDIR + " "+ str(chunk_count)
if not os.path.exists(path):
os.makedirs(path)
if(day != prev_day):
# print(day)
prev_day = day
count += 1
#Create new directory
if(count % 8 == 0):
chunk_count += 1
PATH = WDIR + " " + str(chunk_count)
if not os.path.exists(PATH):
os.makedirs(PATH)
print("Chunk count: " + str(chunk_count))
print("Global count: " + str(count))
temp_path = WDIR +" "+str(chunk_count)
os.chdir(temp_path)
fname = file.rstrip()+str(chunk_count)+".csv"
with open(fname, 'w') as f:
try:
f.write(line + '\n')
except:
print("Could not write to file. \n")
os.chdir(cur_dir)
if(chunk_count >= 406):
continue
cur_file.close()
# count += 1
The answer is in the comment but let me give it here so that your question is answered.
You're opening your file in 'w' mode which overwrites all the previously written content. You need to open it in the 'a' (append) mode:
fname = file.rstrip()+str(chunk_count)+".csv"
with open(fname, 'a') as f:
See more on open function and modes in Python documentation. It specifically mentions about 'w' mode:
note that 'w+' truncates the file
Related
I know there's a lot of content about reading & writing out there, but I'm still not quite finding what I need specifically.
I have 5 files (i.e. in1.txt, in2.txt, in3.txt....), and I want to open/read, run the data through a function I have, and then output the new returned value to corresponding new files (i.e. out1.txt, out2.txt, out3.txt....)
I want to do this in one program run. I'm not sure how to write the loop to process all the numbered files in one run.
If you want them to be processed serially, you can use a for loop as follows:
inpPrefix = "in"
outPrefix = "out"
for i in range(1, 6):
inFile = inPrefix + str(i) + ".txt"
with open(inFile, 'r') as f:
fileLines = f.readlines()
# process content of each file
processedOutput = process(fileLines)
#write to file
outFile = outPrefix + str(i) + ".txt"
with open(outFile, 'w') as f:
f.write(processedOutput)
Note: This assumes that the input and output files are in the same directory as the script is in.
If you are looking just for running one by one separately you can do:
import os
count = 0
directory = "dir/where/your/files/are/"
for filename in os.listdir(directory):
if filename.endswith(".txt"):
count += 1
with open(directory + filename, "r") as read_file:
return_of_your_function = do_something_with_data()
with open(directory + count + filename, "w") as write_file:
write_file.write(return_of_your_function)
Here, you go! I would do something like this:
(Assuming all the input .txt files are in the same input folder)
input_path = '/path/to/input/folder/'
output_path = '/path/to/output/folder/'
for count in range(1,6):
input_file = input_path + 'in' + str(count) + '.txt'
output_file = output_path + 'out' + str(count) + '.txt'
with open(input_file, 'r') as f:
content = f.readlines()
output = process_input(content)
with open(output_file, 'w') as f:
w.write(output)
My code currently unzips one zip folder and finds the file called file.txt and extracts it. Now I need to unzip multiple folders that have the extension .zip. I have tried to use code similar to what I need it to do but the problem is that now I have to find a file called file.txt in each of those .zip folders and extract that file only . Also to store file.txt into a separate folder that has the same name where it came from. Thank you in advance for your time.
import re
import os
from zipfile import ZipFile
def pain():
print("\t\t\tinput_files.zip has been unzipped")
with ZipFile('input_files.zip', 'r') as zipObj:
zipObj.extractall()
listOfFileNames = zipObj.namelist()
for fileName in listOfFileNames:
if fileName.endswith('.txt'):
zipObj.extract(fileName, 'storage')
outfile = "output2.txt" #this will be the filename that the code will write to
baconFile = open(outfile,"wt")
file_name1 = "file.txt"
print('Filename\tLine\tnumber of numbers\tstring separated by a comma\twhite space found\ttab found\tcarriage return found\n') #This prints the master column in the python shell and this is the way the code should collect the data
baconFile.write('Filename\tLine\tnumber of numbers\tstring separated by a comma\twhite space found\ttab found\tcarriage return found\n') #This prints the master column in the output file and this is the way the code should collect the data
#for filename in os.listdir(os.getcwd() + "/input_files"):
for filename in os.listdir('C:\Users\M29858\Desktop\TestPy\Version10\input_files'):
with open("input_files/" + filename, 'r') as f:
if file_name1 in filename:
output_contents(filename, f, baconFile)
baconFile.close() #closes the for loop that the code is writing to
def output_contents(filename, f, baconFile): #using open() function to open the file inside the directory
index = 0
for line in f:
#create a list of all of the numerical values in our line
content = line.split(',') #this will be used to count the amount numbers before and after comma
whitespace_found = False
tab_found = False
false_string = "False (end of file)"
carriage_found = false_string
sigfigs = ""
index += 1 #adds 1 for every line if it finds what the command wants
if " " in line: #checking for whitespace
whitespace_found = True
if "\t" in line: #checking for tabs return
tab_found = True
if '\n' in line: #checking if there is a newline after the end of each line
carriage_found = True
sigfigs = (','.join(str(len(g)) for g in re.findall(r'\d+\.?(\d+)?', line ))) #counts the sigsfigs after decimal point
print(filename + "\t{0:<4}\t{1:<17}\t{2:<27}\t{3:17}\t{4:9}\t{5:21}"
.format(index, len(content), sigfigs, str(whitespace_found), str(tab_found), str(carriage_found))) #whatever is inside the .format() is the way it the data is stored into
baconFile.write('\n')
baconFile.write( filename + "\t{0:<4}\t{1:<17}\t{2:<27}\t{3:17}\t{4:9}\t{5:21}"
.format(index, len(content), sigfigs, str(whitespace_found), str(tab_found), str(carriage_found)))
if __name__ == '__main__':
pain()
#THIS WORKS
import glob
import os
from zipfile import ZipFile
def main():
for fname in glob.glob("*.zip"): # get all the zip files
with ZipFile(fname) as archive:
# if there's no file.txt, ignore and go on to the next zip file
if 'file.txt' not in archive.namelist(): continue
# make a new directory named after the zip file
dirname = fname.rsplit('.',1)[0]
os.mkdir(dirname)
extract file.txt into the directory you just created
archive.extract('file.txt', path=dirname)
I have a directory contains 50 files I want to read them one by one and compare wit the other files - that is fixed. I am using glob.blob. But it didn't work.
Here how I am reading all files. Instead, path = '*.rbd' if I give the file name like path = run-01.rbd it works.
path = '*.rbd'
path = folder + path
files=sorted(glob.glob(path))
complete code
import glob
from itertools import islice
import linecache
num_lines_nonbram = 1891427
bits_perline = 32
total_bit_flips = 0
num_bit_diff_flip_zero = 0
num_bit_diff_flip_ones = 0
folder = "files/"
path = '*.rbd'
path = folder + path
files=sorted(glob.glob(path))
original=open('files/mull-original-readback.rbd','r')
#source1 = open(file1, "r")
for filename in files:
del_lines = 101
with open(filename,'r') as f:
i=1
while i <= del_lines:
line1 = f.readline()
lineoriginal=original.readline()
i+=1
i=0
num_bit_diff_flip_zero = 0
num_bit_diff_flip_ones = 0
num_lines_diff =0
i=0
j=0
k=0
a_write2 = ""
while i < (num_lines_nonbram-del_lines):
line1 = f.readline()
lineoriginal = original.readline()
while k < bits_perline:
if ((lineoriginal[k] == line1[k])):
a_write2 += " "
else:
if (lineoriginal[k]=="0"):
#if ((line1[k]=="0" and line1[k]=="1")):
num_bit_diff_flip_zero += 1
if (lineoriginal[k]=="1"):
#if ((line1[k]=="0" and line1[k]=="1")):
num_bit_diff_flip_ones += 1
#if ((line1[k]==1 and line1[k]==0)):
#a_write_file2 = str(i+1) + " " + str(31-k) + "\n" + a_write_file2
#a_write2 += "^"
#num_bit_diff_flip_one += 1
# else:
# a_write2 += " "
k+=1
total_bit_flips=num_bit_diff_flip_zero+num_bit_diff_flip_ones
i+=1
k=0
i = 0
print files
print "Number of bits flip zero= %d" %num_bit_diff_flip_zero +"\n" +"Number of bits flip one= %d" %num_bit_diff_flip_ones +"\n" "Total bit flips = %d " %total_bit_flips
f.close()
original.close()
You could use the os module to first list everything in a directory (both files and modules) then use a python generator to filter out only the files. You could then use a second python generator to filter out files with a specific extension. There is probably a more efficient way of doing it but this works:
import os
def main():
path = './' # The path to current directory
# Go through all items in the directory and filter out files
files = [file for file in os.listdir(path) if
os.path.isfile(os.path.join(path, file))]
# Go through all files and filter out files with .txt (for example)
specificExtensionFiles = [file for file in files if ".txt" in file]
# Now specificExtensionFiles is a generator for .txt files in current
# directory which you can use in a for loop
print (specificExtensionFiles)
if __name__ == '__main__':
main()
For further reference:
How do I list all files of a directory?
The problem is that you're not going back to the beginning of originalfile whenever you start comparing with the next file in the for filename in files: loop. The simplest solution is to put:
original.seek(0)
at the beginning of that loop.
You could also read the whole file into a list just once before the loop, and use that instead of reading the file repeatedly.
And if you only want to process part of the files, you can read the file into a list, and then use a list slice to get the lines you want.
You also shouldn't be setting num_bit_diff_flip_zero and num_bit_diff_flip_one to 0 each time through the loop, since these are supposed to be the total across all files.
with open('files/mull-original-readback.rbd','r') as original:
original_lines = list(original)[del_lines:num_lines_nonbram]
for filename in files:
with open(file, 'r') as f:
lines = list(f)[del_lines:num_lines_nonbram]
for lineoriginal, line1 in zip(original_lines, lines):
for k in range(bits_perline):
if lineoriginal[k] == line1[k]:
a_write2 += " "
elif lineoriginal[k] == "0"
num_bit_diff_flip_zero += 1
else:
num_bit_diff_flip_ones += 1
total_bit_flips = num_bit_diff_flip_zero + num_bit_diff_flip_ones
I need to compare text files with two other files, and then get the result as an output. So I taught myself enough to write the following script which works fine and compares all of the files in a specific directory, however I have multiple directories with text files inside. What I need is to compare all of the text files in all of the directories and have an output file for each directory. Is there a way to improve the code below to do that:
import glob
import os
import sys
sys.stdout = open("citation.txt", "w")
for filename in glob.glob('journal*.txt'):
f1 = open(filename,'r')
f1data = f1.readlines()
f2 = open('chem.txt')
f2data = f2.readlines()
f3 = open('bio.txt')
f3data = f3.readlines()
chem = 0
bio = 0
total = 0
for line1 in f1data:
i = 0
for line2 in f2data:
if line1 in line2:
i+=1
total+=1
chem+=1
if i > 0:
print 'chem ' + line1 + "\n"
for line3 in f3data:
if line1 in line3:
i+=1
total+=1
bio+=1
if i > 0:
print 'bio ' + line1 + "\n"
print filename
print total
print 'bio ' + str(bio)
print 'chem ' + str(kimya)
Thanks in advance!
Just use a list of directories and a for loop
directories = ['folder1','folder2',...]
for i,folder in enumerate(directories):
sys.stdout = open("citation{}.txt".format(i), "w")
...
[put the rest of your code here]
This will name different output files as citation0.txt but you can do other formats if you want, just by changing how that name is declared.
And if you want each citation.txt to go into the actual directory, just change your code to this:
for folder in directories:
citation = os.path.join(folder, "citation.txt")
sys.stdout = open(citation, "w")
This will create a path for a new citation.txt file with each directory as the loop runs. Make sure to import os at the start of your file, if you haven't already.
I have written a script in python, which works on a single file. I couldn't find an answer to make it run on multiple files and to give output for each file separately.
out = open('/home/directory/a.out','w')
infile = open('/home/directory/a.sam','r')
for line in infile:
if not line.startswith('#'):
samlist = line.strip().split()
if 'I' or 'D' in samlist[5]:
match = re.findall(r'(\d+)I', samlist[5]) # remember to chang I and D here aswell
intlist = [int(x) for x in match]
## if len(intlist) < 10:
for indel in intlist:
if indel >= 10:
## print indel
###intlist contains lengths of insertions in for each read
#print intlist
read_aln_start = int(samlist[3])
indel_positions = []
for num1, i_or_d, num2, m in re.findall('(\d+)([ID])(\d+)?([A-Za-z])?', samlist[5]):
if num1:
read_aln_start += int(num1)
if num2:
read_aln_start += int(num2)
indel_positions.append(read_aln_start)
#print indel_positions
out.write(str(read_aln_start)+'\t'+str(i_or_d) + '\t'+str(samlist[2])+ '\t' + str(indel) +'\n')
out.close()
I would like my script to take multiple files with names like a.sam, b.sam, c.sam and for each file give me the output : aout.sam, bout.sam, cout.sam
Can you please pass me either a solution or a hint.
Regards,
Irek
Loop over filenames.
input_filenames = ['a.sam', 'b.sam', 'c.sam']
output_filenames = ['aout.sam', 'bout.sam', 'cout.sam']
for infn, outfn in zip(input_filenames, output_filenames):
out = open('/home/directory/{}'.format(outfn), 'w')
infile = open('/home/directory/{}'.format(infn), 'r')
...
UPDATE
Following code generate output_filenames from given input_filenames.
import os
def get_output_filename(fn):
filename, ext = os.path.splitext(fn)
return filename + 'out' + ext
input_filenames = ['a.sam', 'b.sam', 'c.sam'] # or glob.glob('*.sam')
output_filenames = map(get_output_filename, input_filenames)
I'd recommend wrapping that script in a function, using the def keyword, and passing the names of the input and output files as parameters to that function.
def do_stuff_with_files(infile, outfile):
out = open(infile,'w')
infile = open(outfile,'r')
# the rest of your script
Now you can call this function for any combination of input and output file names.
do_stuff_with_files('/home/directory/a.sam', '/home/directory/a.out')
If you want to do this for all files in a certain directory, use the glob library. To generate the output filenames, just replace the last three characters ("sam") with "out".
import glob
indir, outdir = '/home/directory/', '/home/directory/out/'
files = glob.glob1(indir, '*.sam')
infiles = [indir + f for f in files]
outfiles = [outdir + f[:-3] + "out" for f in files]
for infile, outfile in zip(infiles, outfiles):
do_stuff_with_files(infile, outfile)
The following script allows working with an input and output file. It will loop over all files in the given directory with the ".sam" extension, perform the specified operation on them, and output the results to a separate file.
Import os
# Define the directory containing the files you are working with
path = '/home/directory'
# Get all the files in that directory with the desired
# extension (in this case ".sam")
files = [f for f in os.listdir(path) if f.endswith('.sam')]
# Loop over the files with that extension
for file in files:
# Open the input file
with open(path + '/' + file, 'r') as infile:
# Open the output file
with open(path + '/' + file.split('.')[0] + 'out.' +
file.split('.')[1], 'a') as outfile:
# Loop over the lines in the input file
for line in infile:
# If a line in the input file can be characterized in a
# certain way, write a different line to the output file.
# Otherwise write the original line (from the input file)
# to the output file
if line.startswith('Something'):
outfile.write('A different kind of something')
else:
outfile.write(line)
# Note the absence of either a infile.close() or an outfile.close()
# statement. The with-statement handles that for you