Using a loop to open and process txt files with csv - python

I have data (mixed text and numbers in txt files) and I'd like to write a for loop that creates a list of lists, such that I can process the data from all the files using fewer lines.
So far I have written this:
import csv
path = (some path...)
files = [path + 'file1.txt',path + 'file2.txt', path +
'file3.txt', ...]
for i in files:
with open(i, 'r') as j:
Reader = csv.reader(j)
List = [List for List in Reader]
I think I overwrite List instead of creating a nested list, since I get Reader with size of 1 and a list that's with dimensions for one of the files.
My questions:
Given that the files may contain different line numbers, is it the right approach to save some lines of code? (What could be done better?)
I think the problem is in [List for List in Reader], is there a way to change it so I don't overwrite List? Something like add to List?

You can use the list append() method to add to an existing list. Since csv.reader instances are iterable objects, you can just pass one of them to the method as shown below:
import csv
from pathlib import Path
path = Path('./')
filenames = ['in_file1.txt', 'in_file2.txt'] # etc ...
List = []
for filename in filenames:
with open(path / filename, 'r', newline='') as file:
List.append(list(csv.reader(file)))
print(List)
Update
An even more succinct way to do it would be to use something called a "list comprehension":
import csv
from pathlib import Path
path = Path('./')
filenames = ['in_file1.txt', 'in_file2.txt'] # etc ...
List = [list(csv.reader(open(path / filename, 'r', newline='')))
for filename in filenames]
print(List)

Yes, use .append():
import numpy as np
import matplotlib.pyplot as plt
import csv
path = (some path...)
files = [path+x for x in ['FILESLIST']]
for i in files:
with open(i, 'r') as j:
Reader = csv.reader(j)
List.append([L for L in Reader])

Related

Sorting hash numbers in using matrices in python

I'm working on project to check for copies between two drives and I got stuck on sorting.
the output I have now is [ Filename, Hash, Location] in two list called drive1 and drive2
the output id'e like to end up with two text files with a list of the files that aren't in the other drive.
import os
import os.path
import hashlib
from os import path
drive1 = []
drive2 = []
file1 = input("Directory 1 location : ")
file2 = input("Directory 2 location : ")
AFile = open('skrar.txt','w')
AFile.close
def hash_file(filename):
if path.isfile(filename) is False:
pass
# make a hash object
md5_h = hashlib.md5()
# open file for reading in binary mode
with open(filename,'rb') as file:
# read file in chunks and update hash
chunk = 0
while chunk != b'':
chunk = file.read(1024)
md5_h.update(chunk)
# return the hex digest
return md5_h.hexdigest()
with open('Drive1.txt', 'w') :
AFile.write(hashlib.sha224(b"FILENAME").hexdigest())
for folderName, subfolders, filenames in os.walk(file1):
os.chdir(folderName)
for filename in filenames:
AFile.write(filename+";"+hash_file(filename)+";"+os.getcwd()+";"+os.path.join(os.getcwd(),filename)+'\n')
with open('Drive2.txt', 'w') :
AFile.write(hashlib.sha224(b"FILENAME").hexdigest())
for folderName, subfolders, filenames in os.walk(file2):
os.chdir(folderName)
for filename in filenames:
AFile.write(filename+";"+hash_file(filename)+";"+os.getcwd()+";"+os.path.join(os.getcwd(),filename)+'\n')
with open('Drive1.txt','r') as file:
for line in file:
drive1.append(line.split(";"))
with open('Drive2.txt','r') as file:
for line in file:
drive2.append(line.split(";"))
I'm not sure how to go about this maybe I should use dictionaries?
As I understand it, both drive1 and drive2 contain are lists of lists with length 3. The simplest approach would be the following:
# filter() creates a new list with the files in the opposite drive removed
files_only_in_drive1 = filter(lambda x: x not in drive2, drive1)
files_only_in_drive2 = filter(lambda x: x not in drive1, drive2)
This isn't the fastest solution (since search in an unordered list takes linear time). A more performant solution would take advantage of hashing and the set difference operator:
# Use tuple() for hashability.
drive1_file_set = set([tuple(file) for file in drive1])
drive2_file_set = set([tuple(file) for file in drive2])
# Now, remove files that are in the other drive using the set difference operator. In case it is necessary, I've added extra syntax to transform the 3-tuples to lists, and to cast the set back into a list.
files_only_in_drive_1 = [list(file) for file in drive1_file_set.difference(drive2_file_set)]
files_only_in_drive_2 = [list(file) for file in drive2_file_set.difference(drive1_file_set)]

remove the unwanted columns in data

I have 500 txt files in a folder
data example:
1-1,0-7,10.1023,
1-2,7-8,/,
1-3,8-9,A,
1-4,9-10,:,
1-5,10-23,1020940830716,
I would like to delete the last "," in each line. to :
1-1,0-7,10.1023
1-2,7-8,/
1-3,8-9,A
1-4,9-10,:
1-5,10-23,1020940830716
How do I do that with a for loop to delete them from all of 500 files?
Try using this code:
for fname in filenames:
with open(fname, 'r') as f:
string = f.read().replace(',\n','\n')
with open(fname, 'w') as w:
w.write(string)
I usually do something like this.
Change the folder_path variable
Change the filename_pattern variable. This is just extra in case you have specific file patterns in your folder that you want to consider. You can simply set this variable to (blank) if irrelevant.
Also, the * takes anything that matches the pattern i.e. Book1, Book2, etc. Before running the code print(files) to make sure you have all of the correct files. I am not sure if :
import glob
import os
import pandas
#read files in
folder_path = 'Documents'
filename_pattern = 'Book'
files = glob.glob(f'{folder_path}//{filename_pattern}*.txt')
df = (pd.concat([pd.read_csv(f, header=None)
.assign(filename=os.path.basename(f))
for f in files]))
#read files out
for file, data in df.groupby('filename'):
data.iloc[:,:-2].to_csv(f'{folder_path}/{file}',
index=False,
header=False)

Combining columns of multiple files into one - Python

I am trying to write a simple script that would import a specific column from multiple data files (.csv like file but with no extension) and export it all to one file with filenames in each column header. I tried this solution (also the code bellow by shaktimaan), which seems to do almost exactly the same, however, I got to some difficulties. Firstly I am still getting ''expected str, bytes or os.PathLike object, not list'' error and I am not really sure what I am doing wrong. I am not sure if the File_name variable should contain file names or file paths, and if I should use a different function to import files because my files don't have a .csv extension in the name.
Thank you for your help,
Šimon
import csv
# List of your files
file_names = ['file1', 'file2']
# Output list of generator objects
o_data = []
# Open files in the succession and
# store the file_name as the first
# element followed by the elements of
# the third column.
for afile in file_names:
file_h = open(afile)
a_list = []
a_list.append(afile)
csv_reader = csv.reader(file_h, delimiter=' ')
for row in csv_reader:
a_list.append(row[2])
# Convert the list to a generator object
o_data.append((n for n in a_list))
file_h.close()
# Use zip and csv writer to iterate
# through the generator objects and
# write out to the output file
with open('output', 'w') as op_file:
csv_writer = csv.writer(op_file, delimiter=' ')
for row in list(zip(*o_data)):
csv_writer.writerow(row)
op_file.close()

Multiple editing of CSV files

I have a small delay with operating CSV files in python (3.5). Previously I was working with single files and there was no problem, but right now I have >100 files in one folder.
So, my goal is:
To parse all *.csv files in the directory
From each file delete first 6 rows , the files consists of the following data:
"nu(Ep), 2.6.8"
"Date: 2/10/16, 11:18:21 AM"
19
Ep,nu
0.0952645,0.123776,
0.119036,0.157720,
...
0.992060,0.374300,
Save each file separately (for example adding "_edited"), so there should be only numbers saved.
As an option - I have data subdivided on two parts for one material. For example: Ag(0-1_s).csv and Ag(1-4)_s.csv (after steps 1-3 the should be like Ag(*)_edited.csv). How can I merge this two files in a way of adding data from (1-4) to the end of (0-1) saving it in a third file?
My code so far is the following:
import os, sys
import csv
import re
import glob
import fileinput
def get_all_files(directory, extension='.csv'):
dir_list = os.listdir(directory)
csv_files = []
for i in dir_list:
if i.endswith(extension):
csv_files.append(os.path.realpath(i))
return csv_files
csv_files = get_all_files('/Directory/Path/Here')
#Here is the problem with csv's, I dont know how to scan files
#which are in the list "csv_files".
for n in csv_files:
#print(n)
lines = [] #empty, because I dont know how to write it properly per
#each file
input = open(n, 'r')
reader = csv.reader(n)
temp = []
for i in range(5):
next(reader)
#a for loop for here regarding rows?
#for row in n: ???
# ???
input.close()
#newfilename = "".join(n.split(".csv")) + "edited.csv"
#newfilename can be used within open() below:
with open(n + '_edited.csv', 'w') as nf:
writer = csv.writer(nf)
writer.writerows(lines)
This is the fastest way I can think of. If you have a solid-state drive, you could throw multiprocessing at this for more of a performance boost
import glob
import os
for fpath in glob.glob('path/to/directory/*.csv'):
fname = os.basename(fpath).rsplit(os.path.extsep, 1)[0]
with open(fpath) as infile, open(os.path.join('path/to/dir', fname+"_edited"+os.path.extsep+'csv'), 'w') as outfile:
for _ in range(6): infile.readline()
for line in infile: outfile.write(line)

Saving Filenames with Condition

I'm trying to save the names of files that fulfill a certain condition.
I think the easiest way to do this would make a short Python program that imports and reads the files, checks if the condition is met, and (assuming it is met) then saves the names of the files.
I have data files with just two columns and four rows, something like this:
a: 5
b: 5
c: 6
de: 7
I want to save the names of the files (or part of the name of the files, if that's a simple fix, otherwise I can just sed the file afterwards) of the data files that have the 4th number ([3:1]) greater than 8. I tried importing the files with numpy, but it said it couldn't import the letters in the first column.
Another way I was considering trying to do it was from the command line something along the lines of cat *.dat >> something.txtbut I couldn't figure out how to do that.
The code I've tried to write up to get this to work is:
import fileinput
import glob
import numpy as np
#Filter to find value > 8
#Globbing value datafiles
file_list = glob.glob("/path/to/*.dat")
#Creating output file containing
f = open('list.txt', 'w')
#Looping over files
for file in file_list:
#For each file in the directory, isolating the filename
filename = file.split('/')[-1]
#Opening the files, checking if value is greater than 8
a = np.loadtxt("file", delimiter=' ', usecols=1)
if a[3:0] > 8:
print >> f, filename
f.close()
When I do this, I get an error that says TypeError: 'int' object is not iterable, but I don't know what that's referring to.
I ended up using
import fileinput
import glob
import numpy as np
#Filter to find value > 8
#Globbing datafiles
file_list = glob.glob("/path/to/*.dat")
#Creating output file containing
f = open('list.txt', 'w')
#Looping over files
for file in file_list:
#For each file in the directory, isolating the filename
filename = file.split('/')[-1]
#Opening the files, checking if value is greater than 8
a = np.genfromtxt(file)
if a[3,1] > 8:
f.write(filename + "\n")
f.close()
it is hard to tell exactly what you want but maybe something like this
from glob import glob
from re import findall
fpattern = "/path/to/*.dat"
def test(fname):
with open(fname) as f:
try:
return int(findall("\d+",f.read())[3])>8
except IndexError:
pass
matches = [fname for fname in glob(fpattern) if test(fname)]
print matches

Categories

Resources