remove the unwanted columns in data - python

I have 500 txt files in a folder
data example:
1-1,0-7,10.1023,
1-2,7-8,/,
1-3,8-9,A,
1-4,9-10,:,
1-5,10-23,1020940830716,
I would like to delete the last "," in each line. to :
1-1,0-7,10.1023
1-2,7-8,/
1-3,8-9,A
1-4,9-10,:
1-5,10-23,1020940830716
How do I do that with a for loop to delete them from all of 500 files?

Try using this code:
for fname in filenames:
with open(fname, 'r') as f:
string = f.read().replace(',\n','\n')
with open(fname, 'w') as w:
w.write(string)

I usually do something like this.
Change the folder_path variable
Change the filename_pattern variable. This is just extra in case you have specific file patterns in your folder that you want to consider. You can simply set this variable to (blank) if irrelevant.
Also, the * takes anything that matches the pattern i.e. Book1, Book2, etc. Before running the code print(files) to make sure you have all of the correct files. I am not sure if :
import glob
import os
import pandas
#read files in
folder_path = 'Documents'
filename_pattern = 'Book'
files = glob.glob(f'{folder_path}//{filename_pattern}*.txt')
df = (pd.concat([pd.read_csv(f, header=None)
.assign(filename=os.path.basename(f))
for f in files]))
#read files out
for file, data in df.groupby('filename'):
data.iloc[:,:-2].to_csv(f'{folder_path}/{file}',
index=False,
header=False)

Related

Using a loop to open and process txt files with csv

I have data (mixed text and numbers in txt files) and I'd like to write a for loop that creates a list of lists, such that I can process the data from all the files using fewer lines.
So far I have written this:
import csv
path = (some path...)
files = [path + 'file1.txt',path + 'file2.txt', path +
'file3.txt', ...]
for i in files:
with open(i, 'r') as j:
Reader = csv.reader(j)
List = [List for List in Reader]
I think I overwrite List instead of creating a nested list, since I get Reader with size of 1 and a list that's with dimensions for one of the files.
My questions:
Given that the files may contain different line numbers, is it the right approach to save some lines of code? (What could be done better?)
I think the problem is in [List for List in Reader], is there a way to change it so I don't overwrite List? Something like add to List?
You can use the list append() method to add to an existing list. Since csv.reader instances are iterable objects, you can just pass one of them to the method as shown below:
import csv
from pathlib import Path
path = Path('./')
filenames = ['in_file1.txt', 'in_file2.txt'] # etc ...
List = []
for filename in filenames:
with open(path / filename, 'r', newline='') as file:
List.append(list(csv.reader(file)))
print(List)
Update
An even more succinct way to do it would be to use something called a "list comprehension":
import csv
from pathlib import Path
path = Path('./')
filenames = ['in_file1.txt', 'in_file2.txt'] # etc ...
List = [list(csv.reader(open(path / filename, 'r', newline='')))
for filename in filenames]
print(List)
Yes, use .append():
import numpy as np
import matplotlib.pyplot as plt
import csv
path = (some path...)
files = [path+x for x in ['FILESLIST']]
for i in files:
with open(i, 'r') as j:
Reader = csv.reader(j)
List.append([L for L in Reader])

Combining csv headers with corresponding file paths into new file

I am not sure how to "crack" the following Python-nut. So I was hoping that some of you more experienced Python'ers could push me in the right direction.
What I got:
Several directories containing many csv files
For instance:
/home/Date/Data1 /home/Date/Data2 /home/Date/Data3/sub1 /home/Date/Data3/sub2
What I want:
A file containing the "splitted" path for each file, followed by the variables (=row/headers) of the corresponding file. Like this:
home /t Date /t Data1 /t "variable1" "variable2" "variable3" ...
home /t Date /t Data2 /t "variable1" "variable2" "variable3" ...
home /t Date /t Data3 /t sub1 /t "variable1" "variable2" "variable3" ...
home /t Date /t Data3 /t sub2 /t "variable1" "variable2" "variable3" ...
Where am I right now?: The first step was to figure out how to print out the first row (the variables) of a single csv file (I used a test.txt file for testing)
# print out variables of a single file:
import csv
with open("test.txt") as f:
reader = csv.reader(f)
i = next(reader)
print(i)
The second step was to figure out how to print the paths of the csv files in directories inclusive subfolders. This is what I ended with:
import os
# Getting the current work directory (cwd)
thisdir = os.getcwd()
# r=root, d=directories, f = files
for r, d, f in os.walk(thisdir):
for file in f:
if ".csv" in file:
print(os.path.join(r, file))
Prints:
/home/Date/Data1/file1.csv
/home/Date/Data1/file2.csv
/home/Date/Data2/file1.csv
/home/Date/Data2/file2.csv
/home/Date/Data2/file3.csv
/home/Date/Data3/sub1/file1.csv
/home/Date/Data3/sub2/file1.csv
/home/Date/Data3/sub2/file2.csv
Where I am stuck?: I am struggling to figure out how to get from here, any ideas, approaches etc. in the right direction is greatly appreciated!
Cheers, B
##### UPDATE #####
Inspired by Tim Pietzcker's useful comments I have gotten a long way (Thanks Tim!).
But I could not get the output.write & join part to work, therefore the code is slightly different. The new issue is now to "merge" the two lists as two separate columns with comma as delimiter (I want to create a csv file). Since I am stuck, yet again, I wanted to see if there is any good suggestions from the experienced python'ers inhere?
#!/usr/bin/python
import os
import csv
thisdir = os.getcwd()
# Extract file-paths and append them to "csvfiles"
for r, d, f in os.walk(thisdir): # r=root, d=directories, f = files
for file in f:
if ".csv" in file:
csvfiles.append(os.path.join(r, file))
# get each file-path on new line + convert to list of str
filepath = "\n".join(["".join(sub) for sub in csvfiles])
filepath = filepath.replace(".csv", "") # remove .csv
filepath = filepath.replace("/", ",") # replace / with ,
Results in:
,home,Date,Data1,file1
,home,Date,Data1,file2
,home,Date,Data1,file3
... and so on
Then on to the headers:
# Create header-extraction function:
def get_csv_headers(filename):
with open(filename, newline='') as f:
reader = csv.reader(f)
return next(reader)
# Create empty list for headers
headers=[]
# Extract headers with the function and append them to "headers" list
for l in csvfiles:
headers.append(get_csv_headers(l))
# Create file with headers
headers = "\n".join(["".join(sublst) for sublst in headers]) # new lines + str conversion
headers = headers.replace(";", ",") # replace ; with ,
Results in:
variable1,variable2,variable3
variable1,variable2,variable3,variable4,variable5,variable6
variable1,variable2,variable3,variable4
and so on..
What I want now: a csv like this:
home,Date,Data1,file1,variable1,variable2,variable3
home,Date,Data1,file2,variable1,variable2,variable3,variable4,variable5,variable6
home,Date,Data1,file3, variable1,variable2,variable3,variable4
For instance:
with open('text.csv', 'w') as f:
writer = csv.writer(f, delimiter=',')
writer.writerows(zip(filepath,headers))
resulted in:
",",v
h,a
o,r
m,i,
e,a
and so on..
Any ideas and pushes in the right direction are very welcome!
About your edit: I would recommend against transforming everything into strings that early in the process. It makes much more sense keeping the data in a structured format and allow the modules designed to handle structured data to do the rest. So your program might look something like this:
#!/usr/bin/python
import os
import csv
thisdir = os.getcwd()
# Extract file-paths and append them to "csvfiles"
for r, d, f in os.walk(thisdir): # r=root, d=directories, f = files
for file in f:
if ".csv" in file:
csvfiles.append(os.path.join(r, file))
This (taken directly from your question) leaves you with a list of CSV filenames.
Now let's read those files. From the script in your question it seems that your CSV files are actually semicolon-separated, not comma-separated. This is common in Europe (because the comma is needed as a decimal point), but Python needs to be told that:
# Create header-extraction function:
def get_csv_headers(filename):
with open(filename, newline='') as f:
reader = csv.reader(f, delimiter=";") # semicolon-separated files!
return next(reader)
# Create empty list for headers
headers=[]
# Extract headers with the function and append them to "headers" list
for l in csvfiles:
headers.append(get_csv_headers(l))
Now headers is a list containing many sub-lists (which contain all the headers as separate items, just as we need them).
Let's not try to put everything on a single line; better keep it readable:
with open('text.csv', 'w', newline="") as f:
writer = csv.writer(f, delimiter=',') # maybe use semicolon again??
for path, header in zip(csvfiles, headers):
writer.writerow(list(path.split("\\")) + header)
If all your paths start with \, you could also use
writer.writerow(list(path.split("\\")[1:]) + header)
to avoid the empty field at the start of each line.
This looks promising; you've already done most of the work.
What I would do is
Collect all your CSV filenames in a list. So instead of printing the filenames, create an empty list (csvfiles=[]) before the os.walk() loop and do something like csvfiles.append(os.path.join(r, file)).
Then, iterate over those filenames, passing each to the routine that's currently used to read test.txt. If you place that in a function, it could look like this:
def get_csv_headers(filename):
with open(filename, newline="") as f:
reader = csv.reader(f)
return next(reader)
Now, you can write the split filename to a new file and add the headers. I'm questioning your file format a bit - why separate part of the line by tabs and the rest by spaces (and quotes)? If you insist on doing it like this, you could use something like
output.write("\t".join(filename.split("\\"))
output.write("\t")
output.write(" ".join(['"{}"'.format(header) for header in get_csv_headers(filename)])
but you might want to rethink this approach. A standard format like JSON might be more readable and portable.

Multiple editing of CSV files

I have a small delay with operating CSV files in python (3.5). Previously I was working with single files and there was no problem, but right now I have >100 files in one folder.
So, my goal is:
To parse all *.csv files in the directory
From each file delete first 6 rows , the files consists of the following data:
"nu(Ep), 2.6.8"
"Date: 2/10/16, 11:18:21 AM"
19
Ep,nu
0.0952645,0.123776,
0.119036,0.157720,
...
0.992060,0.374300,
Save each file separately (for example adding "_edited"), so there should be only numbers saved.
As an option - I have data subdivided on two parts for one material. For example: Ag(0-1_s).csv and Ag(1-4)_s.csv (after steps 1-3 the should be like Ag(*)_edited.csv). How can I merge this two files in a way of adding data from (1-4) to the end of (0-1) saving it in a third file?
My code so far is the following:
import os, sys
import csv
import re
import glob
import fileinput
def get_all_files(directory, extension='.csv'):
dir_list = os.listdir(directory)
csv_files = []
for i in dir_list:
if i.endswith(extension):
csv_files.append(os.path.realpath(i))
return csv_files
csv_files = get_all_files('/Directory/Path/Here')
#Here is the problem with csv's, I dont know how to scan files
#which are in the list "csv_files".
for n in csv_files:
#print(n)
lines = [] #empty, because I dont know how to write it properly per
#each file
input = open(n, 'r')
reader = csv.reader(n)
temp = []
for i in range(5):
next(reader)
#a for loop for here regarding rows?
#for row in n: ???
# ???
input.close()
#newfilename = "".join(n.split(".csv")) + "edited.csv"
#newfilename can be used within open() below:
with open(n + '_edited.csv', 'w') as nf:
writer = csv.writer(nf)
writer.writerows(lines)
This is the fastest way I can think of. If you have a solid-state drive, you could throw multiprocessing at this for more of a performance boost
import glob
import os
for fpath in glob.glob('path/to/directory/*.csv'):
fname = os.basename(fpath).rsplit(os.path.extsep, 1)[0]
with open(fpath) as infile, open(os.path.join('path/to/dir', fname+"_edited"+os.path.extsep+'csv'), 'w') as outfile:
for _ in range(6): infile.readline()
for line in infile: outfile.write(line)

Saving Filenames with Condition

I'm trying to save the names of files that fulfill a certain condition.
I think the easiest way to do this would make a short Python program that imports and reads the files, checks if the condition is met, and (assuming it is met) then saves the names of the files.
I have data files with just two columns and four rows, something like this:
a: 5
b: 5
c: 6
de: 7
I want to save the names of the files (or part of the name of the files, if that's a simple fix, otherwise I can just sed the file afterwards) of the data files that have the 4th number ([3:1]) greater than 8. I tried importing the files with numpy, but it said it couldn't import the letters in the first column.
Another way I was considering trying to do it was from the command line something along the lines of cat *.dat >> something.txtbut I couldn't figure out how to do that.
The code I've tried to write up to get this to work is:
import fileinput
import glob
import numpy as np
#Filter to find value > 8
#Globbing value datafiles
file_list = glob.glob("/path/to/*.dat")
#Creating output file containing
f = open('list.txt', 'w')
#Looping over files
for file in file_list:
#For each file in the directory, isolating the filename
filename = file.split('/')[-1]
#Opening the files, checking if value is greater than 8
a = np.loadtxt("file", delimiter=' ', usecols=1)
if a[3:0] > 8:
print >> f, filename
f.close()
When I do this, I get an error that says TypeError: 'int' object is not iterable, but I don't know what that's referring to.
I ended up using
import fileinput
import glob
import numpy as np
#Filter to find value > 8
#Globbing datafiles
file_list = glob.glob("/path/to/*.dat")
#Creating output file containing
f = open('list.txt', 'w')
#Looping over files
for file in file_list:
#For each file in the directory, isolating the filename
filename = file.split('/')[-1]
#Opening the files, checking if value is greater than 8
a = np.genfromtxt(file)
if a[3,1] > 8:
f.write(filename + "\n")
f.close()
it is hard to tell exactly what you want but maybe something like this
from glob import glob
from re import findall
fpattern = "/path/to/*.dat"
def test(fname):
with open(fname) as f:
try:
return int(findall("\d+",f.read())[3])>8
except IndexError:
pass
matches = [fname for fname in glob(fpattern) if test(fname)]
print matches

How to read in multiple files separately from multiple directories in python

I have x directories which are Star_{v} with v=0 to x.
I have 2 csv files in each directory, one with the word "epoch", one without.
If one of the csv files has the word "epoch" in it needs to be sent through one set of code, else through another.
I think dictionaries are probably the way to go but this section of the code is a bit of a wrong mess
directory_dict={}
for var in range(0, len(subdirectory)):
#var refers to the number by which the subdirectories are labelled by Star_0, Star_1 etc.
directory_dict['Star_{v}'.format(v=var)]=directory\\Star_{var}
#directory_dict['Star_0'], directory_dict['Star_1'] etc.
read_csv(f) for f in os.listdir('directory_dict[Star_{var}') if f.endswith(".csv")
#reads in all the files in the directories(star{v}) ending in csv.
if 'epoch' in open(read_csv[0]).read():
#if the word epoch is in the csv file then it is
directory_dict[Star_{var}][read] = csv.reader(read_csv[0])
directory_dict[Star_{var}][read1] = csv.reader(read_csv[1])
else:
directory_dict[Star_{var}][read] = csv.reader(read_csv[1])
directory_dict[Star_{var}][read1] = csv.reader(read_csv[0])
when dealing with csvs, you should use the csv module, and for your particular case, you can use a dictreader and parse the headers to check for the column you're looking for
import csv
import os
directory = os.path.abspath(os.path.dirname(__file__)) # change this to your directory
csv_list = [os.path.join(directory, c) for c in os.listdir(directory) if os.path.splitext(c) == 'csv']
def parse_csv_file():
" open CSV and check the headers "
for c in csv_list:
with open(c, mode='r') as open_csv:
reader = csv.DictReader(open_csv)
if 'epoch' in reader.fieldnames:
# do whatever you want here
else:
# do whatever else
then you can extract it from the DictReader's CSV header and do whatever you want
Also your python looks invalid

Categories

Resources