Batch editing of csv files with Python - python

I need to edit several csv files. Actually, most of the files are fine as they are, it's just the last (41st) column that needs to be changed. For every occurrence of a particular string in that column, I need it to be replaced by a different string; specifically, every occurrence of 'S-D' needs to be replaced by 'S'. I've tried to accomplish this using Python, but I think I need to write the csv files and I'm not quite sure how to do this:
import os
import csv
path=os.getcwd()
filenames = os.listdir(path)
for filename in filenames:
if filename.endswith('.csv'):
r=csv.reader(open(filename))
for row in r:
if row[40] == "S-D":
row[40] = "S"
Any help? Also, if anyone has a quick , elegant way of doing this with a shell script, that would probably be very helpful to me as well.

Try something along these lines. Now using the glob module as mentioned by #SaulloCastro and the csv module.
import glob
import csv
for item in glob.glob(".csv"):
r = list(csv.reader(open(item, "r")))
for row in r:
row[-1] = row[-1].replace("S-D", "S")
w = csv.writer(open(item, "w"))
w.writerows(r)

Be sure to read up on the Python documentation for the CSV File Reading and Writing. Lots to learn there. Here is a basic example based on your question. Only modifying the data in the last column, writing out a modified file with "_edited" in the name.
import os
import csv
path=os.getcwd()
filenames = os.listdir(path)
for filename in filenames:
if filename.endswith('.csv'):
r=csv.reader(open(filename))
new_data = []
for row in r:
row[-1] = row[-1].replace("S-D", "S")
new_data.append(row)
newfilename = "".join(filename.split(".csv")) + "_edited.csv"
with open(newfilename, "w") as f:
writer = csv.writer(f)
writer.writerows(new_data)

Related

Python script to convert multiple txt files into single one

I'm quite new to python and encountered a problem: I want to write a script that is capable of starting in a base directory with several folders, which have all the same structure in the subdirectories and are numbered with a control variable (scan00, scan01, ...)
I read out the names of the folders in the directory and store them in a variable called foldernames.
Then, the script should go in each of these folders in a subdirectory where multiple txt files are stored. I store them in the variable called "myFiles"
These txt files consits of 3 columns with float values which are separated with tabs and each of the txt files has 3371 rows (they are all the same in terms of rows and columns).
Now my issue: I want the script to copy only the third column of all txt files and put it into a new txt or csv file. The only exception is the first txt file, there it is important that all three columns are copied to the new file.
In the other files, every third column of the txt files should be copied in an adjacent column in the new txt/csv file.
So I would like to end up with x columns in the in the generated txt/csv file, where x is the number of the original txt files. If possible, I would like to write the corresponding file names in the first line of the new txt/csv file (here defined as column_names).
At the end, each folder should contain a txt/csv file, which contains all single (297) txt files.
import os
import glob
foldernames1 = []
for foldernames in os.listdir("W:/certaindirectory/"):
if foldernames.startswith("scan"):
# print(foldernames)
foldernames1.append(foldernames)
for i in range(1, len(foldernames1)):
workingpath = "W:/certaindirectory/"+foldernames1[i]+"/.../"
os.chdir(workingpath)
myFiles = glob.glob('*.txt')
column_names = ['X','Y']+myFiles[1:len(myFiles)]
files = [open(f) for f in glob.glob('*.txt')]
fout = open ("ResultsCombined.txt", 'w')
for row in range(1, 3371): #len(files)):
for f in files:
fout.write(f.readline().strip().split('\t')[2])
fout.write('\t')
fout.write('\t')
fout.close()
As an alternative I also tried to fix it via a csv file, but I wasn't able to fix my problem:
import os
import glob
import csv
foldernames1 = []
for foldernames in os.listdir("W:/certain directory/"):
if foldernames.startswith("scan"):
# print(foldernames)
foldernames1.append(foldernames)
for i in range(1, len(foldernames1)):
workingpath = "W:/certain direcotry/"+foldernames1[i]+"/.../"
os.chdir(workingpath)
myFiles = glob.glob('*.txt')
column_names = ['X','Y']+myFiles[0:len(myFiles)]
# print(column_names)
with open(""+foldernames1[i]+".csv", 'w', newline='') as target:
writer = csv.DictWriter(target, fieldnames=column_names)
writer.writeheader() # if you want a header
for path in glob.glob('*.txt'):
with open(path, newline='') as source:
reader = csv.DictReader(source, delimiter='\t', fieldnames=column_names)
writer.writerows(reader)
Can anyone help me? Both codes do not deliver what I want. They are reading out something, but not the values I am interesed in. I have the feeling my code has also some issues with float numbers?
Many thanks and best regards,
quester
pathlib and pandas should make the solution here relatively simple even without knowing the specific file names:
import pandas as pd
from pathlib import Path
p = Path("W:/certain directory/")
# recursively search for .txt files inside all sub directories
txt_files = [txt_file for txt_file in p.rglob("*.txt")] # p.iterdir() --> glob("*.txt") for none recursive iteration
df = pd.DataFrame()
for path in txt_files:
# use tab separator, read only 3rd column, name the column, read as floats
current = pd.read_csv(path,
sep="\t",
usecols=[2],
names=[path.name],
dtype="float64")
# add header=0 to pd.read_csv if there's a header row in the .txt files
pd.concat([df, current], axis=1)
df.to_csv("W:/certain directory/floats_third_column.csv", index=False)
Hope this helps!

Sort multiple csv files within a directory by date python

I have multiple .csv files in the same directory called Original and I want to sort all these files per date, descending (the oldest first) - second column. Every file should be sorted and overwrite the original file. Is it possible? If not, then saved in another directory called Sorted. Can somebody help me?
Original csv
ORA.PA,13/04/2021,10.35,10.35,10.14,10.21,4299528
ORA.PA,27/02/2019,13.36,13.48,13.29,13.3,6929606
ORA.PA,26/02/2019,13.46,13.52,13.35,13.4,6031759
ORA.PA,05/11/2018,13.94,14.21,13.9,14.16,7692439
ORA.PA,02/11/2018,14.1,14.1,13.9,13.96,6867565
ORA.PA,15/04/2011,7.84,7.89,7.7,7.75,8277622
ORA.PA,14/08/2001,20.22,20.22,19.74,19.85,9221300
The desired Output
ORA.PA,14/08/2001,20.22,20.22,19.74,19.85,9221300
ORA.PA,15/04/2011,7.84,7.89,7.7,7.75,8277622
ORA.PA,02/11/2018,14.1,14.1,13.9,13.96,6867565
ORA.PA,05/11/2018,13.94,14.21,13.9,14.16,7692439
ORA.PA,26/02/2019,13.46,13.52,13.35,13.4,6031759
ORA.PA,27/02/2019,13.36,13.48,13.29,13.3,6929606
ORA.PA,13/04/2021,10.35,10.35,10.14,10.21,4299528
This is the code that I used and didn't work, I am getting nothing
import csv
import operator
import glob
data = dict()
path="/Original/*.csv"
files=glob.glob(path)
for filename in files:
with open(filename, 'r') as f:
lists = [row for row in csv.reader(f, delimiter=',')]
data[filename] = sorted(lists, operator.itemgetter(1), reverse=True)
Thanks!
There are multiple issues with your code. Mainly you are only reading from the file; not writing to it. You would need to open the file with the 'r+' option in order to be able to write to the file as well as reading it. Additionally the sorting won't work that way, as the dates have the format dd/mm/yyyy. For sorting they should be reversed, so yyyy-mm-dd. I also changed the inline data reading from your code to an actual for loop, as the inline solution didn't work, though I don't know why that is...
All in all the following code should solve your problems, please let me know if it works...
import csv
import operator
import glob
path="./Original/*.csv"
files=glob.glob(path)
for filename in files:
with open(filename, "r+", newline='') as f:
reader = csv.reader(f, delimiter=',')
lists = []
for row in reader:
lists.append(row)
# Reverses the date for sorting, so that they are sorted correctly
lists.sort(key=lambda a:'-'.join(a[1].split('/')[::-1]))
# deletes content of the file
f.truncate(0)
f.seek(0)
# writes the new data to the file
writer = csv.writer(f)
writer.writerows(lists)

How to open and iterate through a list of CSV files - Python

I have a list of pathnames to CSV files. I need to open each CSV file, take the data without the header and then merge it all together into a new CSV file.
I have this code which gets me the list of CSV file pathnames:
file_list = []
for folder_name, sub_folders, file_names in os.walk(wd):
for file_name in file_names:
file_extention = folder_name + '\\' + file_name
if file_name.endswith('csv'):
file_list.append(file_extention)
An example of my list is:
['C:\\Users\\Documents\\GPS_data\\West_coast\\Westland\\GPS_data1.csv',
'C:\\Users\\Documents\\GPS_data\\West_coast\\Westland\\GPS_data2.csv',
'C:\\Users\\Documents\\GPS_data\\West_coast\\Westland\\GPS_data3.csv']
I am struggling to figure out what to do, any help would be greatly appreciated. Thanks.
The main idea is to read in each line of a file, and write it to the new file. But remember to skip the first line that has the column headers in it. I previously recommend the cvs module, however it doesn't seem like that is necessary, since this task does not require analyzing the data.
file_list = ['data1.csv','data2.csv']
with open('new.csv', 'w') as newfile: # create a new file
for filename in filelist:
with open(filename) as csvfile:
next(csvfile) # skip the header row
for row in csvfile:
newfile.write(line) # write to the new csv file
Edit: clarified my answer.

Python iterate over multiple files

I have a series of files that are in the following format:
file_1991.xlsx
file_1992.xlsx
# there are some gaps in the file numbering sequence
file_1995.xlsx
file_1996.xlsx
file_1997.xlsx
For each file I want to do something like:
import pandas as pd
data_1995 = pd.read_excel(open(directory + 'file_1995', 'rb'), sheetname = 'Sheet1')
do some work on the data, and save it as another file:
output_1995 = pd.ExcelWriter('output_1995.xlsx')
data_1995.to_excel(output_1995,'Sheet1')
Instead of doing all these for every single file, how can I iterate through multiple files and repeat the same operation across multiple files? In other words, I would like to iterate over all the files (they mostly following a numerical sequence in their names, but there are some gaps in the sequence).
Thanks for the help in advance.
You can use os.listdir or glob module to list all files in a directory.
With os.listdir, you can use fnmatch to filter files like this (can use a regex too);
import fnmatch
import os
for file in os.listdir('my_directory'):
if fnmatch.fnmatch(file, '*.xlsx'):
pd.read_excel(open(file, 'rb'), sheetname = 'Sheet1')
""" Do your thing to file """
Or with glob module (which is a shortcut for the fnmatch + listdir) you can do the same like this (or with a regex):
import glob
for file in glob.glob("/my_directory/*.xlsx"):
pd.read_excel(open(file, 'rb'), sheetname = 'Sheet1')
""" Do your thing to file """
You should use Python's glob module: https://docs.python.org/3/library/glob.html
For example:
import glob
for path in glob.iglob(directory + "file_*.xlsx"):
pd.read_excel(path)
# ...
I would recommend glob.
Doing glob.glob('file_*') returns a list which you can iterate on and do work.
Doing glob.iglob('file_*') returns a generator object which is an iterator.
The first one will give you something like:
['file_1991.xlsx','file_1992.xlsx','file_1995.xlsx','file_1996.xlsx']
If you know how your file names can be constructed, you might try to open a file with the 'r' attribute, so that open(..., 'r') fails if the file is non existent.
yearly_data = {}
for year in range(1990,2018):
try:
f = open('file_%4.4d.xlsx'%year, 'r')
except FileNotFoundError:
continue # to the next year
yearly_data[year] = ...
f.close()

Python script not combining csv files

I am trying to combine over 100,000 CSV files (all same formats) in a folder using below script. Each CSV file is on average 3-6KB of size. When I run this script, it only opens exact 47 .csv files and combines. When I re-run it only combines same .csv files, not all of them. I don't understand why it is doing that?
import os
import glob
os.chdir("D:\Users\Bop\csv")
want_header = True
out_filename = "combined.files.csv"
if os.path.exists(out_filename):
os.remove(out_filename)
read_files = glob.glob("*.csv")
with open(out_filename, "w") as outfile:
for filename in read_files:
with open(filename) as infile:
if want_header:
outfile.write('{},Filename\n'.format(next(infile).strip()))
want_header = False
else:
next(infile)
for line in infile:
outfile.write('{},{}\n'.format(line.strip(), filename))
Firstly check the length of read_files:
read_files = glob.glob("*.csv")
print(len(read_files))
Note that glob isn't necessarily recursive as described in this SO question.
Otherwise your code looks fine. You may want to consider using the CSV library but note that you need to adjust the field size limit with really large files.
Are you shure your all filenames ends with .csv? If all files in this directory contains what you need, then open all of them without filtering.
glob.glob('*')

Categories

Resources