I am trying to combine over 100,000 CSV files (all same formats) in a folder using below script. Each CSV file is on average 3-6KB of size. When I run this script, it only opens exact 47 .csv files and combines. When I re-run it only combines same .csv files, not all of them. I don't understand why it is doing that?
import os
import glob
os.chdir("D:\Users\Bop\csv")
want_header = True
out_filename = "combined.files.csv"
if os.path.exists(out_filename):
os.remove(out_filename)
read_files = glob.glob("*.csv")
with open(out_filename, "w") as outfile:
for filename in read_files:
with open(filename) as infile:
if want_header:
outfile.write('{},Filename\n'.format(next(infile).strip()))
want_header = False
else:
next(infile)
for line in infile:
outfile.write('{},{}\n'.format(line.strip(), filename))
Firstly check the length of read_files:
read_files = glob.glob("*.csv")
print(len(read_files))
Note that glob isn't necessarily recursive as described in this SO question.
Otherwise your code looks fine. You may want to consider using the CSV library but note that you need to adjust the field size limit with really large files.
Are you shure your all filenames ends with .csv? If all files in this directory contains what you need, then open all of them without filtering.
glob.glob('*')
Related
My first post on StackOverflow, so please be nice. In other words, a super beginner to Python.
So I want to read multiple files from a folder, divide the text and save the output as a new file. I currently have figured out this part of the code, but it only works on one file at a time. I have tried googling but can't figure out a way to use this code on multiple text files in a folder and save it as "output" + a number, for each file in the folder. Is this something that's doable?
with open("file_path") as fReader:
corpus = fReader.read()
loc = corpus.find("\n\n")
print(corpus[:loc], file=open("output.txt","a"))
Possibly work with a list, like:
from pathlib import Path
source_dir = Path("./") # path to the directory
files = list(x for x in filePath.iterdir() if x.is_file())
for i in range(len(files)):
file = Path(files[i])
outfile = "output_" + str(i) + file.suffix
with open(file) as fReader, open(outfile, "w") as fOut:
corpus = fReader.read()
loc = corpus.find("\n\n")
fOut.write(corpus[:loc])
** sorry for multiple editting....
welcome to the site. Yes, what you are asking above is completely doable and you are on the right track. You will need to do a little research/practice with the os module which is highly useful when working with files. The two commands that you will want to research a bit are:
os.path.join()
os.listdir()
I would suggest you put two folders within your python file, one called data and the other called output to catch the results. Start and see if you can just make the code to list all the files in your data directory, and just keep building that loop. Something like this should list all the files:
# folder file lister/test writer
import os
source_folder_name = 'data' # the folder to be read that is in the SAME directory as this file
output_folder_name = 'output' # will be used later...
files = os.listdir(source_folder_name)
# get this working first
for f in files:
print(f)
# make output folder names and just write a 1-liner into each file...
for f in files:
output_filename = f.split('.')[0] # the part before the period
output_filename += '_output.csv'
output_path = os.path.join(output_folder_name, output_filename)
with open(output_path, 'w') as writer:
writer.write('some data')
While running DataBricks code and preparing CSV files and loading them into ADLS2, the CSV files are split into many CSV files and are being loaded into ADLS2.
Is there a way to merge these CSV files in ADLS2 thru pyspark.
Thanks
Is there a way to merge these CSV files in ADLS2 thru pyspark.
As i know,spark dataframe does makes the files separately.Theoretically,you could use spark.csv method which could accept list of strings as parameters.
>>> df = spark.read.csv('path')
Then use df.toPandas().to_csv() method to write objects into pandas dataframe.You could refer to some clues from this case:Azure Data-bricks : How to read part files and save it as one file to blob?.
However,i'm afraid that this process could not hold such high memory consumption. So,i'd suggest you just using os package to do the merge job directly.I tested below 2 snippet of code for your reference.
1st:
import os
path = '/dbfs/mnt/test/'
file_suffix = '.csv'
filtered_files = [file for file in files if file.endswith(file_suffix)]
print(filtered_files)
with open(path + 'final.csv', 'w') as final_file:
for file in filtered_files:
with open(file) as f:
lines = f.readlines()
final_file.writelines(lines[1:])
2rd:
import os
path = '/dbfs/mnt/test/'
file_suffix = '.csv'
filtered_files = [os.path.join(root, name) for root, dirs, files in os.walk(top=path , topdown=False) for name in files if name.endswith(file_suffix)]
print(filtered_files)
with open(path + 'final2.csv', 'w') as final_file:
for file in filtered_files:
with open(file) as f:
lines = f.readlines()
final_file.writelines(lines[1:])
The second one is compatible hierarchy.
In additional, i provide a way here which is using ADF copy activity to transfer multiple csv files into one file in ADLS gen2.
Please refer to this doc and configure the folder path in ADLS gen2 source dataset.Then set MergeFiles with copyBehavior property.(Besides, you could use wildFileName like *.csv to exclude files which you don't want to touch in the specific folder)
Merges all files from the source folder to one file. If the file name
is specified, the merged file name is the specified name. Otherwise,
it's an autogenerated file name.
I am not well versed with Python, based on my knowledge and some browsing I wrote the script mentioned below, this script basically looks for all files in C:\temp\dats folder and writes it in C:\temp\datsOutput\output.text file, for some reason my code is running terribly slow, can anyone advise me to improve it to have a better performance?
import os
a = open(r"C:\temp\datsOutput\output.txt", "w")
path = r'C:\temp\dats'
for filename in os.listdir(path):
fullPath = path+"\\"+filename
with open(fullPath, "r") as ins:
for line in ins:
a.write(line)
Two speedups. First, copy the whole file at once. Second, treat the files as binary (add a “b” after the “r” or “w” when opening a file.)
Combined, runs about 10x faster.
Final code looks like this
import os
a = open(r"C:\temp\datsOutput\output.txt", "wb")
path = r'C:\temp\dats'
for filename in os.listdir(path):
fullPath = path+"\\"+filename
with open(fullPath, "rb") as ins:
a.write(ins.read())
I need to make a script that executes a script one time in each folder of a directory.
Script in question:
f = open('OrderEXAMPLE.txt', 'r')
data = f.readlines()
mystr = ",".join([line.strip() for line in data])
with open('CSV.csv', 'w') as f2:
f2.write(mystr)
With this script, it changes a list of customer data into csv form.
Each order form has its own folder, so my intial thought was to put the same script into each folder. From there, write another script that executes each script simultaneously.
Folder structure is like so:
Order_forms
--Order_123
-----Order_form
--Order_124
-----Order_form
Amateur at python, so advice is needed and appreciated.
Just walk the directory structure with one script. This will write a separate CSV for each file with the name <original_filename>_CSV.csv. Without more clarity on the desired output nor knowing what the data looks like I can't help much more. You should be able to tweak this for whatever you need.
import os
parent_folder = 'Order_forms'
for root, dirs, files in os.walk(parent_folder):
for f in files:
with open(os.path.join(root, f), 'r') as f1:
data = f1.readlines()
mystr = ",".join([line.strip() for line in data])
with open(os.path.join(root, '{}_CSV.csv'.format(f)), 'w') as f2:
f2.write(mystr)
I need to edit several csv files. Actually, most of the files are fine as they are, it's just the last (41st) column that needs to be changed. For every occurrence of a particular string in that column, I need it to be replaced by a different string; specifically, every occurrence of 'S-D' needs to be replaced by 'S'. I've tried to accomplish this using Python, but I think I need to write the csv files and I'm not quite sure how to do this:
import os
import csv
path=os.getcwd()
filenames = os.listdir(path)
for filename in filenames:
if filename.endswith('.csv'):
r=csv.reader(open(filename))
for row in r:
if row[40] == "S-D":
row[40] = "S"
Any help? Also, if anyone has a quick , elegant way of doing this with a shell script, that would probably be very helpful to me as well.
Try something along these lines. Now using the glob module as mentioned by #SaulloCastro and the csv module.
import glob
import csv
for item in glob.glob(".csv"):
r = list(csv.reader(open(item, "r")))
for row in r:
row[-1] = row[-1].replace("S-D", "S")
w = csv.writer(open(item, "w"))
w.writerows(r)
Be sure to read up on the Python documentation for the CSV File Reading and Writing. Lots to learn there. Here is a basic example based on your question. Only modifying the data in the last column, writing out a modified file with "_edited" in the name.
import os
import csv
path=os.getcwd()
filenames = os.listdir(path)
for filename in filenames:
if filename.endswith('.csv'):
r=csv.reader(open(filename))
new_data = []
for row in r:
row[-1] = row[-1].replace("S-D", "S")
new_data.append(row)
newfilename = "".join(filename.split(".csv")) + "_edited.csv"
with open(newfilename, "w") as f:
writer = csv.writer(f)
writer.writerows(new_data)