Process files then modify filenames - python

I am processing files in a directory and after processing a file I want to save it using the original name but also add xx to the file name. My purpose is to identify which files have been processed.
Basic suggestions as to how to proceed are appreciated

If the only purpose is flag the file in order to know which files have been processed I would try another strategy (adding file metadata or something). But from your question, I infer the only thing you need is a rename of the file after being processed... You can use os.rename:
import os
filename = "example.txt"
flag_suffix = ".xx"
with open(filename, "wb+") as f:
# process file
...
os.rename(filename, f"{filename}{flag_suffix}")

Related

Opening several files with python threading

I have several files I want to open at the same time and pull data from and write them each to their own files. I'm currently doing something like this:
files = os.listdir(FilePath)
for file in files:
with open(os.path.join(FilePath, file), 'r') as LFILE:
LFILE.read()
Will I need to some how read all the files, put the file in a list and then have each thread read down the list and remove the file once its been read? Or is there a better way to open files and not read the same one more then once?
I don't have the reputation to ask for further clarification in a comment, but if you're not aggregating the data in such a way that info needs to be shared across files, I would just map the list of files to a pool, like so:
def analyze_file(filename: str):
with open(filename, 'r') as LFILE:
# analyze the file how you'd like
# store results in a string
with open(filename + 'analyzed.txt', 'w') as result_fh:
result_fh.write(results)
if __name__ == "__main__":
with multiprocessing.Pool(4) as p:
p.map(
analyze_file,
[os.path.join(FilePath, file) for file in os.listdir(FilePath)]
)

Merge CSV files in ADLS2 that are prepared through DataBricks

While running DataBricks code and preparing CSV files and loading them into ADLS2, the CSV files are split into many CSV files and are being loaded into ADLS2.
Is there a way to merge these CSV files in ADLS2 thru pyspark.
Thanks
Is there a way to merge these CSV files in ADLS2 thru pyspark.
As i know,spark dataframe does makes the files separately.Theoretically,you could use spark.csv method which could accept list of strings as parameters.
>>> df = spark.read.csv('path')
Then use df.toPandas().to_csv() method to write objects into pandas dataframe.You could refer to some clues from this case:Azure Data-bricks : How to read part files and save it as one file to blob?.
However,i'm afraid that this process could not hold such high memory consumption. So,i'd suggest you just using os package to do the merge job directly.I tested below 2 snippet of code for your reference.
1st:
import os
path = '/dbfs/mnt/test/'
file_suffix = '.csv'
filtered_files = [file for file in files if file.endswith(file_suffix)]
print(filtered_files)
with open(path + 'final.csv', 'w') as final_file:
for file in filtered_files:
with open(file) as f:
lines = f.readlines()
final_file.writelines(lines[1:])
2rd:
import os
path = '/dbfs/mnt/test/'
file_suffix = '.csv'
filtered_files = [os.path.join(root, name) for root, dirs, files in os.walk(top=path , topdown=False) for name in files if name.endswith(file_suffix)]
print(filtered_files)
with open(path + 'final2.csv', 'w') as final_file:
for file in filtered_files:
with open(file) as f:
lines = f.readlines()
final_file.writelines(lines[1:])
The second one is compatible hierarchy.
In additional, i provide a way here which is using ADF copy activity to transfer multiple csv files into one file in ADLS gen2.
Please refer to this doc and configure the folder path in ADLS gen2 source dataset.Then set MergeFiles with copyBehavior property.(Besides, you could use wildFileName like *.csv to exclude files which you don't want to touch in the specific folder)
Merges all files from the source folder to one file. If the file name
is specified, the merged file name is the specified name. Otherwise,
it's an autogenerated file name.

Determine Filename of Unzipped File

Say you unzip a file called file123.zip with zipfile.ZipFile, which yields an unzipped file saved to a known path. However, this unzipped file has a completely random name. How do you determine this completely random filename? Or is there some way to control what the name of the unzipped file is?
I am trying to implement this in python.
By "random" I assume that you mean that the files are named arbitrarily.
You can use ZipFile.read() which unzips the file and returns its contents as a string of bytes. You can then write that string to a named file of your choice.
from zipfile import ZipFile
with ZipFile('file123.zip') as zf:
for i, name in enumerate(zf.namelist()):
with open('outfile_{}'.format(i), 'wb') as f:
f.write(zf.read(name))
This will write each file from the archive to a file named output_n in the current directory. The names of the files contained in the archive are obtained with ZipFile.namelist(). I've used enumerate() as a simple method of generating the file names, however, you could substitute that with whatever naming scheme you require.
If the filename is completely random you can first check for all filenames in a particular directory using os.listdir(). Now you know the filename and can do whatever you want with it :)
See this topic for more information.

Load files from a list of file paths in python

I have a text file with a couple hundred file paths to text files which I would like to open, write / cut up pieces from it and save under a new name.
I've been Googling how to do this and found the module glob, but I can't figure out exactly how to use this.
Could you guys point me in the right direction?
If you have specific paths to files, you won't need to glob module. The glob module is useful when you want to use path like /user/home/someone/pictures/*.jpg. From what I understand you have a file with normal paths.
You can use this code as a start:
with open('file_with_paths', 'r') as paths_list:
for file_path in paths_list:
with open(file_path, 'r') as file:
# Do what you want with one of the files here.
You can just traverse the file line by line and then take out what you want from that name. Later save/create it . Below sample code might help
with open('file_name') as f:
for file_path in f:
import os
file_name = os.path.basename(file_path)
absolute path = os.path.dirname(file_path)
# change whatever you want to with above two and save the file
# os.makedirs to create directry
# os.open() in write mode to create the file
Let me know if it helps you

Python csv : writing to a different directory

I'm downloading files from a site and I need to save the original file, then open it and then add the url that the file was downloaded from and the date of the download to the file before saving the file to a different directory.
I've used this answer to amend the csv: how to Add New column in beginning of CSV file by Python
but I'm struggling to redirect the file to a different directory before the write() function is called.
Is the best answer to write the file and then move it, or is there a way to write the file to a different directory within the open() function?
if fileName in fileList:
print "already got file "+ fileName
else:
# download the file
urllib.urlretrieve(csvUrl, os.path.basename(fileName))
#print "Saving to 1_Downloaded "+ fileName
# open the file and then add the extra columns
with open(fileName, 'rb') as inf, open("out_"+fileName, 'wb') as outf:
csvreader = csv.DictReader(inf)
# add column names to beginning
fieldnames = ['url_source','downloaded_at'] + csvreader.fieldnames
csvwriter = csv.DictWriter(outf, fieldnames)
csvwriter.writeheader()
for node, row in enumerate(csvreader, 1):
csvwriter.writerow(dict(row, url_source=csvUrl, downloaded_at=today))
I believe both would work.
To me it seems the neatest way to do it would be to append to the file and relocate it afterwards.
Have a look at:
shutil.move
I belive rewriting the entire file would be less efficient.
It's not necessary to rebuild the file, try using the time module to create a time stamp string for your file name, and using os.rename to move your file.
Example - this just moves the file to your specified location:
os.rename('filename.csv','NEW_dir/filename.csv')
Hope this helps.
Went with an additional routine using shutil in the end:
# move and rename the 'out_' files to the right dir
source = os.listdir(downloaded)
for files in source:
if files.startswith('out_'):
newName = files.replace('out_','')
newPath = renamed+'/'+newName
shutil.move(files,newPath)

Categories

Resources