Python: Iterating over every file in directory - python

Using python, I am trying to take a list of emails in .txt format and delete anything after a specific keyword, "Original message", in order to delete the part that I sent. All of the emails are currently saved to a directory via a VBScript in Outlook. Each email is its own .txt file which I would like to cycle through them all in the one file.
If there is a way to replace all text between two keywords that would also work as I have a program that combines the emails into one long .txt file.
I apologize if I left out any important information, this is my first time posting here

You can use os.listdir() to iterate over the files in your directory:
import os
files = [i for i in os.listdir("Path/to/directory/storing/emails") if i.endswith("txt")]
for file in files:
f = open(file).readlines()
f = [i.strip('\n') for i in f]
final_email = f[:f.index("Original message")] #this list slicing will remove the part containing "Original message" and below it
final_message = '\n'.join(final_email)
f = open(file, 'w')
f.write(final_message)
f.close()

Related

How do i make a list with values taken from different text files?

I have a folder, which i want to select manually, with an X number of .txt files. I want to make a program that allows me to run it -> select my folder with files -> And cycle through all files in the folder and take a value from a set place.
I have already made a piece of code that allows me to take the value from the .txt file:
mylines = []
with open ('test1.txt', 'rt') as myfile:
for myline in myfile:
mylines.append(myline)
subline = mylines[58]
sub = subline.split(' ')
print(sub[5])`
EDIT: I also have a piece of code that makes a list of directories with all the files I want to use this on:
'''
import glob
path = r'C:/Users/Etienne/.spyder-py3/test/*.UIT'
files = glob.glob(path)
print(files)
'''
How can I use the first piece of code on every file in the list from the second piece of code so i end up with a list of values?
I never worked with coding but this would make my work a lot faster so I want to pick up python.
If I understood the problem correctly, the os module might be helpful for you.
***os.listdir() method in python is used to get the list of all files and directories in the specified directory.For example;
import os
# Get the list of all files and directories
# in the root directory, you can change your directory
path = "/"
dir_list = os.listdir(path)
print("Files and directories in '", path, "' :")
# print the list
print(dir_list)
with this list you can iterate your txt files.
To additional information you can click
How can I iterate over files in a given directory?

In Python, how do I create a list of files based on specified file extensions?

Let's say I have a folder with a bunch of files (with different file extensions). I want to create a list of files from this folder. However, I want to create a list of files with SPECIFIC file extensions.
These file extensions are categorized into groups.
File Extensions: .jpg, .png, .gif, .pdf, .raw, .docx, .pptx, .xlsx, .js, .html, .css
Group "image" contains .jpg, .png, .gif.
Group "adobe" contains .pdf, .raw. (yes, I'm listing '.raw' as an adobe file for this example :P)
Group "microsoft" contains .docx, .pptx, .xlsx.
Group "webdev" contains .js, .html, .css.
I want to be able to add these files types to a list. That list will be generated in a ".txt" file and would contain ALL files with the chosen file extensions.
So if my folder has 5 image files, 10 adobe files, 5 microsoft files, 3 webdev files and I select the "image" and "microsoft" groups, this application in Python would create a .txt file that contains a list of filenames with file extensions that belong only in the image and microsoft groups.
The text file would have a list like below:
picture1.jpg
picture2.png
picture3.gif
picture4.jpg
picture5.jpg
powerpoint.pptx
powerpoint2.pptx
spreadsheet.xlsx
worddocument.docx
worddocument2.docx
As of right now, my code creates a text file that generates a list of ALL files in a specified folder.
I could use an "if" statement to get specific file extension, but I don't think this achieves the results I want. In this case, I would have to create a function for each Group (i.e. function for the image, adobe, microsoft and webdev groups). I want to be able to combine these groups freely (i.e. image and microsoft files in a list).
Example of an "if" statement:
for elem in os.listdir(filepath):
if elem.endswith('.jpg'):
listItem = elem + '\n'
listName = filepath + (r"\{}List.txt".format(name))
writeFile = open(listName, 'a')
writeFile.write(listItem)
writeFile.close()
if elem.endswith('.png'):
listItem = elem + '\n'
listName = filepath + (r"\{}List.txt".format(name))
writeFile = open(listName, 'a')
writeFile.write(listItem)
writeFile.close()
if elem.endswith('.gif'):
listItem = elem + '\n'
listName = filepath + (r"\{}List.txt".format(name))
writeFile = open(listName, 'a')
writeFile.write(listItem)
writeFile.close()
else:
continue
Full code without the "if" statement (generates a .txt file with all filenames from a specified folder):
import os
def enterFilePath():
global filepath
filepath = input("Please enter your file path. ")
os.chdir(filepath)
enterFilePath()
def enterFileName():
global name
global listName
name = str(input("Name the txt file. "))
listName = name + ".txt"
enterFileName()
def listGenerator():
for filename in os.listdir(filepath):
listItem = filename + ' \n'
listName = filepath + (r"\{}List.txt".format(name))
writeFile = open(listName, 'a')
writeFile.write(listItem)
writeFile.close()
listGenerator()
A pointer before getting into the answer - avoid using global in favor of function parameters and return values. It will make debugging significantly less of a headache and make it easier to follow data flow through your program.
nostradamus is correct in his comment, a dict will be the ideal way to solve your problem here. I've also done similar things as your issue before using itertools.chain.from_iterable and pathlib.Path, which I'll be using here.
First, the dict:
groups = {
'image': {'jpg', 'png', 'gif'},
'adobe': {'pdf', 'raw'},
'microsoft': {'docx', 'pptx', 'xlsx'},
'webdev': {'js', 'html', 'css'}
}
This sets up your extension groups that you mentioned, which you can then access easily with groups['image'], groups['adobe'], etc.
Then, using the Path.glob method, itertools.chain.from_iterable, and a comprehension, you can get your list of desired files in a single statement (or function).
# Set up some variables
target_groups = ['adobe', 'webdev']
# Initialize generator
files = chain.from_iterable(
# Glob pattern for the current extension
Path(filepath).glob(f'*.{ext}')
# Each group in target_groups
for group in target_groups
# Each extension in current group
for ext in groups[group]
)
# Then, just iterate the files
for fpath in files:
# Do stuff with file here
print(fpath.name)
My test directory has one file of each extension you listed, named a, b, etc for each group. Using the above code, my output is:
a.pdf
b.raw
a.js
b.html
c.css
The way the file list/generator is set up means that the list of files will be sorted by extension-group, then by extension, and then by name. If you want to change what groups are being listed, just add/remove values in the target_groups list above (works with a single option as well).
You'll also want to consider parameterizing your targets, such as through input or script arguments, as well as handling cases where a requested group doesn't exist in the groups dictionary. The code above would probably also be more useful as a function, but I'll leave that implementation up to you :)

Check if any part of string is contained within Text File

Here is my code:
for r, d, f in os.walk(path):
for folder in d:
folders.append(os.path.join(r, folder))
for f in folders:
with open('search.txt') as file:
contents = file.read()
if f in contents:
print ('word found')
I'm having the script search through a given directory, and compare the name of the paths to a Text Files full of Virus Definitions. I'm trying to get the script to recognise if the name of the file is contained within said text file.
Problem I've found as you can see from the code, it will only work with a complete match, and as it takes the path as the string (for example; "C:/test/virus.bat") it will never match.
Is it possible to adjust this code so that any part of the path can be matched against the text file?
Not sure if that makes sense, any suggestions welcome or please say if not clear.
EDIT:
To be more clear, here is a logic version of what I am trying to achieve:
List all Files in Directory
Get file name within path name ("virus" within "C:/test/virus")
Check if file name is contained within Text File
You can use the following to get all files in a given path:
files = []
for r, d, f in os.walk(path):
for folder in d:
if f:
for file in f:
files.append(os.path.join(r, folder, file).replace('\\', '/'))
then slightly modify your code to achieve the result. You want to check if the content of the file is in the given file path, not vice versa -> if content in f:
for f in files:
with open('search.txt', 'r') as file:
content = file.read()
if content in f:
print('Found')

Why are my `binaryFiles` empty when I collect them in pyspark?

I have two zip files on hdfs in the same folder : /user/path-to-folder-with-zips/.
I pass that to "binaryfiles" in pyspark:
zips = sc.binaryFiles('/user/path-to-folder-with-zips/')
I'm trying to unzip the zip files and do things to the text files in them, so I tried to just see what the content will be when I try to deal with the RDD. I did it like this:
zips_collected = zips.collect()
But, when I do that, it gives an empty list:
>> zips_collected
[]
I know that the zips are not empty - they have textfiles. The documentation here says
Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
What am I doing wrong here? I know I can't view the contents of the file because it is zipped and therefore binary. But, I should at least be able to see SOMETHING. Why does it not return anything?
There can be more than one file per zip file, but the contents are always something like this:
rownum|data|data|data|data|data
rownum|data|data|data|data|data
rownum|data|data|data|data|data
I'm assuming that each zip file contains a single text file (code is easily changed for multiple text files). You need to read the contents of the zip file first via io.BytesIO before processing line by line. Solution is loosely based on https://stackoverflow.com/a/36511190/234233.
import io
import gzip
def zip_extract(x):
"""Extract *.gz file in memory for Spark"""
file_obj = gzip.GzipFile(fileobj=io.BytesIO(x[1]), mode="r")
return file_obj.read()
zip_data = sc.binaryFiles('/user/path-to-folder-with-zips/*.zip')
results = zip_data.map(zip_extract) \
.flatMap(lambda zip_file: zip_file.split("\n")) \
.map(lambda line: parse_line(line))
.collect()

Combine a .txt's name, and second interior string to create a dictionary in Python

I've written a batch script that is deployed to our network via Chocolatey and FOG that acquires the serial number of the machine and then ejects it via .txt in a file bearing the name of the PC that the serial number belongs to:
net use y: \\192.168.4.104\shared
wmic bio get serialnumber > Y:\IT\SSoftware\Serial_Numbers\%COMPUTERNAME%.txt
net use y: /delete
The folder Serial_Numbers is subsequently filled with .txts bearing the names of every computer on Campus. With this in mind I'd like to write a Python script to go through and grab every .txt name and second interior string to form a dictionary, where you can call for the PC's name, and have the serial number returned.
I'm aware as to how I'd create the dictionary, and call from it, but I'm having troubles figuring out how to properly grab the .txt's name and second interior string, any help would be greatly appreciated.
Format of .txt documents:
SerialNumber
#############
You can use os.listdir to list the directory files nad list comprehension to filter them.
Use glob to list the files in your directory.
You can simply read the first line and stop using the file while populating the dictionary and you're done:
import glob
d = {}
# loop over '.txt' files only
for filename in glob.glob('/path_to_Serial_Numbers_folder/*.txt'):
with open(filename, 'r') as f:
file_name_no_extension = '.'.join(filename.split('.')[:-1])
d[file_name_no_extension] = f.readline().strip()
print d
import glob
data = {}
for fnm in glob.glob('*.txt'):
data[fnm[:-4]] = open(fnm).readlines()[1].strip()
or, more succinctly
import glob
data = {f[:-4]:open(f).readlines()[1].strip() for f in glob.glob('*.txt')}
In the dictionary comprehension above,
f[:-4] is the filename except the last four characters (i.e., ".txt"),
open(f).readlines()[1].strip() is the second line of the file
object and eventually
f is an element of the list of filenames returned by glob.glob().

Categories

Resources