searching html files in directories using python

searching html files in directories using python - python

I have an html parser function and it needs absolute path for each file.
How do I search a directory and find only files ending with .html and then return the absolute path of each file?

Have you considered using the python os module? It has the listdir(path) command
os.listdir(path)
Return a list containing the names of the entries in the directory given by path. The list is in arbitrary order. It does not include the special entries '.' and '..' even if they are present in the directory.
Availability: Unix, Windows.
Use it to get all the files names in your directory, filter out the non-html files, and then prepend the path to the directory to get the absolute path.
https://docs.python.org/2/library/os.html

Related

How can I add a path to the CSV files created?

Im splitting a CSV file based on column "ColumnName". How can I make all the CSV files created save into a specified path?
data = pd.read_csv(r'C:\Users\...\Output.csv')
for (ColumnName), group in data.groupby(['ColumnName']):
group.to_csv('{ColumnName}.csv', index=False)
Thanks

pandas.DataFrame.to_csv() takes a string path as input to write to said path.
With your current code group.to_csv('{ColumnName}.csv', index=False), {ColumnName} is being interpreted as a normal string. If you wanted variable substition in this case Python has many methods, two would be:
f-strings - Introduced in Python 3.6
group.to_csv('{ColumnName}.csv', index=False)
str.format
group.to_csv('{}.csv'.format(ColumnName), index=False)
Specifying path
Following this. If you're looking to specify more than just the file name, you are able to specify the full file path or the file path relative to the current directory.
Providing full file path
Full file paths require describing the path from the root context. In windows this would be providing a path such as f'C:\Users\mycsvfolder\{ColumnName}.csv'. Providing the full path to to_csv() will have the file written there.
Note In linux, root context starts at /. So for example /Users/myuser/mycsvfolder/file.csv would be the full file path.
Providing a relative file path
Relative file paths take into account the current folder. For example, to instead write to a folder within the current folder you are able to specify f'mycsvfolder/{ColumnName}.csv' and the file will be written to the specified folder in the current directory. It's with this method that writing f'{ColumnName}.csv' will write a file, but to the current directory as work is relative to the current directory unless otherwise specified.
Note when writing to folder
You will need to create folders before writing to them in most cases. Some write functions do provide folder creation functionality however.
Additional material regarding paths, specifically in Python.

Glob doesn't return list of files from specified directory

I've found two ways of listing files from a specified directory from other posts here on Stack Overflow but I can't seem to get them working. The first one returns the path and second return the files I'm looking for but also the path. I have tried several ways like renaming the target directory and files but it doesn't seem to do the trick.
The code in question:
import glob
jpgFilenamesList = glob.glob(r"C:\Users\viodo\PycharmProjects\pythonProject")
print(jpgFilenamesList)
mydir = r"C:\Users\viodo\PycharmProjects\pythonProject"
file_list = glob.glob(mydir + "/*.jpg")
print(file_list)
what I get:
['C:\\Users\\viodo\\PycharmProjects\\pythonProject']
['C:\\Users\\viodo\\PycharmProjects\\pythonProject\\dngjknfjkg.jpg', 'C:\\Users\\viodo\\PycharmProjects\\pythonProject\\fjkdnfkl.jpg', 'C:\\Users\\viodo\\PycharmProjects\\pythonProject\\skdklenfkd.jpg']
Solution found in another thread: Python glob multiple filetypes
Some tweaking got it running smoth. Thanks for the help!

Glob returns a list of pathnames relative to the root directory. That root directory is assumed to be your current working directory unless the glob pattern specified is an absolute path. In short, because your pattern is an absolute path pattern, the returned files will not be relative, but absolute, including the entire path.
When not using an absolute path pattern, in some cases, you could get just a file name if a file name matches in the current working directory. That file name would of course be relative to the current working directory.
In Python 3.10, you should be able to change the assumed root directory without using an absolute pattern via a new root_dir parameter, but this is not currently available in 3.9 and below: https://docs.python.org/3.10/library/glob.html.
In your case, as mentioned in the comments by othes, os.path.basename should be able to get just the file name if that is what you are after. Alternatively, you could change the current working directory via os.chdir and provide a glob pattern of simply *.jpg and get just the file names relative to the that current working directory, both are reasonable solutions.
Extracting the base name:
mydir = r"C:\Users\viodo\PycharmProjects\pythonProject"
file_list = [os.path.basename(f) for f in glob.glob(mydir + "/*.jpg")]
or returning the files relative to an arbitrary "current working directory":
os.chdir(r"C:\Users\viodo\PycharmProjects\pythonProject")
file_list = glob.glob("*.jpg")
Depending on your requirements, one solution may be better than the other.

Confusing about path format if I am using os.walk

I have a folder contains both zipped and non zipped files, I want to migrate data in a given format from one place to another in the local mode.
I am new in Python so I tried os module for the first time and for accessing all the files in the folder, I am using os.walk for root path, directory list and file lists. It is showing that zipped files are not directories and normal files are directories.
When I am printing root path then it is showing in this way
Desktop/shin/archieve
Desktop/shin/archieve\New folder
My code is
for path,dir_list,file_list in os.walk('Desktop/shin/archieve'):
print(path)
I am confused about "Desktop/shin/archieve\New folder" and "Desktop/shin/archieve/New folder" and also why zipped files are considering as files only.

recursive script to rename folders ending with a space or period

We just switched over our storage server to a new file system. The old file system allowed users to name folders with a period or space at the end. The new system considers this an illegal character. How can I write a python script to recursively loop through all directories and rename and folder that has a period or space at the end?

Use os.walk. Give it a root directory path and it will recursively iterate over it. Do something like
for root, dirs, files in os.walk('root path'):
for dir in dirs:
if dir.endswith(' ') or dir.endswith('.'):
os.rename(...)
EDIT:
We should actually rename the leaf directories first - here is the workaround:
alldirs = []
for root, dirs, files in os.walk('root path'):
for dir in dirs:
alldirs.append(os.path.join(root, dir))
# the following two lines make sure that leaf directories are renamed first
alldirs.sort()
alldirs.reverse()
for dir in alldirs:
if ...:
os.rename(...)

You can use os.listdir to list the folders and files on some path. This returns a list that you can iterate through. For each list entry, use os.path.join to combine the file/folder name with the parent path and then use os.path.isdir to check if it is a folder. If it is a folder then check the last character's validity and, if it is invalid, change the folder name using os.rename. Once the folder name has been corrected, you can repeat the whole process with that folder's full path as the base path. I would put the whole process into a recursive function.

Change to a known directory name but unkown absolute path in Python

I would like to change the cwd to a specific folder.
The folder name is known; however, the path to it will vary.
I am attempting the following but cannot seem to get what I am looking for:
absolute_path = os.path.abspath(folder_name)
directory_path = os.path.dirname(absolute_path)
os.chdir(directory_path)
This does not do what I'm looking for because it is keeping the original cwd to where the .py file is run from. I've tried adding os.chdir(os.path.expanduser("~")) prior to the first code block; however, it just creates the absolute_path to /home/user/folder_name.
Of course if there is a simple import that I could use, I'll be open to anything.
What would be the correct way to get the paths of all folders with with a specific name?

def find_folders(start_path,needle):
for cwd, folders, files,in os.walk(start_path):
if needle in folders:
yield os.path.join(cwd,needle)
for path in find_folders("/","a_folder_named_x"):
print path
all this is doing is walking down your directory structure from a given start path and finding all occurances of a folder named needle
in the example it is starting at the root folder of the system and looking for a folder named "a_folder_named_x" ... be forwarned this could take a while to run if you need to search the whole system ...

You need to understand that abspath accepts a relative pathname (which might just be a filename), and gives you the equivalent absolute (full) pathname. A relative pathname is one that begins in your current directory; no searching is involved, and so it always points to one place (which may or may not exist).
What you actually need is to search down a directory tree, starting at ~ or whatever directory makes sense in your case, until you find a folder with the requested name. That's what #Joran's code does.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.