Python files pdf rename - python

I have a file .pdf in a folder and I have a .xls with two-column. In the first column I have the filename without extension .pdf and in the second column, I have a value.
I need to open file .xls, match the value in the first column with all filenames in the folder and rename each file .pdf with the value in the second column.
Is it possible?
Thank you for your support
Angelo

You'll want to use the pandas library within python. It has a function called pandas.read_excel that is very useful for reading excel files. This will return a dataframe, which will allow you to use iloc or other methods of accessing the values in the first and second columns. From there, I'd recommend using os.rename(old_name, new_name), where old_name and new_name are the paths to where your .pdf files are kept. A full example of the renaming part looks like this:
import os
# Absolute path of a file
old_name = r"E:\demos\files\reports\details.txt"
new_name = r"E:\demos\files\reports\new_details.txt"
# Renaming the file
os.rename(old_name, new_name)
I've purposely left out a full explanation because you simply asked if it is possible to achieve your task, so hopefully this points you in the right direction! I'd recommend asking questions with specific reproducible code in the future, in accordance with stackoverflow guidelines.

I would encourage you to do this with a .csv file instead of a xls, as is a much easier format (requires 0 formatting of borders, colors, etc.).
You can use the os.listdir() function to list all files and folders in a certain directory. Check os built-in library docs for that. Then grab the string name of each file, remove the .pdf, and read your .csv file with the names and values, and the rename the file.
All the utilities needed are built-in python. Most are the os lib, other are just from csv lib and normal opening of files:
with open(filename) as f:
#anything you have to do with the file here
#you may need to specify what permits are you opening the file with in the open function

Related

Python: Excel file (xlsx) export with variable as the file path, using pandas

I defined an .xlsx file path as the variable output:
print(output)
r'C:\Users\Kev\Documents\Python code.xlsx'
I want to export a pandas dataframe as an .xlxs file, but need to have the file path as the output variable.
I can get the code to work with the file path. I've tried about a dozen ways (copying and/or piecing code together from documentation, stack overflow, blogs, etc.) and getting a variety of errors. None worked. Here is one that worked with the file path:
df = pd.DataFrame(file_list)
df.to_excel(r'C:\Users\Kev\Documents\Python code.xlsx', index=False)
I would want something like:
df.to_excel(output, index=False)
In any form or package, as long as it produces the same xlsx file and won’t need to be edited to change the file path and name (that would be done where the variable output is defined.
I've attempted several iterations on the XlsxWriter site, the openpyxl site, the pandas site, etc.
(with the appropriate python packages). Working in Jupyter Notebook, Python 3.8.
Any resources, packages, or code that will help me to use a variable in place of a file path for an xlsx export from a pandas dataframe?
Why I want it like this is a long story, but basically I'll have several places at the top of the code where myself and other (inexperienced) coders can quickly put file paths in and search for keywords (rather than hunt through code to find where to replace paths). The data itself is file paths that I'll iteratively search through (this is the beginning of a larger project).
try to put the path this way
output = "C://Users//Kev//Documents//Python code.xlsx"
df.to_excel(output , index=False)
Always worked for me
or you can also do like
output = "C://Users//Kev//Documents//"
df.to_excel(output +"Python code.xlsx" , index=False)
os module would be the most useful here:
from os import path
output = path.abspath("your_excel_file.xlsx")
print(output)
this will return the current working directory path plus the file name you've put into the abspath function as a parameter. Also for those interested about why some people use backslash "\" and not forwardslash "/" while writing file paths here is a good stackoverflow answer for it So what IS the right direction of the path's slash (/ or \) under Windows?
You can use format strings with python3
import pandas as pd
df = pd.DataFrame({"a":"b"}, {"c": "d"})
file_name = "filename.xlsx"
df.to_excel(f"/your/path/to/file/{file_name}", index=False)
Assuming that OP's dataframe is df, that OP is using Windows and wants to store the file in the Desktop, OP's username is cowboykevin05, and the filename that one wants is 0001.xlsx, one can use os.path as follows
from os import path
df.to_excel(path.join('C:\\Users\\cowboykevin05\\Desktop', '0001.xlsx'), index=False)

Duplicate in list created from filenames (python)

I'm trying to create a list of excel files that are saved to a specific directory, but I'm having an issue where when the list is generated it creates a duplicate entry for one of the file names (I am absolutely certain there is not actually a duplicate of the file).
import glob
# get data file names
path =r'D:\larvalSchooling\data'
filenames = glob.glob(path + "/*.xlsx")
output:
>>> filenames
['D:\\larvalSchooling\\data\\copy.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Sat_70dpf_GroupA_n5_20200808_1015-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Sat_84dpf_GroupABCD_n5_20200822_1440-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\~$Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx']
you'll note 'D:\larvalSchooling\data\Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx' is listed twice.
Rather than going through after the fact and removing duplicates I was hoping to figure out why it's happening to begin with.
I'm using python 3.7 on windows 10 pro
If you wrote the code to remove duplicates (which can be as simple as filenames = set(filenames)) you'd see that you still have two filenames. Print them out one on top of the other to make a visual comparison easier:
'D:\\larvalSchooling\\data\\Raw data-SF_Sat_84dpf_GroupABCD_n5_20200822_1440-Trial 1.xlsx',
'D:\\larvalSchooling\\data\\~$Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx'
The second one has a leading ~ (probably an auto-backup).
Whenever you open an excel file it will create a ghost copy that works as a temporary backup copy for that specific file. In this case:
Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial1.xlsx
~$ Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial1.xlsx
This means that the file is open by some software and it's showing you that backup inside(usually that file is hidden from the explorer as well)
Just search for the program and close it. Other actions, such as adding validation so the "~$.*.xlsx" type of file is ignored should be also implemented if this is something you want to avoid.
You can use os.path.splittext to get the file extension and loop through the directory using os.listdir . The open excel files can be skipped using the following code:
filenames = []
for file in os.listdir('D:\larvalSchooling\data'):
filename, file_extension = os.path.splitext(file)
if file_extension == '.xlsx':
if not file.startswith('~$'):
filenames.append(file)
Note: this might not be the best solution, but it'll get the job done :)

How to open files in a particular folder with randomly generated names?

How to open files in a particular folder with randomly generated names? I have a folder named 2018 and the files within that folder are named randomly. I want to iterate through all of the files and open them up.
I will post three names of the files as an example but note that there are over a thousand files in this folder so it has to work on a large scale without any hard coding.
0a2ec2da-628d-417d-9520-b0889886e2ac_1.xml
00a6b260-951d-46b5-ab27-b2e8729e664d_1.xml
00a6b260-951d-46b5-ab27-b2e8729e664d_2.xml
You're looking for os.walk().
In general, if you want to do something with files, it's worth glancing at the os, os.path, pathlib and other built-in modules. They're all documented.
You could also use glob expansion to expand "folder/*" into a list of all the filenames, but os.walk is probably better.
With os.listdir() or os.walk(), depending on whether you want to do it recursively or not.
You can go through the python doc
https://docs.python.org/3/library/os.html#os.walk
https://docs.python.org/3/library/os.html#os.listdir
One you have list of files you can read it simply -
for file in files:
with open(file, "r") as f:
# perform file operations

Evaluating File Paths in Excel

I have an ever increasing list of file paths (i have around 5000 records now) in Excel. More specifically, I have a certain unique identifier in column A and in Column B, I have a file path that leads to a picture for that unique identifier.
The process of adding the file paths is very manual and sometimes mistakes occur. So, I wanted to create a code that goes through each one of this file paths and if file path doesn't open/returns an error, to store these values in a list so that I can go directly to those and fix the file path.
I was thinking of writing a Python code that checks the File Path in Google Chrome URL (I have found it to work better than directly clicking the Hyperlink in Excel), but it's been a while since I have used Python and don't know where to start.
Any recommendation/ideas of how to achieve this?
Thank you,
Ricardo G.
To read excel files, I prefer to use the pandas library, specifically the read_excel function. You can also check if a filepath is a valid, existing file in your filesystem using the os.path module. os.path.isfile returns True if the provided path points to an actual file, so you want to use a list comprehension with a filter to only have filepaths where that is not the case.
import pandas as pd
import os
df = pd.read_excel('path/to/excel')
bad_files = [fp for fp in df['filepath_column'] if !os.path.isfile(path)]
I'm not sure what you mean by check with google chrome, but if you're talking about local files, this should work well for you.

Determine Filename of Unzipped File

Say you unzip a file called file123.zip with zipfile.ZipFile, which yields an unzipped file saved to a known path. However, this unzipped file has a completely random name. How do you determine this completely random filename? Or is there some way to control what the name of the unzipped file is?
I am trying to implement this in python.
By "random" I assume that you mean that the files are named arbitrarily.
You can use ZipFile.read() which unzips the file and returns its contents as a string of bytes. You can then write that string to a named file of your choice.
from zipfile import ZipFile
with ZipFile('file123.zip') as zf:
for i, name in enumerate(zf.namelist()):
with open('outfile_{}'.format(i), 'wb') as f:
f.write(zf.read(name))
This will write each file from the archive to a file named output_n in the current directory. The names of the files contained in the archive are obtained with ZipFile.namelist(). I've used enumerate() as a simple method of generating the file names, however, you could substitute that with whatever naming scheme you require.
If the filename is completely random you can first check for all filenames in a particular directory using os.listdir(). Now you know the filename and can do whatever you want with it :)
See this topic for more information.

Categories

Resources