I have an ever increasing list of file paths (i have around 5000 records now) in Excel. More specifically, I have a certain unique identifier in column A and in Column B, I have a file path that leads to a picture for that unique identifier.
The process of adding the file paths is very manual and sometimes mistakes occur. So, I wanted to create a code that goes through each one of this file paths and if file path doesn't open/returns an error, to store these values in a list so that I can go directly to those and fix the file path.
I was thinking of writing a Python code that checks the File Path in Google Chrome URL (I have found it to work better than directly clicking the Hyperlink in Excel), but it's been a while since I have used Python and don't know where to start.
Any recommendation/ideas of how to achieve this?
Thank you,
Ricardo G.
To read excel files, I prefer to use the pandas library, specifically the read_excel function. You can also check if a filepath is a valid, existing file in your filesystem using the os.path module. os.path.isfile returns True if the provided path points to an actual file, so you want to use a list comprehension with a filter to only have filepaths where that is not the case.
import pandas as pd
import os
df = pd.read_excel('path/to/excel')
bad_files = [fp for fp in df['filepath_column'] if !os.path.isfile(path)]
I'm not sure what you mean by check with google chrome, but if you're talking about local files, this should work well for you.
Related
I have a file .pdf in a folder and I have a .xls with two-column. In the first column I have the filename without extension .pdf and in the second column, I have a value.
I need to open file .xls, match the value in the first column with all filenames in the folder and rename each file .pdf with the value in the second column.
Is it possible?
Thank you for your support
Angelo
You'll want to use the pandas library within python. It has a function called pandas.read_excel that is very useful for reading excel files. This will return a dataframe, which will allow you to use iloc or other methods of accessing the values in the first and second columns. From there, I'd recommend using os.rename(old_name, new_name), where old_name and new_name are the paths to where your .pdf files are kept. A full example of the renaming part looks like this:
import os
# Absolute path of a file
old_name = r"E:\demos\files\reports\details.txt"
new_name = r"E:\demos\files\reports\new_details.txt"
# Renaming the file
os.rename(old_name, new_name)
I've purposely left out a full explanation because you simply asked if it is possible to achieve your task, so hopefully this points you in the right direction! I'd recommend asking questions with specific reproducible code in the future, in accordance with stackoverflow guidelines.
I would encourage you to do this with a .csv file instead of a xls, as is a much easier format (requires 0 formatting of borders, colors, etc.).
You can use the os.listdir() function to list all files and folders in a certain directory. Check os built-in library docs for that. Then grab the string name of each file, remove the .pdf, and read your .csv file with the names and values, and the rename the file.
All the utilities needed are built-in python. Most are the os lib, other are just from csv lib and normal opening of files:
with open(filename) as f:
#anything you have to do with the file here
#you may need to specify what permits are you opening the file with in the open function
I defined an .xlsx file path as the variable output:
print(output)
r'C:\Users\Kev\Documents\Python code.xlsx'
I want to export a pandas dataframe as an .xlxs file, but need to have the file path as the output variable.
I can get the code to work with the file path. I've tried about a dozen ways (copying and/or piecing code together from documentation, stack overflow, blogs, etc.) and getting a variety of errors. None worked. Here is one that worked with the file path:
df = pd.DataFrame(file_list)
df.to_excel(r'C:\Users\Kev\Documents\Python code.xlsx', index=False)
I would want something like:
df.to_excel(output, index=False)
In any form or package, as long as it produces the same xlsx file and won’t need to be edited to change the file path and name (that would be done where the variable output is defined.
I've attempted several iterations on the XlsxWriter site, the openpyxl site, the pandas site, etc.
(with the appropriate python packages). Working in Jupyter Notebook, Python 3.8.
Any resources, packages, or code that will help me to use a variable in place of a file path for an xlsx export from a pandas dataframe?
Why I want it like this is a long story, but basically I'll have several places at the top of the code where myself and other (inexperienced) coders can quickly put file paths in and search for keywords (rather than hunt through code to find where to replace paths). The data itself is file paths that I'll iteratively search through (this is the beginning of a larger project).
try to put the path this way
output = "C://Users//Kev//Documents//Python code.xlsx"
df.to_excel(output , index=False)
Always worked for me
or you can also do like
output = "C://Users//Kev//Documents//"
df.to_excel(output +"Python code.xlsx" , index=False)
os module would be the most useful here:
from os import path
output = path.abspath("your_excel_file.xlsx")
print(output)
this will return the current working directory path plus the file name you've put into the abspath function as a parameter. Also for those interested about why some people use backslash "\" and not forwardslash "/" while writing file paths here is a good stackoverflow answer for it So what IS the right direction of the path's slash (/ or \) under Windows?
You can use format strings with python3
import pandas as pd
df = pd.DataFrame({"a":"b"}, {"c": "d"})
file_name = "filename.xlsx"
df.to_excel(f"/your/path/to/file/{file_name}", index=False)
Assuming that OP's dataframe is df, that OP is using Windows and wants to store the file in the Desktop, OP's username is cowboykevin05, and the filename that one wants is 0001.xlsx, one can use os.path as follows
from os import path
df.to_excel(path.join('C:\\Users\\cowboykevin05\\Desktop', '0001.xlsx'), index=False)
I'm trying to create a list of excel files that are saved to a specific directory, but I'm having an issue where when the list is generated it creates a duplicate entry for one of the file names (I am absolutely certain there is not actually a duplicate of the file).
import glob
# get data file names
path =r'D:\larvalSchooling\data'
filenames = glob.glob(path + "/*.xlsx")
output:
>>> filenames
['D:\\larvalSchooling\\data\\copy.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Sat_70dpf_GroupA_n5_20200808_1015-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Sat_84dpf_GroupABCD_n5_20200822_1440-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\~$Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx']
you'll note 'D:\larvalSchooling\data\Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx' is listed twice.
Rather than going through after the fact and removing duplicates I was hoping to figure out why it's happening to begin with.
I'm using python 3.7 on windows 10 pro
If you wrote the code to remove duplicates (which can be as simple as filenames = set(filenames)) you'd see that you still have two filenames. Print them out one on top of the other to make a visual comparison easier:
'D:\\larvalSchooling\\data\\Raw data-SF_Sat_84dpf_GroupABCD_n5_20200822_1440-Trial 1.xlsx',
'D:\\larvalSchooling\\data\\~$Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx'
The second one has a leading ~ (probably an auto-backup).
Whenever you open an excel file it will create a ghost copy that works as a temporary backup copy for that specific file. In this case:
Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial1.xlsx
~$ Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial1.xlsx
This means that the file is open by some software and it's showing you that backup inside(usually that file is hidden from the explorer as well)
Just search for the program and close it. Other actions, such as adding validation so the "~$.*.xlsx" type of file is ignored should be also implemented if this is something you want to avoid.
You can use os.path.splittext to get the file extension and loop through the directory using os.listdir . The open excel files can be skipped using the following code:
filenames = []
for file in os.listdir('D:\larvalSchooling\data'):
filename, file_extension = os.path.splitext(file)
if file_extension == '.xlsx':
if not file.startswith('~$'):
filenames.append(file)
Note: this might not be the best solution, but it'll get the job done :)
I am trying to rename a file in which it is auto-generated by some local modules but I was wondering if using os.listdir is the only way for me to filter/ narrow down this file.
This file will always be generated before it is removed and the code will generate the next one (still in the same directory) based on the next item in list.
Basically, whenever this file is generated, it comes in the following file path:
/user_data/.tmp/tempLibFiles/2015_03_16_192212_182096.con
I had only wanted to rename the 2015_03_16_192212_182096 into connectionFile while keeping the rest the same.
You can also use the glob module to narrow down the list of files to the one that matches a particular pattern. For example:
import glob
files = glob.glob('/user_data/.tmp/tempLibFiles/*.con')
I would like to check for the existence of a .gz file on my linux machine while running Python. If I do this for a text file, the following code works:
import os.path
os.path.isfile('bob.asc')
However, if bob.asc is in gzip format (bob.asc.gz), Python does not recognize it. Ideally, I would like to use os.path.isfile or something very concise (without writing new functions). Is it possible to make the file recognizable either in Python or by changing something in my system configuration?
Unfortunately I can't change the data format or the file names as they are being given to me in batch by a corporation.
After fooling around for a bit, the most concise way I could get the job done was
subprocess.call(['ls','bob.asc.gz']) == 0
which returns True if the file exists in the directory. This is the behavior I would expect from
os.path.isfile('bob.asc.gz')
but for some reason Python won't accept files with extension .gz as files when passed to os.path.isfile.
I don't feel like my solution is very elegant, but it is concise. If someone has a more elegant solution, I'd love to see it. Thanks.
Of course it doesn't; they are completely different files. You need to test it separately:
os.path.isfile('bob.asc.gz')
This would return True if that exact file was present in the current working directory.
Although a workaround could be:
from os import listdir, getcwd
from os.path import splitext, basename
any(splitext(basename(f))[0] == 'bob.asc' for f in listdir(getcwd()))
You need to test each file. For example :
if any(map(os.path.isfile, ['bob.asc', 'bob.asc.gz'])):
print 'yay'