Python find broken symlinks caused by Git - python

I'm currently working on a script that is supposed to go through a cloned Git repo and remove any existing symlinks (broken or not) before recreating them for the user to ease with project initialization.
The solved problems:
Finding symbolic links (broken or unbroken) in Python is very well documented.
GitHub/GitLab breaking symlinks upon downloading/cloning/updating repositories is also very well documented as is how to fix this problem. Tl;dr: Git will download symlinks within the repo as plain text files (with no extension) containing only the symlink path if certain config flags are not set properly.
The unsolved problem:
My problem is that developers may download this repo without realizing the issues with Git, and end up with the symbolic links "checked out as small plain files that contain the link text" which is completely undetectable (as far as I can tell) when parsing the cloned files/directories (via existing base libraries). Running os.stat() on one of these broken symlink files returns information as though it were a normal text file:
os.stat_result(st_mode=33206, st_ino=14073748835637675, st_dev=2149440935, st_nlink=1, st_uid=0, st_gid=0, st_size=42, st_atime=1671662717, st_mtime=1671662699, st_ctime=1671662699)
The st_mode information only indicates that it is a normal text file- 100666 (the first 3 digits are the file type and the last 3 are the UNIX-style permissions). It should show up as 120000.
The os.path.islink() function only ever returns False.
THE CONFUSING PART is that when I run git ls-files -s ./microservices/service/symlink_file it gives 1200000 as the mode bits which, according to the documentation, indicates that this file is a symlink. However I cannot figure out how to see this information from within Python.
I've tried a bunch of things to try and find and delete existing symlinks. Here's the base method that just finds symlink directories and then deletes them:
def clearsymlinks(cwd: str = ""):
"""
Crawls starting from root directory and deletes all symlinks
within the directory tree.
"""
if not cwd:
cwd = os.getcwd()
print(f"Clearing symlinks - starting from {cwd}")
# Create a queue
cwd_dirs: list[str] = [cwd]
while len(cwd_dirs) > 0:
processing_dir: str = cwd_dirs.pop()
# print(f"Processing {processing_dir}") # Only when debugging - else it's too much output spam
for child_dir in os.listdir(processing_dir):
child_dir_path = os.path.join(processing_dir, child_dir)
# check if current item is a directory
if not os.path.isdir(child_dir_path):
if os.path.islink(child_dir_path):
print(f"-- Found symbolic link file {child_dir_path} - removing.\n")
os.remove(child_dir_path)
# skip the dir checking
continue
# Check if the child dir is a symlink
if os.path.islink(child_dir_path):
print(f"-- Found symlink directory {child_dir_path} - removing.")
os.remove(child_dir_path)
else:
# Add the child dir to the queue
cwd_dirs.append(child_dir_path)
After deleting symlinks I run os.symlink(symlink_src, symlink_dst) and generally run into the following error:
Traceback (most recent call last):
File "C:\Users\me\my_repo\remakesymlinks.py", line 123, in main
os.symlink(symlink_src, symlink_dst)
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\Users\\me\\my_repo\\SharedPy' -> 'C:\\Users\\me\\my_repo\\microservices\\service\\symlink_file'
A workaround to specifically this error (inside the create symlink method) is:
try:
os.symlink(symlink_src, symlink_dst)
except FileExistsError:
os.remove(symlink_dst)
os.symlink(symlink_src, symlink_dst)
But this is not ideal because it doesn't prevent a huge list of defunct/broken symlinks from piling up in the directory. I should be able to find any symlinks (working, broken, non-existent, etc.) and then delete them.
I have a list of the symlinks that should be created by my script, but extracting the list of targets from this list is also a workaround that also causes a 'symlink-leak'. Below is how I'm currently finding the broken symlink purely for testing purposes.
if not os.path.isdir(child_dir_path):
if os.path.basename(child_dir_path) in [s.symlink_install_to for s in dirs_to_process]:
print(f"-- Found symlink file {child_dir_path} - removing.")
os.remove(child_dir_path)
# skip the dir checking
continue
A rudimentary solution where I filter for only 'text/plain' files with exactly 1 line (since checking anything else is pointless) and trying to determine whether that single line is just a file path (this seems excessive though):
# Check if Git downloaded the symlink as a plain text file (undetectable broken symlink)
if not os.path.isdir(child_dir_path):
try:
if magic.Magic(mime = True, uncompress = True).from_file(child_dir_path) == 'text/plain':
with open(child_dir_path, 'r') as file:
for i, line in enumerate(file):
if i >= 1:
raise StopIteration
else:
# this function is directly copied from https://stackoverflow.com/a/34102855/8705841
if is_path_exists_or_creatable(line):
print(f"-- Found broken Git link file '{child_dir_path}' - removing.")
print(f"\tContents: \"{line}\"")
# os.remove(child_dir_path)
raise StopIteration
except (StopIteration, UnicodeDecodeError, magic.MagicException):
file.close()
continue
Clearly this solution would require a lot of refactoring (9 indents is pretty ridiculous) if it's the only viable option. Problem with this solution (currently) is that it also tries to delete any single-line files with a string that does not break pathname requirements- i.e. import _virtualenv, some random test string, project-name. Filtering out those files with spaces, or files without slashes, could potentially work but this still feels like chasing edge cases instead of solving the actual problem.
I could potentially rewrite this script in Bash wherein I could, in addition to existing symlink search/destroy code, parse the results from git ls-files -s ... and then delete any files with the 120000 tag. But is this method feasible and/or reliable? There should be a way to do this from within Python since Bash isn't going to run on every OS.
Note: file names have been redacted/changed after copy-paste for privacy, they shouldn't matter anyways since they are generated dynamically by the path searching functions

Related

How to access DVC-controlled files from Oracle?

I have been storing my large files in CLOBs within Oracle, but I am thinking of storing my large files in a shared drive, then having a column in Oracle contain pointers to the files. This would use DVC.
When I do this,
(a) are the paths in Oracle paths that point to the files in my shared drive, as in, the actual files themselves?
(b) or do the paths in Oracle point somehow to the DVC metafile?
Any insight would help me out!
Thanks :)
Justin
EDIT to provide more clarity:
I checked here (https://dvc.org/doc/api-reference/open), and it helped, but I'm not fully there yet ...
I want to pull a file from a remote dvc repository using python (which I have connected to the Oracle database). So, if we can make that work, I think I will be good. But, I am confused. If I specify 'remote' below, then how do I name the file (e.g., 'activity.log') when the remote files are all encoded?
with dvc.api.open(
'activity.log',
repo='location/of/dvc/project',
remote='my-s3-bucket'
) as fd:
for line in fd:
match = re.search(r'user=(\w+)', line)
# ... Process users activity log
(NOTE: For testing purposes, my "remote" DVC directory is just another folder on my MacBook.)
I feel like I'm missing a key concept about getting remote files ...
I hope that adds more clarity. Any help figuring out remote file access is appreciated! :)
Justin
EDIT to get insights on 'rev' parameter:
Before my question, some background/my setup:
(a) I have a repo on my MacBook called 'basics'.
(b) I copied into 'basics' a directory of 501 files (called 'surface_files') that I subsequently pushed to a remote storage folder called 'gss'. After the push, 'gss' contains 220 hash directories.
The steps I used to get here are as follows:
> cd ~/Desktop/Work/basics
> git init
> dvc init
> dvc add ~/Desktop/Work/basics/surface_files
> git add .gitignore surface_files.dvc
> git commit -m "Add raw data"
> dvc remote add -d remote_storage ~/Desktop/Work/gss
> git commit .dvc/config -m "Configure remote storage"
> dvc push
> rm -rf ./.dvc/cache
> rm -rf ./surface_files
Next, I ran the following Python code to take one of my surface files, named surface_100141.dat, and used dvc.api.get_url() to get the corresponding remote storage file name. I then copied this remote storage file into my desktop under the file's original name, i.e., surface_100141.dat.
The code that does all this is as follows, but FIRST, MY QUESTION --- when I run the code as it is shown below, no problems; but when I uncomment the 'rev=' line, it fails. I am not sure why this is happening. I used git log and cat .git/refs/heads/master to make sure that I was getting the right hash. WHY IS THIS FAILING? That is my question.
(In full disclosure, my git knowledge is not too strong yet. I'm getting there, but it's still a work in progress! :))
import dvc.api
import os.path
from os import path
import shutil
filename = 'surface_100141.dat' # This file name would be stored in my Oracle database
home_dir = os.path.expanduser('~')+'/' # This simply expanding '~' into '/Users/ricej/'
resource_url = dvc.api.get_url(
path=f'surface_files/{filename}', # Works when 'surface_files.dvc' exists, even when 'surface_files' directory and .dvc/cache do not
repo=f'{home_dir}Desktop/Work/basics',
# rev='5c92710e68c045d75865fa24f1b56a0a486a8a45', # Commit hash, found using 'git log' or 'cat .git/refs/heads/master'
remote='remote_storage')
resource_url = home_dir+resource_url
print(f'Remote file: {resource_url}')
new_dir = f'{home_dir}Desktop/' # Will copy fetched file to desktop, for demonstration
new_file = new_dir+filename
print(f'Remote file copy: {new_file}')
if path.exists(new_file):
os.remove(new_file)
dest = shutil.copy(resource_url, new_file) # Check your desktop after this to see remote file copy
I'm not 100% sure that I understand the question (it would be great to expand it a bit on the actual use case you are trying to solve with this database), but I can share a few thoughts.
When we talk about DVC, I think you need to specify a few things to identify the file/directory:
Git commit + path (actual path like data/data/xml). Commit (or to be precise any Git revision) is needed to identify the version of the data file.
Or path in the DVC storage (/mnt/shared/storage/00/198493ef2343ao ...) + actual name of this file. This way you would be saving info that .dvc` files have.
I would say that second way is not recommended since to some extent it's an implementation detail - how does DVC store files internally. The public interface to DVC organized data storage is its repository URL + commit + file name.
Edit (example):
with dvc.api.open(
'activity.log',
repo='location/of/dvc/project',
remote='my-s3-bucket'
) as fd:
for line in fd:
match = re.search(r'user=(\w+)', line)
# ... Process users activity log
location/of/dvc/project this path must point to an actual Git repo. This repo should have a .dvc or dvc.lock file that has activity.log name in it + its hash in the remote storage:
outs:
- md5: a304afb96060aad90176268345e10355
path: activity.log
By reading this Git repo and analyzing let's say activity.log.dvc DVC will be able to create the right path s3://my-bucket/storage/a3/04afb96060aad90176268345e10355
remote='my-s3-bucket' argument is optional. By default it will use the one that is defined in the repo itself.
Let's take another real example:
with dvc.api.open(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry'
) as fd:
for line in fd:
match = re.search(r'user=(\w+)', line)
# ... Process users activity log
In the https://github.com/iterative/dataset-registry you could find the .dvc file that is enough for DVC to create a path to the file by also analyzing its config
https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355
you could run wget on this file to download it

How do I access a file for reading/writing in a different (non-current) directory?

I am working on the listener portion of a backdoor program (for an ETHICAL hacking course) and I would like to be able to read files from any part of my linux system and not just from within the directory where my listener python script is located - however, this has not proven to be as simple as specifying a typical absolute path such as "~/Desktop/test.txt"
So far my code is able to read files and upload them to the virtual machine where my reverse backdoor script is actively running. But this is only when I read and upload files that are in the same directory as my listener script (aptly named listener.py). Code shown below.
def read_file(self, path):
with open(path, "rb") as file:
return base64.b64encode(file.read())
As I've mentioned previously, the above function only works if I try to open and read a file that is in the same directory as the script that the above code belongs to, meaning that path in the above content is a simple file name such as "picture.jpg"
I would like to be able to read a file from any part of my filesystem while maintaining the same functionality.
For example, I would love to be able to specify "~/Desktop/another_picture.jpg" as the path so that the contents of "another_picture.jpg" from my "~/Desktop" directory are base64 encoded for further processing and eventual upload.
Any and all help is much appreciated.
Edit 1:
My script where all the code is contained, "listener.py", is located in /root/PycharmProjects/virus_related/reverse_backdoor/. within this directory is a file that for simplicity's sake we can call "picture.jpg" The same file, "picture.jpg" is also located on my desktop, absolute path = "/root/Desktop/picture.jpg"
When I try read_file("picture.jpg"), there are no problems, the file is read.
When I try read_file("/root/Desktop/picture.jpg"), the file is not read and my terminal becomes stuck.
Edit 2:
I forgot to note that I am using the latest version of Kali Linux and Pycharm.
I have run "realpath picture.jpg" and it has yielded the path "/root/Desktop/picture.jpg"
Upon running read_file("/root/Desktop/picture.jpg"), I encounter the same problem where my terminal becomes stuck.
[FINAL EDIT aka Problem solved]:
Based on the answer suggesting trying to read a file like "../file", I realized that the code was fully functional because read_file("../file") worked without any flaws, indicating that my python script had no trouble locating the given path. Once the file was read, it was uploaded to the machine running my backdoor where, curiously, it uploaded the file to my target machine but in the parent directory of the script. It was then that I realized that problem lied in the handling of paths in the backdoor script rather than my listener.py
Credit is also due to the commentator who pointed out that "~" does not count as a valid path element. Once I reached the conclusion mentioned just above, I attempted read_file("~/Desktop/picture.jpg") which failed. But with a quick modification, read_file("/root/Desktop/picture.jpg") was successfully executed and the file was uploaded in the same directory as my backdoor script on my target machine once I implemented some quick-fix code.
My apologies for not being so specific; efforts to aid were certainly confounded by the unmentioned complexity of my situation and I would like to personally thank everyone who chipped in.
This was my first whole-hearted attempt to reach out to the stackoverflow community for help and I have not been disappointed. Cheers!
A solution I found is putting "../" before the filename if the path is right outside of the dictionary.
test.py (in some dictionary right inside dictionary "Desktop" (i.e. /Desktop/test):
with open("../test.txt", "r") as test:
print(test.readlines())
test.txt (in dictionary "/Desktop")
Hi!
Hello!
Result:
["Hi!", "Hello!"]
This is likely the simplest solution. I found this solution because I always use "cd ../" on the terminal.
This not only allows you to modify the current file, but all other files in the same directory as the one you are reading/writing to.
path = os.path.dirname(os.path.abspath(__file__))
dir_ = os.listdir(path)
for filename in dir_:
f = open(dir_ + '/' + filename)
content = f.read()
print filename, len(content)
try:
im = Image.open(filename)
im.show()
except IOError:
print('The following file is not an image type:', filename)

Python - extract and modify a file path in all files in a directory in linux

I have files .sh files and .json files in which there are file paths given to point to a specific directory, but I should keep on changing the file path, depending on where my python scipt is run.
eg:content of one of my .sh file is
"cd /home/aswany/BotStudioInstallation/databricks/platform/databricksastro"
and I should change the file path via python code where the following path
"/home/aswany/BotStudioInstallation/" keep on changing depending on where databicks is located,
I tried the following code:
replaceAll(str(self.currentdirectory)+
"/databricks/platform/devsettings.json",
"/home/holmes/BotStudioInstallation",self.currentdirectory)
and function replaceAll is:
def replaceAll(self,file,searchExp,replaceExp):
for line in fileinput.input(file, inplace=1):
if searchExp in line:
line = line.replace(searchExp,replaceExp)
sys.stdout.write(line)
but above code only replaces a line
"home/holmes/BotStudioInstallation" to the current directory I am logged in,bt it cannot be sure that "home/holmes/BotStudioInstallation" is the only possibility it keep on changing like "home/aswany/BotStudioInstallation","home/dev3/BotStudioInstallation" etc ,I thought of regular expression for this.
please help me
Not sure I 100% understood your issue, but maybe I can help nonetheless.
As pointed out by J.F. Sebastian, you can use relative paths and remove the base part of the path. Using ./databricks/platform/devsettings.json might be enough. This is by far the most elegant solution.
If for any reason it is not, you can keep the directory you need to access, then append it to the base directory whenever you need it. That should allow you to deal with changes in the base directory. Though in the case the files will be used by other applications than your own, that might not be an option.
dir = get_dir_from_json()
dir_with_base = self.currentdirectory + dir
Alternatively, not an elegant solution though, without using regex you can use a "pattern" to always replace.
{
"directory": "<<_replace_me_>>/databricks/platform"
}
Then you know you can always replace "<<_replace_me_>>" with the base directory.

Get full path of currently open files

I'm trying to code a simple application that must read all currently open files within a certain directory.
More specificly, I want to get a list of files open anywhere inside my Documents folder,
but I don't want only the processes' IDs or process name, I want the full path of the open file.
The thing is I haven't quite found anything to do that.
I couldn't do it neither in linux shell (using ps and lsof commands) nor using python's psutil library. None of these is giving me the information I need, which is only the path of currently open files in a dir.
Any advice?
P.S: I'm tagging this as python question (besides os related tags) because it would be a plus if it could be done using some python library.
This seems to work (on Linux):
import subprocess
import shlex
cmd = shlex.split('lsof -F n +d .')
try:
output = subprocess.check_output(cmd).splitlines()
except subprocess.CalledProcessError as err:
output = err.output.splitlines()
output = [line[3:] for line in output if line.startswith('n./')]
# Out[3]: ['file.tmp']
it reads open files from current directory, non-recursively.
For recursive search, use +D option. Keep in mind, that it is vulnerable to race condition - when you get your ouput, situation might have changed already. It is always best to try to do something (open file), and check for failure, e.g. open file and catch exception or check for null FILE value in C.

Store user defined data after inputted

I am making a python program, and I want to check if it is the users first time running the program (firstTime == True). After its ran however, I want to permanently change firstTime to False. (There are other variables that I want to take input for that will stay if it is the first run, but that should be solved the same way).
Is there a better way then just reading from a file that contains the data? If not, how can I find where the file is being ran from (so the data will be in the same dir)?
If you want to persist data, it will "eventually" be to disk files (though there might be intermediate steps, e.g. via a network or database system, eventually if the data is to be persistent it will be somewhere in disk files).
To "find out where you are",
import os
print os.path.dirname(os.path.abspath(__file__))
There are variants, but this is the basic idea. __file__ in any .py script or module gives the file path in which that file resides (won't work on the interactive command line, of course, since there's no file involved then;-).
The os.path module in Python's standard library has many useful function to manipulate path strings -- here, we're using two: abspath to give an absolute (not relative) version of the file's path, so you don't have to care about what your current working directory is; and dirname to extract just the directory name (actually, the whole directory path;-) and drop the filename proper (you don't care if the module's name is foo.py or bar.py, only in what directory it is;-).
It is enough to just create file in same directory if program is run first time (of course that file can be deleted to do stuff for first run again, but that can be sometimes usefull):
firstrunfile = 'config.dat'
if not os.path.exists(firstrunfile):
## configuration here
open(firstrunfile,'w').close() ## .write(configuration)
print 'First run'
firstTime == True
else:
print 'Not first run'
## read configuration
firstTime == False

Categories

Resources