How to push existing file onto gitlab repository using python

How to push existing file onto gitlab repository using python - python

is there a way to push existing file onto gitlab project repository in python like the git commit and git push commands, instead of creating a new file?
I'm currently using python-gitlab package and I think it only supports files.create which creates a new file using supplied string contents. This would result in slightly different file content in my case.
I'm looking for a way to push the file in python onto the repo untouched, can anyone help?

The Dec. 2013 0.5 version of gitlab/python-gitlab does mention:
Project: add methods for create/update/delete files (commit ba39e88)
So there should be a way to update an existing file, instead of creating a new one.
def update_file(self, path, branch, content, message):
url = "/projects/%s/repository/files" % self.id
url += "?file_path=%s&branch_name=%s&content=%s&commit_message=%s" % \
(path, branch, content, message)
r = self.gitlab.rawPut(url)
if r.status_code != 200:
raise GitlabUpdateError
In May 2016, for the 0.13 version, the file_* methods were deprecated in favor of the files manager.
warnings.warn("`update_file` is deprecated, "
"use `files.update()` instead",
DeprecationWarning)
That was documented in 0.15, Aug. 2016.
See docs/gl_objects/projects.rst
Update a file.
The entire content must be uploaded, as plain text or as base64 encoded text:
f.content = 'new content'
f.save(branch='master', commit_message='Update testfile')
# or for binary data
# Note: decode() is required with python 3 for data serialization. You can omit
# it with python 2
f.content = base64.b64encode(open('image.png').read()).decode()
f.save(branch='master', commit_message='Update testfile', encoding='base64')
What I am looking for is to push an "existing local file" to an empty GitLab project repository
To create a new file:
f = project.files.create({'file_path': 'testfile.txt',
'branch': 'master',
'content': file_content,
'author_email': 'test#example.com',
'author_name': 'yourname',
'encoding': 'text',
'commit_message': 'Create testfile'})
You can check the differences between a file created on GitLab (and cloned) with your own local file with
git diff --no-index --color --ws-error-highlight=new,old
I mentioned it in 2015 for better whitespace detection.
The OP Linightz confirms in the comments:
The file after created by python-gitlab misses a space (0x0D) on every line ending.
So I guess you're right.
However I tried to add core.autocrlf setting or add newline='' in my file open statement or read in binary and decode using different encoding, none of above worked.
I decided to just use shell command in python to push the file to avoid all these troubles, t

Related

Python find broken symlinks caused by Git

I'm currently working on a script that is supposed to go through a cloned Git repo and remove any existing symlinks (broken or not) before recreating them for the user to ease with project initialization.
The solved problems:
Finding symbolic links (broken or unbroken) in Python is very well documented.
GitHub/GitLab breaking symlinks upon downloading/cloning/updating repositories is also very well documented as is how to fix this problem. Tl;dr: Git will download symlinks within the repo as plain text files (with no extension) containing only the symlink path if certain config flags are not set properly.
The unsolved problem:
My problem is that developers may download this repo without realizing the issues with Git, and end up with the symbolic links "checked out as small plain files that contain the link text" which is completely undetectable (as far as I can tell) when parsing the cloned files/directories (via existing base libraries). Running os.stat() on one of these broken symlink files returns information as though it were a normal text file:
os.stat_result(st_mode=33206, st_ino=14073748835637675, st_dev=2149440935, st_nlink=1, st_uid=0, st_gid=0, st_size=42, st_atime=1671662717, st_mtime=1671662699, st_ctime=1671662699)
The st_mode information only indicates that it is a normal text file- 100666 (the first 3 digits are the file type and the last 3 are the UNIX-style permissions). It should show up as 120000.
The os.path.islink() function only ever returns False.
THE CONFUSING PART is that when I run git ls-files -s ./microservices/service/symlink_file it gives 1200000 as the mode bits which, according to the documentation, indicates that this file is a symlink. However I cannot figure out how to see this information from within Python.
I've tried a bunch of things to try and find and delete existing symlinks. Here's the base method that just finds symlink directories and then deletes them:
def clearsymlinks(cwd: str = ""):
"""
Crawls starting from root directory and deletes all symlinks
within the directory tree.
"""
if not cwd:
cwd = os.getcwd()
print(f"Clearing symlinks - starting from {cwd}")
# Create a queue
cwd_dirs: list[str] = [cwd]
while len(cwd_dirs) > 0:
processing_dir: str = cwd_dirs.pop()
# print(f"Processing {processing_dir}") # Only when debugging - else it's too much output spam
for child_dir in os.listdir(processing_dir):
child_dir_path = os.path.join(processing_dir, child_dir)
# check if current item is a directory
if not os.path.isdir(child_dir_path):
if os.path.islink(child_dir_path):
print(f"-- Found symbolic link file {child_dir_path} - removing.\n")
os.remove(child_dir_path)
# skip the dir checking
continue
# Check if the child dir is a symlink
if os.path.islink(child_dir_path):
print(f"-- Found symlink directory {child_dir_path} - removing.")
os.remove(child_dir_path)
else:
# Add the child dir to the queue
cwd_dirs.append(child_dir_path)
After deleting symlinks I run os.symlink(symlink_src, symlink_dst) and generally run into the following error:
Traceback (most recent call last):
File "C:\Users\me\my_repo\remakesymlinks.py", line 123, in main
os.symlink(symlink_src, symlink_dst)
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\Users\\me\\my_repo\\SharedPy' -> 'C:\\Users\\me\\my_repo\\microservices\\service\\symlink_file'
A workaround to specifically this error (inside the create symlink method) is:
try:
os.symlink(symlink_src, symlink_dst)
except FileExistsError:
os.remove(symlink_dst)
os.symlink(symlink_src, symlink_dst)
But this is not ideal because it doesn't prevent a huge list of defunct/broken symlinks from piling up in the directory. I should be able to find any symlinks (working, broken, non-existent, etc.) and then delete them.
I have a list of the symlinks that should be created by my script, but extracting the list of targets from this list is also a workaround that also causes a 'symlink-leak'. Below is how I'm currently finding the broken symlink purely for testing purposes.
if not os.path.isdir(child_dir_path):
if os.path.basename(child_dir_path) in [s.symlink_install_to for s in dirs_to_process]:
print(f"-- Found symlink file {child_dir_path} - removing.")
os.remove(child_dir_path)
# skip the dir checking
continue
A rudimentary solution where I filter for only 'text/plain' files with exactly 1 line (since checking anything else is pointless) and trying to determine whether that single line is just a file path (this seems excessive though):
# Check if Git downloaded the symlink as a plain text file (undetectable broken symlink)
if not os.path.isdir(child_dir_path):
try:
if magic.Magic(mime = True, uncompress = True).from_file(child_dir_path) == 'text/plain':
with open(child_dir_path, 'r') as file:
for i, line in enumerate(file):
if i >= 1:
raise StopIteration
else:
# this function is directly copied from https://stackoverflow.com/a/34102855/8705841
if is_path_exists_or_creatable(line):
print(f"-- Found broken Git link file '{child_dir_path}' - removing.")
print(f"\tContents: \"{line}\"")
# os.remove(child_dir_path)
raise StopIteration
except (StopIteration, UnicodeDecodeError, magic.MagicException):
file.close()
continue
Clearly this solution would require a lot of refactoring (9 indents is pretty ridiculous) if it's the only viable option. Problem with this solution (currently) is that it also tries to delete any single-line files with a string that does not break pathname requirements- i.e. import _virtualenv, some random test string, project-name. Filtering out those files with spaces, or files without slashes, could potentially work but this still feels like chasing edge cases instead of solving the actual problem.
I could potentially rewrite this script in Bash wherein I could, in addition to existing symlink search/destroy code, parse the results from git ls-files -s ... and then delete any files with the 120000 tag. But is this method feasible and/or reliable? There should be a way to do this from within Python since Bash isn't going to run on every OS.
Note: file names have been redacted/changed after copy-paste for privacy, they shouldn't matter anyways since they are generated dynamically by the path searching functions

How to copy and paste a file to a new folder in google drive using python? [duplicate]

I wrote a short function in Google Apps script that can make a copy of a specific file that is stored on Google Drive. The purpose of it is that this file is a template and every time I want to create a new document for work I make a copy of this template and just change the title of the document. The code that I wrote to make a copy of the file and store it in the specific folder that I want is very simple:
function copyFile() {
var file = DriveApp.getFileById("############################################");
var folder = DriveApp.getFolderById("############################");
var filename = "Copy of Template";
file.makeCopy(filename, folder);
}
This function takes a specific file, based on ID and a specific folder based on ID and puts the copy entitles "Copy of Template" into that folder.
I have been searching all over and I cannot seem to find this. Is there a way to do the exact same thing, but using Python instead? Or, at the very least is there a way to have Python somehow call that function to run this function? I need this to be done in Python because I am writing a script that does many functions at once whenever I start a new project for work, such as creating a new document from template in Google Drive as well as other things that are not related to Google Drive at all and they can therefore not be done in Google Apps Script.

There are a few tutorials around the web that give partial answers. Here is a step-by-step guide of what you need to do.
Open Command prompt and type (without the quotes) "pip install PyDrive"
Follow the instructions here by step one - https://developers.google.com/drive/v3/web/quickstart/python to set up an account
When that is done, click on Download JSON and a file will be downloaded. Make sure to rename that to client_secrets.json, not client_secret.json as the Quick Start says to do.
Next, make sure to put that file in the same directory as your python script. If you are running the script from a console, that directory might be your username directory.
I assume that you already know the folder id that you are placing this file in and file id that you are copying. If you don't know it, there are tutorials of how to find it using python or you can open it up in Docs and it will be in the URL of the file. Basically enter the ID of the folder and the ID of the file and when you run this script it will make a copy of the chosen file and place it in the chosen folder.
One thing to note is that while running, your browser window will open up and ask for permission, just click accept and then the script will complete.
In order for this to work you might have to enable the Google Drive API, which is in the API's section.
Python Script:
## Create a new Document in Google Drive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
drive = GoogleDrive(gauth)
folder = "########"
title = "Copy of my other file"
file = "############"
drive.auth.service.files().copy(fileId=file,
body={"parents": [{"kind": "drive#fileLink",
"id": folder}], 'title': title}).execute()

From https://developers.google.com/drive/v2/reference/files/copy
from apiclient import errors
# ...
def copy_file(service, origin_file_id, copy_title):
"""Copy an existing file.
Args:
service: Drive API service instance.
origin_file_id: ID of the origin file to copy.
copy_title: Title of the copy.
Returns:
The copied file if successful, None otherwise.
"""
copied_file = {'title': copy_title}
try:
return service.files().copy(
fileId=origin_file_id, body=copied_file).execute()
except errors.HttpError, error:
print 'An error occurred: %s' % error
return None

With API v3:
Copy file to a directory with different name.
service.files().copy(fileId='PutFileIDHere', body={"parents": ['ParentFolderID'], 'name': 'NewFileName'} ).execute()

For me, the answer of #Rashi worked with a small modification.
instead of:
'name': 'NewFileName'
this worked:
'title': 'NewFileName'

How to access DVC-controlled files from Oracle?

I have been storing my large files in CLOBs within Oracle, but I am thinking of storing my large files in a shared drive, then having a column in Oracle contain pointers to the files. This would use DVC.
When I do this,
(a) are the paths in Oracle paths that point to the files in my shared drive, as in, the actual files themselves?
(b) or do the paths in Oracle point somehow to the DVC metafile?
Any insight would help me out!
Thanks :)
Justin
EDIT to provide more clarity:
I checked here (https://dvc.org/doc/api-reference/open), and it helped, but I'm not fully there yet ...
I want to pull a file from a remote dvc repository using python (which I have connected to the Oracle database). So, if we can make that work, I think I will be good. But, I am confused. If I specify 'remote' below, then how do I name the file (e.g., 'activity.log') when the remote files are all encoded?
with dvc.api.open(
'activity.log',
repo='location/of/dvc/project',
remote='my-s3-bucket'
) as fd:
for line in fd:
match = re.search(r'user=(\w+)', line)
# ... Process users activity log
(NOTE: For testing purposes, my "remote" DVC directory is just another folder on my MacBook.)
I feel like I'm missing a key concept about getting remote files ...
I hope that adds more clarity. Any help figuring out remote file access is appreciated! :)
Justin
EDIT to get insights on 'rev' parameter:
Before my question, some background/my setup:
(a) I have a repo on my MacBook called 'basics'.
(b) I copied into 'basics' a directory of 501 files (called 'surface_files') that I subsequently pushed to a remote storage folder called 'gss'. After the push, 'gss' contains 220 hash directories.
The steps I used to get here are as follows:
> cd ~/Desktop/Work/basics
> git init
> dvc init
> dvc add ~/Desktop/Work/basics/surface_files
> git add .gitignore surface_files.dvc
> git commit -m "Add raw data"
> dvc remote add -d remote_storage ~/Desktop/Work/gss
> git commit .dvc/config -m "Configure remote storage"
> dvc push
> rm -rf ./.dvc/cache
> rm -rf ./surface_files
Next, I ran the following Python code to take one of my surface files, named surface_100141.dat, and used dvc.api.get_url() to get the corresponding remote storage file name. I then copied this remote storage file into my desktop under the file's original name, i.e., surface_100141.dat.
The code that does all this is as follows, but FIRST, MY QUESTION --- when I run the code as it is shown below, no problems; but when I uncomment the 'rev=' line, it fails. I am not sure why this is happening. I used git log and cat .git/refs/heads/master to make sure that I was getting the right hash. WHY IS THIS FAILING? That is my question.
(In full disclosure, my git knowledge is not too strong yet. I'm getting there, but it's still a work in progress! :))
import dvc.api
import os.path
from os import path
import shutil
filename = 'surface_100141.dat' # This file name would be stored in my Oracle database
home_dir = os.path.expanduser('~')+'/' # This simply expanding '~' into '/Users/ricej/'
resource_url = dvc.api.get_url(
path=f'surface_files/{filename}', # Works when 'surface_files.dvc' exists, even when 'surface_files' directory and .dvc/cache do not
repo=f'{home_dir}Desktop/Work/basics',
# rev='5c92710e68c045d75865fa24f1b56a0a486a8a45', # Commit hash, found using 'git log' or 'cat .git/refs/heads/master'
remote='remote_storage')
resource_url = home_dir+resource_url
print(f'Remote file: {resource_url}')
new_dir = f'{home_dir}Desktop/' # Will copy fetched file to desktop, for demonstration
new_file = new_dir+filename
print(f'Remote file copy: {new_file}')
if path.exists(new_file):
os.remove(new_file)
dest = shutil.copy(resource_url, new_file) # Check your desktop after this to see remote file copy

I'm not 100% sure that I understand the question (it would be great to expand it a bit on the actual use case you are trying to solve with this database), but I can share a few thoughts.
When we talk about DVC, I think you need to specify a few things to identify the file/directory:
Git commit + path (actual path like data/data/xml). Commit (or to be precise any Git revision) is needed to identify the version of the data file.
Or path in the DVC storage (/mnt/shared/storage/00/198493ef2343ao ...) + actual name of this file. This way you would be saving info that .dvc` files have.
I would say that second way is not recommended since to some extent it's an implementation detail - how does DVC store files internally. The public interface to DVC organized data storage is its repository URL + commit + file name.
Edit (example):
with dvc.api.open(
'activity.log',
repo='location/of/dvc/project',
remote='my-s3-bucket'
) as fd:
for line in fd:
match = re.search(r'user=(\w+)', line)
# ... Process users activity log
location/of/dvc/project this path must point to an actual Git repo. This repo should have a .dvc or dvc.lock file that has activity.log name in it + its hash in the remote storage:
outs:
- md5: a304afb96060aad90176268345e10355
path: activity.log
By reading this Git repo and analyzing let's say activity.log.dvc DVC will be able to create the right path s3://my-bucket/storage/a3/04afb96060aad90176268345e10355
remote='my-s3-bucket' argument is optional. By default it will use the one that is defined in the repo itself.
Let's take another real example:
with dvc.api.open(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry'
) as fd:
for line in fd:
match = re.search(r'user=(\w+)', line)
# ... Process users activity log
In the https://github.com/iterative/dataset-registry you could find the .dvc file that is enough for DVC to create a path to the file by also analyzing its config
https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355
you could run wget on this file to download it

Python API for Github, getting contents in specific directory for specific branch not returning all content

Using the PyGithub API, I am attempting to retrieve all contents from a specific folder from a specific branch of a repository hosted with Github. I can't share the actual repository or specifics regarding the data, but the code I am using is this:
import github
import json
import requests
import base64
from collections import namedtuple
Package = namedtuple('Package', 'name version')
# Parameters
gh_token = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
header = {"Authorization": f"token {gh_token}"}
gh_hostname = 'devtopia.xxx.com'
gh = github.Github(base_url=f'https://{gh_hostname}/api/v3', login_or_token = gh_token)
repo_name = "xxxxxxxxx/SupportFiles"
conda_meta = "xxxxxxx/bin/Python/envs/xxxxxx-xx/conda-meta"
repo = gh.get_repo(repo_name)
def parse_conda_meta(branch):
package_list = []
meta_contents = repo.get_contents(conda_meta, ref=branch) #<< Returns less files than expected for
# a specified branch "xxx/release/3.2.0",
# returns expected number of files for
# "master" branch.
for i, pkg in enumerate(meta_contents):
if ".json" in pkg.name: # filter for JSON files
print(i, pkg.name)
# Need to use GitHub Data API (REST) blobs instead of easier
# `github` with `pkg.decoded_content` here because that method
# only works with files <= 1MB whereas Data API allows for
# reading files <= 100MB.
resp = requests.get(f"https://devtopia.xxxx.com/api/v3/repos/xxxxxxxxx/SupportFiles/git/blobs/{pkg.sha}?ref={branch}", headers=header)
pkg_cont = json.loads(base64.b64decode(json.loads(resp.content)["content"]))
package_list.append(Package(pkg_cont['name'], pkg_cont['version']))
else:
print('>>', i, pkg.name)
return package_list
if __name__ == "__main__":
pkgs = parse_conda_meta("xxx/release/3.2.0")
print(pkgs)
print(len(pkgs))
For some reason that I can't get to the bottom of, I am not getting the correct number of files returned by repo.get_contents(conda_meta, ref=branch). For the branch that I am specifying, when that branch is checked out I am seeing 186 files in the conda-meta folder. However, repo.get_contents(conda_meta, ref=branch) returns only 182, I am missing four JSON files.
Is there some limitation to repo.get_contents that I'm not aware of? I've been reading the doc but can't find anything that hints at the problem I am having. There is one bit about it only handling files up to 1mb, but I am seeing files larger than this returned (e.x: python is 1.204mb and is returned in the list of files). I believe this just applies to reading file content over 1mb, which I deal with by using the GitHub Data API (REST) further downstream. Is there something I'm doing wrong here?
Thanks for reading, any help with this is much appreciated!

Update with solution!
The Problem:
After some more digging, I have found the problem's cause. It's not to do with the code above or repo.get_contents(conda_meta, ref=branch) specifically. It is actually a unix/windows clash that was mistakenly introduced into our repository for this specific branch "xxx/release/3.2.0" but not present in others.
So what was the problem? NTFS (and Windows more broadly) by default is case insensitive, but Git is from a Unix world and is case-sensitive by default
We inadvertently created two folders for Python in the bin directory of the conda_meta path (xxxxxx/bin/), one folder called "Python" and one called "python" (note the lower-case). When pulling the repository locally, only the "Python" folder shows up containing all 168 files. On GitHub, however, the path with "Python" contains 182 files while the path with "python" contains the remaining 4 files.
The Solution:
Solution is to add a conda_meta_folders parameter that takes a list of paths to parse_conda_meta and search each directory. There might be a slicker solution though, I'm looking into whether it is possible to do something like git config core.ignorecase true with the PyGithub API. Does anyone know if it is possible to have PyGithub honor this or be configured for this?

committing a single file using python-hglib

I am trying to implement a rudimentary scm using python-hglib.
So far I have managed to connect to a repo (local) and I want to commit a single file among many.
I am not sure how to do this. Consider the following:
client = hglib.open(my_mercurial_repo)
root_repo=hglib.client.hgclient.root(client)
print "%s root" % root_repo
rev, node =client.commit('Simple commit message', addremove=False, user='user1')
This connects to my_mercurial_repo successfully, but when I get to the commit line I get this error:
'hglib.error.CommandError'>, CommandError('commit', '-m',
'Checkpoint', '-u', 'myself', '--debug')
However if I change it to:
rev, node =client.commit('Simple commit message', addremove=True,
user='user1')
It works fine. Looking at the documentation, addremove=True would mark new/missing files as added/removed before committing.
So I guess my question is: how do I commit a single file in a repository of n files using python-hglib?
Just a quick update, thanks to #kAlmAcetA's response I updated my code as suggested to include
client.add('/tmp/repo/somefile')
rev, node =client.commit('Simple commit message', addremove=False, user='user1')
when I did this, the error goes away, the FIRST time commit is executed.
If I execute the code again on the same file that I had opened I still get the error.
So maybe what I am looking to do is to
Open a file (i'm ok with this)
Add some text to the file (i'm ok with this)
Commit the file
Add more text to the same file (i'm ok with this)
Commit the file
I am now struggling to do the commit-->edit-->commit loop for a single file.
Regards

Committing a single file using hglib.commit is not supported, but you can use hglib.rawcommand, which uses the same syntax as hg on the command line:
repo = hglib.open(path_to_your_repository)
repo.rawcommand(args=['commit', 'filename'])
The first element in args must be an hg command name. The remaining elements are any options you'd use with that command. In my actual code, I have:
repo.rawcommand(['commit','-m '+commit_msg, file_name)])

You must first add that file to commit using client's add method.
...
client.add('/tmp/repo/somefile')
rev, node =client.commit('Simple commit message', addremove=False, user='user1')
...
You have to add file only for the very first time (on success you will get True otherwise False), next modifications requires only commit.
Note: If you would try to add the same file, unfortunately you will get True as well, then commit will fail with exception:
hglib.error.CommandError: (1, 'nothing changed', '')
Is good to wrap commit with try...expect probably.

If I understand you correctly, you want to commit a single file, no matter how many files in your working copy might have changed, just as if you’d do on the command line:
hg commit ${file}
hglib’s commit method does not seem to provide for this:
def commit(self, message=None, logfile=None, addremove=False, closebranch=False,
date=None, user=None, include=None, exclude=None):
"""
Commit changes reported by status into the repository.
message - the commit message
logfile - read commit message from file
addremove - mark new/missing files as added/removed before committing
closebranch - mark a branch as closed, hiding it from the branch list
date - record the specified date as commit date
user - record the specified user as committer
include - include names matching the given patterns
exclude - exclude names matching the given patterns
"""
Committing just a single file doesn’t seem to be supported. hglib is supposed to wrap Mercurial’s commandline in some fashion, and the hg command can commit a single changed file, so this is weird. Then again, the documentation is scarce to the point of being non-existent, so it’s unclear if that’s by design or not.
I’d file a bug for this.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.