Iterate over files in databricks Repos

Iterate over files in databricks Repos - python

I would like to iterate over some files in a folder that has its path in databricks Repos.
How would one do this? I don't seem to be able to access the files in Repos
I have added a picture that shows what folders i would like to access (the dbrks & sql folders)
Thanks :)
Image of the repo folder hierarchy

You can read files from repo folders. The path is /mnt/repos/, this is the top folder when opening the repo window. You can then iterate yourself over these files.
Whenever you find the file you want you can read it with (for example) Spark. Example if you want to read a CSV file.
spark.read.format("csv").load(
path, header=True, inferSchema=True, delimiter=";"
)

If you just want to list files in the repositories, then you can use the list command of Workspace REST API. Using it you can implement recursive listing of files. The actual implementation would different, based on your requirements, like, if you need to generate a list of full paths vs. list with subdirectories, etc. This could be something like this (not tested):
import requests
my_pat = "generated personal access token"
workspace_url = "https://name-of-workspace"
def list_files(base_path: str):
lst = requests.request(method='get',
url=f"{workspace_url}/api/2.0/workspace/list",
headers={"Authentication": f"Bearer {my_pat}",
json={"path": base_path}).json()["objects"]
results = []
for i in lst:
if i["object_type"] == "DIRECTORY" or i["object_type"] == "REPO":
results.extend(list_files(i["path"]))
else:
results.append(i["path"])
return results
all_files = list_files("/Repos/<my-initial-folder")
But if you want to read a content of the files in the repository, then you need to use so-called Arbitrary Files support that is available since DBR 8.4.

Related

Listing files on Microsoft Azure Databricks

I'm working in the Microsoft Azure Databricks. And using the ls command, I found out that there is a CSV file present in it (see first screenshot). But when I was trying to pick the CSV file into a list using glob, it's is returning an empty list (see second screenshot).
How can I list the contents of a directory in Databricks?
%fs
ls /FileStore/tables/26AS_report/normalised_consol_file_record_level/part1/customer_pan=AAACD3312M/
path = "/FileStore/tables/26AS_report/normalised_consol_file_record_level/part1/customer_pan=AAACD3312M/"
result = glob.glob(path+'/**/*.csv', recursive=True)
print(result)

glob is a local file-level operation that doesn't know about DBFS. If you want to use it, then you need to prepend a /dbfs to your path:
path = "/dbfs/FileStore/tables/26AS_report/....."

I don't think you can use standard Python file system functions from the os.path or glob modules.
Instead, you should use the Databricks file system utility (dbutils.fs). See documentation.
Given your example code, you should do something like:
dbutils.fs.ls(path)
or
dbutils.fs.ls('dbfs:' + path)
This should give a list of files that you may have to filter yourself to only get the *.csv files.

Python API for Github, getting contents in specific directory for specific branch not returning all content

Using the PyGithub API, I am attempting to retrieve all contents from a specific folder from a specific branch of a repository hosted with Github. I can't share the actual repository or specifics regarding the data, but the code I am using is this:
import github
import json
import requests
import base64
from collections import namedtuple
Package = namedtuple('Package', 'name version')
# Parameters
gh_token = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
header = {"Authorization": f"token {gh_token}"}
gh_hostname = 'devtopia.xxx.com'
gh = github.Github(base_url=f'https://{gh_hostname}/api/v3', login_or_token = gh_token)
repo_name = "xxxxxxxxx/SupportFiles"
conda_meta = "xxxxxxx/bin/Python/envs/xxxxxx-xx/conda-meta"
repo = gh.get_repo(repo_name)
def parse_conda_meta(branch):
package_list = []
meta_contents = repo.get_contents(conda_meta, ref=branch) #<< Returns less files than expected for
# a specified branch "xxx/release/3.2.0",
# returns expected number of files for
# "master" branch.
for i, pkg in enumerate(meta_contents):
if ".json" in pkg.name: # filter for JSON files
print(i, pkg.name)
# Need to use GitHub Data API (REST) blobs instead of easier
# `github` with `pkg.decoded_content` here because that method
# only works with files <= 1MB whereas Data API allows for
# reading files <= 100MB.
resp = requests.get(f"https://devtopia.xxxx.com/api/v3/repos/xxxxxxxxx/SupportFiles/git/blobs/{pkg.sha}?ref={branch}", headers=header)
pkg_cont = json.loads(base64.b64decode(json.loads(resp.content)["content"]))
package_list.append(Package(pkg_cont['name'], pkg_cont['version']))
else:
print('>>', i, pkg.name)
return package_list
if __name__ == "__main__":
pkgs = parse_conda_meta("xxx/release/3.2.0")
print(pkgs)
print(len(pkgs))
For some reason that I can't get to the bottom of, I am not getting the correct number of files returned by repo.get_contents(conda_meta, ref=branch). For the branch that I am specifying, when that branch is checked out I am seeing 186 files in the conda-meta folder. However, repo.get_contents(conda_meta, ref=branch) returns only 182, I am missing four JSON files.
Is there some limitation to repo.get_contents that I'm not aware of? I've been reading the doc but can't find anything that hints at the problem I am having. There is one bit about it only handling files up to 1mb, but I am seeing files larger than this returned (e.x: python is 1.204mb and is returned in the list of files). I believe this just applies to reading file content over 1mb, which I deal with by using the GitHub Data API (REST) further downstream. Is there something I'm doing wrong here?
Thanks for reading, any help with this is much appreciated!

Update with solution!
The Problem:
After some more digging, I have found the problem's cause. It's not to do with the code above or repo.get_contents(conda_meta, ref=branch) specifically. It is actually a unix/windows clash that was mistakenly introduced into our repository for this specific branch "xxx/release/3.2.0" but not present in others.
So what was the problem? NTFS (and Windows more broadly) by default is case insensitive, but Git is from a Unix world and is case-sensitive by default
We inadvertently created two folders for Python in the bin directory of the conda_meta path (xxxxxx/bin/), one folder called "Python" and one called "python" (note the lower-case). When pulling the repository locally, only the "Python" folder shows up containing all 168 files. On GitHub, however, the path with "Python" contains 182 files while the path with "python" contains the remaining 4 files.
The Solution:
Solution is to add a conda_meta_folders parameter that takes a list of paths to parse_conda_meta and search each directory. There might be a slicker solution though, I'm looking into whether it is possible to do something like git config core.ignorecase true with the PyGithub API. Does anyone know if it is possible to have PyGithub honor this or be configured for this?

How to find the sub folder id in Google Drive using pydrive in Python?

The directory stricture on Google Drive is as follows:
Inside mydrive/BTP/BTP-4
I need to get the folder ID for BTP-4 so that I can transfer a specific file from the folder. How do I do it?
fileList = GoogleDrive(self.driveConn).ListFile({'q': "'root' in parents and trashed=false"}).GetList()
for file in fileList:
if (file['title'] == "BTP-4"):
fileID = file['id']
print(remoteFile, fileID)
return fileID

Will be able to give path like /MyDrive/BTP/BTP-4 and filename as "test.csv" and then directly download the file?
Answer:
Unfortunately, this is not possible.
More Information:
Google Drive supports creating multiple files or folders with the same name in the same location:
As a result of this, in some cases, providing a file path isn't enough to identify a file or folder uniquiely - in this case mydrive/Parent folder/Child folder/Child doc points to two different files, and mydrive/Parent folder/Child folder/Child folder points to five different folders.
You have to either directly search for the folder with its ID, or to get a folder/file's ID you have to search for children recursively through the folders like you are already doing.

PyDrive get only my files and shared with me files. Python

Sorry for my english. I use pydrive for work whith google drive api. I want get list of files. I do it like this:
return self.g_drive.ListFile({'q': 'trashed=false'}).GetList()
this return me list of files. But it list contains delete files. I think 'q': 'trashed=false' it get only exist files, not in the bucket.
How i can get only exist files and files shared with me

Remove the trashed=false and query to get shared files is:
sharedWithMe
Also there is no concept of bucket in google drive
Query to use:
{'q': 'sharedWithMe'}
EDIT
I still believe trashed=false should work
Work around:
There must be a better way but a trick is to do the following:
list_of_trash_files = drive.ListFile({'q': 'trashed=true'})
list_of_all_files = drive.ListFile({'q': ''})
final_required_list = set(list_of_all_files) - set(list_of_trash_files)

How to delete multiple files at once using Google Drive API

I'm developing a python script that will upload files to a specific folder in my drive, as I come to notice, the drive api provides an excellent implementation for that, but I did encountered one problem, how do I delete multiple files at once?
I tried grabbing the files I want from the drive and organize their Id's but no luck there... (a snippet below)
dir_id = "my folder Id"
file_id = "avoid deleting this file"
dFiles = []
query = ""
#will return a list of all the files in the folder
children = service.files().list(q="'"+dir_id+"' in parents").execute()
for i in children["items"]:
print "appending "+i["title"]
if i["id"] != file_id:
#two format options I tried..
dFiles.append(i["id"]) # will show as array of id's ["id1","id2"...]
query +=i["id"]+", " #will show in this format "id1, id2,..."
query = query[:-2] #to remove the finished ',' in the string
#tried both the query and str(dFiles) as arg but no luck...
service.files().delete(fileId=query).execute()
Is it possible to delete selected files (I don't see why it wouldn't be possible, after all, it's a basic operation)?
Thanks in advance!

You can batch multiple Drive API requests together. Something like this should work using the Python API Client Library:
def delete_file(request_id, response, exception):
if exception is not None:
# Do something with the exception
pass
else:
# Do something with the response
pass
batch = service.new_batch_http_request(callback=delete_file)
for file in children["items"]:
batch.add(service.files().delete(fileId=file["id"]))
batch.execute(http=http)

If you delete or trash a folder, it will recursively delete/trash all of the files contained in that folder. Therefore, your code can be vastly simplified:
dir_id = "my folder Id"
file_id = "avoid deleting this file"
service.files().update(fileId=file_id, addParents="root", removeParents=dir_id).execute()
service.files().delete(fileId=dir_id).execute()
This will first move the file you want to keep out of the folder (and into "My Drive") and then delete the folder.
Beware: if you call delete() instead of trash(), the folder and all the files within it will be permanently deleted and there is no way to recover them! So be very careful when using this method with a folder...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterate over files in databricks Repos - python

Related

Listing files on Microsoft Azure Databricks

Python API for Github, getting contents in specific directory for specific branch not returning all content

How to find the sub folder id in Google Drive using pydrive in Python?

PyDrive get only my files and shared with me files. Python

How to delete multiple files at once using Google Drive API

Categories

Resources