Listing files on Microsoft Azure Databricks

Listing files on Microsoft Azure Databricks - python

I'm working in the Microsoft Azure Databricks. And using the ls command, I found out that there is a CSV file present in it (see first screenshot). But when I was trying to pick the CSV file into a list using glob, it's is returning an empty list (see second screenshot).
How can I list the contents of a directory in Databricks?
%fs
ls /FileStore/tables/26AS_report/normalised_consol_file_record_level/part1/customer_pan=AAACD3312M/
path = "/FileStore/tables/26AS_report/normalised_consol_file_record_level/part1/customer_pan=AAACD3312M/"
result = glob.glob(path+'/**/*.csv', recursive=True)
print(result)

glob is a local file-level operation that doesn't know about DBFS. If you want to use it, then you need to prepend a /dbfs to your path:
path = "/dbfs/FileStore/tables/26AS_report/....."

I don't think you can use standard Python file system functions from the os.path or glob modules.
Instead, you should use the Databricks file system utility (dbutils.fs). See documentation.
Given your example code, you should do something like:
dbutils.fs.ls(path)
or
dbutils.fs.ls('dbfs:' + path)
This should give a list of files that you may have to filter yourself to only get the *.csv files.

Related

How can I import a csv from another folder in python?

I have a script in python, I want to import a csv from another folder. how can I do this? (for example, my .py is in a folder and I want to reach the data from the desktop)

First of all, you need to understand how relative and absolute paths work.
I write an example using relative paths. I have two folders in desktop called scripts which includes python files and csvs which includes csv files. So, the code would be:
df = pd.read_csv('../csvs/file.csv)
The path means:
.. (previous folder, in this case, desktop folder).
/csvs (csvs folder).
/file.csv (the csv file).

If you are on Windows:
Right-click on the file on your desktop, and go to its properties.
You should see a Location: tag that has a structure similar to this: C:\Users\<user_name>\Desktop
Then you can define the file path as a variable in Python as:
file_path = r'C:\Users\<your_user_name>\Desktop\<your_file_name>.csv'
To read it:
df = pd.read_csv(file_path)
Obviously, always try to use relative paths instead of absolute paths like this in your code. Investing some time into learning the Pathlib module would greatly help you.

Iterate over files in databricks Repos

I would like to iterate over some files in a folder that has its path in databricks Repos.
How would one do this? I don't seem to be able to access the files in Repos
I have added a picture that shows what folders i would like to access (the dbrks & sql folders)
Thanks :)
Image of the repo folder hierarchy

You can read files from repo folders. The path is /mnt/repos/, this is the top folder when opening the repo window. You can then iterate yourself over these files.
Whenever you find the file you want you can read it with (for example) Spark. Example if you want to read a CSV file.
spark.read.format("csv").load(
path, header=True, inferSchema=True, delimiter=";"
)

If you just want to list files in the repositories, then you can use the list command of Workspace REST API. Using it you can implement recursive listing of files. The actual implementation would different, based on your requirements, like, if you need to generate a list of full paths vs. list with subdirectories, etc. This could be something like this (not tested):
import requests
my_pat = "generated personal access token"
workspace_url = "https://name-of-workspace"
def list_files(base_path: str):
lst = requests.request(method='get',
url=f"{workspace_url}/api/2.0/workspace/list",
headers={"Authentication": f"Bearer {my_pat}",
json={"path": base_path}).json()["objects"]
results = []
for i in lst:
if i["object_type"] == "DIRECTORY" or i["object_type"] == "REPO":
results.extend(list_files(i["path"]))
else:
results.append(i["path"])
return results
all_files = list_files("/Repos/<my-initial-folder")
But if you want to read a content of the files in the repository, then you need to use so-called Arbitrary Files support that is available since DBR 8.4.

Is there a way to be able to use a variable path using os

The goal is to run through a half stable and half variable path.
I am trying to run through a path (go to lowest folder which is called Archive) and fill a list with files that have a certain ending. This works quite well for a stable path such as this.
fileInPath='\\server123456789\provider\COUNTRY\CATEGORY\Archive
My code runs through the path (recursive) and lists all files that have a certain ending. This works well. For simplicity I will just print the file name in the following code.
import csv
import os
fileInPath='\\\\server123456789\\provider\\COUNTRY\\CATEGORY\\Archive
fileOutPath=some path
csvSeparator=';'
fileList = []
for subdir, dirs, files in os.walk(fileInPath):
for file in files:
if file[-3:].upper()=='PAR':
print (file)
The problem is that I can manage to have country and category to be variable e.g. by using *

The standard library module pathlib provides a simple way to do this.
Your file list can be obtained with
from pathlib import Path
list(Path("//server123456789/provider/".glob("*/*/Archive/*.PAR"))
Note I'm using / instead of \\ pathlib handles the conversion for you on windows.

Search for file names that contain words from a list and have a certain file extension

Beginner at python. I'm trying to search users folders for illegal content saved in folders. I want to find all files that contain either one or a number of words from the below list and also the files also have an extension that's listed.
I can search the files using file.endswith but don't know how to add in the word condition.
I've looked through the site and how only come across how to search for a certain word and not a list of words.
Thank you in advance
import os
L = ['720p','aac','ac3','bdrip','brrip','demonoid','disc','hdtv','dvdrip',
'edition','sample','torrent','www','x264','xvid']
for root, dirs, files in os.walk("Y:\User Folders\"):
for file in files:
if file.endswith(('*.7z','.3gp','.alb','.ape','.avi','.cbr','.cbz','.cue','.divx','.epub','.flac',
'.flv','.idx','.iso','.m2ts','.m2v','.m3u','.m4a','.m4b','.m4p','.m4v','.md5',
'.mkv','.mobi','.mov','.mp3','.mp4','.mpeg','.mpg','.mta','.nfo','.ogg','.ogm',
'.pla','.rar','.rm','.rmvb','.sfap0','.sfk','.sfv','.sls','.smfmf','.srt,''.sub',
'.torrent','.vob','.wav','.wma','.wmv','.wpl','.zip')):
print(os.path.join(root, file))

Perhaps it might be better to do a reverse search, and display a warning about files that DON'T match the file types you want. For instance you could do this:
if file.endswith(".txt", ".py"):
print("File is ok!")
else:
print("File is not ok!")

Using py.path.local from py package
The py package (install by $ pip install py) offers a very nice interface for working with files.
from py.path import local
def isbadname(path):
bad_extensions = [".pyc", "txt"]
bad_names = ["code", "xml"]
return (path.ext in bad_extensions) or (path.purebasename in bad_names)
for path in local(".").visit(isbadname):
print(path.strpath)
Explained:
Import
from py.path import local
py.path.local function creates "objectified" file names. To keep my code short, I import
it this way to use only local for objectifying file name strings.
Create objectified path to local directory:
local(".")
Created object is not a string, but an object, which has many interesting properties and methods.
Listing all files within some directory:
local(".").visit("*.txt")
returns a generator, providing all paths to files having extension ".txt"..
Alternative method to detect files to generate is providing a function, which gets argument path
(objectified file name) and returns True if the file is to be used, False otherwise.
The function isbadname serves exactly this purpose.
If you want to google for more information, use py path local (the name py is not giving good hits).
For more see https://py.readthedocs.io/en/latest/path.html
Note, that if you use pytest package, the py is installed with it (for good
reason - it makes tests related to file names much more readable and shorter).

Python operating on files in a folder - 'for file in folder'

I know a folder's path, and for every file in the folder I would like to do some operations. So essentially what I'm looking for is a for file in folder type of code that gives me access to the files in variables.
What is the Python way of doing this?
Thanks
EDIT - example: my folder will contain a bunch of XML files, and I have a python routine already to parse them into variables I need.

This will allow you to access and print all the file names in your current directory:
import os
for filename in os.listdir('.'):
print filename
The os module contains much more information about the various functions available. The os.listdir() function can also take any other paths you want to specify.

Does the glob library look helpful?
It will perform some pattern matching, and accepts both absolute and relative addresses.
>>> import glob
>>> for file in glob.glob("*.xml"): # only loops over XML documents
print file

For people coming at this from a python version 3.5 or later, we now have the superior os.scandir() which has tremendous performance improvements over os.listdir()
For more information about the improvements/benefits, check out https://benhoyt.com/writings/scandir/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Listing files on Microsoft Azure Databricks - python

glob is a local file-level operation that doesn't know about DBFS. If you want to use it, then you need to prepend a /dbfs to your path: path = "/dbfs/FileStore/tables/26AS_report/....."

Related

How can I import a csv from another folder in python?

Iterate over files in databricks Repos

Is there a way to be able to use a variable path using os

Search for file names that contain words from a list and have a certain file extension

Python operating on files in a folder - 'for file in folder'

Categories

Resources