Azure Databricks - Reading Parquet files into DataFrames

Azure Databricks - Reading Parquet files into DataFrames - python

Am newbie with Python ... trying to read parquet files from Databricks, but when the file is empty is throwing error. How can i check filesize before reading it into DataFrame. Code below:
%python
##check if file is empty ???
##if not empty read
##else do something else
try:
parquetDF =
spark.read.parquet("wasbs://XXXXX#XXXX.blob.core.windows.net/XXXX/2019-10- 11/account.parquet")
except:
print('File is Empty !!!')

For now am doing handing this as below
%python
import pandas as pd
data = {
'Dummy': ['Dummy'],
}
parquetDF = pd.DataFrame(data)
try:
parquetDF = spark.read.parquet("wasbs://XXXXX#XXXXX.blob.core.windows.net/XXXXX/2019-10-11/account.parquet")
except:
print('Empty File!!!')
if (parquetDF.columns[0] == 'Dummy'):
print('Do Nothing !!!!')
else:
print('Do Something !!!')
Creating Dummy DataFrame, then trying to load the DataFrame with parquet Data. If any exceptions / source file is empty DF will not be loaded. Then check if the DF is loaded or not and process accordingly.
Also tried to read filesize, but getting exception 'No such file or directory'
%python
import os
statinfo = os.stat("wasbs://XXXXX#XXXXX.blob.core.windows.net/XXXXX/2019-10-11/account.parquet")
statinfo

Related

How to read the files of Azure file share as csv that is pandas dataframe

I have few csv files in my Azure File share which I am accessing as text by following the code:
from azure.storage.file import FileService
storageAccount='...'
accountKey='...'
file_service = FileService(account_name=storageAccount, account_key=accountKey)
share_name = '...'
directory_name = '...'
file_name = 'Name.csv'
file = file_service.get_file_to_text(share_name, directory_name, file_name)
print(file.content)
The contents of the csv files are being displayed but I need to pass them as dataframe which I am not able to do. Can anyone please tell me how to read the file.content as pandas dataframe?

After reproducing from my end, I could able to read a csv file into dataframe from the contents of the file following the below code.
generator = file_service.list_directories_and_files('fileshare/')
for file_or_dir in generator:
print(file_or_dir.name)
file=file_service.get_file_to_text('fileshare','',file_or_dir.name)
df = pd.read_csv(StringIO(file.content), sep=',')
print(df)
RESULTS:

Pandas, Python - Problem with converting xlsx to csv

I found to have problem with conversion of .xlsx file to .csv using pandas library.
Here is the code:
import pandas as pd
# If pandas is not installed: pip install pandas
class Program:
def __init__(self):
# file = input("Insert file name (without extension): ")
file = "Daty"
self.namexlsx = "D:\\" + file + ".xlsx"
self.namecsv = "D:\\" + file + ".csv"
Program.export(self.namexlsx, self.namecsv)
def export(namexlsx, namecsv):
try:
read_file = pd.read_excel(namexlsx, sheet_name='Sheet1', index_col=0)
read_file.to_csv(namecsv, index=False, sep=',')
print("Conversion to .csv file has been successful.")
except FileNotFoundError:
print("File not found, check file name again.")
print("Conversion to .csv file has failed.")
Program()
After running the code the console shows the ValueError: File is not a recognized excel file error
File i have in that directory is "Daty.xlsx". Tried couple of thigns like looking up to documentation and other examples around internet but most had similar code.
Edit&Update
What i intend afterwards is use the created csv file for conversion to .db file. So in the end the line of import will go .xlsx -> .csv -> .db. The idea of such program came as a training, but i cant get past point described above.

You can use like this-
import pandas as pd
data_xls = pd.read_excel('excelfile.xlsx', 'Sheet1', index_col=None)
data_xls.to_csv('csvfile.csv', encoding='utf-8', index=False)

I checked the xlsx itself, and apparently for some reason it was corrupted with columns in initial file being merged into one column. After opening and correcting the cells in the file everything runs smoothly.
Thank you for your time and apologise for inconvenience.

SharePoint Excel URL to Python Pandas DataFrame -Streamlit

Issue: uploading large file to Streamlit-> need a workaround for file size related issues.
Is there a way to create a pandas df from just a file SharePoint file url link?
I solved it for Google Drive url link but cannot figure out SharePoint.
Potential Solution: Create a url link from SharePoint and load the excel/csv file in as a pandas df.
import pandas as pd
url = 'google drive url'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
df = pd.read_csv(path)

yea you can use https://github.com/vgrem/Office365-REST-Python-Client
download_path = os.path.join(tempfile.mkdtemp(), os.path.basename(FILE_URL))
with open(download_path, "wb") as local_file:
ctx.web.get_file_by_server_relative_url(FILE_URL).download(local_file).execute_query()
then read the download_path
df = pd.read_csv(downloadpath)
don't for get to del out the temp file !
The Library is amazing, you can also read the sharepoint file directly in bytes
Ex:
def read_csv(ctx, relative_url, pandas=False):
# relative_url = "/sites/myLib/Folder/test.csv" #TEST
# ctx = auth()
response = File.open_binary(ctx, relative_url)
bytes_data = response.content
try:
s = str(bytes_data, 'utf8')
except Exception as e:
print('utf8 encoding error')
print(relative_url, e)
try:
s = str(bytes_data, 'cp1252')
except Exception as e:
print('CRITIAL ERROR cp1252 encoding error')
print(relative_url, e)
if pandas == False:
return s
else:
data = StringIO(s)
return data
I use panadas variable bc my final code looks like
df= pd.read_csv(read_csv(ctx=ctx, relative_url=FILE_URL, pandas=True), dtype=str, keep_default_na=False) # read master qrd db

How can we load a pandas dataframe using an empty csv file and how can we check that the csv file choosen by us for loading the dataframe is empty?

I am working on a project in which I want to load a dataframe using the csv file and check if the file from which I want to load the dataframe is empty or not. If the csv file is empty then as soon as the statement
df=pd.read_csv(file.csv)
is encountered , I get the error
pandas.errors.EmptyDataError: No columns to parse from file
Please help me
#custom error class defined correctly
try:
#file.csv is an empty csv file
df=pd.read_csv(file.csv)
if df:
print("Dataframe loaded successfully!!")
else:
raise Empty_csv_file_Error("The csv file is empty!!")
except Empty_csv_file_Error as e:
print(e.msg)
Error encountered while loading the dataframe using empty csv file :-
pandas.errors.EmptyDataError: No columns to parse from file

The pandas error is telling you that the file is empty, so just catch it:
import pandas as pd
try:
#file.csv is an empty csv file
df=pd.read_csv("file.csv")
except pd.errors.EmptyDataError:
print("The CSV file is empty")
else:
print("Dataframe loaded successfully!!")

How to skip reading empty files with pandas in Python

I read all the files in one folder one by one into a pandas.DataFrame and then I check them for some conditions. There are a few thousand files, and I would love to make pandas raise an exception when a file is empty, so that my reader function would skip this file.
I have something like:
class StructureReader(FileList):
def __init__(self, dirname, filename):
self.dirname=dirname
self.filename=str(self.dirname+"/"+filename)
def read(self):
self.data = pd.read_csv(self.filename, header=None, sep = ",")
if len(self.data)==0:
raise ValueError
class Run(object):
def __init__(self, dirname):
self.dirname=dirname
self.file__list=FileList(dirname)
self.result=Result()
def run(self):
for k in self.file__list.file_list[:]:
self.b=StructureReader(self.dirname, k)
try:
self.b.read()
self.b.find_interesting_bonds(self.result)
self.b.find_same_direction_chain(self.result)
except ValueError:
pass
Regular file that I'm searching for some condition looks like:
"A/C/24","A/G/14","WW_cis",,
"B/C/24","A/G/15","WW_cis",,
"C/C/24","A/F/11","WW_cis",,
"d/C/24","A/G/12","WW_cis",,
But somehow I don't ever get ValueError raised, and my functions are searching empty files, which gives me a lot of "Empty DataFrame ..." lines in my results file. How can I skip empty files?

I'd first check if the file is empty, and if it isn't empty I'll try to use it with pandas. Following this link https://stackoverflow.com/a/15924160/5088142 you can find a nice way to check if a file is empty:
import os
def is_non_zero_file(fpath):
return os.path.isfile(fpath) and os.path.getsize(fpath) > 0

You should not use pandas, but directly the python libraries. The answer is there: python how to check file empty or not

You can get your work done with following code, just add your CSVs path to the path variable, and run. You should get an object raw_data which is a Pandas dataframe.
import os, pandas as pd, glob
import pandas.io.common
path = "/home/username/data_folder"
files_list = glob.glob(os.path.join(path, "*.csv"))
for i in range(0,len(files_list)):
try:
raw_data = pd.read_csv(files_list[i])
except pandas.errors.EmptyDataError:
print(files_list[i], " is empty and has been skipped.")

How about this
files = glob.glob('*.csv')
files = list(filter(lambda file: os.stat(file).st_size > 0, files))
data = pd.read_csv(files)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Azure Databricks - Reading Parquet files into DataFrames - python

Related

How to read the files of Azure file share as csv that is pandas dataframe

Pandas, Python - Problem with converting xlsx to csv

SharePoint Excel URL to Python Pandas DataFrame -Streamlit

How can we load a pandas dataframe using an empty csv file and how can we check that the csv file choosen by us for loading the dataframe is empty?

How to skip reading empty files with pandas in Python

Categories

Resources