How to write pandas dataframe into Databricks dbfs/FileStore?

How to write pandas dataframe into Databricks dbfs/FileStore? - python

I'm new to the Databricks, need help in writing a pandas dataframe into databricks local file system.
I did search in google but could not find any case similar to this, also tried the help guid provided by databricks (attached) but that did not work either. Attempted the below changes to find my luck, the commands goes just fine, but the file is not getting written in the directory (expected wrtdftodbfs.txt file gets created)
df.to_csv("/dbfs/FileStore/NJ/wrtdftodbfs.txt")
Result: throws the below error
FileNotFoundError: [Errno 2] No such file or directory:
'/dbfs/FileStore/NJ/wrtdftodbfs.txt'
df.to_csv("\\dbfs\\FileStore\\NJ\\wrtdftodbfs.txt")
Result: No errors, but nothing written either
df.to_csv("dbfs\\FileStore\\NJ\\wrtdftodbfs.txt")
Result: No errors, but nothing written either
df.to_csv(path ="\\dbfs\\FileStore\\NJ\\",file="wrtdftodbfs.txt")
Result: TypeError: to_csv() got an unexpected keyword argument 'path'
df.to_csv("dbfs:\\FileStore\\NJ\\wrtdftodbfs.txt")
Result: No errors, but nothing written either
df.to_csv("dbfs:\\dbfs\\FileStore\\NJ\\wrtdftodbfs.txt")
Result: No errors, but nothing written either
The directory exists and the files created manually shows up but pandas to_csv never writes nor error out.
dbutils.fs.put("/dbfs/FileStore/NJ/tst.txt","Testing file creation and existence")
dbutils.fs.ls("dbfs/FileStore/NJ")
Out[186]: [FileInfo(path='dbfs:/dbfs/FileStore/NJ/tst.txt',
name='tst.txt', size=35)]
Appreciate your time and pardon me if the enclosed details are not clear enough.

Try with this in your notebook databricks:
import pandas as pd
from io import StringIO
data = """
CODE,L,PS
5d8A,N,P60490
5d8b,H,P80377
5d8C,O,P60491
"""
df = pd.read_csv(StringIO(data), sep=',')
#print(df)
df.to_csv('/dbfs/FileStore/NJ/file1.txt')
pandas_df = pd.read_csv("/dbfs/FileStore/NJ/file1.txt", header='infer')
print(pandas_df)

This worked out for me:
outname = 'pre-processed.csv'
outdir = '/dbfs/FileStore/'
dfPandas.to_csv(outdir+outname, index=False, encoding="utf-8")
To download the file, add files/filename to your notebook url (before the interrogation mark ?):
https://community.cloud.databricks.com/files/pre-processed.csv?o=189989883924552#
(you need to edit your home url, for me is :
https://community.cloud.databricks.com/?o=189989883924552#)
dbfs file explorer

Related

Modifying the xlsx file using openpyxl in databricks directly without pandas/dataframe

import openpyxl
input_workbook1 = openpyxl.load_workbook('/dbfs/FileStore/Test/my_excel.xlsx')
sheet_1 = input_workbook1.active
sheet_1['A2'] = 'A2'
input_workbook1.save('/dbfs/FileStore/Test/Output.xlsx')
OSError: [Errno 95] Operation not supported
I tried reading the excel file directly using openpyxl in databricks , I can able to read and modify directly without pandas/dataframes, but when I am trying to save i.e last line in above code facing the issue.I tried exactly the same way but facing the above error , can anyone help me please

I tried doing the same procedure and it gave me the same error OSError: [Errno 95] Operation not supported. The reason for this is that there is a limitation that random writes do not work on the local file system and here is the official Microsoft documentation (Local File API limitations) which refers to this issue.
So, try instead of trying to write to the local file system, write the file to /databricks/driver/ path and then copy/move the file to required directory.
Modify your code as following:
import openpyxl
input_workbook1 = openpyxl.load_workbook('/dbfs/FileStore/Test/my_excel.xlsx')
sheet_1 = input_workbook1.active
sheet_1['A2'] = 'A2'
input_workbook1.save('Output.xlsx')
#will be saved to '/databricks/driver/'.
#Use dbutils.fs.ls('/databricks/driver/') to view.
from shutil import move
move('/databricks/driver/Output.xlsx','/dbfs/FileStore/Test/')
wb1 = openpyxl.load_workbook('/dbfs/FileStore/Output.xlsx')
ws1 = wb1.active
for row in ws1.iter_rows():
print([col.value for col in row])
The above code will successfully move your file to the required path without any errors.

Unable to read a parquet file

I am breaking my head over this right now. I am new to this parquet files, and I am running into a LOT of issues with it.
I am thrown an error that reads OSError: Passed non-file path: \datasets\proj\train\train.parquet each time I try to create a df from it.
I've tried this:
pq.read_pandas(r'E:\datasets\proj\train\train.parquet').to_pandas()
AND
od = pd.read_parquet(r'E:\datasets\proj\train\train.parquet', engine='pyarrow')
I also changed the drive letter of the drive the dataset resides, and it's the SAME THING!
It's the same with all engines.
PLEASE HELP!

This might be a problem with Arrow's file path handling. You could instead pass in an already opened file:
import pandas as pd
with open(r'E:\datasets\proj\train\train.parquet', 'rb') as f:
df = pd.read_parquet(f, engine='pyarrow')

Try using fastparquet as engine, worked for me.
engine = "fastparquet"

pandas.read_csv cant find my path error

So I tried to run this code.
import pandas
i = input("hi input a csv file..")
df = pandas.read_csv(i)
and I got an error saying
FileNotFoundError: File b'"C:\\Users\\thomas.swenson\\Downloads\\hi.csv"' does not exist
but then if I hard code that path that 'doesn't exist' into my program it works fine.
import pandas
df = pandas.read_csv("C:\\Users\\thomas.swenson\\Downloads\\hi.csv")
it works just fine.
Anyone know why this may be happening?
I'm running python 3.6 and using a virtualenv

looks like the input function was placing another set of quotes around the input.
so ill just have to remove them and it works fine.

Error: Line magic function

I'm trying to read a file using python and I keep getting this error
ERROR: Line magic function `%user_vars` not found.
My code is very basic just
names = read_csv('Combined data.csv')
names.head()
I get this for anytime I try to read or open a file. I tried using this thread for help.
ERROR: Line magic function `%matplotlib` not found
I'm using enthought canopy and I have IPython version 2.4.1. I made sure to update using the IPython installation page for help. I'm not sure what's wrong because it should be very simple to open/read files. I even get this error for opening text files.
EDIT:
I imported traceback and used
print(traceback.format_exc())
But all I get is none printed. I'm not sure what that means.

Looks like you are using Pandas. Try the following (assuming your csv file is in the same path as the your script lib) and insert it one line at a time if you are using the IPython Shell:
import pandas as pd
names = pd.read_csv('Combined data.csv')
names.head()

Read a .csv into pandas from F: drive on Windows 7

I have a .csv file on my F: drive on Windows 7 64-bit that I'd like to read into pandas and manipulate.
None of the examples I see read from anything other than a simple file name (e.g. 'foo.csv').
When I try this I get error messages that aren't making the problem clear to me:
import pandas as pd
trainFile = "F:/Projects/Python/coursera/intro-to-data-science/kaggle/data/train.csv"
trainData = pd.read_csv(trainFile)
The error message says:
IOError: Initializing from file failed
I'm missing something simple here. Can anyone see it?
Update:
I did get more information like this:
import csv
if __name__ == '__main__':
trainPath = 'F:/Projects/Python/coursera/intro-to-data-science/kaggle/data/train.csv'
trainData = []
with open(trainPath, 'r') as trainCsv:
trainReader = csv.reader(trainCsv, delimiter=',', quotechar='"')
for row in trainReader:
trainData.append(row)
print trainData
I got a permission error on read. When I checked the properties of the file, I saw that it was read-only. I was able to read 892 lines successfully after unchecking it.
Now pandas is working as well. No need to move the file or amend the path. Thanks for looking.

I cannot promise that this will work, but it's worth a shot:
import pandas as pd
import os
trainFile = "F:/Projects/Python/coursera/intro-to-data-science/kaggle/data/train.csv"
pwd = os.getcwd()
os.chdir(os.path.dirname(trainFile))
trainData = pd.read_csv(os.path.basename(trainFile))
os.chdir(pwd)

A better solution is to use literal strings like r'pathname\filename' rather than 'pathname\filename'. See Lexical Analysis for more details.

I also got the same issue and got that resolved .
Check your path for the file correctly
I initially had the path like
dfTrain = pd.read_csv("D:\\Kaggle\\labeledTrainData.tsv",header=0,delimiter="\t",quoting=3)
This returned an error because the path was wrong .Then I have changed the path as below.This is working fine.
dfTrain = dfTrain = pd.read_csv("D:\\Kaggle\\labeledTrainData.tsv\\labeledTrainData.tsv",header=0,delimiter="\t",quoting=3)
This is because my earlier path was not correct.Hope you get it reolved

This happens to me quite often. Usually I open the csv file in Excel, and save it as an xlsx file, and it works.
so instead of
df = pd.read_csv(r"...\file.csv")
Use:
df = pd.read_excel(r"...\file.xlsx")

If you're sure the path is correct, make sure no other programs have the file open. I got that error once, and closing the Excel file made the error go away.

Try this:
import os
import pandas as pd
trainFile = os.path.join('F:',os.sep,'Projects','Python','coursera','intro-to-data-science','train.csv' )
trainData = pd.read_csv(trainFile)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to write pandas dataframe into Databricks dbfs/FileStore? - python

Related

Modifying the xlsx file using openpyxl in databricks directly without pandas/dataframe

Unable to read a parquet file

pandas.read_csv cant find my path error

Error: Line magic function

Read a .csv into pandas from F: drive on Windows 7

Categories

Resources