Not able to create table from CSV in Databricks - python

I am trying to create a table from a CSV file stored in Azure Storage Account. I am using the below code. I am using Azure Databricks. Notebook is in Python.
%sql
drop table if exists customer;
create table customer
using csv
options ( path "/mnt/datalake/data/Customer.csv", header "True", mode "FAILFAST", inferSchema "True");
I am getting the below error.
Unable to infer schema for CSV. It must be specified manually.
Anyone having any idea, on how to resolve this error.

I have reproduced above and got the below results.
This my csv file in data container.
This is my mounting:
I have mounted this and when I tried to create table from CSV file, I got same error.
The above error arises when we don't give correct path in the csv file. In file path after /mnt give mount point(here onemount for me) not container name as we already mounted till container.
Result:

Related

to_csv "No Such File or Directory" But the directory does exist - Databricks on ADLS

I've seen many iterations of this question but cannot seem to understand/fix this behavior.
I am on Azure Databricks working on DBR 10.4 LTS Spark 3.2.1 Scala 2.12 trying to write a single csv file to blob storage so that it can be dropped to an SFTP server. Could not use spark-sftp because I am on Scala 2.12 unfortunately and could not get the library to work.
Given this is a small dataframe, I am converting it to pandas and then attempting to_csv.
to_export = df.toPandas()
to_export.to_csv(pathToFile, index = False)
I get the error: [Errno 2] No such file or directory: '/dbfs/mnt/adls/Sandbox/user/project_name/testfile.csv
Based on the information in other threads, I create the directory with dbutils.fs.mkdirs("/dbfs/mnt/adls/Sandbox/user/project_name/") /n Out[40]: True
The response is true and the directory exists, yet I still get the same error. I'm convinced it is something obvious and I've been staring at it for too long to notice. Does anyone see what my error may be?
Python's pandas library recognizes the path only when it is in File API Format (since you are using mount). And dbutils.fs.mkdirs uses Spark API Format which is different from File API Format.
As you are creating the directory using dbutils.fs.mkdirs with path as /dbfs/mnt/adls/Sandbox/user/project_name/, this path would be actually considered as dbfs:/dbfs/mnt/adls/Sandbox/user/project_name/. Hence, the directory would be created within DBFS.
dbutils.fs.mkdirs('/dbfs/mnt/repro/Sandbox/user/project_name/')
So, you have to create the directory by modify the code to create directory to the following code:
dbutils.fs.mkdirs('/mnt/repro/Sandbox/user/project_name/')
#OR
#dbutils.fs.mkdirs('dbfs:/mnt/repro/Sandbox/user/project_name/')
Writing to the folder would now work without any issue.
pdf.to_csv('/dbfs/mnt/repro/Sandbox/user/project_name/testfile.csv', index=False)

[DATABRICKS]how to store SQ L query result data to local disk?

I am a newbie to data bricks and trying to write results into the excel/ CSV file using the below command but getting DataFrame' object has no attribute 'to_csv' errors while executing.
I am using a notebook to execute my SQL queries and now want to store results in the CSV or excel file
%python
df =spark.sql ("""select * from customer""")
and now I want to store the query results in the excel/csv file.I have tried the below code but it's not working
df.coalesce(1).write.option("header","true").option("sep",",").mode("overwrite").csv("file:///C:/New folder/mycsv.csv")
AND
df.write.option("header", "true").csv("file:///C:/New folder/mycsv.csv")
The is no direct way to write a dataframe to local machine. Databricks does not identify path to local machine.
The df.write.option("header", "true").csv("file:///C:/New folder/mycsv.csv") runs successfully because it file:/ is a valid path inside Databricks. It does not refer to your local machine according to Databricks. You can use display(dbutils.fs.ls("file:/C:/")) to see its contents.
The best possible way to download the results to your local machine is either of the following ways
1. Using UI from display():
Use the following code.
%python
display(df)
Your dataframe will be displayed with a few UI options. You can use the download symbol to download the results.
2. Using Filestore:
First, we have to enable DBFS file browser. Navigate to Settings -> Admin Console -> Workspace Settings. Under advanced section, there is an option called DBFS file browser. Enable this and reload. Now you can see the DBFS FileStore as shown in the below image
Write the dataframe to this location using the following code.
df.coalesce(1).write.option("header","true").option("sep",",").mode("overwrite").csv("dbfs:/FileStore/tables/Output")
#must use coalesce
Now use displayHTML in the following way inside python cell
%python
displayHTML("<a href='/FileStore/tables/Output/opfile.csv' download>Download CSV </a>")
#I renamed my file to opfile.csv.
#You can find your file by navigating to Data -> DBFS -> FileStore -> tables -> Output.
#Right click to rename
These are some of the ways to download dataframe to your local machine. But there might be no way to write it directly to Local machine.
UPDATE:
Using pandas dataframe also does not work. The file will be saved inside databricks itself. Look at the following images:
pd = df.toPandas()
pd.to_csv('C://New folder/myoutput.csv', sep=',', header=True, index=False)
#successful
import os
print(os.listdir("/"))
The following image indicates that the output is being written inside databricks (inside C:/), but not local machine.

how to read an excel file using pandad pd.read_excel in databricks from /Filestore/tables/ directory?

Hi I am trying to read an excel file that's uploaded to DBX filestore from UI. I can see that file is available under /Filestore/tables directory and I am trying to create a pandas dataframe using the code below
import pandas as pd
df = pd.read_excel("/dbfs/FileStore/tables/abc.xlsx")
display(df)
I am getting the error below
FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/FileStore/tables/abc.xlsx
I understand that the path is not relative to my current working directory I would like to know how can I point to the file from the file store using python
things I have tried:
I have used /FileStore/tables/abc.xlsx in the path and it didn't work
I know the scala code with spark-excel jar works but I cant execute scala commands as my org didn't and will not provide me access to execute scala commands.
any ideas how to get this working?
The file is not stored as an excel file when you create a table. You access the data via the Spark API.
You could also read the table into a koalas dataframe and then convert it to pandas if you don't want to use koalas.
If you don't want to use Spark or koalas, then upload the file to /dbfs/FileStore and use read_excel from the file in that location.

Python, pandas.read_csv on large csv file with 10 Million rows from Google Drive file

I extracted a .csv file from Google Bigquery of 2 columns and 10 Million rows.
I have downloaded the file locally as a .csv with the size of 170Mb, then I uploaded the file to Google Drive, and I want to use pandas.read_csv() function to read it into pandas DataFrame in my Jupyter Notebook.
Here is the code I used, with specific fileID that I wanna read.
# read into pandasDF from .csv stored on Google Drive.
follow_network_df = pd.read_csv("https://drive.google.com/uc?export=download&id=1WqHWdgMVLPKVbFzIIprBBhe3I9faq4HA")
Then here is what I got:
It seems the 170Mb csv file is read as an html link?
While when I tried the same code with another csv file of 40Mb, it worked perfectly
# another csv file of 40Mb.
user_behavior_df = pd.read_csv("https://drive.google.com/uc?export=download&id=1NT3HZmrrbgUVBz5o6z_JwW5A5vRXOgJo")
Can anyone give me some hint on the root cause of the difference?
Any ideas on how to read a csv file of 10 Million rows and 170Mb from online storage? I know it's possible to just read the 10 Million rows into pandasDF by just using the BigQuery interface or from local machine, but I have to include this as part of my submission, so it's only possible for me to read from online source.
The problem is that your first file is too large for Google Drive to scan for viruses, so there's a user prompt that gets displayed instead of the actual file. You can see this if you access the first file's link.
I'd say click on the user prompt and use the following url with pd.read_csv.

python pandas dataframe to_sql : how to set up sqlite db file created directory

conn = sqlite3.connect('test.db')
essential_df.to_sql('collection',conn, if_exists='append')
conn.close
I'm currently working on kind of a db collector from some website. I want to download excel file from the website and want to modify the data as I want and want to store it into database which I chose to be sqlite.
I successed to download the excelfile with selenium and finally I modified the data as I want with pandas. the data is stored as a pandas dataframe.
As I wanted to store this into db, I tried to work with cursor() and execute method but kept failed. so I chose to use pandas to_sql method. and it worked with if_exists='append' method as I wanted.
============Here comes my real question!============
when I worked with data sotring only with pandas, the db file is saved at the .py file located directory. however, when I added selenium function into the code, it strangely began to create file into the download folder of windows, where excel file is downloaded.
is there anyone why it happened?and can guide me how to save the db file into current directory(where .py file located)?
When you are running your code via py file, the current working directory will be the directory where py file is present.
However, it seems to get changed after adding selenium code.
The current working directory for both case can be checked using os.getcwd().
Solution:
Add below code in py file.
>>> a=os.path.join(os.getcwd(),'test.db')
>>> a
'C:\\Users\\punddin\\test.db'
>>>

Categories

Resources