Unable to read a parquet file - python

I am breaking my head over this right now. I am new to this parquet files, and I am running into a LOT of issues with it.
I am thrown an error that reads OSError: Passed non-file path: \datasets\proj\train\train.parquet each time I try to create a df from it.
I've tried this:
pq.read_pandas(r'E:\datasets\proj\train\train.parquet').to_pandas()
AND
od = pd.read_parquet(r'E:\datasets\proj\train\train.parquet', engine='pyarrow')
I also changed the drive letter of the drive the dataset resides, and it's the SAME THING!
It's the same with all engines.
PLEASE HELP!

This might be a problem with Arrow's file path handling. You could instead pass in an already opened file:
import pandas as pd
with open(r'E:\datasets\proj\train\train.parquet', 'rb') as f:
df = pd.read_parquet(f, engine='pyarrow')

Try using fastparquet as engine, worked for me.
engine = "fastparquet"

Related

Access file and read/write from shared/network folder using python

i want to read and write data from network folder, so far i have tried
os.open("\u drive path") , open("\u drive path")
but it says accesss or permission denied
but when i use
os.startfile("\u drive path")
I always try r strings when connecting to a network drive (especially if using pandas) try doing this to put the file into a dataframe
import pandas as pd
desired_file = r'\\networkdrive\folder\file.csv'
df = pd.read_csv(desired_file, , encoding='utf-8')
This makes it easier for us to just look at as people with the r string but if you use
print(desired_file)
You can see that python reads it the way that it needs to be formatted for pandas

How to write pandas dataframe into Databricks dbfs/FileStore?

I'm new to the Databricks, need help in writing a pandas dataframe into databricks local file system.
I did search in google but could not find any case similar to this, also tried the help guid provided by databricks (attached) but that did not work either. Attempted the below changes to find my luck, the commands goes just fine, but the file is not getting written in the directory (expected wrtdftodbfs.txt file gets created)
df.to_csv("/dbfs/FileStore/NJ/wrtdftodbfs.txt")
Result: throws the below error
FileNotFoundError: [Errno 2] No such file or directory:
'/dbfs/FileStore/NJ/wrtdftodbfs.txt'
df.to_csv("\\dbfs\\FileStore\\NJ\\wrtdftodbfs.txt")
Result: No errors, but nothing written either
df.to_csv("dbfs\\FileStore\\NJ\\wrtdftodbfs.txt")
Result: No errors, but nothing written either
df.to_csv(path ="\\dbfs\\FileStore\\NJ\\",file="wrtdftodbfs.txt")
Result: TypeError: to_csv() got an unexpected keyword argument 'path'
df.to_csv("dbfs:\\FileStore\\NJ\\wrtdftodbfs.txt")
Result: No errors, but nothing written either
df.to_csv("dbfs:\\dbfs\\FileStore\\NJ\\wrtdftodbfs.txt")
Result: No errors, but nothing written either
The directory exists and the files created manually shows up but pandas to_csv never writes nor error out.
dbutils.fs.put("/dbfs/FileStore/NJ/tst.txt","Testing file creation and existence")
dbutils.fs.ls("dbfs/FileStore/NJ")
Out[186]: [FileInfo(path='dbfs:/dbfs/FileStore/NJ/tst.txt',
name='tst.txt', size=35)]
Appreciate your time and pardon me if the enclosed details are not clear enough.
Try with this in your notebook databricks:
import pandas as pd
from io import StringIO
data = """
CODE,L,PS
5d8A,N,P60490
5d8b,H,P80377
5d8C,O,P60491
"""
df = pd.read_csv(StringIO(data), sep=',')
#print(df)
df.to_csv('/dbfs/FileStore/NJ/file1.txt')
pandas_df = pd.read_csv("/dbfs/FileStore/NJ/file1.txt", header='infer')
print(pandas_df)
This worked out for me:
outname = 'pre-processed.csv'
outdir = '/dbfs/FileStore/'
dfPandas.to_csv(outdir+outname, index=False, encoding="utf-8")
To download the file, add files/filename to your notebook url (before the interrogation mark ?):
https://community.cloud.databricks.com/files/pre-processed.csv?o=189989883924552#
(you need to edit your home url, for me is :
https://community.cloud.databricks.com/?o=189989883924552#)
dbfs file explorer

Failing to open an Excel file with Python

I'm on a Debian GNU/Linux computer, working with Python 2.7.9.
As a part of my job, I have been making python scripts that read inputs in various formats (e.g. Excel, Csv, Txt) and parse the information to more standarized files. It's not my first time opening or working with Excel files.
There's a particular file which is giving me problems, I just can't open it. When I tried with xlrd (version 0.9.3), it gave me the following error:
xlrd.open_workbook('sample.xls')
XLRDError: Unsupported format, or corrupt file: BOF not
workbook/worksheet: op=0x0009 vers=0x0002 strm=0x000a build=0 year=0
-> BIFF21
I tried to investigate the matter on my own, found a couple of answers in StackOverflow but I couldn't open it anyway. This particular answer I found may be the problem (the second explanation), but it doesn't include a workaround: https://stackoverflow.com/a/16518707/4345659
A tool that could conert the file to csv/txt would also solve the problem.
I already tried with:
xlrd
openpyxl
xlsx2csv (the shell tool)
A sample file is available here:
https://ufile.io/r4m6j
As a side note, I can open it with LibreOffice Calc and MS Excel, so I could eventually change it to csv that way. The thing is, I need to do it all with a python script.
Thanks in advance!
It seems like MS Problem. The xls file is very strange, maybe you should contact xlrd support.
But I have a crazy workaround for you: xls2ods. It works for me even though xls2csv doesn't (SiC!).
So, install catdoc first:
$sudo apt-get install catdoc
Then convert your xls file to ods and open ods using pyexcel_ods or whatever you prefer. To use pyexcel_ods install it first using pip install pyexcel_ods.
import subprocess
from pyexcel_ods import get_data
file_basename = 'sample'
returncode = subprocess.call(['xls2ods', '{}.xls'.format(file_basename)])
if returnecode > 0:
# consider to use subprocess.Popen if you need more control on stderr
exit(returncode)
data = get_data('{}.ods'.format(file_basename))
print(data)
I'm getting following output:
OrderedDict([(u'sample',
[[u'labo',
u'codfarm',
u'farmacia',
u'direccion',
u'localidad',
u'nom_medico',
u'matricula',
u'troquel',
u'producto',
u'cant_total']])])
Here is a kludge I would use:
Assuming you have LibreOffice on Debian, you could either convert all your *.xls files into *.csv using:
import os
os.system("libreoffice --headless --convert-to csv *.xls")
#or use os.call
... and then work consistently with csv.
Or you could convert only the corrupted file(s) when needed using a try/except block:
import os
try:
xlrd.open_workbook('sample.xls')
except XLRDError:
os.system("libreoffice --headless --convert-to csv sample.xls")
# mycsv = open("sample.csv", "r")
# for line in mycsv.readlines():
# ...
# ...
OBS: Keep LibreOffice closed while running the script.
Alternatively there are other tools out there to do the conversion. Here is one (which I have not tested): https://github.com/dilshod/xlsx2csv
If you are targeting windows, if you have Excel installed, and if you are familiar with Excel VBA, you will have a quick solution using the comtypes package:
http://pythonhosted.org/comtypes/
You will have direct access to Excel by its COM interfaces.
This code open an xls file and saves it as a cvs file, using the comtypes package:
import comtypes.client as cl
progId = "Excel.Application.15"
xl = cl.CreateObject(progId)
wb = xl.Workbooks.Open(r"C:\Users\aUser\Desktop\thermoList.xls")
wb.SaveAs(r"C:\Users\aUser\Desktop\thermoList.csv",FileFormat=6)
xl.DisplayAlerts = False
xl.Quit()
I could not test it with "sample.xls" which is corrupt.
Your could try with another file.
You might need to adjust the progId according to your version of Excel.
It's a file format issue. I'm not sure what file type is it but it's not Excel. I just open and saved the file with sample2.xls name and compare the types:
How are you creating this file?
If you need to get the words as a list of strings:
text_file = open("sample.xls", "r")
lines = text_file.read().replace(chr(200), '').replace(chr(0), '').replace(chr(1), '').replace(chr(5), '').replace(chr(2), '').replace(chr(3), '').replace(chr(4), '').replace(chr(6), '').replace(chr(7), '').replace(chr(8), '').replace(chr(9), '').replace(chr(10), '').replace(chr(12), '').replace(chr(15), '').replace(chr(16), '').replace(chr(17), '').replace(chr(18), '').replace(chr(49), '').replace('Arial', '')
for line in lines.split(chr(128)):
print(line)
the output:
The file you provided is corrupted, so there is no way for other responders to test it and recommend a good solution. And exception you posted confirming that.
As a solution you can try to debug some things, please see some steps below:
You mentioned you tried the xlrd library. Try to check if your xlrd module is upto date by executing this:
Python 2.7.9
>>> import xlrd
>>> xlrd.__VERSION
update to the latest official version if needed
Try to open any other *.xls file and see if it works with Python version you're using and current library.
Check module documentation it's pretty good, and there are some different things described how to use this module on various platforms( Win vs. Linux)http://xlrd.readthedocs.io/en/latest/dates.html
You always can rich out to the community (there is still a chance that you might be getting into some weird state or bug) the link is here https://github.com/python-excel/xlrd/issues
Hope that helps.
Unable to open your Excel either. Just as yadayada said, I think it is the problem of data source. If you really want to figure out the reason, I suggest you ask questions about the excel instead of python.
It's always work for me with any xls or xlsx files:
def csv_from_excel(filename_xls, filename_csv):
wb = xlrd.open_workbook(filename_xls, encoding_override='YOUR_ENCODING_HERE (f.e. "cp1251"')
sh = wb.sheet_by_index(0)
your_csv_file = open(filename_csv, 'wb')
wr = unicodecsv.writer(your_csv_file)
for rownum in xrange(sh.nrows):
wr.writerow(sh.row_values(rownum))
your_csv_file.close()
So, i don't work directly with excel file before convert them to csv. Mb it will help you

Opening a file that has been uploaded in Flask

I'm trying to modify a csv that is uploaded into my flask application. I have the logic that works just fine when I don't upload it through flask.
import pandas as pd
import StringIO
with open('example.csv') as f:
data = f.read()
data = data.replace(',"', ",'")
data = data.replace('",', "',")
df = pd.read_csv(StringIO.StringIO(data), header=None, sep=',', quotechar="'")
print df.head(10)
I upload it to flask and access it using
f = request.files['data_file']
When I run it through the code above, replacing open('example.csv') with open(f), I get the following error
coercing to Unicode: need string or buffer, FileStorage found
I have figured out that the problem is the file type here. I can't use open on my file because open is looking for a file name and when the file is uploaded to flask it is the instance of the file that is being passed to the open command. However, I don't know how to make this work. I've tried skipping the open command and just using data = f.read() but that doesn't work. Any suggestions?
Thanks
FileStorage is a file-like wrapper around the incoming data. You can pass it directly to read_csv.
pd.read_csv(request.files['data_file'])
You most likely should not be performing those replace calls on the data, as the CSV module should handle that and the naive replacement can corrupt data in quoted columns. However, if you still need to, you can read the data out just like you were before.
data = request.files['data_file'].read()
If your data has a mix of quoting styles, you should fix the source of your data.
Answering my own question in case someone else needs this.
FileStorage objects have a .stream attribute which will be an io.BytesIO
f = request.files['data_file']
df = pandas.read_csv(f.stream)

Read a .csv into pandas from F: drive on Windows 7

I have a .csv file on my F: drive on Windows 7 64-bit that I'd like to read into pandas and manipulate.
None of the examples I see read from anything other than a simple file name (e.g. 'foo.csv').
When I try this I get error messages that aren't making the problem clear to me:
import pandas as pd
trainFile = "F:/Projects/Python/coursera/intro-to-data-science/kaggle/data/train.csv"
trainData = pd.read_csv(trainFile)
The error message says:
IOError: Initializing from file failed
I'm missing something simple here. Can anyone see it?
Update:
I did get more information like this:
import csv
if __name__ == '__main__':
trainPath = 'F:/Projects/Python/coursera/intro-to-data-science/kaggle/data/train.csv'
trainData = []
with open(trainPath, 'r') as trainCsv:
trainReader = csv.reader(trainCsv, delimiter=',', quotechar='"')
for row in trainReader:
trainData.append(row)
print trainData
I got a permission error on read. When I checked the properties of the file, I saw that it was read-only. I was able to read 892 lines successfully after unchecking it.
Now pandas is working as well. No need to move the file or amend the path. Thanks for looking.
I cannot promise that this will work, but it's worth a shot:
import pandas as pd
import os
trainFile = "F:/Projects/Python/coursera/intro-to-data-science/kaggle/data/train.csv"
pwd = os.getcwd()
os.chdir(os.path.dirname(trainFile))
trainData = pd.read_csv(os.path.basename(trainFile))
os.chdir(pwd)
A better solution is to use literal strings like r'pathname\filename' rather than 'pathname\filename'. See Lexical Analysis for more details.
I also got the same issue and got that resolved .
Check your path for the file correctly
I initially had the path like
dfTrain = pd.read_csv("D:\\Kaggle\\labeledTrainData.tsv",header=0,delimiter="\t",quoting=3)
This returned an error because the path was wrong .Then I have changed the path as below.This is working fine.
dfTrain = dfTrain = pd.read_csv("D:\\Kaggle\\labeledTrainData.tsv\\labeledTrainData.tsv",header=0,delimiter="\t",quoting=3)
This is because my earlier path was not correct.Hope you get it reolved
This happens to me quite often. Usually I open the csv file in Excel, and save it as an xlsx file, and it works.
so instead of
df = pd.read_csv(r"...\file.csv")
Use:
df = pd.read_excel(r"...\file.xlsx")
If you're sure the path is correct, make sure no other programs have the file open. I got that error once, and closing the Excel file made the error go away.
Try this:
import os
import pandas as pd
trainFile = os.path.join('F:',os.sep,'Projects','Python','coursera','intro-to-data-science','train.csv' )
trainData = pd.read_csv(trainFile)

Categories

Resources