trying to parse parquet file into pandas dataframe - python

As stated above I am trying to parse a parquet file into a pandas data frame but I always get the error from the screenshot below. I also switch from VS Code to Sublime because VS Code did not accept the pyarrow import even though it was there picture. The line above also gives the same error.
thanks in advance guys
edit: I know tried the following which lead to the following error Screenshot

This could resolve your problem:
df = pd.read_parquet(path=your_file_path)

Related

Jupyter Notebook - Pandas

I am new to using Jupyter, NumPy, and pandas. I was looking for a solution online but I could not find anything to solve the error.
I am trying to load a file.csv but I got an error each time I find a solution. I also tried to upload the file to Jupyter notebook to use just the file directly but my system respond that the file is not there. I convert the file from .txt to .csv assuming that that was the problem but still can't load directly. Thus, I decided to use the long format but still have problems.
data = pd.read_csv(r'C:/Users/kharm/Dropbox/Jupyter/Assignment/AutoInsurSweden.csv', header=None)
data.head()
I got the error:
ParserError: Error tokenizing data. C error: Expected 1 field in line 12, saw 2
If I modify to:
data = pd.read_csv(r'C:/Users/kharm/Dropbox/Jupyter/Assignment/AutoInsurSweden.csv', header=None, error_bad_lines=False )
data.head()
or
data = pd.read_csv(r'C:/Users/kharm/Dropbox/Jupyter/Assignment/AutoInsurSweden.csv', header=None, sep='\n')
data.head()
that error suggests that the problem is with the datafile itself not your code, it seems on line 12 of the csv you have an extra data field

I am getting an error expected <class 'openpyxl.styles.fills.Fill'> reading an excel file with pandas read_excel

I am trying to read an excel file with pandas read_excel function, but I keep getting the following error:
expected <class 'openpyxl.styles.fills.Fill'>
The exact code I tiped is:
corrosion_df=pd.read_excel('Corrosion.xlsx')
I already double checked the filename and it is correct. The file is also saved in the correct directory. I don't know what's going wrong because I used this method many times and until now it has always worked. Thank you very much in advance.
I had the same issue, but I found when I made some changed the spreadsheet and resaved the problem stopped.
I think the answer here is the most helpful:
Error when trying to use module load_workbook from openpyxl
My data was also being autogenerated by another site so I'm assuming there is so slight corruption in their process. I'm adding the option of csv to my project just to give an alternative.
The only way was to manually open it, save it and load it.
My workaround for it is to convert the file using libreoffice:
I ran this command line in my jupyter notebook:
!libreoffice --convert-to xls 'my_file.xlsx'
this creates a new file named my_file.xls, this file can be opened now with pandas.
import pandas as pd
df = pd.read_excel('my_file.xls')
I had the same problem. I just resaved the excel file.

How can i show my csv data file in jupyter notebook using pyspark

I am working on a big data csv dataset. I need to read it on jupyter-notebook using pyspark. My data is about 4+ million records (540000 rows and 7 columns.) What can i do so i can show all my dataset printed?
I tried to use pandas dataframe, but it does show error as in the attached screenshot, then i tried to change the encoding type it gives SyntaxError: unexpected EOF while parsing. Can you please help me?
For the last screenshot I think you are missing the way files are reading in python by using the handler with. If your data is in a json file your can read it as follows:
with open('data_file.json', encoding='utf-8') as data_file:
data = json.loads(data_file.read())
Note that it is 'data_file.json' and not data_file.json. The same logis holds for the csv example
If it is in a csv file, tha's pretty straigtforward:
file = pd.read_csv('data_file.csv')
Try removing the encoding parameter in your csv reading step
I would not recommend to use a notebook for reading such a huge file even if you are using pyspark for that. Consider using a portion of that file for visualized in a notebook and then switch to another platform.
Hope it helps

Pandas excel reading buffer error (python 3)

I am having a problem reading an excel file from a download link using pandas. The excelString below loads correctly and looks like an excel file, but when trying to convert it to excel using pandas it says the file name is too long. Any assistance would be appreciated. This is a useful generic problem to solve for anyone accessing iShares index membership info.
import urllib
import pandas as pd
f = urllib.request.urlopen('https://www.ishares.com/us/239714/fund-download.dl')
excelString = f.read().decode('utf-8')
pd.ExcelFile(excelString)
The Error returned is OSError: [Errno 36] File name too long
Works fine for me using Python3 and pandas 0.16.2 - do you have the latest version?

Python crashing when using error_bad_lines=False in Pandas DataFrame

Whenever I load data from a csv in a pandas dataframe and use :
error_bad_lines=False
it gives Segmentation fault: 11 error and keeps crashing everytime.
Here it is..
df = pandas.read_csv(filename,error_bad_lines=False)
I got the problem fixed. Somehow the file format got changed and it was not parsing it properly because of which it crashed.
Sorry guys

Categories

Resources