I have being trying to export a table to a csv file. The table is copied to the clipboard and it is ready to be put into a csv (at least manually).
I have seen that you can read with pandas anything that you have in the clipboard and assign it to a dataframe, so I tried this code.
df = pd.read_clipboard()
df
df.to_csv('data.csv')
However, I got this error:
pandas.errors.ParserError: Expected 10 fields in line 5, saw 16. Error could possibly be due to
quotes being ignored when a multi-char delimiter is used.
I have being looking for a solution or an alternative but failed.
Thanks in advance!
Related
Simple problem that has me completely dumbfounded. I am trying to read an Excel document with pandas but I am stuck with this error:
ValueError: Worksheet index 0 is invalid, 0 worksheets found
My code snippet works well for all but one Excel document linked below. Is this an issue with my Excel document (which definitely has sheets when I open it in Excel) or am I missing something completely obvious?
Excel Document
EDIT - Forgot the code. It is quite simply:
import pandas as pd
df = pd.read_excel(FOLDER + 'omx30.xlsx')
FOLDER Is the absolute path to the folder in which the file is located.
Your file is saved as Strict Open XML Spreadsheet (*.xlsx). Because it shares the same extension as Excel Workbook, it isn't obvious that the format is different. Open the file in Excel and Save As. If the selected option is Strict Open XML Spreadsheet (*.xlsx), change it to Excel Workbook (*.xlsx), save it and try loading it again with pandas.
EDIT: with the info that you have the original .csv, re-do your cleaning and save it as a .csv from Excel; or, if you prefer, pd.read_csv the original, and do your cleaning from the CLI with pandas directly.
It maybe your excel delete the first sheet of index 0, and now the actual index is > 0, but the param sheet_name of function pd.read_excel is 0, so the error raised.
It seems there indeed is a problem with my excel file. We have not been able to figure out what though. For now the path of least resistance is simply saving as a .csv in excel and using pd.read_csv to read this instead.
I am working on a big data csv dataset. I need to read it on jupyter-notebook using pyspark. My data is about 4+ million records (540000 rows and 7 columns.) What can i do so i can show all my dataset printed?
I tried to use pandas dataframe, but it does show error as in the attached screenshot, then i tried to change the encoding type it gives SyntaxError: unexpected EOF while parsing. Can you please help me?
For the last screenshot I think you are missing the way files are reading in python by using the handler with. If your data is in a json file your can read it as follows:
with open('data_file.json', encoding='utf-8') as data_file:
data = json.loads(data_file.read())
Note that it is 'data_file.json' and not data_file.json. The same logis holds for the csv example
If it is in a csv file, tha's pretty straigtforward:
file = pd.read_csv('data_file.csv')
Try removing the encoding parameter in your csv reading step
I would not recommend to use a notebook for reading such a huge file even if you are using pyspark for that. Consider using a portion of that file for visualized in a notebook and then switch to another platform.
Hope it helps
I have an 8GB CSV file that contains information about companies created in France.
When I try to upload it in Python using pandas.read_csv, I get various types of error; I believe it’s a combination of 3 factors that cause the problem:
The size of the file (8GB)
The French characters in the cells (like “é”)
The fact that this CSV file is organized like an Excel file; the fields are separated by column, just like an XLS file
When I tried to import the file using:
import pandas as pd
df = pd.read_csv(r'C:\..\data.csv')
I got the following error: OSError: Initializing from file failed
Then, to eliminate the problem about the size, I copy the file (data.csv) and paste it, only keeping the first 25 rows (data2.csv). This is a much lighter file, to eliminate the size problem:
df = pd.read_csv(r'C:\..\data2.csv')
I get the same OSError: Initializing from file failed error.
After some research, I try the following code with Data2.csv
df = pd.read_csv(r'C:\..\data2.csv', sep="\t", encoding="latin")
This time, the import successfully works, but in a weird format, like this: https://imgur.com/a/y6WJHC5. All fields are in the same column.
So this even with the size problem eliminated, it doesn't properly read the csv file. And still, I need to work with the main file, Data.csv. So I try the same code on the initial file (data.csv):
df = pd.read_csv(r'C:\..\data.csv', sep="\t", encoding="latin")
I get: ParserError: Error tokenizing data. C error: out of memory
What is the proper code to read this data.csv properly?
Thank you,
From your image it looks like the file is separated by semi-colons (;). Try using ";" as the sep in the read_csv function.
Pandas reads the csv into ram - an 8GB file could easily exhaust this - try reading the file in chunks. See this answer.
I'm using the below code to write to a CSV file.
df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").option("nullValue"," ").save("/home/user/test_table/")
when I execute it, I'm getting the following error:
java.lang.UnsupportedOperationException: CSV data source does not support null data type.
Could anyone please help?
I had the same problem (not using that command with the nullValue option) and I solved it by using the fillna method.
And I also realised that fillna was not working with _corrupt_record, so I dropped since I didn't need it.
df = df.drop('_corrupt_record')
df = df.fillna("")
df.write.option('header', 'true').format('csv').save('file_csv')
When I try to load my excel file with:
pd.read_excel(filename)
It does not make a a Dataframe unless I delete the column with the french (i.e Vérité, café). I tried switching the encoding='utf-8' but it didn't work.
Please help.