ParserError in read_csv()

ParserError in read_csv() - python

I'm trying to read 100 CSVs and collate data from all into a single CSV.
I made use of :
all_files = pd.DataFrame()
for file in files :
all_files = all_files.append(pd.read_csv(file,encoding= 'unicode_escape')).reset_index(drop=True)
where files = list of filepaths of 100 CSVs
Now each CSV may have different number of columns. single CSV, each row may have different no. of colums too.
I want to match the column headers names, put the data from all the CSVs in the correct column, and keep on adding new columns to my final DF on the go.
The above code works fine for 30-40 CSVs and then breaks and gives the following error:
ParserError: Error tokenizing data. C error: Expected 16 fields in line 78, saw 17
Any help will be much appreciated!

There are a couple of ways to read variable length csv files -
First, you can specify the column names beforehand. If you are not sure of the number of columns, you can give a reasonably large number of columns
df = pd.read_csv(filename.csv, header=None, names=list(range(10)))
The other option is to read the entire file into a single column using a different delimiter - and then split on commas
df = pd.read_csv(filename.csv, header=None, sep='\t')
df = df[0].str.split(',', expand=True)

Its because you are trying to read all CSV files into a single Dataframe. When the first file is read number of columns for the DataFrame are decided and then it results in error when a different number of columns are fed. If you really want to concat them you should read them all in python, adjust their coulmns and then concat them

Related

Python - Extract mutual column of multiple csv files into one DataFrame for statistics/plotting

I want to analyze 26 csv files from lab experiments, all in the same format.
However, I just want to extract column #5 of each csv file, and put all columns #5 into one DataFrame, which each column named after the initial csv file.
The final df should contain 26 columns and one row as header.
In simpler words: load multiple csv files, extract same column of each, put all extracted columns into a new DataFrame with column names = filename_n, filename_n+1, ...
I could find partly some lines of code to get the df, but I'm not able to adjust it for the final goal though...
path = '*' # use your path
files = glob.glob(path + "/*.csv")
get_df = lambda f: pd.read_csv(f, header=None)
dodf = {f: get_df(f) for f in files}
dodf
If someone has an idea, I'd highly appreciate it!
Thanks,
Niels

How to add columns and delete columns in a Json file then save it into csv

I have tried to use dataframe to add columns and values into the json file, but it seems like after I tried to delete some columns it return to the original data file. I also faced a problem of not able to save it into a csv file. So was wondering maybe I cannnot use dataframe for this?
It is like a list and divided into different columns (around 30 rows in total), however some I would like to delete like the route and urls, while adding three columns with length, maxcal, mincal (all the values in these 3 columns are found in the route columns)
I had done this so far and got stucked:
import pandas as pd
import json
data = pd.read_json('fitness.json') # fitness.json is the filename of the json file
fitness2 = pd.DataFrame (fitness2)
fitness2
data.join(fitness2, lsuffix="_left") # to join the three columns into the data table
I am not sure how can I delete the columns of route, 'MapURL', 'MapURL_tc', 'MapURL_sc' then finally save as a csv like the output shown.
Thank you.

you can drop columns and then concat the two dataframes:
data.drop(['MapURL', 'MapURL_tc', 'MapURL_sc'], inplace=True, axis=1)
pd.concat([data,fitness2], axis=1) # to join the three columns into the data table

How to efficiently remove junk above headers in an .xls file

I have a number of .xls datasheets which I am looking to clean and merge.
Each data sheet is generated by a larger system which cannot be changed.
The method that generates the data sets displays the selected parameters for the data set. (E.G 1) I am looking to automate the removal of these.
The number of rows that this takes up varies, so I am unable to blanket remove x rows from each sheet. Furthermore, the system that generates the report arbitrarily merges cells in the blank sections to the right of the information.
Currently I am attempting what feels like a very inelegant solution where I convert the file to a CSV, read it as a string and remove everything before the first column.
data_xls = pd.read_excel('InputFile.xls', index_col=None)
data_xls.to_csv('Change1.csv', encoding='utf-8')
with open("Change1.csv") as f:
s = f.read() + '\n'
a=(s[s.index("Col1"):])
df = pd.DataFrame([x.split(',') for x in a.split('\n')])
This works but it seems wildly inefficient:
Multiple format conversions
Reading every line in the file when the only rows being altered occur within first ~20
Dataframe ends up with column headers shifted over by one and must be re-aligned (Less concern)
With some of the files being around 20mb, merging a batch of 8 can take close to 10 minutes.

A little hacky, but an idea to speed up your process, by doing some operations directly on your dataframe. Considering you know your first column name to be Col1, you could try something like this:
df = pd.read_excel('InputFile.xls', index_col=None)
# Find the first occurrence of "Col1"
column_row = df.index[df.iloc[:, 0] == "Col1"][0]
# Use this row as header
df.columns = df.iloc[column_row]
# Remove the column name (currently an useless index number)
del df.columns.name
# Keep only the data after the (old) column row
df = df.iloc[column_row + 1:]
# And tidy it up by resetting the index
df.reset_index(drop=True, inplace=True)
This should work for any dynamic number of header rows in your Excel (xls & xlsx) files, as long as you know the title of the first column...

If you know the number of junk rows, you skip them using "skiprows",
data_xls = pd.read_excel('InputFile.xls', index_col=None, skiprows=2)

Input multiple csv file and get one output?

I have 30 csv files. I want to give it as input in for loop, in pandas?
Each file has names such as fileaa, fileab,fileac,filead,....
And i would like to receive one output.
Usually i use read_csv but due to memory error, 'read_csv' doesn't work.
f = "./file.csv"
df = pd.read_csv(f, sep="/", header=0, dtype=str)
P.S only the first file has the column title but number of columns are same

pandas.read_csv not partitioning data at semicolon delimiter

I'm having a tough time correctly loading csv file to pandas dataframe. The file is csv saved in MS Excel, where the rows looks like this:
Montservis, s.r.o.;"2 012";"-14.98";"-34.68";"- 11.7";"0.02";"0.09";"0.16";"284.88";"10.32";"
I am using
filep="file_name.csv"
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter=";")
(I have tried several combinations and alternatives of read_csv arguments, but without any success.....I have tried also read_table )
What I want to see in my dataframe that each semi colon separated value will be in separate column (I understand that read_csv works this way(?)).
Unfortunately, I always end up with whole row being placed in first column of dataframe. So basicly after loading I have many rows, but only one column (two if I count also indexes)
I have placed sample here:
datafile
Any idea welcomed.

Add quoting = 3. 3 is for QUOTE_NONE refer this.
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter=";", quoting = 3)
This will give [7 rows x 23 columns] dataframe

The problem is enclosing characters which can be ignored by \ character.
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter='\;')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

ParserError in read_csv() - python

Related

Python - Extract mutual column of multiple csv files into one DataFrame for statistics/plotting

How to add columns and delete columns in a Json file then save it into csv

How to efficiently remove junk above headers in an .xls file

Input multiple csv file and get one output?

pandas.read_csv not partitioning data at semicolon delimiter

Categories

Resources