Reading data with more columns than expected into a dataframe - python

I have a number of .csv files that I download into a directory.
Each .csv is suppose to have 3 columns of information. The head of one of these files looks like:
17/07/2014,637580,10.755
18/07/2014,61996,10.8497
21/07/2014,126758,10.8208
22/07/2014,520926,10.8201
23/07/2014,370843,9.2883
The code that I am using to read the .csv into a dataframe (df) is:
df = pd.read_csv(adj_directory+'\\'+filename, error_bad_lines=False,names=['DATE', 'PX', 'RAW'])
Where I name the three columns (DATE, PX and RAW).
This works fine when the file is formatted correctly. However I have noticed that sometimes the .csv has a slightly different format and can look like for example:
09/07/2014,26268315,,
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,,
15/07/2014,205019,10.8607
where there is a column value missing and an extra comma appears in the values place. This means that the file fails to load into the dataframe (the df dataframe is empty).
Is there a way to read the data into a dataframe with the extra comma (ignoring the offending row) so the df would look like:
09/07/2014,26268315,NaN
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,NaN
15/07/2014,205019,10.8607

Probably best to fix the file upstream so that missing values aren't filled with a ,. But if necessary you can correct the file in python, by replacing ,, with just , (line-by-line). Taking your bad file as test.csv:
import re
import csv
patt = re.compile(r",,")
with open('corrected.csv', 'w') as f2:
with open('test.csv') as f:
for line in csv.reader(map(lambda s: patt.sub(',', s), f)):
f2.write(','.join(str(x) for x in line))
f2.write('\n')
f2.close()
f.close()
Output: corrected.csv
09/07/2014,26268315,
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,
15/07/2014,205019,10.8607
Then you should be able to read in this file without issue
import pandas as pd
df = pd.read_csv('corrected.csv', names=['DATE', 'PX', 'RAW'])
DATE PX RAW
0 09/07/2014 26268315 NaN
1 10/07/2014 6601181 16.3857
2 11/07/2014 916651 12.5879
3 14/07/2014 213357 NaN
4 15/07/2014 205019 10.8607

Had this problem yesterday.
Have you tried:
pd.read_csv(adj_directory+'\\'+filename,
error_bad_lines=False,names=['DATE', 'PX', 'RAW'],
keep_default_na=False,
na_values=[''])

Related

How to convert .dat to .csv using python? the data is being expressed in one column

Hi i'm trying to convert .dat file to .csv file.
But I have a problem with it.
I have a file .dat which looks like(column name)
region GPS name ID stop1 stop2 stopname1 stopname2 time1 time2 stopgps1 stopgps2
it delimiter is a tab.
so I want to convert dat file to csv file.
but the data keeps coming out in one column.
i try to that, using next code
import pandas as pd
with open('file.dat', 'r') as f:
df = pd.DataFrame([l.rstrip() for l in f.read().split()])
and
with open('file.dat', 'r') as input_file:
lines = input_file.readlines()
newLines = []
for line in lines:
newLine = line.strip('\t').split()
newLines.append(newLine)
with open('file.csv', 'w') as output_file:
file_writer = csv.writer(output_file)
file_writer.writerows(newLines)
But all the data is being expressed in one column.
(i want to express 15 column, 80,000 row, but it look 1 column, 1,200,000 row)
I want to convert this into a csv file with the original data structure.
Where is a mistake?
Please help me... It's my first time dealing with data in Python.
If you're already using pandas, you can just use pd.read_csv() with another delimiter:
df = pd.read_csv("file.dat", sep="\t")
df.to_csv("file.csv")
See also the documentation for read_csv and to_csv

Pandas error reading csv with double quotes

I've read all related topics - like this, this and this - but couldn't get a solution to work.
I have an input csv file like this:
ItemId,Content
i0000008,{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010,{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
I've tried several different approaches but couldn't get it to work. I want to read this csv file into a Dataframe like this:
ItemId Content
-------- -------------------------------------------------------------------------------
i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
With following code (Python 3.9):
df = pd.read_csv('test.csv', sep=',', skipinitialspace = True, quotechar = '"')
As far as I understand, commas inside dictionary column and commas inside quotation marks are being treated as regular separators, so it raises following error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 6
Is it possible to produce desired result? Thanks.
The problem is that the comma's in the Content column are interpreted as separators. You can solve this by using pd.read_fwf to manually set the number of characters on which to split:
df = pd.read_fwf('test.csv', colspecs=[(0, 8),(9,100)], header=0, names=['ItemId', 'Content'])
Result:
ItemId
Content
0
i0000008
{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1
i0000010
{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
I don't think you'll be able to read it normally with pandas because it has the delimiter used multiple times for a single value; however, reading it with python and doing some processing, you should be able to convert it to pandas dataframe:
def splitValues(x):
index = x.find(',')
return x[:index], x[index+1:].strip()
import pandas as pd
data = open('file.csv')
columns = next(data)
columns = columns.strip().split(',')
df = pd.DataFrame(columns=columns, data=(splitValues(row) for row in data))
OUTPUT:
ItemId Content
0 i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1 i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}

Wrong row count for CSV file in python

I am processing a csv file and before that I am getting the row count using the below code.
total_rows=sum(1 for row in open(csv_file,"r",encoding="utf-8"))
The code has been written with the help given in this link.
However, the total_rows doesn't match the actual number of rows in the csv file. I have found an alternative to do it but would like to know why is this not working correctly??
In the CSV file, there are cells with huge text and I have to use the encoding to avoid errors reading the csv file.
Any help is appreciated!
Let's assume you have a csv file in which some cell's a multi-line text.
$ cat example.csv
colA,colB
1,"Hi. This is Line 1.
And this is Line2"
Which, by look of it, has three lines and wc -l agrees:
$ wc -l example.csv
3 example.csv
And so does open with sum:
sum(1 for row in open('./example.csv',"r",encoding="utf-8"))
# 3
But now if you read is with some csv parser such as pandas.read_csv:
import pandas as pd
df = pd.read_csv('./example.csv')
df
colA colB
0 1 Hi. This is Line 1.\nAnd this is Line2
The other alternative way to fetch the correct number of rows is given below:
with open(csv_file,"r",encoding="utf-8") as f:
reader = csv.reader(f,delimiter = ",")
data = list(reader)
row_count = len(data)
Excluding the header, the csv contains 1 line, which I believe is what you expect.
This is because colB's first cell (a.k.a. huge text block) is now properly handled with the quotes wrapping the entire text.
I think that the problem in here is because you are not counting rows, but counting newlines (either \r\n in windows or \n in linux). The problem lies when you have a cell with text where you have newline character example:
1, "my huge text\n with many lines\n"
2, "other text"
Your method for data above will return 4 when accutaly there are only 2 rows
Try to use Pandas or other library for reading CSV files. Example:
import pandas as pd
data = pd.read_csv(pathToCsv, sep=',', header=None);
number_of_rows = len(df.index) # or df[0].count()
Note that len(df.index) and df[0].count() are not interchangeable as count excludes NaNs.

How to manage a problem reading a csv that is a semicolon-separated file where some strings contain semi-colons?

The problem I have can be illustrated by showing a couple of sample rows in my csv (semicolon-separated) file, which look like this:
4;1;"COFFEE; COMPANY";4
3;2;SALVATION ARMY;4
Notice that in one row, a string is in quotation marks and has a semi-colon inside of it (none of the columns have quotations marks around them in my input file except for the ones containing semicolons).
These rows with the quotation marks and semicolons are causing a problem -- basically, my code is counting the semicolon inside quotation marks within the column/field. So when I read in this row, it reads this semicolon inside the string as a delimiter, thus making it seem like this row has an extra field/column.
The desired output would look like this, with no quotation marks around "coffee company" and no semicolon between 'coffee' and 'company':
4;1;COFFEE COMPANY;4
3;2;SALVATION ARMY;4
Actually, this column with "coffee company" is totally useless to me, so the final file could look like this too:
4;1;xxxxxxxxxxx;4
3;2;xxxxxxxxxxx;4
How can I get rid of just the semi-colons inside of this one particular column, but without getting rid of all of the other semi-colons?
The csv module makes it relatively easy to handle a situation like this:
# Contents of input_file.csv
# 4;1;"COFFEE; COMPANY";4
# 3;2;SALVATION ARMY;4
import csv
input_file = 'input_file.csv' # Contents as shown in your question.
with open(input_file, 'r', newline='') as inp:
for row in csv.reader(inp, delimiter=';'):
row[2] = row[2].replace(';', '') # Remove embedded ';' chars.
# If you don't care about what's in the column, use the following instead:
# row[2] = 'xyz' # Value not needed.
print(';'.join(row))
Printed output:
4;1;COFFEE COMPANY;4
3;2;SALVATION ARMY;4
Follow-on question: How to write this data to a new csv file?
import csv
input_file = 'input_file.csv' # Contents as shown in your question.
output_file = 'output_file.csv'
with open(input_file, 'r', newline='') as inp, \
open(output_file, 'w', newline='') as outp:
writer= csv.writer(outp, delimiter=';')
for row in csv.reader(inp, delimiter=';'):
row[2] = row[2].replace(';', '') # Remove embedded ';' chars.
writer.writerow(row)
Here's an alternative approach using the Pandas library which spares you having to code for loops:
import pandas as pd
#Read csv into dataframe df
df = pd.read_csv('data.csv', sep=';', header=None)
#Remove semicolon in column 2
df[2] = df[2].apply(lambda x: x.replace(';', ''))
This gives the following dataframe df:
0 1 2 3
0 4 1 COFFEE COMPANY 4
1 3 2 SALVATION ARMY 4
Pandas provides several inbuilt functions to help you manipulate data or make statistical conclusions. Having the data in a tabular format can also make working with it more intuitive.

Pandas, read CSV ignoring extra commas

I am reading a CSV file with 8 columns into Pandas data frame. The final column contains an error message, some of which contain commas. This causes the file read to fail with the error ParserError: Error tokenizing data. C error: Expected 8 fields in line 21922, saw 9
Is there a way to ignore all commas after the 8th field, rather than having to go through the file and remove excess commas?
Code to read file:
import pandas as pd
df = pd.read_csv('C:\\somepath\\output.csv')
Line that works:
061AE,Active,001,2017_02_24 15_18_01,00006,1,00013,some message
Line that fails:
061AE,Active,001,2017_02_24 15_18_01,00006,1,00013,longer message, with commas
You can use the parameter usecols in the read_csv function to limit what columns you read in. For example:
import pandas as pd
pd.read_csv(path, usecols=range(8))
if you only want to read the first 8 columns.
You can use re.sub to replace the first few commas with, say, the '|', save the intermediate results in a StringIO then process that.
import pandas as pd
from io import StringIO
import re
for_pd = StringIO()
with open('MikeS159.csv') as mike:
for line in mike:
new_line = re.sub(r',', '|', line.rstrip(), count=7)
print (new_line, file=for_pd)
for_pd.seek(0)
df = pd.read_csv(for_pd, sep='|', header=None)
print (df)
I put the two lines from your question into a file to get this output.
0 1 2 3 4 5 6 \
0 061AE Active 1 2017_02_24 15_18_01 6 1 13
1 061AE Active 1 2017_02_24 15_18_01 6 1 13
7
0 some message
1 longer message, with commas
You can take a shot at this roundabout posted on the Pandas issues page:
import csv
import pandas as pd
import numpy as np
df = pd.read_csv('filename.csv', parse_dates=True, dtype=Object, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')
You can also preprocess the data, basically changing all first 7 (0th to 6th, both inclusive) commas to semicolons, and leaving the ones after that as commas* using something like:
to_write = []
counter = 0
with open("sampleCSV.csv", "r") as f:
for line in f:
while counter < 7:
line = list(line)
line[line.index(",")] = ";"
counter += 1
counter = 0
to_write.append("".join(line))
You can now read this to_write list as a Pandas object like
data = pd.DataFrame(to_write)
data = pd.DataFrame(data[0].str.split(";").values.tolist()),
or write it back into a csv and read using pandas with a semicolon delimiter such as read_csv(csv_path, sep=';').
I kinda drafted this quickly without rigorous testing, but should give you some ideas to try. Please comment if it does or doesn't help, and I'll edit it.
*Another option is to delete all commas after 7th, and keep using the comma separator. Either way the point is to differentiate the first 7 delimiters from the subsequent punctuation.
to join #Tblaz answer If you use GoogleColab you can use this solution, in my case the extra comma was on the column 24 so I have only to read 23 columns:
import pandas as pd
from google.colab import files
import io
uploaded = files.upload()
x_train = pd.read_csv(io.StringIO(uploaded['x_train.csv'].decode('utf-8')), skiprows=1, usecols=range(23) ,header=None)

Categories

Resources