failed to read inch symbol in pandas read_csv - python

I have csv with below details
Name,Desc,Year,Location
Jhon,12" Main Third ,2012,GR
Lew,"291" Line (12,596,3)",2012,GR
,All, 1992,FR
...
It is very long file. i just showed problematic lines.I am confused how can i read it in Pandas data frame, I tried
quotechar,
quoting,
sep
like attribute of pandas read_csv.
Still no success.
I have no control on how csv is being designed.

You can do something like this. Try if this works for you:
import pandas as pd
import re
l1=[]
with open('/home/yusuf/Desktop/c1') as f:
headers = f.readline().strip('\n').split(',')
for a in f.readlines():
if a:
q = re.findall("^(\w*),(.*),\s?(\d+),(\w+)",a)
if q:
l1.append(q)
l2 = [list(b[0]) for b in l1]
df = pd.DataFrame(data=l2, columns=headers)
df
Output:
Regex Demo: https://regex101.com/r/AU2WcO/1

You can't have the separator character inside a field.
For example, in
Lew,"291" Line (12,596,3)",2012,GR
Pandas will assume you have 6 fields because you have 5 commas, even if two of them are between quotes. You would need to do some pre-processing of the text file to get rid of this issue, or ask for a different separator character (# or | seem to work well in my experience.
Pandas has no problems reading the other lines:
import pandas as pd
print pd.read_csv('untitled.txt')
Name Desc Year Location
0 Jhon 12" Main Third 2012 GR
1 NaN All 1992 FR

Related

Find out Specific No of Character in each line using Python & Storing it in dataframe

I have a text file containing many lines with many '' I tried read(),readline(),readlines(),splitlines() at f.read() or p.splitlines() but none of them are working. Most of the time count is zero or Seven (total no of '').
Please tell me where I am making mistake.
LLL*LLL*LL
AA*AAAA
NN**NNN
My current code
import re
import pandas as pd
from io import StringIO
with open('Test.txt','r') as f:
p = f.read()
print(p)
df12= []
for l in p.splitlines():
x=p.count('*')
df12.append(x)
print(pd.DataFrame(df12))
pandas is probably overkill here, but if you want you can read in each line by specifying a '\n' as the separator then you just want to Series.str.count the character (need to escape with '\*' since '*' is a special character). squeeze=True forces it to be a Series since we know we should only have a single field.
s = pd.read_csv('Test.txt', header=None, sep='\n', squeeze=True)
s.str.count('\*')
0 2
1 1
2 2
Name: 0, dtype: int64

Reading data with more columns than expected into a dataframe

I have a number of .csv files that I download into a directory.
Each .csv is suppose to have 3 columns of information. The head of one of these files looks like:
17/07/2014,637580,10.755
18/07/2014,61996,10.8497
21/07/2014,126758,10.8208
22/07/2014,520926,10.8201
23/07/2014,370843,9.2883
The code that I am using to read the .csv into a dataframe (df) is:
df = pd.read_csv(adj_directory+'\\'+filename, error_bad_lines=False,names=['DATE', 'PX', 'RAW'])
Where I name the three columns (DATE, PX and RAW).
This works fine when the file is formatted correctly. However I have noticed that sometimes the .csv has a slightly different format and can look like for example:
09/07/2014,26268315,,
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,,
15/07/2014,205019,10.8607
where there is a column value missing and an extra comma appears in the values place. This means that the file fails to load into the dataframe (the df dataframe is empty).
Is there a way to read the data into a dataframe with the extra comma (ignoring the offending row) so the df would look like:
09/07/2014,26268315,NaN
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,NaN
15/07/2014,205019,10.8607
Probably best to fix the file upstream so that missing values aren't filled with a ,. But if necessary you can correct the file in python, by replacing ,, with just , (line-by-line). Taking your bad file as test.csv:
import re
import csv
patt = re.compile(r",,")
with open('corrected.csv', 'w') as f2:
with open('test.csv') as f:
for line in csv.reader(map(lambda s: patt.sub(',', s), f)):
f2.write(','.join(str(x) for x in line))
f2.write('\n')
f2.close()
f.close()
Output: corrected.csv
09/07/2014,26268315,
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,
15/07/2014,205019,10.8607
Then you should be able to read in this file without issue
import pandas as pd
df = pd.read_csv('corrected.csv', names=['DATE', 'PX', 'RAW'])
DATE PX RAW
0 09/07/2014 26268315 NaN
1 10/07/2014 6601181 16.3857
2 11/07/2014 916651 12.5879
3 14/07/2014 213357 NaN
4 15/07/2014 205019 10.8607
Had this problem yesterday.
Have you tried:
pd.read_csv(adj_directory+'\\'+filename,
error_bad_lines=False,names=['DATE', 'PX', 'RAW'],
keep_default_na=False,
na_values=[''])

Pandas, read CSV ignoring extra commas

I am reading a CSV file with 8 columns into Pandas data frame. The final column contains an error message, some of which contain commas. This causes the file read to fail with the error ParserError: Error tokenizing data. C error: Expected 8 fields in line 21922, saw 9
Is there a way to ignore all commas after the 8th field, rather than having to go through the file and remove excess commas?
Code to read file:
import pandas as pd
df = pd.read_csv('C:\\somepath\\output.csv')
Line that works:
061AE,Active,001,2017_02_24 15_18_01,00006,1,00013,some message
Line that fails:
061AE,Active,001,2017_02_24 15_18_01,00006,1,00013,longer message, with commas
You can use the parameter usecols in the read_csv function to limit what columns you read in. For example:
import pandas as pd
pd.read_csv(path, usecols=range(8))
if you only want to read the first 8 columns.
You can use re.sub to replace the first few commas with, say, the '|', save the intermediate results in a StringIO then process that.
import pandas as pd
from io import StringIO
import re
for_pd = StringIO()
with open('MikeS159.csv') as mike:
for line in mike:
new_line = re.sub(r',', '|', line.rstrip(), count=7)
print (new_line, file=for_pd)
for_pd.seek(0)
df = pd.read_csv(for_pd, sep='|', header=None)
print (df)
I put the two lines from your question into a file to get this output.
0 1 2 3 4 5 6 \
0 061AE Active 1 2017_02_24 15_18_01 6 1 13
1 061AE Active 1 2017_02_24 15_18_01 6 1 13
7
0 some message
1 longer message, with commas
You can take a shot at this roundabout posted on the Pandas issues page:
import csv
import pandas as pd
import numpy as np
df = pd.read_csv('filename.csv', parse_dates=True, dtype=Object, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')
You can also preprocess the data, basically changing all first 7 (0th to 6th, both inclusive) commas to semicolons, and leaving the ones after that as commas* using something like:
to_write = []
counter = 0
with open("sampleCSV.csv", "r") as f:
for line in f:
while counter < 7:
line = list(line)
line[line.index(",")] = ";"
counter += 1
counter = 0
to_write.append("".join(line))
You can now read this to_write list as a Pandas object like
data = pd.DataFrame(to_write)
data = pd.DataFrame(data[0].str.split(";").values.tolist()),
or write it back into a csv and read using pandas with a semicolon delimiter such as read_csv(csv_path, sep=';').
I kinda drafted this quickly without rigorous testing, but should give you some ideas to try. Please comment if it does or doesn't help, and I'll edit it.
*Another option is to delete all commas after 7th, and keep using the comma separator. Either way the point is to differentiate the first 7 delimiters from the subsequent punctuation.
to join #Tblaz answer If you use GoogleColab you can use this solution, in my case the extra comma was on the column 24 so I have only to read 23 columns:
import pandas as pd
from google.colab import files
import io
uploaded = files.upload()
x_train = pd.read_csv(io.StringIO(uploaded['x_train.csv'].decode('utf-8')), skiprows=1, usecols=range(23) ,header=None)

Reading CSV files with python (pandas) when there is HTML escaped string in there

I'm trying to read a CSV file with pandas read_csv. The data looks like this (example)
thing;weight;price;colour
apple;1;2;red
m & m's;0;10;several
cherry;0,5;2;dark red
Because of the HTML-escaped ampersand thingy, the second row would contain 5 fields according to pandas. How can I make sure, that thing gets read correctly?
The example here is pretty much how my data looks like: separator is ";", no string quotes, cp1251 encoding.
The data I receive is pretty big, and reading it must run in one step (meaning no preprocessing outside of python).
I didn't find any reference in the pandas doc (I'm using pandas 0.19 with python 3.5.1). Any suggestions? Thanks in advance.
Unescape the html character references:
import html
with open('data.csv', 'r', encoding='cp1251') as f, open('data-fixed.csv', 'w') as g:
content = html.unescape(f.read())
g.write(content)
print(content)
# thing;weight;price;colour
# apple;1;2;red
# m & m's;0;10;several
# cherry;0,5;2;dark red
Then load the csv in the usual way:
import pandas as pd
df = pd.read_csv('data-fixed.csv', sep=';')
print(df)
yields
thing weight price colour
0 apple 1 2 red
1 m & m's 0 10 several
2 cherry 0,5 2 dark red
Although the data file is "pretty big", you appear to have enough memory to read it into a DataFrame. Therefore you should also have enough memory to read the file into a single string: f.read(). Converting the HTML with one call to html.unescape is more performant than calling html.unescape on many smaller strings. This is why I suggest using
with open('data.csv', 'r', encoding='cp1251') as f, open('data-fixed.csv', 'w') as g:
content = html.unescape(f.read())
g.write(content)
instead of something like
with open('data.csv', 'r', encoding='cp1251') as f, open('data-fixed.csv', 'w') as g:
for line in f:
g.write(html.unescape(line))
If you need to read this data file more than once, then it pays to fix it (and save it
to disk) so you don't need to call html.unescape every time you wish to parse
the data. That's why I suggest writing the unescaped contents to data-fixed.csv.
If reading this data is a one-off task and you wish to avoid the performance or resource cost of writing to disk, then you could use a StringIO (in-memory file-like object):
from io import StringIO
import html
import pandas as pd
with open('data.csv', 'r', encoding='cp1251') as f:
content = html.unescape(f.read())
df = pd.read_csv(StringIO(content), sep=';')
print(df)
You can use a regex as separator for pandas.read_csv
In your specific case you can try:
pd.read_csv("test.csv",sep = "(?<!&amp);")
# thing weight price colour
#0 apple 1 2 red
#1 m & m's 0 10 several
#2 cherry 0,5 2 dark red
to select all the ; not preceded by &amp, this can be extended to other escaped characters

What function and parameters are available in Pandas in order to open a tab delimited text file?

I have a text file as follows:
Movie_names Rating
"A" 10
"B" 6.5
The text file is tab delimited. Some movie titles are enclosed in a double quote. How to read it into a pandas dataframe with the quotes removed from the movie names?
I tried using the following code:
import pandas as pd
data = pd.read_csv("movie.txt")
However, it says there is a Unicode decode error. What should be done?
First you can read tab delimited files using either read_table or read_csv. The former uses tab delimiter by default, for the latter you need to specify it:
import pandas as pd
df = pd.read_csv('yourfile.txt', sep='\t')
Or:
import pandas as pd
df = pd.read_table('yourfile.txt')
If you are receiving encoding errors it is because read_table doesn't understand the text encoding of the file. You can solve this by specifying the encoding directly, for example for UTF8:
import pandas as pd
df = pd.read_table('yourfile.txt', encoding='utf8')
If you file is using a different encoding, you will need to specify that instead.
First you'll want to
import pandas
Df = pandas.read_csv("file.csv")
Get rid of double quotes with
Df2 = Df['columnwithquotes'].apply(lambda x: x.replace('"', ''))
You can use read_table as its quotechar parameter is set to '"' by default and will so remove the double quotes.
import pandas as pd
from io import StringIO
the_data = """
A B C D
ABC 2016-6-9 0:00 95 "foo foo"
ABC 2016-6-10 0:00 0 "bar bar"
"""
df = pd.read_table(StringIO(the_data))
print(df)
# A B C D
# 0 ABC 2016-6-9 0:00 95 foo foo
# 1 ABC 2016-6-10 0:00 0 bar bar

Categories

Resources