Pandas Read CSV with string delimiters via regex - python

I am trying to import a weirdly formatted text file into a pandas DataFrame. Two example lines are below:
LOADED LANE 1 MAT. TYPE= 2 LEFFECT= 1 SPAN= 200. SPACE= 10. BETA= 3.474 LOADEFFECT 5075. LMAX= 3643. COV= .13
LOADED LANE 1 MAT. TYPE= 3 LEFFECT= 1 SPAN= 200. SPACE= 10. BETA= 3.515 LOADEFFECT10009. LMAX= 9732. COV= .08
First I tried the following:
df = pd.read_csv('beta.txt', header=None, delim_whitespace=True, usecols=[2,5,7,9,11,13,15,17,19])
This seemed to work fine, however got messed up when it hit the above example line, where there is no whitespace after the LOADEFFECT string (you may need to scroll a bit right to see it in the example). I got a result like:
632 1 2 1 200 10 3.474 5075. 3643. 0.13
633 1 3 1 200 10 3.515 LMAX= COV= NaN
Then I decided to use a regular expression to define my delimiters. After many trial and error runs (I am no expert in regex), I managed to get close with the following line:
df = pd.read_csv('beta.txt', header=None, sep='/s +|LOADED LANE|MAT. TYPE=|LEFFECT=|SPAN=|SPACE=|BETA=|LOADEFFECT|LMAX=|COV=', engine='python')
This almost works, but creates a NaN column for some reason at the very beginning:
632 NaN 1 2 1 200 10 3.474 5075 3643 0.13
633 NaN 1 3 1 200 10 3.515 10009 9732 0.08
At this point I think I can just delete that first column, and get away with it. However I wonder what would be the correct way to set up the regex to correctly parse this text file in one shot. Any ideas? Other than that, I am sure there is a smarter way to parse this text file. I would be glad to hear your recommendations.
Thanks!

import re
import pandas as pd
import csv
csvfile = open("parsing.txt") #open text file
reader = csv.reader(csvfile)
new_list=[]
for line in reader:
for i in line:
new_list.append(re.findall(r'(\d*\.\d+|\d+)', i))
table = pd.DataFrame(new_list)
table # output will be pandas DataFrame with values

Related

Python automatically converting specific data to date format and losing data

I'm cleaning a csv file with pandas, mainly removing special characters such as ('/', '#', etc). The file has 7 columns (none of which are dates).
In some of the columns, there's just numerical data such as '11/6/1980'.
I've noticed that directly after reading the csv file,
df = pd.read_csv ('report16.csv', encoding ='ANSI')
this data becomes '11/6/80', after cleaning it becomes '11 6 80' (it's the same result in the output file). So wherever the data has ' / ', it's being interpreted as a date and python is eliminating the first 2 digits from the data.
Data
Expected result
Actual Result
11/6/1980
11 6 1980
11 6 80
12/8/1983
12 8 1983
12 8 83
Both of the above results are wrong because in the Actual Result column, I'm losing 2 digits towards the end.
The data looks like this
Org Name
Code
Code copy
ABC
11/6/1980
11/6/1980
DEF
12/8/1983
12/8/1983
GH
11/5/1987
11/5/1987
OrgName, Code, Code copy
ABC, 11/6/1980, 11/6/1980
DEF, 12/8/1983, 12/8/1983
GH, 11/5/1987, 11/5/1987
KC, 9000494, 9000494
It's worth mentioning that the column contains other data such as '900490', strings, etc but in these instances, there aren't any problems.
What could be done to not allow this conversion?
Not an answer, but comments do not allow to include well presented code and data.
Here is what I call a minimal reproducible example:
Content of the sample.csv file:
Data,Expected result,Actual Result
11/6/1980,11 6 1980,11 6 80
12/8/1983,12 8 1983,12 8 83
Code:
df = pd.read_csv('sample.csv')
print(df)
s = df['Data'].str.replace('/', ' ')
print((df['Expected result'] == s).all())
It gives :
Data Expected result Actual Result
0 11/6/1980 11 6 1980 11 6 80
1 12/8/1983 12 8 1983 12 8 83
True
This proves that read_csv has correctly read the file and has not changed anything.
PLEASE SHOW THE CONTENT OF YOUR CSV FILE AS TEXT, along with enough code to reproduce your problem.
How about trying string operation?! First select the column that you would like to modify and replace "/" or "#" with whitespace : column.str.replace("/", " "). I hope this is gonna work !
The behavior of converting dates is not strictly a python issue. You are using pandas read_csv.
Try to explicitly declare a separator. If sep not declared, it makes guesses.
df = pd.read_csv ('report16.csv', encoding ='ANSI', sep =',')

Formatting issues using Regex and Pandas

I don't exactly know how to describe the issue I'm having, so I'll just show it.
I have 2 data tables, and I'm using regex to search through and extract values in those tables based on if it matches with the correct word. I'll put the whole script for reference.
import re
import os
import pandas as pd
import numpy as np
os.chdir('C:/Users/Sams PC/Desktop')
f=open('test5.txt', 'w')
NHSQC=pd.read_csv('NHSQC.txt', sep='\s+', header=None)
NHSQC.columns=['Column_1','Column_2','Column_3']
HNCA=pd.read_csv('HNCA.txt', sep='\s+', header=None)
HNCA.columns=['Column_1','Column_2','Column_3','Column_4']
x=re.findall('[A-Z][0-9][0-9][A-Z]-[H][N]',str(NHSQC))
y=re.findall('[A-Z][0-9][0-9][A-Z]-[C][A]-[H][N]',str(HNCA))
print (NHSQC)
print (HNCA)
print(x)
print (y)
data=[]
label=[]
for i in range (0,6):
if x[i] in str(NHSQC):
data2=NHSQC.set_index('Column_1',drop=False)
data3=(data2.loc[str(x[i]), 'Column_2':'Column_3'])
data.extend(list(data3))
a=[x[i]]
label.extend(a)
label.extend(a)
if y[i] in str(HNCA):
data2=HNCA.set_index('Column_1',drop=False)
data3=(data2.loc[str(y[i]),'Column_3'])
data.append(data3)
a=[y[i]]
label.extend(a)
else:
print('Not Found')
else:
print('Not Found')
data6=[label,data]
matrix=data6
data5=np.transpose(matrix)
print(data5)
f.write(str(data5))
f.close()
This script, does exactly what I want it to do, and it works as intended when I run my test data files, but fails when I run my actual data files. I don't know how to explain the issue, so I'll just show it. This is the output:
Column_1 Column_2 Column_3
0 S31N-HN 114.424 7.390
1 Y32N-HN 121.981 7.468
2 Q33N-HN 120.740 8.578
3 A34N-HN 118.317 7.561
4 G35N-HN 106.764 7.870
.. ... ... ...
89 R170N-HN 118.078 7.992
90 S171N-HN 110.960 7.930
91 R172N-HN 119.112 7.268
92 999_XN-HN 116.703 8.096
93 1000_XN-HN 117.530 8.040
[94 rows x 3 columns]
Column_1 Column_2 Column_3 Column_4
0 Assignment w1 w2 w3
1 S31N-A30CA-S31HN 114.424 54.808 7.393
2 S31N-A30CA-S31HN 126.854 53.005 9.277
3 S31N-CA-HN 114.424 61.717 7.391
4 S31N-HA-HN 126.864 59.633 9.287
.. ... ... ... ...
173 R170N-CA-HN 118.016 60.302 7.999
174 S171N-R170CA-S171HN 110.960 60.239 7.932
175 S171N-CA-HN 110.960 60.946 7.931
176 R172N-S171CA-R172HN 119.112 60.895 7.264
177 R172N-CA-HN 119.112 55.093 7.265
[178 rows x 4 columns]
['S31N-HN', 'Y32N-HN', 'Q33N-HN', 'A34N-HN', 'G35N-HN']
['S31N-CA-HN']
Traceback (most recent call last):
File "test.py", line 29, in <module>
if y[i] in str(HNCA):
IndexError: list index out of range
As you can see, there is an issue because my regex for y isn't finding all the values. Furthermore, there is an issue with how many my x regex is finding (only 5 instead of the hundreds it should be). Initially I thought this was just a display thing (it wasn't displaying the hundreds of matches since it would take too long), and I also thought the ... in the middle of it printing my table was also for display purposes. However, if I copy part of my HNCA.txt data and save it as a separate file, it fixes the issue.
[94 rows x 3 columns]
Column_1 Column_2 Column_3 Column_4
0 Assignment w1 w2 w3
1 S31N-A30CA-S31HN 114.424 54.808 7.393
2 S31N-A30CA-S31HN 126.854 53.005 9.277
3 S31N-CA-HN 114.424 61.717 7.391
4 S31N-HA-HN 126.864 59.633 9.287
5 Y32N-S31CA-Y32HN 121.981 61.674 7.467
6 Y32N-CA-HN 121.981 60.789 7.469
7 Q33N-Y32CA-Q33HN 120.770 60.775 8.582
8 Q33N-CA-HN 120.701 58.706 8.585
9 A34N-Q33CA-A34HN 118.317 58.740 7.559
10 A34N-CA-HN 118.317 52.260 7.565
11 G35N-A34CA-G35HN 106.764 52.195 7.868
12 G35N-CA-HN 106.764 46.507 7.868
13 R36N-G35CA-R36HN 117.833 46.414 8.111
14 R36N-CA-HN 117.833 54.858 8.112
15 G37N-R36CA-G37HN 110.365 54.808 8.482
16 G37N-CA-HN 110.365 44.901 8.484
17 I55N-CA-HN 118.132 65.360 7.935
18 Y56N-I55CA-Y56HN 123.025 65.464 8.088
19 Y56N-CA-HN 123.025 62.195 8.082
20 A57N-Y56CA-A57HN 120.470 62.159 7.978
21 A57N-CA-HN 120.447 55.522 7.980
22 S72N-K71CA-S72HN 117.239 55.390 8.368
23 S72N-CA-HN 117.259 58.583 8.362
24 C73N-S72CA-C73HN 128.142 58.569 9.690
25 C73N-CA-HN 128.142 61.410 9.677
26 G74N-C73CA-G74HN 116.187 61.439 9.439
27 G74N-CA-HN 116.194 46.528 9.437
28 H75N-G74CA-H75HN 122.640 46.307 9.642
29 H75N-CA-HN 122.621 56.784 9.644
30 C76N-H75CA-C76HN 122.775 56.741 7.152
31 C76N-CA-HN 122.738 57.527 7.146
32 R77N-C76CA-R77HN 120.104 57.532 8.724
33 R77N-CA-HN 120.135 59.674 8.731
['S31N-HN', 'Y32N-HN', 'Q33N-HN', 'A34N-HN', 'G35N-HN']
['S31N-CA-HN', 'Y32N-CA-HN', 'Q33N-CA-HN', 'A34N-CA-HN', 'G35N-CA-HN', 'R36N-CA-HN', 'G37N-CA-HN', 'I55N-CA-HN', 'Y56N-CA-HN', 'A57N-CA-HN', 'S72N-CA-HN', 'C73N-CA-HN', 'G74N-CA-HN', 'H75N-CA-HN', 'C76N-CA-HN', 'R77N-CA-HN']
[['S31N-HN' '114.42399999999999']
I won't post the whole output, but as you can see, now it finds all the proper matches. Its also now displaying the entire table, instead of doing ... and only showing the top and bottom halves. I don't exactly understand where this issue is arising from though. Why is it displaying only the top and bottom half of my table, but if I copy and paste it to another file, it displays the entire thing. Why does regex not search through the entire table even if it isn't displayed (based on the fact it shows the top and bottom half, makes me think the entire table is there, but again its not showing it because its trying to simplify the display, but why would whats being displayed effect what regex is searching)?
Why is python only displaying the top and bottom portions of your table?
Python classes can define two "magic" methods:
__repr__(), which is supposed to produce a "representation" of the object as a string, and which has a pretty useless default implementation for most objects; and
__str__(), which is supposed to produce a readable "string" of the object, and which falls back to __repr__().
When the line x=re.findall('[A-Z][0-9][0-9][A-Z]-[H][N]',str(NHSQC)) is run, that last str(NHSQC) bit tells python to call NHSCQ.__str__(), which falls back to NHSCQ.__repr__(), which you can read about here.
The developers of the pandas library implemented DataFrame.__repr__() in such a way that, depending on the values of certain global variables, will produce a string that does not fully represent the underlying data. The defaults truncate the DataFrame to show only the first 5 and last 5 rows with ellipses (...) telling you that there are bits missing. Thus, as you suspected, you are only calling re.findall on the first 5 and last 5 rows of the DataFrame.
What should you do instead?
Using str(NHSQC) is probably not what you intend to do. This converts the entire DataFrame into a (incomplete) string representation, then runs the regular expression search over that entire string. That's extremely inefficient, so why not use the Series.str methods instead?
For instance, you appear to be lining up Column_2 and Column_3 of rows from DataFrame NHSQC where the value of Column_1 matches the first regex in order with Column_3 of rows from DataFrame HNCA where the value of Column_1 matches the second regex, right?
df1 = NHSQC.loc[NHSQC["Column_1"].str.match(re.compile("[A-Z][0-9][0-9][A-Z]-HN"))]
df2 = HNCA.loc[HNCA["Column_1"].str.match(re.compile("[A-Z][0-9][0-9][A-Z]-CA-HN")), ["Column_1", "Column_3"]]
Those lines will select the requisite rows and columns from the two DataFrames using Series.str.match on Column_1.
long1 = df1.melt(id_vars=["Column_1"]).drop("variable", axis="columns")
long2 = df2.rename(columns={"Column_3": "value"})
The first line uses DataFrame.melt to turn the three columns of df1 into a "longer" version with columns Column_1 as an identifier, variable as either the strings "Column_2" or "Column_3", and value, containing the thing you actually care about and are printing at the end of your program. You don't use the column name anymore, so it is dropped. The DataFrame df2 doesn't need to be converted to a longer format because it only has two columns, so we just rename Column_3 to value.
extra_long = pd.concat([long1, long2])
print(extra_long.to_numpy())
This just concatenates the two long DataFrames together, turns them into a numpy array, then prints them out.

Delete or remove headers from text files being read in

I am trying to remove or delete headers of data I am reading in using pandas. One file has a header and the other doesn't but I want to be able to check for headers and then remove it.
So far, I have tried using header=None in the read_csv function
from pathlib import Path
import pandas as pd
def _reader(fname):
return pd.read_csv(fname, sep="\t", header=None)
folder = Path("C:\\Me\\Project1")
data = pd.concat([
_reader(txt)
for txt in folder.glob("*.txt")
])
I get the following error:
TypeError: must be str, not int
My two files look like this:
File1.txt
ISIN AVL_QTY
BAD 90000
AAB 8550000
BAD 173688
BAD 360000
BAD 90000
BAD 810000
BAD 900000
BAD 900000
File2.txt
TEST 543
HELLO 555
STOCK 900
CODE 785
First, you need a check if the first line is a header. E.g. you can check if any of the first row's entries begins with a number, as this would not be typical for a column header.
In fact without knowing your thousands of files the right approach for header detection is just guessing - but that's not really the point in your code.
To make use of a header detection, you should go with normal loop instead of a list comprehension, so that you can in each iteration: 1. check for header 2. read the file and append the data to a dataframe:
df = pd.DataFrame()
for f in folder.glob("*.txt"):
with open(f) as fin:
chk_lst = next(fin).split()
is_h = not any(v[0].isdecimal() for v in chk_lst)
df = pd.concat([df, pd.read_csv(f, sep='\s+', header=(None, 0)[is_h])], axis=1)
# ISIN AVL_QTY 0 1
# 0 BAD 90000 TEST 543.775
# 1 AAB 8550000 HELLO 555.000
# 2 BAD 173688 STOCK 900.000
# 3 BAD 360000 CODE 785.000
# 4 BAD 90000 NaN NaN
# 5 BAD 810000 NaN NaN
# 6 BAD 900000 NaN NaN
# 7 BAD 900000 NaN NaN
Edit:
For concatenating row wise, you can use
df = pd.concat([df, pd.read_csv(f, sep='\s+', header=None, skiprows=(0, 1)[is_h])], axis=0, ignore_index=True)
# 0 1
# 0 BAD 90000
# 1 AAB 8550000
# 2 BAD 173688
# 3 BAD 360000
# 4 BAD 90000
# 5 BAD 810000
# 6 BAD 900000
# 7 BAD 900000
# 8 TEST 543
# 9 HELLO 555
# 10 STOCK 900
# 11 CODE 785
File2.txt does not have header, right? but in _reader you set header as None.
Add header to File2.txt and see what happens.
There are a couple ways to check if a csv file has a header
using the csv library
import csv
with open('example.csv', 'rb') as csvfile:
sniffer = csv.Sniffer()
has_header = sniffer.has_header(csvfile.read(2048))
csvfile.seek(0)
# ...
my source
or if you know your data, checking if there are any digits in the first row
is_header = not any(cell.isdigit() for cell in csv_table[0])
my source
or with pandas itself, if you know what the header might be called
df = (pd.read_csv(filename, header=None, names=cols)
[lambda x: np.ones(len(x)).astype(bool)
if (x.iloc[0] != cols).all()
else np.concatenate([[False], np.ones(len(x)-1).astype(bool)])]
)
my source
and of course if you want to preprocess the files with the command line first, it'll probably be faster....

Ignoring bad rows of data in pandas.read_csv() that break header= keyword

I have a series of very messy *.csv files that are being read in by pandas. An example csv is:
Instrument 35392
"Log File Name : station"
"Setup Date (MMDDYY) : 031114"
"Setup Time (HHMMSS) : 073648"
"Starting Date (MMDDYY) : 031114"
"Starting Time (HHMMSS) : 090000"
"Stopping Date (MMDDYY) : 031115"
"Stopping Time (HHMMSS) : 235959"
"Interval (HHMMSS) : 010000"
"Sensor warmup (HHMMSS) : 000200"
"Circltr warmup (HHMMSS) : 000200"
"Date","Time","","Temp","","SpCond","","Sal","","IBatt",""
"MMDDYY","HHMMSS","","øC","","mS/cm","","ppt","","Volts",""
"Random message here 031114 073721 to 031114 083200"
03/11/14,09:00:00,"",15.85,"",1.408,"",.74,"",6.2,""
03/11/14,10:00:00,"",15.99,"",1.96,"",1.05,"",6.3,""
03/11/14,11:00:00,"",14.2,"",40.8,"",26.12,"",6.2,""
03/11/14,12:00:01,"",14.2,"",41.7,"",26.77,"",6.2,""
03/11/14,13:00:00,"",14.5,"",41.3,"",26.52,"",6.2,""
03/11/14,14:00:00,"",14.96,"",41,"",26.29,"",6.2,""
"message 3"
"message 4"**
I have been using this code to import the *csv file, process the double headers, pull out the empty columns, and then strip the offending rows with bad data:
DF = pd.read_csv(BADFILE,parse_dates={'Datetime_(ascii)': [0,1]}, sep=",", \
header=[10,11],na_values=['','na', 'nan nan'], \
skiprows=[10], encoding='cp1252')
DF = DF.dropna(how="all", axis=1)
DF = DF.dropna(thresh=2)
droplist = ['message', 'Random']
DF = DF[~DF['Datetime_(ascii)'].str.contains('|'.join(droplist))]
DF.head()
Datetime_(ascii) (Temp, øC) (SpCond, mS/cm) (Sal, ppt) (IBatt, Volts)
0 03/11/14 09:00:00 15.85 1.408 0.74 6.2
1 03/11/14 10:00:00 15.99 1.960 1.05 6.3
2 03/11/14 11:00:00 14.20 40.800 26.12 6.2
3 03/11/14 12:00:01 14.20 41.700 26.77 6.2
4 03/11/14 13:00:00 14.50 41.300 26.52 6.2
This was working fine and dandy until I have a file that has an erronious 1 row line after the header: "Random message here 031114 073721 to 031114 083200"
The error I receieve is:
*C:\Users\USER\AppData\Local\Continuum\Anaconda3\lib\site-
packages\pandas\io\parsers.py in _do_date_conversions(self, names, data)
1554 data, names = _process_date_conversion(
1555 data, self._date_conv, self.parse_dates, self.index_col,
-> 1556 self.index_names, names,
keep_date_col=self.keep_date_col)
1557
1558 return names, data
C:\Users\USER\AppData\Local\Continuum\Anaconda3\lib\site-
packages\pandas\io\parsers.py in _process_date_conversion(data_dict,
converter, parse_spec, index_col, index_names, columns, keep_date_col)
2975 if not keep_date_col:
2976 for c in list(date_cols):
-> 2977 data_dict.pop(c)
2978 new_cols.remove(c)
2979
KeyError: ('Time', 'HHMMSS')*
If I remove that line, the code works fine. Similarly, if I remove the header= line the code works fine. However, I want to be able to preserve this because I am reading in hundreds of these files.
Difficulty: I would prefer to not open each file before the call to pandas.read_csv() as these files can be rather large - thus I don't want to read and save multiple times! Also, I would prefer a real pandas/pythonic solution that doesn't involve openning the file first as a stringIO buffer to removing offending lines.
Here's one approach, making use of the fact that skip_rows accepts a callable function. The function receives only the row index being considered, which is a built-in limitation of that parameter.
As such, the callable function skip_test() first checks whether the current index is in the set of known indices to skip. If not, then it opens the actual file and checks the corresponding row to see if its contents match.
The skip_test() function is a little hacky in the sense that it does inspect the actual file, although it only inspects up until the current row index it's evaluating. It also assumes that the bad line always begins with the same string (in the example case, "foo"), but that seems to be a safe assumption given OP.
# example data
""" foo.csv
uid,a,b,c
0,1,2,3
skip me
1,11,22,33
foo
2,111,222,333
"""
import pandas as pd
def skip_test(r, fn, fail_on, known):
if r in known: # we know we always want to skip these
return True
# check if row index matches problem line in file
# for efficiency, quit after we pass row index in file
f = open(fn, "r")
data = f.read()
for i, line in enumerate(data.splitlines()):
if (i == r) & line.startswith(fail_on):
return True
elif i > r:
break
return False
fname = "foo.csv"
fail_str = "foo"
known_skip = [2]
pd.read_csv(fname, sep=",", header=0,
skiprows=lambda x: skip_test(x, fname, fail_str, known_skip))
# output
uid a b c
0 0 1 2 3
1 1 11 22 33
2 2 111 222 333
If you know exactly which line the random message will appear on when it does appear, then this will be much faster, as you can just tell it not to inspect the file contents for any index past the potential offending line.
After some tinkering yesterday I found a solution and what the potential issue may be.
I tried the skip_test() function answer above, but I was still getting errors with the size of the table:
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10862)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory (pandas\_libs\parsers.c:11138)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)()
ParserError: Error tokenizing data. C error: Expected 1 fields in line 14, saw 11
So after playing around with skiprows= I discovered that I was just not getting the behavior I wanted when using the engine='c'. read_csv() was still determining the size of the file from those first few rows, and some of those single column rows were still being passed. It may be that I have a few more bad single column rows in my csv set that I did not plan on.
Instead, I create an arbitrary sized DataFrame as a template. I pull in the entire .csv file, then use logic to strip out the NaN rows.
For example, I know that the largest table that I will encounter with my data will be 10 rows long. So my call to pandas is:
DF = pd.read_csv(csv_file, sep=',', \
parse_dates={'Datetime_(ascii)': [0,1]},\
na_values=['','na', '999999', '#'], engine='c',\
encoding='cp1252', names = list(range(0,10)))
I then use these two lines to drop the NaN rows and columns from the DataFrame:
#drop the null columns created by double deliminators
DF = DF.dropna(how="all", axis=1)
DF = DF.dropna(thresh=2) # drop if we don't have at least 2 cells with real values
If anyone in the future comes across this question, pandas has now implemented the on_bad_lines argument. You can now solve this problem by using on_bad_lines = "skip"

Change CSV before importing it to pandas

I have an issue with a CSV files I am trying to import in pandas. The structure of the file is as follow:
first character of the file is a single quote;
last character of the file is a single quote;
every line of the CSV start with a double quotes, end with a double quote followed by \n
So I have issues importing it with pandas.read_csv. Ideally I would like pandas to just ignore the single and double quotes when importing (not taking them into account for the structure of the data frame, and not importing these as characters).
I do not really know if I should manipulate the CSV file before using pandas.read_csv, or if I have option for just ignoring these characters.
The pd.read_csv methods first argument is either a file name or a stream.
You can read the file manually and manipulate the stream before handing it to pandas.
sio = StringIO("id,category,value\n1,beer,2.40\n2,wine,6.40\n3,$$$Theawsomestuff$$$###,166.00"
pd.read_csv(sio)
id category value
0 1 beer 2.4
1 2 wine 6.4
2 3 $$$Theawsomestuff$$$### 166.0
Thus subclassing StringIO you can change the behavior of the read method
class StreamChanger(StringIO):
def read(self, **kwargs):
data = super().read(**kwargs)
data = data.replace("$", "")
data = data.replace("#", "")
return data
sio = StreamChanger("id,category,value\n1,beer,2.40\n2,wine,6.40\n3,$$$Theawsomestuff$$$###,166.00")
pd.read_csv(sio)
id category value
0 1 beer 2.4
1 2 wine 6.4
2 3 Theawsomestuff 166.0
Use parameter 'quoting' and pass value 3 to read_csv. once you have dataframe created you should take care of the quotes in data and headers.
import pandas as pd
df=pd.read_csv('check.txt',doublequote=True,delimiter=',',quoting=3)
df=df.replace({'"': '','\'':''}, regex=True)
df.columns = ['Id1','StartTime','start_lat','start_long','StartGeohash']
print df
Sample File
'Id1,StartTime,start_lat,start_long,StartGeohash
"113,2016-11-01 10:50:28.063,-33.139507,-100.226715,9vbsx2"
"113,2016-11-02 10:49:24.063,-33.139507,-100.226715,9vbsx2"
"115,2016-11-03 10:55:20.063,-36.197660,-101.186050,9y2jcm"'
output
Id1 StartTime start_lat start_long StartGeohash
0 113 2016-11-01 10:50:28.063 -33.139507 -100.226715 9vbsx2
1 113 2016-11-02 10:49:24.063 -33.139507 -100.226715 9vbsx2
2 115 2016-11-03 10:55:20.063 -36.197660 -101.186050 9y2jcm

Categories

Resources