Python pandas separate dataframe by commas, except commas inside of quotations - python

I have a very large csv file (2+ million rows) which is separated by commas. Except there are a few entries such as Company Name and a value field where fields can contain commas inside quotations.
E.g:
product1,"23-12-2021 21:37:22.567",50,18.32,"Company, A"
product1,"23-12-2021 21:37:24.237","1,000",17.34,"Company, C"
Note that when the value field after the datetime is >= 1000 it adds the comma and quotations. As a result of the above when I try:
pd.read_csv(r'C:\example_file.csv', delimiter=",", quoting=csv.QUOTE_NONE, quotechar='"', encoding='utf-8')
It throws a pandas.errors.ParserError: Error tokenizing data. C error: Expected 24 fields in line x, saw 25
I've managed a workaround to get it into a dataframe using this:
import pandas as pd
import csv
file = pd.read_csv(r'C:\example_file.csv', delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')
df = pd.Dataframe(file)
column_list = df_columns
column_string = str(column_list[0])
# column_list[0] returns index with 24 column names
column_string_split = column_string.split(",")
df.rename(columns{df.columns[0]: 'fill'}, inplace=True)
new_df = pd.Dataframe(data_df['fill'].str.split(',').tolist(), columns=column_string_split)
I understand what I've tried so far isnt optimal and ideally I would separate the data when I read it in but I'm at a loss on how to progress. I've been looking into regex expressions to exclude commas inside quotations (such as Python, split a string at commas, except within quotes, ignoring whitespace) but nothing as worked. Any ideas on how to str.split commas except within quotes?

Considering the .csv you gave :
product1,"23-12-2021 21:37:22.567",50,18.32,"Company, A"
product1,"23-12-2021 21:37:24.237","1,000",17.34,"Company, C"
First, as explained in the comments by #Chris and #Quang Hoang, a standard pandas.read_csv will ignore the comma inside the quotes.
Second, you can pass the thousands parameter to pandas.read_csv to get rid of the comma for all the numbers > 999.
Try this :
df = pd.read_csv(r'C:\example_file.csv', encoding='utf-8', thousands=",")

Related

Split column in several columns by delimiter '\' in pandas

I have a txt file which I read into pandas dataframe. The problem is that inside this file my text data recorded with delimiter ''. I need to split information in 1 column into several columns but it does not work because of this delimiter.
I found this post on stackoverflow just with one string, but I don't understand how to apply it once I have a whole dataframe: Split string at delimiter '\' in python
After reading my txt file into df it looks something like this
df
column1\tcolumn2\tcolumn3
0.1\t0.2\t0.3
0.4\t0.5\t0.6
0.7\t0.8\t0.9
Basically what I am doing now is the following:
df = pd.read_fwf('my_file.txt', skiprows = 8) #I use skip rows because there is irrelevant text
df['column1\tcolumn2\tcolumn3'] = "r'" + df['column1\tcolumn2\tcolumn3'] +"'" # i try to make it a row string as in the post suggested but it does not really work
df['column1\tcolumn2\tcolumn3'].str.split('\\',expand=True)
and what I get is just the following (just displayed like text inside a data frame)
r'0.1\t0.2\t0.3'
r'0.4\t0.5\t0.6'
r'0.7\t0.8\t0.9'
I am not very good with regular expersions and it seems a bit hard, how can I target this problem?
It looks like your file is tab-delimited, because of the "\t". This may work
pd.read_csv('file.txt', sep='\t', skiprows=8)

Pandas won't separate columns in my comma separated .txt file

I'm aware this has been asked so many times, but it's left me really scratching my head. I have a .txt file which looks like:
"0,1,2,3,4, ....63,0,1,2,3,4.....63"
"0,1,2,3,4, ....63,0,1,2,3,4.....63"
"0,1,2,3,4, ....63,0,1,2,3,4.....63"
and so on for multiple rows.
So that's 64+64 = 128 columns separated by commas, while each row is enclosed in double quotes.
I have used the commands:
#Used this initially
df = pd.read_csv('test_data.txt')
#Used this after reading more stackoverflow answers
df = pd.read_csv('test_data.txt', header = None, sep=",", delimiter=',', quotechar='"', index_col = None)
I know sep and delimiter are the same parameters, but I tried both out anyway, I shouldn't have to specify these either because pandas chooses commas by default.
After this I'm just using:
df.head()
And it outputs:
0
0 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
1 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
2 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
3 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
4 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
It just reads it all as the one column, please advise on how I can read all 128 columns.
This will get you to desired result:
df = pd.read_csv('test_data.txt', header=None)
df = pd.DataFrame(df[0].str.split(',').tolist())
So this will read your file, that has each row wrapped with quote marks, and pack it into single column.
Then, you split that column by comma and construct new dataframe from the results.

reading a multi-indexed CSV in pandas with multiple delimiters

I'm trying to create a very human-readable script that will be multi-indexed. It looks like this:
A
one : some data
two : some other data
B
one : foo
three : bar
I'd like to use pandas' read_csv to automatically read this in as a multi-indexed file with both \t and : used as delimiters so that I can easily slice by section (i.e., A and B). I understand something like that header=[0,1] and perhaps tupleize_cols may be used to this end, but I can't get that far since it doesn't seem to want to read both the tabs and colons properly. If I use sep='[\t:]', it consumes the leading tabs. If I don't use the regexp and read with sep='\t', it gets the tabs right, but doesn't handle the colons. Is this possible using read_csv? I could do it line by line, but there must be an easier way :)
This is the output I had in mind. I added labels to the indices and column, which could hopefully be applied when reading it in:
value
index_1 index_2
A one some data
two some other data
B one foo
three bar
EDIT: I used part of Ben.T's answer to get what I needed. I'm not in love with my solution since I'm writing to a temp file, but it does work:
with open('temp.csv','w') as outfile:
for line in open(reader.filename,'r'):
if line[0] != '\t' or not line.strip():
index1 = line.split('\n')[0]
else:
outfile.write(index1+':'+re.sub('[\t]+','',line))
pd.read_csv('temp.csv', sep=':', header=None, \
names = ['index_1', 'index_2', 'Value'] ).set_index(['index_1', 'index_2'])
You can use two delimiters in read_csv such as:
pd.read_csv( path_file, sep=':|\t', engine='python')
Note the engine='python' to prevent a warning.
EDIT: with your input format it seems difficult, but with input like:
A one : some data
A two : some other data
B one : foo
B three : bar
with a \t as delimiter after A or B, then you get a multiindex by:
pd.read_csv(path_file, sep=':|\t', header = None, engine='python', names = ['index_1', 'index_2', 'Value'] ).set_index(['index_1', 'index_2'])

pandas.read_csv not partitioning data at semicolon delimiter

I'm having a tough time correctly loading csv file to pandas dataframe. The file is csv saved in MS Excel, where the rows looks like this:
Montservis, s.r.o.;"2 012";"-14.98";"-34.68";"- 11.7";"0.02";"0.09";"0.16";"284.88";"10.32";"
I am using
filep="file_name.csv"
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter=";")
(I have tried several combinations and alternatives of read_csv arguments, but without any success.....I have tried also read_table )
What I want to see in my dataframe that each semi colon separated value will be in separate column (I understand that read_csv works this way(?)).
Unfortunately, I always end up with whole row being placed in first column of dataframe. So basicly after loading I have many rows, but only one column (two if I count also indexes)
I have placed sample here:
datafile
Any idea welcomed.
Add quoting = 3. 3 is for QUOTE_NONE refer this.
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter=";", quoting = 3)
This will give [7 rows x 23 columns] dataframe
The problem is enclosing characters which can be ignored by \ character.
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter='\;')

python pandas read_csv unable to read character double quoted twice

I'm trying to a 2 columns csv file (error.csv) with semi-column separator which contains double quoted semi-columns:
col1;col2
2016-04-17_22:34:25.126;"Linux; Android"
2016-04-17_22:34:25.260;"{"g":2}iPhone; iPhone"
And I'm trying:
logs = pd.read_csv('error.csv', na_values="null", sep=';',
quotechar='"', quoting=0)
I understand that the problem comes from having a double quoted "g" inside my double quotes in line 3 but I can't figure out how to deal with it. Any ideas ?
You will probably need to pre-process the data so that it conforms to the expected CSV format. I doubt pandas will handle this just by changing a parameter or two.
If there are only two columns, and the first never contains a semi-colon, then you could split the lines on the first semi-colon:
records = []
with open('error.csv', 'r') as fh:
# first row is a header
header = next(fh).strip().split(';')
for rec in fh:
# split only on the first semi-colon
date, dat = rec.strip().split(';', maxsplit=1)
# assemble records, removing quotes from the second column
records.append((date, dat.strip('"')))
# create a data frame
df = pandas.DataFrame.from_records(records, columns=header)
You will have to manually parse the dates yourself with the datetime module if you want the first column to contain proper dates and not strings.

Categories

Resources