I'm trying to a 2 columns csv file (error.csv) with semi-column separator which contains double quoted semi-columns:
col1;col2
2016-04-17_22:34:25.126;"Linux; Android"
2016-04-17_22:34:25.260;"{"g":2}iPhone; iPhone"
And I'm trying:
logs = pd.read_csv('error.csv', na_values="null", sep=';',
quotechar='"', quoting=0)
I understand that the problem comes from having a double quoted "g" inside my double quotes in line 3 but I can't figure out how to deal with it. Any ideas ?
You will probably need to pre-process the data so that it conforms to the expected CSV format. I doubt pandas will handle this just by changing a parameter or two.
If there are only two columns, and the first never contains a semi-colon, then you could split the lines on the first semi-colon:
records = []
with open('error.csv', 'r') as fh:
# first row is a header
header = next(fh).strip().split(';')
for rec in fh:
# split only on the first semi-colon
date, dat = rec.strip().split(';', maxsplit=1)
# assemble records, removing quotes from the second column
records.append((date, dat.strip('"')))
# create a data frame
df = pandas.DataFrame.from_records(records, columns=header)
You will have to manually parse the dates yourself with the datetime module if you want the first column to contain proper dates and not strings.
Related
I have a very large csv file (2+ million rows) which is separated by commas. Except there are a few entries such as Company Name and a value field where fields can contain commas inside quotations.
E.g:
product1,"23-12-2021 21:37:22.567",50,18.32,"Company, A"
product1,"23-12-2021 21:37:24.237","1,000",17.34,"Company, C"
Note that when the value field after the datetime is >= 1000 it adds the comma and quotations. As a result of the above when I try:
pd.read_csv(r'C:\example_file.csv', delimiter=",", quoting=csv.QUOTE_NONE, quotechar='"', encoding='utf-8')
It throws a pandas.errors.ParserError: Error tokenizing data. C error: Expected 24 fields in line x, saw 25
I've managed a workaround to get it into a dataframe using this:
import pandas as pd
import csv
file = pd.read_csv(r'C:\example_file.csv', delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')
df = pd.Dataframe(file)
column_list = df_columns
column_string = str(column_list[0])
# column_list[0] returns index with 24 column names
column_string_split = column_string.split(",")
df.rename(columns{df.columns[0]: 'fill'}, inplace=True)
new_df = pd.Dataframe(data_df['fill'].str.split(',').tolist(), columns=column_string_split)
I understand what I've tried so far isnt optimal and ideally I would separate the data when I read it in but I'm at a loss on how to progress. I've been looking into regex expressions to exclude commas inside quotations (such as Python, split a string at commas, except within quotes, ignoring whitespace) but nothing as worked. Any ideas on how to str.split commas except within quotes?
Considering the .csv you gave :
product1,"23-12-2021 21:37:22.567",50,18.32,"Company, A"
product1,"23-12-2021 21:37:24.237","1,000",17.34,"Company, C"
First, as explained in the comments by #Chris and #Quang Hoang, a standard pandas.read_csv will ignore the comma inside the quotes.
Second, you can pass the thousands parameter to pandas.read_csv to get rid of the comma for all the numbers > 999.
Try this :
df = pd.read_csv(r'C:\example_file.csv', encoding='utf-8', thousands=",")
I have a txt file which I read into pandas dataframe. The problem is that inside this file my text data recorded with delimiter ''. I need to split information in 1 column into several columns but it does not work because of this delimiter.
I found this post on stackoverflow just with one string, but I don't understand how to apply it once I have a whole dataframe: Split string at delimiter '\' in python
After reading my txt file into df it looks something like this
df
column1\tcolumn2\tcolumn3
0.1\t0.2\t0.3
0.4\t0.5\t0.6
0.7\t0.8\t0.9
Basically what I am doing now is the following:
df = pd.read_fwf('my_file.txt', skiprows = 8) #I use skip rows because there is irrelevant text
df['column1\tcolumn2\tcolumn3'] = "r'" + df['column1\tcolumn2\tcolumn3'] +"'" # i try to make it a row string as in the post suggested but it does not really work
df['column1\tcolumn2\tcolumn3'].str.split('\\',expand=True)
and what I get is just the following (just displayed like text inside a data frame)
r'0.1\t0.2\t0.3'
r'0.4\t0.5\t0.6'
r'0.7\t0.8\t0.9'
I am not very good with regular expersions and it seems a bit hard, how can I target this problem?
It looks like your file is tab-delimited, because of the "\t". This may work
pd.read_csv('file.txt', sep='\t', skiprows=8)
I am trying to load a semicolon seperated txt file and there are a few instances where escape chars are in the data. These are typically < ; (space removed so it isn't covered to <) which adds a semicolon. This obviously messes up my data and since dtypes are important causes read_csv problems. Is there away to tell pandas to ignore these when the file is read?
I tried deleting the char from the file and it works now, but given that I want an automated process on millions of rows this is not sustainable.
df = pd.read_csv(file_loc.csv,
header=None,
names=column_names,
usecols=counters,
dtype=dtypes,
delimiter=';',
low_memory=False)
ValueError: could not convert string to float:
As my first column is a string and the second is a float, but if the first is split by the < ; it then goes on the 2nd too.
Is there a way to tell pandas to ignore these or efficiently remove before loading?
Give the following example csv file so57732330.csv:
col1;col2
1<2;a
3;
we read it using StringIO after unescaping named and numeric html5 character references:
import pandas as pd
import io
import html
with open('so57732330.csv') as f:
s = f.read()
f = io.StringIO(html.unescape(s))
df = pd.read_csv(f,sep=';')
Result:
col1 col2
0 1<2 a
1 3 NaN
I'm aware this has been asked so many times, but it's left me really scratching my head. I have a .txt file which looks like:
"0,1,2,3,4, ....63,0,1,2,3,4.....63"
"0,1,2,3,4, ....63,0,1,2,3,4.....63"
"0,1,2,3,4, ....63,0,1,2,3,4.....63"
and so on for multiple rows.
So that's 64+64 = 128 columns separated by commas, while each row is enclosed in double quotes.
I have used the commands:
#Used this initially
df = pd.read_csv('test_data.txt')
#Used this after reading more stackoverflow answers
df = pd.read_csv('test_data.txt', header = None, sep=",", delimiter=',', quotechar='"', index_col = None)
I know sep and delimiter are the same parameters, but I tried both out anyway, I shouldn't have to specify these either because pandas chooses commas by default.
After this I'm just using:
df.head()
And it outputs:
0
0 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
1 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
2 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
3 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
4 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
It just reads it all as the one column, please advise on how I can read all 128 columns.
This will get you to desired result:
df = pd.read_csv('test_data.txt', header=None)
df = pd.DataFrame(df[0].str.split(',').tolist())
So this will read your file, that has each row wrapped with quote marks, and pack it into single column.
Then, you split that column by comma and construct new dataframe from the results.
I'm having a tough time correctly loading csv file to pandas dataframe. The file is csv saved in MS Excel, where the rows looks like this:
Montservis, s.r.o.;"2 012";"-14.98";"-34.68";"- 11.7";"0.02";"0.09";"0.16";"284.88";"10.32";"
I am using
filep="file_name.csv"
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter=";")
(I have tried several combinations and alternatives of read_csv arguments, but without any success.....I have tried also read_table )
What I want to see in my dataframe that each semi colon separated value will be in separate column (I understand that read_csv works this way(?)).
Unfortunately, I always end up with whole row being placed in first column of dataframe. So basicly after loading I have many rows, but only one column (two if I count also indexes)
I have placed sample here:
datafile
Any idea welcomed.
Add quoting = 3. 3 is for QUOTE_NONE refer this.
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter=";", quoting = 3)
This will give [7 rows x 23 columns] dataframe
The problem is enclosing characters which can be ignored by \ character.
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter='\;')