Pandas read_csv - Ignore Escape Char in SemiColon Seperated File

Pandas read_csv - Ignore Escape Char in SemiColon Seperated File - python

I am trying to load a semicolon seperated txt file and there are a few instances where escape chars are in the data. These are typically &lt ; (space removed so it isn't covered to <) which adds a semicolon. This obviously messes up my data and since dtypes are important causes read_csv problems. Is there away to tell pandas to ignore these when the file is read?
I tried deleting the char from the file and it works now, but given that I want an automated process on millions of rows this is not sustainable.
df = pd.read_csv(file_loc.csv,
header=None,
names=column_names,
usecols=counters,
dtype=dtypes,
delimiter=';',
low_memory=False)
ValueError: could not convert string to float:
As my first column is a string and the second is a float, but if the first is split by the &lt ; it then goes on the 2nd too.
Is there a way to tell pandas to ignore these or efficiently remove before loading?

Give the following example csv file so57732330.csv:
col1;col2
1<2;a
3;
we read it using StringIO after unescaping named and numeric html5 character references:
import pandas as pd
import io
import html
with open('so57732330.csv') as f:
s = f.read()
f = io.StringIO(html.unescape(s))
df = pd.read_csv(f,sep=';')
Result:
col1 col2
0 1<2 a
1 3 NaN

Related

Python pandas separate dataframe by commas, except commas inside of quotations

I have a very large csv file (2+ million rows) which is separated by commas. Except there are a few entries such as Company Name and a value field where fields can contain commas inside quotations.
E.g:
product1,"23-12-2021 21:37:22.567",50,18.32,"Company, A"
product1,"23-12-2021 21:37:24.237","1,000",17.34,"Company, C"
Note that when the value field after the datetime is >= 1000 it adds the comma and quotations. As a result of the above when I try:
pd.read_csv(r'C:\example_file.csv', delimiter=",", quoting=csv.QUOTE_NONE, quotechar='"', encoding='utf-8')
It throws a pandas.errors.ParserError: Error tokenizing data. C error: Expected 24 fields in line x, saw 25
I've managed a workaround to get it into a dataframe using this:
import pandas as pd
import csv
file = pd.read_csv(r'C:\example_file.csv', delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')
df = pd.Dataframe(file)
column_list = df_columns
column_string = str(column_list[0])
# column_list[0] returns index with 24 column names
column_string_split = column_string.split(",")
df.rename(columns{df.columns[0]: 'fill'}, inplace=True)
new_df = pd.Dataframe(data_df['fill'].str.split(',').tolist(), columns=column_string_split)
I understand what I've tried so far isnt optimal and ideally I would separate the data when I read it in but I'm at a loss on how to progress. I've been looking into regex expressions to exclude commas inside quotations (such as Python, split a string at commas, except within quotes, ignoring whitespace) but nothing as worked. Any ideas on how to str.split commas except within quotes?

Considering the .csv you gave :
product1,"23-12-2021 21:37:22.567",50,18.32,"Company, A"
product1,"23-12-2021 21:37:24.237","1,000",17.34,"Company, C"
First, as explained in the comments by #Chris and #Quang Hoang, a standard pandas.read_csv will ignore the comma inside the quotes.
Second, you can pass the thousands parameter to pandas.read_csv to get rid of the comma for all the numbers > 999.
Try this :
df = pd.read_csv(r'C:\example_file.csv', encoding='utf-8', thousands=",")

How to use 'Shift In' from text in csv file to split columns

I'm trying to import csv style data from a software designed in Europe into a df for analysis.
The data uses two characters to delimit the data in the files, 'DC4' and 'SI' ("Shift In" I believe). I'm currently concatenating the files and delimiting them by the 'DC4' character using read_csv into a df. Then I use a regex line to replace all the 'SI' characters into ';' in the df. I skip every other line in the code to remove the identifiers I don't need next. If I open the data at this point everything is split by the 'DC4' and all 'SI' are converted to ;.
What would you suggest to further split the df by the ; character now? I've tried to split the df by series.string but got type errors. I've exported to csv and reimported it using ; as the delimiter, but it doesn't split the existing columns that were already split with the first import for some reason? I also get parser errors on some rows way down the df so I think there are dirty rows (this is just information I've found. If not helpful please ignore it). I can ignore these lines without affecting the data I need.
The size of the df is around 60-70 columns and usually less than 75K rows when I pull a full report. I'm using PyCharm and Python 3.8. Thank you all for any help on this, I very much appreciate it. Here is my code so far:
path = file directory location
df = pd.concat([pd.read_csv(f, sep='', comment=" ", na_values='Nothing', header=None, index_col=False)
for f in glob.glob(path + ".file extension")], ignore_index=True)
df = df.replace('', ';', regex=True)
df = df.iloc[::2]
df.to_csv(r'new_file_location', index=False, encoding='utf-8-sig')

So you have a CSV (technically not a CSV I guess) that's separated by two different values (DC4 and SI) and you want to read it into a dataframe?
You can do so directly with pandas, the read_csv function allows you to specify regex delimiters, so you could use "\x0e|\x14" and use either DC4 or SI as selarator: pd.read_csv(path, sep="\x0e|\x14")
An example with readable characters:
The csv contains:
col1,col2;col3
val1,val2,val3
val4;val5;val6
Which can be read as follows:
import pandas as pd
df = pd.read_csv(path, sep=",|;")
which results in df being:
col1 col2 col3
0 val1 val2 val3
1 val4 val5 val6

Reading csv file in pandas with newlines and natural language

I have a csv file with multiple columns that contain numbers and natural language text. The columns have a special separator (say "%#") so that one can immediately see the boundary between columns. For example:
55%#This is an example text%#24
32%#This is another example text%#1649
However, sometimes the text contains newlines:
212%#This is a possible
occurrence of
a text in the csv file%#3399
My problem is how to read the csv file with pandas such that the newline is not interpreted as another csv row. When I use df = pd.read_csv(filename, sep='%#') I get for example:
Col1 Col2 Col3
212 This is a possible None
occurrence of None None
a text in the csv file 3399
How could I solve this problem?

Nan issue with pandas.read_csv

I am trying to read a data file with a header. The data file is attached and I am using the following code:
import pandas as pd
data=pd.read_csv('TestData.out', sep=' ', skiprows=1, header=None)
The issue is that I have 20 columns in my data file, while I am getting 32 columns in the variable data. How can I resolve this issue. I am very new to Python and I am learning.
Data_File

Your Text File has two spaces together, in from of any value that does not have a minus sign. if sep=' ', pandas sees this as two delimiters with nothing (Nan) inbetween.
This will fix it:
data = pd.read_csv('TestData.out', sep='\s+', skiprows=1, header=None)
In this case the sep is interpreted as a regex, which looks for "one of more spaces" as the delimiter, and reurns Columns 0 though 19.

Your data file has inconsistent space delimitation. So, you just have to skip the subsequent space after the delimiter. This simple code works:
data= pd.read_csv('TestData.out',sep=' ',skiprows=1,skipinitialspace=True)

python pandas read_csv unable to read character double quoted twice

I'm trying to a 2 columns csv file (error.csv) with semi-column separator which contains double quoted semi-columns:
col1;col2
2016-04-17_22:34:25.126;"Linux; Android"
2016-04-17_22:34:25.260;"{"g":2}iPhone; iPhone"
And I'm trying:
logs = pd.read_csv('error.csv', na_values="null", sep=';',
quotechar='"', quoting=0)
I understand that the problem comes from having a double quoted "g" inside my double quotes in line 3 but I can't figure out how to deal with it. Any ideas ?

You will probably need to pre-process the data so that it conforms to the expected CSV format. I doubt pandas will handle this just by changing a parameter or two.
If there are only two columns, and the first never contains a semi-colon, then you could split the lines on the first semi-colon:
records = []
with open('error.csv', 'r') as fh:
# first row is a header
header = next(fh).strip().split(';')
for rec in fh:
# split only on the first semi-colon
date, dat = rec.strip().split(';', maxsplit=1)
# assemble records, removing quotes from the second column
records.append((date, dat.strip('"')))
# create a data frame
df = pandas.DataFrame.from_records(records, columns=header)
You will have to manually parse the dates yourself with the datetime module if you want the first column to contain proper dates and not strings.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas read_csv - Ignore Escape Char in SemiColon Seperated File - python

Related

Python pandas separate dataframe by commas, except commas inside of quotations

How to use 'Shift In' from text in csv file to split columns

Reading csv file in pandas with newlines and natural language

Nan issue with pandas.read_csv

python pandas read_csv unable to read character double quoted twice

Categories

Resources