I am trying to load a semicolon seperated txt file and there are a few instances where escape chars are in the data. These are typically < ; (space removed so it isn't covered to <) which adds a semicolon. This obviously messes up my data and since dtypes are important causes read_csv problems. Is there away to tell pandas to ignore these when the file is read?
I tried deleting the char from the file and it works now, but given that I want an automated process on millions of rows this is not sustainable.
df = pd.read_csv(file_loc.csv,
header=None,
names=column_names,
usecols=counters,
dtype=dtypes,
delimiter=';',
low_memory=False)
ValueError: could not convert string to float:
As my first column is a string and the second is a float, but if the first is split by the < ; it then goes on the 2nd too.
Is there a way to tell pandas to ignore these or efficiently remove before loading?
Give the following example csv file so57732330.csv:
col1;col2
1<2;a
3;
we read it using StringIO after unescaping named and numeric html5 character references:
import pandas as pd
import io
import html
with open('so57732330.csv') as f:
s = f.read()
f = io.StringIO(html.unescape(s))
df = pd.read_csv(f,sep=';')
Result:
col1 col2
0 1<2 a
1 3 NaN
Related
I have a very large csv file (2+ million rows) which is separated by commas. Except there are a few entries such as Company Name and a value field where fields can contain commas inside quotations.
E.g:
product1,"23-12-2021 21:37:22.567",50,18.32,"Company, A"
product1,"23-12-2021 21:37:24.237","1,000",17.34,"Company, C"
Note that when the value field after the datetime is >= 1000 it adds the comma and quotations. As a result of the above when I try:
pd.read_csv(r'C:\example_file.csv', delimiter=",", quoting=csv.QUOTE_NONE, quotechar='"', encoding='utf-8')
It throws a pandas.errors.ParserError: Error tokenizing data. C error: Expected 24 fields in line x, saw 25
I've managed a workaround to get it into a dataframe using this:
import pandas as pd
import csv
file = pd.read_csv(r'C:\example_file.csv', delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')
df = pd.Dataframe(file)
column_list = df_columns
column_string = str(column_list[0])
# column_list[0] returns index with 24 column names
column_string_split = column_string.split(",")
df.rename(columns{df.columns[0]: 'fill'}, inplace=True)
new_df = pd.Dataframe(data_df['fill'].str.split(',').tolist(), columns=column_string_split)
I understand what I've tried so far isnt optimal and ideally I would separate the data when I read it in but I'm at a loss on how to progress. I've been looking into regex expressions to exclude commas inside quotations (such as Python, split a string at commas, except within quotes, ignoring whitespace) but nothing as worked. Any ideas on how to str.split commas except within quotes?
Considering the .csv you gave :
product1,"23-12-2021 21:37:22.567",50,18.32,"Company, A"
product1,"23-12-2021 21:37:24.237","1,000",17.34,"Company, C"
First, as explained in the comments by #Chris and #Quang Hoang, a standard pandas.read_csv will ignore the comma inside the quotes.
Second, you can pass the thousands parameter to pandas.read_csv to get rid of the comma for all the numbers > 999.
Try this :
df = pd.read_csv(r'C:\example_file.csv', encoding='utf-8', thousands=",")
I'm trying to import csv style data from a software designed in Europe into a df for analysis.
The data uses two characters to delimit the data in the files, 'DC4' and 'SI' ("Shift In" I believe). I'm currently concatenating the files and delimiting them by the 'DC4' character using read_csv into a df. Then I use a regex line to replace all the 'SI' characters into ';' in the df. I skip every other line in the code to remove the identifiers I don't need next. If I open the data at this point everything is split by the 'DC4' and all 'SI' are converted to ;.
What would you suggest to further split the df by the ; character now? I've tried to split the df by series.string but got type errors. I've exported to csv and reimported it using ; as the delimiter, but it doesn't split the existing columns that were already split with the first import for some reason? I also get parser errors on some rows way down the df so I think there are dirty rows (this is just information I've found. If not helpful please ignore it). I can ignore these lines without affecting the data I need.
The size of the df is around 60-70 columns and usually less than 75K rows when I pull a full report. I'm using PyCharm and Python 3.8. Thank you all for any help on this, I very much appreciate it. Here is my code so far:
path = file directory location
df = pd.concat([pd.read_csv(f, sep='', comment=" ", na_values='Nothing', header=None, index_col=False)
for f in glob.glob(path + ".file extension")], ignore_index=True)
df = df.replace('', ';', regex=True)
df = df.iloc[::2]
df.to_csv(r'new_file_location', index=False, encoding='utf-8-sig')
So you have a CSV (technically not a CSV I guess) that's separated by two different values (DC4 and SI) and you want to read it into a dataframe?
You can do so directly with pandas, the read_csv function allows you to specify regex delimiters, so you could use "\x0e|\x14" and use either DC4 or SI as selarator: pd.read_csv(path, sep="\x0e|\x14")
An example with readable characters:
The csv contains:
col1,col2;col3
val1,val2,val3
val4;val5;val6
Which can be read as follows:
import pandas as pd
df = pd.read_csv(path, sep=",|;")
which results in df being:
col1 col2 col3
0 val1 val2 val3
1 val4 val5 val6
I have a csv file with multiple columns that contain numbers and natural language text. The columns have a special separator (say "%#") so that one can immediately see the boundary between columns. For example:
55%#This is an example text%#24
32%#This is another example text%#1649
However, sometimes the text contains newlines:
212%#This is a possible
occurrence of
a text in the csv file%#3399
My problem is how to read the csv file with pandas such that the newline is not interpreted as another csv row. When I use df = pd.read_csv(filename, sep='%#') I get for example:
Col1 Col2 Col3
212 This is a possible None
occurrence of None None
a text in the csv file 3399
How could I solve this problem?
I am trying to read a data file with a header. The data file is attached and I am using the following code:
import pandas as pd
data=pd.read_csv('TestData.out', sep=' ', skiprows=1, header=None)
The issue is that I have 20 columns in my data file, while I am getting 32 columns in the variable data. How can I resolve this issue. I am very new to Python and I am learning.
Data_File
Your Text File has two spaces together, in from of any value that does not have a minus sign. if sep=' ', pandas sees this as two delimiters with nothing (Nan) inbetween.
This will fix it:
data = pd.read_csv('TestData.out', sep='\s+', skiprows=1, header=None)
In this case the sep is interpreted as a regex, which looks for "one of more spaces" as the delimiter, and reurns Columns 0 though 19.
Your data file has inconsistent space delimitation. So, you just have to skip the subsequent space after the delimiter. This simple code works:
data= pd.read_csv('TestData.out',sep=' ',skiprows=1,skipinitialspace=True)
I'm trying to a 2 columns csv file (error.csv) with semi-column separator which contains double quoted semi-columns:
col1;col2
2016-04-17_22:34:25.126;"Linux; Android"
2016-04-17_22:34:25.260;"{"g":2}iPhone; iPhone"
And I'm trying:
logs = pd.read_csv('error.csv', na_values="null", sep=';',
quotechar='"', quoting=0)
I understand that the problem comes from having a double quoted "g" inside my double quotes in line 3 but I can't figure out how to deal with it. Any ideas ?
You will probably need to pre-process the data so that it conforms to the expected CSV format. I doubt pandas will handle this just by changing a parameter or two.
If there are only two columns, and the first never contains a semi-colon, then you could split the lines on the first semi-colon:
records = []
with open('error.csv', 'r') as fh:
# first row is a header
header = next(fh).strip().split(';')
for rec in fh:
# split only on the first semi-colon
date, dat = rec.strip().split(';', maxsplit=1)
# assemble records, removing quotes from the second column
records.append((date, dat.strip('"')))
# create a data frame
df = pandas.DataFrame.from_records(records, columns=header)
You will have to manually parse the dates yourself with the datetime module if you want the first column to contain proper dates and not strings.