When importing a CSV file, some rows follow with a double quotation directly after the seperator ','.
See the two rows higlighted by the red border:
As you can see, my code has trouble reading these lines. A closer look to the first troubling row:
documents/edf-2021-universal-registration-document.pdf,In which year should carbon neutrality be achieved?,"[\' 2050\', \' 2022\', \' 2050\', \' 2050\', \' 2050\']","[98.251474, 94.238061, 91.08210199999999, 88.809156, 80.074859]"
Is there an efficient way to solve this?
import pandas as pd
df = pd.read_csv("tmp1.csv", encoding="ISO-8859-1")
Related
I have a very large csv file (2+ million rows) which is separated by commas. Except there are a few entries such as Company Name and a value field where fields can contain commas inside quotations.
E.g:
product1,"23-12-2021 21:37:22.567",50,18.32,"Company, A"
product1,"23-12-2021 21:37:24.237","1,000",17.34,"Company, C"
Note that when the value field after the datetime is >= 1000 it adds the comma and quotations. As a result of the above when I try:
pd.read_csv(r'C:\example_file.csv', delimiter=",", quoting=csv.QUOTE_NONE, quotechar='"', encoding='utf-8')
It throws a pandas.errors.ParserError: Error tokenizing data. C error: Expected 24 fields in line x, saw 25
I've managed a workaround to get it into a dataframe using this:
import pandas as pd
import csv
file = pd.read_csv(r'C:\example_file.csv', delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')
df = pd.Dataframe(file)
column_list = df_columns
column_string = str(column_list[0])
# column_list[0] returns index with 24 column names
column_string_split = column_string.split(",")
df.rename(columns{df.columns[0]: 'fill'}, inplace=True)
new_df = pd.Dataframe(data_df['fill'].str.split(',').tolist(), columns=column_string_split)
I understand what I've tried so far isnt optimal and ideally I would separate the data when I read it in but I'm at a loss on how to progress. I've been looking into regex expressions to exclude commas inside quotations (such as Python, split a string at commas, except within quotes, ignoring whitespace) but nothing as worked. Any ideas on how to str.split commas except within quotes?
Considering the .csv you gave :
product1,"23-12-2021 21:37:22.567",50,18.32,"Company, A"
product1,"23-12-2021 21:37:24.237","1,000",17.34,"Company, C"
First, as explained in the comments by #Chris and #Quang Hoang, a standard pandas.read_csv will ignore the comma inside the quotes.
Second, you can pass the thousands parameter to pandas.read_csv to get rid of the comma for all the numbers > 999.
Try this :
df = pd.read_csv(r'C:\example_file.csv', encoding='utf-8', thousands=",")
I have a txt file which I read into pandas dataframe. The problem is that inside this file my text data recorded with delimiter ''. I need to split information in 1 column into several columns but it does not work because of this delimiter.
I found this post on stackoverflow just with one string, but I don't understand how to apply it once I have a whole dataframe: Split string at delimiter '\' in python
After reading my txt file into df it looks something like this
df
column1\tcolumn2\tcolumn3
0.1\t0.2\t0.3
0.4\t0.5\t0.6
0.7\t0.8\t0.9
Basically what I am doing now is the following:
df = pd.read_fwf('my_file.txt', skiprows = 8) #I use skip rows because there is irrelevant text
df['column1\tcolumn2\tcolumn3'] = "r'" + df['column1\tcolumn2\tcolumn3'] +"'" # i try to make it a row string as in the post suggested but it does not really work
df['column1\tcolumn2\tcolumn3'].str.split('\\',expand=True)
and what I get is just the following (just displayed like text inside a data frame)
r'0.1\t0.2\t0.3'
r'0.4\t0.5\t0.6'
r'0.7\t0.8\t0.9'
I am not very good with regular expersions and it seems a bit hard, how can I target this problem?
It looks like your file is tab-delimited, because of the "\t". This may work
pd.read_csv('file.txt', sep='\t', skiprows=8)
I'm trying to import csv style data from a software designed in Europe into a df for analysis.
The data uses two characters to delimit the data in the files, 'DC4' and 'SI' ("Shift In" I believe). I'm currently concatenating the files and delimiting them by the 'DC4' character using read_csv into a df. Then I use a regex line to replace all the 'SI' characters into ';' in the df. I skip every other line in the code to remove the identifiers I don't need next. If I open the data at this point everything is split by the 'DC4' and all 'SI' are converted to ;.
What would you suggest to further split the df by the ; character now? I've tried to split the df by series.string but got type errors. I've exported to csv and reimported it using ; as the delimiter, but it doesn't split the existing columns that were already split with the first import for some reason? I also get parser errors on some rows way down the df so I think there are dirty rows (this is just information I've found. If not helpful please ignore it). I can ignore these lines without affecting the data I need.
The size of the df is around 60-70 columns and usually less than 75K rows when I pull a full report. I'm using PyCharm and Python 3.8. Thank you all for any help on this, I very much appreciate it. Here is my code so far:
path = file directory location
df = pd.concat([pd.read_csv(f, sep='', comment=" ", na_values='Nothing', header=None, index_col=False)
for f in glob.glob(path + ".file extension")], ignore_index=True)
df = df.replace('', ';', regex=True)
df = df.iloc[::2]
df.to_csv(r'new_file_location', index=False, encoding='utf-8-sig')
So you have a CSV (technically not a CSV I guess) that's separated by two different values (DC4 and SI) and you want to read it into a dataframe?
You can do so directly with pandas, the read_csv function allows you to specify regex delimiters, so you could use "\x0e|\x14" and use either DC4 or SI as selarator: pd.read_csv(path, sep="\x0e|\x14")
An example with readable characters:
The csv contains:
col1,col2;col3
val1,val2,val3
val4;val5;val6
Which can be read as follows:
import pandas as pd
df = pd.read_csv(path, sep=",|;")
which results in df being:
col1 col2 col3
0 val1 val2 val3
1 val4 val5 val6
Im trying to work with pandas and I can not change a dataframe column name because it has Double Quotation mark and df.rename() it only changes the columns that doesn't have double Quotation pic of code and output
It would be helpful if you would put your code in text format.
You can set/rename columns using df.columns = [column_list]
I am trying to load a semicolon seperated txt file and there are a few instances where escape chars are in the data. These are typically < ; (space removed so it isn't covered to <) which adds a semicolon. This obviously messes up my data and since dtypes are important causes read_csv problems. Is there away to tell pandas to ignore these when the file is read?
I tried deleting the char from the file and it works now, but given that I want an automated process on millions of rows this is not sustainable.
df = pd.read_csv(file_loc.csv,
header=None,
names=column_names,
usecols=counters,
dtype=dtypes,
delimiter=';',
low_memory=False)
ValueError: could not convert string to float:
As my first column is a string and the second is a float, but if the first is split by the < ; it then goes on the 2nd too.
Is there a way to tell pandas to ignore these or efficiently remove before loading?
Give the following example csv file so57732330.csv:
col1;col2
1<2;a
3;
we read it using StringIO after unescaping named and numeric html5 character references:
import pandas as pd
import io
import html
with open('so57732330.csv') as f:
s = f.read()
f = io.StringIO(html.unescape(s))
df = pd.read_csv(f,sep=';')
Result:
col1 col2
0 1<2 a
1 3 NaN