Pandas or csv read: remove html character encoding - python

I read csv file which include .html Entity / html character encoding like this: & for the Ampersand symbol (just an example) there are other chars as well.
At the end my goal is to read multiple csv files and combine them into a single csv and change the encoding to UTF, so that such symbols are gone.
I currently I do it like this:
for file in files:
df = pd.read_csv(file, sep=';', index_col=None, header=0, encoding='utf-8-sig')
list_.append(df)
df_total = pd.concat(list_)
df_total.to_csv('test.csv', sep=';', encoding='utf-8-sig', index=False)
which is very slow. Worse is that it does not change the encoding to UTF it seams.
So a) is there a quick way to get rid of these characters
and b) is there a better way to concat csv files maybe with the inbuild csv library AND get rid of the unwanted characters/ change their encoding?
Thank you in advance

Related

Trouble reading CSV file using pandas

I'm working on a data analysis project & I wanted to read data from CSV files using pandas. I read the first CSV file and It was fine but the second one gave me a UTF 8 encoding error. I exported the file to csv and encoded it to UTF-8 in the numbers spreadsheet app. However, the data frame is not in the expected format. Any idea why?
the original CSV file in numbers
it looks like your file is semicolon separated not comma separated.
To fix this you need to add the sep=';' parameter to pd.read_csv function.
pd.read_csv("mitb.csv", sep=';')
Try adding the correct delimiter, in this case ";", to read the csv.
mitb = pd.read_csv('mitb.csv', sep=";")
The file is semicolon-separated and also decimal is comma, not dot
df = pd.read_csv('mitb.csv', sep=';', decmal=',')
And Please do not upload images of code/data/errors.

May I use either tab or comma as delimiter when reading from pandas csv?

I have csv files. Some are comma delimited, and some are tab delimited.
df = pd.read_csv(data_file, sep='\t')
Is there a way to specify either tab or comma as delimiter when using pd.read_csv()? Or, is there a way to automatically detect whether the file is tab or comma delimited? If I know that, I can use different sep='' paramters when reading the file.
Recently I had a similair problem, I ended up using a different method but I explored using the Sniffer Class from the CSV standard library.
I haven't used this in production but only to help find what file types are for testing prototyping, use at your own risk!
from the documentation
"Sniffs" the format of a CSV file (i.e. delimiter, quotechar) Returns
a Dialect object.
you can return the dialect object then pass dialect.delimiter to the sep arg in pd.read_csv
'text_a.csv'
cola|colb|col
A|B|C
E|F|G
A|B|C
E|F|G
'text_b.csv'
cola\tcolb\tcol
A\tB\tC
E\tF\tG
A\tB\tC
E\tF\tG
A\tB\tC
from csv import Sniffer
sniffer = Sniffer()
def detect_delim(file,num_rows,sniffer):
with open(file,'r') as f:
for row in range(num_rows):
line = next(f).strip()
delim = sniffer.sniff(line)
print(delim.delimiter) # ideally you should return the dialect object - just being lazy.
#return delim.dedelimiter
detect_delim(file='text_a.csv',num_rows=5,sniffer=sniffer)
'|'
detect_delim(file='text_b.csv',num_rows=5,sniffer=sniffer)
'\t'
I'd just read the first row and see which gives you more columns:
import pandas as pd
tab = pd.read_csv(data_file, nrows=1, sep='\t').shape[1]
com = pd.read_csv(data_file, nrows=1, sep=',').shape[1]
if tab > com:
df = pd.read_csv(data_file, sep='\t')
else:
df = pd.read_csv(data_file, sep=',')
Is this useful? You can used python regex parser with read_csv and specify different delimiters.
Ask the user to specify how the file is formatted if you don't expect to be able to determine from the file contents itself.
E.g. a flag of some sort as --tab-delimited-file=true and then you flip the separator based on their input.

Unable to convert csv file to text tab delimited file in Python

Instead of manually convert csv file to text tab delimited file using excel software
I would like to automate this process using Python.
However, using the following code
with open('endnote_csv.csv', 'r') as fin:
with open('endnote_deliminated.txt', 'w', newline='') as fout:
reader = csv.DictReader(fin, delimiter=',')
writer = csv.DictWriter(fout, reader.fieldnames, delimiter='|')
writer.writeheader()
writer.writerows(reader)
Return an error of
ValueError: dict contains fields not in fieldnames: None
May I know where did I do wrong,
The csv file is accessible via the following link
Thanks in advance for any insight.
You can use the Python package called pandas to do this:
import pandas as pd
fname = 'endnote_csv'
pd.read_csv(f'{fname}.csv').to_csv(f'{fname}.tsv', sep='\t', index=False)
Here's how it works:
pd.read_csv(fname) - reads a CSV file and stores it as a pd.DataFrame object (not important for this example)
.to_csv(fname) - writes a pd.DataFrame to a CSV file given by fname
sep='\t' - replaces the ',' used in CSVs with a tab character
index=False - use this to remove the row numbers
If you want to be a bit more advanced and use the command line only, you can do this:
# csv-to-tsv.py
import sys
import pandas as pd
fnames = sys.argv[1:]
for fname in fnames:
main_name = '.'.join(fname.split('.')[:-1])
pd.read_csv(f'{main_name}.csv').to_csv(f'{main_name}.tsv', sep='\t', index=False)
This will allow you to run a command like this from the command line and change all .csv files to .tsv files in one go:
python csv-to-tsv.py *.csv
It is erroring out on comma seperated author names. It appears that columns in the underline rows exceeds number of headers.

Strange character while reading a CSV file

I try to read a CSV file in Python, but the first element in the first row is read like that 0, while the strange character isn't in the file, its just a simple 0. Here is the code I used:
matriceDist=[]
file=csv.reader(open("distanceComm.csv","r"),delimiter=";")
for row in file:
matriceDist.append(row)
print (matriceDist)
I had this same issue. Save your excel file as CSV (MS-DOS) vs. UTF-8 and those odd characters should be gone.
Specifying the byte order mark when opening the file as follows solved my issue:
open('inputfilename.csv', 'r', encoding='utf-8-sig')
Just use pandas together with some encoding (utf-8 for example) is gonna be easier
import pandas as pd
df = pd.read_csv('distanceComm.csv', header=None, encoding = 'utf8', delimiter=';')
print(df)
I don't know what your input file is. But since it has a Byte Order Mark for UTF-8, you can use something like this:
import codecs
matriceDist=[]
file=csv.reader(codecs.open('distanceComm.csv', encoding='utf-8'),delimiter=";")
for row in file:
matriceDist.append(row)
print (matriceDist)

Auto-detect the delimiter in a CSV file using pd.read_csv

Is there a way for read_csv to auto-detect the delimiter? numpy's genfromtxt does this.
My files have data with single space, double space and a tab as delimiters. genfromtxt() solves it, but is slower than pandas' read_csv.
Any ideas?
Another option is to use the built in CSV Sniffer. I mix it up with only reading a certain number of bytes in case the CSV file is large.
import csv
def get_delimiter(file_path, bytes = 4096):
sniffer = csv.Sniffer()
data = open(file_path, "r").read(bytes)
delimiter = sniffer.sniff(data).delimiter
return delimiter
Option 1
Using delim_whitespace=True
df = pd.read_csv('file.csv', delim_whitespace=True)
Option 2
Pass a regular expression to the sep parameter:
df = pd.read_csv('file.csv', sep='\s+')
This is equivalent to the first option
Documentation for pd.read_csv.
For better control, I use a python module called detect_delimiter from python projects. See https://pypi.org/project/detect-delimiter/ . It has been around for some time. As with all code, you should test with your interpreter prior to deployment. I have tested up to python version 3.8.5.
See code example below where the delimiter is automatically detected, and the var
delimiter is defined from the method's output. The code then reads the CSV file
with sep = delimiter. I have tested with the following delimiters, although others should work: ; , |
It does not work with multi-char delimiters, such as ","
CAUTION! This method will do nothing to detect a malformed CSV file. In the case
where the input file contains both ; and , the method returns , as the detected delimiter.
from detect_delimiter import detect
import pandas as pd
filename = "some_csv.csv"
with open(filename) as f:
firstline = f.readline()
delimiter = detect(firstline)
records = pd.read_csv(filename, sep = delimiter)

Categories

Resources