detect strange chars in csv python

detect strange chars in csv python - python

I have a CSV with a million line or so and some of the lines are mixed with some of these chars (meaning the line can be read but is mixed with baloney):
ªïÜÜµ+>&\ôowó¨ñø4(½;!|Nòdd¼Õõ¿¨W[¦¡¿p\,¶êÕMÜÙ;!ÂeÃ£YÃ3SÂ®Â´Ã¸ÂÃ
The input file is ISO-8859-1, each line filtered and written in an utf-8 new file.
can this be filtered? and how?
Here is what it looks like (the entire line)
Foo;Bar;24/01/2019-13:06;24/01/2019-12:55.01;;!
ù:#ªïÜÜµ+>&\ôowó¨ñø4(½;!|Nòdd¼Õõ¿¨W[¦¡¿p\,¶êÕMÜÙ;!
ÂeÃ£YÃ3ÃSÂ®Â´Ã¸ÂÃÃ§~ÂÂÂÂÃÃ½ÃÂ¬ÂÂ¯Ã£m;!ÃvÂ
´Ã¼ÂÂ9Â¬uÂ»/"ÂFÃ|b`ÃÃÃµÃ Â±ÃÃÂ8ÃÂ;Baz
This is how i read it.
the encoding for fileObject being ISO-8859-1
def tee(self, rules=None):
_good = open(self.good, "a")
_bad = open(self.bad, "a")
with open(self.temp, encoding=rules["encoding"]) as fileobject:
cpt = 0
_csv = csv.reader(fileobject, **self.dialect)
for row in _csv:
_len = len(row)
_reconstructed = ";".join(row)
self.count['original'] += 1
if _len == rules["columns"]:
_good.write("{}\n".format(_reconstructed))
else:
_bad.write("{}\n".format(_reconstructed))
# print("[{}] {}:{}{}{}".format(_len, cpt, fg("red"), _reconstructed, attr("reset")))
self.count['errors'] += 1
cpt += 1
_good.close()
_bad.close()

Related

Python: Separating txt file to multiple files using a reoccuring symbol

I have a .txt file of amino acids separated by ">node" like this:
Filename.txt :
>NODE_1
MSETLVLTRPDDWHVHLRDGAALQSVVPYTARQFARAIAMPNLKPPITTAEQAQAYRERI
KFFLGTDSAPHASVMKENSVCGAGCFTALSALELYAEAFEAAGALDKLEAFASFHGADFY
GLPRNTTQVTLRKTEWTLPESVPFGEAAQLKPLRGGEALRWKLD*
>NODE_2
MSTWHKVQGRPKAQARRPGRKSKDDFVTRVEHDAKNDALLQLVRAEWAMLRSDIATFRGD
MVERFGKVEGEITGIKGQIDGLKGEMQGVKGEVEGLRGSLTTTQWVVGTAMALLAVVTQV
PSIISAYRFPPAGSSAFPAPGSLPTVPGSPASAASAP*
I want to separate this file into two (or as many as there are nodes) files;
Filename1.txt :
>NODE
MSETLVLTRPDDWHVHLRDGAALQSVVPYTARQFARAIAMPNLKPPITTAEQAQAYRERI
KFFLGTDSAPHASVMKENSVCGAGCFTALSALELYAEAFEAAGALDKLEAFASFHGADFY
GLPRNTTQVTLRKTEWTLPESVPFGEAAQLKPLRGGEALRWKLD*
Filename2.txt :
>NODE
MSTWHKVQGRPKAQARRPGRKSKDDFVTRVEHDAKNDALLQLVRAEWAMLRSDIATFRGD
MVERFGKVEGEITGIKGQIDGLKGEMQGVKGEVEGLRGSLTTTQWVVGTAMALLAVVTQV
PSIISAYRFPPAGSSAFPAPGSLPTVPGSPASAASAP*
With a number after the filename
This code works, however it deletes the ">NODE" line and does not create a file for the last node (the one without a '>' afterwards).
with open('FilePathway') as fo:
op = ''
start = 0
cntr = 1
for x in fo.read().split("\n"):
if x.startswith('>'):
if start == 1:
with open (str(cntr) + '.fasta','w') as opf:
opf.write(op)
opf.close()
op = ''
cntr += 1
else:
start = 1
else:
if op == '':
op = x
else:
op = op + '\n' + x
fo.close()
I can´t seem to find the mistake. Would be thankful if you could point it out to me.
Thank you for your help!
Hi again! Thank you for all the comments. With your help, I managed to get it to work perfectly. For anyone with similar problems, this is my final code:
import os
import glob
folder_path = 'FilePathway'
for filename in glob.glob(os.path.join(folder_path, '*.fasta')):
with open(filename) as fo:
for line in fo.readlines():
if line.startswith('>'):
original = line
content = [original]
fileno = 1
filename = filename
y = filename.replace(".fasta","_")
def writefasta():
global content, fileno
if len(content) > 1:
with open(f'{y}{fileno}.fasta', 'w') as fout:
fout.write(''.join(content))
content = [line]
fileno += 1
with open('FilePathway') as fin:
for line in fin:
if line.startswith('>NODE'):
writefasta()
else:
content.append(line)
writefasta()

You could do it like this:
def writefasta(d):
if len(d['content']) > 1:
with open(f'Filename{d["fileno"]}.fasta', 'w') as fout:
fout.write(''.join(d['content']))
d['content'] = ['>NODE\n']
d['fileno'] += 1
with open('test.fasta') as fin:
D = {'content': ['>NODE\n'], 'fileno': 1}
for line in fin:
if line.startswith('>NODE'):
writefasta(D)
else:
D['content'].append(line)
writefasta(D)

This would be better way. It is going to write only on odd iterations. So that, ">NODE" will be skipped and files will be created only for the real content.
with open('filename.txt') as fo:
cntr=1
for i,content in enumerate(fo.read().split("\n")):
if i%2 == 1:
with open (str(cntr) + '.txt','w') as opf:
opf.write(content)
cntr += 1
By the way, since you are using context manager, you dont need to close the file.
Context managers allow you to allocate and release resources precisely
when you want to. It opens the file, writes some data to it and then
closes it.
Please check: https://book.pythontips.com/en/latest/context_managers.html

with open('FileName') as fo:
cntr = 1
for line in fo.readlines():
with open (f'{str(cntr)}.fasta','w') as opf:
opf.write(line)
opf.close()
op = ''
cntr += 1
fo.close()

Pick line in csv file to pull data

I have a code that is working for me in a Cinema 4d project. It is able to read 4 different data points and kick out outputs to the main project. Currently it is reading all of the lines of the csv file and I would like to pick one line and pull the data from that line only.
import c4d
import csv
def main():
path = Spreadsheet #Spreadsheet is an input filename path
with open(path, 'rb') as csv_file:
readed = csv.DictReader(csv_file,delimiter=',')
for i, row in enumerate(readed):
try:
Xcord = float(row["hc_x"])
Ycord = float(row["hc_y"])
Langle = float(row["launch_angle"])
Lspeed = float(row["launch_speed"])
except:
print "Error while reading - {}".format(row)
continue
global Output1
global Output2
global Output3
global Output4
Output1 = Xcord
Output2 = Ycord
Output3 = Langle
Output4 = Lspeed
This is about the first thing I have tried to code. So thanks.

csv.DictReader requires that you open the file with newline="" in order for it to parse the file correctly.
with open(path, 'rb', newline="") as csv_file:
readed = csv.DictReader(csv_file,delimiter=',')
You also don't have any condition to stop reading the file.
row_to_stop = 5
for i, row in enumerate(readed):
if i == row_to_stop:
Xcord = float(row["hc_x"])
Ycord = float(row["hc_y"])
Langle = float(row["launch_angle"])
Lspeed = float(row["launch_speed"])
break
If you only care about one line, don't look up and type cast values until you reach the line you care about.

I would like to pick one line and pull the data from that line only
The code below will return specific line (by index). You will have to split it and grab the data.
def get_interesting_line(file_name: str, idx: int):
cnt = 0
with open(file_name) as f:
while cnt != idx:
f.readline()
cnt += 1
return f.readline().strip()
# usage example below
print(get_interesting_line('data.txt',7))

How to search for a string in part of text?

I am trying to search multiple text files for the text "1-2","2-3","3-H" which occur in the last field of the lines of text that start with "play".
An example of the text file is show below
id,ARI201803290
version,2
info,visteam,COL
info,hometeam,ARI
info,site,PHO01
play,1,0,lemad001,22,CFBBX,HR/78/F
play,1,0,arenn001,20,BBX,S7/L+
play,1,0,stort001,12,SBCFC,K
play,1,0,gonzc001,02,SS>S,K
play,1,1,perad001,32,BTBBCX,S9/G
play,1,1,polla001,02,CSX,S7/L+.1-2
play,1,1,goldp001,32,SBFBBB,W.2-3;1-2
play,1,1,lambj001,00,X,D9/F+.3-H;2-H;1-3
play,1,1,avila001,31,BC*BBX,31/G.3-H;2-3
play,2,0,grayj003,12,CC*BS,K
play,2,1,dysoj001,31,BBCBX,43/G
play,2,1,corbp001,31,CBBBX,43/G
play,4,1,avila001,02,SC1>X,S8/L.1-2
For the text file above, I would like the output to be '4' since there are 4 occurrences of "1-2","2-3" and "3-H" in total.
The code I have got so far is below, however I'm not sure where to start with writing a line of code to do this function.
import os
input_folder = 'files' # path of folder containing the multiple text files
# create a list with file names
data_files = [os.path.join(input_folder, file) for file in
os.listdir(input_folder)]
# open csv file for writing
csv = open('myoutput.csv', 'w')
def write_to_csv(line):
print(line)
csv.write(line)
j=0 # initialise as 0
count_of_plate_appearances=0 # initialise as 0
for file in data_files:
with open(file, 'r') as f: # use context manager to open files
for line in f:
lines = f.readlines()
i=0
while i < len(lines):
temp_array = lines[i].rstrip().split(",")
if temp_array[0] == "id":
j=0
count_of_plate_appearances=0
game_id = temp_array[1]
awayteam = lines[i+2].rstrip().split(",")[2]
hometeam = lines[i+3].rstrip().split(",")[2]
date = lines[i+5].rstrip().split(",")[2]
for j in range(i+46,i+120,1): #only check for plate appearances this when temp_array[0] == "id"
temp_array2 = lines[j].rstrip().split(",") #create new array to check for plate apperances
if temp_array2[0] == "play" and temp_array2[2] == "1": # plate apperance occurs when these are true
count_of_plate_appearances=count_of_plate_appearances+1
#print(count_of_plate_appearances)
output_for_csv2=(game_id,date,hometeam, awayteam,str(count_of_plate_appearances))
print(output_for_csv2)
csv.write(','.join(output_for_csv2) + '\n')
i=i+1
else:
i=i+1
j=0
count_of_plate_appearances=0
#quit()
csv.close()
Any suggestions on how I can do this? Thanks in advance!

You can use regex, I put your text in a file called file.txt.
import re
a = ['1-2', '2-3', '3-H'] # What you want to count
find_this = re.compile('|'.join(a)) # Make search string
count = 0
with open('file.txt', 'r') as f:
for line in f.readlines():
count += len(find_this.findall(line)) # Each findall returns the list of things found
print(count) # 7
or a shorter solution: (Credit to wjandrea for hinting the use of a generator)
import re
a = ['1-2', '2-3', '3-H'] # What you want to count
find_this = re.compile('|'.join(a)) # Make search string
with open('file.txt', 'r') as f:
count = sum(len(find_this.findall(line)) for line in f)
print(count) # 7

read line by `for` loop and rewrtie, creates more rows by spliting them by comma

I re-upload question after some editing.
This file consists of one column and 81,021 rows.
What I am trying to do is read each row then rewrite files.
After reading each row, I want to count the number of letters, number of special characters, and white spaces and each row.
First, here is my code to read, count number of letters, and rewrite.
file = "C:/" # File I want to read
with open("C:/",'w',encoding='cp949',newline='') as testfile: # New file
csv_writer=csv.writer(testfile)
with open(file,'r') as fi:
for each in fi:
file=each
linecount=count_letters(file)
lst=[file]+[linecount]
csv_writer.writerow(lst)
The problem here is that number of rows increased from 81021 to 86000. Records that have , were separated into multiple rows. Here's how I edited.
input_fileName = ""
output_fileName = ""
f = open(input_fileName, 'r')
out_list = []
buf = ''
flg = 0
for line in f:
if line.count('"')%2 == 1:
if flg == 0: flg = 1
else: flg = 0
if flg == 1: buf += line.strip(' \n')
elif flg == 0 and len(buf) > 0:
buf += line.strip(' \n')
buf = buf.strip(' "')
out_list.append([buf,len(buf)])
buf = ''
else:
line = line.strip(' \n')
out_list.append([line,len(line)])
f.close()
of = open(output_fileName, 'w')
for each in out_list:
print(each[0]+','+str(each[1]), file=of)
of.close()
In this case, a number of rows are not changed.
But now those files creates more columns and records are now separated into multiple columns instead of rows.
How should I fix this problem?
I can't delete , in my file since some rows have ,
That one where it says Nationality caused an error. There were both Korean and English written in one cell. There was a line between those two words.
국적
Nationality
성별
합계
And now it turned into 4 rows and there are quotation marks.
"국적
Nationality"
성별
합계

Python: Write encrypted data to file

I've made a chat app for school, and some people just write into the database. So my new project on it is to encrypt the resources. So I've made an encrypt function.
It's working fine, but when I try to write a encrypted data at a file, I get an error Message:
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x94' in position 0:
character maps to <undefined>
How to fix that problem?
complete code:
def encrypts(data, step):
newdata = ""
i = 0
while (len(data) > len(step)):
step += step[i]
i += 1
if (len(data) < len(step)):
step = step[:len(data)]
for i in range(len(data)):
a = ord(data[i])
b = ord(step[i])
newdata += chr(a+b)
return newdata
file = open("C:/Users/David/Desktop/file.msg","wb")
file.write(encrypts("12345","code"))
Now, I finally solved my problem. The created ASCII Characters didn't exist. So I changed my functions:
def encrypts(data, step):
newdata = ""
i = 0
while (len(data) > len(step)):
step += step[i]
i += 1
if (len(data) < len(step)):
step = step[:len(data)]
for i in range(len(data)):
a = ord(data[i])
b = ord(step[i])
newdata += chr(a+b-100) #The "-100" fixed the problem.
return newdata

When opening a file for writing or saving, try adding the 'b' character to the open mode. So instead of :
open("encryptedFile.txt", 'w')
use
open("encryptedFile.txt", 'wb')
This will open files as binary, which is necessary when you modify the characters the way you are because you're sometime setting those characters to values outside of the ASCII range.

Your problem in the encoding of the file.
Try it:
inputFile = codecs.open('input.txt', 'rb', 'cp1251')
outFile = codecs.open('output.txt', 'wb', 'cp1251')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

detect strange chars in csv python - python

Related

Python: Separating txt file to multiple files using a reoccuring symbol

Pick line in csv file to pull data

How to search for a string in part of text?

read line by `for` loop and rewrtie, creates more rows by spliting them by comma

Python: Write encrypted data to file

Categories

Resources