Replacing line contains NULL byte with '0' value - python

Columns with Null ByteHow can I replace the NULL byte with '0' value after opening up csv file?
My code is as follow but it doesn't work:
try:
# open source file
with open (dataFile,'r')as csvfile:
sourceDF = csv.reader(csvfile)
replaced = [sourceDF.replace(b'\0',b'0') for sourceDF in replaced]
print(replaced)
first_line = True
selHeaders = []
# read each row in source file
for dataRow in sourceDF:
# check if first line of file
if first_line == True:
first_line = False
first_row = dataRow
# check if first file in compile list
if first_run == 0:
result.append(list())
Attached hyper link for the csv files for your reference:
https://drive.google.com/drive/folders/1bPbE3hnO7ZAQEVTQ4prHUkEkqlDa7QTi?usp=share_link
Best regards
Tried following code but doesn't work
replaced = [sourceDF.replace(b'\0',b'0') for sourceDF in replaced]
print(replaced)

There are multiple problems in these two lines:
sourceDF = csv.reader(csvfile)
replaced = [sourceDF.replace(b'\0',b'0') for sourceDF in replaced]
The first line reads data from the CSV file. replaced is None at this time.
The second line now tries to iterate over replaced, which is None - which doesn't work. But even if it would work, it would immediately replace sourceDF with the things it iterates over, thus making the CSV data disappear.
I'm not sure whether replacing \0 by 0 is a good idea. In my 25 years of coding, I never replaced \0 by a visible character. I only replaced it by nothing, space or \n. But I can't really judge in your case, because I don't understand what kind of data you have and what the numbers mean.

Related

Concatenate lines with previous line based on number of letters in first column

New to coding and trying to figure out how to fix a broken csv file to make be able to work with it properly.
So the file has been exported from a case management system and contains fields for username, casenr, time spent, notes and date.
The problem is that occasional notes have newlines in them and when exporting the csv the tooling does not contain quotation marks to define it as a string within the field.
see below example:
user;case;hours;note;date;
tnn;123;4;solved problem;2017-11-27;
tnn;124;2;random comment;2017-11-27;
tnn;125;3;I am writing a comment
that contains new lines
without quotation marks;2017-11-28;
HJL;129;8;trying to concatenate lines to re form the broken csv;2017-11-29;
I would like to concatenate lines 3,4 and 5 to show the following:
tnn;125;3;I am writing a comment that contains new lines without quotation marks;2017-11-28;
Since every line starts with a username (always 3 letters) I thought I would be able to iterate the lines to find which lines do not start with a username and concatenate that with the previous line.
It is not really working as expected though.
This is what I have got so far:
import re
with open('Rapp.txt', 'r') as f:
for line in f:
previous = line #keep current line in variable to join next line
if not re.match(r'^[A-Za-z]{3}', line): #regex to match 3 letters
print(previous.join(line))
Script shows no output just finishes silently, any thoughts?
I think I would go a slightly different way:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
for line in f:
if not re.search("\d{4}-\d{1,2}-\d{1,2};\n", line):
line = re.sub("\n", "", line)
all_the_data = "".join([all_the_data, line])
print (all_the_data)
There a several ways to do this each with pros and cons, but I think this keeps it simple.
Loop the file as you have done and if the line doesn't end in a date and ; take off the carriage return and stuff it into all_the_data. That way you don't have to play with looking back 'up' the file. Again, lots of way to do this. If you would rather use the logic of starts with 3 letters and a ; and looking back, this works:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
all_the_data = ""
for line in f:
if not re.search("^[A-Za-z]{3};", line):
all_the_data = re.sub("\n$", "", all_the_data)
all_the_data = "".join([all_the_data, line])
print ("results:")
print (all_the_data)
Pretty much what was asked for. The logic being if the current line doesn't start right, take out the previous line's carriage return from all_the_data.
If you need help playing with the regex itself, this site is great: http://regex101.com
The regex in your code matches to all the lines (string) in the txt (finds a valid match to the pattern). The if condition is never true and hence nothing prints.
with open('./Rapp.txt', 'r') as f:
join_words = []
for line in f:
line = line.strip()
if len(line) > 3 and ";" in line[0:4] and len(join_words) > 0:
print(';'.join(join_words))
join_words = []
join_words.append(line)
else:
join_words.append(line)
print(";".join(join_words))
I've tried to not use regex here to keep it a little clear if possible. But, regex is a better option.
A simple way would be to use a generator that acts as a filter on the original file. That filter would concatenate a line to the previous one if it has not a semicolon (;) in its 4th column. Code could be:
def preprocess(fd):
previous = next(fd)
for line in fd:
if line[3] == ';':
yield previous
previous = line
else:
previous = previous.strip() + " " + line
yield previous # don't forget last line!
You could then use:
with open(test.txt) as fd:
rd = csv.DictReader(preprocess(fd))
for row in rd:
...
The trick here is that the csv module only requires on object that returns a line each time next function is applied to it, so a generator is appropriate.
But this is only a workaround and the correct way would be that the previous step directly produces a correct CSV file.

Printing to a file via Python

Hopefully this is an easy fix. I'm trying to edit one field of a file we use for import, however when I run the following code it leaves the file blank and 0kb. Could anyone advise what I'm doing wrong?
import re #import regex so we can use the commands
name = raw_input("Enter filename:") #prompt for file name, press enter to just open test.nhi
if len(name) < 1 : name = "test.nhi"
count = 0
fhand = open(name, 'w+')
for line in fhand:
words = line.split(',') #obtain individual words by using split
words[34] = re.sub(r'\D', "", words[34]) #remove non-numeric chars from string using regex
if len(words[34]) < 1 : continue # If the 34th field is blank go to the next line
elif len(words[34]) == 2 : "{0:0>3}".format([words[34]]) #Add leading zeroes depending on the length of the field
elif len(words[34]) == 3 : "{0:0>2}".format([words[34]])
elif len(words[34]) == 4 : "{0:0>1}".format([words[34]])
fhand.write(words) #write the line
fhand.close() # Close the file after the loop ends
I have taken below text in 'a.txt' as input and modified your code. Please check if it's work for you.
#Intial Content of a.txt
This,program,is,Java,program
This,program,is,12Python,programs
Modified code as follow:
import re
#Reading from file and updating values
fhand = open('a.txt', 'r')
tmp_list=[]
for line in fhand:
#Split line using ','
words = line.split(',')
#Remove non-numeric chars from 34th string using regex
words[3] = re.sub(r'\D', "", words[3])
#Update the 3rd string
# If the 3rd field is blank go to the next line
if len(words[3]) < 1 :
#Removed continue it from here we need to reconstruct the original line and write it to file
print "Field empty.Continue..."
elif len(words[3]) >= 1 and len(words[3]) < 5 :
#format won't add leading zeros. zfill(5) will add required number of leading zeros depending on the length of word[3].
words[3]=words[3].zfill(5)
#After updating 3rd value in words list, again creating a line out of it.
tmp_str = ",".join(words)
tmp_list.append(tmp_str)
fhand.close()
#Writing to same file
whand = open("a.txt",'w')
for val in tmp_list:
whand.write(val)
whand.close()
File content after running code
This,program,is,,program
This,program,is,00012,programs
The file mode 'w+' Truncates your file to 0 bytes, so you'll only be able to read lines that you've written.
Look at Confused by python file mode "w+" for more information.
An idea would be to read the whole file first, close it, and re-open it to write files in it.
Not sure which OS you're on but I think reading and writing to the same file has undefined behaviour.
I guess internally the file object holds the position (try fhand.tell() to see where it is). You could probably adjust it back and forth as you went using fhand.seek(last_read_position) but really that's asking for trouble.
Also, I'm not sure how the script would ever end as it would end up reading the stuff it had just written (in a sort of infinite loop).
Best bet is to read the entire file first:
with open(name, 'r') as f:
lines = f.read().splitlines()
with open(name, 'w') as f:
for l in lines:
# ....
f.write(something)
For 'Printing to a file via Python' you can use:
ifile = open("test.txt","r")
print("Some text...", file = ifile)

Adding each item in list to end of specific lines in FASTA file

I solved this in the comments below.
So essentially what I am trying to do is add each element of a list of strings to the end of specific lines in a different file.
Hard to explain but essentially I want to parse a FASTA file, and every time it reaches a header (line.startswith('>')) I want it to replace parts of that header with an element in a list I've already made.
For example:
File1:
">seq1 unwanted here
AATATTATA
ATATATATA
>seq2 unwanted stuff here
GTGTGTGTG
GTGTGTGTG
>seq3 more stuff I don't want
ACACACACAC
ACACACACAC"
I want it to keep ">seq#" but replace everything after with the next item in the list below:
List:
mylist = "['things1', '', 'things3', 'things4', '' 'things6', 'things7']"
Result (modified file1):
">seq1 things1
AATATTATA
ATATATATA
>seq2 # adds nothing here due to mylist[1] = ''
GTGTGTGTG
GTGTGTGTG
>seq3 things3
ACACACACAC
ACACACACAC
As you can see I want it to add even the blank items in the list.
So once again, I want it to parse this FASTA file, and every time it gets to a header (there are thousands), I want it to replace everything after the first word with the next item in the separate list I have made.
What you have will work, but there are a few unnecessary lines so I've edited down to use a few less lines. Also, an important note is that you don't close your file handles. This could result in errors, specifically when writing to file, either way it's bad practice. code:
#!/usr/bin/python
import sys
# gets list of annotations
def get_annos(infile):
with open(infile, 'r') as fh: # makes sure the file is closed properly
annos = []
for line in fh:
annos.append( line.split('\t')[5] ) # added tab as separator
return annos
# replaces extra info on each header with correct annotation
def add_annos(infile1, infile2, outfile):
annos = get_annos(infile1) # contains list of annos
with open(infile2, 'r') as f2, open(outfile, 'w') as output:
for line in f2:
if line.startswith('>'):
line_split = list(line.split()[0]) # split line on whitespace and store first element in list
line_split.append(annos.pop(0)) # append data of interest to current id line
output.write( ' '.join(line_split) + '\n' ) # join and write to file with a newline character
else:
output.write(line)
anno = sys.argv[1]
seq = sys.argv[2]
out = sys.argv[3]
add_annos(anno, seq, out)
get_annos(anno)
This is not perfect but it cleans things up a bit. I'd might veer away from using pop() to associate the annotation data with the sequence IDs unless you are certain the files are in the same order every time.
There is a great library in python for Fasta and other DNA file parsing. It is totally helpful in Bioinformatics. You can also manipulate any data according to your need.
Here is a simple example extracted from the library website:
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))
You should get something like this on your screen:
gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740
...
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', SingleLetterAlphabet())
592
***********EDIT*********
I solved this before anyone could help. This is my code, can anyone tell me if I have any bad practices? Is there a way to do this without writing everything to a new file? Seems like it would take a long time/lots of memory.
#!/usr/bin/python
# Script takes unedited FASTA file, removed seq length and
# other header info, adds annotation after sequence name
# run as: $ python addanno.py testanno.out testseq.fasta out.txt
import sys
# gets list of annotations
def get_annos(infile):
f = open(infile)
list2 = []
for line in f:
columns = line.strip().split('\t')
list2.append(columns[5])
return list2
# replaces extra info on each header with correct annotation
def add_annos(infile1, infile2, outfile):
mylist = get_annos(infile1) # contains list of annos
f2 = open(infile2, 'r')
output = open(out, 'w')
for line in f2:
if line.startswith('>'):
l = line.partition(" ")
list3 = list(l)
del list3[1:]
list3.append(' ')
list3.append(mylist.pop(0))
final = ''.join(list3)
line = line.replace(line, final)
output.write(line)
output.write('\n')
else:
output.write(line)
anno = sys.argv[1]
seq = sys.argv[2]
out = sys.argv[3]
add_annos(anno, seq, out)
get_annos(anno)

Extracting oddly arranged data from csv and converting to another csv file using python

I have a odd csv file thas has data with header value and its corresponding data in a manner as below:
,,,Completed Milling Job,,,,,, # row 1
,,,,Extended Report,,,,,
,,Job Spec numerical control,,,,,,,
Job Number,3456,,,,,, Operator Id,clipper,
Coder Machine Name,Caterpillar,,,,,,Job Start time,3/12/2013 6:22,
Machine type,Stepper motor,,,,,,Job end time,3/12/2013 9:16,
I need to extract the data from this strucutre create another csv file as per the structure below:
Status,Job Number,Coder Machine Name,Machine type, Operator Id,Job Start time,Job end time,,, # header
Completed Milling Job,3456,Caterpillar,Stepper motor,clipper,3/12/2013 6:22,3/12/2013 9:16,,, # data row
If you notice, there is a new header column added called 'status" but the value is in the first row of the csv file. rest of the column names in output file are extracted from the original file.
Any thoughts will be greatly appreciated - thanks
Assuming the files are all exactly like that (at least in terms of caps) this should work, though I can only guarantee it on the exact data you have supplied:
#!/usr/bin/python
import glob
from sys import argv
g=open(argv[2],'w')
g.write("Status,Job Number,Coder Machine Name,Machine type, Operator Id,Job Start time,Job end time\n")
for fname in glob.glob(argv[1]):
with open(fname) as f:
status=f.readline().strip().strip(',')
f.readline()#extended report not needed
f.readline()#job spec numerical control not needed
s=f.readline()
job_no=s.split('Job Number,')[1].split(',')[0]
op_id=s.split('Operator Id,')[1].strip().strip(',')
s=f.readline()
machine_name=s.split('Coder Machine Name,')[1].split(',')[0]
start_t=s.split('Job Start time,')[1].strip().strip(',')
s=f.readline()
machine_type=s.split('Machine type,')[1].split(',')[0]
end_t=s.split('Job end time,')[1].strip().strip(',')
g.write(",".join([status,job_no,machine_name,machine_type,op_id,start_t,end_t])+"\n")
g.close()
It takes a glob argument (like Job*.data) and an output filename and should construct what you need. Just save it as 'so.py' or something and run it as python so.py <data_files_wildcarded> output.csv
Here is a solution that should work on any CSV files that follow the same pattern as what you showed. That is a seriously nasty format.
I got interested in the problem and worked on it during my lunch break. Here's the code:
COMMA = ','
NEWLINE = '\n'
def _kvpairs_from_line(line):
line = line.strip()
values = [item.strip() for item in line.split(COMMA)]
i = 0
while i < len(values):
if not values[i]:
i += 1 # advance past empty value
else:
# yield pair of values
yield (values[i], values[i+1])
i += 2 # advance past pair
def kvpairs_by_column_then_row(lines):
"""
Given a series of lines, where each line is comma-separated values
organized as key/value pairs like so:
key_1,value_1,key_n+1,value_n+1,...
key_2,value_2,key_n+2,value_n+2,...
...
key_n,value_n,key_n+n,value_n+n,...
Yield up key/value pairs taken from the first column, then from the second column
and so on.
"""
pairs = [_kvpairs_from_line(line) for line in lines]
done = [False for _ in pairs]
while not all(done):
for i in range(len(pairs)):
if not done[i]:
try:
key_value_tuple = next(pairs[i])
yield key_value_tuple
except StopIteration:
done[i] = True
STATUS = "Status"
columns = [STATUS]
d = {}
with open("data.csv", "rt") as f:
# get an iterator that lets us pull lines conveniently from file
itr = iter(f)
# pull first line and collect status
line = next(itr)
lst = line.split(COMMA)
d[STATUS] = lst[3]
# pull next lines and make sure the file is what we expected
line = next(itr)
assert "Extended Report" in line
line = next(itr)
assert "Job Spec numerical control" in line
# pull all remaining lines and save in a list
lines = [line.strip() for line in f]
for key, value in kvpairs_by_column_then_row(lines):
columns.append(key)
d[key] = value
with open("output.csv", "wt") as f:
# write column headers line
line = COMMA.join(columns)
f.write(line + NEWLINE)
# write data row
line = COMMA.join(d[key] for key in columns)
f.write(line + NEWLINE)

Remove special characters from csv file using python

There seems to something on this topic already (How to replace all those Special Characters with white spaces in python?), but I can't figure this simple task out for the life of me.
I have a .CSV file with 75 columns and almost 4000 rows. I need to replace all the 'special characters' ($ # & * ect) with '_' and write to a new file. Here's what I have so far:
import csv
input = open('C:/Temp/Data.csv', 'rb')
lines = csv.reader(input)
output = open('C:/Temp/Data_out1.csv', 'wb')
writer = csv.writer(output)
conversion = '-"/.$'
text = input.read()
newtext = '_'
for c in text:
newtext += '_' if c in conversion else c
writer.writerow(c)
input.close()
output.close()
All this succeeds in doing is to write everything to the output file as a single column, producing over 65K rows. Additionally, the special characters are still present!
Sorry for the redundant question.
Thank you in advance!
I might do something like
import csv
with open("special.csv", "rb") as infile, open("repaired.csv", "wb") as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
conversion = set('_"/.$')
for row in reader:
newrow = [''.join('_' if c in conversion else c for c in entry) for entry in row]
writer.writerow(newrow)
which turns
$ cat special.csv
th$s,2.3/,will-be
fixed.,even.though,maybe
some,"shoul""dn't",be
(note that I have a quoted value) into
$ cat repaired.csv
th_s,2_3_,will-be
fixed_,even_though,maybe
some,shoul_dn't,be
Right now, your code is reading in the entire text into one big line:
text = input.read()
Starting from a _ character:
newtext = '_'
Looping over every single character in text:
for c in text:
Add the corrected character to newtext (very slowly):
newtext += '_' if c in conversion else c
And then write the original character (?), as a column, to a new csv:
writer.writerow(c)
.. which is unlikely to be what you want. :^)
This doesn't seem to need to deal with CSV's in particular (as long as the special characters aren't your column delimiters).
lines = []
with open('C:/Temp/Data.csv', 'r') as input:
lines = input.readlines()
conversion = '-"/.$'
newtext = '_'
outputLines = []
for line in lines:
temp = line[:]
for c in conversion:
temp = temp.replace(c, newtext)
outputLines.append(temp)
with open('C:/Temp/Data_out1.csv', 'w') as output:
for line in outputLines:
output.write(line + "\n")
In addition to the bug pointed out by #Nisan.H and the valid point made by #dckrooney that you may not need to treat the file in a special way in this case just because it is a CSV file (but see my comment below):
writer.writerow() should take a sequence of strings, each of which would be written out separated by commas (see here). In your case you are writing a single string.
This code is setting up to read from 'C:/Temp/Data.csv' in two ways - through input and through lines but it only actually reads from input (therefore the code does not deal with the file as a CSV file anyway).
The code appends characters to newtext and writes out each version of that variable. Thus, the first version of newtext would be 1 character long, the second 2 characters long, the third 3 characters long, etc.
Finally, given that a CSV file can have quote marks in it, it may actually be necessary to deal with the input file specifically as a CSV to avoid replacing quote marks that you want to keep, e.g. quote marks that are there to protect commas that exist within fields of the CSV file. In that case, it would be necessary to process each field of the CSV file individually, then write each row out to the new CSV file.
Maybe try
s = open('myfile.cv','r').read()
chars = ('$','%','^','*') # etc
for c in chars:
s = '_'.join( s.split(c) )
out_file = open('myfile_new.cv','w')
out_file.write(s)
out_file.close()

Categories

Resources