Remove special characters from csv file using python - python

There seems to something on this topic already (How to replace all those Special Characters with white spaces in python?), but I can't figure this simple task out for the life of me.
I have a .CSV file with 75 columns and almost 4000 rows. I need to replace all the 'special characters' ($ # & * ect) with '_' and write to a new file. Here's what I have so far:
import csv
input = open('C:/Temp/Data.csv', 'rb')
lines = csv.reader(input)
output = open('C:/Temp/Data_out1.csv', 'wb')
writer = csv.writer(output)
conversion = '-"/.$'
text = input.read()
newtext = '_'
for c in text:
newtext += '_' if c in conversion else c
writer.writerow(c)
input.close()
output.close()
All this succeeds in doing is to write everything to the output file as a single column, producing over 65K rows. Additionally, the special characters are still present!
Sorry for the redundant question.
Thank you in advance!

I might do something like
import csv
with open("special.csv", "rb") as infile, open("repaired.csv", "wb") as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
conversion = set('_"/.$')
for row in reader:
newrow = [''.join('_' if c in conversion else c for c in entry) for entry in row]
writer.writerow(newrow)
which turns
$ cat special.csv
th$s,2.3/,will-be
fixed.,even.though,maybe
some,"shoul""dn't",be
(note that I have a quoted value) into
$ cat repaired.csv
th_s,2_3_,will-be
fixed_,even_though,maybe
some,shoul_dn't,be
Right now, your code is reading in the entire text into one big line:
text = input.read()
Starting from a _ character:
newtext = '_'
Looping over every single character in text:
for c in text:
Add the corrected character to newtext (very slowly):
newtext += '_' if c in conversion else c
And then write the original character (?), as a column, to a new csv:
writer.writerow(c)
.. which is unlikely to be what you want. :^)

This doesn't seem to need to deal with CSV's in particular (as long as the special characters aren't your column delimiters).
lines = []
with open('C:/Temp/Data.csv', 'r') as input:
lines = input.readlines()
conversion = '-"/.$'
newtext = '_'
outputLines = []
for line in lines:
temp = line[:]
for c in conversion:
temp = temp.replace(c, newtext)
outputLines.append(temp)
with open('C:/Temp/Data_out1.csv', 'w') as output:
for line in outputLines:
output.write(line + "\n")

In addition to the bug pointed out by #Nisan.H and the valid point made by #dckrooney that you may not need to treat the file in a special way in this case just because it is a CSV file (but see my comment below):
writer.writerow() should take a sequence of strings, each of which would be written out separated by commas (see here). In your case you are writing a single string.
This code is setting up to read from 'C:/Temp/Data.csv' in two ways - through input and through lines but it only actually reads from input (therefore the code does not deal with the file as a CSV file anyway).
The code appends characters to newtext and writes out each version of that variable. Thus, the first version of newtext would be 1 character long, the second 2 characters long, the third 3 characters long, etc.
Finally, given that a CSV file can have quote marks in it, it may actually be necessary to deal with the input file specifically as a CSV to avoid replacing quote marks that you want to keep, e.g. quote marks that are there to protect commas that exist within fields of the CSV file. In that case, it would be necessary to process each field of the CSV file individually, then write each row out to the new CSV file.

Maybe try
s = open('myfile.cv','r').read()
chars = ('$','%','^','*') # etc
for c in chars:
s = '_'.join( s.split(c) )
out_file = open('myfile_new.cv','w')
out_file.write(s)
out_file.close()

Related

left padding with python

I have following data and link combination of 100000 entries
dn:id=2150fccc-beb8-42f8-b201-182a6bf5ddfe,ou=test,dc=com
link:545214569
dn:id=ffa55959-457d-49e6-b4cf-a34eff8bbfb7,ou=test,dc=com
link:32546897
dn:id=3452a4c3-b768-43f5-8f1e-d33c14787b9b,ou=test,dc=com
link:6547896541
I am trying to write a program in python 2.7 to add left padding zeros if value of link is less than 10 .
Eg:
545214569 --> 0545214569
32546897 --> 0032546897
can you please guide me what am i doing wrong with the following program :
with open("test.txt", "r") as f:
line=f.readline()
line1=f.readline()
wordcheck = "link"
wordcheck1= "dn"
for wordcheck1 in line1:
with open("pad-link.txt", "a") as ff:
for wordcheck in line:
with open("pad-link.txt", "a") as ff:
key, val = line.strip().split(":")
val1 = val.strip().rjust(10,'0')
line = line.replace(val,val1)
print (line)
print (line1)
ff.write(line1 + "\n")
ff.write('%s:%s \n' % (key, val1))
The usual pythonic way to pad values in Python is by using string formatting and the Format Specification Mini Language
link = 545214569
print('{:0>10}'.format(link))
Your for wordcheck1 in line1: and for workcheck in line: aren't doing what you think. They iterate one character at a time over the lines and assign that character to the workcheck variable.
If you only want to change the input file to have leading zeroes, this can be simplified as:
import re
# Read the whole file into memory
with open('input.txt') as f:
data = f.read()
# Replace all instances of "link:<digits>", passing the digits to a function that
# formats the replacement as a width-10 field, right-justified with zeros as padding.
data = re.sub(r'link:(\d+)', lambda m: 'link:{:0>10}'.format(m.group(1)), data)
with open('output.txt','w') as f:
f.write(data)
output.txt:
dn:id=2150fccc-beb8-42f8-b201-182a6bf5ddfe,ou=test,dc=com
link:0545214569
dn:id=ffa55959-457d-49e6-b4cf-a34eff8bbfb7,ou=test,dc=com
link:0032546897
dn:id=3452a4c3-b768-43f5-8f1e-d33c14787b9b,ou=test,dc=com
link:6547896541
i don't know why you have to open many times. Anyway, open 1 time, then for each line, split by :. the last element in list is the number. Then you know what lenght the digits should consistently b, say 150, then use zfill to padd the 0. then put the lines back by using join
for line in f.readlines():
words = line.split(':')
zeros = 150-len(words[-1])
words[-1] = words[-1].zfill(zeros)
newline = ':'.join(words)
# write this line to file

How do I write a custom CSV Reader in python without using csv import?

I'm trying to solve a problem from the pyschools website that asks to write a script that reads a CSV file with comas "," as a delimiter and returns a list of records. When running my script on their website it returns as incorrect using a test case of:
csvReader('books.csv')[0] thus returning:
['"Pete,Zelle","Intro to HTML, CSS",2011']
when the expected result is:
['Pete,Zelle', 'Intro to HTML, CSS', '2011']
I've notice that the problem has to do with the quotations " & ' but still haven't come up with the right answer, using replace('"','') for the line variable to remove the double quotes does not fix it as it returns as:
['Pete,Zelle,Intro to HTML, CSS,2011']
where it removes the last quotation mark from some of the words e.g. Zelle, instead of Zelle',.
Below ill provide a link to the exercise, the problem and my current script. Any explanation or help is greatly appreciated.
link:
http://www.pyschools.com/quiz/view_question/s13-q8
problem:
Write a function to read a CSV file with ',' as delimiter and returns a list of records.
The function must be able to ignore ',' that are within a pair of double quotes '"'.
script:
def csvReader(filename):
records = []
for line in open(filename):
line = line.rstrip() # strip '\n'
if line=='","':
continue # ignore empty line
records.append([line.replace('"','')])
return records
I was after the CSV file you are trying to read. Sounds as though you need to seperate the fields whilst ignoring any delimiters that fall inbetween quotation marks.
In this case I would recommend the CSV Library and setting the quotation character.
import csv
record = '"Pete,Zelle","Intro to HTML, CSS",2011'
newStr = [ '"{}"'.format(x) for x in list(csv.reader([record], delimiter=',', quotechar='"'))[0] ]
print(newStr)
Will return ['"Pete,Zelle"', '"Intro to HTML, CSS"', '"2011"']
In your function you could incorporate this as below
import csv
def csvReader(filename):
records = []
for line in open(filename):
line = line.rstrip() # strip '\n'
if line=='","':
continue # ignore empty line
newLine = [ '"{}"'.format(x) for x in list(csv.reader([line], delimiter=',', quotechar='"'))[0] ]
records.append(newLine)
return records
Batteries are included, as usual, with python. Here's using the standard lib csv module:
import csv
with open(path, "r") as f:
csv_reader = csv.reader(f, delimiter=",")
for row_number, row in enumerate(csv_reader):
print(f"{row_number} => {row}")
If the stdlib isn't available for some strange reason.. you'll need to tokenize each line with 'delimiters', 'separators', and 'cell values'. Again, this would be trivial with stdlib (import re). Let's pretend you have no batteries at all, just plain python.
You'll need to realize that how you treat each character of each line depends on a "context" and that that context is built up by all previous characters. Using a stack is advised here. You push and pop off states (aka contexts) from a stack
depending on what the current context is (the top of your stack) and the current character you're handing. Now, given a context, you may process each character differently depending on that context:
class State:
IN_NON_DELIMITED_CELL = 1
IN_DELIMITED_CELL = 2
def get_cell_values(line, quotechar='"', separator=','):
stack = []
stack.append(State.IN_NON_DELIMITED_CELL)
cell_values = [""]
for character in line:
current_state = stack[-1]
if current_state == State.IN_NON_DELIMITED_CELL:
if character == quotechar:
stack.append(State.IN_DELIMITED_CELL)
elif character == separator:
cell_values.append("")
else:
cell_values[-1] += character
if current_state == State.IN_DELIMITED_CELL:
if character == quotechar:
stack.pop()
else:
cell_values[-1] += character
return cell_values
with open(path, "r") as f:
for line in f:
cell_values = tokenize(line, quotechar='"', delimiter=',')
print(cell_values)
This is a good starting point:
print(get_cell_values('"this","is",an,example,of,"doing things, the hard way?"'))
# prints:
['this', 'is', 'an', 'example', 'of', 'doing things, the hard way?']
For taking this (MUCH) further, look into these topics: tokenizing strings, LL+LR parsers, recursive descent, shift-reduce parsers.

identify csv in python

I have a data dump that is a "messed up" CSV. (About 100 files, each with about 1000 lines of actual CSV data.)
The dump has some other text in addition to CSV. How can I extract the CSV part separately, programmatically?
As an example the data file looks like something like this
Session:1
Data collection date: 09-09-2016
Related questions:
Question 1: parta, partb, partc,
Question 2: parta, partb, partc
"field1","field2","field3","field4"
"data11","data12","data13","data14"
"data21","data22","data23","data24"
"data31","data32","data33","data34"
"data41","data42","data43","data44"
"data51","data52","data53","data54"
I need to extract the csv part.
Caveats,
the text in the beginning is NOT limited to 4 - 5 lines.
the additional text is NOT just in the beginning of the file
I saw this post that suggests using re.split and/or csv.Sniffer,
however my attempt was not fruitful.
with open("untitled.csv") as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
print(dialect.__dict__)
csvstarts = False
csvdump = []
for ln in csvfile.readlines():
toks = re.split(r'[,]', ln)
print(toks)
if toks[0] == '"field1"' and not csvstarts: # identify by the header line
csvstarts = True
continue
if csvstarts:
if toks[0] == '"field1"': # identify the start of subsequent csv data
csvstarts = False
continue
csvdump.append(ln) # record the current line
print(csvdump)
For now I am able to identify the csv lines accurately ONLY if there is one bunch of data.
Is there anything better I can do?
How about this:
import re
my_pattern = re.compile("(\"[\w]+\",)+")
with open('<your_file>', 'rb') as fi:
for f in fi:
result = my_pattern.match(f)
if result:
print f
Assuming the csv data can be differentiated from the rest by having no special characters in them (we only accept each element to have letters or numbers surrounded by double quotes and a comma to separate from the next element)
If your csv lines and only those lines start with \", then you can do this:
import csv
data = list(csv.reader(open("test.csv", 'rb'), quotechar='¬'))
# for quotechar - use something that won't turn up in data
def importCSV(data):
# outputs list of list with required data
# works on the assumption that all required data starts with \"
# and that no text starts with \"
out = []
for line in data:
if (line != []) and (line[0][0] == "\""):
line = [el.replace("\"", "") for el in line]
out.append(line)
return out
useful = importCSV(data)
Can you not read each line and do a regex to see weather or not to pull the data?
Maybe something like:
^(["][\w]["][,])+["][\w]["]$
My regex is not the best and there may likely be a better way but that seemed to work for me.

Reading and writing to files in python, comparing characters in different files to each other

I want to read 2 files in python, and based off those 2 create another file. The first file contains regular english (ex "hello") and the second file contains "cipher text" (2 5 letter random string Ex "aiwld" and "pqmcx") I want to match up the letter 'h' with the first letter in the cipher text and store it in the third file (the one that we created)
def cipher():
file = english.txt
file2 = secret.txt
file3 = cipher.txt
outputFile = open(file, 'r')
outputFile = open(file2, 'r')
So I have open, for reading, file and file2 and I want to match the first letter in the english.txt with the first letter in the secret.txt and and then write that letter to the cipher.txt file. I am completely lost on where to start and any help would be great.
Do I need to open both files, read from both, somehow compare and then write to the file?
I guess I am really unsure on how to compare individual letters in each file with other individual letters in a different file.
I think I would want something like set english.txt[0] == secret.txt[0] but I am not really sure.
The key thing you're looking at here is how to iterate over a file character by character (rather than the line by line you get more simply).
The simplest solution to this is to read the two files entirely into memory and iterate over them together. This can be done with the file.read() call and the zip() built-in. This suffers because large files would cause us to run out of memory.
Writing out the result is just a normal file.write() call.
For example:
with open('plaintext.text') as ptf:
plaintext = ptf.read()
with open('key.txt') as keyf:
key = keyf.read()
with open('output.txt') as f:
for plaintext_char, key_char in zip(plaintext, key):
# Do something to combine the characters
f.write(new_char)
So this might be overly complicated but
def cipher(file1 = 'english.txt',
file2 = 'secret.txt',
file3 = 'cipher.txt'):
fh1 = open(file1, 'r') # open the files
fh2 = open(file2, 'r')
fh3 = open(file3, 'w+') # write this file if it doesn't exist
ls1 = list() # initiate lists
ls2 = list()
for line in fh1: # add the charecters to the list
for char in line:
ls1.append(char)
for line in fh2:
for char in line:
ls2.append(char)
if ' ' in ls1: # remove blank spaces
ls1.remove(' ')
if ' ' in ls2:
ls2.remove(' ')
print ls1, ls2
for i in range(len(ls1)): # traverse through the list and write things! :)
fh3.write(ls1[i] + ' ' + ls2[i] + '\n')

Python: How do I delete periods occurring alone in a CSV file?

I have a bunch of CSV files. In some of them, missing data are represented by empty cells, but in others there is a period. I want to loop over all my files, open them, delete any periods that occur alone, and then save and close the file.
I've read a bunch of other questions about doing whole-word-only searches using re.sub(). That is what I want to do (delete . when it occurs alone but not the . in 3.5), but I can't get the syntax right for a whole-word-only search where the whole word is a special character ('.'). Also, I'm worried those answers might be a little different in the case where a whole word can be distinguished by tab and newlines too. That is, does /b work in my CSV file case?
UPDATE: Here is a function I wound up writing after seeing the help below. Maybe it will be useful to someone else.
import csv, re
def clean(infile, outfile, chars):
'''
Open a file, remove all specified special characters used to represent missing data, and save.\n\n
infile:\tAn input file path\n
outfile:\tAn output file path\n
chars:\tA list of strings representing missing values to get rid of
'''
in_temp = open(infile)
out_temp = open(outfile, 'wb')
csvin = csv.reader(in_temp)
csvout = csv.writer(out_temp)
for row in csvin:
row = re.split('\t', row[0])
for colno, col in enumerate(row):
for char in chars:
if col.strip() == char:
row[colno] = ''
csvout.writerow(row)
in_temp.close()
out_temp.close()
Something like this should do the trick... This data wouldn't happen to be coming out of SAS would it - IIRC, that quite often used '.' as missing for numeric values.
import csv
with open('input.csv') as fin, open('output.csv', 'wb') as fout:
csvin = csv.reader(fin)
csvout = csv.writer(fout)
for row in csvin:
for colno, col in enumerate(row):
if col.strip() == '.':
row[colno] = ''
csvout.writerow(row)
Why not just use the csv module?
#!/usr/bin/env python
import csv
with open(somefile) as infile:
r=csv.reader(infile)
rows = []
for row in csv:
rows.append(['' if f == "." else f for f in row])
with open(newfile, 'w') as outfile:
w=csv.writer(outfile)
w.writelines(rows)
The safest way would be to use the CSV module to process the file, then identify any fields that only contain ., delete those and write the new CSV file back to disk.
A brittle workaround would be to search and replace a dot that is not surrounded by alphanumerics: \B\.\B is the regex that would find those dots. But that might also find other dots like the middle dot in "...".
So, to find a dot that is surrounded by commas, you could search for (?<=,)\.(?=,).

Categories

Resources