I have a set of .csv files with ; delimiter. There are certain junk values in the data that I need to replace with blanks. A sample problem row is:
103273;CAN D MAT;B.C.;B.C.;B.C.;03-Apr-2006
Desired row after find and replace is:
103273;CAN D MAT;;;;03-Apr-2006
In the above example I'm replacing ;B.C.; with ;;
I cannot replace just B.C. as I need to match the entire cell value for this particular error case. The code that I am using is:
import os, fnmatch
def findReplace(directory, filePattern):
for path, dirs, files in os.walk(os.path.abspath(directory)):
for filename in fnmatch.filter(files, filePattern):
filepath = os.path.join(path, filename)
with open(filepath) as f:
s = f.read()
for [find, replace] in zip([';#DIV/0!;',';B.C.;'],[';;',';;']
s = s.replace(find, replace)
with open(filepath, "w") as f:
f.write(s)
findReplace(*Path*, "*.csv")
The output that I'm instead getting is:
103273;CAN D MAT;;B.C.;;03-Apr-2006
Can someone please help with this issue?
Thanks in advance!
The [find, replacement] pairs are not well-suited for your purpose.
Replacing ; + value + ; with ;; is really just a complicated way of saying that you want to remove columns with value.
So instead of using the [find, replacement] pairs,
it will be more natural and straightforward to split the line on ; to fields,
replace the values that are considered junk with empty string,
and then join the values again:
JUNK = frozenset(['#DIV/0!', 'B.C.'])
def clean(s):
return ';'.join(map(lambda x: '' if x in JUNK else x, s.split(';')))
You could use this function in your implementation (or copy it inline):
def findReplace(directory, filePattern):
for path, dirs, files in os.walk(os.path.abspath(directory)):
for filename in fnmatch.filter(files, filePattern):
filepath = os.path.join(path, filename)
cleaned_lines = []
with open(filepath) as f:
for line in f.read():
cleaned_lines.append(clean(line))
with open(filepath, "w") as f:
f.write('\n'.join(cleaned_lines))
str.replace, once it has made one replacement, continues scanning from the next character after the last thing it replaced. So when two ;B.C.;s overlap, it will not replace both.
You can use the re module to replace B.C. only when it occurs between two ;, using lookahead and lookbehind assertions:
>>> import re
>>> s = "103273;CAN D MAT;B.C.;B.C.;B.C.;03-Apr-2006"
>>> re.sub(r'(?<=;)B[.]C[.](?=;)', "", s)
'103273;CAN D MAT;;;;03-Apr-2006'
... But in this case it may be better to split the line into fields on ;, replace the fields that match the strings you want to erase, and join the strings together again.
>>> fields = s.split(';')
>>> for i, f in enumerate(fields):
... if f in ('B.C.', '#DIV/0!'):
... fields[i] = ''
...
>>> ';'.join(fields)
'103273;CAN D MAT;;;;03-Apr-2006'
This has two main advantages: you don't have to write a fairly complex regular expression for each replaced string; and it will still work if one of the fields is at the beginning or end of the line.
For any CSV parsing more complicated than this (for example, if any fields can contain quoted ; characters, or if the file has a header that should be skipped), look into the csv module.
Related
Background
I have a two column CSV file like this:
Find
Replace
is
was
A
one
b
two
etc.
First column is text to find and second is text to replace.
I have second file with some text like this:
"This is A paragraph in a text file." (Please note the case sensitivity)
My requirement:
I want to use that csv file to search and replace in the text file with three conditions:-
whole word replacement.
case sensitive replacement.
Replace all instances of each entry in CSV
Script tried:
with open(CSV_file.csv', mode='r') as infile:
reader = csv.reader(infile)
mydict = {(r'\b' + rows[0] + r'\b'): (r'\b' + rows[1]+r'\b') for rows in reader}<--Requires Attention
print(mydict)
with open('find.txt') as infile, open(r'resul_out.txt', 'w') as outfile:
for line in infile:
for src, target in mydict.items():
line = re.sub(src, target, line) <--Requires Attention
# line = line.replace(src, target)
outfile.write(line)
Description of script
I have loaded my csv into a python dictionary and use regex to find whole words.
Problems
I used r'\b' to make word boundry in order to make whole word replacement but output gives me "\\b" in the dictionary instead of '\b' ??
using REPLACE function gives like:
"Thwas was one paragraph in a text file."
secondly I don't know how to make replacement case sensitive in regex pattern?
If anyone know better solution than this script or can improve the script?
Thanks for help if any..
Here's a more cumbersome approach (more code) but which is easier to read and does not rely on regular expressions. In fact, given the very simple nature of your CSV control file, I wouldn't normally bother using the csv module at all:-
import csv
with open('temp.csv', newline='') as c:
reader = csv.DictReader(c, delimiter=' ')
D = {}
for row in reader:
D[row['Find']] = row['Replace']
with open('input.txt', newline='') as infile:
with open('output.txt', 'w') as outfile:
for line in infile:
tokens = line.split()
for i, t in enumerate(tokens):
if t in D:
tokens[i] = D[t]
outfile.write(' '.join(tokens)+'\n')
I'd just put pure strings into mydict so it looks like
{'is': 'was', 'A': 'one', ...}
and replace this line:
# line = re.sub(src, target, line) # old
line = re.sub(r'\b' + src + r'\b', target, line) # new
Note that you don't need \b in the replacement pattern. Regarding your other questions,
regular expressions are case-sensitive by default,
changing '\b' to '\\b' is exactly what the r'' does. You can omit the r and write '\\b', but that quickly gets ugly with more complex regexs.
I need a program to find a string (S) in a file (P), and return the number of thimes it appears in the file, to do this i decided tocreate a function:
def file_reading(P, S):
file1= open(P, 'r')
pattern = S
match1 = "re.findall(pattern, P)"
if match1 != None:
print (pattern)
I know it doesn't look very good, but for some reason it's not outputing anything, let alone the right answer.
There are multiple problems with your code.
First of all, calling open() returns a file object. It does not read the contents of the file. For that you need to use read() or iterate through the file object.
Secondly, if your goal is to count the number of matches of a string, you don't need regular expressions. You can use the string function count(). Even still, it doesn't make sense to put the regular expression call in quotes.
match1 = "re.findall(pattern, file1.read())"
Assigns the string "re.findall(pattern, file1.read())" to the variable match1.
Here is a version that should work for you:
def file_reading(file_name, search_string):
# this will put the contents of the file into a string
file1 = open(file_name, 'r')
file_contents = file1.read()
file1.close() # close the file
# return the number of times the string was found
return file_contents.count(search_string)
You can read line by line instead of reading the entire file and find the nunber of time the pattern is repeated and add it to the total count c
def file_reading(file_name, pattern):
c = 0
with open(file_name, 'r') as f:
for line in f:
c + = line.count(pattern)
if c: print c
There are a few errors; let's go through them one by one:
Anything in quotes is a string. Putting "re.findall(pattern, file1.read())" in quotes just makes a string. If you actually want to call the re.findall function, no quotes are needed :)
You check whether match1 is None or not, which is really great, but then you should return that matches, not the initial pattern.
The if-statement should not be indented.
Also:
Always a close a file once you have opened it! Since most people forget to do this, it is better to use the with open(filename, action) syntax.
So, taken together, it would look like this (I've changed some variable names for clarity):
def file_reading(input_file, pattern):
with open(input_file, 'r') as text_file:
data = text_file.read()
matches = re.findall(pattern, data)
if matches:
print(matches) # prints a list of all strings found
I have a large number of entries in a file. Let me call it file A.
File A:
('aaa.dat', 'aaa.dat', 'aaa.dat')
('aaa.dat', 'aaa.dat', 'bbb.dat')
('aaa.dat', 'aaa.dat', 'ccc.dat')
I want to use these entries, line by line, in a program that would iteratively pick an entry from file A, concatenate the files in this way:
filenames = ['aaa.dat', 'aaa.dat', 'ccc.dat'] ###entry number 3
with open('out.dat', 'w') as outfile: ###the name has to be aaa-aaa-ccc.dat
for fname in filenames:
with open(fname) as infile:
outfile.write(infile.read().strip())
All I need to do is to substitute the filenames iteratively and create an output in a "aaa-aaa-aaa.dat" format. I would appreciate any help-- feeling a bit lost!
Many thanks!!!
You can retrieve and modify the file names in the following way:
import re
pattern = re.compile('\W')
with open('fnames.txt', 'r') as infile:
for line in infile:
line = (re.sub(pattern, ' ', line)).split()
# Old filenames - to concatenate contents
content = [x + '.dat' for x in line[::2]];
# New filename
new_name = ('-').join(line[::2]) + '.dat'
# Write the concatenated content to the new
# file (first read the content all at once)
with open(new_name, 'w') as outfile:
for con in content:
with open(con, 'r') as old:
new_content = old.read()
outfile.write(new_content)
This program reads your input file, here named fnames.txt with the exact structure from your post, line by line. For each line it splits the entries using a precompiled regex (precompiling regex is suitable here and should make things faster). This assumes that your filenames are only alphanumeric characters, since the regex substitutes all non-alphanumeric characters with a space.
It retrieves only 'aaa' and dat entries as a list of strings for each line and forms a new name by joining every second entry starting from 0 and adding a .dat extension to it. It joins using a - as in the post.
It then retrieves the individual file names from which it will extract the content into a list content by selecting every second entry from line.
Finally, it reads each of the files in content and writes them to the common file new_name. It reads each of them all at ones which may be a problem if these files are big and in general there may be more efficient ways of doing all this. Also, if you are planning to do more things with the content from old files before writing, consider moving the old file-specific operations to a separate function for readability and any potential debugging.
Something like this:
with open(fname) as infile, open('out.dat', 'w') as outfile:
for line in infile:
line = line.strip()
if line: # not empty
filenames = eval(line.strip()) # read tuple
filenames = [f[:-4] for f in filenames] # remove extension
filename = '-'.join(filenames) + '.dat' # make filename
outfile.write(filename + '\n') # write
If your problem is just calculating the new filenames, how about using os.path.splitext?
'-'.join([
f[0] for f in [os.path.splitext(path) for path in filenames]
]) + '.dat'
Which can be probably better understood if you see it like this:
import os
clean_fnames = []
filenames = ['aaa.dat', 'aaa.dat', 'ccc.dat']
for fname in filenames:
name, extension = os.path.splitext(fname)
clean_fnames.append(name)
name_without_ext = '-'.join(clean_fnames)
name_with_ext = name_without_ext + '.dat'
print(name_with_ext)
HOWEVER: If your issue is that you can not get the filenames in a list by reading the file line by line, you must keep in mind that when you read files, you get text (strings) NOT Python structures. You need to rebuild a list from a text like: "('aaa.dat', 'aaa.dat', 'aaa.dat')\n".
You could take a look to ast.literal_eval or try to rebuild it yourself. The code below outputs a lot of messages to show what's happening:
import pprint
collected_fnames = []
with open('./fileA.txt') as f:
for line in f:
print("Read this (literal) line: %s" % repr(line))
line_without_whitespaces_on_the_sides = line.strip()
if not line_without_whitespaces_on_the_sides:
print("line is empty... skipping")
continue
else:
line_without_parenthesis = (
line_without_whitespaces_on_the_sides
.lstrip('(')
.rstrip(')')
)
print("Cleaned parenthesis: %s" % line_without_parenthesis)
chunks = line_without_parenthesis.split(', ')
print("Collected %s chunks in a %s: %s" % (len(chunks), type(chunks), chunks))
chunks_without_quotations = [chunk.replace("'", "") for chunk in chunks]
print("Now we don't have quotations: %s" % chunks_without_quotations)
collected_fnames.append(chunks_without_quotations)
print("collected %s lines with filenames:\n%s" %
(len(collected_fnames), pprint.pformat(collected_fnames)))
I have a folder filled with text files that all have names similar to these:
2014521RNC Reax to Obama on VA.txt
2014520W.H. Evades Questions On When Obama.txt
2012517Updated Research/ Obama Vets Roll Out.txt
So digits and then letters and/or characters. In each text file, there are words. I'm trying to write a script that will take the first string of digits and add that to a csv in a column titled "date." Then it should take the letters and/or characters after the digits and put those in a column titled "title." And then it should take the text inside the file and add that to a column titled "content." I got kind of far but not the whole cigar. When I run the script below, date = -1 and title = -1 for all of them. What have I don't wrong?
f = open('RNC.csv','w')
names = ['date', 'title', 'content']
dw = csv.DictWriter(f, names)
dw.writerow({k:k for k in names})
for root, dirnames, filenames in os.walk('.'):
for filename in filenames:
if not filename.endswith('.txt'):
continue
title = filename.find(r'\D*')
date = filename.find(r'^\d*')
open_doc = open(root + '/' + filename, 'r')
content = open_doc.read().rstrip()
open_doc.close()
dw.writerow({'date':date, 'title':title, 'content':content})
f.close()
The problem is that filename.find(s) returns the position of the substring s in filename. It returns -1 when the substring isn't found.
You can use a regex to perform the matching instead:
import re
for filename in filenames:
m = re.match("\A(\d+)(.*)\.txt\Z", filename)
if m:
date = m.group(1)
title = m.group(2)
...
You can't supply regular expressions as parameters to the str.find method, which would interpret them as literal substrings to try to find in the filename. Probably what you need to do is something like this (after adding import re to the top of your script):
match = re.search(r'^(\d+)', filename)
date = match.group(1) if match else 'None'
match = re.search(r'(\D+)', filename)
title = match.group(1) if match else 'None'
I have a file with a bunch of text that I want to tear through, match a bunch of things and then write these items to separate lines in a new file.
This is the basics of the code I have put together:
f = open('this.txt', 'r')
g = open('that.txt', 'w')
text = f.read()
matches = re.findall('', text) # do some re matching here
for i in matches:
a = i[0] + '\n'
g.write(a)
f.close()
g.close()
My issue is I want each matched item on a new line (hence the '\n') but I don't want a blank line at the end of the file.
I guess I need to not have the last item in the file being trailed by a new line character.
What is the Pythonic way of sorting this out? Also, is the way I have set this up in my code the best way of doing this, or the most Pythonic?
If you want to write out a sequence of lines with newlines between them, but no newline at the end, I'd use str.join. That is, replace your for loop with this:
output = "\n".join(i[0] for i in matches)
g.write(output)
In order to avoid having to close your files explicitly, especially if your code might be interrupted by exceptions, you can use the with statement to make things simpler. The following code replaces the entire code in your question:
with open('this.txt') as f, open('that.txt', 'w') as g:
text = f.read()
matches = re.findall('', text) # do some re matching here
g.write("\n".join(i[0] for i in matches))
or, since you don't need both files open at the same time:
with open('this.txt') as f:
text = f.read()
matches = re.findall('', text) # do some re matching here
with open('that.txt', 'w') as g:
g.write("\n".join(i[0] for i in matches))