Python: replacing data in a CSV file - python

Hello i am attempting to adjust a CSV file using Python but my out put is a little off and I can't figure out why.
in_file = open(out, "rb")
fout = "DomainWatchlist.csv"
fin_out_file = open(fout, "wb")
csv_writer2 = csv.writer(fin_out_file, quoting=csv.QUOTE_MINIMAL)
for item in in_file:
if "[.]" in item:
csv_writer2.writerow([item.replace("[.]", ".")])
elif "[dot]" in item:
csv_writer2.writerow([item.replace("[dot]", ".")])
else:
csv_writer2.writerow([item])
in_file.close
fin_out_file.close
The input file contains data that looks like this:
bluecreatureoftheseas.com
12rafvwe[dot]co[dot]cc
12rafvwe[dot]co[dot]cc
404page[dot]co[dot]cc
abalamahala[dot]co[dot]cc
abtarataha[dot]co[dot]cc
adoraath[dot]cz[dot]cc
adoranaya[dot]cz[dot]cc
afnffnjq[dot]co[dot]cc
aftermorningstar[dot]co[dot]cc
I am attempting to fix this data but it comes out looking like this:
"12rafvwe.co.cc
"
"12rafvwe.co.cc
"
"404page.co.cc
"
"abalamahala.co.cc
"
"abtarataha.co.cc
"
"adoraath.cz.cc
"
"adoranaya.cz.cc
"
"afnffnjq.co.cc
"
"aftermorningstar.co.cc
"
"aftrafsudalitf.co.cc
"
"agamafym.cz.cc
"
"agamakus.vv.cc
Why does this create the extra quotes and then add a carriage return?

The reason you're getting a newline is that for item in in_file: iterates over each line in in_file, without stripping the newline. You don't strip the newline anywhere. So it's still there in the single string in the list you pass to writerow.
The reason you're getting quotes is that in CSV, strings with special characters—like newlines—have to be either escaped or quoted. There are different "dialect options" you can set to control that, but by default, it tries to use quoting instead of escaping.
So, the solution is something like this:
for item in in_file:
item = item.rstrip()
# rest of your code
There are some other problems with your code, as well as some ways you're making things more complicated than they need to be.
First, in_file.close does not close the file. You're not calling the function, just referring to it as a function object. You need parentheses to call a function in Python.
But an even simpler way to handle closing files is to use a with statement.
You only have a single column, so there is no need to use the csv module at all. Just fin_out_file.write would work just fine.
You also probably don't want to use binary mode here. If you have a good reason for doing so, that's fine, but if you don't know why you're using it, don't use it.
You don't need to check whether a substring exists before replace-ing it. If you call 'abc'.replace('n', 'N'), it will just harmlessly return 'abc'. All you're doing is writing twice as much code, and making Python search each string twice in a row.
Putting this all together, here's the whole thing in three lines:
with open(out) as in_file, open(fout, 'w') as out_file:
for line in in_file:
out_file.write(line.replace("[.]", ".").replace("[dot]", "."))

a bit OT but perl was built for this
$ perl -i -ple 's/\[dot\]/./g' filename
will do the job, including saving the new file on the oldfilename.

Related

Write a single poly-linear string to multiple lines in .txt

I have encountered a strange problem which I am struggling to resolve. When I run a re.findall() through a .txt file, and then try to print and write the results. all of the results I would expect appear, but they do so in different formats.
The code (modified from a similar thread I found earlier):
import re
with open ('test.txt') as text:
text = text.read()
match = re.findall(r'[\w\.-]+#[\w\.-]+', text)
for i in match:
with open ('list.txt', 'a') as dest:
i = str(i)
print(i)
dest.write(i)
The interpreter then produces the result:
a#a
b#b
c#c
which is exactly what I would expect it to do, given the contents of test.txt.
However, list.txt reads:
(generic existing text goes here)
a#ab#bc#c
while I want it to (and believe it should) read
(generic existing text goes here)
a#a
b#b
c#c
I've tried using str.writelines.() in place of str.write() but this was not helpful. What differences between print() and str.write() are causing this ambiguity, and how would one go about avoiding it.
N.B. I am 99% sure that line 8 i = str(i) serves no purpose, but I've left it in because it's what I've been doing. Not really sure why...
I'll start with your last comment. What str(i) does is it converts i to its string representation (which is defined in i's class's __str__ method). If you call str(4) you get '4', for example. This is unnecessary in this case because re.findall returns a list of strings as per the documentation.
As for your actual issue: you're missing the newlines. I would also prefer to open the file fewer times than you are.
Perhaps try:
import re
with open ('test.txt') as text:
text = text.read()
match = re.findall(r'[\w\.-]+#[\w\.-]+', text)
with open('list.txt', 'a') as dest:
for i in match:
print(i)
dest.write(i + '\n')
(You can also remove the print(i) line if you don't want to see the output in the console every time a write is done.)

How can I successfully capture all possible cases to create a python list from a text file

This public gist creates a simple scenario where you can turn a text file into a python list line by line.
with open('test.txt', 'r') as listFile:
lines = listFile.read().split("\n")
out = []
for item in lines:
if '"' in item:
out.append('("""' + item + '"""),')
else:
out.append('("' + item + '"),')
with open('out.py', 'a') as outFile:
outFile.write("out = [\n")
for item in out:
outFile.write("\t" + item + "\n")
outFile.write("]")
In text.txt the sixth and seventh lines
'"""'
""
are the ones that produce invalid output. Perhaps you can think of some other examples that would fail to work.
EDIT:
Valid output would look something like this:
out = [
"line1",
"line2",
""" line 3 has """ and "" and " in it """, # but it is a valid string
"last line",
]
The ( and ) characters were an oversight by me they are not needed or wanted...
EDIT: Oh god I'm getting overwhelmed. I'm going to take 5 minutes and post the question again in a better form.
Using a newline character besides \n would also cause the program to fail. In Windows its common to use \r or \r\n.
#abarnert's comment shows a better way to read lines.
A text file is already an iterable of lines.
As with any other iterable, you can convert it to a list by just passing it to the list constructor:
with open('text.txt') as f:
lines = list(f)
Or, if you don't want the newlines on the end of each line:
with open('text.txt') as f:
lines = [line.rstrip('\n') for line in f]
If you want to handle classic Mac and Windows line endings as well as Unix, open the file in universal-newlines mode:
with open('text.txt', 'rU') as f:
… or use the Python 3-style io classes (but note that this will give you unicode strings, not byte strings, which will repr with u prefixes—they're still valid Python literals that way, but they won't look as pretty):
import io
with io.open('text.txt') as f:
Now, it's hard to tell from code that doesn't work and no explanation of what's wrong with it, but it looks like you're trying to figure out how to write that list out as a Python-source-format list display, wrapping it in brackets, adding quotes, escaping any internal quotes, etc. But there's a much easier way to do that too:
with open('out.py', 'a') as f:
f.write(repr(lines))
If you're trying to pretty-print it, there's a pprint module in the stdlib for exactly that purpose, and various bigger/better alternatives on PyPI. Here's an example of the output of pprint.pprint(lines, width=60) with (what I think is) the same input you used for your desired output:
['line1',
'line2',
' line 3 has """ and "" and " in it ',
'last line']
Not exactly the same as your desired output—but, unlike your output, it's a valid Python list display that evaluates to the original input, and it looks pretty readable to me.

Don't write final new line character to a file

I have looked around StackOverflow and couldn't find an answer to my specific question so forgive me if I have missed something.
import re
target = open('output.txt', 'w')
for line in open('input.txt', 'r'):
match = re.search(r'Stuff', line)
if match:
match_text = match.group()
target.write(match_text + '\n')
else:
continue
target.close()
The file I am parsing is huge so need to process it line by line.
This (of course) leaves an additional newline at the end of the file.
How should I best change this code so that on the final iteration of the 'if match' loop it doesn't put the extra newline character at the end of the file. Should it look through the file again at the end and remove the last line (seems a bit inefficient though)?
The existing StackOverflow questions I have found cover removing all new lines from a file.
If there is a more pythonic / efficient way to write this code I would welcome suggestions for my own learning also.
Thanks for the help!
Another thing you can do, is to truncate the file. .tell() gives us the current byte number in the file. We then subtract one, and truncate it there to remove the trailing newline.
with open('a.txt', 'w') as f:
f.write('abc\n')
f.write('def\n')
f.truncate(f.tell()-1)
On Linux and MacOS, the -1 is correct, but on Windows it needs to be -2. A more Pythonic method of determining which is to check os.linesep.
import os
remove_chars = len(os.linesep)
with open('a.txt', 'w') as f:
f.write('abc\n')
f.write('def\n')
f.truncate(f.tell() - remove_chars)
kindal's answer is also valid, with the exception that you said it's a large file. This method will let you handle a terabyte sized file on a gigabyte of RAM.
Write the newline of each line at the beginning of the next line. To avoid writing a newline at the beginning of the first line, use a variable that is initialized to an empty string and then set to a newline in the loop.
import re
with open('input.txt') as source, open('output.txt', 'w') as target:
newline = ''
for line in source:
match = re.search(r'Stuff', line)
if match:
target.write(newline + match.group())
newline = '\n'
I also restructured your code a bit (the else: continue is not needed, because what else is the loop going to do?) and changed it to use the with statement so the files are automatically closed.
The shortest path from what you have to what you want is probably to store the results in a list, then join the list with newlines and write that to the file.
import re
target = open('output.txt', 'w')
results = []
for line in open('input.txt', 'r'):
match = re.search(r'Stuff', line)
if match:
results.append(match.group())
target.write("\n".join(results))
target.close()
Voilà, no extra newline at the beginning or end. Might not scale very well of the resulting list is huge. (And like kindall I left out the else)
Since you're performing the same regex over and over, you'd probably want to compile it beforehand.
import re
prog = re.compile(r'Stuff')
I tend to input from and output to stdin and stdout for simplicity. But that's a matter of taste (and specs).
from sys import stdin, stdout
Ignoring the specific requirement about removing the final EOL[1], and just addressing the bit about your own learning, the whole thing could be written like this:
from itertools import imap
stdout.writelines(match.group() for match in imap(prog.match, stdin) if match)
[1] As others have commented, this is a Bad Thing, and it's extremely annoying when someone does this.

Writelines writes lines without newline, Just fills the file

I have a program that writes a list to a file.
The list is a list of pipe delimited lines and the lines should be written to the file like this:
123|GSV|Weather_Mean|hello|joe|43.45
122|GEV|temp_Mean|hello|joe|23.45
124|GSI|Weather_Mean|hello|Mike|47.45
BUT it wrote them line this ahhhh:
123|GSV|Weather_Mean|hello|joe|43.45122|GEV|temp_Mean|hello|joe|23.45124|GSI|Weather_Mean|hello|Mike|47.45
This program wrote all the lines into like one line without any line breaks.. This hurts me a lot and I gotta figure-out how to reverse this but anyway, where is my program wrong here? I thought write lines should write lines down the file rather than just write everything to one line..
fr = open(sys.argv[1], 'r') # source file
fw = open(sys.argv[2]+"/masked_"+sys.argv[1], 'w') # Target Directory Location
for line in fr:
line = line.strip()
if line == "":
continue
columns = line.strip().split('|')
if columns[0].find("#") > 1:
looking_for = columns[0] # this is what we need to search
else:
looking_for = "Dummy#dummy.com"
if looking_for in d:
# by default, iterating over a dictionary will return keys
new_line = d[looking_for]+'|'+'|'.join(columns[1:])
line_list.append(new_line)
else:
new_idx = str(len(d)+1)
d[looking_for] = new_idx
kv = open(sys.argv[3], 'a')
kv.write(looking_for+" "+new_idx+'\n')
kv.close()
new_line = d[looking_for]+'|'+'|'.join(columns[1:])
line_list.append(new_line)
fw.writelines(line_list)
This is actually a pretty common problem for newcomers to Python—especially since, across the standard library and popular third-party libraries, some reading functions strip out newlines, but almost no writing functions (except the log-related stuff) add them.
So, there's a lot of Python code out there that does things like:
fw.write('\n'.join(line_list) + '\n')
(writing a single string) or
fw.writelines(line + '\n' for line in line_list)
Either one is correct, and of course you could even write your own writelinesWithNewlines function that wraps it up…
But you should only do this if you can't avoid it.
It's better if you can create/keep the newlines in the first place—as in Greg Hewgill's suggestions:
line_list.append(new_line + "\n")
And it's even better if you can work at a higher level than raw lines of text, e.g., by using the csv module in the standard library, as esuaro suggests.
For example, right after defining fw, you might do this:
cw = csv.writer(fw, delimiter='|')
Then, instead of this:
new_line = d[looking_for]+'|'+'|'.join(columns[1:])
line_list.append(new_line)
You do this:
row_list.append(d[looking_for] + columns[1:])
And at the end, instead of this:
fw.writelines(line_list)
You do this:
cw.writerows(row_list)
Finally, your design is "open a file, then build up a list of lines to add to the file, then write them all at once". If you're going to open the file up top, why not just write the lines one by one? Whether you're using simple writes or a csv.writer, it'll make your life simpler, and your code easier to read. (Sometimes there can be simplicity, efficiency, or correctness reasons to write a file all at once—but once you've moved the open all the way to the opposite end of the program from the write, you've pretty much lost any benefits of all-at-once.)
The documentation for writelines() states:
writelines() does not add line separators
So you'll need to add them yourself. For example:
line_list.append(new_line + "\n")
whenever you append a new item to line_list.
As others have noted, writelines is a misnomer (it ridiculously does not add newlines to the end of each line).
To do that, explicitly add it to each line:
with open(dst_filename, 'w') as f:
f.writelines(s + '\n' for s in lines)
writelines() does not add line separators. You can alter the list of strings by using map() to add a new \n (line break) at the end of each string.
items = ['abc', '123', '!##']
items = map(lambda x: x + '\n', items)
w.writelines(items)
As others have mentioned, and counter to what the method name would imply, writelines does not add line separators. This is a textbook case for a generator. Here is a contrived example:
def item_generator(things):
for item in things:
yield item
yield '\n'
def write_things_to_file(things):
with open('path_to_file.txt', 'wb') as f:
f.writelines(item_generator(things))
Benefits: adds newlines explicitly without modifying the input or output values or doing any messy string concatenation. And, critically, does not create any new data structures in memory. IO (writing to a file) is when that kind of thing tends to actually matter. Hope this helps someone!
Credits to Brent Faust.
Python >= 3.6 with format string:
with open(dst_filename, 'w') as f:
f.writelines(f'{s}\n' for s in lines)
lines can be a set.
If you are oldschool (like me) you may add f.write('\n') below the second line.
As we have well established here, writelines does not append the newlines for you. But, what everyone seems to be missing, is that it doesn't have to when used as a direct "counterpart" for readlines() and the initial read persevered the newlines!
When you open a file for reading in binary mode (via 'rb'), then use readlines() to fetch the file contents into memory, split by line, the newlines remain attached to the end of your lines! So, if you then subsequently write them back, you don't likely want writelines to append anything!
So if, you do something like:
with open('test.txt','rb') as f: lines=f.readlines()
with open('test.txt','wb') as f: f.writelines(lines)
You should end up with the same file content you started with.
As we want to only separate lines, and the writelines function in python does not support adding separator between lines, I have written the simple code below which best suits this problem:
sep = "\n" # defining the separator
new_lines = sep.join(lines) # lines as an iterator containing line strings
and finally:
with open("file_name", 'w') as file:
file.writelines(new_lines)
and you are done.

Write strings to another file

The Problem - Update:
I could get the script to print out but had a hard time trying to figure out a way to put the stdout into a file instead of on a screen. the below script worked on printing results to the screen. I posted the solution right after this code, scroll to the [ solution ] at the bottom.
First post:
I'm using Python 2.7.3. I am trying to extract the last words of a text file after the colon (:) and write them into another txt file. So far I am able to print the results on the screen and it works perfectly, but when I try to write the results to a new file it gives me str has no attribute write/writeline. Here it the code snippet:
# the txt file I'm trying to extract last words from and write strings into a file
#Hello:there:buddy
#How:areyou:doing
#I:amFine:thanks
#thats:good:I:guess
x = raw_input("Enter the full path + file name + file extension you wish to use: ")
def ripple(x):
with open(x) as file:
for line in file:
for word in line.split():
if ':' in word:
try:
print word.split(':')[-1]
except (IndexError):
pass
ripple(x)
The code above works perfectly when printing to the screen. However I have spent hours reading Python's documentation and can't seem to find a way to have the results written to a file. I know how to open a file and write to it with writeline, readline, etc, but it doesn't seem to work with strings.
Any suggestions on how to achieve this?
PS: I didn't add the code that caused the write error, because I figured this would be easier to look at.
End of First Post
The Solution - Update:
Managed to get python to extract and save it into another file with the code below.
The Code:
inputFile = open ('c:/folder/Thefile.txt', 'r')
outputFile = open ('c:/folder/ExtractedFile.txt', 'w')
tempStore = outputFile
for line in inputFile:
for word in line.split():
if ':' in word:
splitting = word.split(':')[-1]
tempStore.writelines(splitting +'\n')
print splitting
inputFile.close()
outputFile.close()
Update:
checkout droogans code over mine, it was more efficient.
Try this:
with open('workfile', 'w') as f:
f.write(word.split(':')[-1] + '\n')
If you really want to use the print method, you can:
from __future__ import print_function
print("hi there", file=f)
according to Correct way to write line to file in Python. You should add the __future__ import if you are using python 2, if you are using python 3 it's already there.
I think your question is good, and when you're done, you should head over to code review and get your code looked at for other things I've noticed:
# the txt file I'm trying to extract last words from and write strings into a file
#Hello:there:buddy
#How:areyou:doing
#I:amFine:thanks
#thats:good:I:guess
First off, thanks for putting example file contents at the top of your question.
x = raw_input("Enter the full path + file name + file extension you wish to use: ")
I don't think this part is neccessary. You can just create a better parameter for ripple than x. I think file_loc is a pretty standard one.
def ripple(x):
with open(x) as file:
With open, you are able to mark the operation happening to the file. I also like to name my file object according to its job. In other words, with open(file_loc, 'r') as r: reminds me that r.foo is going to be my file that is being read from.
for line in file:
for word in line.split():
if ':' in word:
First off, your for word in line.split() statement does nothing but put the "Hello:there:buddy" string into a list: ["Hello:there:buddy"]. A better idea would be to pass split an argument, which does more or less what you're trying to do here. For example, "Hello:there:buddy".split(":") would output ['Hello', 'there', 'buddy'], making your search for colons an accomplished task.
try:
print word.split(':')[-1]
except (IndexError):
pass
Another advantage is that you won't need to check for an IndexError, since you'll have, at least, an empty string, which when split, comes back as an empty string. In other words, it'll write nothing for that line.
ripple(x)
For ripple(x), you would instead call ripple('/home/user/sometext.txt').
So, try looking over this, and explore code review. There's a guy named Winston who does really awesome work with Python and self-described newbies. I always pick up new tricks from that guy.
Here is my take on it, re-written out:
import os #for renaming the output file
def ripple(file_loc='/typical/location/while/developing.txt'):
outfile = "output.".join(os.path.basename(file_loc).split('.'))
with open(outfile, 'w') as w:
lines = open(file_loc, 'r').readlines() #everything is one giant list
w.write('\n'.join([line.split(':')[-1] for line in lines]))
ripple()
Try breaking this down, line by line, and changing things around. It's pretty condensed, but once you pick up comprehensions and using lists, it'll be more natural to read code this way.
You are trying to call .write() on a string object.
You either got your arguments mixed up (you'll need to call fileobject.write(yourdata), not yourdata.write(fileobject)) or you accidentally re-used the same variable for both your open destination file object and storing a string.

Categories

Resources