Python: can't delete unwanted commas in a text file - python

I have a CSV file that contains an information about people:
name,age,height
Maria,25,172
George,45,180,
Peter,23,179,
The problem is that some strings contain an extra commas in the end, and some don't (this appears because this information was got from the internet using urlopen in the other Python script which processes the raw data).
I tried to write some code to fix this, but I couldn`t get a result. What I've written:
import re
data = open('file.csv').read()
new_data = re.sub('\W$', '', data)
print(new_data)
But this code substitutes only the last comma in the whole document. I tried to write a cycle, which counts all lines and then analyses each line, but maybe my coding skills are not great and I didn't reach a success. Please, tell me, what I'm doing wrong.

The problem is the whole file is handled as a string, and $ matches only the end of the string.
You would better use re.sub('\W\n', '\n', data)
You can also do that without regexp: new_data = data.replace(',\n', '\n'), which is probably faster.

This is simple enough you don't really need regex (and its probably faster to not use it)
Here's what I would do:
with open("file.csv", 'r') as f:
newLines = [line[:-1] if line.endswith(",") else line for line in f.readlines()]
Then all you need to do is write it back to the file

Related

Appending string to list adds a '\r' to every element

I am trying to append a string to a list (seqs). The strings look something like random letters which I am reading line by line from a file. Printing out the string shows no sign of a '\r'. But printing out the list does.
seqs.append(seq)
print('seq is', seq)
gives: seq is CSMKMTIGSGTKTLHRWAFNPTQQTCVTFVYTGAAGNQNNFLTRNDCVNTC
print(seqs)
gives: 'CSMKMTIGSGTKTLHRWAFNPTQQTCVTFVYTGAAGNQNNFLTRNDCVNTC\r'
I have tried changing the file a few different ways, even writing an example file of just lines of text so I am fairly certain I am not adding a '\r' to the file. Any help would be much appreciated.
Edit: I printed the sequence differently and I do see the carriage return now. Is there anyway I can remove it? For now when I parse through the sequences I just say if it's '\r' continue. To skip it
It's reading a gzip file and opening it with:
with gzip.open(filepath, 'rb') as fin:
file_content = fin.read().decode(encoding)
lines = file_content.split('\n')
for line in lines: #separate them into a list
EDIT EDIT I just did file_content.split('\r\n')...Seemed to fix the problem, not sure why I did not think of it sooner. Thanks everyone!

Trying to understand how to get import re to work in pycharm

I'm going through a course at work for Python. We're using Pycharm, and I'm not sure if that's what the problem is.
Basically, I have to read in a text file, scrub it, then count the frequency of specific words. The counting is not an issue. (I looped through a scrubbed list, checked the scrubbed list for the specific words, then added the specific words to a dictionary as I looped through the list. It works fine).
My issue is really about scrubbing the data. I ended up doing successive scrubs to get to a final clean list. But when I read the documentation, I should be able to use regex or re and scrub my file with one line of code. No matter what I do, importing re, or regex I get errors that stop my code.
How can I write the below code pythonically?
# Open the file in read mode
with open('chocolate.txt', 'r') as file:
input_col = file.read().replace(',', '')
text3 = input_col.replace('.', '')
text2 = text3.replace('"', '')
text = text2.split()
You could try using a regular expression which looks something like this
import re
result = re.sub(r'("|.|,)', "", text)
print(result)
Here text is the string you would read from the text file
Hope this helps!
x = re.sub(r'("|\.|,)', "", str)

Searching a text file and grabbing all lines that do not include ## in python

I am trying to write a python script to read in a large text file from some modeling results, grab the useful data and save it as a new array. The text file is output in a way that has a ## starting each line that is not useful. I need a way to search through and grab all the lines that do not include the ##. I am used to using grep -v in this situation and piping to a file. I want to do it in python!
Thanks a lot.
-Tyler
I would use something like this:
fh = open(r"C:\Path\To\File.txt", "r")
raw_text = fh.readlines()
clean_text = []
for line in raw_text:
if not line.startswith("##"):
clean_text.append(line)
Or you could also clean the newline and carriage return non-printing characters at the same time with a small modification:
for line in raw_text:
if not line.startswith("##"):
clean_text.append(line.rstrip("\r\n"))
You would be left with a list object that contains one line of required text per element. You could split this into individual words using string.split() which would give you a nested list per original list element which you could easily index (assuming your text has whitespaces of course).
clean_text[4][7]
would return the 5th line, 8th word.
Hope this helps.
[Edit: corrected indentation in loop]
My suggestion would be to do the following:
listoflines = [ ]
with open(.txt, "r") as f: # .txt = file, "r" = read
for line in f:
if line[:2] != "##": #Read until the second character
listoflines.append(line)
print listoflines
If you're feeling brave, you can also do the following, CREDITS GO TO ALEX THORNTON:
listoflines = [l for l in f if not l.startswith('##')]
The other answer is great as well, especially teaching the .startswith function, but I think this is the more pythonic way and also has the advantage of automatically closing the file as soon as you're done with it.

Don't write final new line character to a file

I have looked around StackOverflow and couldn't find an answer to my specific question so forgive me if I have missed something.
import re
target = open('output.txt', 'w')
for line in open('input.txt', 'r'):
match = re.search(r'Stuff', line)
if match:
match_text = match.group()
target.write(match_text + '\n')
else:
continue
target.close()
The file I am parsing is huge so need to process it line by line.
This (of course) leaves an additional newline at the end of the file.
How should I best change this code so that on the final iteration of the 'if match' loop it doesn't put the extra newline character at the end of the file. Should it look through the file again at the end and remove the last line (seems a bit inefficient though)?
The existing StackOverflow questions I have found cover removing all new lines from a file.
If there is a more pythonic / efficient way to write this code I would welcome suggestions for my own learning also.
Thanks for the help!
Another thing you can do, is to truncate the file. .tell() gives us the current byte number in the file. We then subtract one, and truncate it there to remove the trailing newline.
with open('a.txt', 'w') as f:
f.write('abc\n')
f.write('def\n')
f.truncate(f.tell()-1)
On Linux and MacOS, the -1 is correct, but on Windows it needs to be -2. A more Pythonic method of determining which is to check os.linesep.
import os
remove_chars = len(os.linesep)
with open('a.txt', 'w') as f:
f.write('abc\n')
f.write('def\n')
f.truncate(f.tell() - remove_chars)
kindal's answer is also valid, with the exception that you said it's a large file. This method will let you handle a terabyte sized file on a gigabyte of RAM.
Write the newline of each line at the beginning of the next line. To avoid writing a newline at the beginning of the first line, use a variable that is initialized to an empty string and then set to a newline in the loop.
import re
with open('input.txt') as source, open('output.txt', 'w') as target:
newline = ''
for line in source:
match = re.search(r'Stuff', line)
if match:
target.write(newline + match.group())
newline = '\n'
I also restructured your code a bit (the else: continue is not needed, because what else is the loop going to do?) and changed it to use the with statement so the files are automatically closed.
The shortest path from what you have to what you want is probably to store the results in a list, then join the list with newlines and write that to the file.
import re
target = open('output.txt', 'w')
results = []
for line in open('input.txt', 'r'):
match = re.search(r'Stuff', line)
if match:
results.append(match.group())
target.write("\n".join(results))
target.close()
VoilĂ , no extra newline at the beginning or end. Might not scale very well of the resulting list is huge. (And like kindall I left out the else)
Since you're performing the same regex over and over, you'd probably want to compile it beforehand.
import re
prog = re.compile(r'Stuff')
I tend to input from and output to stdin and stdout for simplicity. But that's a matter of taste (and specs).
from sys import stdin, stdout
Ignoring the specific requirement about removing the final EOL[1], and just addressing the bit about your own learning, the whole thing could be written like this:
from itertools import imap
stdout.writelines(match.group() for match in imap(prog.match, stdin) if match)
[1] As others have commented, this is a Bad Thing, and it's extremely annoying when someone does this.

Write strings to another file

The Problem - Update:
I could get the script to print out but had a hard time trying to figure out a way to put the stdout into a file instead of on a screen. the below script worked on printing results to the screen. I posted the solution right after this code, scroll to the [ solution ] at the bottom.
First post:
I'm using Python 2.7.3. I am trying to extract the last words of a text file after the colon (:) and write them into another txt file. So far I am able to print the results on the screen and it works perfectly, but when I try to write the results to a new file it gives me str has no attribute write/writeline. Here it the code snippet:
# the txt file I'm trying to extract last words from and write strings into a file
#Hello:there:buddy
#How:areyou:doing
#I:amFine:thanks
#thats:good:I:guess
x = raw_input("Enter the full path + file name + file extension you wish to use: ")
def ripple(x):
with open(x) as file:
for line in file:
for word in line.split():
if ':' in word:
try:
print word.split(':')[-1]
except (IndexError):
pass
ripple(x)
The code above works perfectly when printing to the screen. However I have spent hours reading Python's documentation and can't seem to find a way to have the results written to a file. I know how to open a file and write to it with writeline, readline, etc, but it doesn't seem to work with strings.
Any suggestions on how to achieve this?
PS: I didn't add the code that caused the write error, because I figured this would be easier to look at.
End of First Post
The Solution - Update:
Managed to get python to extract and save it into another file with the code below.
The Code:
inputFile = open ('c:/folder/Thefile.txt', 'r')
outputFile = open ('c:/folder/ExtractedFile.txt', 'w')
tempStore = outputFile
for line in inputFile:
for word in line.split():
if ':' in word:
splitting = word.split(':')[-1]
tempStore.writelines(splitting +'\n')
print splitting
inputFile.close()
outputFile.close()
Update:
checkout droogans code over mine, it was more efficient.
Try this:
with open('workfile', 'w') as f:
f.write(word.split(':')[-1] + '\n')
If you really want to use the print method, you can:
from __future__ import print_function
print("hi there", file=f)
according to Correct way to write line to file in Python. You should add the __future__ import if you are using python 2, if you are using python 3 it's already there.
I think your question is good, and when you're done, you should head over to code review and get your code looked at for other things I've noticed:
# the txt file I'm trying to extract last words from and write strings into a file
#Hello:there:buddy
#How:areyou:doing
#I:amFine:thanks
#thats:good:I:guess
First off, thanks for putting example file contents at the top of your question.
x = raw_input("Enter the full path + file name + file extension you wish to use: ")
I don't think this part is neccessary. You can just create a better parameter for ripple than x. I think file_loc is a pretty standard one.
def ripple(x):
with open(x) as file:
With open, you are able to mark the operation happening to the file. I also like to name my file object according to its job. In other words, with open(file_loc, 'r') as r: reminds me that r.foo is going to be my file that is being read from.
for line in file:
for word in line.split():
if ':' in word:
First off, your for word in line.split() statement does nothing but put the "Hello:there:buddy" string into a list: ["Hello:there:buddy"]. A better idea would be to pass split an argument, which does more or less what you're trying to do here. For example, "Hello:there:buddy".split(":") would output ['Hello', 'there', 'buddy'], making your search for colons an accomplished task.
try:
print word.split(':')[-1]
except (IndexError):
pass
Another advantage is that you won't need to check for an IndexError, since you'll have, at least, an empty string, which when split, comes back as an empty string. In other words, it'll write nothing for that line.
ripple(x)
For ripple(x), you would instead call ripple('/home/user/sometext.txt').
So, try looking over this, and explore code review. There's a guy named Winston who does really awesome work with Python and self-described newbies. I always pick up new tricks from that guy.
Here is my take on it, re-written out:
import os #for renaming the output file
def ripple(file_loc='/typical/location/while/developing.txt'):
outfile = "output.".join(os.path.basename(file_loc).split('.'))
with open(outfile, 'w') as w:
lines = open(file_loc, 'r').readlines() #everything is one giant list
w.write('\n'.join([line.split(':')[-1] for line in lines]))
ripple()
Try breaking this down, line by line, and changing things around. It's pretty condensed, but once you pick up comprehensions and using lists, it'll be more natural to read code this way.
You are trying to call .write() on a string object.
You either got your arguments mixed up (you'll need to call fileobject.write(yourdata), not yourdata.write(fileobject)) or you accidentally re-used the same variable for both your open destination file object and storing a string.

Categories

Resources