Using multiple re.sub() calls in one file with Python

Using multiple re.sub() calls in one file with Python - python

I have a file with a large amount of random strings contained with in it. There are certain patterns that I wan't to remove, so I decided to use RegEX to check for them. So far this code, does exactly what I want it to:
#!/usr/bin/python
import csv
import re
import sys
import pdb
f=open('output.csv', 'w')
with open('retweet.csv', 'rb') as inputfile:
read=csv.reader(inputfile, delimiter=',')
for row in read:
f.write(re.sub(r'#\s\w+', ' ', row[0]))
f.write("\n")
f.close()
f=open('output2.csv', 'w')
with open('output.csv', 'rb') as inputfile2:
read2=csv.reader(inputfile2, delimiter='\n')
for row in read2:
a= re.sub('[^a-zA-Z0-9]', ' ', row[0])
b= str.split(a)
c= "+".join(b)
f.write("http://www.google.com/webhp#q="+c+"&btnI\n")
f.close()
The problem is, I would like to avoid having to open and close a file as this can get messy if I need to check for more patterns. How can I perform multiple re.sub() calls on the same file and write it out to a new file with all substitutions?
Thanks for any help!

Apply all your substitutions in one go on the current line:
with open('retweet.csv', 'rb') as inputfile:
read=csv.reader(inputfile, delimiter=',')
for row in read:
text = row[0]
text = re.sub(r'#\s\w+', ' ', text)
text = re.sub(another_expression, another_replacement, text)
# etc.
f.write(text + '\n')
Note that opening a file with csv.reader(..., delimiter='\n') sounds awfully much as if you are treating that file as a sequence of lines; you could just loop over the file:
with open('output.csv', 'rb') as inputfile2:
for line in inputfile2:

Related

Trying to remove multiple space in txt file using python [duplicate]

So I have this crazy long text file made by my crawler and it for some reason added some spaces inbetween the links, like this:
https://example.com/asdf.html (note the spaces)
https://example.com/johndoe.php (again)
I want to get rid of that, but keep the new line. Keep in mind that the text file is 4.000+ lines long. I tried to do it myself but figured that I have no idea how to loop through new lines in files.

Seems like you can't directly edit a python file, so here is my suggestion:
# first get all lines from file
with open('file.txt', 'r') as f:
lines = f.readlines()
# remove spaces
lines = [line.replace(' ', '') for line in lines]
# finally, write lines in the file
with open('file.txt', 'w') as f:
f.writelines(lines)

You can open file and read line by line and remove white space -
Python 3.x:
with open('filename') as f:
for line in f:
print(line.strip())
Python 2.x:
with open('filename') as f:
for line in f:
print line.strip()
It will remove space from each line and print it.
Hope it helps!

Read text from file, remove spaces, write text to file:
with open('file.txt', 'r') as f:
txt = f.read().replace(' ', '')
with open('file.txt', 'w') as f:
f.write(txt)
In #Leonardo Chirivì's solution it's unnecessary to create a list to store file contents when a string is sufficient and more memory efficient. The .replace(' ', '') operation is only called once on the string, which is more efficient than iterating through a list performing replace for each line individually.
To avoid opening the file twice:
with open('file.txt', 'r+') as f:
txt = f.read().replace(' ', '')
f.seek(0)
f.write(txt)
f.truncate()
It would be more efficient to only open the file once. This requires moving the file pointer back to the start of the file after reading, as well as truncating any possibly remaining content left over after you write back to the file. A drawback to this solution however is that is not as easily readable.

I had something similar that I'd been dealing with.
This is what worked for me (Note: This converts from 2+ spaces into a comma, but if you read below the code block, I explain how you can get rid of ALL whitespaces):
import re
# read the file
with open('C:\\path\\to\\test_file.txt') as f:
read_file = f.read()
print(type(read_file)) # to confirm that it's a string
read_file = re.sub(r'\s{2,}', ',', read_file) # find/convert 2+ whitespace into ','
# write the file
with open('C:\\path\\to\\test_file.txt', 'w') as f:
f.writelines('read_file')
This helped me then send the updated data to a CSV, which suited my need, but it can help for you as well, so instead of converting it to a comma (','), you can convert it to an empty string (''), and then [or] use a read_file.replace(' ', '') method if you don't need any whitespaces at all.

Lets not forget about adding back the \n to go to the next row.
The complete function would be :
with open(str_path, 'r') as file :
str_lines = file.readlines()
# remove spaces
if bl_right is True:
str_lines = [line.rstrip() + '\n' for line in str_lines]
elif bl_left is True:
str_lines = [line.lstrip() + '\n' for line in str_lines]
else:
str_lines = [line.strip() + '\n' for line in str_lines]
# Write the file out again
with open(str_path, 'w') as file:
file.writelines(str_lines)

Python split and find specific string from a text file

I have a raw data in a .txt file format and would like to convert it to .csv file format.
This is a sample data from the txt fle:
(L2-CR666 Reception Counter) L2-CR666 Reception Counter has been forced.
(L7-CR126 Handicapped Toilet) L7-CR126 Handicapped Toilet has been forced.
I would like to achieve the following result:
L2-CR666 Reception Counter, forced
L7-CR126 Handicapped Toilet, forced
I have tried the following code but was unable to achieve the stated result. Where did I went wrong?
import csv
with open('Converted Detection\\Testing 01\\2019-02-21.txt') as infile, open('Converted Detection\\Converted CSV\\log.csv', 'w') as outfile:
for line in infile:
outfile.write(infile.read().replace("(", ""))
for line in infile:
outfile.write(', '.join(infile.read().split(')')))
outfile.close()

You can try this :
with open('Converted Detection\\Testing 01\\2019-02-21.txt') as infile, open('Converted Detection\\Converted CSV\\log.csv', 'w') as outfile:
for line in infile:
# Get text inside ()
text = line[line.find("(")+1:line.find(")")]
# Remove \r\n
line = line.rstrip("\r\n")
# Get last word
forcedText = line.split(" ")[len(line.split(" "))-1]
# Remove . char
forcedText = forcedText[:len(forcedText)-1]
outfile.write(text+", "+forcedText+"\n")
outfile.close()
Best

You could use .partition() to truncate everything before ) and then simply replace the parts you do not want accordingly. Also, you do not have to close the file when using the with statement as it automatically closes it for you, and you do not have to import the csv library to save a file with the .csv extension.
The following code outputs your wanted result:
infile_path = "Converted Detection\\Testing 01\\2019-02-21.txt"
outfile_path = "Converted Detection\\Converted CSV\\log.csv"
with open(infile_path, "r") as infile, open(outfile_path, "+w") as outfile:
for line in infile:
line = line.partition(")")[2].replace(" has been forced.", ", forced").strip()
outfile.write(line + "\n")

First for loop is reading infile. No need to reread infile and second loop.
Also with block will take care of closing files.
for line in infile:
line = line.replace("(", "")
outfile.write(', '.join(line.split(')')))

I would suggest using:
lineout = ', '.join(linein.replace('(','').replace(')','').split(' has been ')
where:
linein = line.strip()

Removing all spaces in text file with Python 3.x

So I have this crazy long text file made by my crawler and it for some reason added some spaces inbetween the links, like this:
https://example.com/asdf.html (note the spaces)
https://example.com/johndoe.php (again)
I want to get rid of that, but keep the new line. Keep in mind that the text file is 4.000+ lines long. I tried to do it myself but figured that I have no idea how to loop through new lines in files.

Seems like you can't directly edit a python file, so here is my suggestion:
# first get all lines from file
with open('file.txt', 'r') as f:
lines = f.readlines()
# remove spaces
lines = [line.replace(' ', '') for line in lines]
# finally, write lines in the file
with open('file.txt', 'w') as f:
f.writelines(lines)

You can open file and read line by line and remove white space -
Python 3.x:
with open('filename') as f:
for line in f:
print(line.strip())
Python 2.x:
with open('filename') as f:
for line in f:
print line.strip()
It will remove space from each line and print it.
Hope it helps!

Read text from file, remove spaces, write text to file:
with open('file.txt', 'r') as f:
txt = f.read().replace(' ', '')
with open('file.txt', 'w') as f:
f.write(txt)
In #Leonardo Chirivì's solution it's unnecessary to create a list to store file contents when a string is sufficient and more memory efficient. The .replace(' ', '') operation is only called once on the string, which is more efficient than iterating through a list performing replace for each line individually.
To avoid opening the file twice:
with open('file.txt', 'r+') as f:
txt = f.read().replace(' ', '')
f.seek(0)
f.write(txt)
f.truncate()
It would be more efficient to only open the file once. This requires moving the file pointer back to the start of the file after reading, as well as truncating any possibly remaining content left over after you write back to the file. A drawback to this solution however is that is not as easily readable.

I had something similar that I'd been dealing with.
This is what worked for me (Note: This converts from 2+ spaces into a comma, but if you read below the code block, I explain how you can get rid of ALL whitespaces):
import re
# read the file
with open('C:\\path\\to\\test_file.txt') as f:
read_file = f.read()
print(type(read_file)) # to confirm that it's a string
read_file = re.sub(r'\s{2,}', ',', read_file) # find/convert 2+ whitespace into ','
# write the file
with open('C:\\path\\to\\test_file.txt', 'w') as f:
f.writelines('read_file')
This helped me then send the updated data to a CSV, which suited my need, but it can help for you as well, so instead of converting it to a comma (','), you can convert it to an empty string (''), and then [or] use a read_file.replace(' ', '') method if you don't need any whitespaces at all.

Lets not forget about adding back the \n to go to the next row.
The complete function would be :
with open(str_path, 'r') as file :
str_lines = file.readlines()
# remove spaces
if bl_right is True:
str_lines = [line.rstrip() + '\n' for line in str_lines]
elif bl_left is True:
str_lines = [line.lstrip() + '\n' for line in str_lines]
else:
str_lines = [line.strip() + '\n' for line in str_lines]
# Write the file out again
with open(str_path, 'w') as file:
file.writelines(str_lines)

Pipe delimiter file, but no pipe inside data

Problem
I need to re-format a text from comma (,) separated values to pipe (|) separated values. Pipe characters within the values of the original (comma separated) text shall be replaced by a space for representation in the (pipe separated) result text.
The pipe separated result text shall be written back to the same file from which the original comma separated text has been read.
I am using python 2.6
Possible Solution
I should read the file first and remove all pipes with spaces in that and later replace (,) with (|).
Is there a the better way to achieve this?

Don't reinvent the value-separated file parsing wheel. Use the csv module to do the parsing and the writing for you.
The csv module will add "..." quotes around values that contain the separator, so in principle you don't need to replace the | pipe symbols in the values. To replace the original file, write to a new (temporary) outputfile then move that back into place.
import csv
import os
outputfile = inputfile + '.tmp'
with open(inputfile, 'rb') as inf, open(outputfile, 'wb') as outf:
reader = csv.reader(inf)
writer = csv.writer(outf, delimiter='|')
writer.writerows(reader)
os.remove(inputfile)
os.rename(outputfile, inputfile)
For an input file containing:
foo,bar|baz,spam
this produces
foo|"bar|baz"|spam
Note that the middle column is wrapped in quotes.
If you do need to replace the | characters in the values, you can do so as you copy the rows:
outputfile = inputfile + '.tmp'
with open(inputfile, 'rb') as inf, open(outputfile, 'wb') as outf:
reader = csv.reader(inf)
writer = csv.writer(outf, delimiter='|')
for row in reader:
writer.writerow([col.replace('|', ' ') for col in row])
os.remove(inputfile)
os.rename(outputfile, inputfile)
Now the output for my example becomes:
foo|bar baz|spam

Sounds like you're trying to work with a variation of CSV - in that case, Python's CSV library might as well be what you need. You can use it with custom delimiters and it will auto-handle escaping for you (this example was yanked from the manual and modified):
import csv
with open('eggs.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter='|')
spamwriter.writerow(['One', 'Two', 'Three])
There are also ways to modify quoting and escaping and other options. Reading works similarly.

You can create a temporary file from the original that has the pipe characters replaced, and then replace the original file with it when the processing is done:
import csv
import tempfile
import os
filepath = 'C:/Path/InputFile.csv'
with open(filepath, 'rb') as fin:
reader = csv.DictReader(fin)
fout = tempfile.NamedTemporaryFile(dir=os.path.dirname(filepath)
delete=False)
temp_filepath = fout.name
writer = csv.DictWriter(fout, reader.fieldnames, delimiter='|')
# writer.writeheader() # requires Python 2.7
header = dict(zip(reader.fieldnames, reader.fieldnames))
writer.writerow(header)
for row in reader:
for k,v in row.items():
row[k] = v.replace('|'. ' ')
writer.writerow(row)
fout.close()
os.remove(filepath)
os.rename(temp_filepath, filepath)

Removing specific text from every line

I have a txt file with this format:
something text1 pm,bla1,bla1
something text2 pm,bla2,bla2
something text3 am,bla3,bla3
something text4 pm,bla4,bla4
and in a new file I want to hold:
bla1,bla1
bla2,bla2
bla3,bla3
bla4,bla4
I have this which holds the first 10 characters for example of every line. Can I transform this or any other idea?
with open('example1.txt', 'r') as input_handle:
with open('example2.txt', 'w') as output_handle:
for line in input_handle:
output_handle.write(line[:10] + '\n')

This is what the csv module was made for.
import csv
reader = csv.reader(open('file.csv'))
for row in reader: print(row[1])
You can then just redirect the output of the file to the new file using your shell, or you can do something like this instead of the last line:
for row in reader:
with open('out.csv','w+') as f:
f.write(row[1]+'\n')

To remove the first ","-separated column from the file:
first, sep, rest = line.partition(",")
if rest: # don't write lines with less than 2 columns
output_handle.write(rest)

If the format is fixed:
with open('example1.txt', 'r') as input_handle:
with open('example2.txt', 'w') as output_handle:
for line in input_handle:
if line: # and maybe some other format check
od = line.split(',', 1)
output_handle.write(od[1] + "\n")

Here is how I would write it.
Python 2.7
import csv
with open('example1.txt', 'rb') as f_in, open('example2.txt', 'wb') as f_out:
writer = csv.writer(f_out)
for row in csv.reader(f_in):
writer.write(row[-2:]) # keeps the last two columns
Python 3.x (note the differences in arguments to open)
import csv
with open('example1.txt', 'r', newline='') as f_in:
with open('example2.txt', 'w', newline='') as f_out:
writer = csv.writer(f_out)
for row in csv.reader(f_in):
writer.write(row[-2:]) # keeps the last two columns

Try:
output_handle.write(line.split(",", 1)[1])
From the docs:
str.split([sep[, maxsplit]])
Return a list of the words in the string, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using multiple re.sub() calls in one file with Python - python

Related

Trying to remove multiple space in txt file using python [duplicate]

Python split and find specific string from a text file

Removing all spaces in text file with Python 3.x

Pipe delimiter file, but no pipe inside data

Removing specific text from every line

Categories

Resources