Iterative regex in Python: find and replace - python

This is one of those "I'd know how to do it in C" type questions. :p
I'm asking this as similar questions in SO don't have a particular aspect I'm looking for.
I'm essentially looking to find and replace items that also have possessive forms. So if there is a "rabbit" in the list, and also a "rabbit's", then replace "rabbit" with a series of asterisks.
Something along the lines of:
#!/usr/bin/env python
import re
text = open("list.txt", "w")
for line in text:
test = line
if re.match(test+"'", line) or re.match(test+"'s", line):
line = "****"
However, this clearly won't work as the for each mechanism makes line be used for both iteration and pattern matching.

with open('file.txt') as f:
# Remove the \n characters at the end of each line
all_lines = [x.strip() for x in f.readlines()]
for line in all_lines:
# Check for presence of word' or word's
if line+"'" in all_lines or line+"'s" in all_lines:
print('****')
else:
print(line)
It's worth noting that this is quite a brute force way of doing and for huge lists will take a bit longer (it loads the file into memory) but should give you an idea.

you can use str.endswith:
text = open("list.txt", "r")
for line in text:
test = line.strip()
if test.endswith("'s"):
line = "****"
Here i have explained why your code is not going to work:
replace this:
test = line
to:
test = line.strip() # to remove new line character
so your test will be rabbit\n', if you don't remove newline character
you also need to open file on read mode
text = open("list.txt",'r')
you match is not going to work, think of it:
suppose test="rabbit's"
test+"'" will give you `rabbit's'`

Related

Load a text file paragraph into a string without libraries

sorry if this question may look a bit dumb for some of you but i'm totally a beginner at programming in Python so i'm quite bad and got a still got a lot to learn.
So basically I have this long text file separated by paragraphs, sometimes the newline can be double or triple to make the task more hard for us so i added a little check and looks like it's working fine so i have a variable called "paragraph" that tells me in which paragraph i am currently.
Now basically i need to scan this text file and search for some sequences of words in it but the newline character is the worst enemy here, for example if i have the string = "dummy text" and i'm looking into this:
"random questions about files with a dummy
text and strings
hey look a new paragraph here"
As you can see there is a newline between dummy and text so reading the file line by line doesn't work. So i was wondering to load directly the entire paragraph to a string so this way i can even remove punctuation and stuff more easly and check directly if those sequences of words are contained in it.
All this must be done without libraries.
However my piece of code of paragraph counter works while the file is being read, so if uploading a whole paragraph in a string is possible i should basically use something like "".join until the paragraph increases by 1 because we're on the next paragraph? Any idea?
This should do the trick. It is very short and elegant:
with open('dummy text.txt') as file:
data = file.read().replace('\n', '')
print(data)#prints out the file
The output is:
"random questions about files with a dummy text and strings hey look a new paragraph here"
I think you do not need to think it in a difficult way. Here is a very commonly used pattern for this kind of problems.
paragraphs = []
lines = []
for line in open('text.txt'):
if not line.strip(): # empty line
if lines:
paragraphs.append("".join(lines))
lines = []
else:
lines.append(line)
if lines:
paragraphs.append("".join(lines))
If a stripped line is empty, you encounter the second \n and it means that you have to join previous lines to a paragraph.
If you encounter the 3rd \n, you must not join again so remove your previous lines (lines = []). In this way, you will not join the same paragraph again.
To check the last line, try this pattern.
f = open('text.txt')
line0 = f.readline()
while True:
# do what you have to do with the previous line, `line0`
line = f.readline()
if not line: # `line0` was the last line
# do what you have to do with the last line
break
line0 = line
You can strip the newline character. Here is an example from a different problem.
data = open('resources.txt', 'r')
book_list = []
for line in data:
new_line = line.rstrip('\n')
book_list.append(new_line)

Matching a simple string with regex not working?

I have a large txt-file and want to extract all strings with these patterns:
/m/meet_the_crr
/m/commune
/m/hann_2
Here is what I tried:
import re
with open("testfile.txt", "r") as text_file:
contents = text_file.read().replace("\n", "")
print(re.match(r'^\/m\/[a-zA-Z0-9_-]+$', contents))
The result I get is a simple "None". What am I doing wrong here?
You need to not remove lineends and use the re.MULTILINE flag so you get multiple results from a bigger text returned:
# write a demo file
with open("t.txt","w") as f:
f.write("""
/m/meet_the_crr\n
/m/commune\n
/m/hann_2\n\n
# your text looks like this after .read().replace(\"\\n\",\"\")\n
/m/meet_the_crr/m/commune/m/hann_2""")
Program:
import re
regex = r"^\/m\/[a-zA-Z0-9_-]+$"
with open("t.txt","r") as f:
contents = f.read()
found_all = re.findall(regex,contents,re.M)
print(found_all)
print("-")
print(open("t.txt").read())
Output:
['/m/meet_the_crr', '/m/commune', '/m/hann_2']
Filecontent:
/m/meet_the_crr
/m/commune
/m/hann_2
# your text looks like this after .read().replace("\n","")
/m/meet_the_crr/m/commune/m/hann_2
This is about what Wiktor Stribiżew did tell you in his comment - although he suggested to use a better pattern as well: r'^/m/[\w-]+$'
There is nothing logically wrong with your code, and in fact your pattern will match the inputs you describe:
result = re.match(r'^\/m\/[a-zA-Z0-9_-]+$', '/m/meet_the_crr')
if result:
print(result.groups()) # this line is reached, as there is a match
Since you did not specify any capture groups, you will see () being printed to the console. You could capture the entire input, and then it would be available, e.g.
result = re.match(r'(^\/m\/[a-zA-Z0-9_-]+$)', '/m/meet_the_crr')
if result:
print(result.groups(1)[0])
/m/meet_the_crr
You are reading a whole file into a variable (into memory) using .read(). With .replace("\n", ""), you re,ove all newlines in the string. The re.match(r'^\/m\/[a-zA-Z0-9_-]+$', contents) tries to match the string that entirely matches the \/m\/[a-zA-Z0-9_-]+ pattern, and it is impossible after all the previous manipulations.
There are at least two ways out. Either remove .replace("\n", "") (to prevent newline removal) and use re.findall(r'^/m/[\w-]+$', contents, re.M) (re.M option will enable matching whole lines rather than the whole text), or read the file line by line and use your re.match version to check each line for a match, and if it matches add to the final list.
Example:
import re
with open("testfile.txt", "r") as text_file:
contents = text_file.read()
print(re.findall(r'^/m/[\w-]+$', contents, re.M))
Or
import re
with open("testfile.txt", "r") as text_file:
for line in text_file:
if re.match(r'/m/[\w-]+\s*$', line):
print(line.rstrip())
Note I used \w to make the pattern somewhat shorter, but if you are working in Python 3 and only want to match ASCII letters and digits, use also re.ASCII option.
Also, / is not a special char in Python regex patterns, there is no need escaping it.

List the first words per line from a text file in Python

I need to select the first word on each line and make a list from them from a text file:
I would copy the text but it's the formatting is quite screwed up. will try
All the other text is unnecessary.
I have tried
string=[]
for line in f:
String.append(line.split(None, 1)[0]) # add only first word
from another solution, but it keeps returning a "Index out of bounds" error.
I can get the first word from the first line using string=text.partition(' ')[0]
but I do not know how to repeat this for the other lines.
I am still new to python and to the site, I hope my formatting is bearable! (when opened, I encode the text to accept symbols, like so
wikitxt=open('racinesPrefixesSuffixes.txt', 'r', encoding='utf-8')
could this be the issue?)
The reason it's raising an IndexError is because the specific line is empty.
You can do this:
words = []
for line in f:
if line.strip():
words.append(line.split(maxsplit=1)[0])
Here line.strip() is checking if the line consists of only whitespace. If it does only consist of whitespace, it will simply skip the line.
Or, if you like list comprehension:
words = [line.split(maxsplit=1)[0] for line in f if line.strip()]

Searching a text file and grabbing all lines that do not include ## in python

I am trying to write a python script to read in a large text file from some modeling results, grab the useful data and save it as a new array. The text file is output in a way that has a ## starting each line that is not useful. I need a way to search through and grab all the lines that do not include the ##. I am used to using grep -v in this situation and piping to a file. I want to do it in python!
Thanks a lot.
-Tyler
I would use something like this:
fh = open(r"C:\Path\To\File.txt", "r")
raw_text = fh.readlines()
clean_text = []
for line in raw_text:
if not line.startswith("##"):
clean_text.append(line)
Or you could also clean the newline and carriage return non-printing characters at the same time with a small modification:
for line in raw_text:
if not line.startswith("##"):
clean_text.append(line.rstrip("\r\n"))
You would be left with a list object that contains one line of required text per element. You could split this into individual words using string.split() which would give you a nested list per original list element which you could easily index (assuming your text has whitespaces of course).
clean_text[4][7]
would return the 5th line, 8th word.
Hope this helps.
[Edit: corrected indentation in loop]
My suggestion would be to do the following:
listoflines = [ ]
with open(.txt, "r") as f: # .txt = file, "r" = read
for line in f:
if line[:2] != "##": #Read until the second character
listoflines.append(line)
print listoflines
If you're feeling brave, you can also do the following, CREDITS GO TO ALEX THORNTON:
listoflines = [l for l in f if not l.startswith('##')]
The other answer is great as well, especially teaching the .startswith function, but I think this is the more pythonic way and also has the advantage of automatically closing the file as soon as you're done with it.

Deleting a specific word from a file in python

I am quite new to python and have just started importing text files. I have a text file which contains a list of words, I want to be able to enter a word and this word to be deleted from the text file. Can anyone explain how I can do this?
text_file=open('FILE.txt', 'r')
ListText = text_file.read().split(',')
DeletedWord=input('Enter the word you would like to delete:')
NewList=(ListText.remove(DeletedWord))
I have this so far which takes the file and imports it into a list, I can then delete a word from the new list but want to delete the word also from the text file.
Here's what I would recommend since its fairly simple and I don't think you're concerned with performance.:
f = open("file.txt",'r')
lines = f.readlines()
f.close()
excludedWord = "whatever you want to get rid of"
newLines = []
for line in lines:
newLines.append(' '.join([word for word in line.split() if word != excludedWord]))
f = open("file.txt", 'w')
for line in lines:
f.write("{}\n".format(line))
f.close()
This allows for a line to have multiple words on it, but it will work just as well if there is only one word per line
In response to the updated question:
You cannot directly edit the file (or at least I dont know how), but must instead get all the contents in Python, edit them, and then re-write the file with the altered contents
Another thing to note, lst.remove(item) will throw out the first instance of item in lst, and only the first one. So the second instance of item will be safe from .remove(). This is why my solution uses a list comprehension to exclude all instances of excludedWord from the list. If you really want to use .remove() you can do something like this:
while excludedWord in lst:
lst.remove(excludedWord)
But I would discourage this in favor for the equivalent list comprehension
We can replace strings in files (some imports needed;)):
import os
import sys
import fileinput
for line in fileinput.input('file.txt', inplace=1):
sys.stdout.write(line.replace('old_string', 'new_string'))
Find this (maybe) here: http://effbot.org/librarybook/fileinput.htm
If 'new_string' change to '', then this would be the same as to delete 'old_string'.
So I was trying something similar, here are some points to people whom might end up reading this thread. The only way you can replace the modified contents is by opening the same file in "w" mode. Then python just overwrites the existing file.
I tried this using "re" and sub():
import re
f = open("inputfile.txt", "rt")
inputfilecontents = f.read()
newline = re.sub("trial","",inputfilecontents)
f = open("inputfile.txt","w")
f.write(newline)
#Wnnmaw your code is a little bit wrong there it should go like this
f = open("file.txt",'r')
lines = f.readlines()
f.close()
excludedWord = "whatever you want to get rid of"
newLines = []
for line in newLines:
newLines.append(' '.join([word for word in line.split() if word != excludedWord]))
f = open("file.txt", 'w')
for line in lines:
f.write("{}\n".format(line))
f.close()

Categories

Resources