My Python Regex code isn't finding consecutive sets of characters - python

I'm trying to code a program to find if there are three consecutive sets of double letters in a .txt file (E.G. bookkeeper). So far I have:
import re
text = open(r'C:\Users\Jimbo.Wimbo\Desktop\List.txt')
for line in text:
x = re.finditer(r'((\w)\2)+', line)
if True:
print("Yes")
Else:
print("No")
List.txt has 5 words. There is one word with three consecutive sets of double letters right at the end, but it prints 5 "Yes"'s. What can I do to fix it using re and os?

You don't need re.finditer(), you can just use re.search().
Your regexp is wrong, it will match at least 1 set of duplicate characters, not 3.
if True: doesn't do anything useful. It doesn't mean "if the last assignment was a truthy value". You need to test the result of the regexp search.
Use any() to test if the condition matches any line in the file. Your code will print Yes or No for each line in the file.
if any(re.search(r'((\w)\2)){3}', line) for line in text):
print('Yes')
else:
print('No')

I think your regex is incorrect.
A good way to check your regex is to use an online regex checker, and you can test your regex against any number of strings you provide.
Here is one possible solution to your query:
import re
text = open(r'C:\Users\Jimbo.Wimbo\Desktop\List.txt')
for line in text:
x = len(re.findall(r'(.)\1', line))
if x == 3:
print(f"Found a word with 3 duplicate letters : {line}")
else:
print(f"Word: {line}, Duplicate letters : {x}")
Hope this helps.

Related

substring with a small change

I'm trying to solve this problem were they give me a set of strings where to count how many times a certain word appears within a string like 'code' but the program also counts any variant where the 'd' changes like 'coze' but something like 'coz' doesn't count this is what I made:
def count(word):
count=0
for i in range(len(word)):
lo=word[i:i+4]
if lo=='co': # this is what gives me trouble
count+=1
return count
Test if the first two characters match co and the 4th character matches e.
def count(word):
count=0
for i in range(len(word)-3):
if word[i:i+1] == 'co' and word[i+3] == 'e'
count+=1
return count
The loop only goes up to len(word)-3 so that word[i+3] won't go out of range.
You could use regex for this, through the re module.
import re
string = 'this is a string containing the words code, coze, and coz'
re.findall(r'co.e', string)
['code', 'coze']
from there you could write a function such as:
def count(string, word):
return len(re.findall(word, string))
Regex is the answer to your question as mentioned above but what you need is a more refined regex pattern. since you are looking for certain word appears you need to search for boundary words. So your pattern should be sth. like this:
pattern = r'\bco.e\b'
this way your search will not match with the words like testcodetest or cozetest but only match with code coze coke but not leading or following characters
if you gonna test for multiple times, then it's better to use a compiled pattern, that way it'd be more memory efficient.
In [1]: import re
In [2]: string = 'this is a string containing the codeorg testcozetest words code, coze, and coz'
In [3]: pattern = re.compile(r'\bco.e\b')
In [4]: pattern.findall(string)
Out[4]: ['code', 'coze']
Hope that helps.

Matching exact strings in python

How can I match exact strings in Python which can dynamically catch/ignore the following cases?
If I want to get value2 in this IE output "Y" from a file that is formatted as such:
...
--value1 X
---value2 Y
----value3 Z
....
How can I search for the exact match "value2" whilst ignoring the preceding "---", these characters don't allow exact string matching with the "==" operator when searching each line.
You could strip leading dashes, then split the result to get the first word without the dashes:
let's say you iterate on the lines:
for line in lines:
first_word = line.lstrip("-").split()
if first_word == "value2":
print("found")
regex can be of help too, with word boundary on the right
if re.match(r"^-*value2\b",line):
You can remove the extra characters at the start of a string s using s.lstrip('-') before using an exact match. There are other ways to handle this, but this is the fastest and strictest way without using regular expressions.
Can you guarantee that all of the valid words with have a dash before them and a space afterward? If so, you could write that like:
for line in lines:
if '-value2 ' in line:
print(line.split()[1])
The simplest way that I know is:
for line in lines:
if 'value2' in line:
...
Another way (if you need to know position):
for line in lines:
pos = line.find('value2')
if pos >= 0:
...
More complex things can be done as well, like a regular expression, if necessary, but without knowing what validation you need. The two ways above, I feel, are the most simple.
UPDATE (addressing comment):
(Trying to keep it simple, this requires a space after the number)
for line in lines:
for token in line.split():
if 'value2' in token:
...

How to remove values in list which contain alphabetical characters?

I am reading a .dat file and the first few lines are just metadata before it gets to the actual data. A shortened example of the .dat file is below.
&SRS
SRSRUN=266128,SRSDAT=20180202,SRSTIM=122132,
fc.fcY=0.9000
&END
energy rc ai2
8945.016 301.32 6.7959
8955.497 301.18 6.8382
8955.989 301.18 6.8407
8956.990 301.16 6.8469
Or as the list:
[' &SRS\n', ' SRSRUN=266128,SRSDAT=20180202,SRSTIM=122132,\n', 'fc.fcY=0.9000\n', '\n', ' &END\n', 'energy\trc\tai2\n', '8945.016\t301.32\t6.7959\n', '8955.497\t301.18\t6.8382\n', '8955.989\t301.18\t6.8407\n', '8956.990\t301.16\t6.8469\n']
I tried this previously but it :
def import_absorptionscan(file_path,start,end):
for i in range(start,end):
lines=[]
f=open(file_path+str(i)+'.dat', 'r')
for line in f:
lines.append(line)
for line in lines:
for c in line:
if c.isalpha():
lines.remove(line)
print lines
But i get this error: ValueError: list.remove(x): x not in list
i started looking through stack overflow then but most of what came up was how to strip alphabetical characters from a string, so I made this question.
This produces a list of strings, with each string making up one line in the file. I want to remove any string which contains any alphabet characters as this should remove all the metadata and leave just the data. Any help would be appreciated thank you.
I have a suspicion you will want a more robust rule than "does the string contain a letter?", but you can use a regular expression to check:
re.search("[a-zA-Z]", line)
You'll probably want to take a look at the regular expression docs.
Additionally, you can use the any statement to check for letters. Inside your inner for loop add:
If any (word.isalpha() for word in line)
Notice that this will say that "ver9" is all numbers, so if this is a problem, just replace it with:
line_is_meta = False
for word in line:
if any (letter.isalpha() for letter in word):
line_is_meta = True
break
for letter in word:
if letter.isalpha():
line_is_meta = True
break
if not line_is_meta: lines.append (line)

searching for specific words in a text python

I'm trying to make a function that will take an argument that's a word (or set of characters) as well as the speech, and return a boolean expression saying whether the word is there or not, as a function.
speech2 = open("Obama_DNC.txt", "r")
speech2_words = speech2.read()
def search(word):
if word in speech2_words:
if len(word) == len(word in speech2_words):
print(True)
elif len(word) != len(word in speech2_words):
print(False)
elif not word in speech2_words:
print(False)
word = input("search?")
search(word)
I want to make it so that the word that the program searches for in the text matches exactly as the input and that are not a part of another word ("America" in "American"). I thought of using the len() function but it doesn't seem to work and I am stuck. If anyone helps me figure this out that would be very helpful. Thank you in advance
You can use mmap also, for more information about the mmap
mmap in python 3 is treated differently that in python 2.7
Below code is for 2.7, what it does looking for a string in the text file.
#!/usr/bin/python
import mmap
f = open('Obama_DNC.txt')
s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
if s.find('blabla') != -1:
print 'true'
Why mmap doesnt work with large files.
One option may be to use the the findall() method in the regex module, which can be used to find all occurrences of a specific string.
Optionally, you could include list.count() to check how many times the searched string occurs in the text:
import re
def search(word):
found = re.findall('\\b' + word + '\\b', speech2_words)
if found:
print(True, '{word} occurs {counts} time'.format(word=word, counts=found.count(word)))
else:
print(False)
output:
search?America
(True, 'America occurs 28 time')
search?American
(True, 'American occurs 12 time')

Simple Filter Python script for Text

I am trying to create what must be a simple filter function which runs a regex against a text file and returns all words containing that particular regex.
so for example if i wanted to find all words that contained "abc", and I had the list: abcde, bce, xyz and zyxabc the script would return abcde and zyxabc.
I have a script below however I am not sure if it is just the regex I am failing at or not. it just returns abc twice rather than the full word. thanks.
import re
text = open("test.txt", "r")
regex = re.compile(r'(abc)')
for line in text:
target = regex.findall(line)
for word in target:
print word
I think you dont need regex for such task you can simply split your lines to create a list of words then loop over your words list and use in operator :
with open("test.txt") as f :
for line in f:
for w in line.split():
if 'abc' in w :
print w
Your methodology is correct however, you can change your Regex to r'.*abc.*', in the sense
regex = re.compile(r'.*abc.*')
This will match all the lines with abc in themThe wildcards.*` will match all your letters in the line.
A small Demo with that particular line changed would print
abcde
zyxabc
Note, As Kasra mentions it is better to use in operator in such cases

Categories

Resources