Python regex string match from file

Python regex string match from file - python

I have this a text file that resembles
alpha alphabet alphameric
I would like to match just the first string `alpha', nothing else
I have the following code that attempts to match just the alpha string and get its line number
findWord = re.findall('\\ba\\b', "alpha")
with open(file) as myFile:
for num, line in enumerate(myFile, 1):
if findWord in line:
print 'Found at line: ', num
However I get the following error:
TypeError: 'in ' requires string as left operand, not list

Issues in your code
re.findall('\\ba\\b', "alpha") gives a matched list but you are using in if findWord in line means using list in place of string . That's what the error you are getting
By giving findWord = re.findall('\\ba\\b', "alpha") you are searching for string a in alpha string which is not existing
Try this
import re
#findWord = re.findall('\\ba\\b', "alpha")
#print findWord
with open("data.txt") as myFile:
for num,line in enumerate(myFile):
if re.findall('\\balpha\\b', line):
print 'Found at line: ', num+1

You may modify your code a bit
with open(file, 'r') as myFile:
for num, line in enumerate(myFile, 1):
if 'alpha' in line.split():
print 'Found at line', num
Output:
Found at line 1

You can try this:
import re
s = "alpha alphabet alphameric"
data = re.findall("alpha(?=\s)", s)[0]
Output:
"alpha"

Related

Swap quoted word in random position with last word in Python

I have a txt file with lines of text like this, and I want to swap the word in
quotations with the last word that is separated from the sentence with a tab:
it looks like this:
This "is" a person are
She was not "here" right
"The" pencil is not sharpened a
desired output:
This "are" a person is
She was not "right" here
Some ideas:
#1: Use Numpy
Seperate all the words by whitespace with numpy-> ['This','"is"','a','person',\t,'are']
Problems:
How do I tell python the position of the quoted word
How to convert the list back to normal text. Concatenate all?
#2: Use Regex
Use regex and find the word in ""
with open('readme.txt','r') as x:
x = x.readlines()
swap = x[-1]
re.findall(\"(\w+)\", swap)
Problems:
I don't know what to read the txt file with regex. most examples I see here will assign the entire sentence to a variable.
Is it something like this?
with open('readme.txt') as f:
lines = f.readlines()
lines.findall(....)
Thanks guys

You don't really need re for something this trivial.
Assuming you want to rewrite the file:
with open('foo.txt', 'r+') as txt:
lines = txt.readlines()
for k, line in enumerate(lines):
words = line.split()
for i, word in enumerate(words[:-1]):
if word[0] == '"' and word[-1] == '"':
words[i] = f'"{words[-1]}"'
words[-1] = word[1:-1]
break
lines[k] = ' '.join(words[:-1]) + f'\t{words[-1]}'
txt.seek(0)
print(*lines, sep='\n', file=txt)
txt.truncate()

This is my solution:
regex = r'"[\s\S]*"'
import re
file1 = open('test.txt', 'r')
count = 0
while True:
# Get next line from file
line = file1.readline()
# if line is empty
# end of file is reached
if not line:
break
get_tab = line.strip().split('\t')[1]
regex = r'\"[\s\S]*\"'
print("original: {} mod ----> {}".format(line.strip(), re.sub(regex, get_tab, line.strip().split('\t')[0])))

Try:
import re
pat = re.compile(r'"([^"]*)"(.*\t)(.*)')
with open("your_file.txt", "r") as f_in:
for line in f_in:
print(pat.sub(r'"\3"\2\1', line.rstrip()))
Prints:
This "are" a person is
She was not "right" here
"a" pencil is not sharpened The

I guess this is also a way to solve it:
Input readme.txt contents:
This "is" a person are
She was not "here" right
"The" pencil is not sharpened a
Code:
import re
changed_text = []
with open('readme.txt') as x:
for line in x:
splitted_text = line.strip().split("\t") # ['This "is" a person', 'are'] etc.
if re.search(r'\".*\"', line.strip()): # If a quote is found
qouted_text = re.search(r'\"(.*)\"', line.strip()).group(1)
changed_text.append(splitted_text[0].replace(qouted_text, splitted_text[1])+"\t"+qouted_text)
with open('readme.txt.modified', 'w') as x:
for line in changed_text:
print(line)
x.write(line+"\n")
Result (readme.txt.modified):
Thare "are" a person is
She was not "right" here
"a" pencil is not sharpened The

Why replacing strings from dictionary produce empty file

I'm trying to replace some strings with other strings in a text file,
but the code produce empty file (file size is 0)
what am I missing ?
emotion_list = {":-)" : "happy-similey", \
":-(": "sad-similey"}
for line in fileinput.input(file_name, inplace=True):
if not line:
continue
for f_key, f_value in emotion_list.items():
if f_key in line:
line = line.replace(f_key, f_value)

You missing the print statement to send replaced line to your file:
for line in fileinput.input(file_name, inplace=True):
if not line:
continue
for f_key, f_value in emotion_list.items():
if f_key in line:
line = line.replace(f_key, f_value)
print(line, end="") # print without newline

In your code you are maching file line with word.So split line by spaces to get words (If you do match word in a entire line then you have to give the no.of occurences you want to replace and no of occurences is dynamic.You don't how much occurences would be there in file in a real scenarios)
emotion_list = {":-)" : "happy-similey", \
":-(": "sad-similey"}
file="I am not really :-) but I am not :-( too "
for line in file.split():
for f_key, f_value in emotion_list.items():
if f_key == line:
file=file.replace(line, f_value,1)
print(file)
output
I am not really happy-similey but I am not sad-similey too

This is basically the problem you are facing:
lst = ["abc", "acd", "ade"]
for x in lst:
x = x.replace("a", "x")
print(lst) # ["abc", "acd", "ade"]
Instead, you should replace the ith element of the list:
lst = ["abc", "acd", "ade"]
for i, x in enumerate(lst):
lst[i] = x.replace("a", "x")
print(lst) # ['xbc', 'xcd', 'xde']
This is happening because strings are immutable in Python!

Read only the numbers from a txt file python

I have a text file that contains these some words and a number written with a point in it. For example
hello!
54.123
Now I only want the number 54.123 to be extracted an converted so that the outcome is 54123
The code I tried is
import re
exp = re.compile(r'^[\+]?[0-9]')
my_list = []
with open('file.txt') as f:
lines = f.readlines()
for line in lines:
if re.match(exp, line.strip()):
my_list.append(int(line.strip()))
#convert to a string
listToStr = ' '.join([str(elem) for elem in my_list])
print(listToStr)
But this returns the error: ValueError: invalid literal for int() with base 10: '54.123'
Does anyone know a solution for this?

You can try to convert the current line to a float. In case the line does not contain a legit float number it returns a ValueError exception that you can catch and just pass. If no exception is thrown just split the line at the dot, join the 2 parts, convert to int and add to the array.
my_list = []
with open('file.txt') as f:
lines = f.readlines()
for line in lines:
try:
tmp = float(line)
num = int(''.join(line.split(".")))
my_list.append(num)
except ValueError:
pass
#convert to a string
listToStr = ' '.join([str(elem) for elem in my_list])
print(listToStr)

You can check if a given line is a string representing a number using the isdigit() function.
From what I can tell you need to just check if there is a number as isdigit() works on integers only (floats contain "." which isn't a number and it returns False).
For example:
def numCheck(string):
# Checks if the input string contains numbers
return any(i.isdigit() for i in string)
string = '54.123'
print(numCheck(string)) # True
string = 'hello'
print(numCheck(string)) # False
Note: if your data contains things like 123ab56 then this won't be good for you.
To convert 54.123 to 54123 you could use the replace(old, new) function.
For example:
string = 54.123
new_string = string.replace('.', '') # replace . with nothing
print(new_string) # 54123

This may help I am now getting numbers from the file I guess you were trying to use split in place of strip
import re
exp = re.compile(r'[0-9]')
my_list = []
with open('file.txt') as f:
lines = f.readlines()
for line in lines:
for numbers in line.split():
if re.match(exp, numbers):
my_list.append(numbers)
#convert to a string
listToStr = ' '.join([str(elem) for elem in my_list])
print(listToStr)

filtering lines based on the presence of 2 short sequences in python

I have a text file like this example:
example:
>chr9:128683-128744
GGATTTCTTCTTAGTTTGGATCCATTGCTGGTGAGCTAGTGGGATTTTTTGGGGGGTGTTA
>chr16:134222-134283
AGCTGGAAGCAGCGTGAATAAAACAGAATGGCCGGGACCTTAAAGGCTTTGCTTGGCCTGG
>chr16:134226-134287
GGAAGCAGCGTGGGAATCACAGAATGGACGGCCGATTAAAGGCTTTGCTTGGCCTGGATTT
>chr1:134723-134784
AAGTGATTCACCCTGCCTTTCCGACCTTCCCCAGAACAGAACACGTTGATCGTGGGCGATA
>chr16:135770-135831
GCCTGAGCAAAGGGCCTGCCCAGACAAGATTTTTTAATTGTTTAAAAACCGAATAAATGTT
this file is divided into different parts and every part has 2 rows. the 1st row starts with > (and this row is called ID) and the 2nd row is the sequence of letters.
I want to search for 2 short motif (AATAAA and GGAC) in the sequence of letters and if they contain these motifs, I want to get the the ID and sequence of that part.
but the point is AATAAA should be the 1st sequence and GGAC will come after that. there is a distance between them but this distance can be 2 letters or more.
expected output:
>chr16:134222-134283
AGCTGGAAGCAGCGTGAATAAAACAGAATGGCCGGGACCTTAAAGGCTTTGCTTGGCCTGG
I am trying to do that in python using the following command:
infile = open('infile.txt', 'r')
mot1 = 'AATAAA'
mot2 = 'GGAC'
new = []
for line in range(len(infile)):
if not infile[line].startswith('>'):
for match in pattern.finder(mot1) and pattern.finder(mot2):
new.append(infile[line-1])
with open('outfile.txt', "w") as f:
for item in new:
f.write("%s\n" % item)
this code does not return what I want. do you know how to fix it?

You can group the ID with sequence, and then utilize re.findall:
import re
data = [i.strip('\n') for i in open('filename.txt')]
new_data = [[data[i], data[i+1]] for i in range(0, len(data), 2)]
final_result = [[a, b] for a, b in new_data if re.findall('AATAAA\w{2,}GGAC', b)]
Output:
[['>chr16:134222-134283', 'AGCTGGAAGCAGCGTGAATAAAACAGAATGGCCGGGACCTTAAAGGCTTTGCTTGGCCTGG']]

Not sure I've got your idea about this distance can be 2 letters or more, and is it obligatory to check, but following code gives you desired output:
mot1 = 'AATAAA'
mot2 = 'GGAC'
with open('infile.txt', 'r') as inp:
last_id = None
for line in inp:
if line.startswith('>'):
last_id = line
else:
if mot1 in line and mot2 in line:
print(last_id)
print(line)
You can redirect output to a file if you want

You can use a regex and a dictionary comprehension:
import re
with open('test.txt', 'r') as f:
lines = f.readlines()
data = dict(zip(lines[::2],lines[1::2]))
{k.strip(): v.strip() for k,v in data.items() if re.findall(r'AATAAA\w{2,}GGAC', v)}
Returns:
{'>chr16:134222-134283': 'AGCTGGAAGCAGCGTGAATAAAACAGAATGGCCGGGACCTTAAAGGCTTTGCTTGGCCTGG'}

You may slice the irrelevant part of the string if mot1 is found in it. Here's a way to do it:
from math import ceil
infile = open('infile.txt', 'r')
text = infile.readlines()
infile.close()
mot1 = 'AATAAA'
mot2 = 'GGAC'
check = [(text[x], text[x+1]) for x in range(ceil(len(text)/2))]
result = [(x + '\n' + y) for (x, y) in check if mot1 in y and mot2 in y[(y.find(mot1)+len(mot1)+2):]]
with open('outfile.txt', "w") as f:
for item in result:
f.write("%s\n" % item)

If the file is not too big, you can read it at once, and use re.findall():
import re
with open("infile.txt") as finp:
data=finp.read()
with open('outfile.txt', "w") as f:
for item in re.findall(r">.+?[\r\n\f][AGTC]*?AATAAA[AGTC]{2,}GGAC[AGTC]*", data):
f.write(item+"\n")
"""
+? and *? means non-greedy process;
>.+?[\r\n\f] matches a line starting with '>' and followed by any characters to the end of the line;
[AGTC]*?AATAAA matches any number of A,G,T,C characters, followed by the AATAAA pattern;
[AGTC]{2,} matches at least two or more characters of A,G,T,C;
GGAC matches the GGAC pattern;
[AGTC]* matches the empty string or any number of A,G,T,C characters.
"""

Reading file multiple ways in Python

I am trying to set up a system for running various statistics on a text file. In this endeavor I need to open a file in Python (v2.7.10) and read it both as lines, and as a string, for the statistical functions to work.
So far I have this:
import csv, json, re
from textstat.textstat import textstat
file = "Data/Test.txt"
data = open(file, "r")
string = data.read().replace('\n', '')
lines = 0
blanklines = 0
word_list = []
cf_dict = {}
word_dict = {}
punctuations = [",", ".", "!", "?", ";", ":"]
sentences = 0
This sets up the file and the preliminary variables. At this point, print textstat.syllable_count(string) returns a number. Further, I have:
for line in data:
lines += 1
if line.startswith('\n'):
blanklines += 1
word_list.extend(line.split())
for char in line.lower():
cf_dict[char] = cf_dict.get(char, 0) + 1
for word in word_list:
lastchar = word[-1]
if lastchar in punctuations:
word = word.rstrip(lastchar)
word = word.lower()
word_dict[word] = word_dict.get(word, 0) + 1
for key in cf_dict.keys():
if key in '.!?':
sentences += cf_dict[key]
number_words = len(word_list)
num = float(number_words)
avg_wordsize = len(''.join([k*v for k, v in word_dict.items()]))/num
mcw = sorted([(v, k) for k, v in word_dict.items()], reverse=True)
print( "Total lines: %d" % lines )
print( "Blank lines: %d" % blanklines )
print( "Sentences: %d" % sentences )
print( "Words: %d" % number_words )
print('-' * 30)
print( "Average word length: %0.2f" % avg_wordsize )
print( "30 most common words: %s" % mcw[:30] )
But this fails as 22 avg_wordsize = len(''.join([k*v for k, v in word_dict.items()]))/num returns a ZeroDivisionError: float division by zero. However, if I comment out the string = data.read().replace('\n', '') from the first piece of code, I can run the second piece without problem and get the expected output.
Basically, how do I set this up so that I can run the second piece of code on data, as well as textstat on string?

The call to data.read() places the file pointer at the end of the file, so you dont have anything more to read at this point. You either have to close and reopen the file or more simply reset the pointer at the begining using data.seek(0)

First see the line:
string = data.read().replace('\n', '')
You are reading from data once. Now, cursor is in the end of data.
Then see the line,
for line in data:
You are trying to read it again, but you just can't do it, because there is nothing else in data, you are at the end of it.so len(word_list) are returning 0.
You are dividing by it and getting the error.
ZeroDivisionError: float division by zero.
But when you comment it, now you are reading only once, which is valid, so second portion of your codes now work.
Clear now?
So, what to do now?
Use data.seek() after data.read()
Demo:
>>> a = open('file.txt')
>>> a.read()
#output
>>>a.read()
#nothing
>>> a.seek(0)
>>> a.read()
#output again

Here is a simple fix. Replace the line for line in data: by :
data.seek(0)
for line in data.readlines():
...
It basically points back to the beginning of the file and read it again line by line.
While this should work, you may want to simplify the code and read the file only once. Something like:
with open(file, "r") as fin:
lines = fin.readlines()
string = ''.join(lines).replace('\n', '')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex string match from file - python

You may modify your code a bit with open(file, 'r') as myFile: for num, line in enumerate(myFile, 1): if 'alpha' in line.split(): print 'Found at line', num Output: Found at line 1

You can try this: import re s = "alpha alphabet alphameric" data = re.findall("alpha(?=\s)", s)[0] Output: "alpha"

Related

Swap quoted word in random position with last word in Python

Why replacing strings from dictionary produce empty file

Read only the numbers from a txt file python

filtering lines based on the presence of 2 short sequences in python

Reading file multiple ways in Python

Categories

Resources