Python error with lambda and alphanumeric words - python

I am really new to python and now I am having an error and do not know why I get this error.
I have 3 lists with words. The lists contains words numeric, literal words and alphanumeric words. These lists are saved in an txt file. Each file can contain words from other lists or new words.
Now I like to compare these lists and copy all words without duplicates in to the new list. So I have one big list, containing all words but no duplicates.
This is my script:
file_a = raw_input("File 1?: ")
file_b = raw_input("File 2?: ")
file_c = raw_input("File_3?: ")
file_new = raw_input("Neue Datei: ")
def compare_files():
with open(file_a, 'r') as a:
with open(file_b, 'r') as b:
with open(file_c, 'r') as c:
with open(file_new, 'w') as new:
difference = set(a).symmetric_difference(b).symmetric_difference(c)
difference.discard('\n')
sortiert = sorted(difference, key=lambda item: (int(item.partition(' ')[0])
if item[0].isdigit() else float('inf'), item))
for line in sortiert:
new.write(line)
k = compare_files()
When I run the script I get the following error message:
Traceback (most recent call last):
File "TestProject1.py", line 19, in <module>
k = compare_files()
File "TestProject1.py", line 13, in compare_files
sortiert = sorted(difference, key=lambda item: (int(item.partition(' ')[0])
File "TestProject1.py", line 14, in <lambda>
if item[0].isdigit() else float('inf'), item))
ValueError: invalid literal for int() with base 10: '12234thl\n'
Anyone an idea or something what is wrong in my script?
Thank you for your help :)

partition on ' ' or any other string for that matter will not extract the numeral part of the string except you know the character immediately following the numeral; very unlikely.
You can instead use a regular expression to extract the leading numeral part of the string:
import re
p = re.compile(r'^\d+')
def compare_files():
with open(file_a, 'r') as a, open(file_b, 'r') as b, \
open(file_c, 'r') as c, open(file_new, 'w') as new:
difference = set(a).symmetric_difference(b).symmetric_difference(c)
difference.discard('\n')
sortiert = sorted(difference,
key=lambda item: (int(p.match(item).group(0)) \
if item[0].isdigit() \
else float('inf'), item))
for line in sortiert:
new.write(line)
The pattern '^\d+' should match all numerals from the start of the string and then p.match(item).group(0) returns the numeral as a string which can then casted to integer.

Related

How do I fix this ValueError?

I am trying to get an average from a text file that uses a def function. I am trying to convert the list from the text file to int(). Instead of converting it gives me the error: " ValueError: invalid literal for int() with base 10: '5, 5, 6, 7' ". The "5, 5, 6, 7" is one that I made from the proper .txt file. Here is the code:
def getNumberList(filename):
with open(filename,'r') as f:
lyst = f.read().split('\n')
numberList = [int(num) for num in lyst]
return numberList
def getAverage(filename, func):
numbers = func(filename)
return sum(numbers)/len(numbers)
def main():
filename = input("Input the file name: ")
average = getAverage(filename, getNumberList)
print(average)
if __name__ == "__main__":
main()
You are splitting by line but you are not splitting by commas, so you are trying to convert 5,5,6,7 to an integer, which is impossible. You need to also split by commas after you split by line, and then combine them into one list, if you want to average all the numbers in the file. The following should work:
def getNumberList(filename):
with open(filename,'r') as f:
lines = f.readlines()
numberList = [int(num) for num in line.split(',') for line in lines]
return numberList
Looks like you might need to split each element with lyst using "," because right now it is trying to convert each line which has "1,2,3" as input.
So, change this and try.
def getNumberList(filename):
with open(filename,'r') as f:
lyst = []
temp = f.read().strip().split('\n')
for i in temp:
lyst += i.strip().split(',')
numberList = [int(num) for num in lyst]
return numberList

Read only the numbers from a txt file python

I have a text file that contains these some words and a number written with a point in it. For example
hello!
54.123
Now I only want the number 54.123 to be extracted an converted so that the outcome is 54123
The code I tried is
import re
exp = re.compile(r'^[\+]?[0-9]')
my_list = []
with open('file.txt') as f:
lines = f.readlines()
for line in lines:
if re.match(exp, line.strip()):
my_list.append(int(line.strip()))
#convert to a string
listToStr = ' '.join([str(elem) for elem in my_list])
print(listToStr)
But this returns the error: ValueError: invalid literal for int() with base 10: '54.123'
Does anyone know a solution for this?
You can try to convert the current line to a float. In case the line does not contain a legit float number it returns a ValueError exception that you can catch and just pass. If no exception is thrown just split the line at the dot, join the 2 parts, convert to int and add to the array.
my_list = []
with open('file.txt') as f:
lines = f.readlines()
for line in lines:
try:
tmp = float(line)
num = int(''.join(line.split(".")))
my_list.append(num)
except ValueError:
pass
#convert to a string
listToStr = ' '.join([str(elem) for elem in my_list])
print(listToStr)
You can check if a given line is a string representing a number using the isdigit() function.
From what I can tell you need to just check if there is a number as isdigit() works on integers only (floats contain "." which isn't a number and it returns False).
For example:
def numCheck(string):
# Checks if the input string contains numbers
return any(i.isdigit() for i in string)
string = '54.123'
print(numCheck(string)) # True
string = 'hello'
print(numCheck(string)) # False
Note: if your data contains things like 123ab56 then this won't be good for you.
To convert 54.123 to 54123 you could use the replace(old, new) function.
For example:
string = 54.123
new_string = string.replace('.', '') # replace . with nothing
print(new_string) # 54123
This may help I am now getting numbers from the file I guess you were trying to use split in place of strip
import re
exp = re.compile(r'[0-9]')
my_list = []
with open('file.txt') as f:
lines = f.readlines()
for line in lines:
for numbers in line.split():
if re.match(exp, numbers):
my_list.append(numbers)
#convert to a string
listToStr = ' '.join([str(elem) for elem in my_list])
print(listToStr)

filtering lines based on the presence of 2 short sequences in python

I have a text file like this example:
example:
>chr9:128683-128744
GGATTTCTTCTTAGTTTGGATCCATTGCTGGTGAGCTAGTGGGATTTTTTGGGGGGTGTTA
>chr16:134222-134283
AGCTGGAAGCAGCGTGAATAAAACAGAATGGCCGGGACCTTAAAGGCTTTGCTTGGCCTGG
>chr16:134226-134287
GGAAGCAGCGTGGGAATCACAGAATGGACGGCCGATTAAAGGCTTTGCTTGGCCTGGATTT
>chr1:134723-134784
AAGTGATTCACCCTGCCTTTCCGACCTTCCCCAGAACAGAACACGTTGATCGTGGGCGATA
>chr16:135770-135831
GCCTGAGCAAAGGGCCTGCCCAGACAAGATTTTTTAATTGTTTAAAAACCGAATAAATGTT
this file is divided into different parts and every part has 2 rows. the 1st row starts with > (and this row is called ID) and the 2nd row is the sequence of letters.
I want to search for 2 short motif (AATAAA and GGAC) in the sequence of letters and if they contain these motifs, I want to get the the ID and sequence of that part.
but the point is AATAAA should be the 1st sequence and GGAC will come after that. there is a distance between them but this distance can be 2 letters or more.
expected output:
>chr16:134222-134283
AGCTGGAAGCAGCGTGAATAAAACAGAATGGCCGGGACCTTAAAGGCTTTGCTTGGCCTGG
I am trying to do that in python using the following command:
infile = open('infile.txt', 'r')
mot1 = 'AATAAA'
mot2 = 'GGAC'
new = []
for line in range(len(infile)):
if not infile[line].startswith('>'):
for match in pattern.finder(mot1) and pattern.finder(mot2):
new.append(infile[line-1])
with open('outfile.txt', "w") as f:
for item in new:
f.write("%s\n" % item)
this code does not return what I want. do you know how to fix it?
You can group the ID with sequence, and then utilize re.findall:
import re
data = [i.strip('\n') for i in open('filename.txt')]
new_data = [[data[i], data[i+1]] for i in range(0, len(data), 2)]
final_result = [[a, b] for a, b in new_data if re.findall('AATAAA\w{2,}GGAC', b)]
Output:
[['>chr16:134222-134283', 'AGCTGGAAGCAGCGTGAATAAAACAGAATGGCCGGGACCTTAAAGGCTTTGCTTGGCCTGG']]
Not sure I've got your idea about this distance can be 2 letters or more, and is it obligatory to check, but following code gives you desired output:
mot1 = 'AATAAA'
mot2 = 'GGAC'
with open('infile.txt', 'r') as inp:
last_id = None
for line in inp:
if line.startswith('>'):
last_id = line
else:
if mot1 in line and mot2 in line:
print(last_id)
print(line)
You can redirect output to a file if you want
You can use a regex and a dictionary comprehension:
import re
with open('test.txt', 'r') as f:
lines = f.readlines()
data = dict(zip(lines[::2],lines[1::2]))
{k.strip(): v.strip() for k,v in data.items() if re.findall(r'AATAAA\w{2,}GGAC', v)}
Returns:
{'>chr16:134222-134283': 'AGCTGGAAGCAGCGTGAATAAAACAGAATGGCCGGGACCTTAAAGGCTTTGCTTGGCCTGG'}
You may slice the irrelevant part of the string if mot1 is found in it. Here's a way to do it:
from math import ceil
infile = open('infile.txt', 'r')
text = infile.readlines()
infile.close()
mot1 = 'AATAAA'
mot2 = 'GGAC'
check = [(text[x], text[x+1]) for x in range(ceil(len(text)/2))]
result = [(x + '\n' + y) for (x, y) in check if mot1 in y and mot2 in y[(y.find(mot1)+len(mot1)+2):]]
with open('outfile.txt', "w") as f:
for item in result:
f.write("%s\n" % item)
If the file is not too big, you can read it at once, and use re.findall():
import re
with open("infile.txt") as finp:
data=finp.read()
with open('outfile.txt', "w") as f:
for item in re.findall(r">.+?[\r\n\f][AGTC]*?AATAAA[AGTC]{2,}GGAC[AGTC]*", data):
f.write(item+"\n")
"""
+? and *? means non-greedy process;
>.+?[\r\n\f] matches a line starting with '>' and followed by any characters to the end of the line;
[AGTC]*?AATAAA matches any number of A,G,T,C characters, followed by the AATAAA pattern;
[AGTC]{2,} matches at least two or more characters of A,G,T,C;
GGAC matches the GGAC pattern;
[AGTC]* matches the empty string or any number of A,G,T,C characters.
"""

Python search for patterns in all lines, export only lines with results

I would like to search for strings that match a pattern in a text file and export only the matched strings
k=''
regex = re.compile(r'[a-zA-Z]{2}\d{8}')
with open(file, 'r') as f:
for line in f:
line = line.replace(',', '')
line = line.replace('.', '')
k = regex.findall(line)
#k.append(line)
if not k=='':
position=True
else:
position=False
if position==True:
print(k)
Somehow my code doesn't work, it always returns the following output:
[] [] [] [] [] [] [] ['AI13933231'] [] [] [] [] []
I want the output to contain only the matched strings. Thank you!
The reason why there are empty array literals [] is because this line actually exists, but is either empty (containing just \n) or does not match the regex '[a-zA-Z]{2}\d{8}'. And please note that regex.findall(line) returns an list, so if the regex did not find any that matches, it is an empty list.
Your main error happened in this section: if not k=='':. Note k is an list.
Consider this code:
import re
k=''
regex = re.compile(r'[a-zA-Z]{2}\d{8}')
with open("omg.txt", 'r') as f:
for line in f:
line = line.replace(',', '')
line = line.replace('.', '')
k = regex.findall(line)
#k.append(line)
position = False
if str(k) != '[]': # The `[]` is just the string representation of an empty array
position=True
print(k)
else:
position=False
Given the file (Text after # are ignored, not part of the file)
AZ23153133
# Empty line
AB12355342
gz # No match
XY93312344
The output would be
['AZ23153133']
['AB12355342']
['XY93312344']

Python regex string match from file

I have this a text file that resembles
alpha alphabet alphameric
I would like to match just the first string `alpha', nothing else
I have the following code that attempts to match just the alpha string and get its line number
findWord = re.findall('\\ba\\b', "alpha")
with open(file) as myFile:
for num, line in enumerate(myFile, 1):
if findWord in line:
print 'Found at line: ', num
However I get the following error:
TypeError: 'in ' requires string as left operand, not list
Issues in your code
re.findall('\\ba\\b', "alpha") gives a matched list but you are using in if findWord in line means using list in place of string . That's what the error you are getting
By giving findWord = re.findall('\\ba\\b', "alpha") you are searching for string a in alpha string which is not existing
Try this
import re
#findWord = re.findall('\\ba\\b', "alpha")
#print findWord
with open("data.txt") as myFile:
for num,line in enumerate(myFile):
if re.findall('\\balpha\\b', line):
print 'Found at line: ', num+1
You may modify your code a bit
with open(file, 'r') as myFile:
for num, line in enumerate(myFile, 1):
if 'alpha' in line.split():
print 'Found at line', num
Output:
Found at line 1
You can try this:
import re
s = "alpha alphabet alphameric"
data = re.findall("alpha(?=\s)", s)[0]
Output:
"alpha"

Categories

Resources