Finding the index number for a string of words - python

I'm creating a program in python that will go through a list of sentences and find the words in capitals within the sentences. I've used a findall function to acquire the capitals at the moment.
Here is an example of the output I am receiving at the minute:
line 0: the dog_SUBJ bit_VERB the cat_OBJ
['S'] ['U'] ['B'] ['J'] [] ['V'] ['E'] ['R'] ['B'] [] ['O'] ['B'] ['J']
However, I want for the output to be full words, as so:
['SUBJ'] [] ['VERB'] [] ['OBJ']
I also want the indices of the words as so:
['SUBJ'] [0]
['VERB'] [1]
['OBJ'] [2]
Is it possible to do this? I've seen the above done before on in the terminal and I think that 'index' is used or something similar?
Here's my code below (as far as I have got):
import re, sys
f = open('findallEX.txt', 'r')
lines = f.readlines()
ii=0
for l in lines:
sys.stdout.write('line %s: %s' %(ii, l))
ii = ii + 1
results = []
for s in l:
results.append(re.findall('[A-Z]+', s))
Thanks! Any help would be greatly appreciated!

Something like:
>>> s = 'the dog_SUBJ bit_VERB the cat_OBJ'
>>> import re
>>> from itertools import count
>>> zip(re.findall('[A-Z]+', s), count())
[('SUBJ', 0), ('VERB', 1), ('OBJ', 2)]
Format as appropriate...

Related

Get sequences from a file and store them into a list in python

Here is the code (i took it from this discussion Translation DNA to Protein, but here i'm using RNA instead of DNA file):
from itertools import takewhile
def translate_rna(sequence, d, stop_codons=('UAA', 'UGA', 'UAG')):
start = sequence.find('AUG')
# Take sequence from the first start codon
trimmed_sequence = sequence[start:]
# Split it into triplets
codons = [trimmed_sequence[i:i + 3] for i in range(0, len(trimmed_sequence), 3)]
# Take all codons until first stop codon
coding_sequence = takewhile(lambda x: x not in stop_codons and len(x) == 3, codons)
# Translate and join into string
protein_sequence = ''.join([codontable[codon] for codon in coding_sequence])
# This line assumes there is always stop codon in the sequence
return "{0}".format(protein_sequence)
Calling the translate_rna function:
sequence = ''
for line in open("to_rna", "r"):
sequence += line.strip()
translate_rna(sequence, d)
My to_rna file looks like:
CCGCCCCUCUGCCCCAGUCACUGAGCCGCCGCCGAGGAUUCAGCAGCCUCCCCCUUGAGCCCCCUCGCUU
CCCGACGUUCCGUUCCCCCCUGCCCGCCUUCUCCCGCCACCGCCGCCGCCGCCUUCCGCAGGCCGUUUCC
ACCGAGGAAAAGGAAUCGUAUCGUAUGUCCGCUAUCCAG.........
The function translate only the first proteine (from the first AUG to the first stop_codon)
I think the problem is in this line:
# Take all codons until first stop codon
coding_sequence = takewhile(lambda x: x not in stop_codons and len(x) == 3 , codons)
My question is : How can i tell python (after finding the first AUG and store it into coding_sequence as a list) to search again the next AUG in the RNA file and sotre it in the next position.
As a result, i wanna have a list like that:
['here_is_the_1st_coding_sequence', 'here_is_the_2nd_coding_sequence', ...]
PS : This is a homework, so i can't use Biopython.
EDIT:
A simple way to describe the problem:
From this code:
from itertools import takewhile
lst = ['N', 'A', 'B', 'Z', 'C', 'A', 'V', 'V' 'Z', 'X']
ch = ''.join(lst)
stop = 'Z'
start = ch.find('A')
seq = takewhile(lambda x: x not in stop, ch)
I want to get this:
['AB', 'AVV']
EDIT 2:
For instance, from this string:
UUUAUGCGCCGCUAACCCAUGGUUCCCUAGUGGUCCUGACGCAUGUGA
I should get as result:
['AUGCGCCGC', 'AUGGUUCCC', 'AUG']
looking at your basic code, because I couldn't quite follow your main stuff, it looks like you just want to split your string on all occurences of another string, and substring the string starting from the index of another string. If that is wrong, please tell me and I can update accordingly.
To achieve this, python has a builtin str.split(sub) which splits a string at every occurence of sub. Also, it has a str.index(sub) which returns the first index of sub. Example:
>>> ch = 'NABZCAVZX'
>>> ch[ch.index('A'):].split('Z')
['AB', 'CAV', 'X']
you can also specify sub strings that aren't just one char:
>>> ch = 'NACBABQZCVEZTZCGE'
>>> ch[ch.index('AB'):].split('ZC')
['ABQ', 'VEZT', 'GE']
Using multiple delimiters:
>>> import re
>>> stop_codons = ['UAA','UGA','UAG']
>>> re.compile('|'.join(stop_codons))\
>>> delim = re.compile('|'.join(stop_codons))
>>> ch = 'CCHAUAABEGTAUAAVEGTUGAVKEGUAABEGEUGABRLVBUAGCGGA'
>>> delim.split(ch)
['CCHA', 'BEGTA', 'VEGT', 'VKEG', 'BEGE', 'BRLVB', 'CGGA']
note that there is no order preferance to the split, ie if there is a UGA string ahead of a UAA, it will still split on the UGA. I am not sure if thats what you want but thats it.

Split string into tuple (Upper,lower) 'ABCDefgh' . Python 2.7.6

my_string = 'ABCDefgh'
desired = ('ABCD','efgh')
the only way I can think of doing this is creating a for loop and then scanning through and checking each element in the string individually and adding to string and then creating the tuple . . . is there a more efficient way to do this?
it will always be in the format UPPERlower
print re.split("([A-Z]+)",my_string)[1:]
Simple way (two passes):
>>> import itertools
>>> my_string = 'ABCDefgh'
>>> desired = (''.join(itertools.takewhile(lambda c:c.isupper(), my_string)), ''.join(itertools.dropwhile(lambda c:c.isupper(), my_string)))
>>> desired
('ABCD', 'efgh')
Efficient way (one pass):
>>> my_string = 'ABCDefgh'
>>> uppers = []
>>> done = False
>>> i = 0
>>> while not done:
... c = my_string[i]
... if c.isupper():
... uppers.append(c)
... i += 1
... else:
... done = True
...
>>> lowers = my_string[i:]
>>> desired = (''.join(uppers), lowers)
>>> desired
('ABCD', 'efgh')
Because I throw itertools.groupby at everything:
>>> my_string = 'ABCDefgh'
>>> from itertools import groupby
>>> [''.join(g) for k,g in groupby(my_string, str.isupper)]
['ABCD', 'efgh']
(A little overpowered here, but scales up to more complicated problems nicely.)
my_string='ABCDefg'
import re
desired = (re.search('[A-Z]+',my_string).group(0),re.search('[a-z]+',my_string).group(0))
print desired
A more robust approach without using re
import string
>>> txt = "ABCeUiioualfjNLkdD"
>>> tup = (''.join([char for char in txt if char in string.ascii_uppercase]),
''.join([char for char in txt if char not in string.ascii_uppercase]))
>>> tup
('ABCUNLD', 'eiioualfjkd')
the char not in string.ascii_uppercase instead of char in string.ascii_lowercase means that you'll never lose any data in case your string has non-letters in it, which could be useful if you suddenly start having errors when this input starts being rejected 20 function calls later.

How to read a text file into separate lists python

Say I have a text file formatted like this:
100 20 the birds are flying
and I wanted to read the int(s) into their own lists and the string into its own list...how would I go about this in python. I tried
data.append(map(int, line.split()))
that didn't work...any help?
Essentially, I'm reading the file line by line, and splitting them. I first check to see if I can turn them into an integer, and if I fail, treat them as strings.
def separate(filename):
all_integers = []
all_strings = []
with open(filename) as myfile:
for line in myfile:
for item in line.split(' '):
try:
# Try converting the item to an integer
value = int(item, 10)
all_integers.append(value)
except ValueError:
# if it fails, it's a string.
all_strings.append(item)
return all_integers, all_strings
Then, given the file ('mytext.txt')
100 20 the birds are flying
200 3 banana
hello 4
...doing the following on the command line returns...
>>> myints, mystrings = separate(r'myfile.txt')
>>> print myints
[100, 20, 200, 3, 4]
>>> print mystrings
['the', 'birds', 'are', 'flying', 'banana', 'hello']
If i understand your question correctly:
import re
def splitList(list):
ints = []
words = []
for item in list:
if re.match('^\d+$', item):
ints.append(int(item))
else:
words.append(item)
return ints, words
intList, wordList = splitList(line.split())
Will give you two lists: [100, 20] and ['the', 'birds', 'are', 'flying']
Here's a simple solution. Note it might not be as efficient as others for very large files, because it iterates over word two times for each line.
words = line.split()
intList = [int(x) for x in words if x.isdigit()]
strList = [x for x in words if not x.isdigit()]
pop removes the element from the list and returns it:
words = line.split()
first = int(words.pop(0))
second = int(words.pop(0))
This is of course assuming your format is always int int word word word ....
And then join the rest of the string:
words = ' '.join(words)
And in Python 3 you can even do this:
first, second, *words = line.split()
Which is pretty neat. Although you would still have to convert first and second to int's.

restoring list from unified_diff output

I need to generate the diff between two arrays of strings:
a=['1','2']
b=['1','2','3']
To achieve this I'm using the difflib library in Python (2.6):
c=difflib.unified_diff(a,b)
and I save the content of
d=list(c)
which is something like:
['--- \n', '+++ \n', '## -1,2 +1,3 ##\n', ' 1', ' 2', '+3']
How can I build the second array from the first using the output of the unified_diff function?
The behavior that I'm looking for is something like:
>>> merge(a,d)
>>> ['1','2','3']
P.S. the array can have duplicate entries and the order in which each entry appears is important for my application. Moreover, from one iteration to another there could be changes both in the middle/begin of the array, as well as new entries added at the end.
Not sure that my sample is a good style, but you can use something like this:
from collections import Counter
a=['1','2']
b=['1','2','3']
a.extend(b)
[k for k,v in Counter(a).items() if v == 1]
OR if your lists could have only unique items:
list(set(a) ^ set(b))
OR:
missed_in_a = [x for x in a if x not in b]
missed_in_b = [x for x in b if x not in a]
OR:
a=['1','2']
b=['1','2','3']
c = [x for x in a]
c.extend(b)
diff = [x for x in c if a.count(x)+b.count(x) == 1]
The last one(hope i understand you correctly(sorry if not so) now):
a = ['1','2','3','4']
b = ['2','2','3','6','5']
from difflib import unified_diff
def merge(a,b):
output = []
for line in list(unified_diff(a,b))[3:]:
if '+' in line:
output.append(line.strip('+'))
elif not '-' in line:
output.append(line.strip())
return output
print merge(a,b)

How would you make a comma-separated string from a list of strings?

What would be your preferred way to concatenate strings from a sequence such that between every two consecutive pairs a comma is added. That is, how do you map, for instance, ['a', 'b', 'c'] to 'a,b,c'? (The cases ['s'] and [] should be mapped to 's' and '', respectively.)
I usually end up using something like ''.join(map(lambda x: x+',',l))[:-1], but also feeling somewhat unsatisfied.
my_list = ['a', 'b', 'c', 'd']
my_string = ','.join(my_list)
'a,b,c,d'
This won't work if the list contains integers
And if the list contains non-string types (such as integers, floats, bools, None) then do:
my_string = ','.join(map(str, my_list))
Why the map/lambda magic? Doesn't this work?
>>> foo = ['a', 'b', 'c']
>>> print(','.join(foo))
a,b,c
>>> print(','.join([]))
>>> print(','.join(['a']))
a
In case if there are numbers in the list, you could use list comprehension:
>>> ','.join([str(x) for x in foo])
or a generator expression:
>>> ','.join(str(x) for x in foo)
",".join(l) will not work for all cases. I'd suggest using the csv module with StringIO
import StringIO
import csv
l = ['list','of','["""crazy"quotes"and\'',123,'other things']
line = StringIO.StringIO()
writer = csv.writer(line)
writer.writerow(l)
csvcontent = line.getvalue()
# 'list,of,"[""""""crazy""quotes""and\'",123,other things\r\n'
Here is a alternative solution in Python 3.0 which allows non-string list items:
>>> alist = ['a', 1, (2, 'b')]
a standard way
>>> ", ".join(map(str, alist))
"a, 1, (2, 'b')"
the alternative solution
>>> import io
>>> s = io.StringIO()
>>> print(*alist, file=s, sep=', ', end='')
>>> s.getvalue()
"a, 1, (2, 'b')"
NOTE: The space after comma is intentional.
#Peter Hoffmann
Using generator expressions has the benefit of also producing an iterator but saves importing itertools. Furthermore, list comprehensions are generally preferred to map, thus, I'd expect generator expressions to be preferred to imap.
>>> l = [1, "foo", 4 ,"bar"]
>>> ",".join(str(bit) for bit in l)
'1,foo,4,bar'
Don't you just want:
",".join(l)
Obviously it gets more complicated if you need to quote/escape commas etc in the values. In that case I would suggest looking at the csv module in the standard library:
https://docs.python.org/library/csv.html
>>> my_list = ['A', '', '', 'D', 'E',]
>>> ",".join([str(i) for i in my_list if i])
'A,D,E'
my_list may contain any type of variables. This avoid the result 'A,,,D,E'.
l=['a', 1, 'b', 2]
print str(l)[1:-1]
Output: "'a', 1, 'b', 2"
#jmanning2k using a list comprehension has the downside of creating a new temporary list. The better solution would be using itertools.imap which returns an iterator
from itertools import imap
l = [1, "foo", 4 ,"bar"]
",".join(imap(str, l))
Here is an example with list
>>> myList = [['Apple'],['Orange']]
>>> myList = ','.join(map(str, [i[0] for i in myList]))
>>> print "Output:", myList
Output: Apple,Orange
More Accurate:-
>>> myList = [['Apple'],['Orange']]
>>> myList = ','.join(map(str, [type(i) == list and i[0] for i in myList]))
>>> print "Output:", myList
Output: Apple,Orange
Example 2:-
myList = ['Apple','Orange']
myList = ','.join(map(str, myList))
print "Output:", myList
Output: Apple,Orange
If you want to do the shortcut way :) :
','.join([str(word) for word in wordList])
But if you want to show off with logic :) :
wordList = ['USD', 'EUR', 'JPY', 'NZD', 'CHF', 'CAD']
stringText = ''
for word in wordList:
stringText += word + ','
stringText = stringText[:-2] # get rid of last comma
print(stringText)
Unless I'm missing something, ','.join(foo) should do what you're asking for.
>>> ','.join([''])
''
>>> ','.join(['s'])
's'
>>> ','.join(['a','b','c'])
'a,b,c'
(edit: and as jmanning2k points out,
','.join([str(x) for x in foo])
is safer and quite Pythonic, though the resulting string will be difficult to parse if the elements can contain commas -- at that point, you need the full power of the csv module, as Douglas points out in his answer.)
I would say the csv library is the only sensible option here, as it was built to cope with all csv use cases such as commas in a string, etc.
To output a list l to a .csv file:
import csv
with open('some.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(l) # this will output l as a single row.
It is also possible to use writer.writerows(iterable) to output multiple rows to csv.
This example is compatible with Python 3, as the other answer here used StringIO which is Python 2.
mmm also need for SQL is :
l = ["foo" , "baar" , 6]
where_clause = "..... IN ("+(','.join([ f"'{x}'" for x in l]))+")"
>> "..... IN ('foo','baar','6')"
enjoit
My two cents. I like simpler an one-line code in python:
>>> from itertools import imap, ifilter
>>> l = ['a', '', 'b', 1, None]
>>> ','.join(imap(str, ifilter(lambda x: x, l)))
a,b,1
>>> m = ['a', '', None]
>>> ','.join(imap(str, ifilter(lambda x: x, m)))
'a'
It's pythonic, works for strings, numbers, None and empty string. It's short and satisfies the requirements. If the list is not going to contain numbers, we can use this simpler variation:
>>> ','.join(ifilter(lambda x: x, l))
Also this solution doesn't create a new list, but uses an iterator, like #Peter Hoffmann pointed (thanks).

Categories

Resources