Related
Can you help me with my algorithm in Python to parse a list, please?
List = ['PPPP_YYYY_ZZZZ_XXXX', 'PPPP_TOTO_TATA_TITI_TUTU', 'PPPP_TOTO_MMMM_TITI_TUTU', 'PPPP_TOTO_EHEH_TITI_TUTU', 'PPPP_TOTO_EHEH_OOOO_AAAAA', 'PPPP_TOTO_EHEH_IIII_SSSS_RRRR']
In this list, I have to get the last two words (PARENT_CHILD). For example for PPPP_TOTO_TATA_TITI_TUTU, I only get TITI_TUTU
In the case where there are duplicates, that is to say that in my list, I have: PPPP_TOTO_TATA_TITI_TUTU and PPPP_TOTO_EHEH_TITI_TUTU, I would have two times TITI_TUTU, I then want to recover the GRANDPARENT for each of them, that is: TATA_TITI_TUTU and EHEH_TITI_TUTU
As long as the names are duplicated, we take the level above.
But in this case, if I added the GRANDPARENT for EHEH_TITI_TUTU, I also want it to be added for all those who have EHEH in the name so instead of having OOOO_AAAAA, I would like to have EHEH_OOO_AAAAA and EHEH_IIII_SSSS_RRRR
My final list =
['ZZZZ_XXXX', 'TATA_TITI_TUTU', 'MMMM_TITI_TUTU', 'EHEH_TITI_TUTU', 'EHEH_OOOO_AAAAA', 'EHEH_IIII_SSSS_RRRR']
Thank you in advance.
Here is the code I started to write:
json_paths = ['PPPP_YYYY_ZZZZ_XXXX', 'PPPP_TOTO_TATA_TITI_TUTU',
'PPPP_TOTO_EHEH_TITI_TUTU', 'PPPP_TOTO_MMMM_TITI_TUTU', 'PPPP_TOTO_EHEH_OOOO_AAAAA']
cols_name = []
for path in json_paths:
acc=2
col_name = '_'.join(path.split('_')[-acc:])
tmp = cols_name
while col_name in tmp:
acc += 1
idx = tmp.index(col_name)
cols_name[idx] = '_'.join(json_paths[idx].split('_')[-acc:])
col_name = '_'.join(path.split('_')[-acc:])
tmp = ['_'.join(item.split('_')[-acc:]) for item in json_paths].pop()
cols_name.append(col_name)
print(cols_name.index(col_name), col_name)
cols_name
help ... with ... algorithm
use a dictionary for the initial container while iterating
keys will be PARENT_CHILD's and values will be lists containing grandparents.
>>> s = 'PPPP_TOTO_TATA_TITI_TUTU'
>>> d = collections.defaultdict(list)
>>> *_,grandparent,parent,child = s.rsplit('_',maxsplit=3)
>>> d['_'.join([parent,child])].append(grandparent)
>>> d
defaultdict(<class 'list'>, {'TITI_TUTU': ['TATA']})
>>> s = 'PPPP_TOTO_EHEH_TITI_TUTU'
>>> *_,grandparent,parent,child = s.rsplit('_',maxsplit=3)
>>> d['_'.join([parent,child])].append(grandparent)
>>> d
defaultdict(<class 'list'>, {'TITI_TUTU': ['TATA', 'EHEH']})
>>>
after iteration determine if there are multiple grandparents in a value
if there are, join/append the parent_child to each grandparent
additionally find all the parent_child's with these grandparents and prepend their grandparents. To facilitate build a second dictionary during iteration - {grandparent:[list_of_children],...}.
if the parent_child only has one grandparent use as-is
Instead of splitting each string the info could be extracted with a regular expression.
pattern = r'^.*?_([^_]*)_([^_]*_[^_]*)$'
So I have a text file like this
123
1234
123
1234
12345
123456
You can see 123 appears twice so both instances should be removed. but 12345 appears once so it stays. My text file is about 70,000 lines.
Here is what I came up with.
file = open("test.txt",'r')
lines = file.read().splitlines() #to ignore the '\n' and turn to list structure
for appId in lines:
if(lines.count(appId) > 1): #if element count is not unique remove both elements
lines.remove(appId) #first instance removed
lines.remove(appId) #second instance removed
writeFile = open("duplicatesRemoved.txt",'a') #output the left over unique elements to file
for element in lines:
writeFile.write(element + "\n")
When I run this I feel like my logic is correct, but I know for a fact the output is suppose to be around 950, but Im still getting 23000 elements in my output so a lot is not getting removed. Any ideas where the bug could reside?
Edit: I FORGOT TO MENTION. An element can only appear twice MAX.
Use Counter from built in collections:
In [1]: from collections import Counter
In [2]: a = [123, 1234, 123, 1234, 12345, 123456]
In [3]: a = Counter(a)
In [4]: a
Out[4]: Counter({123: 2, 1234: 2, 12345: 1, 123456: 1})
In [5]: a = [k for k, v in a.items() if v == 1]
In [6]: a
Out[6]: [12345, 123456]
For your particular problem I will do it like this:
from collections import defaultdict
out = defaultdict(int)
with open('input.txt') as f:
for line in f:
out[line.strip()] += 1
with open('out.txt', 'w') as f:
for k, v in out.items():
if v == 1: #here you use logic suitable for what you want
f.write(k + '\n')
Be careful about removing elements from a list while still iterating over that list. This changes the behavior of the list iterator, and can make it skip over elements, which may be part of your problem.
Instead, I suggest creating a filtered copy of the list using a list comprehension - instead of removing elements that appear more than twice, you would keep elements that appear less than that:
file = open("test.txt",'r')
lines = file.read().splitlines()
unique_lines = [line for line in lines if lines.count(line) <= 2] # if it appears twice or less
with open("duplicatesRemoved.txt", "w") as writefile:
writefile.writelines(unique_lines)
You could also easily modify this code to look for only one occurrence (if lines.count(line) == 1) or for more than two occurrences.
You can count all of the elements and store them in a dictionary:
dic = {a:lines.count(a) for a in lines}
Then remove all duplicated one from array:
for k in dic:
if dic[k]>1:
while k in lines:
lines.remove(k)
NOTE: The while loop here is becaues line.remove(k) removes first k value from array and it must be repeated till there's no k value in the array.
If the for loop is complicated, you can use the dictionary in another way to get rid of duplicated values:
lines = [k for k, v in dic.items() if v==1]
for the class data structures and algorithms at Tilburg University i got a question in an in class test:
build a dictionary from testfile.txt, with only unique values, where if a value appears again, it should be added to the total sum of that productclass.
the text file looked like this, it was not a .csv file:
apples,1
pears,15
oranges,777
apples,-4
oranges,222
pears,1
bananas,3
so apples will be -3 and the output would be {"apples": -3, "oranges": 999...}
in the exams i am not allowed to import any external packages besides the normal: pcinput, math, etc. i am also not allowed to use the internet.
I have no idea how to accomplish this, and this seems to be a big problem in my development of python skills, because this is a question that is not given in a 'dictionaries in python' video on youtube (would be to hard maybe), but also not given in a expert course because there this question would be to simple.
hope you guys can help!
enter code here
from collections import Counter
from sys import exit
from os.path import exists, isfile
##i did not finish it, but wat i wanted to achieve was build a list of the
strings and their belonging integers. then use the counter method to add
them together
## by splitting the string by marking the comma as the split point.
filename = input("filename voor input: ")
if not isfile(filename):
print(filename, "bestaat niet")
exit()
keys = []
values = []
with open(filename) as f:
xs = f.read().split()
for i in xs:
keys.append([i])
print(keys)
my_dict = {}
for i in range(len(xs)):
my_dict[xs[i]] = xs.count(xs[i])
print(my_dict)
word_and_integers_dict = dict(zip(keys, values))
print(word_and_integers_dict)
values2 = my_dict.split(",")
for j in values2:
print( value2 )
the output becomes is this:
[['schijndel,-3'], ['amsterdam,0'], ['tokyo,5'], ['tilburg,777'], ['zaandam,5']]
{'zaandam,5': 1, 'tilburg,777': 1, 'amsterdam,0': 1, 'tokyo,5': 1, 'schijndel,-3': 1}
{}
so i got the dictionary from it, but i did not separate the values.
the error message is this:
28 values2 = my_dict.split(",") <-- here was the error
29 for j in values2:
30 print( value2 )
AttributeError: 'dict' object has no attribute 'split'
I don't understand what your code is actually doing, I think you don't know what your variables are containing, but this is an easy problem to solve in Python. Split into a list, split each item again, and count:
>>> input = "apples,1 pears,15 oranges,777 apples,-4 oranges,222 pears,1 bananas,3"
>>> parts = input.split()
>>> parts
['apples,1', 'pears,15', 'oranges,777', 'apples,-4', 'oranges,222', 'pears,1', 'bananas,3']
Then split again. Behold the list comprehension. This is an idiomatic way to transform a list to another in python. Note that the numbers are strings, not ints yet.
>>> strings = [s.split(',') for s in strings]
>>> strings
[['apples', '1'], ['pears', '15'], ['oranges', '777'], ['apples', '-4'], ['oranges', '222'], ['pears', '1'], ['bananas', '3']]
Now you want to iterate over pairs, and sum all the same fruits. This calls for a dict:
>>> result = {}
>>> for fruit, countstr in pairs:
... if fruit not in result:
... result[fruit] = 0
... result[fruit] += int(countstr)
>>> result
{'pears': 16, 'apples': -3, 'oranges': 999, 'bananas': 3}
This pattern of adding an element if it doesn't exist comes up frequently. You should checkout defaultdict in the collections module. If you use that, you don't even need the if.
Let's walk through what you need to do to. First, check if the file exists and read the contents to a variable. Second, parse each line - you need to split the line on the comma, convert the number from a string to an integer, and then pass the values to a dictionary. In this case I would recommend using defaultdict from collections, but we can also do it with a standard dictionary.
from os.path import exists, isfile
from collections import defaultdict
filename = input("filename voor input: ")
if not isfile(filename):
print(filename, "bestaat niet")
exit()
# this reads the file to a list, removing newline characters
with open(filename) as f:
line_list = [x.strip() for x in f]
# create a dictionary
my_dict = {}
# update the value in the dictionary if it already exists,
# otherwise add it to the dictionary
for line in line_list:
k, v_str = line.split(',')
if k in my_dict:
my_dict[k] += int(v_str)
else:
my_dict[k] = int(v_str)
# print the dictionary
table_str = '{:<30}{}'
print(table_str.format('Item','Count'))
print('='*35)
for k,v in sorted(my_dict.item()):
print(table_str.format(k,v))
I'm trying to write a program for school. I'm a biotech major and this is a required course, but I'm not a programmer. So, this is probably easy for many, but difficult for me. Anyway, I have a text file with about 30 lines. Each line has a movie name listed first and actors who appeared in the movie, separated by commas following. Here's what I have so far:
InputName = input('What is the name of the file? ')
File = open(InputName, 'r+').readlines()
ActorLst = []
for line in File:
MovieActLst = line.split(',')
Movie = MovieActLst[0]
Actors = MovieActLst[1:]
for actor in Actors:
if actor not in ActorLst:
ActorLst.append(actor)
MovieDict = {Movie: Actors for x in MovieActLst}
print (MovieDict)
print(len(MovieDict))
Output(shortened):
What is the name of the file? Movies.txt
{"Ocean's Eleven": ['George Clooney', 'Brad Pitt', 'Elliot Gould', 'Casey Affleck', 'Carl Reiner', 'Julia Roberts', 'Angie Dickinson', 'Steve Lawrence', 'Wayne Newton\n']}
1
{'Up in the Air': ['George Clooney', 'Sam Elliott', 'Jason Bateman\n']}
1
{'Iron Man': ['Robert Downey Jr', 'Jeff Bridges', 'Gwyneth Paltrow\n']}
1
{'The Big Lebowski': ['Jeff Bridges', 'John Goodman', 'Julianne Moore', 'Sam Elliott\n']}
1
I have created a dictionary (MovieDict) that contains a movie name for the key and a list of actors for the values. There are about 30 movie names (keys). I need to figure out how to iterate through this dictionary to essentially reverse it. I want a dictionary that contains an actor as a key and the movies they play in as the values.
However, I think I have created a list of dictionaries as well instead of one dictionary and now I have really confused myself! Any suggestions?
Trivial using collections.defaultdict:
from collections import defaultdict
reverse = defaultdict(list)
for movie, actors in MovieDict.items():
for actor in actors:
reverse[actor].append(movie)
Thedefaultdict class differs from dict because when you try to access a key that does not exist, it creates it and sets its value to an item created by the factory passed to the constructor(list in the above code), this avoids catching the KeyError or checking if the key is in the dictionary.
Putting this with Steven Rumbalski's loop results in:
from collections import defaultdict
in_fname = input('What is the name of the file? ')
in_file = open(in_fname, 'r+')
movie_to_actors = {}
actors_to_movie = defaultdict(list)
for line in in_file:
#assumes python3:
movie, *actors = line.strip().split(',')
#python2 you can do actors=line.strip().split(',');movie=actors.pop(0)
movie_to_actors[movie] = list(actors)
for actor in actors:
actors_to_movie[actor].append(movie)
Some explanations about the code above.
Iterating over the lines of a file
File object are iterable, and thus support iteration.
This means you can do:
for line in open('filename'):
instead of:
for line in open('filename').readlines():
(Also in python2 the latter reads all file and then splits the content, while iterating over the file does not read all file into memory[and so you may save a lot of RAM with big files]).
Tuple unpacking
To "unpack" a sequence into different variables you can use the "tuple unpacking" syntax:
>>> a,b = (0,1)
>>> a
0
>>> b
1
The syntax was extended to allow gathering of a variable number of values into a variable.
For example:
>>> head, *tail = (1, 2, 3, 4, 5)
>>> head
1
>>> tail
[2, 3, 4, 5]
>>> first, *mid, last = (0, 1, 2, 3, 4, 5)
>>> first
0
>>> mid
[1, 2, 3, 4]
>>> last
5
You can have only one "starred expression", so this does not work:
>>> first, *mid, center, *mid2, last =(0,1,2,3,4,5)
File "<stdin>", line 1
SyntaxError: two starred expressions in assignment
So basically when you have a star on the left hand side, python puts there everything that it wasn't able to put in other variables. Notice that this mean that the variable may refer to an empty list:
>>> first, *mid, last = (0,1)
>>> first
0
>>> mid
[]
>>> last
1
Using defaultdict
The defaultdict allows you to give a default value to non existent keys.
The class accepts a callable(~function or class) as parameter and calls it to build a default value everytime that it's required:
>>> def factory():
... print("Called!")
... return None
...
>>> mydict = defaultdict(factory)
>>> mydict['test']
Called!
reverse={}
keys=MovieDict.keys()
for key in keys:
val=MovieDict[key]
for actor in val:
try:
reverse[actor]=reverse[actor].append(actor)
except KeyError:
reverse[actor]=[]
reverse[actor]=reverse[actor].append(actor)
print(reverse)#retarded python 3 format! :)
That should do it.
Programming is about abstracting things, so try to write code in a way that doesn't depend on the specific problem. For example:
def csv_to_dict(seq, separator=','):
dct = {}
for item in seq:
data = [x.strip() for x in item.split(separator)]
if len(data) > 1:
dct[data[0]] = data[1:]
return dct
def flip_dict(dct):
rev = {}
for key, vals in dct.items():
for val in vals:
if val not in rev:
rev[val] = []
rev[val].append(key)
return rev
Note how these two functions don't "know" anything about "input files", "actors", "movies" and so on, but still are able to solve your problem with two lines of code:
with open("movies.txt") as fp:
print(flip_dict(csv_to_dict(fp)))
InputName = input('What is the name of the file? ')
with open(InputName, 'r') as f:
actors_by_movie = {}
movies_by_actor = {}
for line in f:
movie, *actors = line.strip().split(',')
actors_by_movie[movie] = actors
for actor in actors:
movies_by_actor.setdefault(actor, []).append(movie)
Per your naming conventions:
from collections import defaultdict
InputName = input('What is the name of the file? ')
File = open(InputName, 'rt').readlines()
ActorLst = []
ActMovieDct = defaultdict(list)
for line in File:
MovieActLst = line.strip().split(',')
Movie = MovieActLst[0]
Actors = MovieActLst[1:]
for actor in Actors:
ActMovieDct[actor].append(Movie)
# print results
for actor, movies in ActMovieDct.items():
print(actor, movies)
fh=open('asd.txt')
data=fh.read()
fh.close()
name=data.split('\n')[0][1:]
seq=''.join(data.split('\n')[1:])
print name
print seq
In this code, the 3rd line means "take only first line with first character removed" while the 4th line means "leave the first line and join the next remaining lines".
I cannot get the logic of these two lines.
Can anyone explain me how these two slice operators ([0][1:]) are used together?
Thanx
Edited: renamed file variable (which is a keyword, too) to data.
Think of it like this: file.split('\n') gives you a list of strings. So the first indexing operation, [0], gives you the first string in the list. Now, that string itself is a "list" of characters, so you can then do [1:] to get every character after the first. It's just like starting with a two-dimensional list (a list of lists) and indexing it twice.
When confused by a complex expression, do it it steps.
>>> data.split('\n')[0][1:]
>>> data
>>> data.split('\n')
>>> data.split('\n')[0]
>>> data.split('\n')[0][1:]
That should help.
lets do it by steps, (I think I know what name and seq is):
>>> file = ">Protein kinase\nADVTDADTSCVIN\nASHRGDTYERPLK" <- that's what you get reading your (fasta) file
>>> lines = file.split('\n') <- make a list of lines
>>> line_0 = lines[0] <- take first line (line numbers start at 0)
>>> name = line_0[1:] <- give me line items [x:y] (from x to y)
>>> name
'Protein kinase'
>>>
>>> file = ">Protein kinase\nADVTDADTSCVIN\nASHRGDTYERPLK"
>>> lines = file.split('\n')
>>> seqs = lines[1:] <- gime lines [x:y] (from x to y)
>>> seq = ''.join(seqs)
>>> seq
'ADVTDADTSCVINASHRGDTYERPLK'
>>>
in slice [x:y], x is included, y is not included. When you want to arrive to the end of the list just do not indicate y -> [x:] (from item of index x to the end)
Each set of [] just operates on the list that split returns, and the resulting
list or string then used without assigning it to another variable first.
Break down the third line like this:
lines = file.split('\n')
first_line = lines[0]
name = first_line[1:]
Break down the fourth line like this:
lines = file.split('\n')
all_but_first_line = lines[1:]
seq = ''.join(all_but_first_line)
take this as an example
myl = [["hello","world","of","python"],["python","is","good"]]
so here myl is a list of list. So, myl[0] means first element of list which is equal to ['hello', 'world', 'of', 'python'] but when you use myl[0][1:] it means selecting first element from list which is represented by myl[0] and than from the resulting list(myl[0]) select every element except first one(myl[0][1:]). So output = ['world', 'of', 'python']