Extract rows from list given two criteria - python

What I have:
I have a list of lists, nodes. Each list has the following structure:
nodes = [[ID, number 1, number 2, number 3],[...]]
I also have two other lists of lists called sampleID and sampleID2 where each list may only have single data equal to an ID number which belongs to a subset of the total IDs contained in nodes:
sampleID = [[IDa],[...]]
sampleID2 = [[IDb],[...]], len(sampleID) + len(sampleID2) <= len(nodes)
In some cases these lists can also be like:
sampleID = [[IDa1,IDa2, IDa3,...],[...]]
What I want:
Given the above three lists I'd like to obtain in a fast way a fourth list which contains the lists where IDi==ID, i=a,b:
extractedlist = [[ID, number 1, number 2, number 3],[...]], len(extractedlist) = len(sampleID) + len(sampleID2)
My code:
Very basic, it works but it takes a lot of time to compute:
import itertools
for line in nodes[:]:
for line2,line3 in itertools.izip(sampleID[:],sampleID2[:]):
for i in range(0,len(line2)):
if line2[i]==line[0]:
extractedlist.append([line[0], line[1], line[2], line[3]])
for j in range(0,len(line3)):
if line3[j]==line[0]:
extractedlist.append([line[0], line[1], line[2], line[3]])

I could not understand your problem well but this is what i understand :P
node = [ .... ]
sampleID = [ .... ]
sampleID2 = [ .... ]
final_ids = []
[final_ids.extend(list_item) for list_item in sampleID]
[final_ids.extend(list_item) for list_item in sampleID2]
extractedlist = []
for line in nodes:
if line[0] in final_ids:
extractedlist.append(line)
hope this is what you need.
Else just add original input-list and result-list in question so i can understand what you want to do :)

Related

Incorrect number of values in output from zip?

I've been working on a problem that involves taking multiple number pairs, and creating some form of sum loop that adds each pair together.
I am not getting the correct number of outputs, e.g. 15 pairs of numbers are inputted and only 8 are coming out.
Here's my code so far...
data = "917128 607663\
907859 281478\
880236 180499\
138147 764933\
120281 410091\
27737 932325\
540724 934920\
428397 637913\
879249 469640\
104749 325216\
113555 304966\
941166 925887\
46286 299745\
319716 662161\
853092 455361"
data_list = data.split(" ") # creating a list of strings
data_list_numbers = [] # converting list of strings to list of integers
for d in data_list:
data_list_numbers.append(int(d))
#splitting the lists into two with every other integer (basically to get the pairs again.
list_one = data_list_numbers[::2]
list_two = data_list_numbers[1::2]
zipped_list = zip(list_one, list_two) #zipping lists
sum = [x+y for x,y in zip(list_one, list_two)] # finding the sum of each pair
print(sum)
What am I missing?
Quote the input string like so: """...""", remove the backslashes, and use re.split to split on whitespace. Note that using backslashes without spaces, as you did, causes the numbers in data to smash into each other. That is, this:
"607663\
907859"
is the same as: "607663907859".
import re
data = """917128 607663
907859 281478
880236 180499
138147 764933
120281 410091
27737 932325
540724 934920
428397 637913
879249 469640
104749 325216
113555 304966
941166 925887
46286 299745
319716 662161
853092 455361"""
data_list = re.split(r'\s+', data) # creating a list of strings
data_list_numbers = [] # converting list of strings to list of integers
for d in data_list:
data_list_numbers.append(int(d))
#splitting the lists into two with every other integer (basically to get the pairs again.
list_one = data_list_numbers[::2]
list_two = data_list_numbers[1::2]
zipped_list = zip(list_one, list_two) #zipping lists
sum = [x+y for x,y in zip(list_one, list_two)] # finding the sum of each pair
print(sum)
# [1524791, 1189337, 1060735, 903080, 530372, 960062, 1475644, 1066310, 1348889, 429965, 418521, 1867053, 346031, 981877, 1308453]

associate 2 lists based on ID

I'm trying to merge data from 2 lists by an ID:
list_a = [
(u'65d92438497c', u'compute-0'),
(u'051df48db621', u'compute-4'),
(u'd6160db0cbcd', u'compute-3'),
(u'23fc20b59bd6', u'compute-1'),
(u'0db2e733520d', u'controller-1'),
(u'89334dac8a59', u'compute-2'),
(u'51cf9d50b02e', u'compute-5'),
(u'f4fe106eaeab', u'controller-2'),
(u'06cc124662dc', u'controller-0')
]
list_b = [
(u'65d92438497c', u'p06619'),
(u'051df48db621', u'p06618'),
(u'd6160db0cbcd', u'p06620'),
(u'23fc20b59bd6', u'p06622'),
(u'0db2e733520d', u'p06612'),
(u'89334dac8a59', u'p06621'),
(u'51cf9d50b02e', u'p06623'),
(u'f4fe106eaeab', u'p06611'),
(u'06cc124662dc', u'p06613')
]
list_ab = [
(u'65d92438497c', u'p06619', u'compute-0'),
(u'051df48db621', u'p06618', u'compute-4'),
(u'd6160db0cbcd', u'p06620', u'compute-3'),
(u'23fc20b59bd6', u'p06622', u'compute-1'),
(u'0db2e733520d', u'p06612', u'controller-1'),
(u'89334dac8a59', u'p06621', u'compute-2'),
(u'51cf9d50b02e', u'p06623', u'compute-5'),
(u'f4fe106eaeab', u'p06611', u'controller-2'),
(u'06cc124662dc', u'p06613', u'controller-0')
]
You can see that the first field in an ID, identical between list_a and list_b and I need to merge on this value
I'm not sure what type of data I need for result_ab
The purpose of this is to find 'compute-0' from 'p06619' so maybe there is a better way than merge.
You are using a one-dimensional list containing a tuple, it could be not needed. Anyway, to obtain the output you require:
list_a = [(u'65d92438497c', u'compute-0')]
list_b = [(u'65d92438497c', u'p-06619')]
result_ab = None
if list_a[0][0] == list_b[0][0]:
result_ab = [tuple(list(list_a[0]) + list(list_b[0][1:]))]
Here is my solution :
merge = []
for i in range(0,len(list_a)):
if list_a[i][0] == list_b[i][0]:
merge.append([tuple(list(list_a[i]) + list(list_b[i][1:]))])
The idea is to create a dictionary with the keys as the first element of both the lists and values as the list object with all the elements matching that key.
Next, just iterate over the dictionary and create the required new list object:
from collections import defaultdict
res = defaultdict(list)
for elt in list_a:
res[elt[0]].extend([el for el in elt[1:]])
for elt in list_b:
res[elt[0]].extend([el for el in elt[1:]])
list_ab = []
for key, value in res.items():
elt = tuple([key, *[val for val in value]])
list_ab.append(elt)
print(list_ab)

Python get keys from ordered dict

python noob here. So I'm making a program that will take a JSON file from a url and parse the information and put it into a database. I have the JSON working, thankfully, but now I am stuck, I'll explain it through my code.
playerDict = {
"zenyatta" : 0,
"mei" : 0,
"tracer" : 0,
"soldier76" : 0,
"ana" : 0,
...}
So this is my original dictionary with the which I then fill with the players data for each hero.
topHeroes = sorted(playerDict.items(),key = operator.itemgetter(1),reverse = True)
I then sort this list and it turns the heroes with the most amount of hours played first.
topHeroesDict = topHeroes[0:3]
playerDict['tophero'] = topHeroesDict[0]
I then get the top three heroes. The second line here prints out a list like so:
'secondhero': ('mercy', 6.0)
Whereas I want the output to be:
'secondhero': 'mercy'
Would appreciate any help i have tried the code below with and without list.
list(topHeroes.keys())[0]
So thanks in advance and apologies for the amount of code!
You could take an approach with enumerate, if instead of "firsthero" you are ok with "Top 1" and so on. With enumerate you can iterate over the list and keep track of the current index, which is used to name the key in this dictionary comprehension. j[0] is the name of the hero, which is the first element of the tuple.
topHeroes = sorted(playerDict.items(),key = operator.itemgetter(1),reverse = True)
topHeroesDict = {"Top "+str(i): j[0] for i, j in enumerate(topHeroes[0:3])}
Alternatively, you could use a dictionary which maps the index to first like this:
topHeroes = sorted(playerDict.items(),key = operator.itemgetter(1),reverse = True)
top = {0: "first", 1: "second", 2: "third"}
topHeroesDict = {top[i]+"hero": j[0] for i, j in enumerate(topHeroes[0:3])}
You do not need any imports to achieve this. Without itemgetter, you can do it in one line like this:
top = {0: "first", 1: "second", 2: "third"}
topHeroesDict = {top[i]+"hero": j[0] for i, j in enumerate(sorted([(i, playerDict[i]) for i in playerDict.keys()], key = lambda x: x[1], reverse = True)[0:3])}
You're sorting an iterable of tuples returned by the items method of the dict, so each item in the sorted list is a tuple containing the hero and their score.
You can avoid using sorted and dict.items altogether and get the leading heroes (without their score) by simply using collections.Counter and then getting the most_common 3 heroes.
from collections import Counter
player_dict = Counter(playerDict)
leading_heroes = [hero for hero, _ in player_dict.most_common(3)]

How to create a frequency matrix?

I just started using Python and I just came across the following problem:
Imagine I have the following list of lists:
list = [["Word1","Word2","Word2","Word4566"],["Word2", "Word3", "Word4"], ...]
The result (matrix) i want to get should look like this:
The Displayed Columns and Rows are all appearing words (no matter which list).
The thing that I want is a programm that counts the appearence of words in each list (by list).
The picture is the result after the first list.
Is there an easy way to achieve something like this or something similar?
EDIT:
Basically I want a List/Matrix that tells me how many times words 2-4566 appeared when word 1 was also in the list, and so on.
So I would get a list for each word that displays the absolute frequency of all other 4555 words in relationship with this word.
So I would need an algorithm that iterates through all this lists of words and builts the result lists
As far as I understand you want to create a matrix that shows the number of lists where two words are located together for each pair of words.
First of all we should fix the set of unique words:
lst = [["Word1","Word2","Word2","Word4566"],["Word2", "Word3", "Word4"], ...] # list is a reserved word in python, don't use it as a name of variables
words = set()
for sublst in lst:
words |= set(sublst)
words = list(words)
Second we should define a matrix with zeros:
result = [[0] * len(words)] * len(words) # zeros matrix N x N
And finally we fill the matrix going through the given list:
for sublst in lst:
sublst = list(set(sublst)) # selecting unique words only
for i in xrange(len(sublst)):
for j in xrange(i + 1, len(sublst)):
index1 = words.index(sublst[i])
index2 = words.index(sublst[j])
result[index1][index2] += 1
result[index2][index1] += 1
print result
I find it really hard to understand what you're really asking for, but I'll try by making some assumptions:
(1) You have a list (A), containing other lists (b) of multiple words (w).
(2) For each b-list in A-list
(3) For each w in b:
(3.1) count the total number of appearances of w in all of the b-lists
(3.2) count how many of the b-lists, in which w appears only once
If these assumptions are correct, then the table doesn't correspond correctly to the list you've provided. If my assumptions are wrong, then I still believe my solution may give you inspiration or some ideas on how to solve it correctly. Finally, I do not claim my solution to be optimal with respect to speed or similar.
OBS!! I use python's built-in dictionaries, which may become terribly slow if you intend to fill them with thousands of words!! Have a look at: https://docs.python.org/2/tutorial/datastructures.html#dictionaries
frq_dict = {} # num of appearances / frequency
uqe_dict = {} # unique
for list_b in list_A:
temp_dict = {}
for word in list_b:
if( word in temp_dict ):
temp_dict[word]+=1
else:
temp_dict[word]=1
# frq is the number of appearances
for word, frq in temp_dict.iteritems():
if( frq > 1 ):
if( word in frq_dict )
frq_dict[word] += frq
else
frq_dict[word] = frq
else:
if( word in uqe_dict )
uqe_dict[word] += 1
else
uqe_dict[word] = 1
I managed to come up with the right answer to my own question:
list = [["Word1","Word2","Word2"],["Word2", "Word3", "Word4"],["Word2","Word3"]]
#Names of all dicts
all_words = sorted(set([w for sublist in list for w in sublist]))
#Creating the dicts
dicts = []
for i in all_words:
dicts.append([i, dict.fromkeys([w for w in all_words if w != i],0)])
#Updating the dicts
for l in list:
for word in sorted(set(l)):
tmpL = [w for w in l if w != word]
ind = ([w[0] for w in dicts].index(word))
for w in dicts[ind][1]:
dicts[ind][1][w] += l.count(w)
print dicts
Gets the result:
['Word1', {'Word4': 0, 'Word3': 0, 'Word2': 2}], ['Word2', {'Word4': 1, 'Word1': 1, 'Word3': 2}], ['Word3', {'Word4': 1, 'Word1': 0, 'Word2': 2}], ['Word4', {'Word1': 0, 'Word3': 1, 'Word2': 1}]]

Editing script to account for every combination of two lists

SO, I have what I think is a difficult problem to solve, I have a script that cycles through a CSV to count the number of occurrences of data in different columns. This script works well and is contained below for referencing:
Original script
import csv
import datetime
import copy
from collections import defaultdict
with open(r"C:\Temp\test2.csv") as i, open(r"C:\Temp\results2.csv", "wb") as o:
rdr = csv.reader(i)
wrt = csv.writer(o)
data, currdate = defaultdict(lambda:[0, 0, 0, 0]), None
for line in rdr:
date = datetime.datetime.strptime(line[0], '%d/%m/%Y')
name = (line[7], line[9])
if date != currdate or not currdate:
for v in data.itervalues(): v[:2] = v[2:]
currdate = date
wrt.writerow(line + data[name][:2])
data[name][3] += 1
if line[6] == "1": data[name][2] += 1
I have edited this script to add a percentage column and I can make this script do multiple different combinations of column matches manually, e.g column 7/9 here I can make it column 7/10 etc all in one script. However what I need it to do I do not know the required function or method for. Essentially I need it to go through each calclist contained in this script and output the numbers associated with this script for every combination of the column references in the calclists. i.e. for 6/7 6/19 6/23
Because in my real script the calc lists are much longer than in this example, it would also be nice if this edit could include some way of attaching the titles to column I don't have a suitable method or mechanism to do this. But if there was a list of titles for the calc lists, some how it might be possible to create a titles in this format (remembering there are three for each run of the script) "title1-title2-x","title1-title2-y","title1-title2-z"
import csv
import datetime
import copy
from collections import defaultdict
with open(r"dualparametermatch_test.csv") as i, open(r"dualparametermatch_test_edit.csv", "wb") as o:
rdr = csv.reader(i)
wrt = csv.writer(o)
data, currdate = defaultdict(lambda:[0, 0, 0, 0]), None
# Identical calclists
calclist = [6, 7, 19, 23, 25, 26, 35, 62, 64]
calclist2 = [6, 7, 19, 23, 25, 26, 35, 62, 64]
for counter, line in enumerate(rdr):
if counter == 0:
#Titles, there are three for each item in the calclist
titles = ["titleX", "titleY", "titleZ"] # ... etc
wrt.writerow(line + titles)
else:
extra_cols = []
for calc in calclist:
date = datetime.datetime.strptime(line[0], '%d/%m/%Y')
name = (line[calclist], line[calclist2])
if date != currdate or not currdate:
for v in data.itervalues(): v[:2] = v[2:]
currdate = date
### Adds the percentage calulation column
top,bottom = data[name][0:2]
try:
quotient = '{0:0.5f}'.format(float(top)/bottom).rstrip("0")
except ZeroDivisionError:
quotient = 0
extra_cols.extend(data[name][:2]+ [quotient])
data[name][3] += 1
if line[6] == "1": data[name][2] += 1
wrt.writerow(line + data[name][:2])
I appreciate this could be a difficult problem to solve and if anyone out there can help with this then first and foremost Kudos to you! If more detail is required or anything is unclear please get back to me. I can provide example data and output for the original script if required. Thanks in advance AEA
I'm not sure if I am understanding the question correctly so this could be completely off base but if you simply want all unique combinations of the numbers that appear in all calclists you could approach it like this:
calclists = [[1,2,3], [4,5,6], [7,8,9]] # calclists is a list of calclists
calcset = set()
for calclist in calclists:
for x in calclist:
calcset.add(x)
unique_calclist = list(calcset)
for x in unique_calclist:
for y in unique_calclist[1:]:
# in your example you didn't use combinations
# of duplicate valuesso I am skipping that here
if x != y:
print (x, y)
Here is the same thing using itertools (note this approach assumes the value in each list in calclists is unique).
import itertools
calclists = [[1,2,3], [4,5,6], [7,8,9]]
comb_itr = itertools.combinations(itertools.chain.from_iterable(calclists), 2)
for comb in comb_itr:
print comb
If you can't assume that all of the values in each list are unique you could combine the above two approaches like so:
import itertools
calclists = [[1,2,3], [4,5,6], [1, 2, 3]]
calcset = set()
for calclist in calclists:
for x in calclist:
calcset.add(x)
comb_itr = itertools.combinations(calcset, 2)
for comb in comb_itr:
print comb
As others are saying, it is a bit hard to follow, but going off of what you're saying and off the code, I think csv's DictReader and DictWriter are what you're looking for.
http://docs.python.org/2/library/csv.html#csv.DictReader
http://docs.python.org/2/library/csv.html#csv.DictWriter
Using these, if you need to manipulate, say, only three of x columns of your csv, and you know those column names, you can call the corresponding columns of each row by its column name (I'd recommend making those column names constants). Basically, each row becomes a dictionary where each key is a column name and the value is the element of that row.

Categories

Resources