I have an array of list containing lemmatized words. When I print many of them at once, this is the output:
print(data[:3])
[list(['#', 'switchfoot', 'http', ':', '//twitpic.com/2y1zl', '-', 'Awww', ',', 'that', "'s", 'a', 'bummer', '.', 'You', 'shoulda', 'got', 'David', 'Carr', 'of', 'Third', 'Day', 'to', 'do', 'it', '.', ';', 'D'])
list(['is', 'upset', 'that', 'he', 'ca', "n't", 'update', 'his', 'Facebook', 'by', 'texting', 'it', '...', 'and', 'might', 'cry', 'a', 'a', 'result', 'School', 'today', 'also', '.', 'Blah', '!'])
list(['#', 'Kenichan', 'I', 'dived', 'many', 'time', 'for', 'the', 'ball', '.', 'Managed', 'to', 'save', '50', '%', 'The', 'rest', 'go', 'out', 'of', 'bound'])]
I tried many thing to get rid of it but it never does, but when I tried:
a = [[i for i in range(5)] for _ in range(5)]
print(np.array(a))
the output is not with list() around every list:
array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]])
does it mean they are different lists? Does it just happen with list of string? How can I get rid of it, if it is necessary of course, thanks for your time.
you could potentialy loop through the 3 lists and print them with the * sympol in front.
for i in data[:3]:
print(*i)
this would in normal cases remove the brackets and commas of the list and just print it with spaces. I must admit though that i do not under stand how you got your output, so this is just my 2 cents. Hope it helps :)
print(data[:3].tolist())
Convert the array to a list. This will use the list of list display, as opposed to an array of lists.
But as hashed out in the comments, there is a significant difference between an array of lists, and a 2d array.
Including list is the display is a relatively recent change in numpy. It was added, I think, to clarify the underlying nature of the elements of an object dtype array.
Consider, for example, an array with a variety of element types:
In [532]: x=np.empty(5,object)
In [533]: x[0]=[1,2,3]; x[1]=(1,2,3); x[2]=np.array([1,2,3]); x[3]=np.matrix([1,2,3]); x[4]={0:1}
In [534]: x
Out[534]:
array([list([1, 2, 3]),
(1, 2, 3),
array([1, 2, 3]),
matrix([[1, 2, 3]]),
{0: 1}],
dtype=object)
I tweaked the layout for clarity. But note that without the words, the list and array elements would look a lot alike.
Converting the array to a list, we get the default formatting of a list:
In [537]: x.tolist()
Out[537]: [[1, 2, 3], (1, 2, 3), array([1, 2, 3]), matrix([[1, 2, 3]]), {0: 1}]
The elements of the array and list are same.
Related
This question already has answers here:
In Python how do I create variable length combinations or permutations?
(2 answers)
Closed 7 months ago.
For example, for the list 'letters', I want to produce 'symbols' as below:
['ab', 'abc', 'abcd', 'bc', 'bcd', 'cd']
Now I current code is below:
letters = ['a', 'b', 'c', 'd']
letter_len = len(letters)
symbols = []
for i in range(letter_len):
for j in range(i+1, letter_len):
sub = letters[i:j + 1]
sub = ''.join(sub)
symbols.append( sub)
print(symbols)
If you only want adjacent values, it's probably most efficient to slide a window across the input, slicing for the substrings you want, rather than splitting and joining them
As a bonus, the same logic will work for lists
def multi_adjacent(letters, min_gap=2):
# further opportunity for dynamic maximum length and bounds-checking
for idx_start in range(len(letters)-(min_gap-1)):
for idx_end in range(idx_start+min_gap, len(letters)+1):
yield letters[idx_start:idx_end] # slice for substring
>>> list(multi_adjacent("abcd"))
['ab', 'abc', 'abcd', 'bc', 'bcd', 'cd']
>>> list(multi_adjacent("abcd", 3))
['abc', 'abcd', 'bcd']
>>> list(multi_adjacent([1,2,3,4]))
[[1, 2], [1, 2, 3], [1, 2, 3, 4], [2, 3], [2, 3, 4], [3, 4]]
If you're really after every grouping (not just adjacent ones), itertools.combinations can provide this
from itertools import combinations
def multi_combination(letters):
for grouping_size in range(2, len(letters)+1):
for grouping in combinations(letters, grouping_size):
yield "".join(grouping)
>>> sorted(multi_combination("abcd"))
['ab', 'abc', 'abcd', 'abd', 'ac', 'acd', 'ad', 'bc', 'bcd', 'bd', 'cd']
I have a NumPy ndarray array that has been converted to a list of lists array = [list(ele) for ele in array]. I also have a list indexes that is entered into my function to_float as a parameter specifying the indices of elements within a list for all lists that must not be effected by the code in my function. The function has to convert all elements within all lists to float not specified by indices in indexes.
For example, my array (converted to lists) and indexes could be:
array = [['Hi', 'how', 'are', 'you', '4.65', '5.789', 'eat', '9.021'], ['its', 'not', 'why', 'you', '6.75', '5.89', 'how', '2.10'],
['On', 'woah', 'right', 'on', '7.45', '9.99', 'teeth', '2.11']]
indexes = [0, 1, 2, 3, 6]
I now have to convert all elements in all my lists that are not indices specified in indexes to float values.
Desired output:
[['Hi', 'how', 'are', 'you', 4.65, 5.789, 'eat', 9.021], ['its', 'not', 'why', 'you', 6.75, 5.89, 'how', 2.10],
['On', 'woah', 'right', 'on', 7.45, 9.99, 'teeth', 2.11]]
As you can see, elements with indices 4, 5, 7 in all lists were converted to float as their indices were not in indexes.
So how could I do this?
You can use enumerate and list comprehension:
array = [
["Hi", "how", "are", "you", "4.65", "5.789", "eat", "9.021"],
["its", "not", "why", "you", "6.75", "5.89", "how", "2.10"],
["On", "woah", "right", "on", "7.45", "9.99", "teeth", "2.11"],
]
indexes = [0, 1, 2, 3, 6]
array = [
[float(val) if i not in indexes else val for i, val in enumerate(subl)]
for subl in array
]
print(array)
Prints:
[['Hi', 'how', 'are', 'you', 4.65, 5.789, 'eat', 9.021],
['its', 'not', 'why', 'you', 6.75, 5.89, 'how', 2.1],
['On', 'woah', 'right', 'on', 7.45, 9.99, 'teeth', 2.11]]
Note: to speed up, you can convert indexes to set.
I have a set of unique words called h_unique. I also have a 2D list of documents called h_tokenized_doc which has a structure like:
[ ['hello', 'world', 'i', 'am'],
['hello', 'stackoverflow', 'i', 'am'],
['hello', 'world', 'i', 'am', 'mr'],
['hello', 'stackoverflow', 'i', 'am', 'pycahrm'] ]
and h_unique as:
('hello', 'world', 'i', 'am', 'stackoverflow', 'mr', 'pycharm')
what I want is to find the occurrences of the unique words in the tokenized documents list.
So far I came up with this code but this seems to be VERY slow. Is there any efficient way to do this?
term_id = []
for term in h_unique:
print term
for doc_id, doc in enumerate(h_tokenized_doc):
term_id.append([doc_id for t in doc if t == term])
In my case I have a document list of 7000 documents, structured like:
[ [doc1], [doc2], [doc3], ..... ]
It'll be slow because you're running through your entire document list once for every unique word. Why not try storing the unique words in a dictionary and appending to it for each word found?
unique_dict = {term: [] for term in h_unique}
for doc_id, doc in enumerate(h_tokenized_doc):
for term_id, term in enumerate(doc):
try:
# Not sure what structure you want to keep it in here...
# This stores a tuple of the doc, and position in that doc
unique_dict[term].append((doc_id, term_id))
except KeyError:
# If the term isn't in h_unique, don't do anything
pass
This runs through all the document's only once.
From your above example, unique_dict would be:
{'pycharm': [], 'i': [(0, 2), (1, 2), (2, 2), (3, 2)], 'stackoverflow': [(1, 1), (3, 1)], 'am': [(0, 3), (1, 3), (2, 3), (3, 3)], 'mr': [(2, 4)], 'world': [(0, 1), (2, 1)], 'hello': [(0, 0), (1, 0), (2, 0), (3, 0)]}
(Of course assuming the typo 'pycahrm' in your example was deliberate)
term_id.append([doc_id for t in doc if t == term])
This will not append one doc_id for each matching term; it will append an entire list of potentially many identical values of doc_id. Surely you did not mean to do this.
Based on your sample code, term_id ends up as this:
[[0], [1], [2], [3], [0], [], [2], [], [0], [1], [2], [3], [0], [1], [2], [3], [], [1], [], [3], [], [], [2], [], [], [], [], []]
Is this really what you intended?
If I understood correctly, and based on your comment to the question where you say
yes because a single term may appear in multiple docs like in the above case for hello the result is [0,1, 2, 3] and for world it is [0, 2]
it looks like what you wanna do is: For each of the words in the h_unique list (which, as mentioned, should be a set, or keys in a dict, which both have a search access of O(1)), go through all the lists contained in the h_tokenized_doc variable and find the indexes in which of those lists the word appears.
IF that's actually what you want to do, you could do something like the following:
#!/usr/bin/env python
h_tokenized_doc = [['hello', 'world', 'i', 'am'],
['hello', 'stackoverflow', 'i', 'am'],
['hello', 'world', 'i', 'am', 'mr'],
['hello', 'stackoverflow', 'i', 'am', 'pycahrm']]
h_unique = ['hello', 'world', 'i', 'am', 'stackoverflow', 'mr', 'pycharm']
# Initialize a dict with empty lists as the value and the items
# in h_unique the keys
results = {k: [] for k in h_unique}
for i, line in enumerate(h_tokenized_doc):
for k in results:
if k in line:
results[k].append(i)
print results
Which outputs:
{'pycharm': [], 'i': [0, 1, 2, 3], 'stackoverflow': [1, 3],
'am': [0, 1, 2, 3], 'mr': [2], 'world': [0, 2],
'hello': [0, 1, 2, 3]}
The idea is using the h_unique list as keys in a dictionary (the results = {k: [] for k in h_unique} part).
Keys in dictionaries have the advantage of a constant lookup time, which is great for the if k in line: part (if it were a list, that in would take O(n)) and then check if the word (the key k) appears in the list. If it does, append the index of the list within the matrix to the dictionary of results.
Although I'm not certain this is what you want to achieve, though.
You can optimize your code to do the trick with
Using just a single for loop
Generators dictionaries for constant lookup time, as suggested previously. Generators are faster than for loops because the generate values on the fly
In [75]: h_tokenized_doc = [ ['hello', 'world', 'i', 'am'],
...: ['hello', 'stackoverflow', 'i', 'am'],
...: ['hello', 'world', 'i', 'am', 'mr'],
...: ['hello', 'stackoverflow', 'i', 'am', 'pycahrm'] ]
In [76]: h_unique = ('hello', 'world', 'i', 'am', 'stackoverflow', 'mr', 'pycharm')
In [77]: term_id = {k: [] for k in h_unique}
In [78]: for term in h_unique:
...: term_id[term].extend(i for i in range(len(h_tokenized_doc)) if term in h_tokenized_doc[i])
which yields the output
{'am': [0, 1, 2, 3],
'hello': [0, 1, 2, 3],
'i': [0, 1, 2, 3],
'mr': [2],
'pycharm': [],
'stackoverflow': [1, 3],
'world': [0, 2]}
A more descriptive solution would be
In [79]: for term in h_unique:
...: term_id[term].extend([(i,h_tokenized_doc[i].index(term)) for i in range(len(h_tokenized_doc)) if term in h_tokenized_doc[i]])
In [80]: term_id
Out[80]:
{'am': [(0, 3), (1, 3), (2, 3), (3, 3)],
'hello': [(0, 0), (1, 0), (2, 0), (3, 0)],
'i': [(0, 2), (1, 2), (2, 2), (3, 2)],
'mr': [(2, 4)],
'pycharm': [],
'stackoverflow': [(1, 1), (3, 1)],
'world': [(0, 1), (2, 1)]}
Let's say I've the following list x in Python
[['a',6,'aa']
['d',7,'bb']]
['c',1,'cc']
['a',4,'dd']
['d',2,'ee']]
and I want to sort its elements in order to obtain the following result
[['a',4,'dd']
['a',6,'aa']
['c',1,'cc']
['d',2,'ee']
['d',7,'bb']]
that is I want to sort it by two columns, the first one (the most important) and the second one (the less important). This is probably a duplicate question but I haven't been able to find the solution...
The following sorts list by the first element, then by the second element:
>>> sorted(a, key=lambda x:(x[0], x[1]))
[['a', 4, 'dd'], ['a', 6, 'aa'], ['c', 1, 'cc'], ['d', 2, 'ee'], ['d', 7, 'bb']]
Since it does not matter whether you sort by the third column as well, or not, you can use plain sort here, and get the same result:
>>> sorted(a)
[['a', 4, 'dd'], ['a', 6, 'aa'], ['c', 1, 'cc'], ['d', 2, 'ee'], ['d', 7, 'bb']]
This is because lists are compared left to right and sorted in lexicographical order.
If you did want to order by arbitrary column order, you should use operator.itemgetter, which is faster than using a lambda function for the key.
>>> import operator
>>> sorted(a, key=operator.itemgetter(1, 0)) # order by column 1 first, then 0.
[['c', 1, 'cc'], ['d', 2, 'ee'], ['a', 4, 'dd'], ['a', 6, 'aa'], ['d', 7, 'bb']]
You can simply use sorted as sorted automatically handles if the sorting is applied from first element to last.
a = [['a',6,'aa'],['d',7,'bb'],['c',1,'cc'],['a',4,'dd'],['d',2,'ee']]
sorted(a)
[['a', 4, 'dd'],
['a', 6, 'aa'],
['c', 1, 'cc'],
['d', 2, 'ee'],
['d', 7, 'bb']]
I've been teaching myself python and so far it's hasn't been too bad. I reckon the easiest way to learn is to just start coding so I've come up with relatively simple tasks to help me. And now I'm stuck and it's getting frustrating not being able to figure out what I'm doing wrong
{0: [[1, '.', 0, '-', 8]],
1: [['.', 3, '.', 2, 0, 0, 1, '-']],
2: [],
3: [['.', '.']],
4: [],
5: [[2, 0, 1, 2, '-', 0, 1, '-', 1, 9]],
6: [[1, '.', 0, 0, 9, 5]], etc...
So I have a dictionary where each key has a value(which are exchange rates for various currency pairs). I've been trying to turn the values into a single string so for instance:
0: [[1, '.', 0, '-', 8]] would become 0: '1.0-8'
6: [[1, '.', 0, 0, 9, 5]] would be 6 : '1.0095'
etc...
After numerous hours of trying various methods and googling, I've come to the conclusion that I have no idea whatsoever on how to accomplish this. I've tried doing a replace, and various complicated loops that turn the dictionary into a list of lists with the key[value] as a list and then trying to iterate over it using a join function etc...and it all has accomplished absolutely nothing!
It seems like it should be simple to do, but I give up, I have no idea how to do this, so hopefully someone here can give me a hand.
Thanks a lot!
One-liner...
>>> {i:(''.join(str(x) for x in j[0]) if j else '') for i,j in d.items()}
{0: '1.0-8', 1: '.3.2001-', 2: '', 3: '..', 4: '', 5: '2012-01-19', 6: '1.0095'}
"writing it in a noob friendly way", as asked:
d_new = {}
for i,j in d.items():
if j:
d_new[i] = ''.join(str(x) for x in j[0])
else:
d_new[i] = ''
d = {0: [[1, '.', 0, '-', 8]], 6: [[1, '.', 0, 0, 9, 5]]}
for k,v in d.iteritems():
print str(k)+':'+''.join(str(el) for el in v[0])
>>> from itertools import chain
>>> d = {0: [[1, '.', 0, '-', 8]], 1: [['.', 3, '.', 2, 0, 0, 1, '-']], 2: [],
3: [['.', '.']], 4: [], 5: [[2, 0, 1, 2, '-', 0, 1, '-', 1, 9]], 6: [[1, '.', 0, 0, 9, 5]]}
>>> {k:''.join(str(el) for el in chain.from_iterable(v)) for k,v in d.items()}
{0: '1.0-8', 1: '.3.2001-', 2: '', 3: '..', 4: '', 5: '2012-01-19', 6: '1.0095'}
Another approach. Here chain takes care in unwrapping a list and handling empty list.
>>> {k:"".join(map(str,chain(*v))) for k,v in spam.items()}
{0: '1.0-8', 1: '.3.2001-', 2: '', 3: '..', 4: '', 5: '2012-01-19', 6: '1.0095'}
Few of intuitive approaches
To unwrap a list like [[1, '.', 0, '-', 8]] using itertools.chain, use
>>> list(chain(*[[1, '.', 0, '-', 8]]))
[1, '.', 0, '-', 8]
To convert a list of int to str with map, use (Please note, it can also work with iterables)
>>> map(str,[1, '.', 0, '-', 8])
['1', '.', '0', '-', '8']
To join a list of strings use. (Please note, it can also work with iterables)
"".join(['1', '.', '0', '-', '8'])
Dictionary Comprehension