I have table like this:
Column A
Column B
a
[1, 2, 3]
b
[4, 1, 2]
And I want to create dictionary like this using NumPy:
{1: [a, b],
2: [a, b],
3: [a],
4: [b]}
is there a more or less simple way to do this?
Let us try with explode
d = df.explode('col2').groupby('col2')['col1'].agg(list).to_dict()
Out[206]: {1: ['a', 'b'], 2: ['a', 'b'], 3: ['a'], 4: ['b']}
As long as I know, numpy doesn't support dictionaries, it actually uses Arrays (numpy Arrays), as you can see here.
But there are many ways to achieve the creation of a dict from a pandas dataframe. Well, looping over dataframes is not a good practice as you can see in this answer, so we can use pandas.to_numpy as follows:
import pandas as pd
import numpy as np
d = {'col1': ['a', 'b'], 'col2': [[1,2,3], [4,1,2]]}
df = pd.DataFrame(data=d)
my_dict = {}
np_array=df.to_numpy()
for row in np_array:
my_dict.update({row[0]: row[1]})
Output:
>my_dict: {'a': [1, 2, 3], 'b': [4, 1, 2]}
Which is different from the output you wished, but I didn't see the pattern on it. Could you clarity more?
UPDATED
To achieve the output you want, one possible way is to iterate over each row then over the values in the list, like this:
for row in np_array:
for item in row[1]:
if item in my_dict.keys():
my_dict[item].append(row[0])
else:
my_dict.update({item: [row[0]]})
Related
I have a large dictionary of which I want to sort the list values based on one list. For a simple dictionary I would do it like this:
d = {'a': [2, 3, 1], 'b': [103, 101, 102]}
d['a'], d['b'] = [list(i) for i in zip(*sorted(zip(d['a'], d['b'])))]
print(d)
Output:
{'a': [1, 2, 3], 'b': [102, 103, 101]}
My actual dictionary has many keys with list values, so unpacking the zip tuple like above becomes unpractical. Is there any way to do the go over the keys and values without specifying them all? Something like:
d.values() = [list(i) for i in zip(*sorted(zip(d.values())))]
Using d.values() results in SyntaxError: can't assign function call, but I'm looking for something like this.
If you have many keys (and they all have equal length list values) using pandas sort_values would be an efficient way of sorting:
d = {'a': [2, 3, 1], 'b': [103, 101, 102], 'c' : [4, 5, 6]}
d = pd.DataFrame(d).sort_values(by='a').to_dict('list')
Output:
{'a': [1, 2, 3], 'b': [102, 103, 101], 'c': [6, 4, 5]}
If memory is an issue, you can sort in place, however since that means sort_values returns None, you can no longer chain the operations:
df = pd.DataFrame(d)
df.sort_values(by='a', inplace=True)
d = df.to_dict('list')
The output is the same as above.
As far as I understand your question, you could try simple looping:
for k in d.keys():
d[k] = [list(i) for i in zip(*sorted(zip(d['a'], d[k])))]
where d['a'] stores the list which others should be compared to. However, using dicts in this way seems slow and messy. Since every entry in your dictionary - presumably - is a list of the same length, a simple fix would be to store the data in a numpy array and call an argsort method to sort by ith column:
a = np.array( --your data here-- )
a[a[:, i].argsort()]
Finally, the most clear approach would be to use a pandas DataFrame, which is designed to store large amounts of data using a dict-like syntax. In this way, you could just sort by contents of a named column 'a':
df = pd.DataFrame( --your data here-- )
df.sort_values(by='a')
For further references, please see the links below:
Sorting arrays in NumPy by column
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
For the given input data and the required output then this will suffice:
from operator import itemgetter
d = {'a': [2, 3, 1], 'b': [103, 101, 102]}
def sort_dict(dict_, refkey):
reflist = sorted([(v, i) for i, v in enumerate(dict_[refkey])], key=itemgetter(0))
for v in dict_.values():
v_ = v[:]
for i, (_, p) in enumerate(reflist):
v[i] = v_[p]
sort_dict(d, 'a')
print(d)
Output:
{'a': [1, 2, 3], 'b': [102, 103, 101]}
I have a huge Pandas data frame with the structure follows as an example below:
import pandas as pd
df = pd.DataFrame({'col1': ['A', 'A', 'B', 'C', 'C', 'C'], 'col2': [1, 2, 5, 2, 4, 6]})
df
col1 col2
0 A 1
1 A 2
2 B 5
3 C 2
4 C 4
5 C 6
The task is to build a dictionary with elements in col1 as keys and corresponding elements in col2 as values. For the example above the output should be:
A -> [1, 2]
B -> [5]
C -> [2, 4, 6]
Although I write a solution as
from collections import defaultdict
dd = defaultdict(set)
for row in df.itertuples():
dd[row.col1].append(row.col2)
I wonder if somebody is aware of a more "Python-native" solution, using in-build pandas functions.
Without apply we do it by for loop
{x : y.tolist() for x , y in df.col2.groupby(df.col1)}
{'A': [1, 2], 'B': [5], 'C': [2, 4, 6]}
Use GroupBy.apply with list for Series of lists and then Series.to_dict:
d = df.groupby('col1')['col2'].apply(list).to_dict()
print (d)
{'A': [1, 2], 'B': [5], 'C': [2, 4, 6]}
I have a nested list that I got from data that will be broken into sections to be uploaded to a database but have to be separate entries.
My nested list:
Master_List = [ a, [1,2,3], [1,2], d]
But I need to make it into six separate lists such as this:
list1 = [a, 1, 1, d]
list2 = [a, 1, 2, d]
list3 = [a, 2, 1, d]
list4 = [a, 2, 2, d]
etc.
I've tried iterating through the lists through the values of the list, but I'm confused since not all the master list indices will have more than one value. How do can I build these separate lists to be used?
When I tried just iterating through the lists and creating separate lists it became a convoluted mess.
Edit: Coldspeed's solution is exactly what I needed. Now I just use the dictionary to access the list I want.
The easiest way to do this would be using itertools.product.
from itertools import product
out = {'list{}'.format(i) : list(l) for i, l in enumerate(product(*Master_List), 1)}
print(out)
{'list1': ['a', 1, 1, 'd'],
'list2': ['a', 1, 2, 'd'],
'list3': ['a', 2, 1, 'd'],
'list4': ['a', 2, 2, 'd'],
'list5': ['a', 3, 1, 'd'],
'list6': ['a', 3, 2, 'd']}
Unfortunately, you cannot create a variable number of variables without the use of a dictionary. If you want to access listX, you'll access the out dictionary, like this:
out['listX']
A one-liner would be:
lists = sum([[[Master_List[0], i, j, Master_List[3]] for i in Master_List[1]] for j in Master_List[2]], [])
Instead of the indices, I'd like to obtain the row positions, so I can use the result later using df.iloc(row_positions).
This is the example:
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']}, index=[10, 2, 7])
print df[df['a']>=2].index
# Int64Index([2, 7], dtype='int64')
# How do I convert the index list [2, 7] to [1, 2] (the row position)
# I managed to do this for 1 index element, but how can I do this for the entire selection/index list?
df.index.get_loc(2)
Update
I could use a list comprehension to apply the selected result on the get_loc function, but perhaps there's some Pandas-built-in function.
you can use where from numpy:
import numpy as np
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']}, index=[10, 2, 7])
np.where( df.a>=2)
returns row indices:
(array([1, 2], dtype=int64),)
#ssm's answer is what I would normally use. However to answer your specific query of how to select multiple rows try this:
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']}, index=[10, 2, 7])
indices = df[df['a']>=2].index
print df.ix[indices]
More information on .ix indexing scheme is here
[EDIT to answer the specific query]
How do I convert the index list [2, 7] to [1, 2] (the row position)
df[df['a']>=2].reset_index().index
How do I build a dict using list comprehension?
I have two lists.
series = [1,2,3,4,5]
categories = ['A', 'B', 'A', 'C','B']
I want to build a dict where the categories are the keys.
Thanks for your answers I'm looking to produce:
{'A' : [1, 3], 'B' : [2, 5], 'C' : [4]}
Because the keys can't exist twice
You have to have a list of tuples. The tuples are key/value pairs. You don't need a comprehension in this case, just zip:
dict(zip(categories, series))
Produces {'A': 3, 'B': 5, 'C': 4} (as pointed out by comments)
Edit: After looking at the keys, note that you can't have duplicate keys in a dictionary. So without further clarifying what you want, I'm not sure what solution you're looking for.
Edit: To get what you want, it's probably easiest to just do a for loop with either setdefault or a defaultdict.
categoriesMap = {}
for k, v in zip(categories, series):
categoriesMap.setdefault(k, []).append(v)
That should produce {'A': [1, 3], 'B': [2, 5], 'C': [3]}
from collectons import defaultdict
series = [1,2,3,4,5]
categories = ['A', 'B', 'A', 'C','B']
result = defaultdict(list)
for key, val in zip(categories, series)
result[key].append(value)
Rather than being clever (I have an itertools solution I'm fond of) there's nothing wrong with a good, old-fashioned for loop:
>>> from collections import defaultdict
>>>
>>> series = [1,2,3,4,5]
>>> categories = ['A', 'B', 'A', 'C','B']
>>>
>>> d = defaultdict(list)
>>> for c,s in zip(categories, series):
... d[c].append(s)
...
>>> d
defaultdict(<type 'list'>, {'A': [1, 3], 'C': [4], 'B': [2, 5]})
This doesn't use a list comprehension because a list comprehension is the wrong way to do it. But since you seem to really want one for some reason: how about:
>> dict([(c0, [s for (c,s) in zip(categories, series) if c == c0]) for c0 in categories])
{'A': [1, 3], 'C': [4], 'B': [2, 5]}
That has not one but two list comprehensions, and is very inefficient to boot.
In principle you can do as Kris suggested: dict(zip(categories, series)), just be aware that there can not be duplicates in categories (as in your sample code).
EDIT :
Now that you've clarified what you intended, this will work as expected:
from collections import defaultdict
d = defaultdict(list)
for k, v in zip(categories, series):
d[k].append(v)
d={ k:[] for k in categories }
map(lambda k,v: d[k].append(v), categories, series )
result:
d is now = {'A': [1, 3], 'C': [4], 'B': [2, 5]}
or (equivalent) using setdefault (thanks Kris R.)
d={}
map(lambda k,v: d.setdefault(k,[]).append(v), categories, series )