Pandas dictionary creation optimization [duplicate] - python

I have an excel sheet that looks like so:
Column1 Column2 Column3
0 23 1
1 5 2
1 2 3
1 19 5
2 56 1
2 22 2
3 2 4
3 14 5
4 59 1
5 44 1
5 1 2
5 87 3
And I'm looking to extract that data, group it by column 1, and add it to a dictionary so it appears like this:
{0: [1],
1: [2,3,5],
2: [1,2],
3: [4,5],
4: [1],
5: [1,2,3]}
This is my code so far
excel = pandas.read_excel(r"e:\test_data.xlsx", sheetname='mySheet', parse_cols'A,C')
myTable = excel.groupby("Column1").groups
print myTable
However, my output looks like this:
{0: [0L], 1: [1L, 2L, 3L], 2: [4L, 5L], 3: [6L, 7L], 4: [8L], 5: [9L, 10L, 11L]}
Thanks!

You could groupby on Column1 and then take Column3 to apply(list) and call to_dict?
In [81]: df.groupby('Column1')['Column3'].apply(list).to_dict()
Out[81]: {0: [1], 1: [2, 3, 5], 2: [1, 2], 3: [4, 5], 4: [1], 5: [1, 2, 3]}
Or, do
In [433]: {k: list(v) for k, v in df.groupby('Column1')['Column3']}
Out[433]: {0: [1], 1: [2, 3, 5], 2: [1, 2], 3: [4, 5], 4: [1], 5: [1, 2, 3]}

According to the docs, the GroupBy.groups:
is a dict whose keys are the computed unique groups and corresponding
values being the axis labels belonging to each group.
If you want the values themselves, you can groupby 'Column1' and then call apply and pass the list method to apply to each group.
You can then convert it to a dict as desired:
In [5]:
dict(df.groupby('Column1')['Column3'].apply(list))
Out[5]:
{0: [1], 1: [2, 3, 5], 2: [1, 2], 3: [4, 5], 4: [1], 5: [1, 2, 3]}
(Note: have a look at this SO question for why the numbers are followed by L)

Related

Python - Group(Cluster/Sort) arrays based on ranking information

I have a dataframe looks like this:
A B C D
0 5 4 3 2
1 4 5 3 2
2 3 5 2 1
3 4 2 5 1
4 4 5 2 1
5 4 3 5 1
...
I converted the dataframe into 2D arrays like this:
[[5 4 3 2]
[4 5 3 2]
[3 5 2 1]
[4 2 5 1]
[4 5 2 1]
[4 3 5 1]
...]
The score of each row 1-5 actually means the people give the scores to item A, B, C, D. I would like to identify the people who have the same ranking, for example the people think A > B > C > D. And I would like to regroup these arrays based on the ranking information like this:
2DArray1: [[5 4 3 2]]
2DArray2: [[4 5 3 2]
[3 5 2 1]
[4 5 2 1]]
2DArray3: [[4 2 5 1]
[4 3 5 1]]
For example 2DArray2 means the people who think B > A > C > D, 2DArray3 are the people think C > A > B > D . I tried different sort functions in numpy but I cannot find one suitable. How should I do?
Numpy doesn't have a groupby function, because a groupby would return a list of lists of different sizes; whereas numpy mostly only deals with "rectangle" arrays.
A workaround would be to sort the rows so that similar rows are adjacent, then produce an array of the indices of the beginning of each group.
Since I'm too lazy to do that, here is a solution without numpy instead:
Index by the permutation directly
For each row, we compute the corresponding permutation of 'ABCD'. Then, we add the row to a dict of lists of rows, where the dictionary keys are the corresponding permutations.
from collections import defaultdict
a = [[5, 4, 3, 2], [4, 5, 3, 2], [3, 5, 2, 1], [4, 2, 5, 1], [4, 5, 2, 1], [4, 3, 5, 1]]
groups = defaultdict(list)
for row in a:
groups[tuple(sorted(range(len(row)), key=lambda i: row[i], reverse=True))].append(row)
print(groups)
Output:
defaultdict(<class 'list'>, {
(0, 1, 2, 3): [[5, 4, 3, 2]],
(1, 0, 2, 3): [[4, 5, 3, 2], [3, 5, 2, 1], [4, 5, 2, 1]],
(2, 0, 1, 3): [[4, 2, 5, 1], [4, 3, 5, 1]]
})
Note that with this solution, the results might not be what you expect if some users give the same score to two different items, because sorted doesn't keep ex-aequo; instead it breaks ties by order of appearance (in this case, this means ties between two items are broken alphabetically).
Index by the index of the permutation
The permutations of 'ABCD' can be ordered lexicographically: 'ABCD' comes first, then 'ABDC' comes second, then 'ACBD' comes third...
As it turns out, there is an algorithm to compute the index at which a given permutation would come in that sequence! And that algorithm is implemented in python module more_itertools:
more_itertools.permutation_index
So, we can replace our tuple key tuple(sorted(range(len(row)), key=lambda i: row[i], reverse=True)) by a simple number key permutation_index(row, sorted(row, reverse=True)).
from collections import defaultdict
from more_itertools import permutation_index
a = [[5, 4, 3, 2], [4, 5, 3, 2], [3, 5, 2, 1], [4, 2, 5, 1], [4, 5, 2, 1], [4, 3, 5, 1]]
groups = defaultdict(list)
for row in a:
groups[permutation_index(row, sorted(row, reverse=True))].append(row)
print(groups)
Output:
defaultdict(<class 'list'>, {
0: [[5, 4, 3, 2]],
6: [[4, 5, 3, 2], [3, 5, 2, 1], [4, 5, 2, 1]],
8: [[4, 2, 5, 1], [4, 3, 5, 1]]
})
Mixing permutation_index and pandas
Since the output of permutation_index is a simple number, we can easily include it in a numpy array or a pandas dataframe as a new column:
import pandas as pd
from more_itertools import permutation_index
df = pd.DataFrame({'A': [5,4,3,4,4,4], 'B': [4,5,5,2,5,3], 'C': [3,2,2,5,2,5], 'D': [2,2,1,1,1,1]})
df['perm_idx'] = df.apply(lambda row: permutation_index(row, sorted(row, reverse=True)), axis=1)
print(df)
A B C D perm_idx
0 5 4 3 2 0
1 4 5 2 2 6
2 3 5 2 1 6
3 4 2 5 1 8
4 4 5 2 1 6
5 4 3 5 1 8
for idx, sub_df in df.groupby('perm_idx'):
print(idx)
print(sub_df)
0
A B C D perm_idx
0 5 4 3 2 0
6
A B C D perm_idx
1 4 5 2 2 6
2 3 5 2 1 6
4 4 5 2 1 6
8
A B C D perm_idx
3 4 2 5 1 8
5 4 3 5 1 8
You can
(i) transpose df and convert it to a dictionary,
(ii) sort this dictionary by value and get the keys,
(iii) join the sorted keys for each "person" and assign this dict to df['ranks'],
(iv) aggregate ranking points and assign it to df['pref'],
(v) groupby(['ranks']) and create lists from pref
df = pd.DataFrame({'A': {0: 5, 1: 4, 2: 3, 3: 4, 4: 4, 5: 4},
'B': {0: 4, 1: 5, 2: 5, 3: 2, 4: 5, 5: 3},
'C': {0: 3, 1: 3, 2: 2, 3: 5, 4: 2, 5: 5},
'D': {0: 2, 1: 2, 2: 1, 3: 1, 4: 1, 5: 1}})
df['ranks'] = pd.Series({k : ''.join(list(zip(*sorted(v.items(), key=lambda d:d[1],
reverse=True)))[0])
for k,v in df.T.to_dict().items()})
df['pref'] = df.loc[:,'A':'D'].values.tolist()
out = df[['ranks','pref']].groupby('ranks').agg(list).to_dict()['pref']
Output:
{'ABCD': [[5, 4, 3, 2]],
'BACD': [[4, 5, 3, 2], [3, 5, 2, 1], [4, 5, 2, 1]],
'CABD': [[4, 2, 5, 1], [4, 3, 5, 1]]}

How to convert part of dataframe to vector

I have a data frame having sentiment scores for user ratings. I need to convert them to vectors to use in another script. How can I do that? Any help is appreciated.
Input:
Output:
Data:
{'Group': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G2', 4: 'G2', 5: 'G2'},
'User': {0: 'User1',
1: 'User2',
2: 'User2',
3: 'User3',
4: 'User3',
5: 'User4'},
'Sentiment': {0: 'positive',
1: 'positive',
2: 'negative',
3: 'positive',
4: 'negative',
5: 'negative'},
'#No of Reviews': {0: 5, 1: 5, 2: 4, 3: 4, 4: 4, 5: 6}}
In the above data, I need to convert G1->User1->positive->5 to G1->User1->[5, 0]. 5 for positive, 0 for negative.
Try this code:
def foo(df):
result = []
for i in ['positive', 'negative']:
sentiment = df.loc[df['Sentiment']==i, '#No of Reviews'].values
if sentiment.size > 0:
result.append(sentiment[0])
else:
result.append(0)
return result
df.groupby(['Group', 'User']).apply(foo)
Group User
G1 User1 [5, 0]
User2 [5, 4]
G2 User3 [4, 4]
User4 [0, 6]
My assumption is that you are using Pandas (your question doesn't have a tag that indicates what tool you are using, and the concept of a dataframe isn't unique)?
Then you could use pivot_table (df: your dataframe):
df = df.pivot_table(
index=["Group", "User"], columns="Sentiment", fill_value=0
).droplevel(0, axis="columns")
df["Vec"] = [[p, n] for p, n in zip(df.positive, df.negative)]
df["Norm"] = [
[1 if p else 0, 1 if n else 0] for p, n in zip(df.positive, df.negative)
]
Result:
Sentiment negative positive Vec Norm
Group User
G1 User1 0 5 [5, 0] [1, 0]
User2 4 5 [5, 4] [1, 1]
G2 User3 4 4 [4, 4] [1, 1]
User4 6 0 [0, 6] [0, 1]
You could replace the list comprehensions with:
df["Vec"] = df[["positive", "negative"]].apply(list, axis=1)
df["Norm"] = df[["positive", "negative"]].astype(bool).astype(int).apply(list, axis=1)
If needed, you can drop the negative/positive-columns and/or reset the index afterwards:
df = df.drop(columns=["negative", "positive"]).reset_index(drop=False)
df.columns.name = None
Group User Vec Norm
0 G1 User1 [5, 0] [1, 0]
1 G1 User2 [5, 4] [1, 1]
2 G2 User3 [4, 4] [1, 1]
3 G2 User4 [0, 6] [0, 1]

How to perform group by in python?

I need to read data from an excel file and perform group by on the data after that.
the structure of the data is like following:
n c
1 2
1 3
1 4
2 3
2 4
2 5
3 1
3 2
3 3
I need to read these data and then generate a list of dictionaries based on the c value.
desired output would be a list of dictionaries with c as keys and values of n as values like this:
[{1:[3]}, {2:[1,3]}, {3:[1,2,3]}, {4:[1,2]}, {5:[2]}]
I use this function to read data and it works fine:
data = pandas.read_excel("pathtofile/filename.xlsx", header=None)
You can try this way:
d1 = df.groupby('c')['n'].agg(list).to_dict()
res = [{k:v} for k,v in d1.items()]
print(res)
Output:
[{1: [3]}, {2: [1, 3]}, {3: [1, 2, 3]}, {4: [1, 2]}, {5: [2]}]
Sample output dict
d=df.groupby('c').n.agg(list).to_dict()
{1: [3], 2: [1, 3], 3: [1, 2, 3], 4: [1, 2], 5: [2]}

Convert a Pandas DataFrame with repetitive keys to a dictionary

I have a DataFrame with two columns. I want to convert this DataFrame to a python dictionary. I want the elements of first column be keys and the elements of other columns in same row be values. However, entries in the first column are repeated -
Keys Values
1 1
1 6
1 9
2 3
3 1
3 4
The dict I want is - {1: [1,6,9], 2: [3], 3: [1,4]}
I am using the code - mydict=df.set_index('Keys').T.to_dict('list') however, the output has only unique values of keys. {1: [9], 2: [3], 3: [4]}
IIUC you can groupby on the 'Keys' column and then apply list and call to_dict:
In[32]:
df.groupby('Keys')['Values'].apply(list).to_dict()
Out[32]: {1: [1, 6, 9], 2: [3], 3: [1, 4]}
Breaking down the above into steps:
In[35]:
# groupby on the 'Keys' and apply list to group values into a list
df.groupby('Keys')['Values'].apply(list)
Out[35]:
Keys
1 [1, 6, 9]
2 [3]
3 [1, 4]
Name: Values, dtype: object
convert to a dict
In[37]:
# make a dict
df.groupby('Keys')['Values'].apply(list).to_dict()
Out[37]: {1: [1, 6, 9], 2: [3], 3: [1, 4]}
Thanks to #P.Tillman for the suggestion that to_frame was unnecessary, kudos to him
try this,
df.groupby('Keys')['Values'].unique().to_dict()
Output:
{1: array([1, 6, 9]), 2: array([3]), 3: array([1, 4])}

merge python dictionary keys based on common elements

Let's say we have a dictionary like this:
{0: [2, 8], 1: [8, 4], 3: [5]}
Then we encounter a key value pair 2 , 8 . Now, as the value 2 and 8 has already appeared for key 0, I need to merge the first two keys and create a new dictionary like the following:
{0: [2, 8, 4], 3: [5]}
I understand that it's possible to do a lot of looping and deleting. I'm really looking for a more pythonish way.
Thanks in advance.
your coworkers will hate you later but here
>>> d = {0: [2, 8], 1: [8, 4], 3: [5]}
>>> x = ((a,b) for a,b in itertools.combinations(d,2) if a in d and b in d and set(d[a]).intersection(d[b]))
>>> for a,b in x:d[min(a,b)].extend([i for i in d[max(a,b)] if i not in d[min(a,b)]]) or d.pop(max(a,b))
[8, 4]
>>> d
{0: [2, 8, 4], 3: [5]}
d = {0: [2, 8],
1: [8, 4],
3: [5]}
revmap = {}
for k,vals in d.items():
for v in vals:
revmap[k] = v
k,v = 2,8
d[revmap[k]].extend([i for i in d[revmap[v]] if i not in d[revmap[k]]])
d.pop(revmap[v])

Categories

Resources