I have a data frame having sentiment scores for user ratings. I need to convert them to vectors to use in another script. How can I do that? Any help is appreciated.
Input:
Output:
Data:
{'Group': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G2', 4: 'G2', 5: 'G2'},
'User': {0: 'User1',
1: 'User2',
2: 'User2',
3: 'User3',
4: 'User3',
5: 'User4'},
'Sentiment': {0: 'positive',
1: 'positive',
2: 'negative',
3: 'positive',
4: 'negative',
5: 'negative'},
'#No of Reviews': {0: 5, 1: 5, 2: 4, 3: 4, 4: 4, 5: 6}}
In the above data, I need to convert G1->User1->positive->5 to G1->User1->[5, 0]. 5 for positive, 0 for negative.
Try this code:
def foo(df):
result = []
for i in ['positive', 'negative']:
sentiment = df.loc[df['Sentiment']==i, '#No of Reviews'].values
if sentiment.size > 0:
result.append(sentiment[0])
else:
result.append(0)
return result
df.groupby(['Group', 'User']).apply(foo)
Group User
G1 User1 [5, 0]
User2 [5, 4]
G2 User3 [4, 4]
User4 [0, 6]
My assumption is that you are using Pandas (your question doesn't have a tag that indicates what tool you are using, and the concept of a dataframe isn't unique)?
Then you could use pivot_table (df: your dataframe):
df = df.pivot_table(
index=["Group", "User"], columns="Sentiment", fill_value=0
).droplevel(0, axis="columns")
df["Vec"] = [[p, n] for p, n in zip(df.positive, df.negative)]
df["Norm"] = [
[1 if p else 0, 1 if n else 0] for p, n in zip(df.positive, df.negative)
]
Result:
Sentiment negative positive Vec Norm
Group User
G1 User1 0 5 [5, 0] [1, 0]
User2 4 5 [5, 4] [1, 1]
G2 User3 4 4 [4, 4] [1, 1]
User4 6 0 [0, 6] [0, 1]
You could replace the list comprehensions with:
df["Vec"] = df[["positive", "negative"]].apply(list, axis=1)
df["Norm"] = df[["positive", "negative"]].astype(bool).astype(int).apply(list, axis=1)
If needed, you can drop the negative/positive-columns and/or reset the index afterwards:
df = df.drop(columns=["negative", "positive"]).reset_index(drop=False)
df.columns.name = None
Group User Vec Norm
0 G1 User1 [5, 0] [1, 0]
1 G1 User2 [5, 4] [1, 1]
2 G2 User3 [4, 4] [1, 1]
3 G2 User4 [0, 6] [0, 1]
Related
I have a dataframe looks like this:
A B C D
0 5 4 3 2
1 4 5 3 2
2 3 5 2 1
3 4 2 5 1
4 4 5 2 1
5 4 3 5 1
...
I converted the dataframe into 2D arrays like this:
[[5 4 3 2]
[4 5 3 2]
[3 5 2 1]
[4 2 5 1]
[4 5 2 1]
[4 3 5 1]
...]
The score of each row 1-5 actually means the people give the scores to item A, B, C, D. I would like to identify the people who have the same ranking, for example the people think A > B > C > D. And I would like to regroup these arrays based on the ranking information like this:
2DArray1: [[5 4 3 2]]
2DArray2: [[4 5 3 2]
[3 5 2 1]
[4 5 2 1]]
2DArray3: [[4 2 5 1]
[4 3 5 1]]
For example 2DArray2 means the people who think B > A > C > D, 2DArray3 are the people think C > A > B > D . I tried different sort functions in numpy but I cannot find one suitable. How should I do?
Numpy doesn't have a groupby function, because a groupby would return a list of lists of different sizes; whereas numpy mostly only deals with "rectangle" arrays.
A workaround would be to sort the rows so that similar rows are adjacent, then produce an array of the indices of the beginning of each group.
Since I'm too lazy to do that, here is a solution without numpy instead:
Index by the permutation directly
For each row, we compute the corresponding permutation of 'ABCD'. Then, we add the row to a dict of lists of rows, where the dictionary keys are the corresponding permutations.
from collections import defaultdict
a = [[5, 4, 3, 2], [4, 5, 3, 2], [3, 5, 2, 1], [4, 2, 5, 1], [4, 5, 2, 1], [4, 3, 5, 1]]
groups = defaultdict(list)
for row in a:
groups[tuple(sorted(range(len(row)), key=lambda i: row[i], reverse=True))].append(row)
print(groups)
Output:
defaultdict(<class 'list'>, {
(0, 1, 2, 3): [[5, 4, 3, 2]],
(1, 0, 2, 3): [[4, 5, 3, 2], [3, 5, 2, 1], [4, 5, 2, 1]],
(2, 0, 1, 3): [[4, 2, 5, 1], [4, 3, 5, 1]]
})
Note that with this solution, the results might not be what you expect if some users give the same score to two different items, because sorted doesn't keep ex-aequo; instead it breaks ties by order of appearance (in this case, this means ties between two items are broken alphabetically).
Index by the index of the permutation
The permutations of 'ABCD' can be ordered lexicographically: 'ABCD' comes first, then 'ABDC' comes second, then 'ACBD' comes third...
As it turns out, there is an algorithm to compute the index at which a given permutation would come in that sequence! And that algorithm is implemented in python module more_itertools:
more_itertools.permutation_index
So, we can replace our tuple key tuple(sorted(range(len(row)), key=lambda i: row[i], reverse=True)) by a simple number key permutation_index(row, sorted(row, reverse=True)).
from collections import defaultdict
from more_itertools import permutation_index
a = [[5, 4, 3, 2], [4, 5, 3, 2], [3, 5, 2, 1], [4, 2, 5, 1], [4, 5, 2, 1], [4, 3, 5, 1]]
groups = defaultdict(list)
for row in a:
groups[permutation_index(row, sorted(row, reverse=True))].append(row)
print(groups)
Output:
defaultdict(<class 'list'>, {
0: [[5, 4, 3, 2]],
6: [[4, 5, 3, 2], [3, 5, 2, 1], [4, 5, 2, 1]],
8: [[4, 2, 5, 1], [4, 3, 5, 1]]
})
Mixing permutation_index and pandas
Since the output of permutation_index is a simple number, we can easily include it in a numpy array or a pandas dataframe as a new column:
import pandas as pd
from more_itertools import permutation_index
df = pd.DataFrame({'A': [5,4,3,4,4,4], 'B': [4,5,5,2,5,3], 'C': [3,2,2,5,2,5], 'D': [2,2,1,1,1,1]})
df['perm_idx'] = df.apply(lambda row: permutation_index(row, sorted(row, reverse=True)), axis=1)
print(df)
A B C D perm_idx
0 5 4 3 2 0
1 4 5 2 2 6
2 3 5 2 1 6
3 4 2 5 1 8
4 4 5 2 1 6
5 4 3 5 1 8
for idx, sub_df in df.groupby('perm_idx'):
print(idx)
print(sub_df)
0
A B C D perm_idx
0 5 4 3 2 0
6
A B C D perm_idx
1 4 5 2 2 6
2 3 5 2 1 6
4 4 5 2 1 6
8
A B C D perm_idx
3 4 2 5 1 8
5 4 3 5 1 8
You can
(i) transpose df and convert it to a dictionary,
(ii) sort this dictionary by value and get the keys,
(iii) join the sorted keys for each "person" and assign this dict to df['ranks'],
(iv) aggregate ranking points and assign it to df['pref'],
(v) groupby(['ranks']) and create lists from pref
df = pd.DataFrame({'A': {0: 5, 1: 4, 2: 3, 3: 4, 4: 4, 5: 4},
'B': {0: 4, 1: 5, 2: 5, 3: 2, 4: 5, 5: 3},
'C': {0: 3, 1: 3, 2: 2, 3: 5, 4: 2, 5: 5},
'D': {0: 2, 1: 2, 2: 1, 3: 1, 4: 1, 5: 1}})
df['ranks'] = pd.Series({k : ''.join(list(zip(*sorted(v.items(), key=lambda d:d[1],
reverse=True)))[0])
for k,v in df.T.to_dict().items()})
df['pref'] = df.loc[:,'A':'D'].values.tolist()
out = df[['ranks','pref']].groupby('ranks').agg(list).to_dict()['pref']
Output:
{'ABCD': [[5, 4, 3, 2]],
'BACD': [[4, 5, 3, 2], [3, 5, 2, 1], [4, 5, 2, 1]],
'CABD': [[4, 2, 5, 1], [4, 3, 5, 1]]}
I have an excel sheet that looks like so:
Column1 Column2 Column3
0 23 1
1 5 2
1 2 3
1 19 5
2 56 1
2 22 2
3 2 4
3 14 5
4 59 1
5 44 1
5 1 2
5 87 3
And I'm looking to extract that data, group it by column 1, and add it to a dictionary so it appears like this:
{0: [1],
1: [2,3,5],
2: [1,2],
3: [4,5],
4: [1],
5: [1,2,3]}
This is my code so far
excel = pandas.read_excel(r"e:\test_data.xlsx", sheetname='mySheet', parse_cols'A,C')
myTable = excel.groupby("Column1").groups
print myTable
However, my output looks like this:
{0: [0L], 1: [1L, 2L, 3L], 2: [4L, 5L], 3: [6L, 7L], 4: [8L], 5: [9L, 10L, 11L]}
Thanks!
You could groupby on Column1 and then take Column3 to apply(list) and call to_dict?
In [81]: df.groupby('Column1')['Column3'].apply(list).to_dict()
Out[81]: {0: [1], 1: [2, 3, 5], 2: [1, 2], 3: [4, 5], 4: [1], 5: [1, 2, 3]}
Or, do
In [433]: {k: list(v) for k, v in df.groupby('Column1')['Column3']}
Out[433]: {0: [1], 1: [2, 3, 5], 2: [1, 2], 3: [4, 5], 4: [1], 5: [1, 2, 3]}
According to the docs, the GroupBy.groups:
is a dict whose keys are the computed unique groups and corresponding
values being the axis labels belonging to each group.
If you want the values themselves, you can groupby 'Column1' and then call apply and pass the list method to apply to each group.
You can then convert it to a dict as desired:
In [5]:
dict(df.groupby('Column1')['Column3'].apply(list))
Out[5]:
{0: [1], 1: [2, 3, 5], 2: [1, 2], 3: [4, 5], 4: [1], 5: [1, 2, 3]}
(Note: have a look at this SO question for why the numbers are followed by L)
I need to read data from an excel file and perform group by on the data after that.
the structure of the data is like following:
n c
1 2
1 3
1 4
2 3
2 4
2 5
3 1
3 2
3 3
I need to read these data and then generate a list of dictionaries based on the c value.
desired output would be a list of dictionaries with c as keys and values of n as values like this:
[{1:[3]}, {2:[1,3]}, {3:[1,2,3]}, {4:[1,2]}, {5:[2]}]
I use this function to read data and it works fine:
data = pandas.read_excel("pathtofile/filename.xlsx", header=None)
You can try this way:
d1 = df.groupby('c')['n'].agg(list).to_dict()
res = [{k:v} for k,v in d1.items()]
print(res)
Output:
[{1: [3]}, {2: [1, 3]}, {3: [1, 2, 3]}, {4: [1, 2]}, {5: [2]}]
Sample output dict
d=df.groupby('c').n.agg(list).to_dict()
{1: [3], 2: [1, 3], 3: [1, 2, 3], 4: [1, 2], 5: [2]}
import pandas as pd
d = {'A': [1,2,3,4], 'B': [[[1,2],[2,3]],[[3,4],[2,5]],[[5,6],[5,6],[5,6]],[7,8]]}
df = pd.DataFrame(data=d)
C = [1,2,3,4,5,6,7,8]
I have a pandas dataframe and would like to append each element of a C list into each one of the nested lists of B, maintaining the structure, so that the resulting dataframe is:
'A': [1,2,3,4]
'B': [[[1,2,1],[2,3,2]],[[3,4,3],[2,5,4]],[[5,6,5],[5,6,6],[5,6,7]],[7,8,8]]
Mybe there is a more elegant solution, but this works :-)
for i in d['B']:
for j in i:
if (isinstance(j, list)):
j.append(C.pop(0))
else:
i.append(C.pop(0))
break
A more efficient solution based on timgebs comment (thank you!):
f = iter(C)
for i in d['B']:
for j in i:
if (isinstance(j, list)):
j.append(next(f))
else:
i.append(next(f))
break
This is an alternative method using itertools.
The idea is to flatten the list of lists, append your data, then split again via information you have stored on the number of lists in each row.
from itertools import chain, accumulate
import pandas as pd
d = {'A': [1,2,3,4], 'B': [[[1,2],[2,3]],[[3,4],[2,5]],[[5,6],[5,6],[5,6]],[[7,8]]]}
df = pd.DataFrame(data=d)
C = [1,2,3,4,5,6,7,8]
acc = [0] + list(accumulate(map(len, B)))
lst = [j+[C[i]] for i, j in enumerate(chain.from_iterable(df['B']))]
df['B'] = [lst[x:y] for x, y in zip(acc, acc[1:])]
Note I have made an important change to the input: the last element of series B is a list of lists, just like all the other elements. For consistency, I would recommend this in any case.
Result
A B
0 1 [[1, 2, 1], [2, 3, 2]]
1 2 [[3, 4, 3], [2, 5, 4]]
2 3 [[5, 6, 5], [5, 6, 6], [5, 6, 7]]
3 4 [[7, 8, 8]]
d = {'A': [1,2,3,4], 'B': [[[1,2],[2,3]],[[3,4],[2,5]],[[5,6],[5,6],[5,6]],[7,8]]}
df = pd.DataFrame(data=d)
C = [1,2,3,4,5,6,7,8]
df['B_len'] = df.B.apply(len)
df['B_len_cumsum']=df.B_len.cumsum()
df['C'] = df.apply(lambda row: C[row['B_len_cumsum']-row['B_len']:row['B_len_cumsum']], axis=1)
df['B'] = df.B.apply(lambda x: [x] if type(x[0])==int else x)
for x,y in zip(df.B,df.C):
for xx,yy in zip(x,y):
xx.append(yy)
df
Output:
A B B_len B_len_cumsum C
0 1 [[1, 2, 1], [2, 3, 2]] 2 2 [1, 2]
1 2 [[3, 4, 3], [2, 5, 4]] 2 4 [3, 4]
2 3 [[5, 6, 5], [5, 6, 6], [5, 6, 7]] 3 7 [5, 6, 7]
3 4 [[7, 8, 8]] 2 9 [8]
import pandas as pd
df1 = pd.DataFrame({'ID':['i1', 'i2', 'i3'],
'A': [2, 3, 1],
'B': [1, 1, 2],
'C': [2, 1, 0],
'D': [3, 1, 2]})
df1.set_index('ID')
df1.head()
A B C D
ID
i1 2 1 2 3
i2 3 1 1 1
i3 1 2 0 2
df2 = pd.DataFrame({'ID':['i1-i2', 'i1-i3', 'i2-i3'],
'A': [2, 1, 1],
'B': [1, 1, 1],
'C': [1, 0, 0],
'D': [1, 1, 1]})
df2.set_index('ID')
df2
A B C D
ID
i1-i2 2 1 1 1
i1-i3 1 1 0 1
i2-i3 1 1 0 1
Given a data frame as df1, I want to compare every two different rows, and get the smaller value at each column, and output the result to a new data frame like df2.
For example, to compare i1 row and i2 row, get new row i1-i2 as 2, 1, 1, 1
Please advise what is the best way of pandas to do that.
Try this:
from itertools import combinations
v = df1.values
r = pd.DataFrame([np.minimum(v[t[0]], v[t[1]])
for t in combinations(np.arange(len(df1)), 2)],
columns=df1.columns,
index=list(combinations(df1.index, 2)))
Result:
In [72]: r
Out[72]:
A B C D
(i1, i2) 2 1 1 1
(i1, i3) 1 1 0 2
(i2, i3) 1 1 0 1