How to Replace Pandas Series with Dictionary Values - python

I want to link my dictionary values to pandas series object. I have already tried replace method and map method still no luck.
As per link:
Replace values in pandas Series with dictionary
Still not working, my sample pandas looks like:
index column
0 ESL Literacy
1 Civics Government Team Sports
2 Health Wellness Team Sports
3 Literacy Mathematics
4 Mathematics
Dictionary:
{'civics': 6,
'esl': 5,
'government': 7,
'health': 8,
'literacy': 1,
'mathematics': 4,
'sports': 3,
'team': 2,
'wellness': 9}
Desired Output:
0 [5,1]
1 [6,7,2,3]
2 [8,9,2,3]
3 [1,4]
4 [4]
Any help would be appreciated. Thank you :)

A fun solution
s=df.column.str.get_dummies(' ')
s.dot(s.columns.str.lower().map(d).astype(str)+',').str[:-1].str.split(',')
Out[413]:
0 [5, 1]
1 [6, 7, 3, 2]
2 [8, 3, 2, 9]
3 [1, 4]
4 [4]
dtype: object
Or in pandas 0.25.0 we can use explode:
df.column.str.split().explode().str.lower().map(d).groupby(level=0).agg(list)
Out[420]:
0 [5, 1]
1 [6, 7, 2, 3]
2 [8, 9, 2, 3]
3 [1, 4]
4 [4]
Name: column, dtype: object

Using str.lower, str.split, and a comprehension.
u = df['column'].str.lower().str.split('\s+')
pd.Series([[d.get(word) for word in row] for row in u])
0 [5, 1]
1 [6, 7, 2, 3]
2 [8, 9, 2, 3]
3 [1, 4]
4 [4]
dtype: object

Related

extract elements of tuple from a pandas series

I have a pandas series with data of type tuple as list elements. The length of the tuple is exactly 2 and there are a bunch of NaNs. I am trying to split each list in the tuple into its own column.
import pandas as pd
import numpy as np
df = pd.DataFrame({'val': [([1,2,3],[4,5,6]),
([7,8,9],[10,11,12]),
np.nan]
})
Expected Output:
If you know the lenght of tuples are exactly 2, you can do:
df["x"] = df.val.str[0]
df["y"] = df.val.str[1]
print(df[["x", "y"]])
Prints:
x y
0 [1, 2, 3] [4, 5, 6]
1 [7, 8, 9] [10, 11, 12]
2 NaN NaN
You could also convert the column to a list and cast it to the DataFrame constructor (fill None with np.nan as well):
out = pd.DataFrame(df['val'].tolist(), columns=['x','y']).fillna(np.nan)
Output:
x y
0 [1, 2, 3] [4, 5, 6]
1 [7, 8, 9] [10, 11, 12]
2 NaN NaN
One way using pandas.Series.apply:
new_df = df["val"].apply(pd.Series)
print(new_df)
Output:
0 1
0 [1, 2, 3] [4, 5, 6]
1 [7, 8, 9] [10, 11, 12]
2 NaN NaN

Pandas cumsum separated by comma

I have a dataframe with a column with data as:
my_column my_column_two
1,2,3 A
5,6,8 A
9,6,8 B
5,5,8 B
if I do:
data = df.astype(str).groupby('my_column_two').agg(','.join).cumsum()
data.iloc[[0]]['my_column'].apply(print)
data.iloc[[1]]['my_column'].apply(print)
I have:
1,2,3,5,6,8
1,2,3,5,6,89,6,8,5,5,8
how can I have 1,2,3,5,6,8,9,6,8,5,5,8 so the cummulative adds a comma when adding the previous row? (Notice 89 should be 8,9)
Were you after?
df['new']=df.groupby('my_column_two')['my_column'].apply(lambda x: x.str.split(',').cumsum())
my_column my_column_two new
0 1,2,3 A [1, 2, 3]
1 5,6,8 A [1, 2, 3, 5, 6, 8]
2 9,6,8 B [9, 6, 8]
3 5,5,8 B [9, 6, 8, 5, 5, 8]

Python - Group(Cluster/Sort) arrays based on ranking information

I have a dataframe looks like this:
A B C D
0 5 4 3 2
1 4 5 3 2
2 3 5 2 1
3 4 2 5 1
4 4 5 2 1
5 4 3 5 1
...
I converted the dataframe into 2D arrays like this:
[[5 4 3 2]
[4 5 3 2]
[3 5 2 1]
[4 2 5 1]
[4 5 2 1]
[4 3 5 1]
...]
The score of each row 1-5 actually means the people give the scores to item A, B, C, D. I would like to identify the people who have the same ranking, for example the people think A > B > C > D. And I would like to regroup these arrays based on the ranking information like this:
2DArray1: [[5 4 3 2]]
2DArray2: [[4 5 3 2]
[3 5 2 1]
[4 5 2 1]]
2DArray3: [[4 2 5 1]
[4 3 5 1]]
For example 2DArray2 means the people who think B > A > C > D, 2DArray3 are the people think C > A > B > D . I tried different sort functions in numpy but I cannot find one suitable. How should I do?
Numpy doesn't have a groupby function, because a groupby would return a list of lists of different sizes; whereas numpy mostly only deals with "rectangle" arrays.
A workaround would be to sort the rows so that similar rows are adjacent, then produce an array of the indices of the beginning of each group.
Since I'm too lazy to do that, here is a solution without numpy instead:
Index by the permutation directly
For each row, we compute the corresponding permutation of 'ABCD'. Then, we add the row to a dict of lists of rows, where the dictionary keys are the corresponding permutations.
from collections import defaultdict
a = [[5, 4, 3, 2], [4, 5, 3, 2], [3, 5, 2, 1], [4, 2, 5, 1], [4, 5, 2, 1], [4, 3, 5, 1]]
groups = defaultdict(list)
for row in a:
groups[tuple(sorted(range(len(row)), key=lambda i: row[i], reverse=True))].append(row)
print(groups)
Output:
defaultdict(<class 'list'>, {
(0, 1, 2, 3): [[5, 4, 3, 2]],
(1, 0, 2, 3): [[4, 5, 3, 2], [3, 5, 2, 1], [4, 5, 2, 1]],
(2, 0, 1, 3): [[4, 2, 5, 1], [4, 3, 5, 1]]
})
Note that with this solution, the results might not be what you expect if some users give the same score to two different items, because sorted doesn't keep ex-aequo; instead it breaks ties by order of appearance (in this case, this means ties between two items are broken alphabetically).
Index by the index of the permutation
The permutations of 'ABCD' can be ordered lexicographically: 'ABCD' comes first, then 'ABDC' comes second, then 'ACBD' comes third...
As it turns out, there is an algorithm to compute the index at which a given permutation would come in that sequence! And that algorithm is implemented in python module more_itertools:
more_itertools.permutation_index
So, we can replace our tuple key tuple(sorted(range(len(row)), key=lambda i: row[i], reverse=True)) by a simple number key permutation_index(row, sorted(row, reverse=True)).
from collections import defaultdict
from more_itertools import permutation_index
a = [[5, 4, 3, 2], [4, 5, 3, 2], [3, 5, 2, 1], [4, 2, 5, 1], [4, 5, 2, 1], [4, 3, 5, 1]]
groups = defaultdict(list)
for row in a:
groups[permutation_index(row, sorted(row, reverse=True))].append(row)
print(groups)
Output:
defaultdict(<class 'list'>, {
0: [[5, 4, 3, 2]],
6: [[4, 5, 3, 2], [3, 5, 2, 1], [4, 5, 2, 1]],
8: [[4, 2, 5, 1], [4, 3, 5, 1]]
})
Mixing permutation_index and pandas
Since the output of permutation_index is a simple number, we can easily include it in a numpy array or a pandas dataframe as a new column:
import pandas as pd
from more_itertools import permutation_index
df = pd.DataFrame({'A': [5,4,3,4,4,4], 'B': [4,5,5,2,5,3], 'C': [3,2,2,5,2,5], 'D': [2,2,1,1,1,1]})
df['perm_idx'] = df.apply(lambda row: permutation_index(row, sorted(row, reverse=True)), axis=1)
print(df)
A B C D perm_idx
0 5 4 3 2 0
1 4 5 2 2 6
2 3 5 2 1 6
3 4 2 5 1 8
4 4 5 2 1 6
5 4 3 5 1 8
for idx, sub_df in df.groupby('perm_idx'):
print(idx)
print(sub_df)
0
A B C D perm_idx
0 5 4 3 2 0
6
A B C D perm_idx
1 4 5 2 2 6
2 3 5 2 1 6
4 4 5 2 1 6
8
A B C D perm_idx
3 4 2 5 1 8
5 4 3 5 1 8
You can
(i) transpose df and convert it to a dictionary,
(ii) sort this dictionary by value and get the keys,
(iii) join the sorted keys for each "person" and assign this dict to df['ranks'],
(iv) aggregate ranking points and assign it to df['pref'],
(v) groupby(['ranks']) and create lists from pref
df = pd.DataFrame({'A': {0: 5, 1: 4, 2: 3, 3: 4, 4: 4, 5: 4},
'B': {0: 4, 1: 5, 2: 5, 3: 2, 4: 5, 5: 3},
'C': {0: 3, 1: 3, 2: 2, 3: 5, 4: 2, 5: 5},
'D': {0: 2, 1: 2, 2: 1, 3: 1, 4: 1, 5: 1}})
df['ranks'] = pd.Series({k : ''.join(list(zip(*sorted(v.items(), key=lambda d:d[1],
reverse=True)))[0])
for k,v in df.T.to_dict().items()})
df['pref'] = df.loc[:,'A':'D'].values.tolist()
out = df[['ranks','pref']].groupby('ranks').agg(list).to_dict()['pref']
Output:
{'ABCD': [[5, 4, 3, 2]],
'BACD': [[4, 5, 3, 2], [3, 5, 2, 1], [4, 5, 2, 1]],
'CABD': [[4, 2, 5, 1], [4, 3, 5, 1]]}

Is there any method to append test data with predicted data?

I have 1 random array of tested dataset like array=[[5, 6 ,7, 1], [5, 6 ,7, 4], [5, 6 ,7, 3]] and 1 array of predicted data like array_pred=[10, 3, 4] both with the equal length. Now I want to append this result like this in 1 res_array = [[5, 6 ,7, 1, 10], [5, 6 ,7, 4, 3], [5, 6 ,7, 3, 4]]. I don't know what to say it but I want this type of result in python. Actually I have to store it in a dataframe and then have to generate an excel file from this data. this is what I want. Is it possible??
Use numpy.vstack for join arrays, convert to Series and then to excel:
a = np.hstack((array, np.array(array_pred)[:, None]))
#thank you #Ch3steR
a = np.column_stack([array, array_pred])
print(a)
0 [5, 6, 7, 1, 10]
1 [5, 6, 7, 4, 3]
2 [5, 6, 7, 3, 4]
dtype: object
s = pd.Series(a.tolist())
print (s)
0 [5, 6, 7, 1, 10]
1 [5, 6, 7, 4, 3]
2 [5, 6, 7, 3, 4]
dtype: object
s.to_excel(file, index=False)
Or if need flatten values convert to DataFrame, Series and use concat:
df = pd.concat([pd.DataFrame(array), pd.Series(array_pred)], axis=1, ignore_index=True)
print(df)
0 1 2 3 4
0 5 6 7 1 10
1 5 6 7 4 3
2 5 6 7 3 4
And then:
df.to_excel(file, index=False)

Use of index in pandas DataFrame for groupby and aggregation

I want to aggregate a single column DataFrame and count the number of elements. However, I always end up with an empty DataFrame:
pd.DataFrame({"A":[1, 2, 3, 4, 5, 5, 5]}).groupby("A").count()
Out[46]:
Empty DataFrame
Columns: []
Index: [1, 2, 3, 4, 5]
If I add a second column, I get the desired result:
pd.DataFrame({"A":[1, 2, 3, 4, 5, 5, 5], "B":[1, 2, 3, 4, 5, 5, 5]}).groupby("A").count()
Out[45]:
B
A
1 1
2 1
3 1
4 1
5 3
Can you explain the reason for this?
Give this a shot:
import pandas as pd
print(pd.DataFrame({"A":[1, 2, 3, 4, 5, 5, 5]}).groupby("A")["A"].count())
prints
A
1 1
2 1
3 1
4 1
5 3
You have to add the grouped by column in your result:
import pandas as pd
pd.DataFrame({"A":[1, 2, 3, 4, 5, 5, 5]}).groupby("A").A.count()
Output:
A
1 1
2 1
3 1
4 1
5 3

Categories

Resources