Concatenate two matrices based on matching list in Python? - python

I want to concatenate two matrices based on its matching string-values in a specific column. For example, I am trying to combine:
1 2 a
3 4 b
5 6 c
7 8 d
and
13 14 c
15 16 d
9 10 a
11 12 b
Such as:
1 2 9 10 a
3 4 11 12 b
5 6 13 14 c
7 8 15 16 d
Observe that the matrices aren't sorted in the same way order, but that I wish for the result to be sorted similar to the first one.
Thanks!

You don't have a matrix there, since a matrix or array (with NumPy) typically indicates numeric data only. Also, you are looking to merge data rather than concatenate. If you are happy to use a 3rd party library, this is possible with Pandas:
import pandas as pd
df1 = pd.DataFrame([[1, 2, 'a'], [3, 4, 'b'], [5, 6, 'c'], [7, 8, 'd']])
df2 = pd.DataFrame([[13, 14, 'c'], [15, 16, 'd'], [9, 10, 'a'], [11, 12, 'b']])
res = df1.merge(df2, on=2).values.tolist()
print(res)
[[1, 2, 'a', 9, 10],
[3, 4, 'b', 11, 12],
[5, 6, 'c', 13, 14],
[7, 8, 'd', 15, 16]]

l1 = [[1,2,'a'],[3,4,'b'],[5,6,'c'],[7,8,'d']]
l2 = [[13,14,'c'],[15,16,'d'],[9,10,'a'],[11,12,'b']]
l3 = sorted(l1, key=lambda x: x[2])
l4 = sorted(l2, key=lambda x: x[2])
l = list(zip(l3,l4))
z = [list(set(x + y)) for x, y in l]
[[1, 2, 9, 10, 'a'], [3, 4, 'b', 11, 12], [5, 6, 13, 14, 'c'], [7, 8, 15, 16, 'd']]

Not as elegant as Pandas (jpp answer), but another way using plain Lists and Dictionaries:
list_a=[[1,2,'a'],[3,4,'b'],[5,6,'c'],[7,8,'d']]
list_b=[[13,14,'c'],[15,16,'d'],[9,10,'a'],[11,12,'b']];
# ---------------------------------------
dict_result = {val[2]:val[0:2] for val in list_a}
for val in list_b:
dict_result[val[2]].extend(val[0:2])
# -----------------------------------------
result=[];
for key,val in dict_result.iteritems():
val.extend(key)
result.append([valout for valout in val]);
# ------------------------------------------
print result

Related

How to join a dictionary with same key as df index as a new column with values from the dictionary

I have the following data:
A dictionary dict with a key: value structure as tuple(str, str,): list[float]
{
('A', 'B'): [0, 1, 2, 3],
('A', 'C'): [4, 5, 6, 7],
('A', 'D'): [8, 9, 10, 11],
('B', 'A'): [12, 13, 14, 15]
}
And a pandas dataframe df with an index of 2 columns that correspond to the keys in the dictionary:
df.set_index("first", "second"]).sort_index()
print(df.head(4))
==============================================
tokens
first second
A B 166
C 128
D 160
B A 475
I want to create a new column, numbers in df with the values provided from dict, whose key corresponds with an index row in df. The example result would be:
print(df.head(4))
========================================================================
tokens numbers
first second
A B 166 [0, 1, 2, 3]
C 128 [4, 5, 6, 7]
D 160 [8, 9, 10, 11]
B A 475 [12, 13, 14, 15]
What is the best way to go about this? Keep performance in mind, as this dataframe may be 10-100k rows long
You can create a series from the dict, and then assign:
df['numbers'] = pd.Series(d)
Or map the index:
df['numbers'] = df.index.map(d)
Output:
tokens numbers
first second
A B 166 [0, 1, 2, 3]
C 128 [4, 5, 6, 7]
D 160 [8, 9, 10, 11]
B A 475 [12, 13, 14, 15]
Create a Series then concatenate it with dataframe:
sr = pd.Series(d, name='numbers')
out = pd.concat([df, sr], axis=1)
print(out)
# Output
tokens numbers
A B 166 [0, 1, 2, 3]
C 128 [4, 5, 6, 7]
D 160 [8, 9, 10, 11]
B A 475 [12, 13, 14, 15]

Find top N highest values in a pandas dataframe, and return column name [duplicate]

I have a code with multiple columns and I would like to add two more, one for the highest number on the row, and another one for the second highest. However, instead of the number, I would like to show the column name where they are found.
Assume the following data frame:
import pandas as pd
df = pd.DataFrame({'A': [1, 5, 10], 'B': [2, 6, 11], 'C': [3, 7, 12], 'D': [4, 8, 13], 'E': [5, 9, 14]})
To extract the highest number on every row, I can just apply max(axis=1) like this:
df['max1'] = df[['A', 'B', 'C', 'D', 'E']].max(axis = 1)
This gets me the max number, but not the column name itself.
How can this be applied to the second max number as well?
You can sorting values and assign top2 values:
cols = ['A', 'B', 'C', 'D', 'E']
df[['max2','max1']] = np.sort(df[cols].to_numpy(), axis=1)[:, -2:]
print (df)
A B C D E max2 max1
0 1 2 3 4 5 4 5
1 5 6 7 8 9 8 9
2 10 11 12 13 14 13 14
df[['max1','max2']] = np.sort(df[cols].to_numpy(), axis=1)[:, -2:][:, ::-1]
EDIT: For get top2 columns names and top2 values use:
df = pd.DataFrame({'A': [1, 50, 10], 'B': [2, 6, 11],
'C': [3, 7, 12], 'D': [40, 8, 13], 'E': [5, 9, 14]})
cols = ['A', 'B', 'C', 'D', 'E']
#values in numpy array
vals = df[cols].to_numpy()
#columns names in array
cols = np.array(cols)
#get indices that would sort an array in descending order
arr = np.argsort(-vals, axis=1)
#top 2 columns names
df[['top1','top2']] = cols[arr[:, :2]]
#top 2 values
df[['max2','max1']] = vals[np.arange(arr.shape[0])[:, None], arr[:, :2]]
print (df)
A B C D E top1 top2 max2 max1
0 1 2 3 40 5 D E 40 5
1 50 6 7 8 9 A E 50 9
2 10 11 12 13 14 E D 14 13
Another approaches to you can get first max then remove it and get max again to get the second max
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 15, 10], 'B': [2, 89, 11], 'C': [80, 7, 12], 'D': [4, 8, 13], 'E': [5, 9, 14]})
max1=df.max(axis=1)
maxcolum1=df.idxmax(axis=1)
max2 = df.replace(np.array(df.max(axis=1)),0).max(axis=1)
maxcolum2=df.replace(np.array(df.max(axis=1)),0).idxmax(axis=1)
df2 =pd.DataFrame({ 'max1': max1, 'max2': max2 ,'maxcol1':maxcolum1,'maxcol2':maxcolum2 })
df.join(df2)

Appending columns to other columns in Pandas

Given the dataframe:
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11], 'col4': [12, 13, 14, 15, 16]}
What is the easiest way to append the third column to the first and the fourth column to the second?
The result should look like.
d = {'col1': [1, 2, 3, 4, 7, 7, 8, 12, 1, 11], 'col2': [4, 5, 6, 9, 5, 12, 13, 14, 15, 16],
I need to use this for a script with different column names, thus referencing columns by name is not possible. I have tried something along the lines of df.iloc[:,x] to achieve this.
You can use:
out = pd.concat([subdf.set_axis(['col1', 'col2'], axis=1)
for _, subdf in df.groupby(pd.RangeIndex(df.shape[1]) // 2, axis=1)])
print(out)
# Output
col1 col2
0 1 4
1 2 5
2 3 6
3 4 9
4 7 5
0 7 12
1 8 13
2 12 14
3 1 15
4 11 16
You can change the column names and concat:
pd.concat([df[['col1', 'col2']],
df[['col3', 'col4']].set_axis(['col1', 'col2'], axis=1)])
Add ignore_index=True to reset the index in the process.
Output:
col1 col2
0 1 4
1 2 5
2 3 6
3 4 9
4 7 5
0 7 12
1 8 13
2 12 14
3 1 15
4 11 16
Or, using numpy:
N = 2
pd.DataFrame(
df
.values.reshape((-1,df.shape[1]//2,N))
.reshape(-1,N,order='F'),
columns=df.columns[:N]
)
This may not be the most efficient solution but, you can do it using the pd.concat() function in pandas.
First convert your initial dict d into a pandas Dataframe and then apply the concat function.
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11], 'col4': [12, 13, 14, 15, 16]}
df = pd.DataFrame(d)
d_2 = {'col1':pd.concat([df.iloc[:,0],df.iloc[:,2]]),'col2':pd.concat([df.iloc[:,1],df.iloc[:,3]])}
d_2 is your required dict. Convert it to a dataframe if you need it to,
df_2 = pd.DataFrame(d_2)

Remove n elements from start of a list in pandas column, where n is the value in another column

Say I have the following DataFrame:
a = [[1,2,3,4,5],[6,7,8,9,10],[11,12,13,14,15]]
b = [3,1,2]
df = pd.DataFrame(zip(a,b), columns = ['a', 'b'])
df:
a b
0 [1, 2, 3, 4, 5] 3
1 [6, 7, 8, 9, 10] 1
2 [11, 12, 13, 14, 15] 2
How can I remove the first n elements from each list in column a, where n is the value in column b.
The result I would expect for the above df is:
a b
0 [4, 5] 3
1 [7, 8, 9, 10] 1
2 [13, 14, 15] 2
I imagine the answer revolves around using .apply() and a lambda function, but I cannot get my head around this one!
Try:
df["a"] = df.apply(lambda x: x["a"][x["b"] :], axis=1)
print(df)
Prints:
a b
0 [4, 5] 3
1 [7, 8, 9, 10] 1
2 [13, 14, 15] 2
Try this:
df['a'] = df.apply(lambda row: row['a'][row['b']:], axis=1)
Output:
a b
0 [4, 5] 3
1 [7, 8, 9, 10] 1
2 [13, 14, 15] 2

Getting indices of sequential chunks of a list

I have these lists:
l1 = ["foo","bar","x","y","z","x","y","z","x","y","z"]
l2 = ["foo","bar","w","x","y","z","w","x","y","z","w","x","y","z"]
l3 = ["foo","bar","y","z","y","z","y","z"]
For each of the list above I'd like to get the indices of sequential chunks
from 3rd entry onwards. Yield:
l1_indices = [[2,3,4],[5,6,7],[8,9,10]]
l2_indices = [[2,3,4,5],[6,7,8,9],[10,11,12,13]]
l3_indices = [[2,3],[4,5],[6,7]]
To clarify further, I got l1_indices the following way:
["foo","bar", "x","y","z", "x","y","z", "x","y","z"]
0 1 2 3 4 5 6 7 8 9 10 <-- indices id
---> onwards
---> always in 3 chunks
What's the way to do it in Python?
I tried this but no avail:
In [8]: import itertools as IT
In [9]: import operator
In [11]: [list(zip(*g))[::-1] for k, g in IT.groupby(enumerate(l1[2:]), operator.itemgetter(1))]
Out[11]:
[[('x',), (0,)],
[('y',), (1,)],
[('z',), (2,)],
[('x',), (3,)],
[('y',), (4,)],
[('z',), (5,)],
[('x',), (6,)],
[('y',), (7,)],
[('z',), (8,)]]
If sequential elements are always in three chunks and always starts from third item then you can simply divide the remaining elements by three and generate indices list.
>>> def get_indices(l):
... last = len(l) - 2
... diff = last / 3
... return [range(i, i + diff) for i in range(2, last, diff)]
...
>>> get_indices(l1)
[[2, 3, 4], [5, 6, 7], [8, 9, 10]]
>>> get_indices(l2)
[[2, 3, 4, 5], [6, 7, 8, 9], [10, 11, 12, 13]]
>>> get_indices(l3)
[[2, 3], [4, 5]]
As a more general answer first of all you can find a sublist of your list that contain elements with length more than 1 , then based on its length and length of its set you can grub the desire indices :
>>> l =['foo', 'bar', 'w', 'x', 'y', 'z', 'w', 'x', 'y', 'z', 'w', 'x', 'y', 'z']
>>> s=[i for i in l if l.count(i)>2]
>>> len_part=len(l)-len(s)
>>> len_set=len(set(s))
>>> [range(i,i+l_s) for i in range(len_part,len(l),len_set)]
[[2, 3, 4, 5], [6, 7, 8, 9], [10, 11, 12, 13]]

Categories

Resources