This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
There is a similar question with a solution not fully fitting my needs. And I do not understand all details of the solution their so I am not able to adapt it to my situation.
This is my initial dataframe where all unique values in the Y column should become a column.
Y P v
0 A X 0
1 A Y 1
2 B X 2
3 B Y 3
4 C X 4
5 C Y 5
The result should look like this where P is the first column or it could be the index also. So P could be understood as a row heading. And the values from 'Y' are the column headings. And the values from v are in each cell now.
P A B C
0 X 0 2 4
1 Y 1 3 5
Not working approach
This is based on https://stackoverflow.com/a/52082963/4865723
new_index = ['Y', df.groupby('Y').cumcount()]
final = df.set_index(new_index)
final = final['P'].unstack('Y')
print(final)
The problem here is that the index (or first column) does not contain the values from Y and the v column is totally gone.
Y A B C
0 X X X
1 Y Y Y
My own unfinished idea
>>> df.groupby('Y').agg(list)
P v
Y
A [X, Y] [0, 1]
B [X, Y] [2, 3]
C [X, Y] [4, 5]
I do not know if this help or how to go further from this point on.
The full MWE
#!/usr/bin/env python3
import pandas as pd
# initial data
df = pd.DataFrame({
'Y': ['A', 'A', 'B', 'B', 'C', 'C'],
'P': list('XYXYXY'),
'v': range(6)
})
print(df)
# final result I want
final = pd.DataFrame({
'P': list('XY'),
'A': [0, 1],
'B': [2, 3],
'C': [4, 5]
})
print(final)
# approach based on:
# https://stackoverflow.com/a/52082963/4865723
new_index = ['Y', df.groupby('Y').cumcount()]
final = df.set_index(new_index)
final = final['P'].unstack('Y')
print(final)
You don't need anything complex, this is a simple pivot:
df.pivot(index='P', columns='Y', values='v').reset_index()
Related
First of all I'll make an example of the result I want to obtain. I initially have two DataFrames with, in general different column indexes and row indexes and eventually with different rows and columns number (even if in the example below are both 3x3):
Dataframe1 | Dataframe2
A B C | B D F
A x x x | A y y y
D x x x | B y y y
E x x x | E y y y
And I want the following result:
Result
A B C D F
A x y x y y
B - y - y y
D x x x - -
E x y x y y
Note the solution has the following characteristics:
the resulting data frame contains all the rows and columns of both dataframe1 and dataframe2
where values overlap, dataframe2 updates dataframe1 (e.g. in positions [BA] and [BE] of result data frame, where in the same poistions in dataframe1 there was x, now there's y)
Where missing values occur (here replaced by a dash -) a default value is inserted (like NaN)
The indexes names are preserved in the result table (but the alphabetical sorting is not necessary)
My questions are:
Is it possible to do that with pandas? If yes, how? I've tried many different things but none of them worked 100%, but since I'm very new to pandas I might not know the proper way to do it.
If pandas is not the right or easiest way to do it, is there another way that you would recommend for doing that (maybe using matrices, dictionaries, ...)
Thank you.
try this:
data1 = {'A': {'A': 'x', 'D': 'x', 'E': 'x'},
'B': {'A': 'x', 'D': 'x', 'E': 'x'},
'C': {'A': 'x', 'D': 'x', 'E': 'x'}}
df1 = pd.DataFrame(data1)
print(df1)
>>>
A B C
A x x x
D x x x
E x x x
data2 = {'B': {'A': 'y', 'B': 'y', 'E': 'y'},
'D': {'A': 'y', 'B': 'y', 'E': 'y'},
'F': {'A': 'y', 'B': 'y', 'E': 'y'}}
df2 = pd.DataFrame(data2)
print(df2)
>>>
B D F
A y y y
B y y y
E y y y
res = df1.combine_first(df2)
print(res)
>>>
A B C D F
A x y x y y
B NaN y NaN y y
D x x x NaN NaN
E x y x y y
try another:
cols = df1.columns.append(df2.columns).unique().sort_values()
idx = df1.index.append(df2.index).unique().sort_values()
res = df1.reindex(index=idx, columns=cols)
res.update(df2)
print(res)
>>>
A B C D F
A x y x y y
B NaN y NaN y y
D x x x NaN NaN
E x y x y y
Say I have the following dataframe
c1
c2
c3
p
x
1
n
x
2
n
y
1
p
y
2
p
y
1
n
x
2
etc. I then want this in the following format:
p
n
x
y
4
5
5
4
i.e., i want to sum column 3 for each group in columns 1 & 2, but I don't want the unique combinations of columns 1 & 2, which would be achieved by grouping by those columns and summing on the third. Any way to do this using groupby?
As Karan said, just call groupby on each of your label columns separately, then concatenate (and transpose) the results:
import pandas as pd
df = pd.DataFrame([['p', 'x', 1],
['n', 'x', 2],
['n', 'y', 1],
['p', 'y', 2],
['p', 'y', 1],
['n', 'x', 2]])
df.columns = ['c1', 'c2', 'c3']
sums1 = df.groupby('c1').sum()
sums2 = df.groupby('c2').sum()
sums = pd.concat([sums1, sums2]).T
sums
n p x y
c3 5 4 5 4
I was going through this link: Return top N largest values per group using pandas
and found multiple ways to find the topN values per group.
However, I prefer dictionary method with agg function and would like to know if it is possible to get the equivalent of the dictionary method for the following problem?
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': [1, 1, 2, 2, 1],
'C': [10, 20, 30, 40, 50],
'D': ['X', 'Y', 'X', 'Y', 'Y']})
print(df)
A B C D
0 1 1 10 X
1 1 1 20 Y
2 1 2 30 X
3 2 2 40 Y
4 2 1 50 Y
I can do this:
df1 = df.groupby(['A'])['C'].nlargest(2).droplevel(-1).reset_index()
print(df1)
A C
0 1 30
1 1 20
2 2 50
3 2 40
# also this
df1 = df.sort_values('C', ascending=False).groupby('A', sort=False).head(2)
print(df1)
# also this
df.set_index('C').groupby('A')['B'].nlargest(2).reset_index()
Required
df.groupby('A',as_index=False).agg(
{'C': lambda ser: ser.nlargest(2) # something like this
})
Is it possible to use the dictionary here?
If you want to get a dictionary like A: 2 top values from C,
you can run:
df.groupby(['A'])['C'].apply(lambda x:
x.nlargest(2).tolist()).to_dict()
For your DataFrame, the result is:
{1: [30, 20], 2: [50, 40]}
So I have a dictionary with letter values and keys and I want to generate an adjacency matrix using digits (0 or 1). But I don't know how to do that.
Here is my dictionary:
g = { "a" : ["c","e","b"],
"b" : ["f","a"]}
And I want an output like this :
import numpy as np
new_dic = {'a':[0,1,1,0,1,0],'b':(1,0,0,0,0,1)}
rows_names = ['a','b'] # I use a list because dictionaries don't memorize the positions
adj_matrix = np.array([new_dic[i] for i in rows_names])
print(adj_matrix)
Output :
[[0 1 1 0 1 0]
[1 0 0 0 0 1]]
So it's an adjacency matrix: column/row 1 represent A, column/row 2 represent B ...
Thank you !
I don't know if it helps but here is how I convert all letters to numbers using ascii :
for key, value in g.items():
nums = [str(ord(x) - 96) for x in value if x.lower() >= 'a' and x.lower() <= 'z']
g[key] = nums
print(g)
Output :
{'a': ['3', '5', '2'], 'b': ['6', '1']}
So a == 1 b == 2 ...
So my problem is: If a take the keys a with the first value "e", how should I do so that the e is found in the column 5 line 1 and not in the column 2 line 1 ? and replacing the e to 1
Using comprehensions:
g = {'a': ['c', 'e', 'b'], 'b': ['f', 'a']}
vals = 'a b c d e f'.split() # Column values
new_dic = {k: [1 if x in v else 0 for x in vals] for k, v in g.items()}
I am trying to do the following:
Given a dataFrame of distance, I want to identify the k-nearest neighbours for each element.
Example:
A B C D
A 0 1 3 2
B 5 0 2 2
C 3 2 0 1
D 2 3 4 0
If k=2, it should return:
A: B D
B: C D
C: D B
D: A B
Distances are not necessarily symmetric.
I am thinking there must be something somewhere that does this in an efficient way using Pandas DataFrames. But I cannot find anything?
Homemade code is also very welcome! :)
Thank you!
The way I see it, I simply find n + 1 smallest numbers/distances/neighbours for each row and remove the 0, which would then give you n numbers/distances/neighbours. Keep in mind that the code will not work if you have a distance of zeroes! Only the diagonals are allowed to be 0.
import pandas as pd
import numpy as np
X = pd.DataFrame([[0, 1, 3, 2],[5, 0, 2, 2],[3, 2, 0, 1],[2, 3, 4, 0]])
X.columns = ['A', 'B', 'C', 'D']
X.index = ['A', 'B', 'C', 'D']
X = X.T
for i in X.index:
Y = X.nsmallest(3, i)
Y = Y.T
Y = Y[Y.index.str.startswith(i)]
Y = Y.loc[:, Y.any()]
for j in Y.index:
print(i + ": ", list(Y.columns))
This prints out:
A: ['B', 'D']
B: ['C', 'D']
C: ['D', 'B']
D: ['A', 'B']