Shuffle rows by a column in pandas

Shuffle rows by a column in pandas - python

I have the following example of dataframe.
c1 c2
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
Given a template c1 = [3, 2, 5, 4, 1], I want to change the order of the rows based on the new order of column c1, so it will look like:
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
I found the following thread, but the shuffle is random. Cmmiw.
Shuffle DataFrame rows

If values are unique in list and also in c1 column use reindex:
df = df.set_index('c1').reindex(c1).reset_index()
print (df)
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
General solution working with duplicates in list and also in column:
c1 = [3, 2, 5, 4, 1, 3, 2, 3]
#create df from list
list_df = pd.DataFrame({'c1':c1})
print (list_df)
c1
0 3
1 2
2 5
3 4
4 1
5 3
6 2
7 3
#helper column for count duplicates values
df['g'] = df.groupby('c1').cumcount()
list_df['g'] = list_df.groupby('c1').cumcount()
#merge together, create index from column and remove g column
df = list_df.merge(df).drop('g', axis=1)
print (df)
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
5 3 c

merge
You can create a dataframe with the column specified in the wanted order then merge.
One advantage of this approach is that it gracefully handles duplicates in either df.c1 or the list c1. If duplicates not wanted then care must be taken to handle them prior to reordering.
d1 = pd.DataFrame({'c1': c1})
d1.merge(df)
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
searchsorted
This is less robust but will work if df.c1 is:
already sorted
one-to-one mapping
df.iloc[df.c1.searchsorted(c1)]
c1 c2
2 3 c
1 2 b
4 5 e
3 4 d
0 1 a

Related

Map a Pandas Series with duplicate keys to a DataFrame

Env: Python 3.9.6, Pandas 1.3.5
I have a DataFrame and a Series like below
df = pd.DataFrame({"C1" : ["A", "B", "C", "D"]})
sr = pd.Series(data = [1, 2, 3, 4, 5],
index = ["A", "A", "B", "C", "D"])
"""
[DataFrame]
C1
0 A
1 B
2 C
3 D
[Series]
A 1
A 2
B 3
C 4
D 5
"""
What I tried,
df["C2"] = df["C1"].map(sr)
But InvalidIndexError occurred because the series has duplicate keys ("A").
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Is there any method to make DF like below?
C1 C2
0 A 1
1 A 2
2 B 3
3 C 4
4 D 5
or
C1 C2
0 A 1
1 B 3
2 C 4
3 D 5
4 A 2
Row indices do not matter.

The question was heavily edited and now has a very different meaning.
You want a simple merge:
df.merge(sr.rename('C2'),
left_on='C1', right_index=True)
Output:
C1 C2
0 A 1
0 A 2
1 B 3
2 C 4
3 D 5
old answer
First, I don't reproduce your issue (tested with 3M rows on pandas 1.3.5).
Then why do you use slicing and not map? This would have the advantage of systematically outputting the correct number of rows (NaN if the key is absent):
Example:
sr = pd.Series({10:"A", 13:"B", 16:"C", 18:"D"})
df = pd.DataFrame({"C1":np.random.randint(10, 20, size=3000000)})
df['C2'] = df['C1'].map(sr)
print(df.head())
output:
C1 C2
0 10 A
1 18 D
2 10 A
3 13 B
4 15 NaN

Compare dataframes and only use unmatched values

I have two dataframes that I want to compare, but only want to use the values that are not in both dataframes.
Example:
DF1:
A B C
0 1 2 3
1 4 5 6
DF2:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
So, from this example I want to work with row index 2 and 3 ([7, 8, 9] and [10, 11, 12]).
The code I currently have (only remove duplicates) below.
df = pd.concat([di_old, di_new])
df = df.reset_index(drop=True)
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
print(df.reindex(idx))

I would do :
df_n = df2[df2.isin(df1).all(axis=1)]
ouput
A B C
0 1 2 3
1 4 5 6

How to populate categories in one column and paste the exact value in other column

It has been a long time that I dealt with pandas library. I searched for it but could not come up with an efficient way, which might be a function existed in the library.
Let's say I have the dataframe below:
df1 = pd.DataFrame({'V1':['A','A','B'],
'V2':['B','C','C'],
'Value':[4, 1, 5]})
df1
And I would like to extend this dataset and populate all the combinations of categories and put its corresponding value as exactly the same.
df2 = pd.DataFrame({'V1':['A','B','A', 'C', 'B', 'C'],
'V2':['B','A','C','A','C','B'],
'Value':[4, 4 , 1, 1, 5, 5]})
df2
In other words, in df1, A and B has Value of 4 and I also want to have a row of that B and A has Value of 4 in the second dataframe. It is very similar to melting. I also do not want to use a for loop. I am looking for a more efficient way.

Use:
df = pd.concat([df1, df1.rename(columns={'V2':'V1', 'V1':'V2'})]).sort_index().reset_index(drop=True)
Output:
V1 V2 Value
0 A B 4
1 B A 4
2 A C 1
3 C A 1
4 B C 5
5 C B 5

Or np.vstack:
>>> pd.DataFrame(np.vstack((df1.to_numpy(), df1.iloc[:, np.r_[1:-1:-1, -1]].to_numpy())), columns=df1.columns)
V1 V2 Value
0 A B 4
1 A C 1
2 B C 5
3 B A 4
4 C A 1
5 C B 5
>>>
For correct order:
>>> pd.DataFrame(np.vstack((df1.to_numpy(), df1.iloc[:, np.r_[1:-1:-1, -1]].to_numpy())), columns=df1.columns, index=[*df1.index, *df1.index]).sort_index()
V1 V2 Value
0 A B 4
0 B A 4
1 A C 1
1 C A 1
2 B C 5
2 C B 5
>>>
And index reset:
>>> pd.DataFrame(np.vstack((df1.to_numpy(), df1.iloc[:, np.r_[1:-1:-1, -1]].to_numpy())), columns=df1.columns, index=[*df1.index, *df1.index]).sort_index().reset_index(drop=True)
V1 V2 Value
0 A B 4
1 B A 4
2 A C 1
3 C A 1
4 B C 5
5 C B 5
>>>

You can use methods assign and append:
df1.append(df1.assign(V1=df1.V2, V2=df1.V1), ignore_index=True)
Output:
V1 V2 Value
0 A B 4
1 A C 1
2 B C 5
3 B A 4
4 C A 1
5 C B 5

Return Row Mean Based on Max of Other Rows in Python

I have this DataFrame
A B C A1 B1 C1
1/1/2021 1 2 7 9 5 7
1/2/2021 4 3 5 3 4 5
1/3/2021 4 6 4 6 7 2
I want to add a new column D that for each row returns the mean two columns across A,B,C based on the max two values of A1, B1, C1. So if A1 and B1 are larger than C1 then column D would equal the mean of column A and B.
Expected Output:
A B C A1 B1 C1 D
1/1/2021 1 2 7 9 5 7 4 (mean of A and C, since A1 and C1 are the top two)
1/2/2021 4 3 5 3 4 5 4 (mean of B and C, since B1 and C1 are the top two)
1/3/2021 4 6 4 6 7 2 5 (mean of A and B, since A1 and B1 are the top two)
I think I can achieve the results using a function like the one below (I just included the first part as an example), writing out all the combinations. But I want something that can be used with a large number of columns that will change and ideally I could adjust the TopN number. For example, get the average of the top 3 or 4 instead of the top 2. The columns would always be structured consistently and in the correct order. For example, 5 columns of data to be used for the average values and 5 columns of data in the same order to be used to determine the max values.
def maxcol(row):
if row[A1] >= row[B1] and row[A1] >= row[C1] and row[B1] >= row[C1]:
val = row[A] + row[B] / 2
elif:
etc etc.
return val
Is there a simple way to accomplish this without the brute force approach above?
UPDATE: I updated the answer to a more general code that works for multiple columns and multiple numbers of top columns.
import heapq
df = pd.DataFrame({'A': [1, 4, 4],'B': [2, 3, 6],'C': [7, 5, 4],'A1': [9, 3, 6],'B1': [5, 4, 7],'C1': [7, 5, 2]})
n = 3
t = 2
def helper(row):
lst = [col for col in row]
order = [lst[n:].index(x) for x in lst[n:] if x in heapq.nlargest(t,lst[n:])]
return mean(lst[o] for o in order)
df['D'] = df.apply(helper, axis = 1)
print(df)

Here is an approach that defines a helper function to apply to the dataframe:
def helper(row):
lst = [col for col in row]
order = [lst[3:].index(x) for x in lst[3:] if x is not min(lst[3:])]
return int(lst[order[0]] + lst[order[1]]) / 2
df['D'] = df.apply(helper, axis = 1)
print(df)
#output
A B C A1 B1 C1 D
0 1 2 7 9 5 7 4.0
1 4 3 5 3 4 5 4.0
2 4 6 4 6 7 2 5.0
#notice that I did not include the date indexes in this sample dataframe.
Here it is with datetimeindex. The same code works fine:
BEFORE:
A B C A1 B1 C1
2021-01-01 1 2 7 9 5 7
2021-01-02 4 3 5 3 4 5
2021-01-03 4 6 4 6 7 2
AFTER:
A B C A1 B1 C1 D
2021-01-01 1 2 7 9 5 7 4.0
2021-01-02 4 3 5 3 4 5 4.0
2021-01-03 4 6 4 6 7 2 5.0

If you have a collection of values and you want to drop the lowest value (or the lowest n values, or the highest n values...), I would just sort a list of them and drop as many as you like from whichever end you like. So if you have a list of arbitrary length and you want to drop the lowest value and then get the average, you can easily do it like so:
>>> somelist = [2, 1, 0, 3]
>>> sorted(somelist)[1:]
[1, 2, 3]
>>> sum(_) / len(_)
2.0

I am not skilled enough to code this, but i think it can be done this way:
A B C A1 B1 C1 D
so you have indexes 1,2,3,4,5,6,7
1)# ill created df for test
import pandas as pd
import numpy as np
df = pd.DataFrame([1, 2, 7, 9, 5, 7]).T
df
k=(df.loc[:, '3':'5']).idxmin(axis=1) #the position of minimum in A1 B1 C1
#calculation of "wrong" number in abc
wrong =k-3 #you have 3 positions shift
df = df.replace(df.iloc[:, wrong] , np.NaN)##change this element with Nan
#getting mean
df['D']=df.loc[:, '0':'2'].mean(axis=1)
i belive it can be coded in much simplier way..but the algorithm is.

Find the column name which has the 2nd maximum value for each row (pandas)

Based on this post: Find the column name which has the maximum value for each row it is clear how to get the column name with the max value of each row using df.idxmax(axis=1).
The question is, how can I get the 2nd, 3rd and so on maximum value per row?

You need numpy.argsort for position and then reorder columns names by indexing:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
print (df)
A B C D E
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
arr = np.argsort(-df.values, axis=1)
df1 = pd.DataFrame(df.columns[arr], index=df.index)
print (df1)
0 1 2 3 4
0 A B D E C
1 D B C E A
2 E A B C D
3 C D A E B
4 C A E D B
Verify:
#first column
print (df.idxmax(axis=1))
0 A
1 D
2 E
3 C
4 C
dtype: object
#last column
print (df.idxmin(axis=1))
0 C
1 A
2 D
3 B
4 B
dtype: object

While there is no method to find specific ranks within a row, you can rank elements in a pandas dataframe using the rank method.
For example, for a dataframe like this:
df = pd.DataFrame([[1, 2, 4],[3, 1, 7], [10, 4, 2]], columns=['A','B','C'])
>>> print(df)
A B C
0 1 2 4
1 3 1 7
2 10 4 2
You can get the ranks of each row by doing:
>>> df.rank(axis=1,method='dense', ascending=False)
A B C
0 3.0 2.0 1.0
1 2.0 3.0 1.0
2 1.0 2.0 3.0
By default, applying rank to dataframes and using method='dense' will result in float ranks. This can be easily fixed just by doing:
>>> ranks = df.rank(axis=1,method='dense', ascending=False).astype(int)
>>> ranks
A B C
0 3 2 1
1 2 3 1
2 1 2 3
Finding the indices is a little trickier in pandas, but it can be resumed to apply a filter on a condition (i.e. ranks==2):
>>> ranks.where(ranks==2)
A B C
0 NaN 2.0 NaN
1 2.0 NaN NaN
2 NaN 2.0 NaN
Applying where will return only the elements matching the condition and the rest set to NaN. We can retrieve the columns and row indices by doing:
>>> ranks.where(ranks==2).notnull().values.nonzero()
(array([0, 1, 2]), array([1, 0, 1]))
And for retrieving the column index or position within a row, which is the answer to your question:
>>> ranks.where(ranks==2).notnull().values.nonzero()[0]
array([1, 0, 1])
For the third element you just need to change the condition in where to ranks.where(ranks==3) and so on for other ranks.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Shuffle rows by a column in pandas - python

Related

Map a Pandas Series with duplicate keys to a DataFrame

Compare dataframes and only use unmatched values

How to populate categories in one column and paste the exact value in other column

Return Row Mean Based on Max of Other Rows in Python

Find the column name which has the 2nd maximum value for each row (pandas)

Categories

Resources