Get the index of n maximum values in a column in dataframe - python

I have a data frame and I want to get the index and value of the 4 maximum values in each rows. For example, in the following df, in column a, 10, 6, 7, 8 are four maximum values.
import pandas as pd
df = pd.DataFrame()
df['a'] = [10, 2, 3, -1,4,5,6,7,8]
df['id'] = [100, 2, 3, -1,4,5,0,1,2]
df
The output which I want is:

Try nlargest,
df.nlargest(4, 'a').reset_index()
Output:
index a id
0 0 10 100
1 8 8 2
2 7 7 1
3 6 6 0

You can sort the a column
out = (df.sort_values('a', ascending=False).iloc[:4]
.sort_index(ascending=True)
.reset_index())
print(out)
index a id
0 0 10 100
1 6 6 0
2 7 7 1
3 8 8 2

Related

pandas dataframe from dictionary where keys are tuples of tuples of row indexes and column indexes resp [duplicate]

I tried to create a data frame df using the below code :
import numpy as np
import pandas as pd
index = [0,1,2,3,4,5]
s = pd.Series([1,2,3,4,5,6],index= index)
t = pd.Series([2,4,6,8,10,12],index= index)
df = pd.DataFrame(s,columns = ["MUL1"])
df["MUL2"] =t
print df
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
While trying to create the same data frame using the below syntax, I am getting a wierd output.
df = pd.DataFrame([s,t],columns = ["MUL1","MUL2"])
print df
MUL1 MUL2
0 NaN NaN
1 NaN NaN
Please explain why the NaN is being displayed in the dataframe when both the Series are non empty and why only two rows are getting displayed and no the rest.
Also provide the correct way to create the data frame same as has been mentioned above by using the columns argument in the pandas DataFrame method.
One of the correct ways would be to stack the array data from the input list holding those series into columns -
In [161]: pd.DataFrame(np.c_[s,t],columns = ["MUL1","MUL2"])
Out[161]:
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
Behind the scenes, the stacking creates a 2D array, which is then converted to a dataframe. Here's what the stacked array looks like -
In [162]: np.c_[s,t]
Out[162]:
array([[ 1, 2],
[ 2, 4],
[ 3, 6],
[ 4, 8],
[ 5, 10],
[ 6, 12]])
If remove columns argument get:
df = pd.DataFrame([s,t])
print (df)
0 1 2 3 4 5
0 1 2 3 4 5 6
1 2 4 6 8 10 12
Then define columns - if columns not exist get NaNs column:
df = pd.DataFrame([s,t], columns=[0,'MUL2'])
print (df)
0 MUL2
0 1.0 NaN
1 2.0 NaN
Better is use dictionary:
df = pd.DataFrame({'MUL1':s,'MUL2':t})
print (df)
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
And if need change columns order add columns parameter:
df = pd.DataFrame({'MUL1':s,'MUL2':t}, columns=['MUL2','MUL1'])
print (df)
MUL2 MUL1
0 2 1
1 4 2
2 6 3
3 8 4
4 10 5
5 12 6
More information is in dataframe documentation.
Another solution by concat - DataFrame constructor is not necessary:
df = pd.concat([s,t], axis=1, keys=['MUL1','MUL2'])
print (df)
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
A pandas.DataFrame takes in the parameter data that can be of type ndarray, iterable, dict, or dataframe.
If you pass in a list it will assume each member is a row. Example:
a = [1,2,3]
b = [2,4,6]
df = pd.DataFrame([a, b], columns = ["Col1","Col2", "Col3"])
# output 1:
Col1 Col2 Col3
0 1 2 3
1 2 4 6
You are getting NaN because it expects index = [0,1] but you are giving [0,1,2,3,4,5]
To get the shape you want, first transpose the data:
data = np.array([a, b]).transpose()
How to create a pandas dataframe
import pandas as pd
a = [1,2,3]
b = [2,4,6]
df = pd.DataFrame(dict(Col1=a, Col2=b))
Output:
Col1 Col2
0 1 2
1 2 4
2 3 6

first/count applied to groupby returns empty dataframe

import pandas as pd
df = pd.DataFrame( {'A': [1,1,2,3,4,5,5,6,7,7,7,8]} )
dummy = df["A"]
print(dummy)
0 1
1 1
2 2
3 3
4 4
5 5
6 5
7 6
8 7
9 7
10 7
11 8
Name: A, dtype: int64
res = df.groupby(dummy)
print(res.first())
Empty DataFrame
Columns: []
Index: [1, 2, 3, 4, 5, 6, 7, 8]
Why the last print results in an empty dataframe? I except each group to be a slice of the original df, where each slice would contain as many rows as the number of duplicates for a given value in column "A". What am I missing?
My guess is by default, A is set to index before applying the groupby operator (e.g. first). Therefore, df is essentially empty before the actual first operator is applied. If you have another column B:
df = pd.DataFrame( {'A': [1,1,2,3,4,5,5,6,7,7,7,8], 'B':range(12)} )
then you would see A as the index and the first values for B in each group with df.groupby(dummy).first():
B
A
1 0
2 2
3 3
4 4
5 5
6 7
7 8
8 11
On the other note, if you force as_index=False, groupby would not set A as index and you would have the non-empty data:
df.groupby(dummy, as_index=False).first()
gives:
A
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
Or, you can groupby on a copy of the column:
df.groupby(dummy.copy()).first()
and you get:
A
A
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
By default, as_index is True which means it will take the passed column and make it the index and then group the other columns of DataFrame accordingly. You need to make as_index=False to get your desired results.
import pandas as pd
df = pd.DataFrame( {'A': [1,1,2,3,4,5,5,6,7,7,7,8]} )
dummy = df["A"]
print(dummy)
res = df.groupby(dummy,as_index=False)
print(res.first())
A
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
as_index : bool, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.

Append Data to Pandas Dataframe

I have the following pandas dataframe:
df = pd.DataFrame({'A': [1, 2, 3, 7], 'B': [4, 5, 6, 29]})
I'm working on a for loop that grabs an index and then appends data to the end of that row.
How do I append columns C, D, E for a given index of the table? Let's say on iteration one, the index is 2:
A B C D E
0 1 4 0 0 0
1 2 5 0 0 0
2 3 6 34 12 23
3 7 29 0 0 0
On the next iteration of the for loop, the index might be 1. Then the dataframe would be:
A B C D E
0 1 4 0 0 0
1 2 5 8 11 4
2 3 6 34 12 23
3 7 29 0 0 0
How do I do this?
You can target specific rows by using loc and providing the index.
For example:
df.loc[5:'D']=10
This will add the value 10, to the column D of row index 5.
Your question states that you want to add new columns depending on the row index. This doesn't make sense, because a dataframe is not like a NoSQL document where you can just add columns independent of other rows.
What you should do is have all your columns already added to your dataframe, then add values as you go.
To add multiple values:
df.loc[5, ['D', 'B']] = 10

How to perform an IF statement for duplicate values within the same column

I have a DataFrame and want to find duplicate values within a column and if found, create a new column add a zero for every duplicate case but leave the original value unchanged.
Original DataFrame:
Code1
1
2
3
4
5
1
2
1
1
New DataFrame:
Code1 Code2
1 1
2 2
3 3
4 4
5 5
6 6
1 10
2 20
1 100
1 1000
6 60
Use groupby and cumcount
df.assign(counts = df.groupby("Code1").cumcount(),
Code2=lambda x:x["Code1"]*10**(x["counts"])
).drop("counts", axis=1)
Code1 Code2
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 1 10
6 2 20
7 1 100
8 1 1000
there might be a solution using transform (but I'm just not having time right now to investigate). However, here it's really explicit about what is happening
import pandas as pd
data = [1, 2, 3, 4, 5, 1, 2, 1, 1]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Code1'])
code2 = []
x = {}
for d in data:
if d not in x:
x[d] = d
else:
x[d] = x[d] * 10
code2.append(x[d])
df['Code2'] = code2
print(df)

Efficiently integrate a series into a pandas dataframe

I have a pandas dataframe with index [0, 1, 2...], and a list something like this: [1, 2, 2, 0, 1...].
I'd like to add a 'count' column to the dataframe, that reflects the number of times the digit in the index is referenced in the list.
Given the example lists above, the 'count' column would have the value 2 at index 2, because 2 occurred twice (so far). Is there a more efficient way to do this than iterating over the list?
Well here is a way of doing it, first load the list into a df, then add the 'occurrence' column using value_counts and then merge this to your orig df:
In [61]:
df = pd.DataFrame({'a':np.arange(10)})
l=[1,2,2,0,1]
df1 = pd.DataFrame(l, columns=['data'])
df1['occurence'] = df1['data'].map(df1['data'].value_counts())
df1
Out[61]:
data occurence
0 1 2
1 2 2
2 2 2
3 0 1
4 1 2
In [65]:
df.merge(s, left_index=True, right_on='data',how='left').fillna(0).drop_duplicates().reset_index(drop=True)
Out[65]:
a data count
0 0 0 1
1 1 1 2
2 2 2 2
3 3 3 0
4 4 4 0
5 5 5 0
6 6 6 0
7 7 7 0
8 8 8 0
9 9 9 0
Counting occurences of numbers in a dataframe is easy in pandas
You just use the Series.value_counts method.
Then you join the grouped dataframe with the original one using the pandas.merge function.
Setting up a DataFrame like the one you have:
df = pd.DataFrame({'nomnom':np.random.choice(['cookies', 'biscuits', 'cake', 'lie'], 10)})
df is now a DataFrame with some arbitrary data in it (since you said you had more data in there).
nomnom
0 biscuits
1 lie
2 biscuits
3 cake
4 lie
5 cookies
6 cake
7 cake
8 cake
9 cake
Setting up a list like the one you have:
yourlist = np.random.choice(10, 10)
yourlist is now:
array([2, 9, 2, 3, 4, 8, 5, 8, 6, 8])
The actual code you need (TLDR;):
counts = pd.DataFrame(pd.value_counts(yourlist))
pd.merge(left=df, left_index=True,
right=counts, right_index=True,
how='left').fillna(0)

Categories

Resources