pandas nlargest lost one column

pandas nlargest lost one column - python

I have this dataset:
Id query count
001 abc 20
001 bcd 30
001 ccd 100
002 ace 13
002 ahhd 30
002 ahe 28
I want to find the Top2 query for each Id, based on the count. So I want to see:
Id query count
001 ccd 100
001 bcd 30
002 ahhd 30
002 ahe 28
I tried these two lines of code:
df.groupby('Id')['count'].nlargest(2), the "query" column is lost in the result, which is not what I wanted. So how to keep query in my result.
Id count
001 100
001 30
002 30
002 28

Use set_index of missing column(s):
df = df.set_index('query').groupby('Id')['count'].nlargest(2).reset_index()
print (df)
Id query count
0 001 ccd 100
1 001 bcd 30
2 002 ahhd 30
3 002 ahe 28

I use a groupby and apply the method pd.DataFrame.nlargest. This differs from pd.Series.nlargest in that I have to specify a set of columns to consider when choosing my n rows. This solution keeps the original index values that are attached to the rows, if that is at all important to the OP or end user.
df.groupby('Id', group_keys=False).apply(
pd.DataFrame.nlargest, n=2, columns='count')
Id query count
2 1 ccd 100
1 1 bcd 30
4 2 ahhd 30
5 2 ahe 28

You could do this with groupby still:
df.sort_values('count', ascending = False).groupby('Id').head(2)

Related

index values always gets displayed even on setting .to_string(index=False) in Pandas

I have a table like this Input Table. Trying to group by year and display the data. Here is the code.
import pandas as pd
data=pd.read_excel("Book6.xlsx",sheet_name="Sheet6")
df_new = data[['Date1','Name', 'Fruit','Price']]
df_new['Date1'] = pd.to_datetime(df_new['Date1'], dayfirst=True, errors='coerce')
result = df_new.reset_index().groupby([df_new['Date1'].dt.year,df_new['Name'],df_new['Fruit'],df_new['Price']]).agg('sum')
print(result)#.to_string(index=False))
Even on setting the index=False in .to_string, still the index getting displayed. Here is the output table. I dont require the index to be displayed.
Output Table

No need for pandas.DataFrame.to_string, just add drop=True to pandas.DataFrame.reset_index :
result = (df_new.reset_index(drop=True)
.groupby([df_new['Date1'].dt.year,df_new['Name'],df_new['Fruit'],df_new['Price']])
.agg('sum'))
Output :
print(result)
Price
Date1 Name Fruit Price
2017 ghi Cat 100 200
2018 abc Ball 20 40
2019 abc Apple 25 75
ghi Apple 25 50
2020 def Apple 25 50
Cat 100 200
ghi Ball 20 40
2021 abc Apple 25 50

Pick highest values in rows which have the same label in a column

I have a table like shown below
SKU
Stock
Past
ABC
0
45
ABC
20
30
DEF
22
0
DEF
5
67
Basically, I just want to change the value of 'Stock' and 'Past' to just one value and pick the highest, so the result should be:
SKU
Stock
Past
ABC
20
45
DEF
22
67
Is this possible to be done in Pandas? Please advise, anyone. Thank you very much!

You can groupby "SKU" and use max method to find the maximum value of each column in each group:
out = df.groupby('SKU').max()
Output:
Stock Past
SKU
ABC 20 45
DEF 22 67

Add column with numbers based on count of value in other column in Pandas

colA is what I currently have.
However, I'm trying to generate colB.
I want colB to contain the number 001 for each value. However if the associated colA value exists twice in that column, I want the colB number to then be 002, and so on.
Hopefully the example below gives a better idea of what I'm looking for based on the colA values. I've been struggling to put together any real code for this.
EDIT: Struggling to explain this in words, so if you can think of a better way to explain it feel free to update my question.
colA colB
BJ02 001
BJ02 002
CJ02 001
CJ03 001
CJ02 002
DJ01 001
DJ02 001
DJ07 001
DJ07 002
DJ07 003

Use groupby_cumcount:
df['colB'] = df.groupby('colA').cumcount().add(1)
print(df)
# Output
colA colB
0 BJ02 1
1 BJ02 2
2 CJ02 1
3 CJ03 1
4 CJ02 2
5 DJ01 1
6 DJ02 1
7 DJ07 1
8 DJ07 2
9 DJ07 3
Suggested by #HenryEcker, use zfill:
df['colB'] = df.groupby('colA').cumcount().add(1).astype(str).str.zfill(3)
print(df)
# Output:
colA colB
0 BJ02 001
1 BJ02 002
2 CJ02 001
3 CJ03 001
4 CJ02 002
5 DJ01 001
6 DJ02 001
7 DJ07 001
8 DJ07 002
9 DJ07 003

You can use Counter() to count the frequency of each value in colA, then create a function to generate a list of values for colB.
from collections import Counter
def count_value(colA):
new_col = []
colA = df[colA].tolist()
freq_table = Counter(colA) # count the frequency of each value
for value in colA:
new_col.append('00' + str(freq_table[value]))
return new_col
df['colB'] = count_value(df['colA'])

Remove rows from pandas dataframe with condition

I have a dataframe that looks like this:
import pandas as pd
### create toy data set
data = [[1111,'10/1/2021',21,123],
[1111,'10/1/2021',-21,123],
[1111,'10/1/2021',21,123],
[2222,'10/2/2021',15,234],
[2222,'10/2/2021',15,234],
[3333,'10/3/2021',15,234],
[3333,'10/3/2021',15,234]]
df = pd.DataFrame(data,columns = ['Individual','date','number','cc'])
What I want to do is remove rows where Individual, date, and cc are the same, but number is a negative value in one case and a positive in the other case. For example, in the first three rows, I would remove rows 1 and 2 (because 21 and -21 values are equal in absolute terms), but I don't want to remove row 3 (because I have already accounted for the negative value in row 2 by eliminating row 1). Also, I don't want to remove duplicated values if the corresponding number values are positive. I have tried a variety of duplicated() approaches, but just can't get it right.
Expected results would be:
Individual date number cc
0 1111 10/1/2021 21 123
1 2222 10/2/2021 15 234
2 2222 10/2/2021 15 234
3 3333 10/3/2021 15 234
4 3333 10/3/2021 15 234
Thus, the first two rows are removed, but not the third row, since the negative value is already accounted for.
Any assistance would be appreciated. I am trying to do this without a loop, but it may be unavoidable. It seems similar to this question, but I can't figure out how to make it work in my case, as I am trying to avoid loops.

I can't be sure since you did not post your expected output, but you could try the below. Create a separate df called n that contains the rows with -ve 'number' and join it to the original with indicator=True.
n = df.loc[df.number.le(0)].drop('number',axis=1)
df = pd.merge(df,n,'left',indicator=True)
>>> df
Individual date number cc _merge
0 1111 10/1/2021 21 123 both
1 1111 10/1/2021 -21 123 both
2 1111 10/1/2021 21 123 both
3 2222 10/2/2021 15 234 left_only
4 2222 10/2/2021 15 234 left_only
5 3333 10/3/2021 15 234 left_only
6 3333 10/3/2021 15 234 left_only
This will allow us to identify the Individual/date/cc groups that have a -ve 'number' row.
Then you can locate the rows with 'both' in _merge, and only use those to perform a groupby.head(2), concatenating that with the rest of the df:
out = pd.concat([df.loc[df._merge.eq('both')].groupby(['Individual','date','cc']).head(2),
df.loc[df._merge.ne('both')]]).drop('_merge',axis=1)
Which prints:
Individual date number cc
0 1111 10/1/2021 21 123
1 1111 10/1/2021 -21 123
3 2222 10/2/2021 15 234
4 2222 10/2/2021 15 234
5 3333 10/3/2021 15 234
6 3333 10/3/2021 15 234

I want to create a new column territory based on the city column

Data Frame :
city Temperature
0 Chandigarh 15
1 Delhi 22
2 Kanpur 20
3 Chennai 26
4 Manali -2
0 Bengalaru 24
1 Coimbatore 35
2 Srirangam 36
3 Pondicherry 39
I need to create another column in data frame, which contains a boolean value for each city to indicate whether it's a union territory or not. Chandigarh, Pondicherry and Delhi are only 3 union territories here.
I have written below code
import numpy as np
conditions = [df3['city'] == 'Chandigarh',df3['city'] == 'Pondicherry',df3['city'] == 'Delhi']
values =[1,1,1]
df3['territory'] = np.select(conditions, values)
Is there any easier or efficient way that I can write?

You can use isin:
union_terrs = ["Chandigarh", "Pondicherry", "Delhi"]
df3["territory"] = df3["city"].isin(union_terrs).astype(int)
which checks each entry in city column and if it is in union_terrs, gives True and otherwise False. The astype makes True/False to 1/0 conversion,
to get
city Temperature territory
0 Chandigarh 15 1
1 Delhi 22 1
2 Kanpur 20 0
3 Chennai 26 0
4 Manali -2 0
0 Bengalaru 24 0
1 Coimbatore 35 0
2 Srirangam 36 0
3 Pondicherry 39 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas nlargest lost one column - python

Use set_index of missing column(s): df = df.set_index('query').groupby('Id')['count'].nlargest(2).reset_index() print (df) Id query count 0 001 ccd 100 1 001 bcd 30 2 002 ahhd 30 3 002 ahe 28

You could do this with groupby still: df.sort_values('count', ascending = False).groupby('Id').head(2)

Related

index values always gets displayed even on setting .to_string(index=False) in Pandas

Pick highest values in rows which have the same label in a column

Add column with numbers based on count of value in other column in Pandas

Remove rows from pandas dataframe with condition

I want to create a new column territory based on the city column

Categories

Resources