Pandas mutliIndex sort by group - python

I would like to keep values the same order (descending), but I am unable to group by index level 0 the following frame. The block with code 0512 should come together keeping descending order by code.
code product count
0510 あたたか新潟こしひかり 5kg           1
0511 キッコ−マン 味わいリッチ減塩しょうゆ 450ml 1
7プレミアム 国産果汁使用ゆずぽん酢 200ml  1
0512 キリン 生茶 525ml              1
キリンレモン 450ml              1
コカ・コーラ い・ろ・は・す もも 555ML   1
サントリー なっちゃん オレンジ 425ml    1
サントリー プレミアムボス ブラック 490ml  2
サントリー 天然水南アルプス 2L ケース     1
サントリー 天然水南アルプス 2L ペット     1
サントリー 朝摘みオレンジ&天然水 540ml   1
大塚 ポカリスエット 900ML ペット      1
森永 inゼリー エネルギーレモン 180g    1
綾鷹 525MLペット               2
7プレミアム パイナップルサイダー 500ml   1
7プレミアム フルーツオ・レ 500ml      1
GAクラフトマン ダークモカ 440ml      1
UCC 職人の珈琲 無糖 930ML ペット    1
0513 アサヒ オフ 500ml×6            1
キリン 本麒麟 500ml             1
万上 濃厚熟成本みりん 1L            1
東村山純米酒 720ml              1
0514 ブルボン プチポテトコンソメ味 45g       1
ロッテ ガーナローストミルク 50g        1
ロッテ グリーンガム 9枚             1
My code
data = df.groupby(['code','product']).size().reset_index(name='counts').set_index(['code','product'])
data1 = data.sort_values(by=['counts','code'], ascending=False).groupby(['product','code']).sum()
EDIT:
I could see that the second groupby put the code together but mess up the descending order of count per code as we can see for 0512.

You should pass a list to the ascending argument in the second line, like this:
data1 = data.sort_values(by=['counts','code'], ascending=[False,False]).groupby(['product','code']).sum()
Otherwise, it would take the default value which is True for "code" column.

Related

Find "most used items" per "level" in big csv file with Pandas

I have a rather big csv file and I want to find out which items are used the most at a certain player level.
So one column I'm looking at has all the player levels (from 1 to 30) another column has all the item names (e.g. knife_1, knife_2, etc.) and yet another column lists backpacks (backback_1, backpack_2, etc.).
Now I want to check which is the most used knife and backpack for player level 1, for player level 2, player level 3, etc.
What I've tried was this but when I tried to verify it in Excel (with countifs) the results were different:
import pandas as pd
df = pd.read_csv('filename.csv')
#getting the columns I need:
df = df[["playerLevel", "playerKnife", "playerBackpack"]]
print(df.loc[df["playerLevel"] == 1].mode())
In my head, this should locate all the rows with playerLevel 1 and then only print out the most used items for that level. However, I wanted to double-check and used "countifs" in excel which gave me a different result.
Maybe I'm thinking too simple (or complicated) so I hope you can either verify that my code should be correct or point out the error.
I'm also looking for an easy way to then go through all levels automatically and print out the most used items for each level.
Thanks in advance.
Edit:
Dataframe example. Just imagine there are thousands of players that can range from level 1 to level 30. And especially on higher levels, they have access to a lot of knives and backpacks. So the combinations are limitless.
index playerLevel playerKnife playerBackpack
0 1 knife_1 backpack_1
1 2 knife_2 backpack_1
2 3 knife_1 backpack_2
3 1 knife_2 backpack_1
4 2 knife_3 backpack_2
5 1 knife_1 backpack_1
6 15 knife_13 backpack_12
7 13 knife_10 backpack_9
8 1 knife_1 backpack_2
Try the following:
data = """\
index playerLevel playerKnife playerBackpack
0 1 knife_1 backpack_1
1 2 knife_2 backpack_1
2 3 knife_1 backpack_2
3 1 knife_2 backpack_1
4 2 knife_3 backpack_2
5 1 knife_1 backpack_1
6 15 knife_13 backpack_12
7 13 knife_10 backpack_9
8 1 knife_1 backpack_2
"""
import io
import pandas as pd
stream = io.StringIO(data)
df = pd.read_csv(stream, sep='\s+')
df = df.drop('index', axis='columns')
print(df.groupby('playerLevel').agg(pd.Series.mode))
yields
playerKnife playerBackpack
playerLevel
1 knife_1 backpack_1
2 [knife_2, knife_3] [backpack_1, backpack_2]
3 knife_1 backpack_2
13 knife_10 backpack_9
15 knife_13 backpack_12
Note that the result of df.groupby('playerLevel').agg(pd.Series.mode) is a DataFrame, so you can assign that result and use it as a normal dataframe.
For data plain from a CSV file, simply use
df = pd.read_csv('filename.csv')
df = df[['playerLevel, 'playerKnife', 'playerBackpack']] # or whichever columns you want
stats = df.groupby('playerLevel').agg(pd.Series.mode)) # stats will be dataframe as well

How to use or command in pandas to categorize my Data

I think it might be a noob question, but I'm new to coding. I used the following code to categorize my data. But I need to command that if, e.g., not all my conditions together fulfill the categories terms, e.g., consider only 4 out of 7 conditions, and give me the mentioned category. How can I do it? I really appreciate any help you can provide.
c1=df['Stroage Condition'].eq('refrigerate')
c2=df['Profit Per Unit'].between(100,150)
c3=df['Inventory Qty']<20
df['Restock Action']=np.where(c1&c2&c3,'Hold Current stock level','On Sale')
print(df)
Let`s say this is your dataframe:
Stroage Condition refrigerate Profit Per Unit Inventory Qty
0 0 1 0 20
1 1 1 102 1
2 2 2 5 2
3 3 0 100 8
and the conditions are the ones you defined:
c1=df['Stroage Condition'].eq(df['refrigerate'])
c2=df['Profit Per Unit'].between(100,150)
c3=df['Inventory Qty']<20
Then you can define a lambda function and pass this to your np.where() function. There you can define how many conditions have to be True. In this example I set the value to at least two.
def my_select(x,y,z):
return np.array([x,y,z]).sum(axis=0) >= 2
Finally you run one more line:
df['Restock Action']=np.where(my_select(c1,c2,c3), 'Hold Current stock level', 'On Sale')
print(df)
This prints to the console:
Stroage Condition refrigerate Profit Per Unit Inventory Qty Restock Action
0 0 1 0 20 On Sale
1 1 1 102 1 Hold Current stock level
2 2 2 5 2 Hold Current stock level
3 3 0 100 8 Hold Current stock level
If you have more conditions or rules, you have extend the lambda function with as many variables as rules.

How not to use loop in a df when access previous lines

I use pandas to process transport data. I study attendance of bus lines. I have 2 columns to count people getting on and off the bus at each stop of the bus. I want to create one which count the people currently on board. At the moment, i use a loop through the df and for the line n, it does : current[n]=on[n]-off[n]+current[n-1] as showns in the following example:
for index,row in df.iterrows():
if index == 0:
df.loc[index,'current']=df.loc[index,'on']
else :
df.loc[index,'current']=df.loc[index,'on']-df.loc[index,'off']+df.loc[index-1,'current']
Is there a way to avoid using a loop ?
Thanks for your time !
You can use Series.cumsum(), which accumulates the the numbers in a given Series.
a = pd.DataFrame([[3,4],[6,4],[1,2],[4,5]], columns=["off", "on"])
a["current"] = a["on"].cumsum() - a["off"].cumsum()
off on current
0 3 4 1
1 6 4 -1
2 1 2 0
3 4 5 1
If I've understood the problem properly, you could calculate the difference between people getting on and off, then have a running total using Series.cumsum():
import pandas as pd
# Create dataframe for demo
d = {'Stop':['A','B','C','D'],'On':[3,2,3,2],'Off':[2,1,0,1]}
df = pd.DataFrame(data=d)
# Get difference between 'On' and 'Off' columns.
df['current'] = df['On']-df['Off']
# Get cumulative sum of column
df['Total'] = df['current'].cumsum()
# Same thing in one line
df['Total'] = (df['On']-df['Off']).cumsum()
Stop On Off Total
A 3 2 1
B 2 1 2
C 3 0 5
D 2 1 6

Python Pandas - Get the First Value that Meets Criteria

We have this function:
def GetPricePerCustomAmt(CustomAmt):
data = [{"Price":281.48,"USDamt":104.84},{"Price":281.44,"USDamt":5140.77},{"Price":281.42,"USDamt":10072.24},{"Price":281.39,"USDamt":15773.83},{"Price":281.33,"USDamt":19314.54},{"Price":281.27,"USDamt":22255.55},{"Price":281.2,"USDamt":23427.64},{"Price":281.13,"USDamt":23708.77},{"Price":281.1,"USDamt":23738.77},{"Price":281.08,"USDamt":24019.88},{"Price":281.01,"USDamt":25986.95},{"Price":281.0,"USDamt":26127.45}]
df = pd.DataFrame(data)
df["getfirst"] = np.where(df["USDamt"] > CustomAmt, 1, 0)
wantedprice = "??"
print(df)
print()
print("Wanted Price:",wantedprice)
return wantedprice
Calling it using a custom USDamt like this:
GetPricePerCustomAmt(500)
gets this result:
Price USDamt getfirst
0 281.48 104.84 0
1 281.44 5140.77 1
2 281.42 10072.24 1
3 281.39 15773.83 1
4 281.33 19314.54 1
5 281.27 22255.55 1
6 281.20 23427.64 1
7 281.13 23708.77 1
8 281.10 23738.77 1
9 281.08 24019.88 1
10 281.01 25986.95 1
11 281.00 26127.45 1
Wanted Price: ??
We want to return the Price row of the first 1 appearing in the "getfirst" column.
Examples:
GetPricePerCustomAmt(500)
Wanted Price: 281.44
GetPricePerCustomAmt(15000)
Wanted Price: 281.39
GetPricePerCustomAmt(24000)
Wanted Price: 281.08
How do we do it?
(If you know a more efficient way to get the wanted price please do tell too)
Use next with iter for return default value if no value matched and is returneded empty Series, for filtering use boolean indexing:
def GetPricePerCustomAmt(CustomAmt):
data = [{"Price":281.48,"USDamt":104.84},{"Price":281.44,"USDamt":5140.77},{"Price":281.42,"USDamt":10072.24},{"Price":281.39,"USDamt":15773.83},{"Price":281.33,"USDamt":19314.54},{"Price":281.27,"USDamt":22255.55},{"Price":281.2,"USDamt":23427.64},{"Price":281.13,"USDamt":23708.77},{"Price":281.1,"USDamt":23738.77},{"Price":281.08,"USDamt":24019.88},{"Price":281.01,"USDamt":25986.95},{"Price":281.0,"USDamt":26127.45}]
df = pd.DataFrame(data)
return next(iter(df.loc[df["USDamt"] > CustomAmt, 'Price']), 'no matched')
print(GetPricePerCustomAmt(500))
281.44
print(GetPricePerCustomAmt(15000))
281.39
print(GetPricePerCustomAmt(24000))
281.08
print(GetPricePerCustomAmt(100000))
no matched

How do you set a specific column with a specific value to a new value in a Pandas DF?

I imported a CSV file that has two columns ID and Bee_type. The bee_type has two types in it - bumblebee and honey bee. I'm trying to convert them to numbers instead of names; i.e. instead of bumblebee it says 1.
However, my code is setting everything to 1. How can I keep the ID column its original value and only change the bee_type column?
# load the labels using pandas
labels = pd.read_csv("bees/train_labels.csv")
#Set bumble_bee to one
for index in range(len(labels)):
labels[labels['bee_type'] == 'bumble_bee'] = 1
I believe you need map by dictionary if only 2 possible values exist:
labels['bee_type'] = labels['bee_type'].map({'bumble_bee': 1, 'honey_bee': 2})
Another solution is to use numpy.where - set values by condition:
labels['bee_type'] = np.where(labels['bee_type'] == 'bumble_bee', 1, 2)
Your code works, but for improved performance, modify it a bit - remove loops and add loc:
labels.loc[labels['bee_type'] == 'bumble_bee'] = 1
print (labels)
ID bee_type
0 1 1
1 1 honey_bee
2 1 1
3 3 honey_bee
4 1 1
Sample:
labels = pd.DataFrame({
'bee_type': ['bumble_bee','honey_bee','bumble_bee','honey_bee','bumble_bee'],
'ID': list(range(5))
})
print (labels)
ID bee_type
0 0 bumble_bee
1 1 honey_bee
2 2 bumble_bee
3 3 honey_bee
4 4 bumble_bee
labels['bee_type'] = labels['bee_type'].map({'bumble_bee': 1, 'honey_bee': 2})
print (labels)
ID bee_type
0 0 1
1 1 2
2 2 1
3 3 2
4 4 1
As far as I can understand, you want to convert names to numbers. If that's the scenario please try LabelEncoder. Detailed documentation can be found sklearn LabelEncoder

Categories

Resources