Inner joining two columns in Pandas - python

I have a rather basic question for pandas, but I've tried merge and join to no success
-edit: these are in the same dataframe, and that wasn't clear. We are indeed condensing the data.
print df
product_code_shipped quantity product_code
0 A12395 1 A12395
1 H53456 4 D78997
2 A13456 3 E78997
3 A12372 8 A13456
4 E28997 1 D83126
5 B78997 2 C64516
6 C78117 9 B78497
7 B78227 1 H53456
8 B78497 2 J12372
So I want to just have one product code column with the unique product codes and their other data. So quantity, and color say, I just want the product codes of the shipped products (and in another column there is color). How do I do this inside the same dataframe?
So I should get
print df2
product_code_shipped quantity product_code color
0 A12395 1 A12395 red
1 H53456 4 H53456 blue
2 B78497 2 B78497 yellow

I'm a little confused by your question, specifically where "unique product codes" enter in...are we condensing the data? The example does not make that clear. Nonetheless I'll give it a shot:
Many DataFrame methods rely on the indexes to automatically align data. In your case, it seems convenient to set the index of these DataFrames to the product code. So you'd have this:
In [132]: shipped
Out[132]:
quantity
product_code_shipped
A 1
B 4
C 2
In [133]: info
Out[133]:
color
product_code
A red
B blue
C yellow
Now, join requires no extra parameters; it gives you exactly what (I think) you want.
In [134]: info.join(shipped)
Out[134]:
color quantity
product_code
A red 1
B blue 4
C yellow 2
If this doesn't answer your question, please clarify it by giving example input including where color comes from and the exact output that would come from that input.

Related

Is there a way to associate the value of a row with another row in Excel using Python

I created a df from the data of my excel sheet and in a specific column I have a lot of values that are the same, but some of then are different. What I want to do is find in what row these different values are and associate each one with another value from the same row. I will give an example:
ColA ColB
'Ship' 5
'Ship' 5
'Car' 3
'Ship' 5
'Plane' 2
Following the example, is there a way to find where the values different from 5 are with the code giving me the respective value from ColA? In this case would be finding 3 and 2, returning for me 'Car' and 'Plane', respectively.
Any help is welcome! :)
It depends on exacty what you want to do, but you could use:
a filter - to filter for the value you seek.
.where - to show values which are False.
Given the above dataframe the following would work:
df['different'] = df['ColB']==5
df['type'] = df['ColA'].where(df['different']==False)
print(df)
Which returns this:
ColA ColB different type
0 Ship 5 True NaN
1 Ship 5 True NaN
2 Car 3 False Car
3 Ship 5 True NaN
4 Plane 2 False Plane
The 4th column has what you seek...

Pandas DataFrame MultiIndex Pivot - Remove Empty Headers and Axis Rows

this is closely related to the question I asked earlier here Python Pandas Dataframe Pivot Table Column and Values Order. Thanks again for the help. Very much appreciated.
I'm trying to automate a report that will be distributed via email to a large audience so it needs to look "pretty" :)
I'm having trouble resetting/removing the Indexes and/or Axis post-Pivots to enable me to use the .style CSS functions (i.e. creating a Styler Object out of the df) to make the table look nice.
I have a DataFrame where two of the principal fields (in my example here they are "Name" and "Bucket") will be variable. The desired display order will also change (so it can't be hard-coded) but it can be derived earlier in the application (e.g. "Name_Rank" and "Bucket_Rank") into Integer "Sorting Values" which can be easily sorted (and theoretically dropped later).
I can drop the column Sorting Value but not the Row/Header/Axis(?). Additionally, no matter what I try I just can't seem to get rid of the blank row between the headers and the DataTable.
I (think) I need to set the Index = Bucket and Headers = "Name" and "TDY/Change" to use the .style style object functionality properly.
import pandas as pd
import numpy as np
data = [
['AAA',2,'X',3,5,1],
['AAA',2,'Y',1,10,2],
['AAA',2,'Z',2,15,3],
['BBB',3,'X',3,15,3],
['BBB',3,'Y',1,10,2],
['BBB',3,'Z',2,5,1],
['CCC',1,'X',3,10,2],
['CCC',1,'Y',1,15,3],
['CCC',1,'Z',2,5,1],
]
df = pd.DataFrame(data, columns =
['Name','Name_Rank','Bucket','Bucket_Rank','Price','Change'])
display(df)
Name
Name_Rank
Bucket
Bucket_Rank
Price
Change
0
AAA
2
X
3
5
1
1
AAA
2
Y
1
10
2
2
AAA
2
Z
2
15
3
3
BBB
3
X
3
15
3
4
BBB
3
Y
1
10
2
5
BBB
3
Z
2
5
1
6
CCC
1
X
3
10
2
7
CCC
1
Y
1
15
3
8
CCC
1
Z
2
5
1
Based on the prior question/answer I can pretty much get the table into the right format:
df2 = (pd.pivot_table(df, values=['Price','Change'],index=['Bucket_Rank','Bucket'],
columns=['Name_Rank','Name'], aggfunc=np.mean)
.swaplevel(1,0,axis=1)
.sort_index(level=0,axis=1)
.reindex(['Price','Change'],level=1,axis=1)
.swaplevel(2,1,axis=1)
.rename_axis(columns=[None,None,None])
).reset_index().drop('Bucket_Rank',axis=1).set_index('Bucket').rename_axis(columns=
[None,None,None])
which looks like this:
1
2
3
CCC
AAA
BBB
Price
Change
Price
Change
Price
Change
Bucket
Y
15
3
10
2
10
2
Z
5
1
15
3
5
1
X
10
2
5
1
15
3
Ok, so...
A) How do I get rid of the Row/Header/Axis(?) that used to be "Name_Rank" (e.g. the integer "Sorting Values" 1,2,3). I figured a hack where the df is exported to XLS/re-imported with Header=(1,2) but that can't be the best way to accomplish the objective.
B) How do I get rid of the blank row above the data in the table? From what I've read online it seems like you should "rename_axis=[None]" but this doesn't seem to work no matter which order I try.
C) Is there a way to set the Header(s) such that the both what used to be "Name" and "Price/Change" rows are Headers so that the .style functionality can be employed to format them separate from the data in the table below?
Thanks a lot for whatever suggestions anyone might have. I'm totally stuck!
Cheers,
Devon
In pandas 1.4.0 the options for A and B are directly available using the Styler.hide method:

Drop or replace values within duplicate rows in pandas dataframe

I have a data frame df where some rows are duplicates with respect to a subset of columns:
A B C
1 Blue Green
2 Red Green
3 Red Green
4 Blue Orange
5 Blue Orange
I would like to remove (or replace with a dummy string) values for duplicate rows with respect to B and C, without deleting the whole row, ideally producing:
A B C
1 Blue Green
2 Red Green
3 NaN NaN
4 Blue Orange
5 Nan NaN
As per this thread: Replace duplicate values across columns in Pandas I've tried using pd.Series.duplicated, however I can't get it to work with duplicates in a subset of columns.
I've also played around with:
is_duplicate = df.loc[df.duplicated(subset=['B','C'])]
df = df.where(is_duplicated==True, 999) # 999 intended as a placeholder that I could find-and-replace later on
However this replaces almost every row with 999 in each column - so clearly I'm doing something wrong. I'd appreciate any advice on how to proceed!
df.loc[df.duplicated(subset=['B','C']), ['B','C']] = np.nan seems to work for me.
Edited to include #ALollz and #macaw_9227 correction.
Let me share with you how I used to confront those kind of challenges in the beginning. Obviously, there are quicker ways (a one-liner) but for the sake of the answer, let's do it on a more intuitive level (later, you'll see that you can do it in one line).
So here we go...
df = pd.DataFrame({"B":['Blue','Red','Red','Blue','Blue'],"C":['Green','Green','Green','Orange','Orange']})
which result in
Step 1: identify the duplication:
For this, I'm simply adding another (facilitator) column and asking with True/False if B and C are duplicated.
df['IS_DUPLICATED']= df.duplicated(subset=['B','C'])
Step 2: Identify the indexes of the 'True' IS_DUPLICATED:
dup_index = df[df['IS_DUPLICATED']==True].index
result: Int64Index([2, 4], dtype='int64')
Step 3: mark them as Nan:
df.iloc[dup_index]=np.NaN
Step 4: remove the IS_DUPLICATED column:
df.drop('IS_DUPLICATED',axis=1, inplace=True)
and the desired result:
I will using
df[['B','C']]=df[['B','C']].mask(df.duplicated(['B','C']))
df
Out[141]:
A B C
0 1 Blue Green
1 2 Red Green
2 3 NaN NaN
3 4 Blue Orange
4 5 NaN NaN

Pandas - Drop duplicate rows from a DataFrame based on a condition from a Series by keeping prioritized values

Let's say I have the following DataFrame:
ID Color
1 Red
2 Yellow
1 Green
3 Red
1 Green
2 Red
And let's presume that the priority of the colors is as following:
Green > Yellow > Red
I want to remove rows with duplicate IDs by keeping the one, for which the color has the highest priority. So, for this example I would like to get this result:
ID Color
1 Green
2 Yellow
3 Red
Any ideas how I can achieve this by using pandas functions? I've done a lot of research on the Internet, including the pandas documentation, but couldn't think of a good approach. Any help would be much appreciated.
You can do this atleast two ways once you have set your colors to category dtype with an order.
df['Color'] = pd.Categorical(df['Color'], categories=['Red','Yellow','Green'], ordered=True)
Option 1:
df.sort_values('Color', ascending=False).drop_duplicates(['ID'])
Output:
ID Color
4 1 Green
1 2 Yellow
3 3 Red
Option 2:
df.groupby('ID')['Color'].max()
Output:
ID
1 Green
2 Yellow
3 Red
Name: Color, dtype: object
You may need Using map, create your own order dict and drop_duplicates
df.iloc[df.Color.map({'Red':0,'Yellow':1,'Green':2}).argsort()].drop_duplicates('ID',keep='last')
Out[607]:
ID Color
3 3 Red
1 2 Yellow
4 1 Green

Sequentially counting repeated entries

I am currently working on a project where I have to measure someones activity over time on a site, based on whether they edit a site. I have a data frame that looks similar to this:
df = pd.DataFrame({"x":["a", "b", "c", "b","b"],
"y":["red", "blue", "green", "yellow","red"],
"z":[1,2,3,4,5]})
I want to add a column to the dataframe such that it counts the number of repeated values (number of edits, which is column x) there are, using the "z" column as the measure of when the events happened.
E.g. to have an additional column of:
df["activity"] = pd.Series([1,1,1,2,3])
How would I best go about this in Python? Not sure what my best approach here is.
groupby and cumcount
df['activity'] = df.groupby('x').cumcount() + 1
df
x y z activity
0 a red 1 1
1 b blue 2 1
2 c green 3 1
3 b yellow 4 2
4 b red 5 3

Categories

Resources