Canonical way for Pandas set value based on reference table

Canonical way for Pandas set value based on reference table - python

I have two dataframes, a reference table and a main table. I want to map the values in the reference table to the main table, overwriting if necessary. In visual form:
import pandas as pd
ref_data = {'Fruit':['Apple','Pear','Orange'],
'Price':[50,60,70]}
reference_table = pd.DataFrame(ref_data)
main_data = {'col1':[1,2,3,4,5],
'col2':[5,5,5,5,5],
'Fruit':['Durian','Pineapple','Apple','Orange','Pear'],
'Price':[40,120,454,12,43]}
main_data = pd.DataFrame(main_data)
This seems like quite a common use case.I found the following question that seems to exactly fit, but it seems a bit "hacky" in a sense. Just wondering if theres a proper way to do this?
Pandas -- set row values based on values in another table
Thanks!

We usually use np.where
s=reference_table.set_index('Fruit').Price.reindex(main_data.Fruit).values
main_data['Price']=np.where(np.isnan(s),main_data['Price'],s)

You can also merge and assign then drop the unused columns
main_data = main_data.merge(reference_table, on='Fruit', how='left').assign(Price=lambda x: x['Price_y'].fillna(x['Price_x'])).drop(['Price_x', 'Price_y'], axis=1)
Result
Fruit col1 col2 Price
0 Durian 1 5 40.0
1 Pineapple 2 5 120.0
2 Apple 3 5 50.0
3 Orange 4 5 70.0
4 Pear 5 5 60.0

Related

Pandas DataFrame MultiIndex Pivot - Remove Empty Headers and Axis Rows

this is closely related to the question I asked earlier here Python Pandas Dataframe Pivot Table Column and Values Order. Thanks again for the help. Very much appreciated.
I'm trying to automate a report that will be distributed via email to a large audience so it needs to look "pretty" :)
I'm having trouble resetting/removing the Indexes and/or Axis post-Pivots to enable me to use the .style CSS functions (i.e. creating a Styler Object out of the df) to make the table look nice.
I have a DataFrame where two of the principal fields (in my example here they are "Name" and "Bucket") will be variable. The desired display order will also change (so it can't be hard-coded) but it can be derived earlier in the application (e.g. "Name_Rank" and "Bucket_Rank") into Integer "Sorting Values" which can be easily sorted (and theoretically dropped later).
I can drop the column Sorting Value but not the Row/Header/Axis(?). Additionally, no matter what I try I just can't seem to get rid of the blank row between the headers and the DataTable.
I (think) I need to set the Index = Bucket and Headers = "Name" and "TDY/Change" to use the .style style object functionality properly.
import pandas as pd
import numpy as np
data = [
['AAA',2,'X',3,5,1],
['AAA',2,'Y',1,10,2],
['AAA',2,'Z',2,15,3],
['BBB',3,'X',3,15,3],
['BBB',3,'Y',1,10,2],
['BBB',3,'Z',2,5,1],
['CCC',1,'X',3,10,2],
['CCC',1,'Y',1,15,3],
['CCC',1,'Z',2,5,1],
]
df = pd.DataFrame(data, columns =
['Name','Name_Rank','Bucket','Bucket_Rank','Price','Change'])
display(df)
Name
Name_Rank
Bucket
Bucket_Rank
Price
Change
0
AAA
2
X
3
5
1
1
AAA
2
Y
1
10
2
2
AAA
2
Z
2
15
3
3
BBB
3
X
3
15
3
4
BBB
3
Y
1
10
2
5
BBB
3
Z
2
5
1
6
CCC
1
X
3
10
2
7
CCC
1
Y
1
15
3
8
CCC
1
Z
2
5
1
Based on the prior question/answer I can pretty much get the table into the right format:
df2 = (pd.pivot_table(df, values=['Price','Change'],index=['Bucket_Rank','Bucket'],
columns=['Name_Rank','Name'], aggfunc=np.mean)
.swaplevel(1,0,axis=1)
.sort_index(level=0,axis=1)
.reindex(['Price','Change'],level=1,axis=1)
.swaplevel(2,1,axis=1)
.rename_axis(columns=[None,None,None])
).reset_index().drop('Bucket_Rank',axis=1).set_index('Bucket').rename_axis(columns=
[None,None,None])
which looks like this:
1
2
3
CCC
AAA
BBB
Price
Change
Price
Change
Price
Change
Bucket
Y
15
3
10
2
10
2
Z
5
1
15
3
5
1
X
10
2
5
1
15
3
Ok, so...
A) How do I get rid of the Row/Header/Axis(?) that used to be "Name_Rank" (e.g. the integer "Sorting Values" 1,2,3). I figured a hack where the df is exported to XLS/re-imported with Header=(1,2) but that can't be the best way to accomplish the objective.
B) How do I get rid of the blank row above the data in the table? From what I've read online it seems like you should "rename_axis=[None]" but this doesn't seem to work no matter which order I try.
C) Is there a way to set the Header(s) such that the both what used to be "Name" and "Price/Change" rows are Headers so that the .style functionality can be employed to format them separate from the data in the table below?
Thanks a lot for whatever suggestions anyone might have. I'm totally stuck!
Cheers,
Devon

In pandas 1.4.0 the options for A and B are directly available using the Styler.hide method:

comparing each value in two columns

How can I compare two columns in a dataframe and create a new column based on the difference of those two columns efficiently?
I have a feature in my table that has a lot of missing values and I need to backfill those information by using other tables in the database that contain that same feature. I have used np.select to compare the feature in my original table with the same feature in other table, but I feel like there should be an easy method.
Eg: pd.DataFrame({'A': [1,2,3,4,np.nan], 'B':[1,np.nan,30,4,np.nan]})
I expect the new column to contain values [1,2,"different",4,np.nan]. Any help will be appreciated!

pandas.Series.combine_first or pandas.DataFrame.combine_first could be useful here. These operate like a SQL COALESCE and combine the two columns by choosing the first non-null value if one exists.
df = pd.DataFrame({'A': [1,2,3,4,np.nan], 'B':[1,np.nan,30,4,np.nan]})
C = df.A.combine_first(df.B)
C looks like:
0 1.0
1 2.0
2 3.0
3 4.0
4 NaN
Then, to capture your requirement that two different non-null values should give "different" when combined, just find those indices and update the values.
mask = ~df.A.isna() & ~df.B.isna() & (df.A != df.B)
C[mask] = 'different'
C now looks like:
0 1
1 2
2 different
3 4
4 NaN

Another way is to use pd.DataFrame.iterrows with nunique:
import pandas as pd
df['C'] = [s['A'] if s.nunique()<=1 else 'different' for _, s in df.iterrows()]
Output:
A B C
0 1.0 1.0 1
1 2.0 NaN 2
2 3.0 30.0 different
3 4.0 4.0 4
4 NaN NaN NaN

Python: create a lag (t-1) data structure of multiple elements

I'm having trouble creating a time lag column for my data. It works fine when I do it for a dataframe with a just a kind of elements, but it doesn't not work fine, when I have different elements. For example, my dataset looks something like this:
when using the command suggested:
data1['lag_t'] = data1['total_tax'].shift(1)
I get a result like this:
As you can see, it just displace all the 'total_tax' value one row. However, I need to do this lag for EACH ONE of the id_inf (as separate items).
My dataset is really huge, so I need to find a way to solve this issue. So I can get as a result a table like this:

You can groupby on index and shift
# an example with random data.
data1 = pd.DataFrame({'id': [9,9,9,54,54,54],'total_tax':[5,6,7,1,2,3]}).set_index('id')
data1['lag_t'] = data1.groupby(level=0)['total_tax'].apply(lambda x: x.shift())
print (data1)
tax lag_t
id
9 5 NaN
9 6 5.0
9 7 6.0
54 1 NaN
54 2 1.0
54 3 2.0

How would I pivot this basic table using pandas?

What I want is this:
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
48944305 A02AG A03AX13 N02BE01 R05X NaN NaN NaN
I don't know how many atc_1...atc_7...?atc_100 columns there will need to be in advance. I just need to gather all associated atc_codes into one row with each visit_id.
This seems like a group_by and then a pivot but I have tried many times and failed. I also tried to self-join a la SQL using pandas' merge() but that doesn't work either.
The end result is that I will paste together atc_1, atc_7, ... atc_100 to form one long atc_code. This composite atc_code will be my "Y" or "labels" column of my dataset that I am trying to predict.
Thank you!

Use cumcount first for count values per groups which create columns by function pivot. Then add missing columns with reindex_axis and change column names by add_prefix. Last reset_index:
g = df.groupby('visit_id').cumcount() + 1
print (g)
0 1
1 2
2 3
3 4
4 5
5 1
6 2
7 3
8 4
dtype: int64
df = pd.pivot(index=df['visit_id'], columns=g, values=df['atc_code'])
.reindex_axis(range(1, 8), 1)
.add_prefix('atc_')
.reset_index()
print (df)
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
0 48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
1 48944305 A02AG A03AX13 N02BE01 R05X None NaN NaN

Transforming Dataframe Columns in Python

If I have a pandas Dataframe like such
and I want to transform it in a way that it results in
Is there a way to achieve this on the most correct way? a good pattern

Use a pivot table:
pd.pivot_table(df,index='name',columns=['property'],aggfunc=sum).fillna(0)
Output:
price
Property boat dog house
name
Bob 0 5 4
Josh 0 2 0
Sam 3 0 0
Sidenote: Pasting in your df's helps so people can use pd.read_clipboard instead of generating the df themselves.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Canonical way for Pandas set value based on reference table - python

We usually use np.where s=reference_table.set_index('Fruit').Price.reindex(main_data.Fruit).values main_data['Price']=np.where(np.isnan(s),main_data['Price'],s)

Related

Pandas DataFrame MultiIndex Pivot - Remove Empty Headers and Axis Rows

comparing each value in two columns

Python: create a lag (t-1) data structure of multiple elements

How would I pivot this basic table using pandas?

Transforming Dataframe Columns in Python

Categories

Resources