Pandas: add column with index of matching row from other dataframe [duplicate] - python

This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 3 years ago.
Cleaning up sharepoint list for upload to mssql with proper table relationships.
Basically, two dataframes (data, config), both share some common columns (country, business).
What I want to do is to insert a new column in datadf where for each row it contains index of matching row in configdf based on values in columns country and business.
dataframe data:
-----|---------|----------|-----
... | Country | Business | ...
-----|---------|----------|-----
| A | 1 |
-----|---------|----------|-----
| A | 1 |
-----|---------|----------|-----
| A | 2 |
-----|---------|----------|-----
| A | 2 |
-----|---------|----------|-----
| B | 1 |
-----|---------|----------|-----
| B | 1 |
-----|---------|----------|-----
| B | 2 |
-----|---------|----------|-----
| C | 1 |
-----|---------|----------|-----
| C | 2 |
-----|---------|----------|-----
dataframe config (ID = index):
----|---------|----------|-----
ID | Country | Business | ...
----|---------|----------|-----
1 | A | 1 |
----|---------|----------|-----
2 | A | 2 |
----|---------|----------|-----
3 | B | 1 |
----|---------|----------|-----
4 | B | 2 |
----|---------|----------|-----
5 | C | 1 |
----|---------|----------|-----
6 | C | 2 |
----|---------|----------|-----
what I want to add to dataframe data:
-----|---------|----------|-----------|-----
... | Country | Business | config_ID | ...
-----|---------|----------|-----------|-----
| A | 1 | 1 |
-----|---------|----------|-----------|-----
| A | 1 | 1 |
-----|---------|----------|-----------|-----
| A | 2 | 2 |
-----|---------|----------|-----------|-----
| A | 2 | 2 |
-----|---------|----------|-----------|-----
| B | 1 | 3 |
-----|---------|----------|-----------|-----
| B | 1 | 3 |
-----|---------|----------|-----------|-----
| B | 2 | 4 |
-----|---------|----------|-----------|-----
| C | 1 | 5 |
-----|---------|----------|-----------|-----
| C | 2 | 6 |
-----|---------|----------|-----------|-----
----Found something that works----
datadf['config_ID'] = datadf.apply(lambda x: configdf[(configdf.country == x.country) & (configdf.business_unit == x.business_unit)].index[0], axis=1)
It gets the job done, although I am open for other suggestions, especially if it could work with df.insert()

You can use numpy.where function to match the data frames
For example:
datadf = pd.DataFrame([['USA','Business1'],['AUS','Business2'],['UK','Business3'],['IND','Business4']],
columns=['country','business'])
configdf = pd.DataFrame([['AUS','Business2'],['IND','Business4'],['USA','Business1'],['UK','Business3']],
columns=['country','business'])
datadf['new_col'] = datadf.apply(lambda x: (np.where(x == configdf)[0][0]),axis=1)
print(datadf)
Output:
country business new_col
0 USA Business1 2
1 AUS Business2 0
2 UK Business3 3
3 IND Business4 1
EDIT1:
Well, in that case, you can use
datadf['new_col'] = datadf.apply(lambda x: (np.where((x['country'] == configdf['country']) & (x['business'] == configdf['business']))[0][0]),axis=1)
Output based on your sample data frames datadf and configdf:
country business new_col
0 A 1 0
1 A 1 0
2 A 2 1
3 A 2 1
4 B 1 2
5 B 1 2
6 B 2 3
7 C 1 4
8 C 2 5

Here is a solution using pandas merge.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html#pandas.DataFrame.merge
import pandas as pd
# make the two dataframes
data = pd.DataFrame({'Country':['A','A','A','A','B','B','B','C','C'],
'Business':[1,1,2,2,1,1,2,1,2]})
configdf = pd.DataFrame({'Country':['A','A','B','B','C','C'],
'Business':[1,2,1,2,1,2]})
# make a column with the index values
configdf.reset_index(inplace=True)
# merge the two dataframes based on the selected columns.
newdf = data.merge(configdf, on=['Country', 'Business'])

Related

Create a column from a choice of other columns using IF style statement

Given the following table:
+---------+---------+-------------+
| field_a | field_b | which_field |
+---------+---------+-------------+
| 1 | 2 | a |
| 1 | 2 | b |
| 3 | 4 | a |
| 3 | 4 | b |
+---------+---------+-------------+
I'd like to create a column called output where the value for each row is taken from either field_a or field_b based upon the value in which_field. So the resulting table would look like this:
+---------+---------+-------------+--------+
| field_a | field_b | which_field | output |
+---------+---------+-------------+--------+
| 1 | 2 | a | 1 |
| 1 | 2 | b | 2 |
| 3 | 4 | a | 3 |
| 3 | 4 | b | 4 |
+---------+---------+-------------+--------+
I've reviewed a number of examples using loc and np.where but these only seem to be able to handle assigning a fixed value rather than the value from a choice of columns.
This is an MRE - in reality there could be multiple which_field fields so it would be great to get an answer that can cope with multiple conditions.
Thanks in advance!
Use DataFrame.melt with DataFrame.loc:
df1 = df.melt('which_field', ignore_index=False)
df['output'] = df1.loc[('field_' + df1['which_field']).eq(df1['variable']), 'value']
print (df)
field_a field_b which_field output
0 1 2 a 1
1 1 2 b 2
2 3 4 a 3
3 3 4 b 4

Merge columns from two dataframes when value in columns are not equal [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have a Pandas df which looks like this:
| | yyyy_mm_dd | id | product | status | is_50 | cnt |
|---|------------|----|------------|--------|-------|-----|
| | 2002-12-15 | 7 | prod_rs | 2 | 0 | 8 |
| | 2002-12-15 | 16 | prod_go | 2 | 0 | 1 |
| | 2002-12-15 | 16 | prod_mb | 2 | 0 | 3 |
| | 2002-12-15 | 29 | prod_er | 2 | 0 | 2 |
| | 2002-12-15 | 29 | prod_lm | 2 | 0 | 2 |
| | 2002-12-15 | 29 | prod_ops | 2 | 0 | 2 |
I also have a second dataframe which is similar:
| | id | product | cnt |
|---|----|------------|-----|
| | 7 | prod_rs | 8 |
| | 16 | prod_go | 1 |
| | 16 | prod_mb | 3 |
| | 29 | prod_er | 2 |
| | 29 | prod_lm | 2 |
| | 29 | prod_ops | 6 |
How can I create a third dataframe which will only store the rows which do not have an equal count? Based on the above, only the last row would be returned as the cnt for the id / product combination differs. Example output:
| | id | product | cnt_df1 | cnt_df2 |
|---|----|---------|---------|---------|
| | 29 | prod_ops| 2 | 6 |
The second df is one row larger in size so not all id / product combinations may be present in both dataframes.
I've been looking at merge but I'm unsure how to use when the cnt columns are not equal.
You would still use merge and just check whether the count columns are different in a second step
In [40]: df = pd.merge(df1.drop(["yyyy_mm_dd", "", "status", "is_50"], axis=1), df2, on=['id', 'product'], suffixes=['_df1', '_df2'])
In [41]: df
Out[41]:
id product cnt_df1 cnt_df2
0 7 prod_rs 8 8
1 16 prod_go 1 1
2 16 prod_mb 3 3
3 29 prod_er 2 2
4 29 prod_lm 2 2
5 29 prod_ops 2 6
Now you can simply filter out all rows with the same cnt e.g. with query()
In [42]: df.query("cnt_df1 != cnt_df2")
Out[42]:
id product cnt_df1 cnt_df2
5 29 prod_ops 2 6
You can acheive this in two steps like so:
# Merge the DataFrames
df3 = df1.merge(df2, on=["id", "product"])
# Filter for where `cnt` are not equal
df3 = df3[df3["cnt_x"].ne(df3["cnt_y"])]
# yyyy_mm_dd id product status is_50 cnt_x cnt_y
# 5 2002-12-15 29 prod_ops 2 0 2 6
You can use the suffixes parameter on merge if you don't want the to use the default _x and _y.

Count the number of duplicate grouped by ID pandas

I'm not sure if this is a duplicate question, but here it goes.
Assuming I have the following table:
import pandas
lst = [1,1,1,2,2,3,3,4,5]
lst2 = ['A','A','B','D','E','A','A','A','E']
df = pd.DataFrame(list(zip(lst, lst2)),
columns =['ID', 'val'])
will output the following table
+----+-----+
| ID | Val |
+----+-----+
| 1 | A |
+----+-----+
| 1 | A |
+----+-----+
| 1 | B |
+----+-----+
| 2 | D |
+----+-----+
| 2 | E |
+----+-----+
| 3 | A |
+----+-----+
| 3 | A |
+----+-----+
| 4 | A |
+----+-----+
| 5 | E |
+----+-----+
The goal is count the duplicates on VAL grouped by ID:
+----+-----+--------------+
| ID | Val | is_duplicate |
+----+-----+--------------+
| 1 | A | 1 |
+----+-----+--------------+
| 1 | A | 1 |
+----+-----+--------------+
| 1 | B | 0 |
+----+-----+--------------+
| 2 | D | 0 |
+----+-----+--------------+
| 2 | E | 0 |
+----+-----+--------------+
| 3 | A | 1 |
+----+-----+--------------+
| 3 | A | 1 |
+----+-----+--------------+
| 4 | A | 0 |
+----+-----+--------------+
| 5 | E | 0 |
+----+-----+--------------+
I tried the following code but its counting the overall duplicates
df_grouped = df.groupby(['notes']).size().reset_index(name='count')
while the following code does only the duplicate count
df.duplicated(subset=['notes'])
what would be the best approach for this?
Let us try duplicated
df['is_dup']=df.duplicated(subset=['ID','val'],keep=False).astype(int)
df
Out[21]:
ID val is_dup
0 1 A 1
1 1 A 1
2 1 B 0
3 2 D 0
4 2 E 0
5 3 A 1
6 3 A 1
7 4 A 0
8 5 E 0
You can use .groupby on the relevant columns and get the count. Then if you add >1 to the end, then that will mean the value for the specified group contains duplicates. The > 1 will create a boolean True/False data type. Finally, to change to 1 or 0, simply use .astype(int) to transform the data type from a boolean data type to an int, which changes True to 1 and False to 0:
df['is_duplicate'] = (df.groupby(['ID','val'])['val'].transform('count') > 1).astype(int)
Out[7]:
ID val is_duplicate
0 1 A 1
1 1 A 1
2 1 B 0
3 2 D 0
4 2 E 0
5 3 A 1
6 3 A 1
7 4 A 0

Merge duplicate cells of a column

My Current excel looks like:
----------------
| Type | Val |
|--------------|
| A | 1 |
|--------------|
| A | 2 |
|--------------|
| B | 3 |
|--------------|
| B | 4 |
|--------------|
| B | 5 |
|--------------|
| C | 6 |
----------------
This is the required excel:
----------------------
| Type | Val | Sum |
|--------------------|
| A | 1 | 3 |
| |------| |
| | 2 | |
|--------------------|
| B | 3 | 12 |
| |------| |
| | 4 | |
| |------| |
| | 5 | |
|--------------------|
| C | 6 | 6 |
----------------------
Is it possible in Python using Pandas or any other module?
IIUC use:
df['Sum']=df.groupby('Type').transform('sum')
df.loc[df[['Type','Sum']].duplicated(),['Type','Sum']]=''
print(df)
Type Val Sum
0 A 1 3
1 2
2 B 3 12
3 4
4 5
5 C 6 6
P.s: you can also add this as index:
df=df.set_index(['Type','Sum']) #export to excel without index=False
For merged first 2 levels is possible set all 3 columns to MultiIndex - only order of columns is different:
#specify column name after groupby
df['Sum'] = df.groupby('Type')['Val'].transform('sum')
df = df.set_index(['Type','Sum', 'Val'])
df.to_excel('file.xlsx')
But in my opinion the best is working with duplicated values:
df['Sum'] = df.groupby('Type')['Val'].transform('sum')
print (df)
Type Val Sum
0 A 1 3
1 A 2 3
2 B 3 12
3 B 4 12
4 B 5 12
5 C 6 6
df.to_excel('file.xlsx', index=False)
You can use
import pandas as pd
df = pd.DataFrame({'Type': ['A', 'A','B','B','B','C'], 'Val': [1,2 ,3,4,5,6]})
df_result = df.merge(df.groupby(by='Type', as_index=False).agg({'Val':'sum'}).rename(columns={'Val':'Sum'}), on = 'Type')
which gives the output as
print(df_result)
Type Val Sum
0 A 1 3
1 A 2 3
2 B 3 12
3 B 4 12
4 B 5 12
5 C 6 6
Is this what you are looking for?

Reading data from text file with variable numbers of Column

I am reading data from a text file in python using pandas. There are no header values (column names) assigned to the data in the text file. I want to reshape the data into a readable form. The problem i am facing is variable column lengths
For example in my text file i have
1,2,3,4,5,Hello,7,8
1,2,3,4,5,7,8,
1,2,3,4,5,7,8,
1,2,3,4,5,Hello,7,8,
Now when i create a data frame I want to make sure that in the second row instead of Hello a "NAN" is written as the value for that column is not present. and in the end after giving column names and rearranging the data frame will look like.
1,2,3,4,5,Hello,7,8
1,2,3,4,5,"NA,"7,8,
1,2,3,4,5,"NA",7,8,
1,2,3,4,5,Hello,7,8,
Answer to updated question and also a generalized solution for such case.
focus_col_idx = 5 # The column where you want to bring NaN in expected output
last_idx = df.shape[1] - 1
# Fetching the index of rows which have None in last column
idx = df[df[last_idx].isnull()].index
# Shifting the column values for those rows with index idx
df.iloc[idx,focus_col_idx+1:] = df.iloc[idx,focus_col_idx:last_idx].values
# Putting NaN for second column where row index is idx
df.iloc[idx,focus_col_idx] = np.NaN
df
+---+----+---+---+---+---+-------+---+-----+
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---+----+---+---+---+---+-------+---+-----+
| 0 | 1 | 2 | 3 | 4 | 5 | Hello | 7 | 8.0 |
| 1 | 1 | 2 | 3 | 4 | 5 | NaN | 7 | 8.0 |
| 2 | 1 | 2 | 3 | 4 | 5 | NaN | 7 | 8.0 |
| 3 | 1 | 2 | 3 | 4 | 5 | Hello | 7 | 8.0 |
+---+----+---+---+---+---+-------+---+-----+
Answer to previous data
Assuming only one column is having missing value (say 2nd column as per your previous data). Here's a quick sol -
df = pd.read_table('SO.txt',sep='\,', header=None)
df
+---+---+---+---+---+------+
| | 0 | 1 | 2 | 3 | 4 |
+---+---+---+---+---+------+
| 0 | A | B | C | D | E |
| 1 | A | C | D | E | None |
+---+---+---+---+---+------+
# Fetching the index of rows which have None in last column
idx = df[df[4].isnull()].index
idx
# Int64Index([1], dtype='int64')
# Shifting the column values for those rows with index idx
df.iloc[idx,2:] = df.iloc[idx,1:4].values
df
+---+---+---+---+---+---+
| | 0 | 1 | 2 | 3 | 4 |
+---+---+---+---+---+---+
| 0 | A | B | C | D | E |
| 1 | A | C | C | D | E | # <- Notice the shifting.
+---+---+---+---+---+---+
# Putting NaN for second column where row index is idx
df.iloc[idx,1] = np.NaN
# Final output
df
+---+---+-----+---+---+---+
| | 0 | 1 | 2 | 3 | 4 |
+---+---+-----+---+---+---+
| 0 | A | B | C | D | E |
| 1 | A | NaN | C | D | E |
+---+---+-----+---+---+---+

Categories

Resources