Pandas. Need to merge tables with value mapping - python

merge two tables and assign values where both have the same values
I have two dataframe tables:
Table1=
| Column A | Column B | Column C |
| -------- | -------- | -------- |
| Cell 1 | Cell 2 | None |
| Cell 3 | Cell 4 | None |
Table2 =
| Column A | Column B | Column C |
| -------- | -------- | -------- |
| Cell 1 | Cell 2 | Value1 |
| Cell 3 | Cell 4 | Value2 |
Table1 =
| Column A | Column B | Column C |
| -------- | -------- | -------- |
| Cell 1 | Cell 2 | None |<--|
| Cell 3 | Cell 4 | None | |
|
Table2 = | Table1[A][0] == Table2[A][0] -> Table1[C][0] = Table2[C][0]
| Column A | Column B | Column C | | And so with all the
lines that have a match, for example, by the value in
| -------- | -------- | -------- | | columns A
| Cell 1 | Cell 2 | Value1 |<--|
| Cell 3 | Cell 4 | Value2 |
The first table is larger than the second.
The first table and the second table have the same values. I want to fill in column C in table 1 for the same values in table 1.
Simply put, if table 1 and table 2 have the same values in the columns, for example, in column A, then the value in column C from the second table will be assigned to column C in the first table.
if Table1[A]['value'] == Table2[A]['value']: Table1[C]['value'] = Table2[C]['value']
Also I tried to merge the tables but the tables didn't merge (Table 1 remained unchanged):
df = Table1['C'].merge(Table2, on=['C'], how='left')

Set the common columns as index then use update on table1 to substitute the values from table2
cols = ['Column A', 'Column B']
Table1 = Table1.set_index(cols)
Table1.update(Table2.set_index(cols))
Table1 = Table1.reset_index()
Result
print(Table1)
Column A Column B Column C
0 Cell 1 Cell 2 Value1
1 Cell 3 Cell 4 Value2

Providing you do not have any data in Table1['Table C'] that you want to keep, then you could drop Table C from the first table, and then merge
Table1 = Table1.drop(['Table C'], axis=1)
Table1 = Table1.merge(Table2, on=['Table A', 'Table B'], how='left')
Output:
Note:
If you want a one-liner:
Table1 = Table1.drop(['Table C'], axis=1).merge(Table2, on=['Table A', 'Table B'], how='left')

Related

Map a column from df2 based on check whether string column value of df1 matches with any column(list type) of df2

I have two dataframes A and B .I would like to create a new column 'suggested_Vendor' in dataframe B which consist of corresponding mapping from dataframe A based on certain checks:
Add the first 'suggested_Vendor' from dataframe A with any match between dataframe B fruit value and datframe A 'preferred_fruits' list type column.
If no matches present return 'suggested_Vendor' as 'None' in dataframe B output
If the vendor_capacity exceeds then match name for 2nd best preferred vendor in data frame A and so on.
There's no relation between Id ,userid in both data frames
Dataframe A
| Id | vendor_name| preferred_fruits |vendor_capacity|
| ---| -----------| --------------------------|---------------|
| 1 | X |['apple','orange','banana']|2 |
| 2 | Y |['kiwi'] |1 |
| 3 | Z |['banana','orange'] |1 |
| 4 | W |['apple'] |1 |
Dataframe B
| userid | fruit |
| --- | -----------|
| 1 | apple |
| 2 | orange |
| 3 | apple |
| 4 | banana |
| 5 | kiwi |
| 6 | strawberry |
Output Dataframe B
| userid | fruit | suggested_Vendor|
| --- | -----------|-----------------|
| 1 | apple | X |
| 2 | orange | X |
| 3 | apple | W |
| 4 | banana | Z |
| 5 | kiwi | Y |
| 6 | strawberry | None |
Any pythonic way for this. I would appreciate some explanation on the code.
Please find the answer below, I have explained the steps in the comments.
I have modified in dfA to remove the rows with lists in the fruits column, such that dfA has multiple rows of same vendor but with different fruits (also a better database design).
import pandas as pd
# Create Dataframes
dfA = pd.DataFrame()
dfA["vendor_name"] = ["X","Y","Z","W"]
dfA["fruits"] = [['apple','orange','banana'],['kiwi'],['banana','orange'],['apple']]
dfA["cap"] = [2,1,1,1]
dfB = pd.DataFrame()
dfB["userid"] = [1,2,3,4,5,6]
dfB["fruit"] = ["apple","orange","apple","banana","kiwi","strawberry"]
"""
Add new rows in dfA, by splitting the "fruits" list
Now, each row in dfA will have a single fruit only
"""
l = len(dfA)
for index, row in dfA.iterrows():
for fruit in row["fruits"]:
newrow = pd.Series([row["vendor_name"],fruit, row["cap"]], index=["vendor_name","fruits","cap"])
dfA = dfA.append(newrow, ignore_index=True)
# removing the earlier rows with list of fruits in each column
dfA = dfA[l:]
# Add current capacity column in dfA
dfA["curr_cap"] = dfA["cap"].copy()
# Add vendor column in dfB
dfB["vendor"] = ""
# Loop over dfB to select vendor
for index,row in dfB.iterrows():
# get fruit
fruit = row["fruit"]
# get available vendors
df = dfA[(dfA['fruits'] == fruit) & (dfA["curr_cap"] > 0)]
# if vendors are available
if len(df):
if len(df) > 1:
# if more than 1 vendor available, sort (descending) by current capacity
df = df.sort_values(by = 'curr_cap', ascending=False)
# get vendor name
selected_vendor = df.iloc[0]["vendor_name"]
# reduce capacity of the vendor in all rows where vendor exists
dfA.loc[dfA['vendor_name'] == selected_vendor, 'curr_cap'] -= 1
# set selected vendor in dfB
dfB.at[index,"vendor"] = selected_vendor
# if no vendors available
else:
dfB.at[index, "vendor"] = None
print(dfB)

Python DataFrame - Select dataframe rows based on values in a column of same dataframe

I'm struggling with a dataframe related problem.
columns = [desc[0] for desc in cursor.description]
data = cursor.fetchall()
df2 = pd.DataFrame(list(data), columns=columns)
df2 is as follows:
| Col1 | Col2 |
| -------- | -------------- |
| 2145779 | 2 |
| 8059234 | 3 |
| 2145779 | 3 |
| 4265093 | 2 |
| 2145779 | 2 |
| 1728234 | 5 |
I want to make a list of values in col1 where value of col2="3"
You can use boolean indexing:
out = df2.loc[df2.Col2.eq(3), "Col1"].agg(list)
print(out)
Prints:
[8059234, 2145779]

How to replace values in DataFrame with values from second DataFrame with condition that it selects different column?

I have a two DataFrames.
df1:
A | B | C
-----|---------|---------|
25zx | b(50gh) | |
50tr | a(70lc) | c(50gh) |
df2:
A | B
-----|-----
25zx | T
50gh | K
50tr | K
70lc | T
I want to replace values in df1. The row that I'm comparing is df2['A'], but the value that I want to put in to df1 is value from the row df['B'].
So the final table would look like:
df3:
A | B | C
-----|---------|---------|
T | b(K) | |
K | a(T) | c(K) |
Cast df2 to dict and use replace:
print (df.replace(df2.set_index("A")["B"].to_dict(), regex=True))
A B C
0 T b(K) None
1 K a(T) c(K)

Making separate pandas dfs from multiple tables on the same excel sheet

I have an abnormal setup where I have multiple tables on the same excel sheet. I'm trying to make each table (on the same sheet) a separate pandas dataframe. For example, on one excel sheet I might have:
+--------+--------+--------+--------+--------+--------+-----+
| Col | Col | Col | Col | Col | Col | Col |
+--------+--------+--------+--------+--------+--------+-----+
| Table1 | Table1 | | | | | |
| | | Table2 | Table2 | Table2 | Table2 | |
| | Table3 | Table3 | Table3 | | | |
+--------+--------+--------+--------+--------+--------+-----+
And what I want is tables broken out by table type (example below is one of the multiple tables in a pandas df). The table header beginning column is unique for each table,
so table1 might have the column header corner column named:
"Leads",
table2 has the column header corner column named:
"Sales",
and table3 has the column header named:
"Products".
+--------+--------+--+
| Leads | Table1 | |
+--------+--------+--+
| pd.Data| pd.Data| |
| | | |
| | | |
+--------+--------+--+
+--------+--------+--------+--------+--+
| Sales | Table2 | Table2 | Table2 | |
+--------+--------+--------+--------+--+
| pd.Data| pd.Data| pd.Data| pd.Data| |
| | | | | |
| | | | | |
+--------+--------+--------+--------+--+
+---------+---------+---------+--+
| Products| Table3 | Table3 | |
+---------+---------+---------+--+
| pd.Data | pd.Data | pd.Data | |
| | | | |
| | | | |
+---------+---------+---------+--+
Because I know that pandas will do well just assuming that the excel sheet is one big table, but with multiple tables I'm stumped on the best way to partition the data into separate df's, especially because I can't index on row or column due to variable length of the tables over time.
This is how far I got before I realized this only works for one table, not three:
import pandas as pd
import string
df = pd.read_excel("file.xlsx", usecols="B:I", index_col=3)
print(df)
letter = list(string.ascii_uppercase)
df1 = pd.read_excel("file.xlsx")
def get_start_column(df):
for i, column in enumerate(df.columns):
if df[column].first_valid_index():
return letter[i]
def get_last_column(df):
columns = df.columns
len_column = len(columns)
for i, column in enumerate(columns):
if df[column].first_valid_index():
return letter[len_column - i]
def get_first_row(df):
for index, row in df.iterrows():
if not row.isnull().values.all():
return index + 1
def usecols(df):
start = get_start_column(df)
end = get_last_column(df)
return f"{start}:{end}"
df = pd.read_excel("file.xlsx", usecols=usecols(df1), header=get_first_row(df1))
print(df)

End of merged cells in Excel with Python

I am using xlrd package to parse Excel spreadsheets.
I would like to get the end index of a merged cell.
A B C
+---+---+----+
1 | 2 | 2 | 2 |
+ +---+----+
2 | | 7 | 8 |
+ +---+----+
3 | | 0 | 3 |
+ +---+----+
4 | | 4 | 20 |
+---+---+----+
5 | | 2 | 0 |
+---+---+----+
given the row index and the column index, I would like to know the end index of the merged cell (if merged)
in this example for (row,col)=(0,0) ; end = 3
You can use merged_cells attribute of the Sheet object: https://secure.simplistix.co.uk/svn/xlrd/trunk/xlrd/doc/xlrd.html?p=4966#sheet.Sheet.merged_cells-attribute
It returns the list of address ranges of cells which have been merged.
If you want to get end index only for the vertically merged cells:
def is_merged(row, column):
for cell_range in sheet.merged_cells:
row_low, row_high, column_low, column_high = cell_range
if row in xrange(row_low, row_high) and column in xrange(column_low, column_high):
return (True, row_high-1)
return False

Categories

Resources