I have 2 DataFrames:
DF1 - Master List
ID Item
100 Wood
101 Steel
102 Brick
103 Soil
DF2
ID
100
103
I want my final DataFrame to look like this:
ID Item / ID
100 100 - Wood
103 103 - Soil
The issue I'm having is DF2 doesn't have the Item Column.
I could manually do it by np.select(conditions, choices, default='N/A') but the full dataset is huge and would take a lot of time.
I also tried an np.select from the 2 different datasets, citing the columns but got a Can only compare identically-labeled DataFrame objects. error.
Is there a way to pull the relevant information from Item so I can join the strings values from ID to create Item / ID
Thanks in advance
dff = pd.merge(df1, df2, how = 'right', on = 'ID')
dff['Item / ID'] = dff['ID'].astype(str) + ' - ' + dff['Item']
dff.drop('Item', axis=1, inplace=True)
print(dff)
output:
ID Item / ID
0 100 100 - Wood
1 103 103 - Soil
Related
I am new learner for python. I have two dataframe loaded from xlsx file in python, and tried to get the value to table 2 from table 1
Table 1:
Product ID
Inventory Receipt Date
Age Days
Quantity
AA12345678
Jan 21, 2022
120
400
AA12345678
Jan 30, 2022
111
100
AA12345678
Jan 31, 2022
110
20
BB12345678
Jan 21, 2022
120
120
BB12345678
Feb 1, 2022
109
100
Table 2:
Location Code
Product ID
Required Quantity
ABCD001
AA12345678
100
ABCD001
AA12345678
401
ABCD002
AA12345678
19
EFGH001
BB12345678
200
EFGH002
BB12345678
20
Expected Result:
Location Code
Product ID
Required Quantity
Age days 1
Age days 2
Age days 3
ABCD001
AA12345678
100
120
ABCD001
AA12345678
401
120
111
110
ABCD002
AA12345678
19
110
EFGH001
BB12345678
200
120
109
EFGH002
BB12345678
20
109
The rule of product distribution is first come first served. For example, 'Location Code' ABCD001 require 100 qty. ('Product ID'= 'AA12345678')on row 2 in table 2. It will distribute 100 qty. ('Product ID'='AA12345678') to row 1 and get the 'Age days' to table 2. When the 'Quantity' on row 2 in table 1 is empty, it will lookup row 3 (with same 'Product ID'=AA12345678). The total number of 'Quantity' in table 1 is same as table 2.
I tried to use df2.merge(df1[['Product ID', 'Age Days']], 'left'), but 'Age Days' cannot merge to df2. And tried to use map function (df2['Age Days'] = df2['Product ID'].map(df1.set_index('Product ID')['Age Days'])), but it occur error "uniquely value".
Issue: 'Product ID' is non-unique for lookup/map/merge. How could get all results or its index by lookup/map/merge/other method? In that case, I need to set a flag called "is_empty" for checking, if "is_empty" == yes: I need to get the value from next matched row
I know that case is complex, could you let me know the way to solve it? I am confusing what keywords should I use to study for it. Thank you.
After converting your comments into program logic, I ended up with the following :
table 1 file
|Product ID|Inventory Receipt Date|Age Days|Quantity|
|:---------|:---------------------|:-------|:-------|
|AA12345678|Jan 21, 2022|120|400|
|AA12345678|Jan 30, 2022|111|100|
|AA12345678|Jan 31, 2022|110|20|
|BB12345678|Jan 21, 2022|120|120|
|BB12345678|Feb 1, 2022|109|100|
table 2 file
|Location Code|Product ID|Required Quantity|
|-------------|----------|-----------------|
|ABCD001|AA12345678|100|
|ABCD001|AA12345678|401|
|ABCD002|AA12345678|19|
|EFGH001|BB12345678|200|
|EFGH002|BB12345678|20|
code
import pandas as pd
import numpy as np
# convert first table from markdown to dataframe
df1 = pd.read_csv('table1', sep="|")
df1.drop(df1.columns[len(df1.columns)-1], axis=1, inplace=True)
df1.drop(df1.columns[0], axis=1, inplace=True)
df1 = df1.iloc[1:]
df1 = df1.reset_index()
# convert second table from markdown to dataframe
df2 = pd.read_csv('table2', sep="|")
df2.drop(df2.columns[len(df2.columns)-1], axis=1, inplace=True)
df2.drop(df2.columns[0], axis=1, inplace=True)
df2 = df2.iloc[1:]
df2 = df2.reset_index()
# Convert data type of quantities to integer
df1["Quantity"] = pd.to_numeric(df1["Quantity"])
df2["Required Quantity"] = pd.to_numeric(df2["Required Quantity"])
# create an extra column in df2
df2["Age 1"] = np.nan
df2_extra_col = 1
for i2, row2 in df2.iterrows():
age_col = 1 # current age column
product_id = row2["Product ID"]
required_quantity = row2["Required Quantity"]
# search for this product id in df1
for i1, row1 in df1.iterrows():
available_quantity = row1["Quantity"]
if row1["Product ID"] == product_id:
if available_quantity == 0: # skip 0 quantity rows
continue
if available_quantity < required_quantity: # insufficient quantity
required_quantity -= available_quantity
df1.loc[i1, "Quantity"] = 0 # take everything
df2.loc[i2, "Age "+str(age_col)] = row1["Age Days"]
# add another column to df2 if missing
age_col += 1
if(age_col > df2_extra_col):
df2["Age "+str(age_col)] = np.nan
df2_extra_col += 1
continue
else: # single delivery enough
df2.loc[i2, "Age "+str(age_col)] = row1["Age Days"]
df1.loc[i1, "Quantity"] -= required_quantity
break
df2.drop(df2.columns[0], axis=1, inplace=True)
print(df2)
Result
Location Code Product ID Required Quantity Age 1 Age 2 Age 3
0 ABCD001 AA12345678 100 120 NaN NaN
1 ABCD001 AA12345678 401 120 111 110
2 ABCD002 AA12345678 19 110 NaN NaN
3 EFGH001 BB12345678 200 120 109 NaN
4 EFGH002 BB12345678 20 109 NaN NaN
Notes
I will highly recommend you to use single stepping and breakpoints to debug my code to understand what's happening on each line. Let me know if there's anything unclear.
My file contents were in markdown so there was some pre-processing to do to convert it to dataframe. If your files are already in csv, you can convert your files directly to a dataframe using df.read_csv.
I have example dataframe in yearly granularity:
df = pd.DataFrame({
"date": ["2020-01-01", "2021-01-01", "2022-01-01"],
"cost": [100, 1000, 150],
"person": ["Tom","Jerry","Brian"]
})
I want to create dataframe with monthly granularity without any estimation methods (just repeat a row 12 times for each year. So in a result from this 3 row dataframe I would like to get 36 rows exactly like:
2020-01-01 / 100 / Tom
2020-02-01 / 100 / Tom
2020-03-01 / 100 / Tom
2020-04-01 / 100 / Tom
2020-05-01 / 100 / Tom
[...]
2022-10-01 / 150 / Brian
2022-11-01 / 150 / Brian
2022-12-01 / 150 / Brian
I tried
df.resample('M', on = 'date').apply(lambda x:x)
but cant seem to get it working...
Im beginner so forgive me my ignorance
Thanks for help in advance!
Here is a way to do that.
count = len(df)
for var in df[['date','cost','person']].values:
for i in range(2,13):
df.loc[count] = [(var[0][0:5] + "{:02d}".format(i) + var[0][7:]),var[1], var[2]]
count += 1
df = df.sort_values('date')
Following should also work,
#Typecasting
df['date'] = pd.to_datetime(df['date'])
#Making new dataframe based on frequency
op = pd.DataFrame(pd.date_range(start=df['date'].min(), end=df['date'].max()+pd.offsets.DateOffset(months=11),freq='MS'),columns = ['date'])
#merging both results on year using merge( with outer join)
res = pd.merge(df,op,left_on=df['date'].apply(lambda x: x.year), right_on = op['date'].apply(lambda x: x.year), how = 'outer')
#dropping key columns from left side
res.drop(['key_0','date_x'],axis=1,inplace=True)
I have a data frame df1 consisting of 14000 person id. I have another data frame df2 consisting of 300000 data of ids and other attributes. I need to match the 14000 id's of df1 to the 300000 id's of df2 and extract the whole row of those 14000 ids.
df1 personUuid
0 99afae32-1486-47db-825e-6695f742eb86
1 bb22ca94-1f4b-435c-98ff-bd6f02a6b42b
2 ecfdc560-cc97-4525-8d1e-e3536793ef6e
3 8fbe1e4f-ae1e-4949-afd9-b120f6ae3762
4 d83dc0c4-26e6-4126-926d-7b84913bca13
... ...
14367 23592455-47a2-47ef-9d21-a283ae50988d
14368 1adecd7e-a0c2-4c35-bef1-75569f3b57fe
14369 e96f6eb4-d823-47b4-bd03-755e8f685e8f
14370 c87156e2-9610-40f4-a75a-17435d9fa91f
14371 70f08fd1-c595-4d01-886d-ed586a77c1d1
personUuid firstName middleName lastName emails urls locations currentTitles currentCompanies education ... count_currentTitles fullName li_clean gh_clean tw_clean fb_clean email_clean email_clean1 email_clean2 email_clean3
0 ab92fa98-2427-461d-87ac-31a440b6e1ae
1 658c57b9-457a-4e97-8b1c-10ab45655518
2 7da5a858-3c20-46c0-b728-23e64352094d
3 9c14f2b6-a81a-49af-85d4-d4cf76001f07
Similarly, I have the second data frame with 300K person ids and attributes like fullname, emails, location, etc.
need to match those 14K ids to 300 K and display all the attributes of the 14K only.
You need to do a merge with an inner join as given below:
df1['personUuid'] = df1['personUuid'].str.strip()
df2['personUuid'] = df2['personUuid'].str.strip()
df = pd.merge(left=df1, right=df2, how='inner', on=['personUuid'])
I have df with cols:
date Account invoice category sales
12-01-2019 123 123 exhaust 2200
13-01-2019 124 124 tyres 1300
15-01-2019 234 125 windscreen 4500
16-01-2019 123 134 gearbox 6000
I have grouped by account and sales a
dfres = df.groupby(['Account'])({'sales': np.sum})
I received:
sales
account
123 8200
124 3300
I want to now retrieve original df filtered by my grouped details, so a reduced dataset but i now have the same number of rows as original and only retain top 5% of sales for example. How can i remove unwanted accounts?
Ive tried:
index_list = res.index.tolist()
newdf = df[df.account.isin(index_list)]
Many thanks
If you want to keep the remaining columns, you'll need to tell pandas how to show the rest of the columns once grouped. For example, if you want to keep the information in invoice and category and date as a list of whatever invoices/cats/dates make up that Account sum, then:
dfres = df.groupby(['Account']).agg({'sales': np.sum, 'invoice':list, 'category':list, 'date':list})
You could then reset the index to turn it back into a flat dataframe:
dfres.reset_index()
I have one difficulty here. My goal is to create a list of sales for one shop with one dataframe that lists prices by product and one other that lists all the sales in terms of products and quantities (for one period of time)
DataFrame 1 : prices
prices = pd.DataFrame({'qty_from' : ('0','10','20'), 'qty_to' : ('9','19','29'), 'product_A' :(50,30,10),'product_B' :(24,14,12),'product_C' :(70,50,18)})
DataFrame 2 : sales
sales = pd.DataFrame({'product' : ('product_b','product_b','product_a',product_c,product_b), 'qty' : ('4','12','21','41','7')})
I would like to get the turnover, line by line within the 'sales' DataFrame, with one other column like 'TurnOver'
I used
pd.merge_asof(sales, prices, left_on='qty', right_on='qty_from', direction='backward')
and it gave me the right price for the quantity sold, but how to get the good price that is related to one product?
How to merge with a value in 'sales' dataframe like 'product_b' with the name of a column in dataframe prices, here 'product_b' and then apply a calculation to get the turnover ?
Thank you for your help,
Eric
If I understand correctly, you can modify the dataframe prices to be able to use the parameter by in merge_asof, using stack:
#modify price
prices_stack = (prices.set_index(['qty_from','qty_to']).stack() # then products become as a column
.reset_index(name='price').rename(columns={'level_2':'product'}))
# uniform the case
sales['product'] = sales['product'].str.lower()
prices_stack['product'] = prices_stack['product'].str.lower()
# this is necessary with your data here as not int
sales.qty = sales.qty.astype(int)
prices_stack.qty_from = prices_stack.qty_from.astype(int)
#now you can merge_asof adding by parameter
sales_prices = (pd.merge_asof( sales.sort_values('qty'), prices_stack,
left_on='qty', right_on='qty_from',
by = 'product', #first merge on the column product
direction='backward')
.drop(['qty_from','qty_to'], axis=1)) #not necessary columns
print (sales_prices)
product qty price
0 product_b 4 24
1 product_b 7 24
2 product_b 12 14
3 product_a 21 10
4 product_c 41 18