I am trying to fill the missing value from other dataframe with matching column values and conditions.
The two Dataframes that I'm working with looks like below:
The First Dataframe contains values that are recorded in Quarter periods (Mar.-June-Sep.-Dec.)
Name
Date
Value_1
Value_2
Value_3
John
2022-03
1
5
9
John
2022-06
2
NA
10
John
2022-09
NA
7
11
John
2022-12
4
8
12
Dave
2021-03
NA
5
10
Dave
2021-06
20
NA
10
Dave
2021-09
10
5
8
Dave
2021-12
NA
NA
30
The Second Dataframe contains values that are recorded in Semiannual periods (June and Dec.)
Name
Date
Value_1
Value_2
Value_3
John
2022-06
NA
NA
30
John
2022-12
50
NA
NA
Dave
2021-06
NA
NA
NA
Dave
2021-12
30
20
10
There are two things that I want to put as a condition when merging this two Dataframe.
(1.) If the 'Value_1' and 'Value_3' of Semiannual column is missing, replace with values in Quarterly Dataframe by same Year and Month. For example, the Value_3 second column will be '12' since the Value'_3, forth row in Quarter dataframe has the value 12. In this case when filling the values, values in March and September in quarter data will be ignored.
(2.) If the 'Value_2' of Semiannual column is missing, repace the values in Quarterly Dataframe by adding the relavant months. For example, the Value_2 second column (2022-12) in Semiannual Dataframe will be replace with the value by adding 2022-09 and 2022-12 in Quarterly dataframe.
The Final Dataframe will look like below:
Name
Date
Value_1
Value_2
Value_3
John
2022-06
2 (from quarter df)
5 (5 + NA from quarter df)
30
John
2022-12
50
15 (7 + 8 from quarter df)
12 (from quarter df
Dave
2021-06
20 (from quarter df)
NA (NA + NA from quarter df)
10 (from quarter df)
Dave
2021-12
30
20
10
I am trying to develop the code below, but it only satisfies the first condition.
dt_quarter= [['John','2022-03',1,5,9], ['John','2022-06',2,'NA',10], ['John','2022-09','NA',7,11], ['John','2022-12',4,8,12],
['Dave','2021-03','NA',5,10], ['Dave','2021-06',20,'NA',10], ['Dave','2021-09',10,5,8], ['Dave','2021-12','NA','NA',30]]
df_quarter= pd.DataFrame(dt_quarter, columns=['Name','Date','Value_1','Value_2','Value_3'])
dt_semi= [['John','2022-06','NA','NA',30], ['John','2022-12',50,'NA','NA'],
['Dave','2021-06','NA','NA','NA'], ['Dave','2021-12',30,20,10]]
df_semi= pd.DataFrame(dt_semi, columns=['Name','Date','Value_1','Value_2','Value_3'])
group_cols = ['Name', 'Date']
df_combined = df_semi.set_index(group_cols).fillna(df_quarter.set_index(group_cols)).reset_index()
Please give advice on this problem.
Thanks
Related
Be the following python pandas DataFrame:
ID
country
money
other
money_add
832932
France
12131
19
82932
217#8#
1329T2
832932
France
30
31728#
I would like to make the following modifications for each row:
If the ID column has any '#' value, the row is left unchanged.
If the ID column has no '#' value, and country is NaN, "Other" is added to the country column, and a 0 is added to other column.
Finally, only if the money column is NaN and the other column has value, we assign the values money and money_add from the following table:
other_ID
money
money_add
19
4532
723823
50
1213
238232
18
1813
273283
30
1313
83293
0
8932
3920
Example of the resulting table:
ID
country
money
other
money_add
832932
France
12131
19
82932
217#8#
1329T2
Other
8932
0
3920
832932
France
1313
30
83293
31728#
First set values to both columns if match both conditions by list, then filter non # rows and update values by DataFrame.update only matched rows:
m1 = df['ID'].str.contains('#')
m2 = df['country'].isna()
df.loc[~m1 & m2, ['country','other']] = ['Other',0]
df1 = df1.set_index(df1['other_ID'])
df = df.set_index(df['other'].mask(m1))
df.update(df1, overwrite=False)
df = df.reset_index(drop=True)
print (df)
ID country money other money_add
0 832932 France 12131 19.0 82932.0
1 217#8# NaN ; NaN NaN
2 1329T2 Other 8932.0 0.0 3920.0
3 832932 France 1313.0 30.0 83293.0
4 31728# NaN NaN NaN NaN
I have a sample dataframe/table as below and I would like to do a simple pivot table in Python to calculate the % difference from the previous year.
DataFrame
Year Month Count Amount Retailer
2019 5 10 100 ABC
2019 3 8 80 XYZ
2020 3 8 80 ABC
2020 5 7 70 XYZ
...
Expected Output
MONTH %Diff
ABC 7 -0.2
XYG 8 -0.125
Thanks,
EDIT: I would like to reiterate that I would like to create the following table below. Not to do a join with the two tables
It looks like you need a groupby not pivot
gdf = df.groupby(['Retailer']).agg({'Amount': 'pct_change'})
Then rename and merge with original df
df = gdf.rename(columns={'Amount': '%Diff'}).dropna().merge(df, how='left', left_index=True, right_index=True)
%Diff Year Month Count Amount Retailer
2 -0.200 2020 3 7 80 ABC
3 -0.125 2020 5 8 70 XYZ
My first data frame
product=pd.DataFrame({
'Product_ID':[101,102,103,104,105,106,107,101],
'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop','New Watch'],
'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics','Electronics'],
'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0,9898.0],
'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore','New York']
})
My 2nd data frame has transactions
customer=pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9],
'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
'age':[20,25,15,10,30,65,35,18,23],
'Product_ID':[101,0,106,0,103,104,0,0,107],
'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
I want Price from 1st data frame to come in the merged dataframe. Common element being 'Product_ID'. Note that against product_ID 101, there are 2 prices - 299.00 and 9898.00. I want the later one to come in the merged data set i.e. 9898.0 (Since this is latest price)
Currently my code is not giving the right answer. It is giving both
customerpur = pd.merge(customer,product[['Price','Product_ID']], on="Product_ID", how = "left")
customerpur
id name age Product_ID Purchased_Product City Price
0 1 Olivia 20 101 Watch Mumbai 299.0
1 1 Olivia 20 101 Watch Mumbai 9898.0
There is no explicit timestamp so I assume the index is the order of the dataframe. You can drop duplicates at the end:
customerpur.drop_duplicates(subset = ['id'], keep = 'last')
result:
id name age Product_ID Purchased_Product City Price
1 1 Olivia 20 101 Watch Mumbai 9898.0
2 2 Aditya 25 0 NA Delhi NaN
3 3 Cory 15 106 Oil Bangalore 110.0
4 4 Isabell 10 0 NA Chennai NaN
5 5 Dominic 30 103 Shoes Chennai 2999.0
6 6 Tyler 65 104 Smartphone Delhi 14999.0
7 7 Samuel 35 0 NA Kolkata NaN
8 8 Daniel 18 0 NA Delhi NaN
9 9 Jeremy 23 107 Laptop Mumbai 79999.0
Please note keep = 'last' argument since we are keeping only last price registered.
Deduplication should be done before merging if Yuo care about performace or dataset is huge:
product = product.drop_duplicates(subset = ['Product_ID'], keep = 'last')
In your data frame there is no indicator of latest entry, so you might need to first remove the the first entry for id 101 from product dataframe as follows:
result_product = product.drop_duplicates(subset=['Product_ID'], keep='last')
It will keep the last entry based on Product_ID and you can do the merge as:
pd.merge(result_product, customer, on='Product_ID')
So I had a dataframe and I had to do some cleansing to minimize the duplicates. In order to do that I created a dataframe that had instead of 40 only 8 of the original columns. Now I have two columns I need for further analysis from the original dataframe but they would mess with the desired outcome if I used them in my previous analysis. Anyone have any idea on how to "extract" these columns based on the new "clean" dataframe I have?
You can merge the new "clean" dataframe with the other two variables by using the indexes. Let me use a pratical example. Suppose the "initial" dataframe, called "df", is:
df
name year reports location
0 Jason 2012 4 Cochice
1 Molly 2012 24 Pima
2 Tina 2013 31 Santa Cruz
3 Jake 2014 2 Maricopa
4 Amy 2014 3 Yuma
while the "clean" dataframe is:
d1
year location
0 2012 Cochice
2 2013 Santa Cruz
3 2014 Maricopa
The remaing columns are saved in dataframe "d2" ( d2 = df[['name','reports']] ):
d2
name reports
0 Jason 4
1 Molly 24
2 Tina 31
3 Jake 2
4 Amy 3
By using the inner join on the indexes d1.merge(d2, how = 'inner' left_index= True, right_index = True) you get the following result:
name year reports location
0 Jason 2012 4 Cochice
2 Tina 2013 31 Santa Cruz
3 Jake 2014 2 Maricopa
You can make a new dataframe with the specified columns;
import pandas
#If your columns are named a,b,c,d etc
df1 = df[['a','b']]
#This will extract columns 0, to 2 based on their index
#[remember that pandas indexes columns from zero!
df2 = df.iloc[:,0:2]
If you could, provide a sample piece of data, that'd make it easier for us to help you.
What is the best way to do a groupby on a Pandas dataframe, but exclude some columns from that groupby? e.g. I have the following dataframe:
Code Country Item_Code Item Ele_Code Unit Y1961 Y1962 Y1963
2 Afghanistan 15 Wheat 5312 Ha 10 20 30
2 Afghanistan 25 Maize 5312 Ha 10 20 30
4 Angola 15 Wheat 7312 Ha 30 40 50
4 Angola 25 Maize 7312 Ha 30 40 50
I want to groupby the column Country and Item_Code and only compute the sum of the rows falling under the columns Y1961, Y1962 and Y1963. The resulting dataframe should look like this:
Code Country Item_Code Item Ele_Code Unit Y1961 Y1962 Y1963
2 Afghanistan 15 C3 5312 Ha 20 40 60
4 Angola 25 C4 7312 Ha 60 80 100
Right now I am doing this:
df.groupby('Country').sum()
However this adds up the values in the Item_Code column as well. Is there any way I can specify which columns to include in the sum() operation and which ones to exclude?
You can select the columns of a groupby:
In [11]: df.groupby(['Country', 'Item_Code'])[["Y1961", "Y1962", "Y1963"]].sum()
Out[11]:
Y1961 Y1962 Y1963
Country Item_Code
Afghanistan 15 10 20 30
25 10 20 30
Angola 15 30 40 50
25 30 40 50
Note that the list passed must be a subset of the columns otherwise you'll see a KeyError.
The agg function will do this for you. Pass the columns and function as a dict with column, output:
df.groupby(['Country', 'Item_Code']).agg({'Y1961': np.sum, 'Y1962': [np.sum, np.mean]}) # Added example for two output columns from a single input column
This will display only the group by columns, and the specified aggregate columns. In this example I included two agg functions applied to 'Y1962'.
To get exactly what you hoped to see, included the other columns in the group by, and apply sums to the Y variables in the frame:
df.groupby(['Code', 'Country', 'Item_Code', 'Item', 'Ele_Code', 'Unit']).agg({'Y1961': np.sum, 'Y1962': np.sum, 'Y1963': np.sum})
If you are looking for a more generalized way to apply to many columns, what you can do is to build a list of column names and pass it as the index of the grouped dataframe. In your case, for example:
columns = ['Y'+str(i) for year in range(1967, 2011)]
df.groupby('Country')[columns].agg('sum')
If you want to add a suffix/prefix to the aggregated column names, use add_suffix() / add_prefix().
df.groupby(["Code", "Country"])[["Y1961", "Y1962", "Y1963"]].sum().add_suffix("_total")
If you want to retain Code and Country as columns after aggregation, set as_index=False in groupby() or use reset_index().
df.groupby(["Code", "Country"], as_index=False)[["Y1961", "Y1962", "Y1963"]].sum()
# df.groupby(["Code", "Country"])[["Y1961", "Y1962", "Y1963"]].sum().reset_index()