I have two tables in pandas. One is about 10,000+ rows that looks like this:
Table 1
col_1 date state ratio [50 more cols]
A 10/12 NY .5
A 12/05 MA NaN
.........
I have another table that's about 10 rows that looks like this:
Table 2
date state ratio
12/05 MA .9
12/03 MA .8
............
I need to set the ratio in table 1 based on the date and state values from table 2. The ideal solution would be to merge on date and state, but that creates two columns: ratio_x and ratio_y
I need a way to set the ratio in table 1 to the corresponding ratio in table 2 where the date and states both match. The ratios in table 1 can be overwritten.
If this can be done correctly by merging then that works too.
Edit: You can consider table 2 as being meant to map to specific state values (so all the states in table 2 are MA in this example)
You'll need to choose which ratio value to take first. Assuming you want ratios from table 2 to take precedence:
# join in ratio from the other table
table1 = table1.join(table2.set_index(["date", "state"])["ratio"].to_frame("ratio2"), on=["date", "state"])
# take ratio2 first, then the existing ratio value if ratio2 is null
table1["ratio"] = table1["ratio2"].fillna(table1["ratio"])
# delete the ratio2 column
del table1["ratio2"]
First create a mapping series from df2:
s = df2.set_index(['date', 'state'])['ratio']
Then feed to df1:
df1['ratio'] = df1.set_index(['date', 'state']).index.map(s.get)\
.fillna(df1['ratio'])
Precedence is given to ratios in df2.
Related
I have a table of skus that need to be placed in locations. Based on the volume that a sku has determines how many locations a sku needs. There are a limited number of locations so I need to prioritize based on how much volume will be in a location. Then once in order apply the locations. When the location is full the volume for the location should be the location volume, for the last location the remainder volume.
Current table setup
So the end result should look like this.
enter image description here
I was hoping to iterate based on the number of locations needed and create a row in a new table while reducing the number of listed locations by row.
Something like this.
rows = int(sum(df['locations_needed']))
new_locs = []
for i in range(rows):
if df['locations_needed'] > 1:
new_locs.append(df['SKU'], df['location_amount'])
df['locations_needed'] - 1
else:
new_locs.append(df['SKU'], df['remainder_volume'])
df['locations_needed'] - 1
Use repeat method from pd.Index:
out = (df.reindex(df.index.repeat(df['locations_needed'].fillna(0).astype(int)))
.reset_index(drop=True))
print(out)
# Output
SKU location_amount locations_needed
0 FAKESKU 2300 3.0
1 FAKESKU 2300 3.0
2 FAKESKU 2300 3.0
3 FAKESKU2 2100 2.0
4 FAKESKU2 2100 2.0
Building off of using repeat as suggested by Corralien, you then set the value for the last of the groupby to the remainder volume. The reorder and reset the index again. So,
#create row for each potential location by sku
df=df.loc[df.index.repeat(df.locations_needed)]
#reset index
df= df.reset_index(drop= True)
#fill last row in group (sku) with remainder volume
df2= df['SKU'].duplicated(keep= 'last')
df.loc[~df2,'location_amount'] = df['remainder_volume']
#reorder and reset index
df = df.sort_values(by=['location_amount'], ascending=False)
df['locations_needed] = 1
df= df.reset_index(drop= True)
I am having difficulties in merging 2 tables. In fact, I would like add a column from table B into table A based on one key
Table A (632 rows) contains the following columns:
part_number / part_designation / AC / AC_program
Table B (4,674 rows) contains the following columns:
part_ref / supplier_id / supplier_name / ac_program
I would like to add the supplier_name values into Table A
I have succeeded compiling a left joint based on the condition tableA.part_number == tableB.part_ref
However, when I look at the resulting Table, additional rows were created. I have now 683 rows instead of the initial 632 rows in Table A. How do I keep the same number of rows with including the supplier_name values in Table A? Below is presented a graph of my transformations:
Here is my code:
Table B seems to contain duplicates (part_ref). The join operation creates a new record in your original table for each duplicate in Table B
import pandas as pd
print(len(pd.unique(updated_ref_table.part_ref)))
print(updated_ref_table.shape[0])
I am trying to pivot a pandas dataframe, but the data is following a strange format that I cannot seem to pivot. The data is structured as below:
Date, Location, Action1, Quantity1, Action2, Quantity2, ... ActionN, QuantityN
<date> 1 Lights 10 CFloor 1 ... Null Null
<date2> 2 CFloor 2 CWalls 4 ... CBasement 15
<date3> 2 CWalls 7 CBasement 4 ... NUll Null
Essentially, each action will always have a quantity attached to it (which may be 0), but null actions will never have a quantity (the quantity will just be null). The format I am trying to achieve is the following:
Lights CFloor CBasement CWalls
1 10 1 0 0
2 0 2 19 11
The index of the rows becomes the location while the columns become any unique action found across the multiple activity columns. When pulling the data together, the value of each row/column is the sum of each quantity associated with the action (i.e Action1 corresponds to Quantity1). Is there a way to do this with the native pandas pivot funciton?
My current code performs a ravel across all the activity columns to get a list of all unique activities. It will also grab all the unique locations from the Location column. Once I have the unique columns, I create an empty dataframe and fill it with zeros:
Lights CFloor CBasement CWalls
1 0 0 0 0
2 0 0 0 0
I then iterate back over the old data frame with the itertuples() method (I was told it was significantly faster than iterrows()) and populate the new dataframe. This empty dataframe acts as a template that is stored in memory and filled later.
#Creates a template from the dataframe
def create_template(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
activities = df[act_cols]
flat_acts = activities.values.ravel('K')
unique_locations = pd.unique(df['Location'])
unique_acts = pd.unique(flat_acts)
pivot_template = pd.DataFrame(index=unique_locations, columns=unique_acts).fillna(0)
return pivot_template
#Fills the template from the dataframe
def create_pivot(df, pivot_frmt):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
for row in df.itertuples():
for act, quantity in zip(act_cols, quant_cols):
act_val = getattr(row, act)
if pd.notna(act_val):
quantity_val = getattr(row, quantity)
location = getattr(row, 'Location')
pivot_frmt.loc[location, act_val] += quantity_val
return pivot_frmt
While my solution works, it is incredibly slow when dealing with a large dataset and has taken 10 seconds or more to complete this type of operation. Any help would be greatly appreciated!
After experimenting with various pandas functions, such as melt and pivot on multiple columns simulatenously, I found a solution that worked for me:
For every quantity-activity pair, I build a partial frame of the final dataset and store it in a list. Once every pair has been addressed I will end up with multiple dataframes that all have the same row counts, but potentially different column counts. I solved this issue by simply concatenating the columns and if any columns are repeated, I then sum them to get the final result.
def test_pivot(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
dfs = []
for act, quant in zip(act_cols, quant_cols):
partial = pd.crosstab(index=df['Location'], columns=df[act], values=df[quant], aggfunc=np.sum).fillna(0)
dfs.append(partial)
finalDf = pd.concat(dfs, axis=1)
finalDf = test.groupby(finalDf.columns, axis=1).sum()
return finalDf
There are two assumptions that I make during this approach:
The indexes maintain their order across all partial dataframes
There are an equivalent number of indexes across all partial dataframes
While this is probably not the most elegant solution, it achieves the desired result and reduced the time it took to process the data by a very significant margin (from 10s ~4k rows to 0.2s ~4k rows). If anybody has a better way to deal with this type of scenario and do the process outlined above in one shot, then I would love to see your response!
I need to match multiple criteria between two dataframes and then assign an ID.
This is complicated by the fact that one criteria needs to be 'like or similar' and not exact as it involves a time reference that is slightly different.
I need the timestamp match second +/- a 1/2 second. I then would like to add a column that print's the ID in a new column in DF2:
DF1
TimeStamp ID Size
2018-07-12T03:34:54.228000Z 46236499 0.0013
2018-07-12T03:34:54.301000Z 46236500 0.01119422
DF2
TimeStamp Size ID #new column
2018-07-12T03:34:54.292Z 0.00 blank #no match/no data
2018-07-12T03:34:54.300Z 0.01119422 46236500 #size and
#timestamp match within tolerances
In the example above the script would look at the time stamp column and look for any timestamp in DF2 that had the following information "2018-07-12T03:34:54" +/- a 1/2 second + had the exact same 'Size' element.
This needs to be done like this as there could be multiple 'Size' elements that are the same throughout the dataset.
It would then stamp the corresponding ID in the newly created 'ID' column within DF2 or if DF2 was copied to a new DF, I would just add the new 'ID' column within DF3.
Depending on which rows you need in the final dataframe you may choose different join operators. One solution uses the combined dataframes joined by the column Size and then filters the remaining columns based on the absoulte time difference between the merged datetime columns.
df3 = df1.merge(df2, left_on='Size', right_on='Size', how='right')
df3['deltaTime'] = numpy.abs(df3['TimeStamp_x'] - df3['TimeStamp_y'])
df3 = df3[(df3['deltaTime'] < timedelta(milliseconds=500)) | pandas.isnull(df3['deltaTime'])]
Output:
TimeStamp_x ID_x Size TimeStamp_y ID_y deltaTime
0 2018-07-12 03:34:54.301 46236500.0 0.011194 2018-07-12 03:34:54.300 46236500 00:00:00.001000
1 2018-07-12 03:34:54.301 46236500.0 0.011194 2018-07-12 03:34:54.800 46236501 00:00:00.499000
3 NaT NaN 0.000000 2018-07-12 03:34:54.292 blank NaT
If you don't want any none merged rows then just remove | pandas.isnull(df3['deltaTime']) and use an inner join.
Here is a sample df:
data = {"Brand":{"0":"BrandA","1":"BrandA","2":"BrandB","3":"BrandB","4":"BrandC","5":"BrandC"},"Cost":{"0":18.5,"1":19.5,"2":6,"3":6,"4":17.69,"5":18.19},"IN STOCK":{"0":10,"1":15,"2":5,"3":1,"4":12,"5":12},"Inventory Number":{"0":1,"1":1,"2":2,"3":2,"4":3,"5":3},"Labels":{"0":"Black","1":"Black","2":"White","3":"White","4":"Blue","5":"Blue"},"Maximum Price":{"0":30.0,"1":35.0,"2":50,"3":45.12,"4":76.78,"5":76.78},"Minimum Price":{"0":23.96,"1":25.96,"2":12.12,"3":17.54,"4":33.12,"5":28.29},"Product Name":{"0":"Product A","1":"Product A","2":"ProductB","3":"ProductB","4":"ProductC","5":"ProductC"}}
df = pd.DataFrame(data=data)
My actual data set is much larger, but maintains the same pattern of there being 2 rows that share the same Inventory Number throughout.
My goal is to create a new data frame that contains only the inventory numbers where a cell value is not duplicated across both rows, and for those inventory numbers, only contains the data from the row with the lower index that is different from the other row.
For this example the resulting data frame would need to look like:
data = {"Inventory Number":{"0":1,"1":2,"2":3},"Cost":{"0":18.50,"1":"","2":17.69},"IN STOCK":{"0":10,"1":5,"2":""},"Maximum Price":{"0":30,"1":50,"2":""},"Minimum Price":{"0":23.96,"1":12.12,"2":33.12}}
df = pd.DataFrame(data=data)
The next time this would run, perhaps nothing changed in the "Maximum Price", so that column would need to not be included at all.
I was hoping someone would have a clean solution using groupby, but if not, i imagine the solution would include dropping all duplicates. then looping through all of the remaining inventory numbers, evaluating each column for duplicates.
icol = 'Inventory Number'
d0 = df.drop_duplicates(keep=False)
i = d0.groupby(icol).cumcount()
d1 = d0.set_index([icol, i]).unstack(icol).T
d1[1][d1[1] != d1[0]].unstack(0)
Cost IN STOCK Maximum Price Minimum Price
Inventory Number
1 19.5 15 35 25.96
2 None 1 45.12 17.54
3 18.19 None None 28.29
Try this:
In [68]: cols = ['Cost','IN STOCK','Inventory Number','Maximum Price','Minimum Price']
In [69]: df[cols].drop_duplicates(subset=['Inventory Number'])
Out[69]:
Cost IN STOCK Inventory Number Maximum Price Minimum Price
0 18.5 10 100566 30.0 23.96