merge from one dataframe values with one dataframe columns - python

I have one difficulty here. My goal is to create a list of sales for one shop with one dataframe that lists prices by product and one other that lists all the sales in terms of products and quantities (for one period of time)
DataFrame 1 : prices
prices = pd.DataFrame({'qty_from' : ('0','10','20'), 'qty_to' : ('9','19','29'), 'product_A' :(50,30,10),'product_B' :(24,14,12),'product_C' :(70,50,18)})
DataFrame 2 : sales
sales = pd.DataFrame({'product' : ('product_b','product_b','product_a',product_c,product_b), 'qty' : ('4','12','21','41','7')})
I would like to get the turnover, line by line within the 'sales' DataFrame, with one other column like 'TurnOver'
I used
pd.merge_asof(sales, prices, left_on='qty', right_on='qty_from', direction='backward')
and it gave me the right price for the quantity sold, but how to get the good price that is related to one product?
How to merge with a value in 'sales' dataframe like 'product_b' with the name of a column in dataframe prices, here 'product_b' and then apply a calculation to get the turnover ?
Thank you for your help,
Eric

If I understand correctly, you can modify the dataframe prices to be able to use the parameter by in merge_asof, using stack:
#modify price
prices_stack = (prices.set_index(['qty_from','qty_to']).stack() # then products become as a column
.reset_index(name='price').rename(columns={'level_2':'product'}))
# uniform the case
sales['product'] = sales['product'].str.lower()
prices_stack['product'] = prices_stack['product'].str.lower()
# this is necessary with your data here as not int
sales.qty = sales.qty.astype(int)
prices_stack.qty_from = prices_stack.qty_from.astype(int)
#now you can merge_asof adding by parameter
sales_prices = (pd.merge_asof( sales.sort_values('qty'), prices_stack,
left_on='qty', right_on='qty_from',
by = 'product', #first merge on the column product
direction='backward')
.drop(['qty_from','qty_to'], axis=1)) #not necessary columns
print (sales_prices)
product qty price
0 product_b 4 24
1 product_b 7 24
2 product_b 12 14
3 product_a 21 10
4 product_c 41 18

Related

Create a new column with the number of unique occurances by date (groupby 2 conditions - in a timeseries) [duplicate]

This question already has answers here:
Count unique values per groups with Pandas [duplicate]
(4 answers)
Closed 3 days ago.
I've tried various group by methods I would like to add a new column 'product-locations' which calculates the total number of 'store locations' for a specific 'product code' on a given date. Basically, how many total number stores is a specific product selling in on any given day. My dataframe should look like this, with 'store-locations' added as a new column.
date
store_location
product_code
store_locations
2017-01-01
Store-A
100
3
2017-01-01
Store-B
100
3
2017-01-01
Store-C
100
3
2017-01-01
Store-D
200
1
2017-01-02
Store-D
200
1
The following for example ignores grouping by date and only takes into account the number if unique products:
group = df.groupby(['date','store_location','product_code']).size().groupby(level=2).size()
you can use:
pvt_coef = df.pivot_table(index=['date','product_code'], aggfunc={'store_location': np.count_nonzero})
pvt_coef.rename(columns={'store_location':'count'}, inplace=True)
pvt_coef.reset_index()
dfcoef = pd.merge(df, pvt_coef, left_on=['date','product_code'], right_on = ['date','product_code'], how='left')

How to handle records in dataframe with same ID but some different values in columns in python

I am working on a dataframe using pandas with bank (loan) details for customers. There is a problem because some unique loan id have been recorded 2 times with different values for some of the features respectively. I am attaching a screenshot to be more specific.
Now you see for instance this unique Loan ID has been recorded 2 times. I want to drop the second one with nan values but I can't do it manually because there are 4900 similar cases. any idea?
The problem is not the NaN value, the problem is the double records. I want to drop rows with nan values only for double records not for the entire dataframe
Thanks in advance
Count rows where there are > 1, and then only drop nans where there are > 1 rows.
df['flag'] = df.groupby(['Loan ID', 'Credit ID'])['Loan ID'].transform('count')
df = df.loc[df['flag'] > 1].dropna(subset=['Credit Score', 'Annual Income']).drop('flag', axis=1)
Instead of dropping nan rows, just take the rows where credit score or annual income is not nan:
df = df[df['Credit Score'].notna()]

How to take the max value of each category in 1 column across multiple rows

I am using Python 3.4 on Jupyternotebook.
I am looking to select the max of each product type from the below table. I've found the groupby code as written below but I am struggling to figure out how to do the search so that it takes into account the max for all box (box_1 and box_2), etc etc.
Perhaps best described as some sort of fuzzy matching?
Ideally my output should give me the max in each category:
box_2 18
bottles_3 31
.
.
.
How should I do this?
data = {'Product':['Box_1','Bottles_1','Pen_1','Markers_1','Bottles_2','Pen_2','Markers_2','Bottles_3','Box_2','Markers_2','Markers_3','Pen_3'],
'Sales':[14,24,31,12,13,7,9,31,18,16,18,14]}
df1=pd.DataFrame(data, columns=['Product','Sales'])
df1
df1.groupby(['Product'])['Sales'].max()
If I understand correctly, you first have to look at the category and then retrieve both the name of the product and the maximum value. Here is how to do that:
df1=pd.DataFrame(data, columns=['Product','Sales'])
df1['Category'] = df1.Product.str.split('_').str.get(0)
df1["rank"] = df1.groupby("Category")["Sales"].rank("dense", ascending=False)
df1[df1["rank"]==1.0][['Product','Sales']]
The rank function will rank the products within the categories according to their Sales value. Then, you need to filter out any category that ranks lower. That will give you the desired dataframe:
Product Sales
2 Pen_1 31
7 Bottles_3 31
8 Box_2 18
10 Markers_3 18
Here you go:
df1['Type'] = df1.Product.str.split('_').str.get(0)
df1.groupby(['Type'])['Sales'].max()
## -- End pasted text --
Out[1]:
Type
Bottles 31
Box 18
Markers 18
Pen 31
Name: Sales, dtype: int64
You can split values by _, select first values by indexing str[0] and pass to groupby and DataFrameGroupBy.idxmax for Product by maximal Sales:
df1 = df1.set_index('Product')
df2 = (df1.groupby(df1.index.str.split('_').str[0])['Sales']
.agg([('Product','idxmax'), ('Sales','max')])
.reset_index(drop=True))
print (df2)
Product Sales
0 Bottles_3 31
1 Box_2 18
2 Markers_3 18
3 Pen_1 31

Creating Different Dataframe for conditions on multiple column in Excel using Python

I want to make a different dataframe for those Number(Column B) where Main Date > Reported Date (see the below image). If this condition comes true then I have to make other dataframe displaying that Number Data.
Example
:- if take Number (column B) 223311, now if any main date > Reported Date, then display all the records of that Number
Here is a simple solution with Pandas. You can separate out Dataframes very easily by column values of a particular column. From there, iterate the new Dataframe, resetting for index (if you want to keep the index, use dataframe.shape instead). I appended them to a list for convenience, which could be easily extracted into labeled dataframes, or combined. Long variable names are to help comprehension.
df = pd.read_csv('forstack.csv')
list_of_dataframes = [] #A place to store each dataframe. You could also name them as you go
checked_Numbers = [] #Simply to avoid multiple of same dataframe
for aNumber in df['Number']: #For every number in the column "Number"
if(aNumber not in checked_Numbers): #While this number has not been processed
checked_Numbers.append(aNumber) #Mark as checked
df_forThisNumber = df[df.Number == aNumber].reset_index(drop=True) #"Make a different Dataframe" Per request, with new index
for index in range(0,len(df_forThisNumber)): #Parse each element of this dataframe to see if it matches criteria
if(df_forThisNumber.at[index,'Main Date'] > df_forThisNumber.at[index,'Reported Date']):
list_of_dataframes.append(df_forThisNumber) #If it matches the criteria, append it
Outputs :
Main Date Number Reported Date Fee Amount Cost Name
0 1/1/2019 223311 1/1/2019 100 12 20 11
1 1/7/2019 223311 1/1/2019 100 12 20 11
Main Date Number Reported Date Fee Amount Cost Name
0 1/2/2019 111111 1/2/2019 100 12 20 11
1 1/6/2019 111111 1/2/2019 100 12 20 11
Main Date Number Reported Date Fee Amount Cost Name
0 1/3/2019 222222 1/3/2019 100 12 20 11
1 1/8/2019 222222 1/3/2019 100 12 20 11

How to convert this to a for-loop with an output to CSV

I'm trying to put together a generic piece of code that would:
Take a time series for some price data and divide it into deciles, e.g. take the past 18m of gold prices and divide it into deciles [DONE, see below]
date 4. close decile
2017-01-03 1158.2 0
2017-01-04 1166.5 1
2017-01-05 1181.4 2
2017-01-06 1175.7 1
... ...
2018-04-23 1326.0 7
2018-04-24 1333.2 8
2018-04-25 1327.2 7
[374 rows x 2 columns]
Pull out the dates for a particular decile, then create a secondary datelist with an added 30 days
#So far only for a single decile at a time
firstdecile = gold.loc[gold['decile'] == 1]
datelist = list(pd.to_datetime(firstdecile.index))
datelist2 = list(pd.to_datetime(firstdecile.index) + pd.DateOffset(months=1))
Take an average of those 30-day price returns for each decile
level1 = gold.ix[datelist]
level2 = gold.ix[datelist2]
level2.index = level2.index - pd.DateOffset(months=1)
result = pd.merge(level1,level2, how='inner', left_index=True, right_index=True)
def ret(one, two):
return (two - one)/one
pricereturns = result.apply(lambda x :ret(x['4. close_x'], x['4. close_y']), axis=1)
mean = pricereturns.mean()
Return the list of all 10 averages in a single CSV file
So far I've been able to put together something functional that does steps 1-3 but only for a single decile, but I'm struggling to expand this to a looped-code for all 10 deciles at once with a clean CSV output
First append the close price at t + 1 month as a new column on the whole dataframe.
gold2_close = gold.loc[gold.index + pd.DateOffset(months=1), 'close']
gold2_close.index = gold.index
gold['close+1m'] = gold2_close
However practically relevant should be the number of trading days, i.e. you won't have prices for the weekend or holidays. So I'd suggest you shift by number of rows, not by daterange, i.e. the next 20 trading days
gold['close+20'] = gold['close'].shift(periods=-20)
Now calculate the expected return for each row
gold['ret'] = (gold['close+20'] - gold['close']) / gold['close']
You can also combine steps 1. and 2. directly so you don't need the additional column (only if you shift by number of rows, not by fixed daterange due to reindexing)
gold['ret'] = (gold['close'].shift(periods=-20) - gold['close']) / gold['close']
Since you already have your deciles, you just need to groupby the deciles and aggregate the returns with mean()
gold_grouped = gold.groupby(by="decile").mean()
Putting in some random data you get something like the dataframe below. close and ret are the averages for each decile. You can create a csv from a dataframe via pandas.DataFrame.to_csv
close ret
decile
0 1238.343597 -0.018290
1 1245.663315 0.023657
2 1254.073343 -0.025934
3 1195.941312 0.009938
4 1212.394511 0.002616
5 1245.961831 -0.047414
6 1200.676333 0.049512
7 1181.179956 0.059099
8 1214.438133 0.039242
9 1203.060985 0.029938

Categories

Resources