Python Pandas Question: Index / Match with Missing Values + Duplicates + Everything In Between

Python Pandas Question: Index / Match with Missing Values + Duplicates + Everything In Between - python

Basically, I have smaller table of assets purchased this year and a table of assets the company holds. I want to be able to get the value of certain symbols from the table of assets the company holds and merge into the assets purchased dataset. I want to use the CUSIP. If there is a CUSIP in the assets purchased this year that is blank, this code can return blank or NaN. If there are duplicate CUSIPS in the Holdings dataset, then return the first value. I have tried 4 different ways of merging these tables now without much luck. I run into a memory error for some reason
The equivalent excel code would be:
=IFNA(INDEX(asset_holdings!ADMIN_SYMBOLS,MATCH(asset_purchases!CUSIP_n, asset_holdings!CUSIPs, 0)),"")
Holdings Table
CUSIP
SYMBOL
353187EV5
1A
74727PAY7
3A
80413TAJ8
FE
02765UCR3
3G
000000000
3G
74727PAYA
3E
000000000
4E
Purchase Table
CUSIP
SHARES
353187EV5
10
74727PAY7
67
80413TAJ8
35
02765UCR4
3666
74727PAY7
3613
74727PAYA
13
000000000
14
Desired Result
CUSIP
SHARES
SYMBOL
353187EV5
10
1A
74727PAY7
67
3A
80413TAJ8
35
FE
02765UCR4
3666
""
74727PAY7
3613
3A
74727PAYA
13
3E
000000000
14
3G
C:\ProgramData\Continuum\Anaconda\lib\site-packages\pandas\core\reshape\merge.py in _get_join_indexers(left_keys, right_keys, sort, how, **kwargs)
1140 join_func = _join_functions[how]
1141
-> 1142 return join_func(lkey, rkey, count, **kwargs)
1143
1144
pandas\_libs\join.pyx in pandas._libs.join.left_outer_join()
MemoryError:
What I tried:
dfnew = dfPurchases.merge(dfHoldings[['CUSIP','SYMBOL']],how='left', on='CUSIP')
dfPurchases = dfPurchases.set_index('CUSIP')
dfPurchases['SYMBOL'] = dfHoldings.lookup(dfHoldings['CUSIP'], df1['SYMBOL'])
enter image description here

Let me elaborate on the question a little bit so that you can review if I have the correct understanding of your question. You want to do a left outer join of purchase dataset with holdings dataset. But, since your holding data set has duplicates for CUSIP ids, It will not be a One-to-one join.
Now you have two options:
Accept multiple rows for one row of the purchase dataset
Make CUSIP id unique in the Holdings dataset and then perform the merge
First way:
import pandas as pd
left = pd.read_csv('purchase.csv')
right = pd.read_csv('holdings.csv')
result = pd.merge(left, right, on="CUSIP", how='left')
print(result)
But, As per your question, the above result isn't acceptable so, We are gonna make CUSIP column unique in the right dataset.
import pandas as pd
left = pd.read_csv('purchase.csv')
right = pd.read_csv('holdings.csv')
# By default it takes first but i have added explicitly for better understanding
right_unique = right.drop_duplicates('CUSIP', keep='first')
result = pd.merge(left, right_unique, on="CUSIP", how='left', validate="many_to_one")
print(result)
Bonus: You can also explore the validation param by putting it into the first version and see the validation errors.

Related

dataframe transform partial row data on column

I have one dataframe where format is given as below image.
Every row where three columns are representing as one type of data. In given example there are one column for ticker and next three column is kind one type of data and column 5-7are second type of data.
Now I want to transform this in column where every type of data appended by another group.
Expected output is:
is there anyway to do this transformation in pandas using any API? I am doing it very basic way where creating a new dataframe for one group and then appending it.

here is one way to do it
use pd.melt to unstack the table, then split what used to be columns (and now as rows) on "/" to separate them into two columns (txt, year)
create the new row value by combining ticker and year, then using pivot to get the desired result set
df2=df.melt(id_vars='ticker', var_name='col') # line missed in earlier solution,updated
df2[['txt','year']] = df.melt(id_vars='ticker', var_name='col')['col'].str.split('/', expand=True)
df2.assign(ticker2=df2['ticker'] + '/' + df2['year']).pivot(index='ticker2', columns='txt', values='value').reset_index()
Result set
txt ticker2 data1 data2
0 AAPL/2020 0.824676 0.616524
1 AAPL/2021 0.018540 0.046365
2 AAPL/2022 0.222349 0.729845
3 AMZ/2020 0.122288 0.087217
4 AMZ/2021 0.012168 0.734674
5 AMZ/2022 0.923501 0.437676
6 APPL/2020 0.886927 0.520650
7 APPL/2021 0.725515 0.543404
8 APPL/2022 0.211378 0.464898
9 GGL/2020 0.777676 0.052658
10 GGL/2021 0.297292 0.213876
11 GGL/2022 0.894150 0.185207
12 MICO/2020 0.898251 0.882252
13 MICO/2021 0.141342 0.105316
14 MICO/2022 0.440459 0.811005
based on the code that you posted in comment. I missed a line, unfortunately, in posting the solution. its added now
df2 = pd.DataFrame(np.random.randint(0,100,size=(2, 6)),
columns=["data1/2020","data1/2021", "data1/2022", "data2/2020", "data2/2021", "data2/2022"])
ticker = ['APPL', 'MICO']
df2.insert(loc=0, column='ticker', value=ticker)
df2.head()
df3=df2.melt(id_vars='ticker', var_name='col') # missed line in earlier posting
df3[['txt','year']] = df2.melt(id_vars='ticker', var_name='col')['col'].str.split('/', expand=True)
df3.head()
df3.assign(ticker2=df3['ticker'] + '/' + df3['year']).pivot(index='ticker2', columns='txt', values='value').reset_index()
txt ticker2 data1 data2
0 APPL/2020 26 9
1 APPL/2021 75 59
2 APPL/2022 20 44
3 MICO/2020 79 90
4 MICO/2021 63 30
5 MICO/2022 73 91

Best way to store pandas df data for recall in analysis

I have a large df in pandas that has a company's product information. Here is a small sample of rows with only the columns I believe are needed to get the information I desire.
df = pd.DataFrame({'Customers': [1,2,3,4,5,6]*3,
'Product':['Beer1','Beer2','Beer1','Beer4', 'Beer3', 'Beer5']*3,
'Packaging':['6pk','keg','big_keg','12pack','22 oz bottle','18pack']*3,
'Sale_Price':[25,50,75,34,54,99]*3}
)
I want to be able to pull the sale price:
def get_price(Customer, Product, Packaging):
abc = df[(df['Customers'] == Customer) & (df['Product'] == Product) & (df['Packaging'] == Packaging)]
price = abc.iloc[0]['Sale_Price']
return price
The function I wrote works for getting one value, but I was wondering if there is a better way to get and store pricing information for an entity's products for later use
Since I am usually using the prices as inputs in a multiplication formula using the examples below
Beer1 Run1: 365 12 packs, 43 big_kegs, 12 kegs
Beer2 Run1: 400 18 packs, 67 kegs
So Ex1 would look something like this: Revenue = (365 * 12 pack price + 43 * big_keg price + 12 * keg price)
My Question(s): How to alter the function above to account for the examples? How best to store all prices for later use?
More direct question based on comment:
I have three arguments (maybe more due to additional pack type possibilities): Customer name, Product Name, Packaging Type, (additional pack type)
I need the sale price, prices for multiple pack types.
So, I have these Beer1, Customer2, 12pack, big_keg: How would my function handle this? Is a function the best way or should I create and store a master pricing dictionary or another storage method?
Will probably need a weighted average at some point, but one question at time.
Thanks in advance for you help.

If what you are looking for is the total revenue each packaging type is bringing, you can simply do groupby.
df.groupby('Packaging')['Sale_Price'].sum()
output:
Packaging
12pack 102
18pack 297
22 oz bottle 162
6pk 75
big_keg 225
keg 150
Name: Sale_Price, dtype: int64
You can do the same for price info with unique function
df.groupby('Packaging')['Sale_Price'].unique()
Packaging
12pack [34]
18pack [99]
22 oz bottle [54]
6pk [25]
big_keg [75]
keg [50]
Name: Sale_Price, dtype: object
Which also help in checking if each type of packaging had one unique pricing or different sale price in the dataframe.

How to append a value in excel based on cell values from multiple columns in Python and/or R

I'm new to the openpyxl and other similar excel packages in Python (even Pandas). What I want to achieve is to append the lowest possible price that I can keep for each of the product based on the expense formula. The expense formula is in the code below, the data is like this on excel:
**Product** |**Cost** |**Price** | **Lowest_Price**
ABC 32 66
XYZ 15 32
DEF 22 44
JML 60 120
I have the code below code on Python 3.5, which works, however this might not be the most optimized solution, I need to know how to append the lowest value on Lowest_price column:
cost = 32 #Cost of product from cost column
price = 66 #Price of product from price column
Net = cost+5 #minimum profit required
lowest = 0 #lowest price that I can keep on the website for each product
#iterating over each row values and posting the lowest value adjacent to each product.
for i in range(Net, price):
expense = (i*0.15) + 15 #expense formula
if i - expense >= Net:
lowest = i
break
print (lowest) #this value should be printed adjacent to the price, in the Lowest price Column
Now if someone can help me doing that in Python and/or R. The reason I want in both Python and R is because I want to compare the time complexity, as I have a huge set of data to deal with.
I'm fine with code that works with any of the excel formats i.e. xls or xlsx as long it is fast

I worked it out this way.
import pandas as pd
df = pd.read_excel('path/file.xls')
row = list(df.index)
for x in row:
cost = df.loc[x,'Cost']
price = df.loc[x,'Price']
Net = cost+5
df.loc[x,'Lowest_Price'] = 0
for i in range(Net, price):
expense = (i*0.15) + 15 #expense formula
if i - expense >= Net:
df.loc[x,'Lowest_Price'] = i
break
#If you want to save it back to an excel file
df.to_excel('path/file.xls',index=False)
It gave this output:
Product Cost Price Lowest_Price
0 ABC 32 66 62.0
1 XYZ 15 32 0.0
2 DEF 22 44 0.0
3 JML 60 120 95.0

pandas: column formatting issues causing merge problems

I have the following two databases:
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/rgdp_catcode.merge'
df=pd.read_csv(url, index_col=0)
df.head(1)
naics catcode GeoName Description ComponentName year GDP state
0 22 E1600',\t'E1620',\t'A4000',\t'E5000',\t'E3000'... Alabama Utilities Real GDP by state 2004 5205 AL
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/mpl.Bspons.merge'
df1=pd.read_csv(url, index_col=0)
df1.head(1)
state year unemployment log_diff_unemployment id.thomas party type date bills id.fec years_exp session name disposition catcode
0 AK 2006 6.6 -0.044452 1440 Republican sen 2006-05-01 s2686-109 S2AK00010 39 109 National Cable & Telecommunications Association support C4500
Regarding df, I had to manually input the catcode values. I think that is why the formatting is off. What I would like is to simply have the values without the \t prefix. I want to merge the dfs on catcode, state, year. I made a test earlier wherein a df1.catcode with only one value per cell was matched with the values in another df.catcode that had more than one value per cell and it worked.
So technically, all I need to do is lose the \t before each consecutive value in df.catcode, but additionally, if anyone has ever done a merge of this sort before, any 'caveats' learned through experience would be appreciated. My merge code looks like this:
mplmerge=pd.merge(df1,df, on=(['catcode', 'state', 'year']), how='left' )
I think this can be done with the regex method, I'm looking at the documentation now.

Cleaning catcode column in df is rather straightforward:
catcode_fixed = df.catcode.str.findall('[A-Z][0-9]{4}')
This will produce a series with a list of catcodes in every row:
catcode_fixed.head(3)
Out[195]:
0 [E1600, E1620, A4000, E5000, E3000, E1000]
1 [X3000, X3200, L1400, H6000, X5000]
2 [X3000, X3200, L1400, H6000, X5000]
Name: catcode, dtype: object
If I understand correctly what you want, then you need to "ungroup" these lists. Here is the trick, in short:
catcode_fixed = catcode_fixed = catcode_fixed.apply(pd.Series).stack()
catcode_fixed.index = catcode_fixed.index.droplevel(-1)
So, we've got (note the index values):
catcode_fixed.head(12)
Out[206]:
0 E1600
0 E1620
0 A4000
0 E5000
0 E3000
0 E1000
1 X3000
1 X3200
1 L1400
1 H6000
1 X5000
2 X3000
dtype: object
Now, dropping the old catcode and joining in the new one:
df.drop('catcode',axis = 1, inplace = True)
catcode_fixed.name = 'catcode'
df = df.join(catcode_fixed)
By the way, you may also need to use df1.reset_index() when merging the data frames.

Python - time series alignment and "to date" functions

I have a dataset with the following first three columns.
Include Basket ID (unique identifier), Sale amount (dollars) and date of the transaction. I want to calculate the following column for each row of the dataset, and I would like to it in Python.
Previous Sale of the same basket (if any); Sale Count to date for current basket; Mean To Date for current basket (if available); Max To Date for current basket (if available)
Basket Sale Date PrevSale SaleCount MeanToDate MaxToDate
88 $15 3/01/2012 1
88 $30 11/02/2012 $15 2 $23 $30
88 $16 16/08/2012 $30 3 $20 $30
123 $90 18/06/2012 1
477 $77 19/08/2012 1
477 $57 11/12/2012 $77 2 $67 $77
566 $90 6/07/2012 1
I'm pretty new with Python, and I really struggle to find anything to do it in a fancy way. I've sorted the data (as above) by BasketID and Date, so I can get the previous sale in bulk by shifting forward by one for each single basket. No clue how to get the MeanToDate and MaxToDate in an efficient way apart from looping... any ideas?

This should do the trick:
from pandas import concat
from pandas.stats.moments import expanding_mean, expanding_count
def handler(grouped):
se = grouped.set_index('Date')['Sale'].sort_index()
# se is the (ordered) time series of sales restricted to a single basket
# we can now create a dataframe by combining different metrics
# pandas has a function for each of the ones you are interested in!
return concat(
{
'MeanToDate': expanding_mean(se), # cumulative mean
'MaxToDate': se.cummax(), # cumulative max
'SaleCount': expanding_count(se), # cumulative count
'Sale': se, # simple copy
'PrevSale': se.shift(1) # previous sale
},
axis=1
)
# we then apply this handler to all the groups and pandas combines them
# back into a single dataframe indexed by (Basket, Date)
# we simply need to reset the index to get the shape you mention in your question
new_df = df.groupby('Basket').apply(handler).reset_index()
You can read more about grouping/aggregating here.

import pandas as pd
pd.__version__ # u'0.24.2'
from pandas import concat
def handler(grouped):
se = grouped.set_index('Date')['Sale'].sort_index()
return concat(
{
'MeanToDate': se.expanding().mean(), # cumulative mean
'MaxToDate': se.expanding().max(), # cumulative max
'SaleCount': se.expanding().count(), # cumulative count
'Sale': se, # simple copy
'PrevSale': se.shift(1) # previous sale
},
axis=1
)
###########################
from datetime import datetime
df = pd.DataFrame({'Basket':[88,88,88,123,477,477,566],
'Sale':[15,30,16,90,77,57,90],
'Date':[datetime.strptime(ds,'%d/%m/%Y')
for ds in ['3/01/2012','11/02/2012','16/08/2012','18/06/2012',
'19/08/2012','11/12/2012','6/07/2012']]})
#########
new_df = df.groupby('Basket').apply(handler).reset_index()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas Question: Index / Match with Missing Values + Duplicates + Everything In Between - python

Related

dataframe transform partial row data on column

Best way to store pandas df data for recall in analysis

How to append a value in excel based on cell values from multiple columns in Python and/or R

pandas: column formatting issues causing merge problems

Python - time series alignment and "to date" functions

Categories

Resources