How to iterate through a dataframe based on two conditions?

How to iterate through a dataframe based on two conditions? - python

I have a sample of companies with financial figures which I would like to compare. My data looks like this:
Cusip9 Issuer IPO Year Total Assets Long-Term Debt Sales SIC-Code
1 783755101 Ryerson Tull Inc 1996 9322000.0 2632000.0 633000.0 3661
2 826170102 Siebel Sys Inc 1996 995010.0 0.0 50250.0 2456
3 894363100 Travis Boats & Motors Inc 1996 313500.0 43340.0 23830.0 3661
4 159186105 Channell Commercial Corp 1996 426580.0 3380.0 111100.0 7483
5 742580103 Printware Inc 1996 145750.0 0.0 23830.0 8473
For every company I want to calculate a "similarity Score". This score should indicate the comparability with other companies. Therefore I want to compare them in different financial figures. The comparability should be expressed as the euclidean distance, the square root of the sum of the squared differences between the financial figures, to the "closest company". So I need to calculate the distance to every company, that fits these conditions, but only need the closest score. Assets of Company 1 minus Assets of Company 2 plus Debt Company 1 minus Debt Comapny 2....
√((x_1-y_1 )^2+(x_2-y_2 )^2)
This should only be computed for companies with the same SIC-Code and the IPO Year of the comparable companies should be smaller then for the company for which the "Similarity score" is computed. I only want to compare these companies with already listed companies.
Hopefully, my point get's clear. Has someone any idea where I can start? I am just starting with programming and completely lost with this.
Thanks in advance.

I would first create different dataframes according to the SIC-code, so every new dataframe only contains companies with the same SIC-code. Then for every of those dataframes, just double loop over the companies and compute the scores, and store them in a matrix. (So you'll end up with a symmetrical matrix of scores.)

try this , Here I have taken Compare the company with IPO Year Equal to or Smaller then since You didn't give any company record with smaller IPO year) You can change it to only Smaller than (<) in statement Group=df[...]
def closestCompany(companyRecord):
Group = df[(df['SIC-Code']==companyRecord['SIC-Code']) & (df['IPO Year'] <= companyRecord['IPO Year']) & (df['Issuer'] != companyRecord['Issuer'])]
return (((Group['Total Assets']-companyRecord['Total Assets'])**2 + (Group['Long-Term Debt'] - companyRecord['Long-Term Debt'])**2)**0.5).min()
df['Closest Company Similarity Score']=df.apply(closestCompany, axis=1)
df

Related

How to solve constraint problem where variables must be one of a discrete set of values in OR-Tools?

My problem looks like this: A movie theatre is showing a set of n films over a 5 day period. Each movie has a corresponding IMBD score. I want to watch one movie per day over the 5 day period, maximising the cumulative IMBD score whilst making sure that I watch the best movies first (i.e. Monday's movie will have a higher score than Tuesday's movie, Tuesday's higher than Wednesday's etc.). An extra constraint if that the theatre doesn't show every movie every day. For example.
Showings:
Monday showings: Sleepless in Seattle, Multiplicity, Jaws, The Hobbit
Tuesday showings: Sleepless in Seattle, Kramer vs Kramer, Jack Reacher
Wednesday showings: The Hobbit, A Star is Born, Joker
etc.
Scores:
Sleepless in Seattle: 7.0
Multiplicity: 10
Jaws: 9.2
The Hobbit: 8.9
A Star is Born: 6.2
Joker: 5.8
Kramer vs Kramer: 8.7
etc.
The way I've thought about this is that each day represents a variable: a,b,c,d,e and we are maximising (a+b+c+d+e) where each variable represents a day of the week and to make sure that I watch the movies in descending order in terms of IMBD rank I would add a constraint that a > b > c > d > e. However using the linear solver as far as I can tell you cannot specify discrete values, just a range as the variables will be chosen from a continuous range (in an ideal world I think the problem would look like "solve for a,b,c,d,e maximising their cumulative sum, while ensuring a>b>c>d>e where a can be this set of possible values, b can be this set of possible values etc.). I'm wondering if someone can point me in the right direction of which OR-Tools solver (or another library) would be best for this problem?
I tried to use the GLOP linear solver to solve this problem but failed. I was expecting it to solve for a,b,c,d, and e but I couldn't write the necessary constraints with this paradigm.

How to do Resampling of panel data from daily to monthly with sums and averages?

I am working with a COVID-19 dataset that looks as follows:
Date
City
City ID
State
Estimated Population
Estimated Population_2019
Confirmed Rate
Death Rate
New Confirmed
New Deaths
2020-03-17
Rio Branco
10002
AC
413418
407319
0.72566
0.01
3
0
2020-03-17
Manaus
12330
AM
555550
555964
0.65433
0.005
5
3
The date is my index. I have multiple cities with equal dates as seen.
Given that I have daily datapoints, I am trying to resample my data such that I have monthly points. I have tried using the resample command but I am having trouble because I want some of my columns to be the same and sum and some to be the mean. More specifically:
City,City ID, State: Will remain the same as they are IDs
Estimated Population and Estimated_population: I would like to take the mean for each of these columns, and these will be the new monthly values
Confirmed Rate and Death: I would like to take the monthly mean of these and have these values to be my monthly values and I would like to create new columns giving the monthly standard deviation for my confirmed rate and death rate.
For New Confirmed and New Deaths: I would like to add these values and have my monthly point to be the sum of new cases and deaths, on two separate columns.
How can I go about making a code that is able to differentiate which columns to add, which to take the mean, and how can I create two new columns for the standard deviations of Confirmed and Death Rates?

You should explore a combination of groupby with .agg.
Something like this should work
df_grouped=df.groupby([df.index.month,'City ID']).agg({'Estimated Population':'mean','Estimated Population_2019':'mean','Confirmed Rate':['mean','std'],'Death Rate':['mean','std'],'New Confirmed':'sum','New Deaths':'sum'})
df_grouped.index.rename(['Month','City ID'],inplace=True)

Complex partial string matching in pandas

Given a dataframe with the following structure and values json_path -
json_path
Reporting Group
Entity/Grouping
data.attributes.total.children.[0]
Christian Family
Abraham Family
data.attributes.total.children.[0].children.[0]
Christian Family
In Estate
data.attributes.total.children.[0].children.[0].children.[0].children.[0]
Christian Family
Cash
data.attributes.total.children.[0].children.[0].children.[1].children.[0]
Christian Family
Investment Grade Fixed Income
How would I filter on the json_path rows which containchildren four times? i.e., I want to filter on index position 2-3 -
json_path
Reporting Group
Entity/Grouping
data.attributes.total.children.[0].children.[0].children.[0].children.[0]
Christian Family
Cash
data.attributes.total.children.[0].children.[0].children.[1].children.[0]
Christian Family
Investment Grade Fixed Income
I know how to obtain a partial match, however the integers in the square brackets will be inconsistent, so my instinct is telling me to somehow have logic that counts the instances of children (i.e., children appearing 4x) and using that as a basis to filter.
Any suggestions or resources on how I can achieve this?

As you said, a naive approach would be to count the occurrence of .children and compare the count with 4 to create boolean mask which can be used to filter the rows
df[df['json_path'].str.count(r'\.children').eq(4)]
A more robust approach would be to check for the consecutive occurrence of 4 children
df[df['json_path'].str.contains(r'(\.children\.\[\d+\]){4}')]
json_path Reporting Group Entity/Grouping
2 data.attributes.total.children.[0].children.[0].children.[0].children.[0] Christian Family Cash
3 data.attributes.total.children.[0].children.[0].children.[1].children.[0] Christian Family Investment Grade Fixed Income

How to create a search algorithm for sales otimization in Python?

I have a dataset with distances between 4 cities, each city has a sales store from the same company and I have the number of sales from the last month of each sales store in another dataset.
What i want to know is the best possible route between the cities to make more profit (each product is sold for 5) knowing that i only produce in the first city and then i have a truck with a maximum truckload of 5000 loading the other 3 cities.
I can´t find anything similar to my problem, the closest i could find were search algorithms, can someone tell me what approach to take?
Sorry if my question is a bit confusing.

Panel Data Research & Development Capitalisation

I am working with a panel data containing many companies' research and development expenses throughout the years.
What I would like to do is to capitalise these expenses as if they were assets. For those who are not familiar with financial terminology, I am trying to accumulate the values of each year's R&D expenses with the following ones by decaying its value (or "depreciating" it) every time period by the corresponding depreciation rate.
The dataframe looks something like this:
fyear tic rd_tot rd_dep
0 1979 AMFD 1.345 0.200
1 1980 AMFD 0.789 0.200
.. .. .. .. ..
211339 2017 ACA 3.567 0.340
211340 2018 ACA 2.990 0.340
211341 2018 CTRM 0.054 0.234
Where fyear is the fiscal year, tic is the company specific letter code, rd_tot is the total R&D expenditure for the year and rd_dep is the applicable depreciation rate.
So far I was able to come up with this:
df['r&d_capital'] = [(df['rd_tot'].iloc[:i] * (1 - df['rd_dep'].iloc[:i]*np.arange(i)[::-1])).sum()for i in range(1, len(df)+1)]
However the problem is that the code just runs through the entire column without taking in consideration that the R&D expense needs to be capitalised in a company (or tic) specific way. I also tried by using .groupby(['tic]) but it did not work.
Therefore, I am trying to look for help to solve this problem, so that I can get each years R&D expenses capitalisation on a COMPANY SPECIFIC way.
Thank you very much for your help!

This solution breaks the initial dataframe into separate ones (one for each 'tic' group), and applies the r&d capital calculation formula on each df.
Finally, we re-construct the dataframe using pd.concat.
tic_dfs = [tic_group for _, tic_group in df.groupby('tic')]
for df in tic_dfs:
df['r&d_capital'] = [(df['rd_tot'].iloc[:i] * (1 - df['rd_dep'].iloc[:i]*np.arange(i)[::-1])).sum() for i in range(1,len(df)+1)]
result=pd.concat([df for df in tic_dfs]).sort_index()
Note: "_" is the mask for the group name e.g. "ACA", "AMFD" etc, while tic_group is the actual data body.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.