I am working with a panel data containing many companies' research and development expenses throughout the years.
What I would like to do is to capitalise these expenses as if they were assets. For those who are not familiar with financial terminology, I am trying to accumulate the values of each year's R&D expenses with the following ones by decaying its value (or "depreciating" it) every time period by the corresponding depreciation rate.
The dataframe looks something like this:
fyear tic rd_tot rd_dep
0 1979 AMFD 1.345 0.200
1 1980 AMFD 0.789 0.200
.. .. .. .. ..
211339 2017 ACA 3.567 0.340
211340 2018 ACA 2.990 0.340
211341 2018 CTRM 0.054 0.234
Where fyear is the fiscal year, tic is the company specific letter code, rd_tot is the total R&D expenditure for the year and rd_dep is the applicable depreciation rate.
So far I was able to come up with this:
df['r&d_capital'] = [(df['rd_tot'].iloc[:i] * (1 - df['rd_dep'].iloc[:i]*np.arange(i)[::-1])).sum()for i in range(1, len(df)+1)]
However the problem is that the code just runs through the entire column without taking in consideration that the R&D expense needs to be capitalised in a company (or tic) specific way. I also tried by using .groupby(['tic]) but it did not work.
Therefore, I am trying to look for help to solve this problem, so that I can get each years R&D expenses capitalisation on a COMPANY SPECIFIC way.
Thank you very much for your help!
This solution breaks the initial dataframe into separate ones (one for each 'tic' group), and applies the r&d capital calculation formula on each df.
Finally, we re-construct the dataframe using pd.concat.
tic_dfs = [tic_group for _, tic_group in df.groupby('tic')]
for df in tic_dfs:
df['r&d_capital'] = [(df['rd_tot'].iloc[:i] * (1 - df['rd_dep'].iloc[:i]*np.arange(i)[::-1])).sum() for i in range(1,len(df)+1)]
result=pd.concat([df for df in tic_dfs]).sort_index()
Note: "_" is the mask for the group name e.g. "ACA", "AMFD" etc, while tic_group is the actual data body.
Related
My problem looks like this: A movie theatre is showing a set of n films over a 5 day period. Each movie has a corresponding IMBD score. I want to watch one movie per day over the 5 day period, maximising the cumulative IMBD score whilst making sure that I watch the best movies first (i.e. Monday's movie will have a higher score than Tuesday's movie, Tuesday's higher than Wednesday's etc.). An extra constraint if that the theatre doesn't show every movie every day. For example.
Showings:
Monday showings: Sleepless in Seattle, Multiplicity, Jaws, The Hobbit
Tuesday showings: Sleepless in Seattle, Kramer vs Kramer, Jack Reacher
Wednesday showings: The Hobbit, A Star is Born, Joker
etc.
Scores:
Sleepless in Seattle: 7.0
Multiplicity: 10
Jaws: 9.2
The Hobbit: 8.9
A Star is Born: 6.2
Joker: 5.8
Kramer vs Kramer: 8.7
etc.
The way I've thought about this is that each day represents a variable: a,b,c,d,e and we are maximising (a+b+c+d+e) where each variable represents a day of the week and to make sure that I watch the movies in descending order in terms of IMBD rank I would add a constraint that a > b > c > d > e. However using the linear solver as far as I can tell you cannot specify discrete values, just a range as the variables will be chosen from a continuous range (in an ideal world I think the problem would look like "solve for a,b,c,d,e maximising their cumulative sum, while ensuring a>b>c>d>e where a can be this set of possible values, b can be this set of possible values etc.). I'm wondering if someone can point me in the right direction of which OR-Tools solver (or another library) would be best for this problem?
I tried to use the GLOP linear solver to solve this problem but failed. I was expecting it to solve for a,b,c,d, and e but I couldn't write the necessary constraints with this paradigm.
I am looking at a history of daily min/max temperatures for the past ~40 years of a specific city (the variable precipitation isn't needed).
I imported the CSV file with the aim to calculate an average for the low and high temperature for each winter (I consider the range November-March as winter). So I suppose a solution could be to loop over the years and maybe create a column which consists of "Winter&year" (for instance the first of december 2018 fell in winter 2018 and the 23rd of February 2019 fell in winter 2018 too). I found plenty of examples to aggregate days/months into seasons but nothing where the year changes and I actually struggle with that bit.
The structure of the data is the following:
Could anyone point me to the right direction?
Many thanks
I have a sample of companies with financial figures which I would like to compare. My data looks like this:
Cusip9 Issuer IPO Year Total Assets Long-Term Debt Sales SIC-Code
1 783755101 Ryerson Tull Inc 1996 9322000.0 2632000.0 633000.0 3661
2 826170102 Siebel Sys Inc 1996 995010.0 0.0 50250.0 2456
3 894363100 Travis Boats & Motors Inc 1996 313500.0 43340.0 23830.0 3661
4 159186105 Channell Commercial Corp 1996 426580.0 3380.0 111100.0 7483
5 742580103 Printware Inc 1996 145750.0 0.0 23830.0 8473
For every company I want to calculate a "similarity Score". This score should indicate the comparability with other companies. Therefore I want to compare them in different financial figures. The comparability should be expressed as the euclidean distance, the square root of the sum of the squared differences between the financial figures, to the "closest company". So I need to calculate the distance to every company, that fits these conditions, but only need the closest score. Assets of Company 1 minus Assets of Company 2 plus Debt Company 1 minus Debt Comapny 2....
√((x_1-y_1 )^2+(x_2-y_2 )^2)
This should only be computed for companies with the same SIC-Code and the IPO Year of the comparable companies should be smaller then for the company for which the "Similarity score" is computed. I only want to compare these companies with already listed companies.
Hopefully, my point get's clear. Has someone any idea where I can start? I am just starting with programming and completely lost with this.
Thanks in advance.
I would first create different dataframes according to the SIC-code, so every new dataframe only contains companies with the same SIC-code. Then for every of those dataframes, just double loop over the companies and compute the scores, and store them in a matrix. (So you'll end up with a symmetrical matrix of scores.)
try this , Here I have taken Compare the company with IPO Year Equal to or Smaller then since You didn't give any company record with smaller IPO year) You can change it to only Smaller than (<) in statement Group=df[...]
def closestCompany(companyRecord):
Group = df[(df['SIC-Code']==companyRecord['SIC-Code']) & (df['IPO Year'] <= companyRecord['IPO Year']) & (df['Issuer'] != companyRecord['Issuer'])]
return (((Group['Total Assets']-companyRecord['Total Assets'])**2 + (Group['Long-Term Debt'] - companyRecord['Long-Term Debt'])**2)**0.5).min()
df['Closest Company Similarity Score']=df.apply(closestCompany, axis=1)
df
I'm trying to write a program to give a deeper analysis of stock trading data but am coming up against a wall. I'm pulling all trades for a given timeframe and creating a new CSV file in order to use that file as the input for a predictive neural network.
The dataframe I currently have has three values: (1) the price of the stock; (2) the number of shares sold at that price; and (3) the unix timestamp of that particular trade. I'm having trouble getting any accurate statistical analysis of the data. For example, if I use .median(), the program only looks at the number of values listed rather than the fact that each value may have been traded hundreds of times based on the volume column.
As an example, this is the partial trading history for one of the stocks that I'm trying to analyze.
0 227.60 40 1570699811183
1 227.40 27 1570699821641
2 227.59 50 1570699919891
3 227.60 10 1570699919891
4 227.36 100 1570699967691
5 227.35 150 1570699967691 . . .
To better understand the issue, I've also grouped it by price and summed the other columns with groupby('p').sum(). I realize this means the timestamp is useless, but it makes visualization easier.
227.22 2 1570700275307
227.23 100 1570699972526
227.25 100 4712101657427
227.30 105 4712101371199
227.33 50 1570700574172
227.35 4008 40838209836171 . . .
Is there any way to use the number from the trade volume column to perform a statistical analysis of the price column? I've considered creating a new dataframe where each price is listed the number of times that it is traded, but am not sure how to do this.
Thanks in advance for any help!
I know capturing a date is usually a simple enough RegEx task, but I need this to be so specific that I'm struggling.
1 SUSTAINABLE HARVEST SECTOR | QUOTA LISTING JUN 11 2013
2 QUOTA
3 TRADE ID AVAILABLE STOCK AMOUNT PRICE
4 130196 COD GBW 10,000 $0.60
5 130158 HADDOCK GBE 300 $0.60
That is what the beginning of my Excel spreadsheet looks like, and what 100's more look like, with the date and the data changing but the format staying the same.
My thoughts were to capture everything that follows LISTING up until the newline... then place the non numbers (JUN) in my Trade Month column, place the first captured number (11) in my Trade Day column, and place the last captured number (2013) in my Trade Year column... but I can't figure out how to. Here's what I have so far:
pattern = re.compile(r'Listing(.+?)(?=\n)')
df = pd.read_excel(file_path)
print("df is:", df)
a = pattern.findall(str(df))
print("a:", a)
but that returns nothing. Any help solving this problem, which I know is probably super simple, is appreciated. Thanks.
Make your expression case insensitive (ie LISTING != Listing):
pattern = re.compile(r'Listing(.+?)(?=\n)', re.IGNORECASE)
Besides, a lookahead for a newline in this situation comes down to the equal expression:
pattern = re.compile(r'Listing(.+)', re.IGNORECASE)
See your working pattern here.