How to join different dataframe with specific criteria? - python

In my MySQL database stocks, I have 5 different tables. I want to join all of those tables to display the EXACT format that I want to see. Should I join in mysql first, or should I first extract each table as a dataframe and then join with pandas? How should it be done? I don't know the code also.
This is how I want to display: https://www.dropbox.com/s/uv1iik6m0u23gxp/ExpectedoutputFULL.csv?dl=0
So each ticker is a row that contains all of the specific columns from my tables.
Additional info:
I only need the most recent 8 quarters for quarterly and 5 years for yearly to be displayed
The exact date for different tickers for quarterly data may differ. If done by hand, the most recent eight quarters can be easily copied and pasted into the respective columns, but I have no idea how to do it with a computer to determine which quarter it belongs to and show it in the same column as my example output. (I use the terms q1 through q8 simply as column names to display. So, if my most recent data is May 30, q8 is not necessarily the final quarter of the second year.
If the most recent quarter or year for one ticker is not available (as in "ADUS" in the example), but it is available for other tickers such as "BA" in the example, simply leave that one blank.
1st table company_info: https://www.dropbox.com/s/g95tkczviu84pnz/company_info.csv?dl=0 contains company info data:
2nd table income_statement_q: https://www.dropbox.com/s/znf3ljlz4y24x7u/income_statement_q.csv?dl=0 contains quarterly data:
3rd table income_statement_y: https://www.dropbox.com/s/zpq79p8lbayqrzn/income_statement_y.csv?dl=0 contains yearly data:
4th table earnings_q:
https://www.dropbox.com/s/bufh7c2jq7veie9/earnings_q.csv?dl=0 contains quarterly data:
5th table earnings_y:
https://www.dropbox.com/s/li0r5n7mwpq28as/earnings_y.csv?dl=0
contains yearly data:

You can use:
# Convert as datetime64 if necessary
df2['date'] = pd.to_datetime(df2['date']) # quarterly
df3['date'] = pd.to_datetime(df3['date']) # yearly
# Realign date according period: 2022-06-30 -> 2022-12-31 for yearly
df2['date'] += pd.offsets.QuarterEnd(0)
df3['date'] += pd.offsets.YearEnd(0)
# Get end dates
qmax = df2['date'].max()
ymax = df3['date'].max()
# Create date range (8 periods for Q, 5 periods for Y)
qdti = pd.date_range( qmax - pd.offsets.QuarterEnd(7), qmax, freq='Q')
ydti = pd.date_range( ymax - pd.offsets.YearEnd(4), ymax, freq='Y')
# Filter and reshape dataframes
qdf = (df2[df2['date'].isin(qdti)]
.assign(date=lambda x: x['date'].dt.to_period('Q').astype(str))
.pivot(index='ticker', columns='date', values='netIncome'))
ydf = (df3[df3['date'].isin(ydti)]
.assign(date=lambda x: x['date'].dt.to_period('Y').astype(str))
.pivot(index='ticker', columns='date', values='netIncome'))
# Create the expected dataframe
out = pd.concat([df1.set_index('ticker'), qdf, ydf], axis=1).reset_index()
Output:
>>> out
ticker industry sector pe roe shares ... 2022Q4 2018 2019 2020 2021 2022
0 ADUS Health Care Providers & Services Health Care 38.06 7.56 16110400 ... NaN 1.737700e+07 2.581100e+07 3.313300e+07 4.512600e+07 NaN
1 BA Aerospace & Defense Industrials NaN 0.00 598240000 ... -663000000.0 1.046000e+10 -6.360000e+08 -1.194100e+10 -4.290000e+09 -5.053000e+09
2 CAH Health Care Providers & Services Health Care NaN 0.00 257639000 ... -130000000.0 2.590000e+08 1.365000e+09 -3.691000e+09 6.120000e+08 -9.320000e+08
3 CVRX Health Care Equipment & Supplies Health Care 0.26 -32.50 20633700 ... -10536000.0 NaN NaN NaN -4.307800e+07 -4.142800e+07
4 IMCR Biotechnology Health Care NaN -22.30 47905000 ... NaN -7.163000e+07 -1.039310e+08 -7.409300e+07 -1.315230e+08 NaN
5 NVEC Semiconductors & Semiconductor Equipment Information Technology 20.09 28.10 4830800 ... 4231324.0 1.391267e+07 1.450794e+07 1.452664e+07 1.169438e+07 1.450750e+07
6 PEPG Biotechnology Health Care NaN -36.80 23631900 ... NaN NaN NaN -1.889000e+06 -2.728100e+07 NaN
7 VRDN Biotechnology Health Care NaN -36.80 40248200 ... NaN -2.210300e+07 -2.877300e+07 -1.279150e+08 -5.501300e+07 NaN
[8 rows x 20 columns]

Related

Calculations on a pandas DataFrame column conditional on another column

I notice several 'set value of new column based on value of another'-type questions, but from what I gather, I have not found that they address dividing values in the same column, based on the conditions set by another column.
The data I have is as the table below, minus the column (variable) 'healthpertotal'.
It shows (in the column 'function'), the amount of government spending (aka expenditure) on
a) health (column 'value'), and
b) its total spending (same column 'value'), and
the associated year of that spending (column 'year').
I want to make a new column that shows the percent of government health spending over its total spending, for a given year, as shown below in the column 'healthpertotal'.
So for instance, in 1995, the value of this variable is (42587(health spending amount)/326420(total spending amount))*100=13.05.
As for the rows showing total spending, the 'healthpertotal' could be 'missing', 1, or 'not applicable' and the like. I am ok with any of these options.
How would I set up this new column 'healthpertotal' using python?
A proposed table or DataFrame for what I would like to achieve follows (and its code on how it might be set up - artificially 'forced' in the case of the final variable 'healthpertotal') :
data = {'function':['Health'] * 3 + ['Total'] * 3,
'year':[1995,1996,1997,1995,1996,1997],
'value':[42587, 44209,44472,326420,333637,340252],
'healthpertotal':[13.05,13.25,13.07]+[np.nan]*3
}
df = pd.DataFrame(data)
print (df)
Expected outcome:
function year value healthpertotal
0 Health 1995 42587 13.05
1 Health 1996 44209 13.25
2 Health 1997 44472 13.07
3 Total 1995 326420 NaN
4 Total 1996 333637 NaN
5 Total 1997 340252 NaN
You could use groupby + transform last to transform total values to align with the DataFrame; then divide "value" with it using rdiv; then replace 100 with NaN (assuming health spending is never 100%):
df['healthpertotal'] = df.groupby('year')['value'].transform('last').rdiv(df['value']).mul(100).replace(100, np.nan)
We could also use merge + concat (calculate the percentage in between these operations):
tmp = df.loc[df['function']=='Health'].merge(df.loc[df['function']=='Total'], on='year')
tmp['healthpertotal'] = tmp['value_x'] / tmp['value_y'] * 100
msk = tmp.columns.str.contains('_y')
tmp1 = tmp.loc[:, ~msk]
tmp2 = tmp[tmp.columns[msk].tolist() + ['year']]
pd.concat((tmp1.set_axis(tmp1.columns.map(lambda x: x.split('_')[0]), axis=1),
tmp2.set_axis(tmp2.columns.map(lambda x: x.split('_')[0]), axis=1)))
We could also use merge + wide_to_long (calculate the percentage in between these operations) + mask the duplicates:
tmp = df.loc[df['function']=='Health'].merge(df.loc[df['function']=='Total'], on='year', suffixes=('0','1'))
tmp['healthpertotal'] = tmp['value0'] / tmp['value1'] * 100
out = pd.wide_to_long(tmp, stubnames=['function', 'value'], i=['year','healthpertotal'], j='').droplevel(-1).reset_index()
out['healthpertotal'] = out['healthpertotal'].mask(out['healthpertotal'].duplicated())
Output:
function year value healthpertotal
0 Health 1995 42587 13.046688
1 Health 1996 44209 13.250629
2 Health 1997 44472 13.070313
3 Total 1995 326420 NaN
4 Total 1996 333637 NaN
5 Total 1997 340252 NaN

Python / Pandas: Fill NaN with order - linear interpolation --> ffill --> bfill

I have a df:
company year revenues
0 company 1 2019 1,425,000,000
1 company 1 2018 1,576,000,000
2 company 1 2017 1,615,000,000
3 company 1 2016 1,498,000,000
4 company 1 2015 1,569,000,000
5 company 2 2019 nan
6 company 2 2018 1,061,757,075
7 company 2 2017 nan
8 company 2 2016 573,414,893
9 company 2 2015 599,402,347
I would like to fill the nan values, with an order. I want to linearly interpolate first, then forward fill and then backward fill. I currently have:
f_2_impute = [x for x in cl_data.columns if cl_data[x].dtypes != 'O' and 'total' not in x and 'year' not in x]
def ffbf(x):
return x.ffill().bfill()
group_with = ['company']
for x in cl_data[f_2_impute]:
cl_data[x] = cl_data.groupby(group_with)[x].apply(lambda fill_it: ffbf(fill_it))
which performs ffill() and bfill(). Ideally I want a function that tries first to linearly intepolate the missing values, then try forward filling them and then backward filling them.
Any quick ways of achieving it? Thanking you in advance.
I believe you need first convert columns to floats if , there:
df = pd.read_csv(file, thousands=',')
Or:
df['revenues'] = df['revenues'].replace(',','', regex=True).astype(float)
and then add DataFrame.interpolate:
def ffbf(x):
return x.interpolate().ffill().bfill()

How to find a matching value for previous year in YYYY-Qx format?

I have this dataset with ['Sales'] values grouped by ['Fiscal Quarter'] in YYYY-Qx format. I want to compare the value of a quarter with the same quarter from the previous year (2019-Q2 with 2018-Q2, for example).
I'm doing this the manual way, creating a new column Prev FY and shifting values up 4 times to get to the matching value and it's working fine.
x = 4
df['Prev FY'] = df['Sales'].shift(x)
Sometimes some quarter data is missing, so shifting 4 times no longer does the job. I want to improve the code to automatically find the correct row using the ['Fiscal Quarter'] column.
Any help on this issue?
You need PeriodIndex and then use parameter freq in Series.shift:
df = pd.DataFrame({'Fiscal Quarter':['2017-Q2','2018-Q2','2019-Q1','2019-Q2'],
'Sales':[10,20,30,40]})
df['Fiscal Quarter'] = pd.to_datetime(df['Fiscal Quarter']).dt.to_period('Q')
df = df.set_index('Fiscal Quarter')
df['Prev FY'] = df['Sales'].shift(4, freq='Q')
print (df)
Sales Prev FY
Fiscal Quarter
2017Q2 10 NaN
2018Q2 20 10.0
2019Q1 30 NaN
2019Q2 40 20.0

Python Pandas DataFrame - Creating Change Column

I have a data frame with this column name
timestamp,stockname,total volume traded
There are multiple stock names at each time frame
11:00,A,100
11:00,B,500
11:01,A,150
11:01,B,600
11:02,A,200
11:02,B,650
I want to create a ChangeInVol column such that each stock carries its own difference like
timestamp, stock,total volume, change in volume
11:00,A,100,NaN
11:00,B,500,NAN
11:01,A,150,50
11:01,B,600,100
11:02,A,200,50
11:03,B,650,50
If it were a single stock, I could have done
df['ChangeVol'] = df['TotalVol'] - df['TotalVol'].shift(1)
but there are multiple stocks
Need sort_values + DataFrameGroupBy.diff:
#if columns not sorted
df = df.sort_values(['timestamp','stockname'])
df['change in volume'] = df.groupby('stockname')['total volume traded'].diff()
print (df)
timestamp stockname total volume traded change in volume
0 11:00 A 100 NaN
1 11:00 B 500 NaN
2 11:01 A 150 50.0
3 11:01 B 600 100.0
4 11:02 A 200 50.0
5 11:02 B 650 50.0

Pandas changing cell values based on another cell

I am currently formatting data from two different data sets.
One of the dataset reflects an observation count of people in room on hour basis, the second one is a count of people based on wifi logs generated in 5 minutes interval.
After merging these two dataframes into one, I run into the issue where each hour (as "10:00:00") has the data from the original set, but the other data (every 5min like "10:47:14") does not include this data.
Here is how the merge dataframe looks:
room time con auth capacity % Count module size
0 B002 Mon Nov 02 10:32:06 23 23 90 NaN NaN NaN NaN`
1 B002 Mon Nov 02 10:37:10 25 25 90 NaN NaN NaN NaN`
12527 B002 Mon Nov 02 10:00:00 NaN NaN 90 50% 45.0 COMP30520 60`
12528 B002 Mon Nov 02 11:00:00 NaN NaN 90 0% 0.0 COMP30520 60`
Is there a way for me to go through the dataframe and find all the information regarding the "occupancy", "occupancyCount", "module" and "size" from 11:00:00 and write it to all the cells that are of the same day and where the hour is between 10:00:00 and 10:59:59?
That would allow me to have all the information on each row and then allow me to gather the min(), max() and median() based on 'day' and 'hour'.
To answer the comment for the original dataframes, here there are:
first dataframe:
time room module size
0 Mon Nov 02 09:00:00 B002 COMP30190 29
1 Mon Nov 02 10:00:00 B002 COMP40660 53
second dataframe:
room time con auth capacity % Count
0 B002 Mon Nov 02 20:32:06 0 0 NaN NaN NaN
1 B002 Mon Nov 02 20:37:10 0 0 NaN NaN NaN
2 B002 Mon Nov 02 20:42:12 0 0 NaN NaN NaN
12797 B008 Wed Nov 11 13:00:00 NaN NaN 40 25 10.0
12798 B008 Wed Nov 11 14:00:00 NaN NaN 40 50 20.0
12799 B008 Wed Nov 11 15:00:00 NaN NaN 40 25 10.0
this is how these two dataframes were merged together:
DFinal = pd.merge(DF, d3, left_on=["room", "time"], right_on=["room", "time"], how="outer", left_index=False, right_index=False)
Any help with this would be greatly appreciated.
Thanks a lot,
-Romain
Somewhere to start:
b = df[(df['time'] > X) & (df['time'] < Y)]
selects all the elements within times X and Y
And then
df.loc[df['column_name'].isin(b)]
Gives you the rows you want (ie - between X and Y) and you can just assign as you see fit.
I think you'll want to assign the values of the selected rows to those of row number X?
Hope that helps.
Note that these function are cut and paste jobs from
[1] Filter dataframe rows if value in column is in a set list of values
[2] Select rows from a DataFrame based on values in a column in pandas
If I understood it correctly, you want to fill all the missing values in your merged dataframe with the corresponding closest data point available in the given hour. I did something similar in essence in the past using a variate of pandas.cut for timeseries but I can't seem to find it, it wasn't really nice anyways.
While I'm not entirely sure, fillna method of the pandas dataframe might be what you want (docs here).
Let your two dataframes be named df_hour and df_cinq, you merged them like this:
df = pd.merge(df_hour, df_cinq, left_on=["room", "time"], right_on=["room", "time"], how="outer", left_index=False, right_index=False)
Then you change your index to time and sort it:
df.set_index('time',inplace=True)
df.sort_index(inplace=True)
The fillna method has an option called 'method' that can have these values (2):
Method Action
pad / ffill Fill values forward
bfill / backfill Fill values backward
nearest Fill from the nearest index value
Using it to do forward filling (i.e. missing values are filled with the preceding value in the frame):
df.fillna(method='ffill', inplace=True)
The problem with this on your data is that all of the missing data in the non-working hours belonging to the 5-minute observations will be filled with outdated data points. You can use the limit option to limit the amount of consecutive data points to be filled but I don't know if it's useful to you.
Here's a complete script I wrote as a toy example:
import pandas as pd
import random
hourly_count = 8 #workhours
cinq_count = 24 * 12 # 1day
hour_rng = pd.date_range('1/1/2016-09:00:00', periods = hourly_count, freq='H')
cinq_rng = pd.date_range('1/1/2016-00:02:53', periods = cinq_count,
freq='5min')
roomz = 'room0 room1 secretroom'.split()
hourlydata = {'col1': [], 'col2': [], 'room': []}
for i in range(hourly_count):
hourlydata['room'].append(random.choice(roomz))
hourlydata['col1'].append(random.random())
hourlydata['col2'].append(random.randint(0,100))
cinqdata = {'col3': [], 'col4': [], 'room': []}
frts = 'apples oranges peaches grapefruits whatmore'.split()
vgtbls = 'onion1 onion2 onion3 onion4 onion5 onion0'.split()
for i in range(cinq_count):
cinqdata['room'].append(random.choice(roomz))
cinqdata['col3'].append(random.choice(frts))
cinqdata['col4'].append(random.choice(vgtbls))
hourlydf = pd.DataFrame(hourlydata)
hourlydf['time'] = hour_rng
cinqdf = pd.DataFrame(cinqdata)
cinqdf['time'] = cinq_rng
df = pd.merge(hourlydf, cinqdf, left_on=['room','time'], right_on=['room',
'time'], how='outer', left_index=False, right_index=False)
df.set_index('time',inplace=True)
df.sort_index(inplace=True)
df.fillna(method='ffill', inplace=True)
print(df['2016-1-1 09:00:00':'2016-1-1 17:00:00'])
Actually I was able to fix this by:
First: using partition on "time" feature in order to generate two additional columns, one for the day showed in "time" and one for the hour in the "time" column.
I used the lambda functions to get these columns:
df['date'] = df['date'].map(lambda x: x[10:-6])
df['time'] = df['time'].map(lambda x: x[8:-8])
Based on these two new columns I modified the way the dataframes were being merged.
here is the code I used to fix it:
dataframeFinal = pd.merge(dataframe1, dataframe2, left_on=["room", "date", "hour"],
right_on=["room", "date", "hour"], how="outer",
left_index=False, right_index=False, copy=False)
After this merge I ended up having duplicate time columns ('time_y' and "time_x').
So I replaced the NaN values as follows:
dataframeFinal.time_y.fillna(dataframeFinal.time_x, inplace=True)
Now the column "time_y" contains all the time values, no more NaN.
I do not need the "time_x" column so I drop it from the dataframe
dataframeFinal = dataframeFinal.drop('time_x', axis=1)

Categories

Resources