Python Pandas DataFrame - Creating Change Column

Python Pandas DataFrame - Creating Change Column - python

I have a data frame with this column name
timestamp,stockname,total volume traded
There are multiple stock names at each time frame
11:00,A,100
11:00,B,500
11:01,A,150
11:01,B,600
11:02,A,200
11:02,B,650
I want to create a ChangeInVol column such that each stock carries its own difference like
timestamp, stock,total volume, change in volume
11:00,A,100,NaN
11:00,B,500,NAN
11:01,A,150,50
11:01,B,600,100
11:02,A,200,50
11:03,B,650,50
If it were a single stock, I could have done
df['ChangeVol'] = df['TotalVol'] - df['TotalVol'].shift(1)
but there are multiple stocks

Need sort_values + DataFrameGroupBy.diff:
#if columns not sorted
df = df.sort_values(['timestamp','stockname'])
df['change in volume'] = df.groupby('stockname')['total volume traded'].diff()
print (df)
timestamp stockname total volume traded change in volume
0 11:00 A 100 NaN
1 11:00 B 500 NaN
2 11:01 A 150 50.0
3 11:01 B 600 100.0
4 11:02 A 200 50.0
5 11:02 B 650 50.0

Related

How to join different dataframe with specific criteria?

In my MySQL database stocks, I have 5 different tables. I want to join all of those tables to display the EXACT format that I want to see. Should I join in mysql first, or should I first extract each table as a dataframe and then join with pandas? How should it be done? I don't know the code also.
This is how I want to display: https://www.dropbox.com/s/uv1iik6m0u23gxp/ExpectedoutputFULL.csv?dl=0
So each ticker is a row that contains all of the specific columns from my tables.
Additional info:
I only need the most recent 8 quarters for quarterly and 5 years for yearly to be displayed
The exact date for different tickers for quarterly data may differ. If done by hand, the most recent eight quarters can be easily copied and pasted into the respective columns, but I have no idea how to do it with a computer to determine which quarter it belongs to and show it in the same column as my example output. (I use the terms q1 through q8 simply as column names to display. So, if my most recent data is May 30, q8 is not necessarily the final quarter of the second year.
If the most recent quarter or year for one ticker is not available (as in "ADUS" in the example), but it is available for other tickers such as "BA" in the example, simply leave that one blank.
1st table company_info: https://www.dropbox.com/s/g95tkczviu84pnz/company_info.csv?dl=0 contains company info data:
2nd table income_statement_q: https://www.dropbox.com/s/znf3ljlz4y24x7u/income_statement_q.csv?dl=0 contains quarterly data:
3rd table income_statement_y: https://www.dropbox.com/s/zpq79p8lbayqrzn/income_statement_y.csv?dl=0 contains yearly data:
4th table earnings_q:
https://www.dropbox.com/s/bufh7c2jq7veie9/earnings_q.csv?dl=0 contains quarterly data:
5th table earnings_y:
https://www.dropbox.com/s/li0r5n7mwpq28as/earnings_y.csv?dl=0
contains yearly data:

You can use:
# Convert as datetime64 if necessary
df2['date'] = pd.to_datetime(df2['date']) # quarterly
df3['date'] = pd.to_datetime(df3['date']) # yearly
# Realign date according period: 2022-06-30 -> 2022-12-31 for yearly
df2['date'] += pd.offsets.QuarterEnd(0)
df3['date'] += pd.offsets.YearEnd(0)
# Get end dates
qmax = df2['date'].max()
ymax = df3['date'].max()
# Create date range (8 periods for Q, 5 periods for Y)
qdti = pd.date_range( qmax - pd.offsets.QuarterEnd(7), qmax, freq='Q')
ydti = pd.date_range( ymax - pd.offsets.YearEnd(4), ymax, freq='Y')
# Filter and reshape dataframes
qdf = (df2[df2['date'].isin(qdti)]
.assign(date=lambda x: x['date'].dt.to_period('Q').astype(str))
.pivot(index='ticker', columns='date', values='netIncome'))
ydf = (df3[df3['date'].isin(ydti)]
.assign(date=lambda x: x['date'].dt.to_period('Y').astype(str))
.pivot(index='ticker', columns='date', values='netIncome'))
# Create the expected dataframe
out = pd.concat([df1.set_index('ticker'), qdf, ydf], axis=1).reset_index()
Output:
>>> out
ticker industry sector pe roe shares ... 2022Q4 2018 2019 2020 2021 2022
0 ADUS Health Care Providers & Services Health Care 38.06 7.56 16110400 ... NaN 1.737700e+07 2.581100e+07 3.313300e+07 4.512600e+07 NaN
1 BA Aerospace & Defense Industrials NaN 0.00 598240000 ... -663000000.0 1.046000e+10 -6.360000e+08 -1.194100e+10 -4.290000e+09 -5.053000e+09
2 CAH Health Care Providers & Services Health Care NaN 0.00 257639000 ... -130000000.0 2.590000e+08 1.365000e+09 -3.691000e+09 6.120000e+08 -9.320000e+08
3 CVRX Health Care Equipment & Supplies Health Care 0.26 -32.50 20633700 ... -10536000.0 NaN NaN NaN -4.307800e+07 -4.142800e+07
4 IMCR Biotechnology Health Care NaN -22.30 47905000 ... NaN -7.163000e+07 -1.039310e+08 -7.409300e+07 -1.315230e+08 NaN
5 NVEC Semiconductors & Semiconductor Equipment Information Technology 20.09 28.10 4830800 ... 4231324.0 1.391267e+07 1.450794e+07 1.452664e+07 1.169438e+07 1.450750e+07
6 PEPG Biotechnology Health Care NaN -36.80 23631900 ... NaN NaN NaN -1.889000e+06 -2.728100e+07 NaN
7 VRDN Biotechnology Health Care NaN -36.80 40248200 ... NaN -2.210300e+07 -2.877300e+07 -1.279150e+08 -5.501300e+07 NaN
[8 rows x 20 columns]

A Yearly data to Daily data in Python

df: (DataFrame)
Open High Close Volume
2020/1/1 1 2 3 323232
2020/1/2 2 3 4 321321
....
2020/12/31 4 5 6 123213
....
2021
The performance i needed is : (Graph NO.1)
Open High Close Volume Year_Sum_Volume
2020/1/1 1 2 3 323232 (323232 + 321321 +....+ 123213)
2020/1/2 2 3 4 321321 (323232 + 321321 +....+ 123213)
....
2020/12/31 4 5 6 123213 (323232 + 321321 +....+ 123213)
....
2021 (x+x+x.....x)
I want a sum of Volume in different year (the Year_Sum_Volume is the volume of each year)
This is the code i try to calculate the sum of volume in each year but how can i add this data
to daily data , i want to add Year_Sum_Volume to df,like(Graph no.1)
df.resample('Y', on='Date')['Volume'].sum()
thanks you for answering

I believe groupby.sum() and merge should be your friends
import pandas as pd
df = pd.DataFrame({"date":['2021-12-30', '2021-12-31', '2022-01-01'], "a":[1,2.1,3.2]})
df.date = pd.to_datetime(df.date)
df["year"] = df.date.dt.year
df_sums = df.groupby("year").sum().rename(columns={"a":"a_sum"})
df = df.merge(df_sums, right_index=True, left_on="year")
which gives:
date
a
year
a_sum
0
2021-12-30 00:00:00
1
2021
3.1
1
2021-12-31 00:00:00
2.1
2021
3.1
2
2022-01-01 00:00:00
3.2
2022
3.2

Based on your output, Year_Sum_Volume is the same value for every row and can be calculated using df['Volume'].sum().
Then you join a column of a scaled list:
df.join(pd.DataFrame( {'Year_Sum_Volume': [your_sum_val] * len(df['Volume'])} ))

Try below code (after converting date column to pd.to_datetime)
df.assign(Year_Sum_Volume = df.groupby(df['date'].dt.year)['a'].transform('sum'))

Appending from one dataframe to another dataframe (with different sizes) when two values match

I have two pandas dataframes and some of the values overlap and I'd like to append to the original dataframe if the time_date value and the origin values are the same.
Here is my original dataframe called flightsDF which is very long, it has the format:
year month origin dep_time dep_delay arr_time time_hour
2001 01 EWR 15:00 15 17:00 2013-01-01T06:00:00Z
I have another dataframe weatherDF (much shorter than flightsDF) with some extra infomation for some of the values in the original dataframe
origin temp dewp humid wind_dir wind_speed precip visib time_hour
0 EWR 39.02 26.06 59.37 270.0 10.35702 0.0 10.0 2013-01-01T06:00:00Z
1 EWR 39.02 26.96 61.63 250.0 8.05546 0.0 10.0 2013-01-01T07:00:00Z
2 LGH 39.02 28.04 64.43 240.0 11.50780 0.0 10.0 2013-01-01T08:00:00Z
I'd like to append the extra information (temp, dewp, humid,...) from weatherDF to the original data frame if both the time_hour and origin match with the original dataframe flightsDF
I have tried
for x in weatherDF:
if x['time_hour'] == flightsDF['time_hour'] & flightsDF['origin']=='EWR':
flights_df.append(x)
and some other similar ways but I can't seem to get it working, can anyone help?
I am planning to append all the corresponding values and then dropping any from the combined dataframe that don't have those values.

You are probably looking for pd.merge:
flightDF = flightsDF.merge(weatherDF, on=['origin', 'time_hour'], how='left')
print(out)
# Output
year month origin dep_time dep_delay arr_time time_hour temp dewp humid wind_dir wind_speed precip visib
0 2001 1 EWR 15:00 15 17:00 2013-01-01T06:00:00Z 39.02 26.06 59.37 270.0 10.35702 0.0 10.0
If I'm right take the time to read Pandas Merging 101

Need to compare two Dataframes iteratively based on year using Pandas

I have 2 Data Frames which needs to be compared iteratively and mismatch rows has to be stored in a csv. Since it has historical dates, need to perform comparison based on year. How can this be achieve in Pandas
product_1 price_1 Date of purchase
0 computer 1200 2022-01-02
1 monitor 800 2022-01-03
2 printer 200 2022-01-04
3 desk 350 2022-01-05
product_2 price_2 Date of purchase
0 computer 900 2022-01-02
1 monitor 800 2022-01-03
2 printer 300 2022-01-04
3 desk 350 2022-01-05

I would use a split/merge/where
df1['Date of purchase'] = df1['Date of purchase'].apply(lambda x : x.split('-')[0])
df2['Date of purchase'] = df2['Date of purchase'].apply(lambda x : x.split('-')[0])
From there you can merge the two columns using a join or merge
After that you can use an np.where()
merge_df['Check'] = np.where(merge_df['comp_column'] != merge_df['another_comp_column'])
From there you can just look for where the comp columns didn't match
merge_df.loc[merge_df['Check'] == False]

First, let's solve the problem for any group of dates/years. First, you could merge your data using the date and product names:
df = df1.merge(df2, left_on=["Date of purchase", "product_1"], right_on=["Date of purchase", "product_2"])
# Bonus points if you rename "product_2" and only use `on` instead of `left_on` and `right_on`
After that, you could simply use .loc to find the rows where prices do not match:
df.loc[df["price_1"] != df["price_2"]])
product_1 price_1 Date of purchase product_2 price_2
0 computer 1200 2022-01-02 computer 900
2 printer 200 2022-01-04 printer 300
Now, you could process each year by iterating a list of years, querying only the data from that year on df1 and df2 and then using the above procedure to find the price mismatches:
# List available years
years = pd.concat([df1["Date of purchase"].dt.year, df2["Date of purchase"].dt.year], axis=0).unique()
# Rename columns for those bonus points
df1 = df1.rename(columns={"product_1": "product"})
df2 = df2.rename(columns={"product_2": "product"})
# Accumulate your rows in a new dataframe (starting from a list)
output_rows = list()
for year in years:
# find data for this `year`
df1_year = df1.loc[df1["Date of purchase"].dt.year == year]
df2_year = df2.loc[df2["Date of purchase"].dt.year == year]
# Apply the procedure described at the beginning
df = df1_year .merge(df2_year , on=["Date of purchase", "product"])
# Find rows where prices do no match
mismatch_rows = df.loc[df["price_1"] != df["price_2"]]
output_rows.append(mismatch_rows)
# Now, transform your rows into a single dataframe
output_df = pd.concat(output_rows)
Output:
product price_1 Date of purchase price_2
0 computer 1200 2022-01-02 900
2 printer 200 2022-01-04 300

pandas add new column based on iteration through rows

I have a list of transactions for a business.
Example dataframe:
userid date amt start_of_day_balance
123 2017-01-04 10 100.0
123 2017-01-05 20 NaN
123 2017-01-02 30 NaN
123 2017-01-04 40 100.0
The start of day balance is not always retrieved (in that case we receive a NaN). But from the moment that we know the start of day balance for any day, we can accurately estimate the balance after each transaction afterwards.
In this example the new column should look as follows:
userid date amt start_of_day_balance calculated_balance
123 2017-01-04 10 100.0 110
123 2017-01-05 20 NaN 170
123 2017-01-02 30 NaN NaN
123 2017-01-04 40 100.0 150
Note that there is no way to tell the exact order of the transactions that occurred on the same day - I'm happy to overlook that in this case.
My question is how to create this new column. Something like:
df['calculated_balance'] = df.sort_values(['date']).groupby(['userid'])\
['amt'].cumsum() + df['start_of_day_balance'].min()
wouldn't work because of the NaNs.
I also don't want to filter out any transactions that happened before the first recorded start of day balance.

I came up with a solution that seems to work. I'm not sure how elegant it is.
def calc_estimated_balance(g):
# find the first date which has a start of day balance
first_date_with_bal = g.loc[g['start_of_day_balance'].first_valid_index(), 'date']
# only calculate the balance if date is greater than or equal to the date of the first balance
g['calculated_balance'] = g[g['date'] >= first_date_with_bal]['amt'].cumsum().add(g['start_of_day_balance'].min())
return g
df = df.sort_values(['date']).groupby(['userid']).apply(calc_estimated_balance)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.