combine two complete rows if certain criteria is met

combine two complete rows if certain criteria is met - python

I've been able to extract data from two separate xlsx and combine them into a single xlsx sheet using pandas.
I know have a table that looks like this.
Home Start Date Gross Earning Tax Gross Rental Commission Net Rental
3157 2020-03-26 00:00:00 -268.8 -28.8 -383.8 -36 -338.66
3157 2020-03-26 00:00:00 268.8 28.8 153.8 36 108.66
3157 2020-03-24 00:00:00 264.32 28.32 149.32 35.4 104.93
3157 2020-03-13 00:00:00 625.46 67.01 510.46 83.7675 405.4225
3157 2020-03-13 00:00:00 558.45 0 443.45 83.7675 342.9325
3157 2020-03-11 00:00:00 142.5 0 27.5 21.375 1.855
3157 2020-03-11 00:00:00 159.6 17.1 44.6 21.375 17.805
3157 2020-03-03 00:00:00 349.52 0 234.52 52.428 171.612
3157 2020-03-03 00:00:00 391.46 41.94 276.46 52.428 210.722
So if you take a look at the first two rows, the name in the Home column is the same (In this example, 3157 Tocoa) but they are also the same for the next few rows. But in the Start date column, only the first two items in that column are the same (In this case 3/26/2020 12:00:00 AM) So what i need to do is the following
If the dates are the same, and the Home is the same, then I need the sum of all of the following columns.
(In this case, I would need the sum of -268.8 and 268.8, the sum of -28.8 and 28.8 and so on) It is also important to mention there are instances where there is a total of more than two matching start dates.
I will include the code I have used to get to where I am now, I would like to mention I am fairly new to python so I'm sure there is a way to do this super simple but I am just not familiar.
I am also new to stackoverflow so if I am missing something or added something I should have please forgive me
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import numpy as np
import matplotlib.pyplot as plt
import os
# class airbnb:
#Gets the location path for the reports that come raw from the channel
airbnb_excel_file = (r'C:\Users\Christopher\PycharmProjects\Reporting with
python\Data_to_read\Bnb_feb_report.xlsx')
empty_excel_file = (r'C:\Users\Christopher\PycharmProjects\Reporting with
python\Data_to_read\empty.xlsx')
#Defines the data frame
df_airbnb = pd.read_excel(airbnb_excel_file)
df_empty = pd.read_excel(empty_excel_file)
gross_earnings = df_airbnb['Gross Earnings']
tax_amount = df_airbnb['Gross Earnings'] * 0.06
gross_rental = df_airbnb['Gross Earnings'] - df_airbnb['Cleaning Fee']
com = ((gross_rental - tax_amount) + df_airbnb['Cleaning Fee']) * 0.15
net_rental = (gross_rental - (com + df_airbnb['Host Fee']))
house = df_airbnb['Listing']
start_date = df_airbnb['Start Date']
# df = pd.DataFrame(df_empty)
# df_empty.replace('nan', '')
#
# print(net_rental)
df_report = pd.DataFrame(
{'Home': house, 'Start Date': start_date, 'Gross Earning': gross_earnings, 'Tax': tax_amount,
'Gross Rental': gross_rental, 'Commission': com, 'Net Rental': net_rental})
df_report.loc[(df_report.Home == 'New house, Minutes from Disney & Attraction'), 'Home'] = '3161
Tocoa'
df_report.loc[(df_report.Home == 'Brand-New House, located minutes from Disney 5151'), 'Home'] =
'5151 Adelaide'
df_report.loc[(df_report.Home == 'Luxury House, Located Minutes from Disney-World 57'), 'Home'] =
'3157 Tocoa'
df_report.loc[(df_report.Home == 'Big house, Located Minutes from Disney-World 55'), 'Home'] = '3155
Tocoa'
df_report.sort_values(by=['Home'], inplace=True)
# writer = ExcelWriter('Final_Report.xlsx')
# df_report.to_excel(writer, 'sheet1', index=False)
# writer.save()
# class homeaway:
homeaway_excel_file = (r'C:\Users\Christopher\PycharmProjects\Reporting with
python\Data_to_read\PayoutSummaryReport2020-03-01_2020-03-29.xlsx')
df_homeaway = pd.read_excel(homeaway_excel_file)
cleaning = int(115)
house = df_homeaway['Address']
start_date = df_homeaway['Check-in']
gross_earnings = df_homeaway['Gross booking amount']
taxed_amount = df_homeaway['Lodging Tax Owner Remits']
gross_rental = (gross_earnings - cleaning)
com = ((gross_rental-taxed_amount) + cleaning) * 0.15
net_rental = (gross_rental - (com + df_homeaway['Deductions']))
df_report2 = pd.DataFrame(
{'Home': house, 'Start Date': start_date, 'Gross Earning': gross_earnings, 'Tax': taxed_amount,
'Gross Rental': gross_rental, 'Commission': com, 'Net Rental': net_rental})
# writer = ExcelWriter('Final_Report2.xlsx')
# df_report2.to_excel(writer, 'sheet1', index=False)
# writer.save()
df_combined = pd.concat([df_report, df_report2])
writer = ExcelWriter('Final_Report_combined.xlsx')
df_report2.to_excel(writer, 'sheet1', index=False)
writer.save()

One of possible approaches is to group by Home and Start Date and
then compute sum of rows involved:
df.groupby(['Home', 'Start Date']).sum()
Fortunately, all "other" columns are numeric, so no column specification is needed.
But if there are more than 2 rows with same Home and Start Date
and you want to:
break them into pairs of consecutive rows,
and then compute their sums (for each pair separately),
you should apply a "2-tier" grouping:
first tier - group by Home and Start Date (as before),
second tier - group into pairs,
and compute sums for each second-level group.
In this case the code should be:
df.groupby(['Home', 'Start Date']).apply(
lambda grp: grp.groupby(np.arange(len(grp.index)) // 2).sum())\
.reset_index(level=-1, drop=True)
Additional operation required here is to drop the last level of the index
(reset_index).
To test this approach, e.g. add the following row to your DataFrame:
1234 Bogus Street,2020-03-26 00:00:00,20.0,2.0,15.0,3,10.0
so that 1234 Bogus Street / 2020-03-26 00:00:00 group now contains
three rows.
When you run the above code, you will get:
Gross Earning Tax Gross Rental Commission Net Rental
Home Start Date
1234 Bogus Street 2020-03-03 00:00:00 740.98 41.94 510.98 104.856 382.334
2020-03-11 00:00:00 302.10 17.10 72.10 42.750 19.660
2020-03-13 00:00:00 1183.91 67.01 953.91 167.535 748.355
2020-03-24 00:00:00 264.32 28.32 149.32 35.400 104.930
2020-03-26 00:00:00 0.00 0.00 -230.00 0.000 -230.000
2020-03-26 00:00:00 20.00 2.00 15.00 3.000 10.000
Note the last row. It contains:
repeated Start Date (from the previous row),
values from the added row.
And the last but one row contains sums for only two first rows
with respective Home / Start Date.

Related

How to join different dataframe with specific criteria?

In my MySQL database stocks, I have 5 different tables. I want to join all of those tables to display the EXACT format that I want to see. Should I join in mysql first, or should I first extract each table as a dataframe and then join with pandas? How should it be done? I don't know the code also.
This is how I want to display: https://www.dropbox.com/s/uv1iik6m0u23gxp/ExpectedoutputFULL.csv?dl=0
So each ticker is a row that contains all of the specific columns from my tables.
Additional info:
I only need the most recent 8 quarters for quarterly and 5 years for yearly to be displayed
The exact date for different tickers for quarterly data may differ. If done by hand, the most recent eight quarters can be easily copied and pasted into the respective columns, but I have no idea how to do it with a computer to determine which quarter it belongs to and show it in the same column as my example output. (I use the terms q1 through q8 simply as column names to display. So, if my most recent data is May 30, q8 is not necessarily the final quarter of the second year.
If the most recent quarter or year for one ticker is not available (as in "ADUS" in the example), but it is available for other tickers such as "BA" in the example, simply leave that one blank.
1st table company_info: https://www.dropbox.com/s/g95tkczviu84pnz/company_info.csv?dl=0 contains company info data:
2nd table income_statement_q: https://www.dropbox.com/s/znf3ljlz4y24x7u/income_statement_q.csv?dl=0 contains quarterly data:
3rd table income_statement_y: https://www.dropbox.com/s/zpq79p8lbayqrzn/income_statement_y.csv?dl=0 contains yearly data:
4th table earnings_q:
https://www.dropbox.com/s/bufh7c2jq7veie9/earnings_q.csv?dl=0 contains quarterly data:
5th table earnings_y:
https://www.dropbox.com/s/li0r5n7mwpq28as/earnings_y.csv?dl=0
contains yearly data:

You can use:
# Convert as datetime64 if necessary
df2['date'] = pd.to_datetime(df2['date']) # quarterly
df3['date'] = pd.to_datetime(df3['date']) # yearly
# Realign date according period: 2022-06-30 -> 2022-12-31 for yearly
df2['date'] += pd.offsets.QuarterEnd(0)
df3['date'] += pd.offsets.YearEnd(0)
# Get end dates
qmax = df2['date'].max()
ymax = df3['date'].max()
# Create date range (8 periods for Q, 5 periods for Y)
qdti = pd.date_range( qmax - pd.offsets.QuarterEnd(7), qmax, freq='Q')
ydti = pd.date_range( ymax - pd.offsets.YearEnd(4), ymax, freq='Y')
# Filter and reshape dataframes
qdf = (df2[df2['date'].isin(qdti)]
.assign(date=lambda x: x['date'].dt.to_period('Q').astype(str))
.pivot(index='ticker', columns='date', values='netIncome'))
ydf = (df3[df3['date'].isin(ydti)]
.assign(date=lambda x: x['date'].dt.to_period('Y').astype(str))
.pivot(index='ticker', columns='date', values='netIncome'))
# Create the expected dataframe
out = pd.concat([df1.set_index('ticker'), qdf, ydf], axis=1).reset_index()
Output:
>>> out
ticker industry sector pe roe shares ... 2022Q4 2018 2019 2020 2021 2022
0 ADUS Health Care Providers & Services Health Care 38.06 7.56 16110400 ... NaN 1.737700e+07 2.581100e+07 3.313300e+07 4.512600e+07 NaN
1 BA Aerospace & Defense Industrials NaN 0.00 598240000 ... -663000000.0 1.046000e+10 -6.360000e+08 -1.194100e+10 -4.290000e+09 -5.053000e+09
2 CAH Health Care Providers & Services Health Care NaN 0.00 257639000 ... -130000000.0 2.590000e+08 1.365000e+09 -3.691000e+09 6.120000e+08 -9.320000e+08
3 CVRX Health Care Equipment & Supplies Health Care 0.26 -32.50 20633700 ... -10536000.0 NaN NaN NaN -4.307800e+07 -4.142800e+07
4 IMCR Biotechnology Health Care NaN -22.30 47905000 ... NaN -7.163000e+07 -1.039310e+08 -7.409300e+07 -1.315230e+08 NaN
5 NVEC Semiconductors & Semiconductor Equipment Information Technology 20.09 28.10 4830800 ... 4231324.0 1.391267e+07 1.450794e+07 1.452664e+07 1.169438e+07 1.450750e+07
6 PEPG Biotechnology Health Care NaN -36.80 23631900 ... NaN NaN NaN -1.889000e+06 -2.728100e+07 NaN
7 VRDN Biotechnology Health Care NaN -36.80 40248200 ... NaN -2.210300e+07 -2.877300e+07 -1.279150e+08 -5.501300e+07 NaN
[8 rows x 20 columns]

Calculating moving median within group

I want to perform rolling median on price column over 4 days back, data will be groupped by date. So basically I want to take prices for a given day and all prices for 4 days back and calculate median out of these values.
Here are the sample data:
id date price
1637027 2020-01-21 7045204.0
280955 2020-01-11 3590000.0
782078 2020-01-28 2600000.0
1921717 2020-02-17 5500000.0
1280579 2020-01-23 869000.0
2113506 2020-01-23 628869.0
580638 2020-01-25 650000.0
1843598 2020-02-29 969000.0
2300960 2020-01-24 5401530.0
1921380 2020-02-19 1220000.0
853202 2020-02-02 2990000.0
1024595 2020-01-27 3300000.0
565202 2020-01-25 3540000.0
703824 2020-01-18 3990000.0
426016 2020-01-26 830000.0
I got close with combining rolling and groupby:
df.groupby('date').rolling(window = 4, on = 'date')['price'].median()
But this seems to add one row per each index value and by median definition, I am not able to somehow merge these rows to produce one result per row.
Result now looks like this:
date date
2020-01-10 2020-01-10 NaN
2020-01-10 NaN
2020-01-10 NaN
2020-01-10 3070000.0
2020-01-10 4890000.0
...
2020-03-11 2020-03-11 4290000.0
2020-03-11 3745000.0
2020-03-11 3149500.0
2020-03-11 3149500.0
2020-03-11 3149500.0
Name: price, Length: 389716, dtype: float64
It seems it just deleted 3 first values and then just printed price value.
Is it possible to get one lagged / moving median value per one date?

You can use rolling with a frequency window of 5 days to get today and last 4 days, then drop_duplicates to keep the last row per day. First create a copy (if you want to keep the original one), sort_values per date and ensure the date column is datetime
#sort and change to datetime
df_f = df[['date','price']].copy().sort_values('date')
df_f['date'] = pd.to_datetime(df_f['date'])
#create the column rolling
df_f['price'] = df_f.rolling('5D', on='date')['price'].median()
#drop_duplicates and keep the last row per day
df_f = df_f.drop_duplicates(['date'], keep='last').reset_index(drop=True)
print (df_f)
date price
0 2020-01-11 3590000.0
1 2020-01-18 3990000.0
2 2020-01-21 5517602.0
3 2020-01-23 869000.0
4 2020-01-24 3135265.0
5 2020-01-25 2204500.0
6 2020-01-26 849500.0
7 2020-01-27 869000.0
8 2020-01-28 2950000.0
9 2020-02-02 2990000.0
10 2020-02-17 5500000.0
11 2020-02-19 3360000.0
12 2020-02-29 969000.0

This is a step by step process. There are probably more efficient methods of getting what you want. Note, if you have time information for your dates, you would need to drop that information before grouping by date.
import pandas as pd
import statistics as stat
import numpy as np
# Replace with you data import
df = pd.read_csv('random_dates_prices.csv')
# Convert your date to a datetime
df['date'] = pd.to_datetime(df['date'])
# Sort your data by date
df = df.sort_values(by = ['date'])
# Create group by object
dates = df.groupby('date')
# Reformat dataframe for one row per day, with prices in a nested list
df = pd.DataFrame(dates['price'].apply(lambda s: s.tolist()))
# Extract price lists to a separate list
prices = df['price'].tolist()
# Initialize list to store past four days of prices for current day
four_days = []
# Loop over the prices list to combine the last four days to a single list
for i in range(3, len(prices), 1):
x = i - 1
y = i - 2
z = i - 3
four_days.append(prices[i] + prices[x] + prices[y] + prices[z])
# Initialize a list to store median values
medians = []
# Loop through four_days list and calculate the median of the last for days for the current date
for i in range(len(four_days)):
medians.append(stat.median(four_days[i]))
# Create dummy zero values to add lists create to dataframe
four_days.insert(0, 0)
four_days.insert(0, 0)
four_days.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
# Add both new lists to data frames
df['last_four_day_prices'] = four_days
df['last_four_days_median'] = medians
# Replace dummy zeros with np.nan
df[['last_four_day_prices', 'last_four_days_median']] = df[['last_four_day_prices', 'last_four_days_median']].replace(0, np.nan)
# Clean data frame so you only have a single date a median value for past four days
df_clean = df.drop(['price', 'last_four_day_prices'], axis=1)

how can I split monthly data to weekly and keep existing data?

I want to split monthly data to weekly and fill each week row with the same monthly value for which each week refers to.
These variables are the ones that I need to work with.
"starting date" non-null datetime64[ns]
"ending date" non-null datetime64[ns]
import pandas as pd
df = pd.read_excel("file")
import pandas as pd
import math, datetime
d1 = datetime.date(yyyy, mm, dd)
d2 = datetime.date(yyyy, mm, dd)
h = []
while d1 <= d2:
print(d1)
d1 = d1 + datetime.timedelta(days=7)
h.append(d1)
df = pd.Series(h)
print(df)
I have tried the code above but
I think It is completly useless:
This is what I have in my dataset:
price starting date ending date model
1000 2013-01-01 2013-01-14 blue
598 2013-01-01 2013-01-14 blue
156 2013-01-15 2013-01-28 red
This is what I would like to get:
weekly date price model
2013-01-01 1000 blue
2013-01-01 598 blue
2013-01-08 1000 blue
2013-01-08 598 blue
2013-01-15 156 red
2013-01-22 156 red

Something like below:
Convert to pd.to_datetime()
df[['starting date','ending date']] = df[['starting date','ending date']].apply(pd.to_datetime)
Create a dictionary from the start time column:
d=dict(zip(df['starting date'],df.data))
#{Timestamp('2013-01-01 00:00:00'): 20, Timestamp('2013-01-15 00:00:00'): 21}
Using pd.date_range() create a dataframe having weekly intervals of the start time:
df_new = pd.DataFrame(pd.date_range(df['starting date'].iloc[0],df['ending date'].iloc[-1],freq='W-TUE'),columns=['StartDate'])
Same for end time:
df_new['EndDate']=pd.date_range(df['starting date'].iloc[0],df['ending date'].iloc[-1],freq='W-MON')
Map the data column based on start time and ffill() till the next start time arrives:
df_new['data']=df_new.StartDate.map(d).ffill()
print(df_new)
StartDate EndDate data
0 2013-01-01 2013-01-07 20.0
1 2013-01-08 2013-01-14 20.0
2 2013-01-15 2013-01-21 21.0
3 2013-01-22 2013-01-28 21.0

I'm going to make an assumption that the starting date and the ending date never overlap in your dataset. I'm also going to assume that your example is correct because it contradicts your words. It's not monthly data, but rather bi-monthly data. This code should work with any frequency.
# creates some sample data
df = pd.DataFrame(data={'starting date':pd.to_datetime(['2019-01-01','2019-01-15','2019-02-01','2019-02-15']),
'data':[1,2,3,4]})
# Hold the stant and end dates of the new df
d1 = pd.datetime(2019,1,1)
d2 = pd.datetime(2019,2,28)
# Create a new DF to hold results
new_df = pd.DataFrame({'date':pd.DatetimeIndex(start=d1,end=d2,freq='w')})
# Merge based on the closest start date.
result = pd.merge_asof(new_df,df,left_on='date',right_on='starting date')

Filtering data-frame on multiple criteria

I have a dataframe df which has a head that looks like:
Shop Opening date
0 London NaT
22 Brighton 01/03/2016
27 Manchester 01/31/2017
54 Bristol 03/31/2017
69 Glasgow 04/09/2017
I also have a variable startPeriod which is set to 1/04/2017 date and endPeriod variable that has a value of 30/06/17
I am trying to create a new dataframe based on df that filters out any rows that do not have a date (so removing any rows with an Opening date of NaT) and also filter out any rows with an opening date between the startPeriod and endPeriod. So in the above example I would be left with the following new dataframe:
Shop Opening date
22 Brighton 01/03/2016
69 Glasgow 04/09/2017
I have tried to filter out the 'NaT' using the following:
df1 = df['Opening date '] != 'NaT'
but am unsure how to also filter out any Opening date that are inside the startPeriod/endPeriod range.

You can use between with boolean indexing:
df['date'] = pd.to_datetime(df['date'])
df = df[df['date'].between('2016-03-01', '2017-04-05')]
print (df)
Shop Opening date
2 27 Manchester 2017-01-31
3 54 Bristol 2017-03-31
I think filtering out NaNs is not necessary, but if need it chain new condition:
df = df[df['date'].between('2016-03-01', '2017-04-05') & df['date'].notnull()]

First of all, be careful with the space after date in df['Opening date ']
try this solution:
df1 = df[df['Opening date'] != 'NaT']
it would be much better if you create a copy of the subset you're making
df1 = df[df['Opening date'] != 'NaT'].copy()

Pandas concatenate/join/group rows in a dataframe based on date

I have a pandas dataset like this:
Date WaterTemp Discharge AirTemp Precip
0 2012-10-05 00:00 10.9 414.0 39.2 0.0
1 2012-10-05 00:15 10.1 406.0 39.2 0.0
2 2012-10-05 00:45 10.4 406.0 37.4 0.0
...
63661 2016-10-12 14:30 10.5 329.0 15.8 0.0
63662 2016-10-12 14:45 10.6 323.0 19.4 0.0
63663 2016-10-12 15:15 10.8 329.0 23 0.0
I want to extend each row so that I get a dataset that looks like:
Date WaterTemp 00:00 WaterTemp 00:15 .... Discharge 00:00 ...
0 2012-10-05 10.9 10.1 414.0
There will be at most 72 readings for each date so I should have 288 columns in addition to the date and index columns, and at most I should have at most 1460 rows (4 years * 365 days in year - possibly some missing dates). Eventually, I will use the 288-column dataset in a classification task (I'll be adding the label later), so I need to convert this dataframe to a 2d array (sans datetime) to feed into the classifier, so I can't simply group by date and then access the group. I did try grouping based on date, but I was uncertain how to change each group into a single row. I also looked at joining. It looks like joining could suit my needs (for example a join based on (day, month, year)) but I was uncertain how to split things into different pandas dataframes so that the join would work. What is a way to do this?
PS. I do already know how to change the my datetimes in my Date column to dates without the time.

I figured it out. I group the readings by time of day of reading. Each group is a dataframe in and of itself, so I just then need to concatenate the dataframes based on date. My code for the whole function is as follows.
import pandas
def readInData(filename):
#read in files and remove missing values
ds = pandas.read_csv(filename)
ds = ds[ds.AirTemp != 'M']
#set index to date
ds['Date'] = pandas.to_datetime(ds.Date, yearfirst=True, errors='coerce')
ds.Date = pandas.DatetimeIndex(ds.Date)
ds.index = ds.Date
#group by time (so group readings by time of day of reading, i.e. all readings at midnight)
dg = ds.groupby(ds.index.time)
#initialize the final dataframe
df = pandas.DataFrame()
for name, group in dg: #for each group
#each group is a dateframe
try:
#set unique column names except for date
group.columns = ['Date', 'WaterTemp'+str(name), 'Discharge'+str(name), 'AirTemp'+str(name), 'Precip'+str(name)]
#ensure date is the index
group.index = group.Date
#remove time from index
group.index = group.index.normalize()
#join based on date
df = pandas.concat([df, group], axis=1)
except: #if the try catch block isn't here, throws errors! (three for my dataset?)
pass
#remove duplicate date columns
df = df.loc[:,~df.columns.duplicated()]
#since date is index, drop the first date column
df = df.drop('Date', 1)
#return the dataset
return df

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.