merge dataframes on index in for loop - python

I am having a lot of trouble merging these dataframes on the same index which are all in the same for loop. Below when I print my code I will get two dataframes under the for loop, I want to do sometype of dataframe.merge() where I can get it to look like this.
# price1 when printed in the for loop
Close tic
Date
2010-05-27 31.33 AAPL
2010-05-28 31.77 AAPL
... ... ...
2020-05-22 318.89 AAPL
2020-05-26 316.73 AAPL
[2516 rows x 2 columns]
Close tic
Date
2010-05-27 38.54 TROW
2010-05-28 37.08 TROW
... ... ...
2020-05-22 115.09 TROW
2020-05-26 120.05 TROW
[2516 rows x 2 columns]
Next is what I want it to look like where they would be merged on the index. Where the new columns are the new dataframe.
#what I want it to look like
Close tic Close tic
Date
2010-05-27 31.33 AAPL 38.54 TROW
2010-05-28 31.77 AAPL 37.08 TROW
... ... ...
2020-05-22 318.89 AAPL 115.09 TROW
2020-05-26 316.73 AAPL 120.05 TROW
[2516 rows x 4 columns]
My code below is.
import yfinance as yf
import pandas as pd
import csv
def price(ticker):
company = yf.Ticker(ticker)
price = company.history(period="10y")
price_df = pd.DataFrame(price)
price_df.drop(price_df.columns[[0,1,2,4,5,6]], axis = 1, inplace = True)
price_df['tic'] = (ticker)
return price_df
l = ["AAPL", "TROW"]
for ticker in l:
price1 = price(ticker)
print(price1)
Thx in advance

Make sure your "date" column is the dataframe index
df1.set_index('date')
df2.set_index('date')
then merge two frames
df_merged = pd.concat((df1,df2) , axis =1)

According to your sample data, simply
price1 = price('AAPL').join(price('TROW'))
print(price1)
may work fine.
In more complicated cases, comments from CypherX could be considered.

Related

How to append to another df from inside a for loop

How can you append to an existing df from inside a for loop? For example:
import pandas as pd
from pandas_datareader import data as web
stocks = ['amc', 'aapl']
colnames = ['Datetime', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'Name']
df1 = pd.DataFrame(data=None, columns=colnames)
for stock in stocks:
df = web.DataReader(stock, 'yahoo')
df['Name'] = stock
What should I do next so that df is appended to df1?
You could try pandas.concat()
df1 = pd.DataFrame(data=None, columns=colnames)
for stock in stocks:
df = web.DataReader(stock, 'yahoo')
df['Name'] = stock
df1 = pd.concat([df1, df], ignore_index=True)
Instead of concat dataframe in each loop, you could also try append dataframe to a list
dfs = []
for stock in stocks:
df = web.DataReader(stock, 'yahoo')
df['Name'] = stock
dfs.append(df)
df_ = pd.concat(dfs, ignore_index=True)
print(df_)
High Low Open Close Volume Adj Close Name
0 32.049999 31.549999 31.900000 31.549999 1867000.0 24.759203 amc
1 31.799999 30.879999 31.750000 31.000000 1225900.0 24.327585 amc
2 31.000000 30.350000 30.950001 30.799999 932100.0 24.170631 amc
3 30.900000 30.250000 30.700001 30.350000 1099000.0 23.817492 amc
4 30.700001 30.100000 30.549999 30.650000 782500.0 24.052916 amc
... ... ... ... ... ... ... ...
2515 179.009995 176.339996 176.690002 178.960007 100589400.0 178.960007 aapl
2516 179.610001 176.699997 178.550003 177.770004 92633200.0 177.770004 aapl
2517 178.029999 174.399994 177.839996 174.610001 103049300.0 174.610001 aapl
2518 174.880005 171.940002 174.029999 174.309998 78699800.0 174.309998 aapl
2519 174.880005 171.940002 174.029999 174.309998 78751328.0 174.309998 aapl
[2520 rows x 7 columns]
What you're trying to do won't quite work, since the data retrieved by DataReader has several columns and you need that data for several stocks. However, each of those columns is a time series.
So what you probably want is something that looks like this:
Stock amc
Field High Low Open ...
2022-03-30 29.230000 25.350000 ...
2022-03-31 25.920000 23.260000 ...
2022-04-01 25.280001 22.340000 ...
2022-04-01 25.280001 22.340000 ...
...
And you'd be able to access like df[('amc', 'Low')] to get a time series for that stock, or like df[('amc', 'Low')]['2022-04-01'][0] to get the 'Low' value for 'amc' on April 1st.
This gets you exactly that:
import pandas as pd
from pandas_datareader import data as web
stocks = ['amc', 'aapl']
df = pd.DataFrame()
for stock_name in stocks:
stock_df = web.DataReader(stock_name, data_source='yahoo')
for col in stock_df:
df[(stock_name, col)] = stock_df[col]
df.columns = pd.MultiIndex.from_tuples(df.columns, names=['Stock', 'Field'])
print(f'\nall data:\n{"-"*40}\n', df)
print(f'\none series:\n{"-"*40}\n', df[('aapl', 'Volume')])
print(f'\nsingle value:\n{"-"*40}\n', df[('amc', 'Low')]['2022-04-01'][0])
The solution uses a MultiIndex to achieve what you need. It first loads all the data as retrieved from the API into columns labeled with tuples of stock name and field, and it then converts that into a proper MultiIndex after loading completes.
Output:
all data:
----------------------------------------
Stock amc ... aapl
Field High Low ... Volume Adj Close
Date ...
2017-04-04 32.049999 31.549999 ... 79565600.0 34.171505
2017-04-05 31.799999 30.879999 ... 110871600.0 33.994480
2017-04-06 31.000000 30.350000 ... 84596000.0 33.909496
2017-04-07 30.900000 30.250000 ... 66688800.0 33.833969
2017-04-10 30.700001 30.100000 ... 75733600.0 33.793839
... ... ... ... ... ...
2022-03-29 34.330002 26.410000 ... 100589400.0 178.960007
2022-03-30 29.230000 25.350000 ... 92633200.0 177.770004
2022-03-31 25.920000 23.260000 ... 103049300.0 174.610001
2022-04-01 25.280001 22.340000 ... 78699800.0 174.309998
2022-04-01 25.280001 22.340000 ... 78751328.0 174.309998
[1260 rows x 12 columns]
one series:
----------------------------------------
Date
2017-04-04 79565600.0
2017-04-05 110871600.0
2017-04-06 84596000.0
2017-04-07 66688800.0
2017-04-10 75733600.0
...
2022-03-29 100589400.0
2022-03-30 92633200.0
2022-03-31 103049300.0
2022-04-01 78699800.0
2022-04-01 78751328.0
Name: (aapl, Volume), Length: 1260, dtype: float64
single value:
----------------------------------------
22.34000015258789

How to summarize missing values in time series data in a Pandas Dataframe?

I'm having a timeseries dataset like the following:
As seen, there are three columns for channel values paired against the same set of timestamps.
Each channel has sets of NaN values.
My objective is to create a summary of these NaN values as follows:
My approach (inefficient): Create a for loop to go across each channel column first, and then another nested for loop to go across each row of the channel. Then when it stumbles across NaN value sets, it can register the start timestamp, end timestamp and duration in the form of individual rows (or lists), which I can eventually stack together as the final output.
But my logic seems pretty inefficient and slow especially considering that my original dataset has 200 channel columns and 10k rows. I'm sure there should be a better approach than this in Python.
Can anyone please help me out with an appropriate way to deal with this - using Pandas in Python?
Use DataFrame.melt for reshape DataFrame, then filter consecutive groups by misisng values and next value after missing and create new DataFrame by aggregation min with max values:
df['date_time'] = pd.to_datetime(df['date_time'])
df1 = df.melt('date_time', var_name='Channel No.')
m = df1['value'].shift(fill_value=False).notna() #
mask = df1['value'].isna() | ~m
df1 = (df1.groupby([m.cumsum()[mask], 'Channel No.'])
.agg(Starting_Timestamp = ('date_time','min'),
Ending_Timestamp = ('date_time','max'))
.assign(Duration = lambda x: x['Ending_Timestamp'].sub(x['Starting_Timestamp']))
.droplevel(0)
.reset_index()
)
print (df1)
Channel No. Starting_Timestamp Ending_Timestamp Duration
0 Channel_1 2019-09-19 10:59:00 2019-09-19 14:44:00 0 days 03:45:00
1 Channel_1 2019-09-19 22:14:00 2019-09-19 23:29:00 0 days 01:15:00
2 Channel_2 2019-09-19 13:59:00 2019-09-19 19:44:00 0 days 05:45:00
3 Channel_3 2019-09-19 10:59:00 2019-09-19 12:44:00 0 days 01:45:00
4 Channel_3 2019-09-19 15:14:00 2019-09-19 16:44:00 0 days 01:30:00
Use:
inds = df[df['g'].isna()].index.to_list()
gs = []
s = 0
for i, x in enumerate(inds):
if i<len(inds)-1:
if x+1!=inds[i+1]:
gs.append(inds[s:i+1])
s = i+1
else:
gs.append(inds[s:i+1])
ses = []
for g in gs:
ses.append([df.iloc[g[0]]['date'], df.iloc[g[-1]+1]['date']])
res = pd.DataFrame(ses, columns = ['st', 'et'])
res['d'] = res['et']-res['st']
And a more efficient solution:
import pandas as pd
import numpy as np
df = pd.DataFrame({'date':pd.date_range('2021-01-01', '2021-12-01', 12), 'g':range(12)})
df['g'].loc[0:3]=np.nan
df['g'].loc[5:7]=np.nan
inds = df[df['g'].isna().astype(int).diff()==-1].index+1
pd.DataFrame([(x.iloc[0]['date'], x.iloc[-1]['date']) for x in np.array_split(df, inds) if np.isnan(x['g'].iloc[0])])

Replacing a for loop with something more efficient when comparing dates to a list

Edit: Title changed to reflect map not being more efficient than a for loop.
Original title: Replacing a for loop with map when comparing dates
I have a list of sequential dates date_list and a data frame df which contains, for the purposes of now, contains one column named Event Date which contains the date that an event occured:
Index Event Date
0 02-01-20
1 03-01-20
2 03-01-20
I want to know how many events have happened by a given date in the format:
Date Events
01-01-20 0
02-01-20 1
03-01-20 3
My current method for doing so is as follows:
for date in date_list:
event_rows = df.apply(lambda x: True if x['Event Date'] > date else False , axis=1)
event_count = len(event_rows[event_rows == True].index)
temp = [date,event_count]
pre_df_list.append(temp)
Where the list pre_df_list is later converted to a dataframe.
This method is slow and seems inelegant but I am struggling to find a method that works.
I think it should be something along the lines of:
map(lambda x,y: True if x > y else False, df['Event Date'],date_list)
but that would compare each item in the list in pairs which is not what I'm looking for.
I appreaciate it might be odd asking for help when I have working code but I'm trying to cut down my reliance of loops as they are somewhat of a crutch for me at the moment. Also I have multiple different events to track in the full data and looping through ~1000 dates for each one will be unsatisfyingly slow.
Use groupby() and size() to get counts per date and cumsum() to get a cumulative sum, i.e. include all the dates before a particular row.
from datetime import date, timedelta
import random
import pandas as pd
# example data
dates = [date(2020, 1, 1) + timedelta(days=random.randrange(1, 100, 1)) for _ in range(1000)]
df = pd.DataFrame({'Event Date': dates})
# count events <= t
event_counts = df.groupby('Event Date').size().cumsum().reset_index()
event_counts.columns = ['Date', 'Events']
event_counts
Date Events
0 2020-01-02 13
1 2020-01-03 23
2 2020-01-04 34
3 2020-01-05 42
4 2020-01-06 51
.. ... ...
94 2020-04-05 972
95 2020-04-06 981
96 2020-04-07 989
97 2020-04-08 995
98 2020-04-09 1000
Then if there's dates in your date_list file that don't exist in your dataframe, convert the date_list into a dataframe and merge the previous results. The fillna(method='ffill') will fill gaps in the middle of the data, whille the last fillna(0) incase there's gaps at the start of the column.
date_list = [date(2020, 1, 1) + timedelta(days=x) for x in range(150)]
date_df = pd.DataFrame({'Date': date_list})
merged_df = pd.merge(date_df, event_counts, how='left', on='Date')
merged_df.columns = ['Date', 'Events']
merged_df = merged_df.fillna(method='ffill').fillna(0)
Unless I am mistaken about your objective, it seems to me that you can simply use pandas DataFrames' ability to compare against a single value and slice the dataframe like so:
>>> df = pd.DataFrame({'event_date': [date(2020,9, 1), date(2020, 9, 2), date(2020, 9, 3)]})
>>> df
event_date
0 2020-09-01
1 2020-09-02
2 2020-09-03
>>> df[df.event_date > date(2020, 9, 1)]
event_date
1 2020-09-02
2 2020-09-03

Problem with group by max period in dataframe pandas

I'm still a novice with python and I'm having problems trying to group some data to show that record that has the highest (maximum) date, the dataframe is as follows:
...
I am trying the following:
df_2 = df.max(axis = 0)
df_2 = df.periodo.max()
df_2 = df.loc[df.groupby('periodo').periodo.idxmax()]
And it gives me back:
Timestamp('2020-06-01 00:00:00')
periodo 2020-06-01 00:00:00
valor 3.49136
Although the value for 'periodo' is correct, for 'valor' it is not, since I need to obtain the corresponding complete record ('period' and 'value'), and not the maximum of each one. I have tried other ways but I can't get to what I want ...
I need to do?
Thank you in advance, I will be attentive to your answers!
Regards!
# import packages we need, seed random number generator
import pandas as pd
import datetime
import random
random.seed(1)
Create example dataframe
dates = [single_date for single_date in (start_date + datetime.timedelta(n) for n in range(day_count))]
values = [random.randint(1,1000) for _ in dates]
df = pd.DataFrame(zip(dates,values),columns=['dates','values'])
ie df will be:
dates values
0 2020-01-01 389
1 2020-01-02 808
2 2020-01-03 215
3 2020-01-04 97
4 2020-01-05 500
5 2020-01-06 30
6 2020-01-07 915
7 2020-01-08 856
8 2020-01-09 400
9 2020-01-10 444
Select rows with highest entry in each column
You can do:
df[df['dates'] == df['dates'].max()]
(Or, if wanna use idxmax, can do: df.loc[[df['dates'].idxmax()]])
Returning:
dates values
9 2020-01-10 444
ie this is the row with the latest date
&
df[df['values'] == df['values'].max()]
(Or, if wanna use idxmax again, can do: df.loc[[df['values'].idxmax()]] - as in Scott Boston's answer.)
and
dates values
6 2020-01-07 915
ie this is the row with the highest value in the values column.
Reference.
I think you need something like:
df.loc[[df['valor'].idxmax()]]
Where you use idxmax on the 'valor' column. Then use that index to select that row.
MVCE:
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'periodo':pd.date_range('2018-07-01', periods = 600, freq='d'),
'valor':np.random.random(600)+3})
df.loc[[df['valor'].idxmax()]]
Output:
periodo valor
474 2019-10-18 3.998918

Write to a dataframe or excel/csv file without overlapping in loop

Basically my algorithm creates rows such as:
[1 rows x 84 columns]
Date 1990-12-31 1991-09-30 1991-12-31 1992-03-31 1992-06-30 ... 2017-06-30 2018-12-31 2019-09-30 2019-12-31 2020-03-31
AEP 28.0 30.625 34.25 30.75 31.875 ... 69.470001 74.739998 93.690002 94.510002 79.980003
[1 rows x 84 columns]
Date 1990-12-31 1991-09-30 1991-12-31 1992-03-31 1992-06-30 ... 2017-06-30 2018-12-31 2019-09-30 2019-12-31 2020-03-31
HON 6.435244 8.639912 10.457272 12.03629 12.810903 ... 127.751709 132.119995 169.199997 177.0 133.789993
[1 rows x 84 columns]
Date 1990-12-31 1991-09-30 1991-12-31 1992-03-31 1992-06-30 ... 2017-06-30 2018-12-31 2019-09-30 2019-12-31 2020-03-31
BMY 15.942265 19.689886 20.998581 18.14325 15.674578 ... 55.720001 51.98 50.709999 64.190002 55.740002
My issue is to append these rows together in one df or excel file.
The function that creates these rows is called by a loop that has a list of the tickers. The problem is everytime I try to append or write something to a file it overwrites each previous ticker so in the end I end up with just variations of the BMY ticker.
This is the loop code, the function is "ticker"
list=["CAT","CVX","BA","AEP","HON","BMY"]
for i in list:
ticker(i)
def ticker(tick):
df = pd.read_csv (r"C:/Users/NAME/Desktop/S&P data/Data Compilation.csv")
df1=df.set_index(["Company Ticker"])
abt=pd.read_csv(r"C:/Users/NAME/Desktop/S&P data/"+tick+"/"+tick+".csv")
abt1=abt[['Close',"Date"]]
# I tried a lot of methods to join, I manually inputted the dates I need.
# The code then appends the ticker data Close & price into a new sheet in Data Compilation
output=abt1.join(df1,how='left')
output=output[output["Date"].isin(['2020-03-31','2019-12-31','2019-09-30' ,'2019-06-30' ,'2019-03-31' ,'2018-12-31' ,'2018-09-30' ,'2018-06-30' ,'2018-03-31' ,'2017-12-31' ,'2017-09-30' ,'2017-06-30' ,'2017-03-31' ,'2016-12-31' ,'2016-09-30' ,'2016-06-30' ,'2016-03-31' ,'2015-12-31' ,'2015-09-30' ,'2015-06-30' ,'2015-03-31' ,'2014-12-31' ,'2014-09-30' ,'2014-06-30' ,'2014-03-31' ,'2013-12-31' ,'2013-09-30' ,'2013-06-30' ,'2013-03-31' ,'2012-12-31' ,'2012-09-30' ,'2012-06-30' ,'2012-03-31' ,'2011-12-31' ,'2011-09-30' ,'2011-06-30' ,'2011-03-31' ,'2010-12-31' ,'2010-09-30' ,'2010-06-30' ,'2010-03-31' ,'2009-12-31' ,'2009-09-30' ,'2009-06-30' ,'2009-03-31' ,'2008-12-31' ,'2008-09-30' ,'2008-06-30' ,'2008-03-31' ,'2007-12-31' ,'2007-09-30' ,'2007-06-30' ,'2007-03-31' ,'2006-12-31' ,'2006-09-30' ,'2006-06-30' ,'2006-03-31' ,'2005-12-31' ,'2005-09-30' ,'2005-06-30' ,'2005-03-31' ,'2004-12-31' ,'2004-09-30' ,'2004-06-30' ,'2004-03-31' ,'2003-12-31' ,'2003-09-30' ,'2003-06-30' ,'2003-03-31' ,'2002-12-31' ,'2002-09-30' ,'2002-06-30' ,'2002-03-31' ,'2001-12-31' ,'2001-09-30' ,'2001-06-30' ,'2001-03-31' ,'2000-12-31' ,'2000-09-30' ,'2000-06-30' ,'2000-03-31' ,'1999-12-31' ,'1999-09-30' ,'1999-06-30' ,'1999-03-31' ,'1998-12-31' ,'1998-09-30' ,'1998-06-30' ,'1998-03-31' ,'1997-12-31' ,'1997-09-30' ,'1997-06-30' ,'1997-03-31' ,'1996-12-31' ,'1996-09-30' ,'1996-06-30' ,'1996-03-31' ,'1995-12-31' ,'1995-09-30' ,'1995-06-30' ,'1995-03-31' ,'1994-12-31' ,'1994-09-30' ,'1994-06-30' ,'1994-03-31' ,'1993-12-31' ,'1993-09-30' ,'1993-06-30' ,'1993-03-31' ,'1992-12-31' ,'1992-09-30' ,'1992-06-30' ,'1992-03-31' ,'1991-12-31' ,'1991-09-30' ,'1991-06-30' ,'1991-03-31' ,'1990-12-31' ,'1990-09-30' ,'1990-06-30' ,'1990-03-31'])]
output=output.pivot_table(values='Close',columns='Date',aggfunc='first')
output=output.rename(index={"Close":tick})
print(output)
return output
If you want to merge rows in one dataframe with same columns, below code may do the work:
df = pd.DataFrame()
list=["CAT","CVX","BA","AEP","HON","BMY"]
for i in list:
responseDf = ticker(i)
df = df.append(responseDf)
print(df)
"df" is your main dataframe and in each loop result dataframe from ticker function is added to the main dataframe by the "append" function.

Categories

Resources