How can I slice this dataframe? - python

I have a dataframe 'data' that looks like this:
<bound method NDFrame.head of Close ... Volume
A AA TSLA ... A AA TSLA
Date ...
2020-06-24 86.378616 11.14 960.849976 ... 1806600 7562700 10959600
2020-06-25 87.077148 11.83 985.979980 ... 1350100 6728600 9254500
2020-06-26 85.720001 10.93 959.739990 ... 2225800 25817600 8854900
2020-06-29 87.290001 10.99 1009.349976 ... 1302500 7397600 9026400
2020-06-30 88.370003 11.24 1079.810059 ... 1920200 5796600 16881600
[5 rows x 15 columns]>
Now, from this dataframe, I would like to get all the data for 'A' into a single dataframe.
I can do this via:
df2['Open'] = data['Open']['A']
df2['High'] = data['High']['A']
df2['Low'] = data['Low']['A']
etc.
And that works fine... However, there must be a smarter way to do this, right?
All help appreciated!

Sure, use DataFrame.xs for selecting in MultiIndex:
df2 = data.xs('A', axis=1, level=1)

Related

Write to a dataframe or excel/csv file without overlapping in loop

Basically my algorithm creates rows such as:
[1 rows x 84 columns]
Date 1990-12-31 1991-09-30 1991-12-31 1992-03-31 1992-06-30 ... 2017-06-30 2018-12-31 2019-09-30 2019-12-31 2020-03-31
AEP 28.0 30.625 34.25 30.75 31.875 ... 69.470001 74.739998 93.690002 94.510002 79.980003
[1 rows x 84 columns]
Date 1990-12-31 1991-09-30 1991-12-31 1992-03-31 1992-06-30 ... 2017-06-30 2018-12-31 2019-09-30 2019-12-31 2020-03-31
HON 6.435244 8.639912 10.457272 12.03629 12.810903 ... 127.751709 132.119995 169.199997 177.0 133.789993
[1 rows x 84 columns]
Date 1990-12-31 1991-09-30 1991-12-31 1992-03-31 1992-06-30 ... 2017-06-30 2018-12-31 2019-09-30 2019-12-31 2020-03-31
BMY 15.942265 19.689886 20.998581 18.14325 15.674578 ... 55.720001 51.98 50.709999 64.190002 55.740002
My issue is to append these rows together in one df or excel file.
The function that creates these rows is called by a loop that has a list of the tickers. The problem is everytime I try to append or write something to a file it overwrites each previous ticker so in the end I end up with just variations of the BMY ticker.
This is the loop code, the function is "ticker"
list=["CAT","CVX","BA","AEP","HON","BMY"]
for i in list:
ticker(i)
def ticker(tick):
df = pd.read_csv (r"C:/Users/NAME/Desktop/S&P data/Data Compilation.csv")
df1=df.set_index(["Company Ticker"])
abt=pd.read_csv(r"C:/Users/NAME/Desktop/S&P data/"+tick+"/"+tick+".csv")
abt1=abt[['Close',"Date"]]
# I tried a lot of methods to join, I manually inputted the dates I need.
# The code then appends the ticker data Close & price into a new sheet in Data Compilation
output=abt1.join(df1,how='left')
output=output[output["Date"].isin(['2020-03-31','2019-12-31','2019-09-30' ,'2019-06-30' ,'2019-03-31' ,'2018-12-31' ,'2018-09-30' ,'2018-06-30' ,'2018-03-31' ,'2017-12-31' ,'2017-09-30' ,'2017-06-30' ,'2017-03-31' ,'2016-12-31' ,'2016-09-30' ,'2016-06-30' ,'2016-03-31' ,'2015-12-31' ,'2015-09-30' ,'2015-06-30' ,'2015-03-31' ,'2014-12-31' ,'2014-09-30' ,'2014-06-30' ,'2014-03-31' ,'2013-12-31' ,'2013-09-30' ,'2013-06-30' ,'2013-03-31' ,'2012-12-31' ,'2012-09-30' ,'2012-06-30' ,'2012-03-31' ,'2011-12-31' ,'2011-09-30' ,'2011-06-30' ,'2011-03-31' ,'2010-12-31' ,'2010-09-30' ,'2010-06-30' ,'2010-03-31' ,'2009-12-31' ,'2009-09-30' ,'2009-06-30' ,'2009-03-31' ,'2008-12-31' ,'2008-09-30' ,'2008-06-30' ,'2008-03-31' ,'2007-12-31' ,'2007-09-30' ,'2007-06-30' ,'2007-03-31' ,'2006-12-31' ,'2006-09-30' ,'2006-06-30' ,'2006-03-31' ,'2005-12-31' ,'2005-09-30' ,'2005-06-30' ,'2005-03-31' ,'2004-12-31' ,'2004-09-30' ,'2004-06-30' ,'2004-03-31' ,'2003-12-31' ,'2003-09-30' ,'2003-06-30' ,'2003-03-31' ,'2002-12-31' ,'2002-09-30' ,'2002-06-30' ,'2002-03-31' ,'2001-12-31' ,'2001-09-30' ,'2001-06-30' ,'2001-03-31' ,'2000-12-31' ,'2000-09-30' ,'2000-06-30' ,'2000-03-31' ,'1999-12-31' ,'1999-09-30' ,'1999-06-30' ,'1999-03-31' ,'1998-12-31' ,'1998-09-30' ,'1998-06-30' ,'1998-03-31' ,'1997-12-31' ,'1997-09-30' ,'1997-06-30' ,'1997-03-31' ,'1996-12-31' ,'1996-09-30' ,'1996-06-30' ,'1996-03-31' ,'1995-12-31' ,'1995-09-30' ,'1995-06-30' ,'1995-03-31' ,'1994-12-31' ,'1994-09-30' ,'1994-06-30' ,'1994-03-31' ,'1993-12-31' ,'1993-09-30' ,'1993-06-30' ,'1993-03-31' ,'1992-12-31' ,'1992-09-30' ,'1992-06-30' ,'1992-03-31' ,'1991-12-31' ,'1991-09-30' ,'1991-06-30' ,'1991-03-31' ,'1990-12-31' ,'1990-09-30' ,'1990-06-30' ,'1990-03-31'])]
output=output.pivot_table(values='Close',columns='Date',aggfunc='first')
output=output.rename(index={"Close":tick})
print(output)
return output
If you want to merge rows in one dataframe with same columns, below code may do the work:
df = pd.DataFrame()
list=["CAT","CVX","BA","AEP","HON","BMY"]
for i in list:
responseDf = ticker(i)
df = df.append(responseDf)
print(df)
"df" is your main dataframe and in each loop result dataframe from ticker function is added to the main dataframe by the "append" function.

Running into errors trying to filter Pandas DF to show only dates over a week

I have created this df by reading in three files.
SUBREF SAMPNUM ... WORKTABLELOGDATE EXPORTED
30 C468633 10552705 ... 2020-06-09 NaN
44 C649747 20380271 ... 2020-06-16 NaN
112 P026530 50482919 ... 2020-04-29 NaN
113 P026530 50482920 ... 2020-04-29 NaN
140 C055419 50485482 ... 2020-05-12 NaN
... ... ... ... ... ...
15308 C725492 99036976 ... 2020-04-27 NaN
15318 S714508 99037098 ... 2020-06-18 NaN
15319 S714508 99037098 ... 2020-06-18 NaN
15320 S714508 99037100 ... 2020-06-18 NaN
15321 S714508 99037100 ... 2020-06-18 NaN
Using this code
WORKTABLE = pd.read_excel('C:/WORKTABLE.XLS', usecols =['SUBREF','SAMPNUM','DET','LOGDATE','SUITE', 'SECTION'], converters= {'LOGDATE': pd.to_datetime})
SAMPLESTABLE = pd.read_excel('C:/SAMPLESTABLE.XLS', usecols = ['SUBREF','SAMPNUM', 'LOCATION','LOGDATE','COMPDATE','REPDATE'], converters = {'EXPORTED': pd.to_datetime})
CBSEXTERNAL = pd.read_excel('C:/Users/CBSEXTNL.XLS', usecols = ['SUBREF','EXPORTED'])
DF3 = pd.merge(left = SAMPLESTABLE, right = WORKTABLE, how='outer', on=['SAMPNUM','SUBREF'])
DF3.rename(columns={'LOGDATE_x':'SAMPLETABLELOGDATE','LOGDATE_y':'WORKTABLELOGDATE'}, inplace=True)
DF3 = DF3[['SUBREF','SAMPNUM','LOCATION','SECTION','DET','SUITE','SAMPLETABLELOGDATE','WORKTABLELOGDATE','COMPDATE','REPDATE','WORKTABLELOGDATE']]
DF4 = pd.merge (left = DF3, right = CBSEXTERNAL, how='outer', on=['SUBREF'])
What I want to do is filter the col WORKTABLELOGDATE to show anything over 7 days old. I've tried various methods but I always end up with the errors
FutureWarning: Comparing Series of datetimes with 'datetime.date'. Currently, the
'datetime.date' is coerced to a datetime. In the future pandas will
not coerce, and a TypeError will be raised. To retain the current
behavior, convert the 'datetime.date' to a datetime with
'pd.Timestamp'.
and
ValueError: cannot reindex from a duplicate axis
What I tried is variants of
DF7SEROLOGY = DF7SEROLOGY[DF7SEROLOGY['WORKTABLELOGDATE'] < weekago]
I've tied re-indexing DF3 and DF4 using
DF4.reset_index(drop=True)
but that didn't help any.
What am I doing wrong?
Expected Output as requested using the today as 23/06/20
SUBREF SAMPNUM ... WORKTABLELOGDATE EXPORTED
30 C468633 10552705 ... 2020-06-09 NaN
140 C055419 50485482 ... 2020-05-12 NaN

Why can't I drop any columns in dataframe? [duplicate]

This question already has answers here:
How to get rid of "Unnamed: 0" column in a pandas DataFrame read in from CSV file?
(11 answers)
Closed 2 years ago.
I don't know why unnamed: 0 got there when I reversed the index and for the life of me, I can't drop or del it. It will NOT go away. No matter what I do, by index or by any possible string variation from 'Unnamed: 0' to just '0'. I've tried setting it by columns= or by .drop(df.columns, I've tried everything already in my code such as drop=True. Then I tried dropping other columns and that wouldn't work.
import pandas as pd
# set csv file as constant
TRADER_READER = pd.read_csv('TastyTrades.csv')
# change date format, make date into timestamp object, set date as index, write changes to csv file
def clean_date():
# TRADER_READER['Date'] = TRADER_READER['Date'].replace({'T': ' ', '-0500': '', '-0400': ''}, regex=True)
# TRADER_READER['Date'] = pd.to_datetime(TRADER_READER['Date'], format="%Y-%m-%d %H:%M:%S")
TRADER_READER.set_index('Date', inplace=True, drop=True)
# TRADER_READER.iloc[::-1].reset_index(drop=True)
print(TRADER_READER)
# TRADER_READER.to_csv('TastyTrades.csv')
clean_date()
Unnamed: 0 Type ... Strike Price Call or Put
Date ...
2020-04-01 11:00:05 0 Trade ... 21.0 PUT
2020-04-01 11:00:05 1 Trade ... NaN NaN
2020-03-31 17:00:00 2 Receive Deliver ... 22.0 PUT
2020-03-31 17:00:00 3 Receive Deliver ... NaN NaN
2020-03-27 16:15:00 4 Receive Deliver ... 7.5 PUT
... ... ... ... ... ...
2019-12-12 10:10:22 617 Trade ... 11.0 PUT
2019-12-12 10:10:21 618 Trade ... 45.0 CALL
2019-12-12 10:10:21 619 Trade ... 32.5 PUT
2019-12-12 09:45:42 620 Trade ... 18.0 CALL
2019-12-12 09:45:42 621 Trade ... 13.0 PUT
[622 rows x 16 columns]
Process finished with exit code 0
I think the problem is from the CSV that includes a non-named column, to fix it, read the csv specifying to use the first column as index, and then set the Date index.
TRADER_READER = pd.read_csv('TastyTrades.csv', index_col=0)
TRADER_READER.set_index('Date', inplace=True, drop=True)

Get data using row / col reference from two column values in another data frame

df1
Date APA AR BP-GB CDEV ... WLL WPX XEC XOM CL00-USA
0 2018-01-01 42.22 19.00 5.227 19.80 ... 26.48 14.07 122.01 83.64 60.42
1 2018-01-02 44.30 19.78 5.175 20.00 ... 27.37 14.31 125.51 85.03 60.37
2 2018-01-03 45.33 19.78 5.242 20.33 ... 27.99 14.39 126.20 86.70 61.63
3 2018-01-04 46.84 19.80 5.300 20.37 ... 28.11 14.44 128.66 86.82 62.01
4 2018-01-05 46.39 19.44 5.296 20.12 ... 27.79 14.24 127.82 86.75 61.44
df2
Date Ticker Event_Type Event_Description Price add
0 2018-11-19 XEC M&A REN 88.03 1
1 2018-03-28 CXO M&A RSPP 143.25 1
2 2018-08-14 FANG M&A EGN 133.75 1
3 2019-05-09 OXY M&A APC 56.33 1
4 2019-08-26 PDCE M&A SRCI 29.65 1
My goal is to update df2.['add'] by using df2['Ticker'] and df2['Date'] to pull the value from df1 ... so for example the first row in df2 is XEC on 2018-11-19... the code needs to first look at df1[XEC] and then pull the value that matches the 2018-11-19 row in df[Date]
My attempt was:
df_Events['add'] = df_Prices.loc[[df_Prices['Date']==df_Events['Date']],[df_Prices.columns==df_Events['Ticker']]]
Try:
df2 = df2.merge(df1.melt(value_vars=df1.columns.tolist()[1:], id_vars='date', value_name="add", var_name='Ticker').reset_index(), how='left')`
This should change df1 Tickers columns to a single column, and than merge the values in that column to df2.
One more approach may be as below (I had started looking at it, so I am putting here even though you have accepted the answer)
First convert dates into datetime object in both dataframes & set it as index ony in the first one (code below)
df1['Date']=pd.to_datetime(df1['Date'])
df1.set_index('Date',inplace=True)
df2['Date']=pd.to_datetime(df2['Date'])
Then use apply to get the values for each of the columns.
df2['add']=df2.apply(lambda x: df1.loc[(x['Date']),(x['Ticker'])], axis=1)
This will work only if dates & values for all tickers exist in both dataframes (hence will throw as 'KeyError'

pandas datetimeindex between_time function(how to get a not_between_time)

I have a pandas df, and I use between_time a and b to clean the data. How do I
get a non_between_time behavior?
I know i can try something like.
df.between_time['00:00:00', a]
df.between_time[b,23:59:59']
then combine it and sort the new df. It's very inefficient and it doesn't work for me as I have data betweeen 23:59:59 and 00:00:00
Thanks
You could find the index locations for rows with time between a and b, and then use df.index.diff to remove those from the index:
import pandas as pd
import io
text = '''\
date,time, val
20120105, 080000, 1
20120105, 080030, 2
20120105, 080100, 3
20120105, 080130, 4
20120105, 080200, 5
20120105, 235959.01, 6
'''
df = pd.read_csv(io.BytesIO(text), parse_dates=[[0, 1]], index_col=0)
index = df.index
ivals = index.indexer_between_time('8:01:30','8:02')
print(df.reindex(index.diff(index[ivals])))
yields
val
date_time
2012-01-05 08:00:00 1
2012-01-05 08:00:30 2
2012-01-05 08:01:00 3
2012-01-05 23:59:59.010000 6

Categories

Resources