Summing Segments of a Dataframe Column Python

Summing Segments of a Dataframe Column Python - python

I'm trying to loop through my DataFrame to create 5 year future returns from each initial i. My code is as follows, but it gives me the Error: Invalid Syntax.
list = ['aapl','tsla','vz','t']
df = pd.io.data.get_data_yahoo(list, start = start_of_interval, end = end_of_interval, interval = data_interval)['Adj Close']
df = DataFrame(df)
df['Returns'] = df.pct_change()
l = df.index.values
for i in range(0,len(l)):
df.loc[l[i], '5YearReturn'] = df.cumsum(df.loc[l[i], "Returns"]:df.loc[l[i+1824], "Returns"])
Can I not use cumsum in this way?

Related

How to concatenate a series to a pandas dataframe in python?

I would like to iterate through a dataframe rows and concatenate that row to a different dataframe basically building up a different dataframe with some rows.
For example:
`IPCSection and IPCClass Dataframes
allcolumns = np.concatenate((IPCSection.columns, IPCClass.columns), axis = 0)
finalpatentclasses = pd.DataFrame(columns=allcolumns)
for isec, secrow in IPCSection.iterrows():
for icl, clrow in IPCClass.iterrows():
if (secrow[0] in clrow[0]):
pdList = [finalpatentclasses, pd.DataFrame(secrow), pd.DataFrame(clrow)]
finalpatentclasses = pd.concat(pdList, axis=0, ignore_index=True)
display(finalpatentclasses)
The output is:
I want the nan values to dissapear and move all the data under the correct columns. I tried axis = 1 but messes up the column names. Append does not work as well all values are placed diagonally at the table with nan values as well.

Alright, I have figured it out. The idea is that you create a newrowDataframe and concatenate all the data in a list from there you can add it to the dataframe and then conc with the final dataframe.
Here is the code:
allcolumns = np.concatenate((IPCSection.columns, IPCClass.columns), axis = 0)
finalpatentclasses = pd.DataFrame(columns=allcolumns)
for isec, secrow in IPCSection.iterrows():
for icl, clrow in IPCClass.iterrows():
newrow = pd.DataFrame(columns=allcolumns)
values = np.concatenate((secrow.values, subclrow.values), axis=0)
newrow.loc[len(newrow.index)] = values
finalpatentclasses = pd.concat([finalpatentclasses, newrow], axis=0)
finalpatentclasses.reset_index(drop=false, inplace=True)
display(finalpatentclasses)
Update the code below is more efficient:
allcolumns = np.concatenate((IPCSection.columns, IPCClass.columns, IPCSubClass.columns, IPCGroup.columns), axis = 0)
newList = []
for secrow in IPCSection.itertuples():
for clrow in IPCClass.itertuples():
if (secrow[1] in clrow[1]):
values = ([secrow[1], secrow[2], subclrow[1], subclrow[2]])
new_row = {IPCSection.columns[0]: [secrow[1]], IPCSection.columns[1]: [secrow[2]],
IPCClass.columns[0]: [clrow[1]], IPCClass.columns[1]: [clrow[2]]}
newList.append(values)
finalpatentclasses = pd.DataFrame(newList, columns=allcolumns)
display(finalpatentclasses)

Create dataframe conditionally to other dataframe elements

Happy 2020! I would like to create a dataframe based on two others. I have the below two dataframes:
df1 = pd.DataFrame({'date':['03.05.1982','04.05.1982','05.05.1982','06.05.1982','07.05.1982','10.05.1982','11.05.1982'],'A': [63.63,64.08,64.19,65.11,65.36,65.25,65.36], 'B': [63.83, 64.10, 64.19, 65.08, 65.33, 65.28, 65.36], 'C':[63.99, 64.22, 64.30, 65.16, 65.41, 65.36, 65.44]})
df2 = pd.DataFrame({'Name':['A','B','C'],'Notice': ['05.05.1982','07.05.1982','12.05.1982']})
The idea is to create df3 such that this dataframe takes the value of A until A's notice date (found in df2) is reached, then df3 switches to the values of B until B's notice date is reached and so on. When we are during notice date, it should take the mean between the current column and the next one.
In the above example, df3 should be as follows (with formulas to illustrate):
df3 = pd.DataFrame({'date':['03.05.1982','04.05.1982','05.05.1982','06.05.1982','07.05.1982','10.05.1982','11.05.1982'], 'Result':[63.63,64.08,(64.19+64.19)/2,65.08,(65.33+65.41)/2,65.36,65.44]})
My idea was to first create a temporary dataframe with same dimensions as df1 and to fill it with 1's when the index date is prior to notice and 0's after. Doing a rolling mean with window 1 would give for each column a series of 1 until I reach 0.5 (signalling a switch).
Not sure if there is a better way to get df3?
I tried the following:
def fill_rule(df_p,df_t):
return np.where(df_p.index > df_t[df_t.Name==df_p.name]['Notice'][0], 0, 1)
df1['date'] = pd.to_datetime(df1['date'])
df2['notice'] = pd.to_datetime(df2['notice'])
df1.set_index("date", inplace = True)
temp = df1.apply(lambda x: fill_rule(x, df2), axis = 0)
And I got the following error: KeyError: (0, 'occurred at index B')

df1['t'] = df1['date'].map(df2.set_index(["Notice"])['Name'])
df1['t'] =df1['t'].fillna(method='bfill').fillna("C")
df3 = pd.DataFrame()
df3['Result'] = df1.apply(lambda row: row[row['t']],axis =1)
df3['date'] = df1['date']

You can use the between method to select the specific date ranges in both dataframes and then use iloc to substitute the specific values
#Initializing the output
df3 = df1.copy()
df3.drop(['B','C'], axis = 1, inplace = True)
df3.columns = ['date','Result']
df3['Result'] = 0.0
df3['count'] = 0
#Modifying df2 to add a dummy sample at the beginning
temp = df2.copy()
temp = temp.iloc[0]
temp = pd.DataFrame(temp).T
temp.Name ='Z'
temp.Notice = pd.to_datetime("05-05-1980")
df2 = pd.concat([temp,df2])
for i in range(len(df2)-1):
startDate = df2.iloc[i]['Notice']
endDate = df2.iloc[i+1]['Notice']
name = df2.iloc[i+1]['Name']
indices = [df1.date.between(startDate, endDate, inclusive=True)][0]
df3.loc[indices,'Result'] += df1[indices][name]
df3.loc[indices,'count'] += 1
df3.Result = df3.apply(lambda x : x.Result/x['count'], axis = 1)

DataFrame with one column 0 to 100

I need a DataFrame of one column ['Week'] that has all values from 0 to 100 inclusive.
I need it as a Dataframe so I can perform a pd.merge
So far I have tried creating an empty DataFrame, creating a series of 0-100 and then attempting to append this series to the DataFrame as a column.
alert_count_list = pd.DataFrame()
week_list= pd.Series(range(0,101))
alert_count_list['week'] = alert_count_list.append(week_list)

Try this:
df = pd.DataFrame(columns=["week"])
df.loc[:,"week"] = np.arange(101)

alert_count_list = pd.DataFrame(np.zeros(101), columns=['week'])
or
alert_count_list = pd.DataFrame({'week':range(101)})

You can try:
week_vals = []
for i in range(0, 101):
week_vals.append(i)
df = pd.Dataframe(columns = ['week'])
df['week'] = week_vals

Slow Data analysis using pandas

I am using a mixture of both lists and pandas dataframes to accomplish a clean and merge of csv data. The following is a snippet from my code that runs disgustingly slow... Generates a csv with about 3MM lines of data.
UniqueAPI = Uniquify(API)
dummydata = []
#bridge the gaps in the data with zeros
for i in range(0,len(UniqueAPI)):
DateList = []
DaysList = []
PDaysList = []
OperatorList = []
OGOnumList = []
CountyList = []
MunicipalityList = []
LatitudeList = []
LongitudeList = []
UnconventionalList = []
ConfigurationList = []
HomeUseList = []
ReportingPeriodList = []
RecordSourceList = []
for j in range(0,len(API)):
if UniqueAPI[i] == API[j]:
#print(str(ProdDate[j]))
DateList.append(ProdDate[j])
DaysList = Days[j]
OperatorList = Operator[j]
OGOnumList = OGOnum[j]
CountyList = County[j]
MunicipalityList = Municipality[j]
LatitudeList = Latitude[j]
LongitudeList = Longitude[j]
UnconventionalList = Unconventional[j]
ConfigurationList = Configuration[j]
HomeUseList = HomeUse[j]
ReportingPeriodList = ReportingPeriod[j]
RecordSourceList = RecordSource[j]
df = pd.DataFrame(DateList, columns = ['Date'])
df['Date'] = pd.to_datetime(df['Date'])
minDate = df.min()
maxDate = df.max()
Years = int((maxDate - minDate)/np.timedelta64(1,'Y'))
Months = int(round((maxDate - minDate)/np.timedelta64(1,'M')))
finalMonths = Months - Years*12 + 1
Y,x = str(minDate).split("-",1)
x,Y = str(Y).split(" ",1)
for k in range(0,Years + 1):
if k == Years:
ender = int(finalMonths + 1)
else:
ender = int(13)
full_df = pd.DataFrame()
if k > 0:
del full_df
full_df = pd.DataFrame()
full_df['API'] = UniqueAPI[i]
full_df['Production Month'] = [pd.to_datetime(str(x)+'/1/'+str(int(Y)+k)) for x in range(1,ender)]
full_df['Days'] = DaysList
full_df['Operator'] = OperatorList
full_df['OGO_NUM'] = OGOnumList
full_df['County'] = CountyList
full_df['Municipality'] = MunicipalityList
full_df['Latitude'] = LatitudeList
full_df['Longitude'] = LongitudeList
full_df['Unconventional'] = UnconventionalList
full_df['Well_Configuration'] = ConfigurationList
full_df['Home_Use'] = HomeUseList
full_df['Reporting_Period'] = ReportingPeriodList
full_df['Record_Source'] = RecordSourceList
dummydata.append(full_df)
full_df = pd.concat(dummydata)
result = full_df.merge(dataClean,how='left').fillna(0)
print(result[:100])
result.to_csv(ResultPath, index_label=False, index=False)
This snippet of code has been running for hours the output should have ~3MM lines there has to be a faster way using pandas to accomplish the goal of which I will describe:
for each unique API i find all occurrences in the main list of apis
using that information i build a list of dates
I find a min and max date for each list corresponding to an api
I then build an empty pandas DataFrame that has every month between the two dates for the associated api
I then append this data frame to a list "dummydata" and loop to the next api
taking this dummy data list I then concatenate it into a DataFrame
this DataFrame is then merged with another dataframe with cleaned data
end result is a csv that has 0 value for dates that did not exist but should between the max and min dates for each corresponding API in the original unclean list
This all takes way longer than I would expect I would have thought that finding the min max date for each unique item and interpolating monthly between them filling in months that dont have data with 0 would be like a three line thing in Pandas. Any options that you guys think I should explore or any snippets of code that could help me out is much appreciated!

You could start by cleaning up the code a bit. These lines don't seem to have any effect or functional purpose since full_df was just created and is already an empty dataframe:
if k > 0:
del full_df
full_df = pd.DataFrame()
Then when you actually build up your full_df it's better to do it all at once rather than one column at a time. So try something like this:
full_df = pd.concat([UniqueAPI[i],
[pd.to_datetime(str(x)+'/1/'+str(int(Y)+k)) for x in range(1,ender)],
DaysList,
etc...
],
axis=1)
Then you would need to add the column labels which you could also do all at once (in the same order as your lists in the concat() call).
full_df.columns = ['API', 'Production Month', 'Days', etc.]

pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift

I am trying to speed up my groupby.apply + shift and
thanks to this previous question and answer: How to speed up Pandas multilevel dataframe shift by group? I can prove that it does indeed speed things up when you have many groups.
From that question I now have the following code to set the first entry in each multi-index to Nan. And now I can do my shift globally rather than per group.
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
but I want to look forward, not backwards, and need to do calculations across N rows. So I am trying to use some similar code to set the last N entries to NaN, but obviously I am missing some important indexing knowledge as I just can't figure it out.
I figure I want to convert this so that every entry is a range rather than a single integer. How would I do that?
# the start of each group, ignoring the first entry
df.groupby(level=0).size().cumsum()[1:]
Test setup (for backwards shift) if you want to try it:
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
df['tmpShift'] = df['colB'].shift(1)
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmp',1,inplace=True)
Thanks!

I ended up doing it using a groupby apply as follows (and coded to work forwards or backwards):
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
df = df.groupby(level=0).apply(replace_tail,'tmpShift',2,np.nan)
So the final code is:
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)
df = df.groupby(level=0).apply(replace_tail,'tmpShift',shiftBy,np.nan)
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Summing Segments of a Dataframe Column Python - python

Related

How to concatenate a series to a pandas dataframe in python?

Create dataframe conditionally to other dataframe elements

DataFrame with one column 0 to 100

Slow Data analysis using pandas

pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift

Categories

Resources