How to create a Dataframe without pre-define the number of rows - python

I am not quite advance in python and pandas yet.
I have a function which calculates returns of stocks under one strategy then outputs as Dataframe. E.g, if I want to calculate return from 2017 to 2019, the function output returns from 2017 to 2019, and if I want to calculates returns from 2010 to 2019, the function output returns from 2010 to 2019. And this function calculates one stock at a time.
Now, I have multiple stocks, I used this function in a for loop which loops through stocks to get its returns. I want to put all returns of different stocks into one dataframe. So I am thinking of pre-define a zeros Dataframe before the loop, then put returns into that Dataframe once at a loop.
The question I am having now is that its not easy to know in advance how many rows in returns Dataframe, so I could not define the row of the zero Dataframe which will contain all returns later (only know the number of columns as easily know number of stocks), so I wonder is there a way I could put return series as a whole into the zero Dataframe? (like put data column by column)???
Hope I stated my question clear enough.....
Thanks very much!!
Please don't advise me not to use loops at this stage ....now I re-state my question as:
In the code below:
for k in ticker:
stock_price = dataprepare(start_date, end_date, k)
mask_end, mask_beg = freq(stock_price, trading_period)
signal_trade = singals(fast_window, slow_window, stock_price, mask_end)
a = basiccomponent(fast_window, slow_window, stock_price, mask_beg, mask_end, year, signal_trade, v)[2]
dataprepare, freq, singals and basiccomponent are self defined functions.
a is a return Dataframe, I want to save all 'a's from each loop in a Dataframe, something like append, but append on the columns after each loop, such as:
a.append(a)
instead of appending rows, I want to append columns, so how can I do it?

Related

guys, Is there a way to get the firs row in a group dataframe

this is a code i wrote, but the output is too big, over 6000, how do i get the first result for each year
df_year = df.groupby('release_year')['genres'].value_counts()
Let's start from a small correction concerning variable name:
value_counts returns a Series (not DataFrame), so you should
not use name starting from df.
Assume that the variable holding this Series is gen.
Then, one of possible solutions is:
result = gen.groupby(level=0).apply(lambda grp:
grp.droplevel(0).sort_values(ascending=False).head(1))
Initially you wrote that you wanted the most popular genre in each year,
so I sorted each group in descending order and returned the first
row from the current group.

I need help concatenating 1 csv file and 1 pandas dataframe together without duplicates

My code currently looks like this:
df1 = pd.DataFrame(statsTableList)
df2 = pd.read_csv('StatTracker.csv')
result = pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
I get an error and I'm not sure why.
The goal of my program is to pull data from an API, and then write it all to a file for analyzing. df1 is the lets say the first 100 games written to the csv file as the first version. df2 is me reading back those first 100 games the second time around and comparing it to that of df1 (new data, next 100 games) to check for duplicates and delete them.
The part that is not working is the drop duplicates part. It gives me an error of unhashable list, I would assume that's because its two dataframes that are lists of dictionaries. The goal is to pull 100 games of data, and then pull the next 50, but if I pull number 100 again, to drop that one, and just add 101-150 and then add it all to my csv file. Then if I run it again, to pull 150-200, but drop 150 if its a duplicate, etc etc..
Based from your explanation, you can use this one liner to find unique values in df1:
df_diff = df1[~df1.apply(tuple,1)\
.isin(df2.apply(tuple,1))]
This code checks if the rows is exists in another dataframe. To do the comparision it converts each row to tuple (apply tuple conversion along 1 (row) axis).
This solution is indeed slow because its compares each row inside df1 to all rows in df2. So it has time complexity n^2.
If you want more optimised version, try to use pandas built in compare method
df1.compare(df2)

Modifying the date column calculation in pandas dataframe

I have a dataframe that looks like this
I need to adjust the time_in_weeks column for the 34 number entry. When there is a duplicate uniqueid with a different rma_created_date that means there was some failure that occurred. The 34 needs to be changed to calculate the number of weeks between the new most recent rma_created_date (2020-10-15 in this case) and subtract the rma_processed_date of the above row 2020-06-28.
I hope that makes sense in terms of what I am trying to do.
So far I did this
def clean_df(df):
'''
This function will fix the time_in_weeks column to calculate the correct number of weeks
when there is multiple failured for an item.
'''
# Sort by rma_created_date
df = df.sort_values(by=['rma_created_date'])
Now I need to perform what I described above but I am a little confused on how to do this. Especially considering we could have multiple failures and not just 2.
I should get something like this returned as output
As you can see what happened to the 34 was it got changed to take the number of weeks between 2020-10-15 and 2020-06-26
Here is another example with more rows
Using the expression suggested
df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_processed_date.dt.isocalendar().week.sub(df.rma_processed_date.dt.isocalendar().week.shift(1)),df.time_in_weeks)
I get this
Final note: if there is a date of 1/1/1900 then don't perform any calculation.
Question not very clear. Happy to correct if I interpreted it wrongly.
Try use np.where(condition, choiceif condition, choice ifnotcondition)
#Coerce dates into datetime
df['rma_processed_date']=pd.to_datetime(df['rma_processed_date'])
df['rma_created_date']=pd.to_datetime(df['rma_created_date'])
#Solution
df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_created_date.sub(df.rma_processed_date),df.time_in_weeks)

How to calculate based on multiple conditions using Python data frames?

I have excel data file with thousands of rows and columns.
I am using python and have started using pandas dataframes to analyze data.
What I want to do in column D is to calculate annual change for values in column C for each year for each ID.
I can use excel to do this – if the org ID is same are that in the prior row, calculate annual change (leaving the cells highlighted in blue because that’s the first period for that particular ID). I don’t know how to do this using python. Can anyone help?
Assuming the dataframe is already sorted
df.groupby(‘ID’).Cash.pct_change()
However, you can speed things up with the assumption things are sorted. Because it’s not necessary to group in order to calculate percentage change from one row to next
df.Cash.pct_change().mask(
df.ID != df.ID.shift()
)
These should produce the column values you are looking for. In order to add the column, you’ll need to assign to a column or create a new dataframe with the new column
df[‘AnnChange’] = df.groupby(‘ID’).Cash.pct_change()

Water year time series: resampling annually for custom non-calendar year dates

I am writing my very first loop since resampling doesn't allow me to use custom start dates for annual sampling. My goal is to sum up each series of 12 consecutive months in a 30 years time series for a non-calendar year calculation (hydrologic water year Oct-Sept). The dataset begins in the month of October, so I figured I would simply add together the first 12 rows, the next 12 rows, and so on. Perfect for a loop, right?! Two questions:
1) What is the simplest way to add 'n' rows together which is output into a new DataFrame, indexed by year.
2) My attempted solution to question 1 is below, and it works. However, the data type of the output is a 'NoneType' which I cannot merge back with another DataFrame via pd.concat. How do I fix this?
def Water_Year_Total(Monthly_Data_30yrs):
for i in range((len(Monthly_Data_30yrs))//12):
x=0
y=12
new_value=sum(data[(x+(12*i)):(y+(12*i))])
print(new_value)
The for loop first counts the number of rows via the len() function, then divides it by 12 to get the number of years in the dataset, and then iterates the sum loop i times before printing out the result.
Your function doesn't return anything, which is why it yields a NoneType when it finishes running. Create a variable before the for loop, add the different new_values to it, and then return that variable after the for loop completes.

Categories

Resources