I have this data frame with two column. The condition I need to form is when 'Balance Created column is empty, I need to take last filled value of Balance Created and add it with the next row of Amount value.
Original Data frame:
After Calculation, my desired result should be:
you can try using cummulative sum of pandas to achieve this,
df['Amount'].cumsum()
# Edit-1
condition = df['Balance Created'].isnull()
df.loc[condition, 'Balance Created'] = df['Amount'].loc[condition]
you can also apply based on groups like deposit and withdraw
df.groupby('transaction')['Amount'].cumsum()
I assume your question is mostly "How do I solve this using pandas", which is a good question that others have given you pandas-specific answers for.
But in case this question is more in the lines of "How do I solve this using an algorithm", which is a common problem to solve for people just starting of writing code, then this little paragraph might push you in the right direction.
for index in frame do
if frame.balance[i] is empty do
if i equals 0 do // Edge-case where first balance is missing
frame.balance[i] = frame.amount[i]
else do
frame.balance[i] = frame.amount[i] + frame.balance[i-1]
end
end
end
Related
I am attempting to position some numeric values in an imported excel file according to changes in my date column (it's called Label). For the moment I am only concerned to place several values in several columns at the same row when the year changes.
Here's my attempt:
#Values to place
values1=[1,2,3]
values2=[4,5,6]
#New columns
workfile['year'] = pd.DatetimeIndex(workfile['Label']).year
workfile['Voice']=''
workfile['Political Stability']=''
for j in range(0,len(values1)):
for i in range(1,len(workfile['Label'])):
if workfile.iloc[i,4]!= workfile.iloc[i-1,4] and workfile.iloc[i,5]=='':
workfile.iloc[i-1,5]=values1[j]
workfile.iloc[i-1,6]=values2[j]
break
This returns the last values of both vectors 'values1' and 'values2' positioned at just one row (the code only identified one change in years, whereas each value from these vectors represents a year itself). I want to place each of the values in these vectors in FIFO (First In First Out) order just the row before the year changes. I hope I have made myself clear, if not please let me know.
I imagine it's evident, but I am a complete beginner in Python. So many thanks in advance are in order for any comment or suggestion, I'd be very appreciative!
I'm new to python and have researched to find an answer, I am most likely not asking the right question. I am streaming data from an exchange into a dataframe, will later stream the data into a databse, My problem is that when I do a calculation on a column to create a new column containing the result, all of the values of all rows in the new column change to the last result.
I am streaming in the open, high, low, close of a stock. In one column I am calculating the range for a candle during the timeframe, like on a one hour chart.
src = candles.close
ohlc = candles
ohlc = ohlc.rename(columns=str.lower)
candles['SMA_21'] = TA.SSMA(ohlc, period)
candles['EMA_21'] = TA.EMA(ohlc, period)
candles['WMA'] = TA.WMA(ohlc, 10)
candles['Range'] = src - candles['open']
candles['AvgRange'] = candles['Range'].tail(21).mean()
The range column works and has correct information which is not changed by each calculation. But the column for 'AvgRange' ends up with all values changed with each new mean value calculated.
The following also writes the last data entry to the whole column stream['EMA_Dir']
if stream['EMA'].iloc[-1] > stream['EMA'].iloc[-2]:
stream['EMA_Dir'] = "Ascending"
I only want the last entry in the last, most recent, row of the dataframe.
Tried several things, but the last calculation changes all values in 'AvgRange' column.
Thanks in advance. Sorry if I didn't ask the question correctly, but that is probably why I haven't found the answer.
candles['AvgRange'] = candles[’range’].rolling(
window=3,
center=False
).mean()
this will give you a 3 row rolling average
I have a dataframe that looks like this
I need to adjust the time_in_weeks column for the 34 number entry. When there is a duplicate uniqueid with a different rma_created_date that means there was some failure that occurred. The 34 needs to be changed to calculate the number of weeks between the new most recent rma_created_date (2020-10-15 in this case) and subtract the rma_processed_date of the above row 2020-06-28.
I hope that makes sense in terms of what I am trying to do.
So far I did this
def clean_df(df):
'''
This function will fix the time_in_weeks column to calculate the correct number of weeks
when there is multiple failured for an item.
'''
# Sort by rma_created_date
df = df.sort_values(by=['rma_created_date'])
Now I need to perform what I described above but I am a little confused on how to do this. Especially considering we could have multiple failures and not just 2.
I should get something like this returned as output
As you can see what happened to the 34 was it got changed to take the number of weeks between 2020-10-15 and 2020-06-26
Here is another example with more rows
Using the expression suggested
df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_processed_date.dt.isocalendar().week.sub(df.rma_processed_date.dt.isocalendar().week.shift(1)),df.time_in_weeks)
I get this
Final note: if there is a date of 1/1/1900 then don't perform any calculation.
Question not very clear. Happy to correct if I interpreted it wrongly.
Try use np.where(condition, choiceif condition, choice ifnotcondition)
#Coerce dates into datetime
df['rma_processed_date']=pd.to_datetime(df['rma_processed_date'])
df['rma_created_date']=pd.to_datetime(df['rma_created_date'])
#Solution
df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_created_date.sub(df.rma_processed_date),df.time_in_weeks)
Using the following code I can build a simple table with the current COVID-19 cases worldwide, per country:
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
raw_data = pd.read_csv(url, sep=",")
raw_data.drop(['Province/State','Lat','Long'], axis = 1, inplace = True)
plot_data = raw_data.groupby('Country/Region').sum()
The plot_data is a simple DataFrame:
What I would like to do now is to subtract the values on each column by the values on the column on a prior day - i.e., I wan to get the new cases per day.
If I do something like plot_data['3/30/20'].add(-plot_data['3/29/20']), it works well. But if I do something like plot_data.iloc[:,68:69].add(-plot_data.iloc[:,67:68]), I got two columns with NaN values. I.e. Python tries to "preserve" de columns header and doesn't perform the operation the way I would like it to.
My goal was to perform this operation in an "elegant way". I was thinking something in the lines of plot_data.iloc[:,1:69].add(-plot_data.iloc[:,0:68]). But of course, if it doesn't work as the single-column example, it doesn't work with multiple columns either (as Python will match the column headers and return a bunch of zeros/NaN values).
Maybe there is a way to tell Python to ignore the headers during an operation with a DataFrame? I know that I can transform my DataFrame into a NumPy array and do a bunch of operations. However, since this is a simple/small table, I thought I would try to keep using a DataFrame data type.
The good old shift can be used on the horizontal axis:
plot_data - plot_data.shift(-1, axis=1)
should be what you want.
Thank you very much #Serge Ballesta! Your answer is exactly the type of "elegant solution" I was looking for. The only comment is the shift sign should be "positive".
plot_data - plot_data.shift(1, axis=1)
This way we bring the historical figures forward one day and now I can subtract it from the actual numbers on each day.
I'm new to Pandas.
I've got a dataframe where I want to group by user and then find their lowest score up until that date in the their speed column.
So I can't just use df.groupby(['user'])['speed'].transform('min) as this would give the min of all values not just form the current row to the first.
What can I use to get what I need?
Without seeing your dataset it's hard to help you directly. The problem does boil down to the following. You need to select the range of data you want to work with (so select rows for the date range and columns for the user/speed).
That would look something like x = df.loc[["2-4-2018","2-4-2019"], ['users', 'speed']]
From there you could do a simple x['users'].min() for the value or x['users'].idxmin() for the index of the value.
I haven't played around for a bit with Dataframes, but you're looking for how to slice Dataframes.