How does .sum() method in pandas.DataFrame physically work?
I'm calculating proportion of salary of each individual staffworker to the total of all salaries.
The CSV has 33,000 rows.
The below function, add_proportion, goes row by row and reads each worker's salary, then divides it by salary.sum() for all rows.
Question: In each of these 33,000 cycles, does salary.sum() do its own 33,000 cycles to calculate the total over and over?
Asking because in this case the total number of cycles would be 1 billion (33,000 times 33,000), which should result in some kind of a delay. But there is no delay, the function runs instantly.
Therefore, does .sum() calculate the total during the first cycle only and then reuses the value?
Thanks.
import pandas as pd
staff = pd.read_csv('staff.csv', names = ['name', 'salary'])
def add_proportion(group):
group['proportion'] = salary / salary.sum()
return group
pandas uses numpy under the hood. In numpy, the behavior of applying operations between differently sized arrays is called broadcasting.
It depends how you are calling your add_proportion function, but the call to sum should only happen once for the whole dataframe (or once per group if you are doing a groupby(...).apply(add_proportion) for example).
Each sum is thread, which means that all the sums are made at the same time , they are parallelized.
The limit is your RAM that determine the number of parallel process you are allowed to have.
For more information, I would advice https://medium.com/#bfortuner/python-multithreading-vs-multiprocessing-73072ce5600b
Related
Background: I am having a list of several hundred departments that I would like to allocate budget as follow:
Each DEPT has an AMT_TOTAL budget within given number of months. They also have a monthly limit LIMIT_MONTH that they cannot exceed.
As each DEPT plans to spend their budget as fast as possible, we assume they will spend up to their monthly limit until AMT_TOTAL runs out. The amount be forecast they will spend, given this assumption, is in AMT_ALLOC_MONTH
My objective is to calculate the AMT_ALLOC_MONTH column, given the LIMIT_MONTH and AMT_TOTAL column. Based on what I've read and searched, I believe a combination of fillna and cumsum() can do the job. So far, the Python dataframe I've managed to generate is as followed:
I planned to fill the NaN using the following line:
table['AMT_ALLOC_MONTH'] = min((table['AMT_TOTAL'] - table.groupby('DEPT')['AMT_ALLOC_MONTH'].cumsum()).ffill, table['LIMIT_MONTH'])
My objective is to have the AMT_TOTAL minus the cumulative sum of AMT_ALLOC_MONTH (excluding the NaN values), grouped by DEPT; the result is then compared with value in column LIMIT_MONTH, and the smaller value is filled in the NaN cells. The process is repeated till all NaN cells of each DEPT is filled.
Needless to say, the result did not come up as I expected; the code line only works with the 1st NaN after the cell with value; subsequent NaN cells just copy the value above it. If there is a way to fix the issue, or a new & more intuitive way to do this, please help. Truly appreciated!
Try this:
for department in table['DEPT'].unique():
subset = table[table['DEPT'] == department]
for index, row in subset.iterrows():
subset = table[table['DEPT'] == department]
cumsum = subset.loc[:index-1, 'AMT_ALLOC_MONTH'].sum()
limit = row['LIMIT_MONTH']
remaining = row['AMT_TOTAL'] - cumsum
table.at[index, 'AMT_ALLOC_MONTH'] = min(remaining, limit)
It't not very elegant I guess, but it seems to work..
I have a small toy dataset of 23 hours of irregular time series data (financial tick data) with millisecond granularity, roughly 1M rows. By irregular I mean that the timestamps are not evenly spaced. I also have a column 'mid' with some values too.
I am trying to group by e.g. 2 minute buckets to calculate the absolute difference of 'mid', and then taking the median, in the following manner:
df.groupby(["RIC", pd.Grouper(freq='2min')]).mid.apply(
lambda x: np.abs(x[-1] - x[0]) if len(x) != 0 else 0).median()
Note: 'RIC' is just another layer of grouping I am applying before the time bucket grouping.
Basically, I am telling pandas to group by every [ith minute : ith + 2 minute] intervals, and in each interval, take the last (x[-1]) and the first (x[0]) 'mid' element, and take its absolute difference. I am doing this over a range of 'freqs' as well, e.g. 2min, 4min, ..., up to 30min intervals.
This approach works completely fine, but it is awfully slow because of the usage of pandas' .apply function. I am aware that .apply doesn't take advantage of the built in vectorization of pandas and numpy, as it is computationally no different to a for loop, and am trying to figure out how to achieve the same without having to use apply so I can speed it up by several orders of magnitude.
Does anyone know how to rewrite the above code to ditch .apply? Any tips will be appreciated!
On the pandas groupby.apply webpage:
"While apply is a very flexible method, its downside is that using it
can be quite a bit slower than using more specific methods like agg or
transform. Pandas offers a wide range of method that will be much
faster than using apply for their specific purposes, so try to use
them before reaching for apply."
Therefore, using transform should be a lot faster.
grouped = df.groupby(["RIC", pd.Grouper(freq='2min')])
abs(grouped.mid.transform("last") - grouped.mid.transform("first")).median()
I'm relatively new to PySpark and I'm currently trying to implement the SVD algorithm for predicting user-item ratings. The input is a matrix with columns - user_id, item_id and rating. In the first step I initialize the biases (bu, bi) and the factor matrices (pu, qi) for each user and each item. So I start the algorithm with the following dataframe:
Initial dataframe
In the current case the number of partitions is 7 and the time needed to count all the rows takes 0.7 seconds. The number of rows is 2.5 million.
Partitions and count time
In the next step I add a column to my dataframe - error. I use a UDF function which calculates the error for each row with regards to all the other columns (I don't think the equation is relevant). After the count function takes about the same amount of time.
Now comes the tricky part. I have to create 2 new dataframes. In the first I group together all the users (named userGroup) and in the second I group together all the items (named itemGroup). I have another UDF function implemented that updates the biases (update_b) and one that updates the factor matrices (update_factor_F). The userGroup dataframe has 1.5 million rows and the itemGroup has 72000 rows.
Updated biases and factors for each user
I then take the initial dataframe and join it firstly by user - I take the user_id, item_id and rating from the initial and the biases and factors from the userGroup dataframe. I repeat the same process with the itemGroup.
train = train.join(userGroup, train.u_id == userGroup.u_id_ug, 'outer') \
.select(train.u_id, train.i_id, train.rating, userGroup.bu, userGroup.pu)
I end up having a dataframe with the same size as the initial one. However if I do a .count() it now takes around 8 seconds. I would have to repeat the above steps iteratively and each iteration slows the time to do the .count() action even further.
I know the issue lies in the join of the dataframes and have searched for solutions to my issues. So far I haver tried different combinations of partitioning (I used .repartition(7, "u_id") on the userGroup dataframe) to try and match the number of partitions. I also tried repartitioning the final dataframe, but the .count() remains high.
My goal is to not loose performance after each iteration.
As some of your dataframes can be used multiple times, you will want to cache them so that they are not re-evaluated every time you need them. To do this you can rely on cache() or persist() operations.
Also, the logical plan of your dataframe will grow as you move forward on your iterative algorithm. This will increase computations exponentially as you move forward on your iterations. To cope with this issue, you will need to rely on checkpoint() operation to regularly break the lineage of your dataframes.
I am not quite advance in python and pandas yet.
I have a function which calculates returns of stocks under one strategy then outputs as Dataframe. E.g, if I want to calculate return from 2017 to 2019, the function output returns from 2017 to 2019, and if I want to calculates returns from 2010 to 2019, the function output returns from 2010 to 2019. And this function calculates one stock at a time.
Now, I have multiple stocks, I used this function in a for loop which loops through stocks to get its returns. I want to put all returns of different stocks into one dataframe. So I am thinking of pre-define a zeros Dataframe before the loop, then put returns into that Dataframe once at a loop.
The question I am having now is that its not easy to know in advance how many rows in returns Dataframe, so I could not define the row of the zero Dataframe which will contain all returns later (only know the number of columns as easily know number of stocks), so I wonder is there a way I could put return series as a whole into the zero Dataframe? (like put data column by column)???
Hope I stated my question clear enough.....
Thanks very much!!
Please don't advise me not to use loops at this stage ....now I re-state my question as:
In the code below:
for k in ticker:
stock_price = dataprepare(start_date, end_date, k)
mask_end, mask_beg = freq(stock_price, trading_period)
signal_trade = singals(fast_window, slow_window, stock_price, mask_end)
a = basiccomponent(fast_window, slow_window, stock_price, mask_beg, mask_end, year, signal_trade, v)[2]
dataprepare, freq, singals and basiccomponent are self defined functions.
a is a return Dataframe, I want to save all 'a's from each loop in a Dataframe, something like append, but append on the columns after each loop, such as:
a.append(a)
instead of appending rows, I want to append columns, so how can I do it?
I have the GPS data set (in csv format) of hundreds of people and I have to study the mobility of them. I have managed to compute the distance between each two point and then compute the speed by simply dividing by the time increment between these two points. I have done all these calculations using pandas and grouping by nickname (this is important because each person has a different trajectory and you can not mix distances and speeds).
The next step I have to do is to compute the average of every three or four velocities to clean some GPS data errors. I have tried this and it works fine but I can not find the way to group it by nickname since the speeds of each user are mixed. Any ideas?
this can be done simply by using the index as a way to group the rows
df['bins'] = df.index // n
and then doing a group by on 'bins'. to put it in a cleaner function here is the code
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,3,4,4,4],'B':[1,2,3,4,5,6,7]})
def n_average(df, n):
df['bin'] = df.index // n
grouped_df = df.groupby(['bin']).mean()
return grouped_df
n_average(df, 3)