Multiply pd DataFrame column with 7-digit scalar

Multiply pd DataFrame column with 7-digit scalar - python

I am trying to modify a pandas dataframe column this way:
Temporary=DF.loc[start:end].copy()
SLICE=Temporary.unstack("time").copy()
SLICE["Var"]["Jan"] = 2678400*SLICE["Var"]["Jan"]
However, this does not work. The resulting column SLICE["Var"]["Jan"] is still the same as before the multiplication.
If I multiply with 2 orders of magnitude less, the multiplication works. Also a subsequent multiplication with 100 to receive the same value that was intended in the first place, works.
SLICE["Var"]["Jan"] = 26784*SLICE["Var"]["Jan"]
SLICE["Var"]["Jan"] = 100*SLICE["Var"]["Jan"]
I seems like the scalar is too large for the multiplication. Is this a python thing or a pandas thing? How can I make sure that the multiplication with the 7-digit number works directly?
I am using Python 3.8, the precision of numbers in the dataframe is float32, they are in a range between 5.0xE-5 and -5.0xE-5 with some numbers having a smaller absolute value than 1xE-11.
EDIT: It might have to do with the 2-level column indexing. When I delete the first level, the calculation works:
Temporary=DF.loc[start:end].copy()
SLICE=Temporary.unstack("time").copy()
SLICE=SLICE.droplevel(0, axis=1)
SLICE["Jan"] = 2678400*SLICE["Jan"]

Your first method might give SettingWithCopyWarning which basically means the changes are not made to the actual dataframe. You can use .loc instead:
SLICE.loc[:,('Var', 'Jan')] = SLICE.loc[:,('Var', 'Jan')]*2678400

Related

Pandas dataframe sum of row won't let me use result in equation

Anybody wish to help me understand why below code doesn't work?
start_date = '1990-01-01'
ticker_list = ['SPY', 'QQQ', 'IWM','GLD']
tickers = yf.download(ticker_list, start=start_date)['Close'].dropna()
ticker_vol_share = (tickers.pct_change().rolling(20).std()) \
/ ((tickers.pct_change().rolling(20).std()).sum(axis=1))
Both (tickers.pct_change().rolling(20).std()) and ((tickers.pct_change().rolling(20).std()).sum(axis=1)) runs fine by themselves, but when ran together they form a dataframe with thousands of columns all filled with nan

Try this.
rolling_std = tickers.pct_change().rolling(20).std()
ticker_vol_share = rolling_std.apply(lambda row:row/sum(row),axis = 1)
You will get

Why its not working as expected:
Your tickers object is a DataFrame, as is the tickers.pct_change(), tickers.pct_change().rolling(20) and tickers.pct_change().rolling(20).std(). The tickers.pct_change().rolling(20).std().sum(axis=1) is probably a Series.
You're therefore doing element-wise division of a DataFrame by a Series. This yields a DataFrame.
Without seeing your source data, it's hard to say for sure why the output DF is filled with nan, but that can certainly happen if some of the things you're dividing by are 0. It might also happen if each series is only one element long after taking the rolling average. It might also happen if you're actually evaluating a Series tickers rather than a DataFrame, since Series.sum(axis=1) doesn't make a whole lot of sense. It is also suspicious that your top and bottom portions of the division are probably different shapes, since sum() collapses an axis.
It's not clear to me what your expected output is, so I'll defer to others or wait for an update before answering that part.

Understanding np.ix_

Code:
import numpy as np
ray = [1,22,33,42,51], [61,71,812,92,103], [113,121,132,143,151], [16,172,183,19,201]
ray = np.asarray(ray)
type(ray)
ray[np.ix_([-2:],[3:4])]
I'd like to use index slicing and get a subarray consisting of the last two rows and the 3rd/4th columns. My current code produces an error:
I'd also like to sum each column. What am I doing wrong? I cannot post a picture because I need at least 10 reputation points.

So you want to make a slice of an array. The most straightforward way to do it is... slicing:
slice = ray[-2:,3:]
or if you want it explicitly
slice = ray[-2:,3:5]
See it explained in Understanding slicing
But if you do want to use np.ix_ for some reason, you need
slice = ray[np.ix_([-2,-1],[3,4])]
You can't use : here, because [] here don't make a slice, they construct lists and you should specify explicitly every row number and every column number you want in the result. If there are too many consecutive indices, you may use range:
slice = ray[np.ix_(range(-2, 0),range(3, 5))]
And to sum each column:
slice.sum(0)
0 means you want to reduce the 0th dimension (rows) by summation and keep other dimensions (columns in this case).

Speed up iteration over DataFrame items

I wrote a function in which each cell of a DataFrame is divided by a number saved in another dataframe.
def calculate_dfA(df_t,xout):
df_A = df_t.copy()
vector_x = xout.T
for index_col, column in tqdm(df_A.iteritems()):
for index_row, row in df_A.iterrows():
df_A.iloc[index_row,index_col] = df_A.iloc[index_row,index_col]/vector_x.iloc[0,index_col]
return(df_A)
The DataFrame on which I apply the calculation has a size of 14839 rows x 14839 columns. According to tqdm the processing speed is roughly 4.5s/it. Accordingly, the calculation will require approixmately 50 days which is not feasible for me. Is there a way to speed up my calculation?

You need to vectorize your division:
result = df_A.values/vector_x
This will broadcast along the row dimension and divide along the column dimension, as you seem to ask for.
Compared to your double for-loop, you are taking advantage of contiguity and homogeneity of the data in memory. This allows for a massive speedup.
Edit: Coming back to this answer today, I am spotting that converting to a numpy array first speeds up the computation. Locally I get a 10x speedup for an array of size similar to the one in the question here-above. Have edited my answer.

I'm on mobile now but you should try to avoid every for loop in python - theres always a better way
For one I know you can multiply a pandas column (Series) times a column to get your desired result.
I think to multiply every column with the matching column of another DataFrame you would still need to iterate (but only with one for loop => performance boost)
I would strongly recommend that you temporarily convert to a numpy ndarray and work with these

Get data from Pandas multiIndex

I am using pandas and uproot to read data from a .root file, and I get a table like the following one:
table
So, from my .root file I have got some branches of a tree.
fname = 'ZZ4lAnalysis_VBFH.root'
key = 'ZZTree/candTree'
ttree = uproot.open(fname)[key]
branches = ['nCleanedJets', 'JetPt', 'JetMass', 'JetPhi']
df = ttree.pandas.df(branches, entrystop=40306)
Essentially, I have to retrieve "JetPhi" data for each entry in which there are more than 2 subentries (or equivalently, entries for which "nCleanedJets" is equal or greater than 2), calculating the difference of "JetPhi" between the first two subentries and then make a histogram for such differences.
I have tried to look up in the internet and tried different possibilities but I have not found any useful solution.
If someone could give me any hint, advice and/or suggestion, I would be very grateful.
I used to code in C++ and I am new to python.
I used to code in C++, so I am new to python and I do not still master this language.

You can do this in Pandas with
df[df["nCleanedJets"] >= 2]
because you have a column with the number of entries. The df["nCleanedJets"] >= 2 expression returns a Series of booleans (True if a row passes, False if a row doesn't pass) and passing a Series or NumPy array as a slice in square brackets masks by that array (returning rows for which the boolean array is True).
You could also do this in Awkward Array before converting to Pandas, which would be easier if you didn't have a "nCleanedJets" column.
array = ttree.arrays(branches, entrystop=40306)
selected = array[array.counts >= 2]
awkward.topandas(selected, flatten=True)
Masking in Awkward Array follows the same principle, but with data structures instead of flat Series or NumPy arrays (each element of array is a list of records with "nCleanedJets", "JetPt", "JetPhi", "JetMass" fields, and counts is the length of each list).
awkward.topandas with flatten=True is equivalent to what uproot does when outputtype=pandas.DataFrame and flatten=True (defaults for ttree.pandas.df).

How can I work with .iloc[] in Python to do some calculation?

I have to implement some functions to calculate special values. I read a csv file for it with pd.read_csv(). Then I used .iloc[] to find the respective row and column I need for my calculation:
V_left = data_one.iloc[0,0:4]
V_right= data_one.iloc[0,5:9]
My formula, which I want to implement is: V_left/V_right
V is a vector of 5 parameters (values).
My question is now: How can I use the values, which I pick out with .iloc[], to do a calculation like my formula?
See me current code here

You can use:
V_left.values and V_right.values to make those dataframes numpy arrays, so that you can manipulate them.
However, I wouldn't use iloc in the first place, you can directly convert them:
V_left = data_one.values[0,:4]
V_right = data_one.values[0, 5:9]

Adding V_left.values / V_right.values should be enough.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multiply pd DataFrame column with 7-digit scalar - python

Your first method might give SettingWithCopyWarning which basically means the changes are not made to the actual dataframe. You can use .loc instead: SLICE.loc[:,('Var', 'Jan')] = SLICE.loc[:,('Var', 'Jan')]*2678400

Related

Pandas dataframe sum of row won't let me use result in equation

Understanding np.ix_

Speed up iteration over DataFrame items

Get data from Pandas multiIndex

How can I work with .iloc[] in Python to do some calculation?

Categories

Resources