Suppose I have a pandas dataframe whose column is are a,b,c and index are dates.
df = pd.DataFrame(columns = ["a","b","c"],index=range(5),data=np.random.rand(5,3))
And, I have a string called formula = (2*a+b)/c, where a,b,c here refer to columns of pandas data frame. What is the most efficient way to go about this?
The solution should give the same answer as this
(2*df["a"]+df["b"])/df["c"]
The bonus question is what if the formula contains lagged value formula = (2*a[-1]+b)/c, where a[-1] would use the data from previous row of column a. Thanks.
Use DataFrame.eval to evaluate a string describing operations on DataFrame columns.
formula = "(2*a+b)/c"
df.eval(formula)
0 6.432992
1 1.175234
2 3.274955
3 2.050857
4 7.605282
dtype: float64
Related
I am having a problem with replacing a specific column using its index in a dataframe with a new dataframe that consists of only 1 column given that they both have the same length
I need to replace the column only knowing its index as I am choosing a random column to replace in the dataframe df that contains 8 columns with the new dataframe df_temp that only has 1 column
N=random.randint(1,8)
df.iloc(: , [N - 1]) = df_temp.values
This gives me syntax error I don't know if I am using the .iloc wrong or there is an alternative way to do that.
I am not clearly understand, but can you try it:
df.iloc[:, [N-1]] = df_temp.values
I have two data frames like this: The first has one column and 720 rows (dataframe A), the second has ten columns and 720 rows(dataframe B). The dataframes contain only numerical values.
I am trying to compare them this way: I want to go through each column of dataframe B and compare each cell(row) of that column to the corresponding row in dataframe A .
(Example: For the first column of dataframe B I compare the first row to the first row of dataframe A, then the second row of B to the second row of A etc.)
Basically I want to compare each column of dataframe B to the single column in dataframe A, row by row.
If the the value in dataframe B is smaller or equal than the value in dataframe A, I want to add +1 to another dataframe (or list, depending on how its easier). In the end, I want to drop any column in dataframe B that doesnt have at least one cell to satisfy the condition (basically if the value added to the list or new dataframe is 0).
I tried something like this (written for a single row, I was thinking of creating a for loop using this) but it doesn't seem to do what I want:
DfA_i = pd.DataFrame(DA.iloc[i])
DfB_j = pd.DataFrame(DB.iloc[j])
B = DfB_j.values
DfC['Criteria'] = DfA_i.apply(lambda x: len(np.where(x.values <= B)), axis=1)
dv = dt_dens.values
if dv[1] < 1:
DF = DA.drop(i)
I hope I made my problem clear enough and sorry for any mistakes. Thanks for any help.
Let's try:
dfB.loc[:, dfB.ge(dfA.values).any()]
Explanation: dfA.values returns the numpy array with shape (720,1). Then dfB.ge(dfA.values) check each column from dfB against that single column from dfA; this returns a boolean dataframe of same size with dfB. Finally .any() check along the columns of that boolean dataframe for any True.
how about this:
pd.DataFrame(np.where(A.to_numpy() <= B.to_numpy(),1,np.nan), columns=B.columns, index=A.index).dropna(how='all')
you and replace the np.nan in the np.where condition with whatever values you wish, including keeping the original values of dataframe 'B'
I have a DataFrame in Pandas for example:
df = pd.DataFrame("a":[0,0,1,1,0], "penalty":["12", "15","13","100", "22"])
and how can I sum values in column "penalty" but I would like to sum only these values from column "penalty" which have values 0 in column "a" ?
You can filter your dataframe with this:
import pandas as pd
data ={'a':[0,0,1,1,0],'penalty':[12, 15,13,100, 22]}
df = pd.DataFrame(data)
print(df.loc[df['a'].eq(0), 'penalty'].sum())
This way you are selecting the column penalty from your dataframe where the column a is equal to 0. Afterwards, you are performing the .sum() operation, hence returning your expected output (49). The only change I made was remove the quotation mark so that the values for the column penalty were interpreted as integers and not strings. If the input are necessarily strings, you can simply change this with df['penalty'] = df['penalty'].astype(int)
Filter the rows which has 0 in column a and calculate the sum of penalty column.
import pandas as pd
data ={'a':[0,0,1,1,0],'penalty':[12, 15,13,100, 22]}
df = pd.DataFrame(data)
df[df.a == 0].penalty.sum()
I have another basic question. So I have a dataframe like so:
cols = a,b,c,d,e which contains integers.
I want column e's value to equal 1 if columns b and c or columns a, b and c = 1.
Although d's column does not matter in this computation, it matters somewhere else so I cannot drop it.
How would I do that on pandas?
Use .loc:
df.loc[df['a']==df['b']==df['c']==1,'e']=1
I've figured out how apply a function to an entire column or subsection of a pandas dataframe in lieu of writing a loop that modifies each cell one by one.
Is it possible to write a function that takes cells within the dataframe as inputs when doing the above?
Eg. A function that in the current cell returns the product of the previous cell's value multiplied by the cell before that previous cell. I'm doing this line by line now in a loop and it is unsurprisingly very inefficient. I'm quite new to python.
For the case you mention (multiplying the two previous cells), you can do the following (which loops through each column, but not each cell):
import pandas as pd
a = pd.DataFrame({0:[1,2,3,4,5],1:[2,3,4,5,6],2:0,3:0})
for i in range(2,len(a)):
a[i] = a[i-1]*a[i-2]
This will make each column in a the previous two columns multiplied together
If you want to perform this operation going down rows instead of columns, you can just transpose the dataframe (and then transpose it again after performing the loop to get it back in the original format)
EDIT
What's actually wanted is the product of the elements in the previous rows of two columns and the current rows of two columns. This can be accomplished using shift:
import pandas as pd
df= pd.DataFrame({"A": [1,2,3,4], "B": [1,2,3,4], "C": [2,3,4,5], "D": [5,5,5,5]})
df['E'] = df['A'].shift(1)*df['B'].shift(1)*df['C']*df['D']
df['E']
Produces:
0 NaN
1 15.0
2 80.0
3 225.0
This does the trick, and shift can go both forward and backward depending on your need:
df['Column'] = df['Column'].shift(1) * df['Column'].shift(2)