Apply function in a pandas dataframe - python

I've figured out how apply a function to an entire column or subsection of a pandas dataframe in lieu of writing a loop that modifies each cell one by one.
Is it possible to write a function that takes cells within the dataframe as inputs when doing the above?
Eg. A function that in the current cell returns the product of the previous cell's value multiplied by the cell before that previous cell. I'm doing this line by line now in a loop and it is unsurprisingly very inefficient. I'm quite new to python.

For the case you mention (multiplying the two previous cells), you can do the following (which loops through each column, but not each cell):
import pandas as pd
a = pd.DataFrame({0:[1,2,3,4,5],1:[2,3,4,5,6],2:0,3:0})
for i in range(2,len(a)):
a[i] = a[i-1]*a[i-2]
This will make each column in a the previous two columns multiplied together
If you want to perform this operation going down rows instead of columns, you can just transpose the dataframe (and then transpose it again after performing the loop to get it back in the original format)
EDIT
What's actually wanted is the product of the elements in the previous rows of two columns and the current rows of two columns. This can be accomplished using shift:
import pandas as pd
df= pd.DataFrame({"A": [1,2,3,4], "B": [1,2,3,4], "C": [2,3,4,5], "D": [5,5,5,5]})
df['E'] = df['A'].shift(1)*df['B'].shift(1)*df['C']*df['D']
df['E']
Produces:
0 NaN
1 15.0
2 80.0
3 225.0

This does the trick, and shift can go both forward and backward depending on your need:
df['Column'] = df['Column'].shift(1) * df['Column'].shift(2)

Related

Pandas dataframes: Create new column with a formula that uses all values of column X until each row (similar to cumsum)

A lot of times (e.g. for time series) I need to use all the values in a column until the current row.
For instance if my dataframe has 100 rows, I want to create a new column where the value in each row is a (sum, average, product, [any other formula]) of all the previous rows, and excluding then next ones:
Row 20 = formula(all_values_until_row_20)
Row 21 = formula(all_values_until_row_21)
etc
I think the easiest way to ask this question would be: How to implement the cumsum() function for a new column in pandas without using that specific method?
One approach, if you cannot use cumsum is to introduce a new column or index and then apply a lambda function that uses all rows that have the new column value less than the current row's.
import pandas as pd
df = pd.DataFrame({'x': range(20, 30), 'y': range(40, 50)}).set_index('x')
df['Id'] = range(0, len(df.index))
df['Sum'] = df.apply(lambda x: df[df['Id']<=x['Id']]['y'].sum(), axis=1)
print(df)
Since there is no sample data I go with an assumed dataframe with atleast one column with numeric data and no NaN values.
I would start like below for cumulativbe sum and averages.
cumulative sum:
df['cum_sum'] = df['existing_col'].cumsum()
cumulative average:
df['cum_avg'] = df['existing_col'].cumsum() / df['index_col']
or
df['cum_avg'] = df['existing_col'].expanding().mean()
if you can provide a sample DataFrame you can get better help I believe so.

Is there a way to have the first 6 rows of a dataset go to there own columns and then the 7th row go back to the first column

How can I assign rows to columns where the 1st-6th columns are rows 1-6 and the 7th row is loops back to the first column?
Suppose we have this data:
data = pd.Series(range(1,8))
Let's assume that these are repeated readings of six sensors, but we are not sure that the measurement cycle is completed. In case of data the cycle has been stopped at the first sensor in the second loop.
Now we want to arrange them in a table with data from each sensor in its own column. I can see two ways to do this. We could use numpy.reshape or DataFrame.pivot.
Here's the first way:
# append enough values to reshape the data
pad = [np.nan]*(6-len(data)%6)
values = np.r_[data.values, pad]
df = pd.DataFrame(values.reshape(-1, 6), columns=[*'123456'])
We might use here numpy.concatenate or pandas.concat as an alternative to numpy.r_. We can't use numpy.resize or ndarray.resize because of how they fill new cells.
Here's the second way I can think of. We create marks of rows and columns of the future table and then build a pivot relying on them:
df = data.to_frame()
df['my_rows'] = df.index // 6
df['my_columns'] = df.index % 6
df.pivot('my_rows', 'my_columns')
Here we could additionaly apply df.reset_index(drop=True) if the original index isn't a sequence 0,1,2,... or we could use pd.RangeIndex(len(df)) instead of df.index in further calculations. Anyway, I hope the main idea is clear enough.

Efficiently split pandas dataframe based on combinations of one column values

Lets say I have a dataframe with one column and it has 3 unique values
Click here to see input
import pandas as pd
df = pd.DataFrame(['a', 'b', 'c'], columns = ['string'])
df
I want to split this dataframe into smaller data frames, such that each dataframe will contain 2 unique values. In the above case I need 3 data frames 3c2(nCr) = 3. df1 - [a b] df2 - [a c] df3 - [b c]. Please click on the below link to see my current implementation.
Click here to see current code and output
import itertools
for i in itertools.combinations(df.string.values, 2):
print(df[df.string.isin(i)], '\n')
I am looking something like groupby in pandas. Because sub-setting data inside loop is time consuming. In one of the sample case, I have 609 unique values and it was taking around 3 mins to complete the loop. So, looking for some optimized way to perform the same operation, as the unique values may shoot up to 1000's in real scenarios.
It will be slow because you're creating 370k dataframes. If all of them are supposed to only hold two values, why does it need to be a dataframe?
df = pd.DataFrame({'x': range(100)})
df['key'] = 1
records = df.merge(df, on='key').drop('key', axis=1).to_dict('r')
[pd.Series(x) for x in records]
You will see that records is computed quite fast but then it takes a few minutes to generate all of these series objects.

Comparing two dataframes and storing results in another dataframe

I have two data frames like this: The first has one column and 720 rows (dataframe A), the second has ten columns and 720 rows(dataframe B). The dataframes contain only numerical values.
I am trying to compare them this way: I want to go through each column of dataframe B and compare each cell(row) of that column to the corresponding row in dataframe A .
(Example: For the first column of dataframe B I compare the first row to the first row of dataframe A, then the second row of B to the second row of A etc.)
Basically I want to compare each column of dataframe B to the single column in dataframe A, row by row.
If the the value in dataframe B is smaller or equal than the value in dataframe A, I want to add +1 to another dataframe (or list, depending on how its easier). In the end, I want to drop any column in dataframe B that doesnt have at least one cell to satisfy the condition (basically if the value added to the list or new dataframe is 0).
I tried something like this (written for a single row, I was thinking of creating a for loop using this) but it doesn't seem to do what I want:
DfA_i = pd.DataFrame(DA.iloc[i])
DfB_j = pd.DataFrame(DB.iloc[j])
B = DfB_j.values
DfC['Criteria'] = DfA_i.apply(lambda x: len(np.where(x.values <= B)), axis=1)
dv = dt_dens.values
if dv[1] < 1:
DF = DA.drop(i)
I hope I made my problem clear enough and sorry for any mistakes. Thanks for any help.
Let's try:
dfB.loc[:, dfB.ge(dfA.values).any()]
Explanation: dfA.values returns the numpy array with shape (720,1). Then dfB.ge(dfA.values) check each column from dfB against that single column from dfA; this returns a boolean dataframe of same size with dfB. Finally .any() check along the columns of that boolean dataframe for any True.
how about this:
pd.DataFrame(np.where(A.to_numpy() <= B.to_numpy(),1,np.nan), columns=B.columns, index=A.index).dropna(how='all')
you and replace the np.nan in the np.where condition with whatever values you wish, including keeping the original values of dataframe 'B'

Pandas: after slicing along specific columns, get "values" without returning entire dataframe

Here is what is happening:
df = pd.read_csv('data')
important_region = df[df.columns.get_loc('A'):df.columns.get_loc('C')]
important_region_arr = important_region.values
print(important_region_arr)
Now, here is the issue:
print(important_region.shape)
output: (5,30)
print(important_region_arr.shape)
output: (5,30)
print(important_region)
output: my columns, in the panda way
print(important_region_arr)
output: first 5 rows of the dataframe
How, having indexed my columns, do I transition to the numpy array?
Alternatively, I could just convert to numpy from the get-go and run the slicing operation within numpy. But, how is this done in pandas?
So here is how you can slice the dataset with specific columns. loc gives you access to the grup of rows and columns. The ones before , represents rows and columns after. If a : is specified it means all the rows.
data.loc[:,'A':'C']
For more understanding, please look at the documentation.

Categories

Resources