subtract m rows of dataframe from other m rows - python

I have dataframe with n rows. All values in dataframe can be assumed to be integers. I wish to subtract particular m rows from another set of m rows. For eg. i wish to do-
df[i:i+m] - df[j:j+m]
This should return a dataframe.

You can use NumPy representation of your sliced dataframes and feed into pd.DataFrame constructor:
res = pd.DataFrame(df.iloc[i:i+m].values - df.iloc[j:j+m].values,
columns=df.columns)

Related

Pandas dataframes: Create new column with a formula that uses all values of column X until each row (similar to cumsum)

A lot of times (e.g. for time series) I need to use all the values in a column until the current row.
For instance if my dataframe has 100 rows, I want to create a new column where the value in each row is a (sum, average, product, [any other formula]) of all the previous rows, and excluding then next ones:
Row 20 = formula(all_values_until_row_20)
Row 21 = formula(all_values_until_row_21)
etc
I think the easiest way to ask this question would be: How to implement the cumsum() function for a new column in pandas without using that specific method?
One approach, if you cannot use cumsum is to introduce a new column or index and then apply a lambda function that uses all rows that have the new column value less than the current row's.
import pandas as pd
df = pd.DataFrame({'x': range(20, 30), 'y': range(40, 50)}).set_index('x')
df['Id'] = range(0, len(df.index))
df['Sum'] = df.apply(lambda x: df[df['Id']<=x['Id']]['y'].sum(), axis=1)
print(df)
Since there is no sample data I go with an assumed dataframe with atleast one column with numeric data and no NaN values.
I would start like below for cumulativbe sum and averages.
cumulative sum:
df['cum_sum'] = df['existing_col'].cumsum()
cumulative average:
df['cum_avg'] = df['existing_col'].cumsum() / df['index_col']
or
df['cum_avg'] = df['existing_col'].expanding().mean()
if you can provide a sample DataFrame you can get better help I believe so.

Comparing two dataframes and storing results in another dataframe

I have two data frames like this: The first has one column and 720 rows (dataframe A), the second has ten columns and 720 rows(dataframe B). The dataframes contain only numerical values.
I am trying to compare them this way: I want to go through each column of dataframe B and compare each cell(row) of that column to the corresponding row in dataframe A .
(Example: For the first column of dataframe B I compare the first row to the first row of dataframe A, then the second row of B to the second row of A etc.)
Basically I want to compare each column of dataframe B to the single column in dataframe A, row by row.
If the the value in dataframe B is smaller or equal than the value in dataframe A, I want to add +1 to another dataframe (or list, depending on how its easier). In the end, I want to drop any column in dataframe B that doesnt have at least one cell to satisfy the condition (basically if the value added to the list or new dataframe is 0).
I tried something like this (written for a single row, I was thinking of creating a for loop using this) but it doesn't seem to do what I want:
DfA_i = pd.DataFrame(DA.iloc[i])
DfB_j = pd.DataFrame(DB.iloc[j])
B = DfB_j.values
DfC['Criteria'] = DfA_i.apply(lambda x: len(np.where(x.values <= B)), axis=1)
dv = dt_dens.values
if dv[1] < 1:
DF = DA.drop(i)
I hope I made my problem clear enough and sorry for any mistakes. Thanks for any help.
Let's try:
dfB.loc[:, dfB.ge(dfA.values).any()]
Explanation: dfA.values returns the numpy array with shape (720,1). Then dfB.ge(dfA.values) check each column from dfB against that single column from dfA; this returns a boolean dataframe of same size with dfB. Finally .any() check along the columns of that boolean dataframe for any True.
how about this:
pd.DataFrame(np.where(A.to_numpy() <= B.to_numpy(),1,np.nan), columns=B.columns, index=A.index).dropna(how='all')
you and replace the np.nan in the np.where condition with whatever values you wish, including keeping the original values of dataframe 'B'

Pyspark dataframe filter using occurrence based on column

I have pyspark dataframe and I want to filter dataframe with columns A and B. Now I want to get only values of B where occurrence of A is greater than some number N.
Column A is like and id which can have repeated values. Right now I am doing group by and the filtering and using list of values which is not efficient so I am looking for efficient solution.
Example
N = 5
Input Image
Expected Output Image
You can see there that only ID1 and ID3 of column A is selected because of threshold of 5 rest all are excluded.
Try the follwoing:
df = ... # The dataframe
N = 5 # The value to test
df_b = df.filter(df['A'] >= N).select('B')
This will first filter the dataframe only containing rows where A is >= N with its corresponding 'B' values. After applying the filter select only column B to obtain the final result.

python pandas multiindex to array of matrices

I have a pandas DataFrame - assume, for example, that it contains some rolling covariance estimates:
df = #some data
rolling_covariance = df.rolling(window=100).cov()
If the data has n columns then rolling_covariance will contain m n by n matrices, where m is the number of rows in df.
Is there quick/one-liner to transform rolling_covariance into a numpy array of matrices? For example you can access individual rows in rolling_covariance using iloc, you can also iterate through all of the first level of the multiindex and extract the data that way - but I'd like something fast and simple if available

Creating new pandas dataframe by extracting columns from other dataframes - ValueError

I have to extract columns from different pandas dataframes and merge them into a single new dataframe. This is what I am doing:
newdf=pd.DataFrame()
newdf['col1']=sorted(df1.columndf1.unique())
newdf['col2']=df2.columndf2.unique(),
newdf['col3']=df3.columndf3.unique()
newdf
I am sure that the three columns have the same length (I have checked) but I get the error
ValueError: Length of values does not match length of index
I have tried to pass them as pd.Series but the result is the same. I am on Python 2.7.
It seems there is problem length of unique values is different.
One possible solution is concat all data together and apply unique.
If unique data not same sizes, get NaNs in last values of columns.
newdf = pd.concat([df1.columndf1, df2.columndf2, df3.columndf3], axis=1)
.apply(lambda x: pd.Series(x.unique()))
EDIT:
Another possible solution:
a = sorted(df1.columndf1.unique())
b = list(df2.columndf2.unique())
c = list(df3.columndf3.unique())
newdf=pd.DataFrame({'col1':a, 'col2':b, 'col3':c})

Categories

Resources