drop rows based on specific conditions - python

Here is a part of df:
NUMBER MONEY
12345 20
12345 -20
12345 20
12345 20
123456 10
678910 7.6
123457 3
678910 -7.6
I want to drop rows which have the same NUMBER but opposite money.
The ideal outcome would like below:
NUMBER MONEY
12345 20
12345 20
123456 10
123457 3
note: these entries are not one-to-one correspondence (I mean the total amount is an odd number).
For example, there are four entries are [Number] 12345.
three of them [Money] are 20, one [Money] is -20.
I just want to delete two [Money] is the opposite, and keep the other two whose money is 20.

Here a solution using groupby and apply and a custom function to match and delete pairs.
def remove_pairs(x):
positive = x.loc[x['MONEY'] > 0].index.values
negative = x.loc[x['MONEY'] < 0].index.values
for i, j in zip(positive, negative):
x = x.drop([i, j])
return x
df['absvalues'] = df['MONEY'].abs()
dd = df.groupby(['NUMBER', 'absvalues']).apply(remove_pairs)
dd.reset_index(drop=True, inplace=True)
dd.drop('absvalues', axis=1, inplace=True)
'absvalue' column with the absolute values of 'MONEY' is added to perform a double index selection with groupby, and then the custom function drops rows in pairs selecting positive and negative numbers.
The two last lines just do some cleaning. Using your sample dataframe, the final result dd is:
NUMBER MONEY
0 12345 20.0
1 12345 20.0
2 123456 10.0
3 123457 3.0

Related

Count instances of a random number of n length

I have a pandas dataframe column containing numbers of varying length. I want to count how many instances of a six digit number I have in a column, regardless of which numbers and their order.
Example:
import pandas as pd
df = pd.DataFrame({"number": [1234, 12345, 777777, 949494, 22, 987654]})
Should return that there is three instances of a six digit number in the column.
I would convert it to string, check the length of the string and sum those which length is 6:
(df['number'].astype(str).apply(len) == 6).sum()
Use np.log10 and floor division which gives you order of magnitude for numbers. Then check how many satisfy that condition.
N = 6
(np.log10(df['number'])//1).eq(N-1).sum()
#3
You can use np.ceil and np.log10:
df['length'] = np.ceil(np.log10(df['number']))
Result:
number length
0 1234 4.0
1 12345 5.0
2 777777 6.0
3 949494 6.0
4 22 2.0
5 987654 6.0
To count instances use:
np.ceil(np.log10(df['number'])).eq(6).sum()
Valid only for values > 0.

Pandas: Calculate Median Based on Multiple Conditions in Each Row

I am trying to calculate median values on the fly based on multiple conditions in each row of a data frame and am not getting there.
Basically, for every row, I am counting the number of people in the same department with rank B with pay greater than the pay listed in that row. I was able to get the count to work properly with a lambda function:
df['B Count'] = df.apply(lambda x: sum(df[(df['Department'] == x['Department']) & (df['Rank'] == 'B')]['Pay'] > x['Pay']), axis=1)
However, I now need to calculate the median for each case satisfying those conditions. So in row x of the data frame, I need the median of df['Pay'] for all others matching x['Department'] and df['Rank'] == 'B'. I can't apply .median() instead of sum(), as that gives me the median count, not the median pay. Any thoughts?
Using the fake data below, the 'B Count' code from above counts the number of B's in each Department with higher pay than each A. That part works fine. What I want is to then construct the 'B Median' column, calculating the median pay of the B's in each Department with higher pay than each A in the same Department.
Person Department Rank Pay B Count B Median
1 One A 1000 1 1500
2 One B 800
3 One A 500 2 1150
4 One A 3000 0
5 One B 1500
6 Two B 2000
7 Two B 1800
8 Two A 1500 3 1800
9 Two B 1700
10 Two B 1000
Well, I was able to do what I wanted to do with a function:
def median_b(x):
if x['B Count'] == 0:
return np.nan
else:
return df[(df['Department'] == x['Department']) & (df['Rank'] == 'B') & (
df['Pay'] > x['Pay'])]['Pay'].median()
df['B Median'] = df.apply(median_b, axis = 1)
Do any of you know of better ways to achieve this result?

Pandas: row operations on a column, given one reference value on a different column

I am working with a database that looks like the below. For each fruit (just apple and pears below, for conciseness), we have:
1. yearly sales,
2. current sales,
3. monthly sales and
4.the standard deviation of sales.
Their ordering may vary, but it's always 4 values per fruit.
dataset = {'apple_yearly_avg': [57],
'apple_sales': [100],
'apple_monthly_avg':[80],
'apple_st_dev': [12],
'pears_monthly_avg': [33],
'pears_yearly_avg': [35],
'pears_sales': [40],
'pears_st_dev':[8]}
df = pd.DataFrame(dataset).T#tranpose
df = df.reset_index()#clear index
df.columns = (['Description', 'Value'])#name 2 columns
I would like to perform two sets of operations.
For the first set of operations, we isolate a fruit price, say 'pears', and subtract each average sales from current sales.
df_pear = df[df.loc[:, 'Description'].str.contains('pear')]
df_pear['temp'] = df_pear['Value'].where(df_pear.Description.str.contains('sales')).bfill()
df_pear ['some_op'] = df_pear['Value'] - df_pear['temp']
The above works, by creating a temporary column holding pear_sales of 40, backfill it and then use it to subtract values.
Question 1: is there a cleaner way to perform this operation without a temporary array? Also I do get the common warning saying I should use '.loc[row_indexer, col_indexer], even though the output still works.
For the second sets of operations, I need to add '5' rows equal to 'new_purchases' to the bottom of the dataframe, and then fill df_pear['some_op'] with sales * (1 + std_dev *some_multiplier).
df_pear['temp2'] = df_pear['Value'].where(df_pear['Description'].str.contains('st_dev')).bfill()
new_purchases = 5
for i in range(new_purchases):
df_pear = df_pear.append(df_pear.iloc[-1])#appends 5 copies of the last row
counter = 1
for i in range(len(df_pear)-1, len(df_pear)-new_purchases, -1):#backward loop from the bottom
df_pear.some_op.iloc[i] = df_pear['temp'].iloc[0] * (1 + df_pear['temp2'].iloc[i] * counter)
counter += 1
This 'backwards' loop achieves it, but again, I'm worried about readability since there's another temporary column created, and then the indexing is rather ugly?
Thank you.
I think, there is a cleaner way to perform your both tasks, for each
fruit in one go:
Add 2 columns, Fruit and Descr, the result of splitting of Description at the first "_":
df[['Fruit', 'Descr']] = df['Description'].str.split('_', n=1, expand=True)
To see the result you may print df now.
Define the following function to "reformat" the current group:
def reformat(grp):
wrk = grp.set_index('Descr')
sal = wrk.at['sales', 'Value']
dev = wrk.at['st_dev', 'Value']
avg = wrk.at['yearly_avg', 'Value']
# Subtract (yearly) average
wrk['some_op'] = wrk.Value - avg
# New rows
wrk2 = pd.DataFrame([wrk.loc['st_dev']] * 5).assign(
some_op=[ sal * (1 + dev * i) for i in range(5, 0, -1) ])
return pd.concat([wrk, wrk2]) # Old and new rows
Apply this function to each group, grouped by Fruit, drop Fruit
column and save the result back in df:
df = df.groupby('Fruit').apply(reformat)\
.reset_index(drop=True).drop(columns='Fruit')
Now, when you print(df), the result is:
Description Value some_op
0 apple_yearly_avg 57 0
1 apple_sales 100 43
2 apple_monthly_avg 80 23
3 apple_st_dev 12 -45
4 apple_st_dev 12 6100
5 apple_st_dev 12 4900
6 apple_st_dev 12 3700
7 apple_st_dev 12 2500
8 apple_st_dev 12 1300
9 pears_monthly_avg 33 -2
10 pears_sales 40 5
11 pears_yearly_avg 35 0
12 pears_st_dev 8 -27
13 pears_st_dev 8 1640
14 pears_st_dev 8 1320
15 pears_st_dev 8 1000
16 pears_st_dev 8 680
17 pears_st_dev 8 360
Edit
I'm in doubt whether Description should also be replicated to new
rows from "st_dev" row. If you want some other content there, set it
in reformat function, after wrk2 is created.

How many data points are plotted on my matplotlib graph?

So I want to count the number of data points plotted on my graph to keep a total track of graphed data. The problem is, my data table messes it up to where there are some NaN values in a different row in comparison to another column where it may or may not have a NaN value. For example:
# I use num1 as my y-coordinate and num1-num2 for my x-coordinate.
num1 num2 num3
1 NaN 25
NaN 7 45
3 8 63
NaN NaN 23
5 10 42
NaN 4 44
#So in this case, there should be only 2 data point on the graph between num1 and num2. For num1 and num3, there should be 3. There should be 4 data points between num2 and num3.
I believe Matplotlib doesn't graph the rows of the column that contain NaN values since its null (please correct me if I'm wrong, I can only tell this due to no dots being on the 0 coordinate of the x and y axes). In the beginning, I thought I could get away with using .count() and find the smaller of the two columns and use that as my tracker, but realistically that won't work as shown in my example above because it can be even LESS than that since one may have the NaN value and the other will have an actual value. Some examples of code I did:
# both x and y are columns within the DataFrame and are used to "count" how many data points are # being graphed.
def findAmountOfDataPoints(colA, colB):
if colA.count() < colB.count():
print(colA.count()) # Since its a smaller value, print the number of values in colA.
else:
print(colB.count()) # Since its a smaller value, print the number of values in colB.
Also, I thought about using .value_count() but I'm not sure if thats the exact function I'm looking for to complete what I want. Any suggestions?
Edit 1: Changed Data Frame names to make example clearer hopefully.
If I understood correctly your problem, assuming that your table is a pandas dataframe df, the following code should work:
sum((~np.isnan(df['num1']) & (~np.isnan(df['num2']))))
How it works:
np.isnan returns True if a cell is Nan. ~np.isnan is the inverse, hence it returns True when it's not Nan.
The code checks where both the column "num1" AND the column "num2" contain a non-Nan value, in other words it returns True for those rows where both the values exist.
Finally, those good rows are counted with sum, which takes into account only True values.
The way I understood it is that the number of combiniations of points that are not NaN is needed. Using a function I found I came up with this:
import pandas as pd
import numpy as np
def choose(n, k):
"""
A fast way to calculate binomial coefficients by Andrew Dalke (contrib).
https://stackoverflow.com/questions/3025162/statistics-combinations-in-python
"""
if 0 <= k <= n:
ntok = 1
ktok = 1
for t in range(1, min(k, n - k) + 1):
ntok *= n
ktok *= t
n -= 1
return ntok // ktok
else:
return 0
data = {'num1': [1, np.nan,3,np.nan,5,np.nan],
'num2': [np.nan,7,8,np.nan,10,4],
'num3': [25,45,63,23,42,44]
}
df = pd.DataFrame(data)
df['notnulls'] = df.notnull().sum(axis=1)
df['plotted'] = df.apply(lambda row: choose(int(row.notnulls), 2), axis=1)
print(df)
print("Total data points: ", df['plotted'].sum())
With this result:
num1 num2 num3 notnulls plotted
0 1.0 NaN 25 2 1
1 NaN 7.0 45 2 1
2 3.0 8.0 63 3 3
3 NaN NaN 23 1 0
4 5.0 10.0 42 3 3
5 NaN 4.0 44 2 1
Total data points: 9

Calculating Excess Cash Amount in ATM

I want to calculate excess amount remaining in ATM from the given dataset of transactions and replenishment.
I can do it by looping over the data to subtract the transactions from current amount. But I need to do this without using loop.
# R: Replenishment amount
# T: Transaction Amount
'''
R T
100 50
0 30
0 10
200 110
0 30
60 20
'''
data = {'Date':pd.date_range('2011-05-03','2011-05-8' ).tolist(),'R':[100,0,0,200,0,60],'T':[50,30,10,110,30,20]}
df = pd.DataFrame(data)
# calculated temporary amount and shift it to subtract future
# transactions from it
df['temp'] = ((df['R']-df['T']).shift(1).bfill())
# Boolean indicating whether ATM was replenished or not
# 1: Replenished, 0: Not Replenished
df['replenished'] = (df['R'] >0).astype(int)
# If replenished subtract transaction amount from the replenishment amount
# otherwise subtract it from temp amount
df['replenished']*df['R']+(np.logical_not(df['replenished']).astype(int))*df['temp']-df['T']
Expected Results:
0 50.0
1 20.0
2 10.0
3 90.0
4 60.0
5 40.0
dtype: float64
Actual Results:
0 50.0
1 20.0
2 -40.0
3 90.0
4 60.0
5 40.0
dtype: float64
First of all, we compute a boolean column to know if it was replenished, as you do.
df['replenished'] = df['R'] > 0
We also compute the increment in money, which will be useful to perform the rest of the operations.
df['increment'] = df['R'] - df['T']
We also create the column which will have the desired values in due time, I called it reserve. To begin, we do the cumulated sum of the increments, which is the desired value from the first replenishment day until the next one.
df['reserve'] = df['increment'].cumsum()
Now, we are going to create an auxiliary alias of our dataframe, which will be useful to do the operations without losing the original data. Remember that this variable is not a copy, it points to the same data as the original: A change in df_aux will change the original variable df.
df_aux = df
Then we can proceed to the loop that will take care of the problem.
while not df_aux.empty:
df_aux = df_aux.loc[df_aux.loc[df_aux['replenished']].index[0]:]
k = df_aux.at[df_aux.index[0], 'reserve']
l = df_aux.at[df_aux.index[0], 'increment']
df_aux['reserve'] = df_aux['reserve'] - k + l
if len(df_aux) > 1:
df_aux = df_aux.loc[df_aux.index[1]:]
else:
break
First, we take all the dataframe starting from the next replenishment day. From this day to the next replenishment day the cumulated sum will give us the desired outcome if the initial value is iqual to the increment, so we modify the cumsum so that the first value complies with this condition.
Then, if this was the last row of the dataframe our work is done and we get out of the loop. If it wasn't, then we drop the replenishment day we just calculated and go on to the next days.
After all these operations, the result (df) is this:
Date R T increment replenished reserve
0 2011-05-03 100 50 50 True 50
1 2011-05-04 0 30 -30 False 20
2 2011-05-05 0 10 -10 False 10
3 2011-05-06 200 110 90 True 90
4 2011-05-07 0 30 -30 False 60
5 2011-05-08 60 20 40 True 40
I'm not experienced with efficiencies in calculus time, so I'm not sure if this solution is faster than looping through all rows.

Categories

Resources