convert R percentage equation to pandas - python

Hello I am trying to convert this function to pandas as I am not familiar with R
sum(data_file$finished_race_date >= 0, na.rm = TRUE)/sum(data_file$signup_race_date >= 0, na.rm = TRUE)
I am trying to figure out what percentage of runners finished the race

If need divide sum of True values in 2 boolean masks comparing by notnull:
100 * data_file.finished_race_date.notnull().sum()/data_file.signup_race_date.notnull().sum()
Sample:
import pandas as pd
import numpy as np
data_file = pd.DataFrame({'finished_race_date':['2/5/16',np.nan,np.nan],
'signup_race_date':[np.nan,'2/5/16','2/5/16']})
print (data_file)
finished_race_date signup_race_date
0 2/5/16 NaN
1 NaN 2/5/16
2 NaN 2/5/16
print (data_file.finished_race_date.notnull())
0 True
1 False
2 False
Name: finished_race_date, dtype: bool
print (data_file.finished_race_date.notnull().sum())
1
finished_race_date = data_file.finished_race_date.notnull().sum()
signup_race_date = data_file.signup_race_date.notnull().sum()
print (100 * finished_race_date / signup_race_date)
50.0

Related

How to search for a string in pandas dataframe and match with another?

I'm trying to compare 2 columns (strings) of 2 different pandas' dataframe (A and B) and if they match a piece of the string, I would like to assign the value of one column in dataframe A to dataframe B.
This is my code:
import numpy as np
import pandas as pd
A = ['DF-PI-05', 'DF-PI-09', 'DF-PI-10', 'DF-PI-15', 'DF-PI-16',
'DF-PI-19', 'DF-PI-89', 'DF-PI-92', 'DF-PI-93', 'DF-PI-94',
'DF-PI-95', 'DF-PI-96', 'DF-PI-25', 'DF-PI-29', 'DF-PI-30',
'DF-PI-34', 'DF-PI-84']
B = ['PI-05', 'PI-10', 'PI-89', 'PI-90', 'PI-93', 'PI-94', 'PI-95',
'PI-96', 'PI-09', 'PI-15', 'PI-16', 'PI-19', 'PI-91A', 'PI-91b',
'PI-92', 'PI-25-CU', 'PI-29', 'PI-30', 'PI-34', 'PI-84-CU-S1',
'PI-84-CU-S2']
import random
sample_size = len(A)
Group = [random.randint(0,1) for _ in range(sample_size)]
A = pd.DataFrame(list(zip(A,Group)),columns=['ID','Group'])
B = pd.DataFrame(B,columns=['Name'])
clus_tx = np.array([])
for date, row in B.iterrows():
for date2, row2 in A.iterrows():
if row2['ID'] in row['Name']:
clus = row['Group']
else:
clus = 999
clus_tx = np.append(clus_tx,clus)
B['Group'] = clus_tx
What I would like to have is a np.array clus_tx with the length of B, where if there is an element with the string that matches in A ('PI-xx'), I would take the value of the column 'Group' from A and assign to B, if there is no string matching, I would assign the value of 999 to the column 'Group' in B.
I think I'm doing the loop wrong because the size of clus_tx is not what I expected...My real dataset is huge, so I can't do this manually.
First, the reason why the size of clus_tx is not what you want is that you put clus_tx = np.append(clus_tx,clus) in the innermost loop, which has no break. So the length of clus_tx will always be len(A) x len(B).
Second, the logic of if statement block is not what you want.
I've changed the code a bit, hope it helps:
import numpy as np
import pandas as pd
A = ['DF-PI-05', 'DF-PI-09', 'DF-PI-10', 'DF-PI-15', 'DF-PI-16',
'DF-PI-19', 'DF-PI-89', 'DF-PI-92', 'DF-PI-93', 'DF-PI-94',
'DF-PI-95', 'DF-PI-96', 'DF-PI-25', 'DF-PI-29', 'DF-PI-30',
'DF-PI-34', 'DF-PI-84']
B = ['PI-05', 'PI-10', 'PI-89', 'PI-90', 'PI-93', 'PI-94', 'PI-95',
'PI-96', 'PI-09', 'PI-15', 'PI-16', 'PI-19', 'PI-91A', 'PI-91b',
'PI-92', 'PI-25-CU', 'PI-29', 'PI-30', 'PI-34', 'PI-84-CU-S1',
'PI-84-CU-S2']
import random
sample_size = len(A)
Group = [random.randint(0,1) for _ in range(sample_size)]
A = pd.DataFrame(list(zip(A,Group)),columns=['ID','Group'])
B = pd.DataFrame(B,columns=['Name'])
clus_tx = np.array([])
for date, row_B in B.iterrows():
clus = 999
for date2, row_A in A.iterrows():
if row_B['Name'] in row_A['ID']:
clus = row_A['Group']
break
clus_tx = np.append(clus_tx,clus)
B['Group'] = clus_tx
print(B)
The print output of B looks like:
Name Group
0 PI-05 0.0
1 PI-10 0.0
2 PI-89 1.0
3 PI-90 999.0
4 PI-93 0.0
5 PI-94 1.0
6 PI-95 1.0
7 PI-96 0.0
8 PI-09 1.0
9 PI-15 0.0
10 PI-16 1.0
11 PI-19 1.0
12 PI-91A 999.0
13 PI-91b 999.0
14 PI-92 1.0
15 PI-25-CU 999.0
16 PI-29 0.0
17 PI-30 1.0
18 PI-34 0.0
19 PI-84-CU-S1 999.0
20 PI-84-CU-S2 999.0

Pandas IF statement referencing other column value

I can't for the life of me find an example of this operation in pandas.... I am trying to write an IF statement saying IF my Check column is true then pull the value from my Proficiency_Value column, and if it's False, then default to 1000.
report_skills['Check'] = report_skills.apply(lambda x: x.Skill_Specialization in x.Specialization, axis=1)
report_skills = report_skills.loc[report_skills['Check'] == True, 'Proficiency_Value'] = 1000
Any ideas why this is not working? I'm sure this is an easy fix
Let`s create a small example DataFrame like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Check':[True,False, False ,True],
'Proficiency_Value':range(4)
})
>>> df
Check Proficiency_Value
0 True 0
1 False 1
2 False 2
3 True 3
If you use now the np.where() functione, you can get the result you are asking for.
df['Proficiency_Value'] = np.where(df['Check']==True, df['Proficiency_Value'], 1000)
>>> df
Check Proficiency_Value
0 True 0
1 False 1000
2 False 1000
3 True 3

how to find increasing-decreasing trends in Python

I am trying to compare a dataframe's different columns with each other row by row like
for (i= startday to endday)
if(df[i]<df[i+1])
counter=counter+1
else
i=endday+1
the goal is find increasing (or decreasing) trends(need to be consecutive)
And my data looks like this
df= 1 2 3 0 1 1 1
1 1 1 1 0 1 2
1 2 1 0 1 1 2
0 0 0 0 1 0 1
(In this example startday to endday is 7 but actually these two are unstable)
As a result i expect to find this {2,0,1,0} and i need it to work fast because my data is quite big(1.2 million). Because of the time limit I tried not to use loops (for, if etc.)
I tried the code below but couldn't find how to stop counter if condition is false
import math
import numpy as np
import pandas as pd
df1=df.copy()
df2=df.copy()
bool1 = (np.less_equal.outer(startday.startday, range(1,13))
& np.greater_equal.outer(endday.endday, range(1,13))
)
bool1= np.c_[np.zeros(len(startday)),bool1].astype('bool')
bool2 = (np.less_equal.outer(startday2.startday2, range(1,13))
& np.greater_equal.outer(endday2.endday2, range(1,13))
)
bool2= np.c_[bool2, np.zeros(len(startday))].astype('bool')
df1.insert(0, 'c_False',math.pi)
df2.insert(12, 'c_False',math.pi)
#df2.head()
arr_bool = (bool1&bool2&(df1.values<df2.values))
df_new = pd.DataFrame(np.sum(arr_bool , axis=1),
index=data_idx, columns=['coll'])
df_new.coll= np.select( condlist = [startday.startday > endday.endday],
choicelist = [-999],
default = df_new.coll)
Add zeros at the end, then use np.diff, then get the first "non positive" using argmin:
(np.diff(np.hstack((df.values, np.zeros((df.values.shape[0], 1)))), axis=1) > 0).argmin(axis=1)
>> array([2, 0, 1, 0], dtype=int64)

How to get average of increasing values using Pandas?

I'm trying to figure out the average of increasing values in my table per column.
my table
A | B | C
----------------
0 | 5 | 10
100 | 2 | 20
50 | 2 | 30
100 | 0 | 40
function I'm trying to write for my problem
def avergeIncreace(data,value): #not complete but what I have so far
x = data[value].pct_change().fillna(0).gt(0)
print( x )
pct_change() returns a table of the percentage of the number at that index compared to the number in row before it.fillna(0) replaces the NaN in position 0 of the chart that pct_change() creates with 0.gt(0) returns true or false table depending if the value at that index is greater than 0
current output of this function
In[1]:avergeIncreace(df,'A')
Out[1]: 0 False
1 True
2 False
3 True
Name: BAL, dtyle: bool
desired output
In[1]:avergeIncreace(df,'A')
Out[1]:75
In[2]:avergeIncreace(df,'B')
Out[2]:0
In[3]:avergeIncreace(df,'C')
Out[3]:10
From my limited understanding of pandas there should be a way to return an array of all the indexes that are true and then use a for loop and go through the original data table, but I believe pandas should have a way to do this without a for loop.
what I think the for loop way would look plus missing code so indexes returned are ones that are true instead of every index
avergeIncreace(df,'A')
indexes = data[value].pct_change().fillna(0).gt(0).index.values #this returns an array containing all of the index (true and false)
answer = 0
times = 0
for x in indexes:
answer += (data[value][x] - data[value][x-1])
times += 1
print( answer/times )
How to I achieve my desired output without using a for loop in the function?
You can use mask() and diff():
df.diff().mask(df.diff()<=0, np.nan).mean().fillna(0)
Yields:
A 75.0
B 0.0
C 10.0
dtype: float64
How about
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [0, 100, 50, 100],
'B': [5, 2, 2, 0],
'C': [10, 20, 30, 40]})
def averageIncrease(df, col_name):
# Create array of deltas. Replace nan and negative values with zero
a = np.maximum(df[col_name] - df[col_name].shift(), 0).replace(np.nan, 0)
# Count non-zero values
count = np.count_nonzero(a)
if count == 0:
# If only zero values… there is no increase
return 0
else:
return np.sum(a) / count
print(averageIncrease(df, 'A'))
print(averageIncrease(df, 'B'))
print(averageIncrease(df, 'C'))
75.0
0
10.0

Pandas: convert column from minutes (type object) to number

I want to convert a column of a Pandas DataFrame from an object to a number (e.g., float64). The DataFrame is the following:
import pandas as pd
import numpy as np
import datetime as dt
df = pd.read_csv('data.csv')
df
ID MIN
0 201167 32:59:00
1 203124 14:23
2 101179 8:37
3 200780 5:22
4 202699 NaN
5 203117 NaN
6 202331 36:05:00
7 2561 30:43:00
I would like to convert the MIN column from type object to a number (e.g., float64). For example, 32:59:00 should become 32.983333.
I'm not sure if it's necessary as an initial step, but I can convert each NaN to 0 via:
df['MIN'] = np.where(pd.isnull(df['MIN']), '0', df['MIN'])
How can I efficiently convert the entire column? I've tried variations of dt.datetime.strptime(), df['MIN'].astype('datetime64'), and pd.to_datetime(df['MIN']) with no success.
Defining a converter function:
def str_to_number(time_str):
if not isinstance(time_str, str):
return 0
minutes, sec, *_ = [int(x) for x in time_str.split(':')]
return minutes + sec / 60
and applying it to the MINcolumn:
df.MIN = df.MIN.map(str_to_number)
works.
Before:
ID MIN
0 1 32:59:00
1 2 NaN
2 3 14:23
After:
ID MIN
0 1 32.983333
1 2 0.000000
2 3 14.383333
The above is for Python 3. This works for Python 2:
def str_to_number(time_str):
if not isinstance(time_str, str):
return 0
entries = [int(x) for x in time_str.split(':')]
minutes = entries[0]
sec = entries[1]
return minutes + sec / 60.0
Note the 60.0. Alternatively, use from __future__ import print_function to avoid the integer division problem.

Categories

Resources