I'm trying to split a UFC record column into multiple columns and am having trouble. The data looks like this
record
1 22–8–1
2 18–7–1
3 12–4
4 8–2 (1 NC)
5 23–9–1
6 23–12
7 19–4–1
8 18–5–1 (1 NC)
The first number is wins, the second losses. If there is a third it is the draws, and if there is a parenthesis and a number it is the "no contests". I want to split it up and have it look like this.
wins loses draws no_contests
1 22 8 1 NaN
2 18 7 1 NaN
3 12 4 NaN NaN
4 8 2 NaN 1
5 23 9 1 NaN
6 23 12 NaN NaN
7 19 4 1 NaN
8 18 5 1 1
I tried using .str.split("-") which just made things more complicated for me. Then I tried making a for loop with a bunch of if statements to try and filter out some of the ore complicated records but failed miserably at that. Does anyone have any ideas as to what I could do? Thanks so much!
# So you can copy and paste the data in
import pandas as pd
data = {'record': ['22–8–1', '18–7–1', '12–4', '8–2 (1 NC)', '23–9–1', '23–12', '19–4–1', '18–5–1 (1 NC)']}
df = pd.DataFrame(data)
This is a job for pandas.Series.str.extract():
# Fix em-dashes
df['record'] = df['record'].str.replace('–', '-')
new_df = df['record'].str.extract(r'^(?P<wins>\d+)-(?P<loses>\d+)(?:-(?P<draws>\d+))?\s*(?:\((?P<no_contests>\d+) NC\))?')
Output:
>>> new_df
wins loses draws no_contests
0 22 8 1 NaN
1 18 7 1 NaN
2 12 4 NaN NaN
3 8 2 NaN 1
4 23 9 1 NaN
5 23 12 NaN NaN
6 19 4 1 NaN
7 18 5 1 1
Related
I have the following dataframe, which the value should be increasing. Originally the dataframe has some unknown values.
index
value
0
1
1
2
3
2
4
5
6
7
4
8
9
10
3
11
3
12
13
14
15
5
Based on the assumsion that the value should be increasing, I would like to remove the value at index 10 and 11. This would be the desired dataframe:
index
value
0
1
1
2
3
2
4
5
6
7
4
8
9
12
13
14
15
5
Thank you very much
Assuming NaN in the empty cells (if not, temporarily replace them with NaN), use boolean indexing:
# if not NaNs uncomment below
# and use s in place of df['value'] afterwards
# s = pd.to_numeric(df['value'], errors='coerce')
# is the cell empty?
m1 = df['value'].isna()
# are the values strictly increasing?
m2 = df['value'].ge(df['value'].cummax())
out = df[m1|m2]
Output:
index value
1 1 NaN
2 2 NaN
3 3 2.0
4 4 NaN
5 5 NaN
6 6 NaN
7 7 4.0
8 8 NaN
9 9 NaN
12 12 NaN
13 13 NaN
14 14 NaN
15 15 5.0
Try this:
def del_df(df):
df_no_na = df.dropna().reset_index(drop = True)
num_tmp = df_no_na['value'][0] # First value which is not NaN.
del_index_list = [] # indicies to delete
for row_index in range(1, len(df_no_na)):
if df_no_na['value'][row_index] > num_tmp : #Increasing
num_tmp = df_no_na['value'][row_index] # to compare following two values.
else : # Not increasing(same or decreasing)
del_index_list.append(df_no_na['index'][row_index]) # index to delete
df_goal = df.drop([df.index[i] for i in del_index_list])
return df_goal
output:
index value
0 0 1.0
1 1 NaN
2 2 NaN
3 3 2.0
4 4 NaN
5 5 NaN
6 6 NaN
7 7 4.0
8 8 NaN
9 9 NaN
12 12 NaN
13 13 NaN
14 14 NaN
15 15 5.0
I'm trying to create a rolling function that:
Divides two DataFrames with 3 columns in each df.
Calculate the mean of each row from the output in step 1.
Sums the averages from step 2.
This could be done by using pd.iterrows() hence looping through each row. However, this would be inefficient when working with larger datasets. Therefore, my objective is to create a pd.rolling function that could do this much faster.
What I would need help with is to understand why my approach below returns multiple values while the function I'm using only returns a single value.
EDIT : I have updated the question with the code that produces my desired output.
This is the test dataset I'm working with:
#import libraries
import pandas as pd
import numpy as np
#create two dataframes
values = {'column1': [7,2,3,1,3,2,5,3,2,4,6,8,1,3,7,3,7,2,6,3,8],
'column2': [1,5,2,4,1,5,5,3,1,5,3,5,8,1,6,4,2,3,9,1,4],
"column3" : [3,6,3,9,7,1,2,3,7,5,4,1,4,2,9,6,5,1,4,1,3]
}
df1 = pd.DataFrame(values)
df2 = pd.DataFrame([[2,3,4],[3,4,1],[3,6,1]])
print(df1)
print(df2)
column1 column2 column3
0 7 1 3
1 2 5 6
2 3 2 3
3 1 4 9
4 3 1 7
5 2 5 1
6 5 5 2
7 3 3 3
8 2 1 7
9 4 5 5
10 6 3 4
11 8 5 1
12 1 8 4
13 3 1 2
14 7 6 9
15 3 4 6
16 7 2 5
17 2 3 1
18 6 9 4
19 3 1 1
20 8 4 3
0 1 2
0 2 3 4
1 3 4 1
2 3 6 1
One method to achieve my desired output by looping through each row:
RunningSum = []
for index, rows in df1.iterrows():
if index > 3:
Div = abs((((df2 / df1.iloc[index-3+1:index+1].reset_index(drop="True").values)-1)*100))
Average = Div.mean(axis=0)
SumOfAverages = np.sum(Average)
RunningSum.append(SumOfAverages)
#printing my desired output values
print(RunningSum)
[330.42328042328046,
212.0899470899471,
152.06349206349208,
205.55555555555554,
311.9047619047619,
209.1269841269841,
197.61904761904765,
116.94444444444444,
149.72222222222223,
430.0,
219.51058201058203,
215.34391534391537,
199.15343915343914,
159.6031746031746,
127.6984126984127,
326.85185185185185,
204.16666666666669]
However, this would be timely when working with large datasets. Therefore, I've tried to create a function which applies to a pd.rolling() object.
def SumOfAverageFunction(vals):
Div = df2 / vals.reset_index(drop="True")
Average = Div.mean(axis=0)
SumOfAverages = np.sum(Average)
return SumOfAverages
RunningSum = df1.rolling(window=3,axis=0).apply(SumOfAverageFunction)
The problem here is that my function returns multiple output. How can I solve this?
print(RunningSum)
column1 column2 column3
0 NaN NaN NaN
1 NaN NaN NaN
2 3.214286 4.533333 2.277778
3 4.777778 3.200000 2.111111
4 5.888889 4.416667 1.656085
5 5.111111 5.400000 2.915344
6 3.455556 3.933333 5.714286
7 2.866667 2.066667 5.500000
8 2.977778 3.977778 3.063492
9 3.555556 5.622222 1.907937
10 2.750000 4.200000 1.747619
11 1.638889 2.377778 3.616667
12 2.986111 2.005556 5.500000
13 5.333333 3.075000 4.750000
14 4.396825 5.000000 3.055556
15 2.174603 3.888889 2.148148
16 2.111111 2.527778 1.418519
17 2.507937 3.500000 3.311111
18 2.880952 3.000000 5.366667
19 2.722222 3.370370 5.750000
20 2.138889 5.129630 5.666667
After reordering of operations, your calculations can be simplified
BASE = df2.sum(axis=0) /3
BASE_series = pd.Series({k: v for k, v in zip(df1.columns, BASE)})
result = df1.rdiv(BASE_series, axis=1).sum(axis=1)
print(np.around(result[4:], 3))
Outputs:
4 5.508
5 4.200
6 2.400
7 3.000
...
if you dont want to calculate anything before index 4 then change:
df1.iloc[4:].rdiv(...
I have a one column dataframe which looks like this:
Neive Bayes
0 8.322087e-07
1 3.213342e-24
2 4.474122e-28
3 2.230054e-16
4 3.957606e-29
5 9.999992e-01
6 3.254807e-13
7 8.836033e-18
8 1.222642e-09
9 6.825381e-03
10 5.275194e-07
11 2.224289e-06
12 2.259303e-09
13 2.014053e-09
14 1.755933e-05
15 1.889681e-04
16 9.929193e-01
17 4.599619e-05
18 6.944654e-01
19 5.377576e-05
I want to pivot it to wide format but with specific intervals. The first 9 rows should make up 9 columns of the first row, and continue this pattern until the final table has 9 columns and has 9 times fewer rows than now. How would I achieve this?
Using pivot_table:
df.pivot_table(columns=df.index % 9, index=df.index // 9, values='Neive Bayes')
0 1 2 3 4 \
0 8.322087e-07 3.213342e-24 4.474122e-28 2.230054e-16 3.957606e-29
1 6.825381e-03 5.275194e-07 2.224289e-06 2.259303e-09 2.014053e-09
2 6.944654e-01 5.377576e-05 NaN NaN NaN
5 6 7 8
0 0.999999 3.254807e-13 8.836033e-18 1.222642e-09
1 0.000018 1.889681e-04 9.929193e-01 4.599619e-05
2 NaN NaN NaN NaN
Construct multiindex, set_index and unstack
iix = pd.MultiIndex.from_arrays([np.arange(df.shape[0]) // 9,
np.arange(df.shape[0]) % 9])
df_wide = df.set_index(iix)['Neive Bayes'].unstack()
Out[204]:
0 1 2 3 4 \
0 8.322087e-07 3.213342e-24 4.474122e-28 2.230054e-16 3.957606e-29
1 6.825381e-03 5.275194e-07 2.224289e-06 2.259303e-09 2.014053e-09
2 6.944654e-01 5.377576e-05 NaN NaN NaN
5 6 7 8
0 0.999999 3.254807e-13 8.836033e-18 1.222642e-09
1 0.000018 1.889681e-04 9.929193e-01 4.599619e-05
2 NaN NaN NaN NaN
Here is a test dataframe. I want to use the relationship between EmpID and MgrID to further map the manager of MgrID in a new column.
Test_df = pd.DataFrame({'EmpID':['1','2','3','4','5','6','7','8','9','10'],
'MgrID':['4','4','4','6','8','8','10','10','10','12']})
Test_df
If I create a dictionary for the initial relationship, I will be able to create the first link of the chain, but I affraid I need to loop through each of the new columns to create a new one.
ID_Dict = {'1':'4',
'2':'4',
'3':'4',
'4':'6',
'5':'8',
'6':'8',
'7':'10',
'8':'10',
'9':'10',
'10':'12'}
Test_df['MgrID_L2'] = Test_df['MgrID'].map(ID_Dict)
Test_df
What is the most efficient way to do this?
Thank you!
Here's a way with a simple while loop. Note I changed the name of MgrID to MgrID_1
Test_df = pd.DataFrame({'EmpID':['1','2','3','4','5','6','7','8','9','10'],
'MgrID_1':['4','4','4','6','8','8','10','10','10','12']})
d = Test_df.set_index('EmpID').MgrID_1.to_dict()
s = 2
while s:
Test_df['MgrID_'+str(s)] = Test_df['MgrID_'+str(s-1)].map(d)
if Test_df['MgrID_'+str(s)].isnull().all():
Test_df = Test_df.drop(columns='MgrID_'+str(s))
s = 0
else:
s+=1
Ouptut: Test_df
EmpID MgrID_1 MgrID_2 MgrID_3 MgrID_4 MgrID_5
0 1 4 6 8 10 12
1 2 4 6 8 10 12
2 3 4 6 8 10 12
3 4 6 8 10 12 NaN
4 5 8 10 12 NaN NaN
5 6 8 10 12 NaN NaN
6 7 10 12 NaN NaN NaN
7 8 10 12 NaN NaN NaN
8 9 10 12 NaN NaN NaN
9 10 12 NaN NaN NaN NaN
I'm playing around with the Titanic dataset, and what I'd like to do is fill in all the NaN/Null values of the Age column with the median value base on that Pclass.
Here is some data:
train
PassengerId Pclass Age
0 1 3 22
1 2 1 35
2 3 3 26
3 4 1 35
4 5 3 35
5 6 1 NaN
6 7 1 54
7 8 3 2
8 9 3 27
9 10 2 14
10 11 1 Nan
Here is what I would like to end up with:
PassengerId Pclass Age
0 1 3 22
1 2 1 35
2 3 3 26
3 4 1 35
4 5 3 35
5 6 1 35
6 7 1 54
7 8 3 2
8 9 3 27
9 10 2 14
10 11 1 35
The first thing I came up with is this - In the interest of brevity I have only included one slice for Pclass equal to 1 rather than including 2 and 3:
Pclass_1 = train['Pclass']==1
train[Pclass_1]['Age'].fillna(train[train['Pclass']==1]['Age'].median(), inplace=True)
As far as I understand, this method creates a view rather than editing train itself (I don't quite understand how this is different from a copy, or if they are analogous in terms of memory -- that is an aside I would love to hear about if possible). I particularly like this Q/A on the topic View vs Copy, How Do I Tell? but it doesn't include the insight I'm looking for.
Looking through Pandas docs I learned why you want to use .loc to avoid this pitfall. However I just can't seem to get the syntax right.
Pclass_1 = train.loc[:,['Pclass']==1]
Pclass_1.Age.fillna(train[train['Pclass']==1]['Age'].median(),inplace=True)
I'm getting lost in indices. This one ends up looking for a column named False which obviously doesn't exist. I don't know how to do this without chained indexing. train.loc[:,train['Pclass']==1] returns an exception IndexingError: Unalignable boolean Series key provided.
In this part of the line,
train.loc[:,['Pclass']==1]
the part ['Pclass'] == 1 is comparing the list ['Pclass'] to the value 1, which returns False. The .loc[] is then evaluated as .loc[:,False] which is causing the error.
I think you mean:
train.loc[train['Pclass']==1]
which selects all of the rows where Pclass is 1. This fixes the error, but it will still give you the "SettingWithCopyWarning".
EDIT 1
(old code removed)
Here is an approach that uses groupby with transform to create a Series
containing the median Age for each Pclass. The Series is then used as the argument to fillna() to replace the missing values with the median. Using this approach will correct all passenger classes at the same time, which is what the OP originally requested. The solution comes from the answer to Python-pandas Replace NA with the median or mean of a group in dataframe
import pandas as pd
from io import StringIO
tbl = """PassengerId Pclass Age
0 1 3 22
1 2 1 35
2 3 3 26
3 4 1 35
4 5 3 35
5 6 1
6 7 1 54
7 8 3 2
8 9 3 27
9 10 2 14
10 11 1
"""
train = pd.read_table(StringIO(tbl), sep='\s+')
print('Original:\n', train)
median_age = train.groupby('Pclass')['Age'].transform('median') #median Ages for all groups
train['Age'].fillna(median_age, inplace=True)
print('\nNaNs replaced with median:\n', train)
The code produces:
Original:
PassengerId Pclass Age
0 1 3 22.0
1 2 1 35.0
2 3 3 26.0
3 4 1 35.0
4 5 3 35.0
5 6 1 NaN
6 7 1 54.0
7 8 3 2.0
8 9 3 27.0
9 10 2 14.0
10 11 1 NaN
NaNs replaced with median:
PassengerId Pclass Age
0 1 3 22.0
1 2 1 35.0
2 3 3 26.0
3 4 1 35.0
4 5 3 35.0
5 6 1 35.0
6 7 1 54.0
7 8 3 2.0
8 9 3 27.0
9 10 2 14.0
10 11 1 35.0
One thing to note is that this line, which uses inplace=True:
train['Age'].fillna(median_age, inplace=True)
can be replaced with assignment using .loc:
train.loc[:,'Age'] = train['Age'].fillna(median_age)