I have some dataframe like the one shown above. The goal of this program is to replace some specific value by the previous one.
import pandas as pd
test = pd.DataFrame([2,2,3,1,1,2,4,6,43,23,4,1,3,3,1,1,1,4,5], columns = ['A'])
obtaining:
If one want to replace all 1 by the previous values, a possible solution is:
for li in test[test['A'] == 1].index:
test['A'].iloc[li] = test['A'].iloc[li-1]
However, it is very inefficient. Can you suggest a more efficient solution?
IIUC, replace to np.nan then ffill
test.replace(1,np.nan).ffill().astype(int)
Out[881]:
A
0 2
1 2
2 3
3 3
4 3
5 2
6 4
7 6
8 43
9 23
10 4
11 4
12 3
13 3
14 3
15 3
16 3
17 4
18 5
Related
I have a data frame like that :
Index
Time
Id
0
10:10:00
11
1
10:10:01
12
2
10:10:02
12
3
10:10:04
12
4
10:10:06
13
5
10:10:07
13
6
10:10:08
11
7
10:10:10
11
8
10:10:12
11
9
10:10:14
13
I want to compare id column for each pairs. So between the row 0 and 1, between the row 2 and 3 etc.
In others words I want to compare even rows with odd rows and keep same id pairs rows.
My ideal output would be :
Index
Time
Id
2
10:10:02
12
3
10:10:04
12
4
10:10:06
13
5
10:10:07
13
6
10:10:08
11
7
10:10:10
11
I tried that but it did not work :
df = df[
df[::2]["id"] ==df[1::2]["id"]
]
You can use a GroupBy.transform approach:
# for each pair, is there only one kind of Id?
out = df[df.groupby(np.arange(len(df))//2)['Id'].transform('nunique').eq(1)]
Or, more efficient, using the underlying numpy array:
# convert to numpy
a = df['Id'].to_numpy()
# are the odds equal to evens?
out = df[np.repeat((a[::2]==a[1::2]), 2)]
output:
Index Time Id
2 2 10:10:02 12
3 3 10:10:04 12
4 4 10:10:06 13
5 5 10:10:07 13
6 6 10:10:08 11
7 7 10:10:10 11
I'm trying to create a rolling function that:
Divides two DataFrames with 3 columns in each df.
Calculate the mean of each row from the output in step 1.
Sums the averages from step 2.
This could be done by using pd.iterrows() hence looping through each row. However, this would be inefficient when working with larger datasets. Therefore, my objective is to create a pd.rolling function that could do this much faster.
What I would need help with is to understand why my approach below returns multiple values while the function I'm using only returns a single value.
EDIT : I have updated the question with the code that produces my desired output.
This is the test dataset I'm working with:
#import libraries
import pandas as pd
import numpy as np
#create two dataframes
values = {'column1': [7,2,3,1,3,2,5,3,2,4,6,8,1,3,7,3,7,2,6,3,8],
'column2': [1,5,2,4,1,5,5,3,1,5,3,5,8,1,6,4,2,3,9,1,4],
"column3" : [3,6,3,9,7,1,2,3,7,5,4,1,4,2,9,6,5,1,4,1,3]
}
df1 = pd.DataFrame(values)
df2 = pd.DataFrame([[2,3,4],[3,4,1],[3,6,1]])
print(df1)
print(df2)
column1 column2 column3
0 7 1 3
1 2 5 6
2 3 2 3
3 1 4 9
4 3 1 7
5 2 5 1
6 5 5 2
7 3 3 3
8 2 1 7
9 4 5 5
10 6 3 4
11 8 5 1
12 1 8 4
13 3 1 2
14 7 6 9
15 3 4 6
16 7 2 5
17 2 3 1
18 6 9 4
19 3 1 1
20 8 4 3
0 1 2
0 2 3 4
1 3 4 1
2 3 6 1
One method to achieve my desired output by looping through each row:
RunningSum = []
for index, rows in df1.iterrows():
if index > 3:
Div = abs((((df2 / df1.iloc[index-3+1:index+1].reset_index(drop="True").values)-1)*100))
Average = Div.mean(axis=0)
SumOfAverages = np.sum(Average)
RunningSum.append(SumOfAverages)
#printing my desired output values
print(RunningSum)
[330.42328042328046,
212.0899470899471,
152.06349206349208,
205.55555555555554,
311.9047619047619,
209.1269841269841,
197.61904761904765,
116.94444444444444,
149.72222222222223,
430.0,
219.51058201058203,
215.34391534391537,
199.15343915343914,
159.6031746031746,
127.6984126984127,
326.85185185185185,
204.16666666666669]
However, this would be timely when working with large datasets. Therefore, I've tried to create a function which applies to a pd.rolling() object.
def SumOfAverageFunction(vals):
Div = df2 / vals.reset_index(drop="True")
Average = Div.mean(axis=0)
SumOfAverages = np.sum(Average)
return SumOfAverages
RunningSum = df1.rolling(window=3,axis=0).apply(SumOfAverageFunction)
The problem here is that my function returns multiple output. How can I solve this?
print(RunningSum)
column1 column2 column3
0 NaN NaN NaN
1 NaN NaN NaN
2 3.214286 4.533333 2.277778
3 4.777778 3.200000 2.111111
4 5.888889 4.416667 1.656085
5 5.111111 5.400000 2.915344
6 3.455556 3.933333 5.714286
7 2.866667 2.066667 5.500000
8 2.977778 3.977778 3.063492
9 3.555556 5.622222 1.907937
10 2.750000 4.200000 1.747619
11 1.638889 2.377778 3.616667
12 2.986111 2.005556 5.500000
13 5.333333 3.075000 4.750000
14 4.396825 5.000000 3.055556
15 2.174603 3.888889 2.148148
16 2.111111 2.527778 1.418519
17 2.507937 3.500000 3.311111
18 2.880952 3.000000 5.366667
19 2.722222 3.370370 5.750000
20 2.138889 5.129630 5.666667
After reordering of operations, your calculations can be simplified
BASE = df2.sum(axis=0) /3
BASE_series = pd.Series({k: v for k, v in zip(df1.columns, BASE)})
result = df1.rdiv(BASE_series, axis=1).sum(axis=1)
print(np.around(result[4:], 3))
Outputs:
4 5.508
5 4.200
6 2.400
7 3.000
...
if you dont want to calculate anything before index 4 then change:
df1.iloc[4:].rdiv(...
I have a dataframe that currently looks somewhat like this.
import pandas as pd
In [161]: pd.DataFrame(np.c_[s,t],columns = ["M1","M2","M1","M2"])
Out[161]:
M1 M2 M1 M2
6/7 1 2 3 5
6/8 2 4 7 8
6/9 3 6 9 9
6/10 4 8 8 10
6/11 5 10 20 40
Except, instead of just four columns, there are approximately 1000 columns, from M1 till ~M340 (there are multiple columns with the same headers). I wanted to sum the values associated with matching columns based on their index. Ideally, the result dataframe would look like:
M1_sum M2_sum
6/7 4 7
6/8 9 12
6/9 12 15
6/10 12 18
6/11 25 50
I wanted to somehow apply the "groupby" and "sum" function, but was unsure how to do that when dealing with a dataframe that has multiple columns and has some columns with 3 other columns matching whereas another may only have one other column matching (or even 0 other columns matching).
You probably want to groupby the first level, and over the second axis, and then perform a .sum(), like:
>>> df.groupby(level=0,axis=1).sum().add_suffix('_sum')
M1_sum M2_sum
0 4 7
1 9 12
2 12 15
3 12 18
4 25 50
If we rename the last column to M1 instead, it will again group this correctly:
>>> df
M1 M2 M1 M1
0 1 2 3 5
1 2 4 7 8
2 3 6 9 9
3 4 8 8 10
4 5 10 20 40
>>> df.groupby(level=0,axis=1).sum().add_suffix('_sum')
M1_sum M2_sum
0 9 2
1 17 4
2 21 6
3 22 8
4 65 10
I'm playing around with the Titanic dataset, and what I'd like to do is fill in all the NaN/Null values of the Age column with the median value base on that Pclass.
Here is some data:
train
PassengerId Pclass Age
0 1 3 22
1 2 1 35
2 3 3 26
3 4 1 35
4 5 3 35
5 6 1 NaN
6 7 1 54
7 8 3 2
8 9 3 27
9 10 2 14
10 11 1 Nan
Here is what I would like to end up with:
PassengerId Pclass Age
0 1 3 22
1 2 1 35
2 3 3 26
3 4 1 35
4 5 3 35
5 6 1 35
6 7 1 54
7 8 3 2
8 9 3 27
9 10 2 14
10 11 1 35
The first thing I came up with is this - In the interest of brevity I have only included one slice for Pclass equal to 1 rather than including 2 and 3:
Pclass_1 = train['Pclass']==1
train[Pclass_1]['Age'].fillna(train[train['Pclass']==1]['Age'].median(), inplace=True)
As far as I understand, this method creates a view rather than editing train itself (I don't quite understand how this is different from a copy, or if they are analogous in terms of memory -- that is an aside I would love to hear about if possible). I particularly like this Q/A on the topic View vs Copy, How Do I Tell? but it doesn't include the insight I'm looking for.
Looking through Pandas docs I learned why you want to use .loc to avoid this pitfall. However I just can't seem to get the syntax right.
Pclass_1 = train.loc[:,['Pclass']==1]
Pclass_1.Age.fillna(train[train['Pclass']==1]['Age'].median(),inplace=True)
I'm getting lost in indices. This one ends up looking for a column named False which obviously doesn't exist. I don't know how to do this without chained indexing. train.loc[:,train['Pclass']==1] returns an exception IndexingError: Unalignable boolean Series key provided.
In this part of the line,
train.loc[:,['Pclass']==1]
the part ['Pclass'] == 1 is comparing the list ['Pclass'] to the value 1, which returns False. The .loc[] is then evaluated as .loc[:,False] which is causing the error.
I think you mean:
train.loc[train['Pclass']==1]
which selects all of the rows where Pclass is 1. This fixes the error, but it will still give you the "SettingWithCopyWarning".
EDIT 1
(old code removed)
Here is an approach that uses groupby with transform to create a Series
containing the median Age for each Pclass. The Series is then used as the argument to fillna() to replace the missing values with the median. Using this approach will correct all passenger classes at the same time, which is what the OP originally requested. The solution comes from the answer to Python-pandas Replace NA with the median or mean of a group in dataframe
import pandas as pd
from io import StringIO
tbl = """PassengerId Pclass Age
0 1 3 22
1 2 1 35
2 3 3 26
3 4 1 35
4 5 3 35
5 6 1
6 7 1 54
7 8 3 2
8 9 3 27
9 10 2 14
10 11 1
"""
train = pd.read_table(StringIO(tbl), sep='\s+')
print('Original:\n', train)
median_age = train.groupby('Pclass')['Age'].transform('median') #median Ages for all groups
train['Age'].fillna(median_age, inplace=True)
print('\nNaNs replaced with median:\n', train)
The code produces:
Original:
PassengerId Pclass Age
0 1 3 22.0
1 2 1 35.0
2 3 3 26.0
3 4 1 35.0
4 5 3 35.0
5 6 1 NaN
6 7 1 54.0
7 8 3 2.0
8 9 3 27.0
9 10 2 14.0
10 11 1 NaN
NaNs replaced with median:
PassengerId Pclass Age
0 1 3 22.0
1 2 1 35.0
2 3 3 26.0
3 4 1 35.0
4 5 3 35.0
5 6 1 35.0
6 7 1 54.0
7 8 3 2.0
8 9 3 27.0
9 10 2 14.0
10 11 1 35.0
One thing to note is that this line, which uses inplace=True:
train['Age'].fillna(median_age, inplace=True)
can be replaced with assignment using .loc:
train.loc[:,'Age'] = train['Age'].fillna(median_age)
Pandas 0.12.0
In the DataFrame below, why for example does it jumble the indexes? Look at the 4's, the indexes go from 1,15,6,7. What is the reasoning pandas is using to decide how to order, I would have suspected the indexes to remain sequential for an equal value.
mydf=pd.DataFrame(np.random.randint(1, 6, 20),columns=["stars"])
mydf.sort(['stars'], ascending=False)
stars
19 5
14 5
1 4
15 4
6 4
7 4
4 3
12 3
18 3
8 2
2 2
9 2
10 2
11 2
13 2
16 2
5 1
3 1
17 1
0 1
Actually, if you look into the source code of pandas DataFrame, you'll see that sort() is just a wrapper of sort_index() with different parameter, and, as #Jeff said in this question, sort_index() is prefered method to use.
The sort_index() method using numpy.argsort() with default kind=quicksort, if you're sorting only by one column. And quicksort() is not stable, that's why your index looks shuffled.
But you can pass kind parameter to sort_index() (one of 'mergesort', 'quicksort', 'heapsort'), so you can use stable sort ('mergesort') for your task:
>>> mydf.sort_index(by=['stars'], ascending=False, kind='mergesort')
stars
17 5
11 5
6 5
1 5
19 4
18 4
15 4
14 4
7 4
5 4
2 4
10 3
8 3
4 3
16 2
12 2
9 2
3 2
13 1
0 1
sort_index() also using mergesort (or counting sort) if there're more that one column in by parameter, it's interesting, for example, you can do this:
>>> mydf.sort_index(by=['stars', 'stars'], ascending=False)
stars
1 5
6 5
11 5
17 5
2 4
5 4
7 4
14 4
15 4
18 4
19 4
4 3
8 3
10 3
3 2
9 2
12 2
16 2
0 1
13 1
Now the sort is stable, but indexes are sorted ascending
Pandas is using numpy's quicksort. Quicksort involves swapping positions of the items. It stops once they are in the requested order (which in this case, does not involve checking the indices because you didn't ask for that column to be checked). Quicksort is much more efficient than a naive sort algorithm such as bubble sort, which might be what you have in mind-- it would leave the individual numbers closer to their original order, but require more steps to do so.