Pandas Keyerror - python

I have a very simple code:
stats2 = {'a':[1,2,3,4,5,6],
'b':[43,34,65,56,29,76],
'c':[65,67,78,65,45,52],
'cac':['mns','ab','cd','cd','ab','k']}
f2 = pd.DataFrame(stats2)
f2.set_index(['cac'], inplace = True)
print(f2.ix['mns'])
print(f2['mns'])
f2.ix['mns'] works just fine. However, f2['mns'] reports KeyError. I am trying to understand why it does that. Is that how pandas work? Do I have to use ix even though I have set the index before?

This is your original dataframe:
>>> df
a b c cac
0 1 43 65 mns
1 2 34 67 ab
2 3 65 78 cd
3 4 56 65 cd
4 5 29 45 ab
5 6 76 52 k
>>> df.set_index(['cac'], inplace=True)
>>> df
a b c
cac
mns 1 43 65
ab 2 34 67
cd 3 65 78
cd 4 56 65
ab 5 29 45
k 6 76 52
So, setting the index in pandas is simply replacing the before counter values(0,1,2,...,5) to the new row values i.e (mns, ab,...,k) of cac column name.
>>> df.ix['mns']
a 1
b 43
c 65
This command specifically searches for row in the index column, cac whose value is equal to mns and retrieves it's corresponding elements.
Note: As mns is not a column name of the dataframe, df['mns'] throws a key error.

Related

Sum row values of all columns where column names meet string match condition

I have the following dataset:
ID Length Width Range_CAP Capacity_CAP
0 1 33 25 16 50
1 2 34 22 11 66
2 3 22 12 15 42
3 4 46 45 66 54
4 5 16 6 23 75
5 6 21 42 433 50
I basically want to sum the row values of the columns only where the columns match a string (in this case, all columns with _CAP at the end of their name). And store the sum of the result in a new column.
So that I end up with a dataframe that looks something like this:
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
I first tried to use the solution recommended in this question here:
Summing columns in Dataframe that have matching column headers
However, the solution doesn't work for me since they are summing up columns that have the same exact name so a simple groupby can accomplish the result whereas I am trying to sum columns with specific string matches only.
Code to recreate above sample dataset:
data1 = [['1', 33,25,16,50], ['2', 34,22,11,66],
['3', 22,12,15,42],['4', 46,45,66,54],
['5',16,6,23,75], ['6', 21,42,433,50]]
df = pd.DataFrame(data1, columns = ['ID', 'Length','Width','Range_CAP','Capacity_CAP'])
Let us do filter
df['CAP_SUM'] = df.filter(like='CAP').sum(1)
Out[86]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
If have other CAP in front
df.filter(regex='_CAP$').sum(1)
Out[92]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
One approach is:
df['CAP_SUM'] = df.loc[:, df.columns.str.endswith('_CAP')].sum(1)
print(df)
Output
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
The expression:
df.columns.str.endswith('_CAP')
creates a boolean mask where the values are True if and only if the column name ends with CAP. As an alternative use filter, with the following regex:
df['CAP_SUM'] = df.filter(regex='_CAP$').sum(1)
print(df)
Output (of filter)
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
You may try this:
columnstxt = df.columns
df['sum'] = 0
for i in columnstxt:
if i.find('_CAP') != -1:
df['sum'] = df['sum'] + df[i]
else:
pass

Looping a function with Pandas DataFrames

I have some function that takes a DataFrame and an integer as arguments:
func(df, int)
The function returns a new DataFrame, e.g.:
df2 = func(df,2)
I'd like to write a loop for integers 2-10, resulting in 9 DataFrames. If I do this manually it would look like this:
df2 = func(df,2)
df3 = func(df2,3)
df4 = func(df3,4)
df5 = func(df4,5)
df6 = func(df5,6)
df7 = func(df6,7)
df8 = func(df7,8)
df9 = func(df8,9)
df10 = func(df9,10)
Is there a way to write a loop that does this?
This type of thing is what lists are for.
data_frames = [df]
for i in range(2, 11):
data_frames.append(func(data_frames[-1], i))
It's a sign of brittle code when you see variable names like df1, df2, df3, etc. Use lists when you have a sequence of related objects to build.
To clarify, this data_frames is a list of DataFrames that can be concatenated with data_frames = pd.concat(data_frames, sort=False), resulting in one DataFrame that combines the original df with everything that results from the loop, correct?
Yup, that's right. If your goal is one final data frame, you can concatenate the entire list at the end to combine the information into a single frame.
Do you mind explaining why data_frames[-1], which takes the last item of the list, returns a DataFrame? Not clear on this.
Because as you're building the list, at all times each entry is a data frame. data_frames[-1] evaluates to the last element in the list, which in this case, is the data frame you most recently appended.
You may try using itertools.accumulate as follows:
sample data
df:
a b c
0 75 18 17
1 48 56 3
import itertools
def func(x, y):
return x + y
dfs = list(itertools.accumulate([df] + list(range(2, 11)), func))
[ a b c
0 75 18 17
1 48 56 3, a b c
0 77 20 19
1 50 58 5, a b c
0 80 23 22
1 53 61 8, a b c
0 84 27 26
1 57 65 12, a b c
0 89 32 31
1 62 70 17, a b c
0 95 38 37
1 68 76 23, a b c
0 102 45 44
1 75 83 30, a b c
0 110 53 52
1 83 91 38, a b c
0 119 62 61
1 92 100 47, a b c
0 129 72 71
1 102 110 57]
dfs is the list of result dataframes where each one is the adding of 2 - 10 to the previous result
If you want concat them all into one dataframe, Use pd.concat
pd.concat(dfs)
Out[29]:
a b c
0 75 18 17
1 48 56 3
0 77 20 19
1 50 58 5
0 80 23 22
1 53 61 8
0 84 27 26
1 57 65 12
0 89 32 31
1 62 70 17
0 95 38 37
1 68 76 23
0 102 45 44
1 75 83 30
0 110 53 52
1 83 91 38
0 119 62 61
1 92 100 47
0 129 72 71
1 102 110 57
You can use exec with a formatted string:
for i in range(2, 11):
exec("df{0} = func(df{1}, {0})".format(i, i - 1 if i > 2 else ''))

Getting aggregate from Pandas for other categories

I am trying to perform an aggregate calculation, but I want the calculation to apply to every other category.
So,
df.groupby(['index']).agg({data : [func1,func2]})
Will perform the aggregate calculations func1 and func2 on the data grouped by index, but I want to perform the calculations on all the data that isn't in the index.
For example:
index data
A 1
A 2
A 1
B 2
B 2
B 4
B 4
C 1
C 3
D 4
D 1
I would want the results for A to be performed on the data in B,C,D.
Is there a novel way to accomplish this?
Well, I actually think I figured this out. Basically, I created a new dataframe and re-index'd it.
value
original_index
A 44
A 65
A 88
B 69
B 11
B 52
C 56
C 42
C 85
D 66
D 77
D 9
Loop through each index and and copy everything not in that index to a new dataframe. Then concat them all together.
l = []
for i in df.index.unique():
d = df[~df.index.isin([i])].copy()
d['new_index'] = i
d.drop('original_index',axis=0,inplace=True)
d.set_index('new_index',inplace=True)
l.append(d)
df2 = pd.concat(l,axis=0)
Ouput:
value
new_index
A 69
A 11
A 52
A 56
A 42
A 85
A 66
A 77
A 9
B 44
B 65
B 88
B 56
B 42
B 85
B 66
B 77
B 9
C 44
C 65
C 88
C 69
C 11
C 52
C 66
C 77
C 9
D 44
D 65
D 88
D 69
D 11
D 52
D 56
D 42
D 85
Now we can apply our groupby function on the new index and it will return results from values that were originally not in the index.
group_df = df2.groupby(['new_index']).agg({'value' :[func1,func2]})[['value']]
It works, but I'm sure there must be a better way.

Fastest way to sort each row in a pandas dataframe

I need to find the quickest way to sort each row in a dataframe with millions of rows and around a hundred columns.
So something like this:
A B C D
3 4 8 1
9 2 7 2
Needs to become:
A B C D
8 4 3 1
9 7 2 2
Right now I'm applying sort to each row and building up a new dataframe row by row. I'm also doing a couple of extra, less important things to each row (hence why I'm using pandas and not numpy). Could it be quicker to instead create a list of lists and then build the new dataframe at once? Or do I need to go cython?
I think I would do this in numpy:
In [11]: a = df.values
In [12]: a.sort(axis=1) # no ascending argument
In [13]: a = a[:, ::-1] # so reverse
In [14]: a
Out[14]:
array([[8, 4, 3, 1],
[9, 7, 2, 2]])
In [15]: pd.DataFrame(a, df.index, df.columns)
Out[15]:
A B C D
0 8 4 3 1
1 9 7 2 2
I had thought this might work, but it sorts the columns:
In [21]: df.sort(axis=1, ascending=False)
Out[21]:
D C B A
0 1 8 4 3
1 2 7 2 9
Ah, pandas raises:
In [22]: df.sort(df.columns, axis=1, ascending=False)
ValueError: When sorting by column, axis must be 0 (rows)
To Add to the answer given by #Andy-Hayden, to do this inplace to the whole frame... not really sure why this works, but it does. There seems to be no control on the order.
In [97]: A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
In [98]: A
Out[98]:
one two three four five
0 22 63 72 46 49
1 43 30 69 33 25
2 93 24 21 56 39
3 3 57 52 11 74
In [99]: A.values.sort
Out[99]: <function ndarray.sort>
In [100]: A
Out[100]:
one two three four five
0 22 63 72 46 49
1 43 30 69 33 25
2 93 24 21 56 39
3 3 57 52 11 74
In [101]: A.values.sort()
In [102]: A
Out[102]:
one two three four five
0 22 46 49 63 72
1 25 30 33 43 69
2 21 24 39 56 93
3 3 11 52 57 74
In [103]: A = A.iloc[:,::-1]
In [104]: A
Out[104]:
five four three two one
0 72 63 49 46 22
1 69 43 33 30 25
2 93 56 39 24 21
3 74 57 52 11 3
I hope someone can explain the why of this, just happy that it works 8)
You could use pd.apply.
Eg:
A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
print (A)
one two three four five
0 2 75 44 53 46
1 18 51 73 80 66
2 35 91 86 44 25
3 60 97 57 33 79
A = A.apply(np.sort, axis = 1)
print(A)
one two three four five
0 2 44 46 53 75
1 18 51 66 73 80
2 25 35 44 86 91
3 33 57 60 79 97
Since you want it in descending order, you can simply multiply the dataframe with -1 and sort it.
A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
A = A * -1
A = A.apply(np.sort, axis = 1)
A = A * -1
Instead of using pd.DataFrame constructor, an easier way to assign the sorted values back is to use double brackets:
original dataframe:
A B C D
3 4 8 1
9 2 7 2
df[['A', 'B', 'C', 'D']] = np.sort(df)[:, ::-1]
A B C D
0 8 4 3 1
1 9 7 2 2
This way you can also sort a part of the columns:
df[['B', 'C']] = np.sort(df[['B', 'C']])[:, ::-1]
A B C D
0 3 8 4 1
1 9 7 2 2
One could try this approach to preserve the integrity of the df:
import pandas as pd
import numpy as np
A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
print (A)
print(type(A))
one two three four five
0 85 27 64 50 55
1 3 90 65 22 8
2 0 7 64 66 82
3 58 21 42 27 30
<class 'pandas.core.frame.DataFrame'>
B = A.apply(lambda x: np.sort(x), axis=1, raw=True)
print(B)
print(type(B))
one two three four five
0 27 50 55 64 85
1 3 8 22 65 90
2 0 7 64 66 82
3 21 27 30 42 58
<class 'pandas.core.frame.DataFrame'>

Python Pandas Sort DataFrame by Duplicate Rows

What is the nicest way to see which rows are duplicated in DataFrame with the duplicate rows sorted and stacked on top of each other? I know I can filter for duplicates with df.duplicated() or something like df[df.duplicated()==True] but need to be able to produce a dataframe with the duplicates and then sort them to show both records in the Dataframe. I also do not need to use a col subset argument for this. -Thank you
One idea is to sort by all columns. Not sure how efficient that is though.
In [20]: df = pd.DataFrame (np.random.randint (100,size=(3,3)), columns= list('ABC'))
In [21]: df = df.append(df, ignore_index=True)
In [22]: df
Out[22]:
A B C
0 23 71 65
1 63 0 47
2 47 13 44
3 23 71 65
4 63 0 47
5 47 13 44
In [23]: df.sort(df.columns.tolist())
Out[23]:
A B C
0 23 71 65
3 23 71 65
2 47 13 44
5 47 13 44
1 63 0 47
4 63 0 47

Categories

Resources