Pandas, substract columns Dataframe in loop - python

I am new with pandas. I have a Dataframe that consists in 6 columns and I would like to make a for loop that does this:
-create a new column (nc 1)
-nc1 = column 1 - column 2
and I want to iterate this for all columns, so the last one would be:
ncx = column 5- column 6
I can substract columns like this:
df['nc'] = df.Column1 - df.Column2
but this is not useful when I try to do a loop since I always have to insert the names of colums.
Can someone help me by telling me how can I refer to columns as numbers?
Thank you!

In [26]: import numpy as np
...: import random
...: import pandas as pd
...:
...: A = pd.DataFrame(np.random.randint(100, size=(5, 6)))
In [27]: A
Out[27]:
0 1 2 3 4 5
0 82 13 17 58 68 67
1 81 45 15 11 20 63
2 0 84 34 60 90 34
3 59 28 46 96 86 53
4 45 74 14 10 5 12
In [28]: for i in range(0, 5):
...: A[(i + 6)] = A[i] - A[(i + 1)]
...:
...:
...: A
...:
Out[28]:
0 1 2 3 4 5 6 7 8 9 10
0 82 13 17 58 68 67 69 -4 -41 -10 1
1 81 45 15 11 20 63 36 30 4 -9 -43
2 0 84 34 60 90 34 -84 50 -26 -30 56
3 59 28 46 96 86 53 31 -18 -50 10 33
4 45 74 14 10 5 12 -29 60 4 5 -7
In [29]: nc = 1 #The first new column
...: A[(nc + 5)] #outputs the first new column
Out[29]:
0 69
1 36
2 -84
3 31
4 -29
Here you don't need to call it by name, just by the column number, and you can just write a simple function that calls the column + 5
Something like this:
In [31]: def call_new_column(n):
...: return(A[(n + 5)])
...:
...:
...: call_new_column(2)
Out[31]:
0 -4
1 30
2 50
3 -18
4 60

Related

Sum row values of all columns where column names meet string match condition

I have the following dataset:
ID Length Width Range_CAP Capacity_CAP
0 1 33 25 16 50
1 2 34 22 11 66
2 3 22 12 15 42
3 4 46 45 66 54
4 5 16 6 23 75
5 6 21 42 433 50
I basically want to sum the row values of the columns only where the columns match a string (in this case, all columns with _CAP at the end of their name). And store the sum of the result in a new column.
So that I end up with a dataframe that looks something like this:
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
I first tried to use the solution recommended in this question here:
Summing columns in Dataframe that have matching column headers
However, the solution doesn't work for me since they are summing up columns that have the same exact name so a simple groupby can accomplish the result whereas I am trying to sum columns with specific string matches only.
Code to recreate above sample dataset:
data1 = [['1', 33,25,16,50], ['2', 34,22,11,66],
['3', 22,12,15,42],['4', 46,45,66,54],
['5',16,6,23,75], ['6', 21,42,433,50]]
df = pd.DataFrame(data1, columns = ['ID', 'Length','Width','Range_CAP','Capacity_CAP'])
Let us do filter
df['CAP_SUM'] = df.filter(like='CAP').sum(1)
Out[86]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
If have other CAP in front
df.filter(regex='_CAP$').sum(1)
Out[92]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
One approach is:
df['CAP_SUM'] = df.loc[:, df.columns.str.endswith('_CAP')].sum(1)
print(df)
Output
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
The expression:
df.columns.str.endswith('_CAP')
creates a boolean mask where the values are True if and only if the column name ends with CAP. As an alternative use filter, with the following regex:
df['CAP_SUM'] = df.filter(regex='_CAP$').sum(1)
print(df)
Output (of filter)
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
You may try this:
columnstxt = df.columns
df['sum'] = 0
for i in columnstxt:
if i.find('_CAP') != -1:
df['sum'] = df['sum'] + df[i]
else:
pass

Shift pandas dataframe down in a cyclical manner

If we have the following data:
X = pd.DataFrame({"t":[1,2,3,4,5],"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]})
X
A B C D E t
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
4 26 82 13 14 34 5
How can I shift the data in a cyclical fashion so that the next step is:
A B C D E t
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
And then:
A B C D E t
3 84 25 14 56 0 4
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
etc.
This should also shift the index values with the row.
I know of pandas X.shift(), but it wasn't making the cyclical thing.
You can combine reindex with np.roll:
X = X.reindex(np.roll(X.index, 1))
Another option is to combine concat with iloc:
shift = 1
X = pd.concat([X.iloc[-shift:], X.iloc[:-shift]])
The resulting output:
A B C D E t
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
Timings
Using the following setup to produce a larger DataFrame and functions for timing:
df = pd.concat([X]*10**5, ignore_index=True)
def root1(df, shift):
return df.reindex(np.roll(df.index, shift))
def root2(df, shift):
return pd.concat([df.iloc[-shift:], df.iloc[:-shift]])
def ed_chum(df, num):
return pd.DataFrame(np.roll(df, num, axis=0), np.roll(df.index, num), columns=df.columns)
def divakar1(df, shift):
return df.iloc[np.roll(np.arange(df.shape[0]), shift)]
def divakar2(df, shift):
idx = np.mod(np.arange(df.shape[0])-1,df.shape[0])
for _ in range(shift):
df = df.iloc[idx]
return df
I get the following timings:
%timeit root1(df.copy(), 25)
10 loops, best of 3: 61.3 ms per loop
%timeit root2(df.copy(), 25)
10 loops, best of 3: 26.4 ms per loop
%timeit ed_chum(df.copy(), 25)
10 loops, best of 3: 28.3 ms per loop
%timeit divakar1(df.copy(), 25)
10 loops, best of 3: 177 ms per loop
%timeit divakar2(df.copy(), 25)
1 loop, best of 3: 4.18 s per loop
You can use np.roll in a custom func:
In [83]:
def roll(df, num):
return pd.DataFrame(np.roll(df,num,axis=0), np.roll(df.index, num), columns=df.columns)
​
roll(X,1)
Out[83]:
A B C D E t
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
In [84]:
roll(X,2)
Out[84]:
A B C D E t
3 84 25 14 56 0 4
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
Here we return a df using the rolled df array, with the index rolled also
You can use numpy.roll :
import numpy as np
nb_iterations = 3 # number of steps you want
for i in range(nb_iterations):
for col in X.columns :
df[col] = numpy.roll(df[col], 1)
Which is equivalent to :
for col in X.columns :
df[col] = numpy.roll(df[col], nb_iterations)
Here is a link to the documentation of this useful function.
One approach would be creating such an shifted-down indexing array once and re-using it over and over to index into rows with .iloc, like so -
idx = np.mod(np.arange(X.shape[0])-1,X.shape[0])
X = X.iloc[idx]
Another way to create idx would be with np.roll : np.roll(np.arange(X.shape[0]),1).
Sample run -
In [113]: X # Starting version
Out[113]:
A B C D E t
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
4 26 82 13 14 34 5
In [114]: idx = np.mod(np.arange(X.shape[0])-1,X.shape[0]) # Creating once
In [115]: X = X.iloc[idx] # Using idx
In [116]: X
Out[116]:
A B C D E t
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
In [117]: X = X.iloc[idx] # Re-using idx
In [118]: X
Out[118]:
A B C D E t
3 84 25 14 56 0 4
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3 ## and so on

Retriving all the rows from a csv file and plotting

I need to retrieve the rows from a csv file generated from the function:
def your_func(row):
return (row['x-momentum']**2+ row['y-momentum']**2 + row['z-momentum']**2)**0.5 / row['mass']
columns_to_keep = ['#time', 'x-momentum', 'y-momentum', 'z-momentum', 'mass']
dataframe = pd.read_csv("./flash.csv", usecols=columns_to_keep)
dataframe['mean_velocity'] = dataframe.apply(your_func, axis=1)
print dataframe
I got rows up until 29s then it skipped to the last few lines, also I need to plot this column 2 against 1
you can adjust pd.options.display.max_rows option, but it won't affect your plots, so your plots will contain all your data
demo:
In [25]: df = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=list('ABC'))
In [26]: df
Out[26]:
A B C
0 93 76 5
1 33 70 12
2 50 52 26
3 88 98 85
4 90 93 92
5 66 10 58
6 82 43 39
7 17 20 91
8 47 90 33
9 44 30 26
In [27]: pd.options.display.max_rows = 4
Now it'll display 4 rows at most
In [36]: df
Out[36]:
A B C
0 93 76 5
1 33 70 12
.. .. .. ..
8 47 90 33
9 44 30 26
[10 rows x 3 columns]
but it'll plot all your data
In [37]: df.plot.bar()
Out[37]: <matplotlib.axes._subplots.AxesSubplot at 0x49e2d68>
In [38]: pd.options.display.max_rows = 60
In [39]: df
Out[39]:
A B C
0 93 76 5
1 33 70 12
2 50 52 26
3 88 98 85
4 90 93 92
5 66 10 58
6 82 43 39
7 17 20 91
8 47 90 33
9 44 30 26

pandas - get a fraction of each multi-index level label rows

I have an XY problem. My setup is as follows - I have a dataframe with multi-index of 2 levels. I want to split it to two dataframes, taking only a fraction of rows from each label in the first level. For example:
df = pd.DataFrame({'a':[1, 1, 1, 1, 7, 7, 10, 10, 10, 10, 10, 10, 10], 'b': np.random.randint(0, 100, 13), 'c':np.random.randint(0, 100, 13)}).set_index(['a', 'b'])
df
Out[13]:
c
a b
1 86 83
1 37
57 64
53 5
7 4 66
13 49
10 61 0
32 84
97 59
69 98
25 52
17 31
37 95
So let's say the fraction is 0.5, I want to split it to two dataframes:
c
a b
1 86 83
1 37
7 4 66
10 61 0
32 84
97 59
69 98
c
a b
1 57 64
53 5
7 13 49
10 25 52
17 31
37 95
I thought about doing (df.groupby(level = 0).count() * 0.5).astype(int) to get the limit on which to "slice" the dataframe. Then, if only I had a way to add a running distance such as this:
c r
a b
1 38 36 0
6 47 1
57 6 2
55 45 3
7 7 51 0
90 96 1
10 59 75 0
27 16 1
58 7 2
79 51 3
58 77 4
63 48 5
87 60 6
I could join the limits and this df and filter with a boolean condition. Any suggestions on either problem? (splitting a fraction of rows or adding a level-aware running index)
This turns out to be pretty trivial with groupby:
In [36]: df.groupby(level=0).apply(lambda x:x.head(int(x.shape[0] * 0.5))).reset_index(level=0, drop=True)
Out[36]:
c
a b
1 86 83
1 37
7 4 66
10 61 0
32 84
97 59
Also getting the running index per group:
In [33]: df.groupby(level=0).cumcount()
Out[33]:
a b
1 38 0
6 1
57 2
55 3
7 7 0
90 1
10 59 0
27 1
58 2
79 3
58 4
63 5
87 6

Fastest way to sort each row in a pandas dataframe

I need to find the quickest way to sort each row in a dataframe with millions of rows and around a hundred columns.
So something like this:
A B C D
3 4 8 1
9 2 7 2
Needs to become:
A B C D
8 4 3 1
9 7 2 2
Right now I'm applying sort to each row and building up a new dataframe row by row. I'm also doing a couple of extra, less important things to each row (hence why I'm using pandas and not numpy). Could it be quicker to instead create a list of lists and then build the new dataframe at once? Or do I need to go cython?
I think I would do this in numpy:
In [11]: a = df.values
In [12]: a.sort(axis=1) # no ascending argument
In [13]: a = a[:, ::-1] # so reverse
In [14]: a
Out[14]:
array([[8, 4, 3, 1],
[9, 7, 2, 2]])
In [15]: pd.DataFrame(a, df.index, df.columns)
Out[15]:
A B C D
0 8 4 3 1
1 9 7 2 2
I had thought this might work, but it sorts the columns:
In [21]: df.sort(axis=1, ascending=False)
Out[21]:
D C B A
0 1 8 4 3
1 2 7 2 9
Ah, pandas raises:
In [22]: df.sort(df.columns, axis=1, ascending=False)
ValueError: When sorting by column, axis must be 0 (rows)
To Add to the answer given by #Andy-Hayden, to do this inplace to the whole frame... not really sure why this works, but it does. There seems to be no control on the order.
In [97]: A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
In [98]: A
Out[98]:
one two three four five
0 22 63 72 46 49
1 43 30 69 33 25
2 93 24 21 56 39
3 3 57 52 11 74
In [99]: A.values.sort
Out[99]: <function ndarray.sort>
In [100]: A
Out[100]:
one two three four five
0 22 63 72 46 49
1 43 30 69 33 25
2 93 24 21 56 39
3 3 57 52 11 74
In [101]: A.values.sort()
In [102]: A
Out[102]:
one two three four five
0 22 46 49 63 72
1 25 30 33 43 69
2 21 24 39 56 93
3 3 11 52 57 74
In [103]: A = A.iloc[:,::-1]
In [104]: A
Out[104]:
five four three two one
0 72 63 49 46 22
1 69 43 33 30 25
2 93 56 39 24 21
3 74 57 52 11 3
I hope someone can explain the why of this, just happy that it works 8)
You could use pd.apply.
Eg:
A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
print (A)
one two three four five
0 2 75 44 53 46
1 18 51 73 80 66
2 35 91 86 44 25
3 60 97 57 33 79
A = A.apply(np.sort, axis = 1)
print(A)
one two three four five
0 2 44 46 53 75
1 18 51 66 73 80
2 25 35 44 86 91
3 33 57 60 79 97
Since you want it in descending order, you can simply multiply the dataframe with -1 and sort it.
A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
A = A * -1
A = A.apply(np.sort, axis = 1)
A = A * -1
Instead of using pd.DataFrame constructor, an easier way to assign the sorted values back is to use double brackets:
original dataframe:
A B C D
3 4 8 1
9 2 7 2
df[['A', 'B', 'C', 'D']] = np.sort(df)[:, ::-1]
A B C D
0 8 4 3 1
1 9 7 2 2
This way you can also sort a part of the columns:
df[['B', 'C']] = np.sort(df[['B', 'C']])[:, ::-1]
A B C D
0 3 8 4 1
1 9 7 2 2
One could try this approach to preserve the integrity of the df:
import pandas as pd
import numpy as np
A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
print (A)
print(type(A))
one two three four five
0 85 27 64 50 55
1 3 90 65 22 8
2 0 7 64 66 82
3 58 21 42 27 30
<class 'pandas.core.frame.DataFrame'>
B = A.apply(lambda x: np.sort(x), axis=1, raw=True)
print(B)
print(type(B))
one two three four five
0 27 50 55 64 85
1 3 8 22 65 90
2 0 7 64 66 82
3 21 27 30 42 58
<class 'pandas.core.frame.DataFrame'>

Categories

Resources