If we have the following data:
X = pd.DataFrame({"t":[1,2,3,4,5],"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]})
X
A B C D E t
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
4 26 82 13 14 34 5
How can I shift the data in a cyclical fashion so that the next step is:
A B C D E t
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
And then:
A B C D E t
3 84 25 14 56 0 4
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
etc.
This should also shift the index values with the row.
I know of pandas X.shift(), but it wasn't making the cyclical thing.
You can combine reindex with np.roll:
X = X.reindex(np.roll(X.index, 1))
Another option is to combine concat with iloc:
shift = 1
X = pd.concat([X.iloc[-shift:], X.iloc[:-shift]])
The resulting output:
A B C D E t
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
Timings
Using the following setup to produce a larger DataFrame and functions for timing:
df = pd.concat([X]*10**5, ignore_index=True)
def root1(df, shift):
return df.reindex(np.roll(df.index, shift))
def root2(df, shift):
return pd.concat([df.iloc[-shift:], df.iloc[:-shift]])
def ed_chum(df, num):
return pd.DataFrame(np.roll(df, num, axis=0), np.roll(df.index, num), columns=df.columns)
def divakar1(df, shift):
return df.iloc[np.roll(np.arange(df.shape[0]), shift)]
def divakar2(df, shift):
idx = np.mod(np.arange(df.shape[0])-1,df.shape[0])
for _ in range(shift):
df = df.iloc[idx]
return df
I get the following timings:
%timeit root1(df.copy(), 25)
10 loops, best of 3: 61.3 ms per loop
%timeit root2(df.copy(), 25)
10 loops, best of 3: 26.4 ms per loop
%timeit ed_chum(df.copy(), 25)
10 loops, best of 3: 28.3 ms per loop
%timeit divakar1(df.copy(), 25)
10 loops, best of 3: 177 ms per loop
%timeit divakar2(df.copy(), 25)
1 loop, best of 3: 4.18 s per loop
You can use np.roll in a custom func:
In [83]:
def roll(df, num):
return pd.DataFrame(np.roll(df,num,axis=0), np.roll(df.index, num), columns=df.columns)
roll(X,1)
Out[83]:
A B C D E t
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
In [84]:
roll(X,2)
Out[84]:
A B C D E t
3 84 25 14 56 0 4
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
Here we return a df using the rolled df array, with the index rolled also
You can use numpy.roll :
import numpy as np
nb_iterations = 3 # number of steps you want
for i in range(nb_iterations):
for col in X.columns :
df[col] = numpy.roll(df[col], 1)
Which is equivalent to :
for col in X.columns :
df[col] = numpy.roll(df[col], nb_iterations)
Here is a link to the documentation of this useful function.
One approach would be creating such an shifted-down indexing array once and re-using it over and over to index into rows with .iloc, like so -
idx = np.mod(np.arange(X.shape[0])-1,X.shape[0])
X = X.iloc[idx]
Another way to create idx would be with np.roll : np.roll(np.arange(X.shape[0]),1).
Sample run -
In [113]: X # Starting version
Out[113]:
A B C D E t
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
4 26 82 13 14 34 5
In [114]: idx = np.mod(np.arange(X.shape[0])-1,X.shape[0]) # Creating once
In [115]: X = X.iloc[idx] # Using idx
In [116]: X
Out[116]:
A B C D E t
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
In [117]: X = X.iloc[idx] # Re-using idx
In [118]: X
Out[118]:
A B C D E t
3 84 25 14 56 0 4
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3 ## and so on
Related
I have a dataframe and i want to group its "First" and "Second" column and then to produce the expected output as mentioned below:
df = pd.DataFrame({'First':list('abcababcbc'), 'Second':list('qeeeeqqqeq'),'Value_1':np.random.randint(4,50,10),'Value_2':np.random.randint(40,90,10)})
print(df)
Output>
First Second Value_1 Value_2
0 a q 17 70
1 b e 44 47
2 c e 5 56
3 a e 23 58
4 b e 10 76
5 a q 11 67
6 b q 21 84
7 c q 42 67
8 b e 36 53
9 c q 16 63
When i Grouped this DataFrame using groupby, I am getting below output:
def func(arr,columns):
return arr.sort_values(by = columns).drop(columns, axis = 1)
df.groupby(['First','Second']).apply(func, columns = ['First','Second'])
Value_1 Value_2
First Second
a e 3 23 58
q 0 17 70
5 11 67
b e 1 44 47
4 10 76
8 36 53
q 6 21 84
c e 2 5 56
q 7 42 67
9 16 63
However i want below output:
Expected output:
Value_1 Value_2
First Second
a e 3 23 58
All 23 58
q 0 17 70
5 11 67
All 28 137
b e 1 44 47
4 10 76
8 36 53
All 90 176
q 6 21 84
All 21 84
c e 2 5 56
All 5 56
q 7 42 67
9 16 63
All 58 130
It's not necessary to print "All" string but to print the sum of all grouped rows.
df = pd.DataFrame({'First':list('abcababcbc'), 'Second':list('qeeeeqqqeq'),'Value_1':np.random.randint(4,50,10),'Value_2':np.random.randint(40,90,10)})
First Second Value_1 Value_2
0 a q 4 69
1 b e 20 74
2 c e 13 82
3 a e 9 41
4 b e 11 79
5 a q 32 77
6 b q 6 75
7 c q 39 62
8 b e 26 80
9 c q 26 42
def lambda_t(x):
df = x.sort_values(['First','Second']).drop(['First','Second'],axis=1)
df.loc['all'] = df.sum()
return df
df.groupby(['First','Second']).apply(lambda_t)
Value_1 Value_2
First Second
a e 3 9 41
all 9 41
q 0 4 69
5 32 77
all 36 146
b e 1 20 74
4 11 79
8 26 80
all 57 233
q 6 6 75
all 6 75
c e 2 13 82
all 13 82
q 7 39 62
9 26 42
all 65 104
You can try this:
reset the index in your group by:
d1 = df.groupby(['First','Second']).apply(func, columns = ['First','Second']).reset_index()
Then group by 'First' and 'Second' and sum the values columns.
d2 = d.groupby(['First', 'Second']).sum().reset_index()
Create the 'level_2' column in the new dataframe and concatenate with the initial one to get the desired result
d2.loc[:,'level_2'] = 'All'
pd.concat([d1,d2],0).sort_values(by = ['First', 'Second'])
Not sure about your function; however, you could chunk it into two steps:
Create an indexed dataframe, where you append the First and Second columns to the existing index:
df.index = df.index.astype(str).rename("Total")
indexed = df.set_index(["First", "Second"], append=True).reorder_levels(
["First", "Second", "Total"]
)
indexed
Value_1 Value_2
First Second Total
a q 0 17 70
b e 1 44 47
c e 2 5 56
a e 3 23 58
b e 4 10 76
a q 5 11 67
b q 6 21 84
c q 7 42 67
b e 8 36 53
c q 9 16 63
Create an aggregation, grouped by First and Second:
summary = (
df.groupby(["First", "Second"])
.sum()
.assign(Total="All")
.set_index("Total", append=True)
)
summary
Value_1 Value_2
First Second Total
a e All 23 58
q All 28 137
b e All 90 176
q All 21 84
c e All 5 56
q All 58 130
Combine indexed and summary dataframes:
pd.concat([indexed, summary]).sort_index(level=["First", "Second"])
Value_1 Value_2
First Second Total
a e 3 23 58
All 23 58
q 0 17 70
5 11 67
All 28 137
b e 1 44 47
4 10 76
8 36 53
All 90 176
q 6 21 84
All 21 84
c e 2 5 56
All 5 56
q 7 42 67
9 16 63
All 58 130
Question is pretty self explanatory, how would you insert a dataframe with a couple of values in to a bigger dataframe at a given point (between index's 10 and 11). Meaning that .append cant be used
You can use concat with sliced df by loc:
np.random.seed(100)
df1 = pd.DataFrame(np.random.randint(100, size=(5,6)), columns=list('ABCDEF'))
print (df1)
A B C D E F
0 8 24 67 87 79 48
1 10 94 52 98 53 66
2 98 14 34 24 15 60
3 58 16 9 93 86 2
4 27 4 31 1 13 83
df2 = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
#inserted between 4 and 5 index values
print (pd.concat([df1.loc[:4], df2, df1.loc[4:]], ignore_index=True))
A B C D E F
0 8 24 67 87 79 48
1 10 94 52 98 53 66
2 98 14 34 24 15 60
3 58 16 9 93 86 2
4 27 4 31 1 13 83
5 1 4 7 1 5 7
6 2 5 8 3 3 4
7 3 6 9 5 6 3
8 27 4 31 1 13 83
I am new with pandas. I have a Dataframe that consists in 6 columns and I would like to make a for loop that does this:
-create a new column (nc 1)
-nc1 = column 1 - column 2
and I want to iterate this for all columns, so the last one would be:
ncx = column 5- column 6
I can substract columns like this:
df['nc'] = df.Column1 - df.Column2
but this is not useful when I try to do a loop since I always have to insert the names of colums.
Can someone help me by telling me how can I refer to columns as numbers?
Thank you!
In [26]: import numpy as np
...: import random
...: import pandas as pd
...:
...: A = pd.DataFrame(np.random.randint(100, size=(5, 6)))
In [27]: A
Out[27]:
0 1 2 3 4 5
0 82 13 17 58 68 67
1 81 45 15 11 20 63
2 0 84 34 60 90 34
3 59 28 46 96 86 53
4 45 74 14 10 5 12
In [28]: for i in range(0, 5):
...: A[(i + 6)] = A[i] - A[(i + 1)]
...:
...:
...: A
...:
Out[28]:
0 1 2 3 4 5 6 7 8 9 10
0 82 13 17 58 68 67 69 -4 -41 -10 1
1 81 45 15 11 20 63 36 30 4 -9 -43
2 0 84 34 60 90 34 -84 50 -26 -30 56
3 59 28 46 96 86 53 31 -18 -50 10 33
4 45 74 14 10 5 12 -29 60 4 5 -7
In [29]: nc = 1 #The first new column
...: A[(nc + 5)] #outputs the first new column
Out[29]:
0 69
1 36
2 -84
3 31
4 -29
Here you don't need to call it by name, just by the column number, and you can just write a simple function that calls the column + 5
Something like this:
In [31]: def call_new_column(n):
...: return(A[(n + 5)])
...:
...:
...: call_new_column(2)
Out[31]:
0 -4
1 30
2 50
3 -18
4 60
I need to retrieve the rows from a csv file generated from the function:
def your_func(row):
return (row['x-momentum']**2+ row['y-momentum']**2 + row['z-momentum']**2)**0.5 / row['mass']
columns_to_keep = ['#time', 'x-momentum', 'y-momentum', 'z-momentum', 'mass']
dataframe = pd.read_csv("./flash.csv", usecols=columns_to_keep)
dataframe['mean_velocity'] = dataframe.apply(your_func, axis=1)
print dataframe
I got rows up until 29s then it skipped to the last few lines, also I need to plot this column 2 against 1
you can adjust pd.options.display.max_rows option, but it won't affect your plots, so your plots will contain all your data
demo:
In [25]: df = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=list('ABC'))
In [26]: df
Out[26]:
A B C
0 93 76 5
1 33 70 12
2 50 52 26
3 88 98 85
4 90 93 92
5 66 10 58
6 82 43 39
7 17 20 91
8 47 90 33
9 44 30 26
In [27]: pd.options.display.max_rows = 4
Now it'll display 4 rows at most
In [36]: df
Out[36]:
A B C
0 93 76 5
1 33 70 12
.. .. .. ..
8 47 90 33
9 44 30 26
[10 rows x 3 columns]
but it'll plot all your data
In [37]: df.plot.bar()
Out[37]: <matplotlib.axes._subplots.AxesSubplot at 0x49e2d68>
In [38]: pd.options.display.max_rows = 60
In [39]: df
Out[39]:
A B C
0 93 76 5
1 33 70 12
2 50 52 26
3 88 98 85
4 90 93 92
5 66 10 58
6 82 43 39
7 17 20 91
8 47 90 33
9 44 30 26
I have an XY problem. My setup is as follows - I have a dataframe with multi-index of 2 levels. I want to split it to two dataframes, taking only a fraction of rows from each label in the first level. For example:
df = pd.DataFrame({'a':[1, 1, 1, 1, 7, 7, 10, 10, 10, 10, 10, 10, 10], 'b': np.random.randint(0, 100, 13), 'c':np.random.randint(0, 100, 13)}).set_index(['a', 'b'])
df
Out[13]:
c
a b
1 86 83
1 37
57 64
53 5
7 4 66
13 49
10 61 0
32 84
97 59
69 98
25 52
17 31
37 95
So let's say the fraction is 0.5, I want to split it to two dataframes:
c
a b
1 86 83
1 37
7 4 66
10 61 0
32 84
97 59
69 98
c
a b
1 57 64
53 5
7 13 49
10 25 52
17 31
37 95
I thought about doing (df.groupby(level = 0).count() * 0.5).astype(int) to get the limit on which to "slice" the dataframe. Then, if only I had a way to add a running distance such as this:
c r
a b
1 38 36 0
6 47 1
57 6 2
55 45 3
7 7 51 0
90 96 1
10 59 75 0
27 16 1
58 7 2
79 51 3
58 77 4
63 48 5
87 60 6
I could join the limits and this df and filter with a boolean condition. Any suggestions on either problem? (splitting a fraction of rows or adding a level-aware running index)
This turns out to be pretty trivial with groupby:
In [36]: df.groupby(level=0).apply(lambda x:x.head(int(x.shape[0] * 0.5))).reset_index(level=0, drop=True)
Out[36]:
c
a b
1 86 83
1 37
7 4 66
10 61 0
32 84
97 59
Also getting the running index per group:
In [33]: df.groupby(level=0).cumcount()
Out[33]:
a b
1 38 0
6 1
57 2
55 3
7 7 0
90 1
10 59 0
27 1
58 2
79 3
58 4
63 5
87 6