I have a dataframe like this:
arr = np.random.randint(10, 99, (4,4))
df = pd.DataFrame(arr)
df.columns = pd.MultiIndex.from_product([['X','Y'],['A','B']])
And it looks like this:
X Y
A B A B
0 76 78 29 24
1 34 80 83 56
2 56 44 40 30
3 16 38 45 93
For all rows where A < B in X, I want to do A - B in Y. How do I do that?
I did this to filter and select A and B from Y
df[df['X']['A'] < df['X']['B']].loc[:, ('Y', ['A', 'B'])]
Y
A B
0 29 24
1 83 56
3 45 93
But I am lost on how to do A - B.
Thanks.
Assuming you want to subtract and update A with the result, you can do so by indexing as:
m = (df[('X','A')] < df[('X','B')])
df.loc[m,('Y','A')] = df.loc[m,('Y','A')] - df.loc[m,('Y','B')]
print(df)
X Y
A B A B
0 77 67 55 87
1 36 85 26 50
2 77 14 62 89
3 88 33 82 44
You can select columns by tuples for MultiIndex like:
np.random.seed(20)
arr = np.random.randint(10, 99, (4,4))
df = pd.DataFrame(arr)
df.columns = pd.MultiIndex.from_product([['X','Y'],['A','B']])
print (df)
X Y
A B A B
0 25 38 19 30
1 85 32 81 44
2 50 95 36 93
3 26 72 26 17
mask = df[('X','A')].lt(df[('X','B')])
print (mask)
0 True
1 False
2 True
3 True
dtype: bool
s = df.loc[mask, ('Y','A')].sub(df.loc[mask, ('Y','B')])
print (s)
0 -11
2 -57
3 9
dtype: int32
Related
I have some function that takes a DataFrame and an integer as arguments:
func(df, int)
The function returns a new DataFrame, e.g.:
df2 = func(df,2)
I'd like to write a loop for integers 2-10, resulting in 9 DataFrames. If I do this manually it would look like this:
df2 = func(df,2)
df3 = func(df2,3)
df4 = func(df3,4)
df5 = func(df4,5)
df6 = func(df5,6)
df7 = func(df6,7)
df8 = func(df7,8)
df9 = func(df8,9)
df10 = func(df9,10)
Is there a way to write a loop that does this?
This type of thing is what lists are for.
data_frames = [df]
for i in range(2, 11):
data_frames.append(func(data_frames[-1], i))
It's a sign of brittle code when you see variable names like df1, df2, df3, etc. Use lists when you have a sequence of related objects to build.
To clarify, this data_frames is a list of DataFrames that can be concatenated with data_frames = pd.concat(data_frames, sort=False), resulting in one DataFrame that combines the original df with everything that results from the loop, correct?
Yup, that's right. If your goal is one final data frame, you can concatenate the entire list at the end to combine the information into a single frame.
Do you mind explaining why data_frames[-1], which takes the last item of the list, returns a DataFrame? Not clear on this.
Because as you're building the list, at all times each entry is a data frame. data_frames[-1] evaluates to the last element in the list, which in this case, is the data frame you most recently appended.
You may try using itertools.accumulate as follows:
sample data
df:
a b c
0 75 18 17
1 48 56 3
import itertools
def func(x, y):
return x + y
dfs = list(itertools.accumulate([df] + list(range(2, 11)), func))
[ a b c
0 75 18 17
1 48 56 3, a b c
0 77 20 19
1 50 58 5, a b c
0 80 23 22
1 53 61 8, a b c
0 84 27 26
1 57 65 12, a b c
0 89 32 31
1 62 70 17, a b c
0 95 38 37
1 68 76 23, a b c
0 102 45 44
1 75 83 30, a b c
0 110 53 52
1 83 91 38, a b c
0 119 62 61
1 92 100 47, a b c
0 129 72 71
1 102 110 57]
dfs is the list of result dataframes where each one is the adding of 2 - 10 to the previous result
If you want concat them all into one dataframe, Use pd.concat
pd.concat(dfs)
Out[29]:
a b c
0 75 18 17
1 48 56 3
0 77 20 19
1 50 58 5
0 80 23 22
1 53 61 8
0 84 27 26
1 57 65 12
0 89 32 31
1 62 70 17
0 95 38 37
1 68 76 23
0 102 45 44
1 75 83 30
0 110 53 52
1 83 91 38
0 119 62 61
1 92 100 47
0 129 72 71
1 102 110 57
You can use exec with a formatted string:
for i in range(2, 11):
exec("df{0} = func(df{1}, {0})".format(i, i - 1 if i > 2 else ''))
This question already has answers here:
Split (explode) pandas dataframe string entry to separate rows
(27 answers)
Separate comma-separated values within individual cells of Pandas Series using regex
(1 answer)
Closed 4 years ago.
I am looking to convert data frame df1 to df2 using Python. I have a solution that uses loops but I am wondering if there is an easier way to create df2.
df1
Test1 Test2 2014 2015 2016 Present
1 x a 90 85 84 0
2 x a:b 88 79 72 1
3 y a:b:c 75 76 81 0
4 y b 60 62 66 0
5 y c 68 62 66 1
df2
Test1 Test2 2014 2015 2016 Present
1 x a 90 85 84 0
2 x a 88 79 72 1
3 x b 88 79 72 1
4 y a 75 76 81 0
5 y b 75 76 81 0
6 y c 75 76 81 0
7 y b 60 62 66 0
8 y c 68 62 66 1
Here's one way using numpy.repeat and itertools.chain:
import numpy as np
from itertools import chain
# split by delimiter and calculate length for each row
split = df['Test2'].str.split(':')
lens = split.map(len)
# repeat non-split columns
cols = ('Test1', '2014', '2015', '2016', 'Present')
d1 = {col: np.repeat(df[col], lens) for col in cols}
# chain split columns
d2 = {'Test2': list(chain.from_iterable(split))}
# combine in a single dataframe
res = pd.DataFrame({**d1, **d2})
print(res)
2014 2015 2016 Present Test1 Test2
1 90 85 84 0 x a
2 88 79 72 1 x a
2 88 79 72 1 x b
3 75 76 81 0 y a
3 75 76 81 0 y b
3 75 76 81 0 y c
4 60 62 66 0 y b
5 68 62 66 1 y c
This will achieve what you want:
# Converting "Test2" strings into lists of values
df["Test2"] = df["Test2"].apply(lambda x: x.split(":"))
# Creating second dataframe with "Test2" values
test2 = df.apply(lambda x: pd.Series(x['Test2']),axis=1).stack().reset_index(level=1, drop=True)
test2.name = 'Test2'
# Joining both dataframes
df = df.drop('Test2', axis=1).join(test2)
print(df)
Test1 2014 2015 2016 Present Test2
1 x 90 85 84 0 a
2 x 88 79 72 1 a
2 x 88 79 72 1 b
3 y 75 76 81 0 a
3 y 75 76 81 0 b
3 y 75 76 81 0 c
4 y 60 62 66 0 b
5 y 68 62 66 1 c
Similar questions (column already existing as a list): 1 2
I am new with pandas. I have a Dataframe that consists in 6 columns and I would like to make a for loop that does this:
-create a new column (nc 1)
-nc1 = column 1 - column 2
and I want to iterate this for all columns, so the last one would be:
ncx = column 5- column 6
I can substract columns like this:
df['nc'] = df.Column1 - df.Column2
but this is not useful when I try to do a loop since I always have to insert the names of colums.
Can someone help me by telling me how can I refer to columns as numbers?
Thank you!
In [26]: import numpy as np
...: import random
...: import pandas as pd
...:
...: A = pd.DataFrame(np.random.randint(100, size=(5, 6)))
In [27]: A
Out[27]:
0 1 2 3 4 5
0 82 13 17 58 68 67
1 81 45 15 11 20 63
2 0 84 34 60 90 34
3 59 28 46 96 86 53
4 45 74 14 10 5 12
In [28]: for i in range(0, 5):
...: A[(i + 6)] = A[i] - A[(i + 1)]
...:
...:
...: A
...:
Out[28]:
0 1 2 3 4 5 6 7 8 9 10
0 82 13 17 58 68 67 69 -4 -41 -10 1
1 81 45 15 11 20 63 36 30 4 -9 -43
2 0 84 34 60 90 34 -84 50 -26 -30 56
3 59 28 46 96 86 53 31 -18 -50 10 33
4 45 74 14 10 5 12 -29 60 4 5 -7
In [29]: nc = 1 #The first new column
...: A[(nc + 5)] #outputs the first new column
Out[29]:
0 69
1 36
2 -84
3 31
4 -29
Here you don't need to call it by name, just by the column number, and you can just write a simple function that calls the column + 5
Something like this:
In [31]: def call_new_column(n):
...: return(A[(n + 5)])
...:
...:
...: call_new_column(2)
Out[31]:
0 -4
1 30
2 50
3 -18
4 60
I need to retrieve the rows from a csv file generated from the function:
def your_func(row):
return (row['x-momentum']**2+ row['y-momentum']**2 + row['z-momentum']**2)**0.5 / row['mass']
columns_to_keep = ['#time', 'x-momentum', 'y-momentum', 'z-momentum', 'mass']
dataframe = pd.read_csv("./flash.csv", usecols=columns_to_keep)
dataframe['mean_velocity'] = dataframe.apply(your_func, axis=1)
print dataframe
I got rows up until 29s then it skipped to the last few lines, also I need to plot this column 2 against 1
you can adjust pd.options.display.max_rows option, but it won't affect your plots, so your plots will contain all your data
demo:
In [25]: df = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=list('ABC'))
In [26]: df
Out[26]:
A B C
0 93 76 5
1 33 70 12
2 50 52 26
3 88 98 85
4 90 93 92
5 66 10 58
6 82 43 39
7 17 20 91
8 47 90 33
9 44 30 26
In [27]: pd.options.display.max_rows = 4
Now it'll display 4 rows at most
In [36]: df
Out[36]:
A B C
0 93 76 5
1 33 70 12
.. .. .. ..
8 47 90 33
9 44 30 26
[10 rows x 3 columns]
but it'll plot all your data
In [37]: df.plot.bar()
Out[37]: <matplotlib.axes._subplots.AxesSubplot at 0x49e2d68>
In [38]: pd.options.display.max_rows = 60
In [39]: df
Out[39]:
A B C
0 93 76 5
1 33 70 12
2 50 52 26
3 88 98 85
4 90 93 92
5 66 10 58
6 82 43 39
7 17 20 91
8 47 90 33
9 44 30 26
I need to find the quickest way to sort each row in a dataframe with millions of rows and around a hundred columns.
So something like this:
A B C D
3 4 8 1
9 2 7 2
Needs to become:
A B C D
8 4 3 1
9 7 2 2
Right now I'm applying sort to each row and building up a new dataframe row by row. I'm also doing a couple of extra, less important things to each row (hence why I'm using pandas and not numpy). Could it be quicker to instead create a list of lists and then build the new dataframe at once? Or do I need to go cython?
I think I would do this in numpy:
In [11]: a = df.values
In [12]: a.sort(axis=1) # no ascending argument
In [13]: a = a[:, ::-1] # so reverse
In [14]: a
Out[14]:
array([[8, 4, 3, 1],
[9, 7, 2, 2]])
In [15]: pd.DataFrame(a, df.index, df.columns)
Out[15]:
A B C D
0 8 4 3 1
1 9 7 2 2
I had thought this might work, but it sorts the columns:
In [21]: df.sort(axis=1, ascending=False)
Out[21]:
D C B A
0 1 8 4 3
1 2 7 2 9
Ah, pandas raises:
In [22]: df.sort(df.columns, axis=1, ascending=False)
ValueError: When sorting by column, axis must be 0 (rows)
To Add to the answer given by #Andy-Hayden, to do this inplace to the whole frame... not really sure why this works, but it does. There seems to be no control on the order.
In [97]: A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
In [98]: A
Out[98]:
one two three four five
0 22 63 72 46 49
1 43 30 69 33 25
2 93 24 21 56 39
3 3 57 52 11 74
In [99]: A.values.sort
Out[99]: <function ndarray.sort>
In [100]: A
Out[100]:
one two three four five
0 22 63 72 46 49
1 43 30 69 33 25
2 93 24 21 56 39
3 3 57 52 11 74
In [101]: A.values.sort()
In [102]: A
Out[102]:
one two three four five
0 22 46 49 63 72
1 25 30 33 43 69
2 21 24 39 56 93
3 3 11 52 57 74
In [103]: A = A.iloc[:,::-1]
In [104]: A
Out[104]:
five four three two one
0 72 63 49 46 22
1 69 43 33 30 25
2 93 56 39 24 21
3 74 57 52 11 3
I hope someone can explain the why of this, just happy that it works 8)
You could use pd.apply.
Eg:
A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
print (A)
one two three four five
0 2 75 44 53 46
1 18 51 73 80 66
2 35 91 86 44 25
3 60 97 57 33 79
A = A.apply(np.sort, axis = 1)
print(A)
one two three four five
0 2 44 46 53 75
1 18 51 66 73 80
2 25 35 44 86 91
3 33 57 60 79 97
Since you want it in descending order, you can simply multiply the dataframe with -1 and sort it.
A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
A = A * -1
A = A.apply(np.sort, axis = 1)
A = A * -1
Instead of using pd.DataFrame constructor, an easier way to assign the sorted values back is to use double brackets:
original dataframe:
A B C D
3 4 8 1
9 2 7 2
df[['A', 'B', 'C', 'D']] = np.sort(df)[:, ::-1]
A B C D
0 8 4 3 1
1 9 7 2 2
This way you can also sort a part of the columns:
df[['B', 'C']] = np.sort(df[['B', 'C']])[:, ::-1]
A B C D
0 3 8 4 1
1 9 7 2 2
One could try this approach to preserve the integrity of the df:
import pandas as pd
import numpy as np
A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
print (A)
print(type(A))
one two three four five
0 85 27 64 50 55
1 3 90 65 22 8
2 0 7 64 66 82
3 58 21 42 27 30
<class 'pandas.core.frame.DataFrame'>
B = A.apply(lambda x: np.sort(x), axis=1, raw=True)
print(B)
print(type(B))
one two three four five
0 27 50 55 64 85
1 3 8 22 65 90
2 0 7 64 66 82
3 21 27 30 42 58
<class 'pandas.core.frame.DataFrame'>