Vectorized operation on three columns - python

First, let us create random dataframe:
df = pd.DataFrame(
{
"A": np.random.randint(0, 70, size=5),
"B": np.random.randint(-10, 35, size=5),
"C": np.random.randint(10, 50, size=5)
}
)
Then, I am using min and max functions to create two additional columns:
df['max'] = df[['A', 'B', 'C']].max(axis=1)
df['min'] = df[['A', 'B', 'C']].min(axis=1)
Output:
A B C max min
0 17 26 31 31 17
1 45 31 17 45 17
2 36 24 31 36 24
3 16 17 24 24 16
4 16 12 23 23 12
What would be the most efficient and elegant way to get remaining value to the 'mid' column so that the output looked like this:
A B C max min mid
0 17 26 31 31 17 26
1 45 31 17 45 17 31
2 36 24 31 36 24 31
3 16 17 24 24 16 17
4 16 12 23 23 12 16
I am looking for vectorized solution. I was able to achieve this using conditions:
conditions = [((df['A'] > df['B']) & (df['A'] < df['C']) | (df['A'] > df['C']) & (df['A'] < df['B'])),
((df['B'] > df['A']) & (df['B'] < df['C']) | (df['B'] > df['C']) & (df['B'] < df['A'])),
((df['C'] > df['A']) & (df['C'] < df['B']) | (df['C'] > df['B']) & (df['C'] < df['A']))]
choices = [df['A'], df['B'], df['C']]
df['mid'] = np.select(conditions, choices, default=0)
However, I think there is more elegant solution for that.

Should you use median?
df[["A","B","C"]].median(axis=1)
By the way, instead of running the aggregations one-by-one, you should everything in one go as follows:
df.join(df.agg([min, max, 'median'], axis=1))
OUTPUT
A B C min max median
0 2 22 38 2.0 38.0 22.0
1 29 15 40 15.0 40.0 29.0
2 48 -5 17 -5.0 48.0 17.0
3 17 18 43 17.0 43.0 18.0
4 60 -10 39 -10.0 60.0 39.0
The advantage of this is that, in a case like the one you described (i.e. you want to aggregate the entire row), you don't need to specify the name of the columns you want to aggregate. If you start adding one column with an aggregation, you need to make you sure you don't include the new column in the following aggregation - so you will need to speficy the columns you want to aggregate.

Related

How to concatenate rows side by side in pandas

I want to combine the five rows of the same dataset into a single dataset
I have 700 rows and i want to combining every five rows
A B C D E F G
1 10,11,12,13,14,15,16
2 17,18,19,20,21,22,23
3 24,25,26,27,28,29,30
4 31,32,33,34,35,36,37
5 38,39,40,41,42,43,44
.
.
.
.
.
700
After combining the first five rows.. My first row should look like this:
A B C D E F G A B C D E F G A B C D E F G A B C D E F G A B C D E F G
1 10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44
If you can guarantee that the total number of rows you have is a multiple of 5, dipping into numpy will be the most efficient way to solve this problem:
import numpy as np
import pandas as pd
data = np.arange(70).reshape(-1, 7)
df = pd.DataFrame(data, columns=[*'ABCDEFG'])
print(df)
A B C D E F G
0 0 1 2 3 4 5 6
1 7 8 9 10 11 12 13
2 14 15 16 17 18 19 20
3 21 22 23 24 25 26 27
4 28 29 30 31 32 33 34
5 35 36 37 38 39 40 41
6 42 43 44 45 46 47 48
7 49 50 51 52 53 54 55
8 56 57 58 59 60 61 62
9 63 64 65 66 67 68 69
out = pd.DataFrame(
df.to_numpy().reshape(-1, df.shape[1] * 5),
columns=[*df.columns] * 5
)
print(out)
A B C D E F G A B C D E F ... B C D E F G A B C D E F G
0 0 1 2 3 4 5 6 7 8 9 10 11 12 ... 22 23 24 25 26 27 28 29 30 31 32 33 34
1 35 36 37 38 39 40 41 42 43 44 45 46 47 ... 57 58 59 60 61 62 63 64 65 66 67 68 69
[2 rows x 35 columns]
You can do:
cols = [col for v in [df.columns.tolist()]*len(df) for col in v]
dfs = [df[i:min(i+5,len(df))].reset_index(drop=True) for i in range(0,len(df),5)]
df2 = pd.concat([pd.DataFrame(df.stack()).T for df in dfs])
df2.columns = cols
df2.reset_index(drop=True, inplace=True)
see if this helps answer your question
unstack turns the columns into rows, and once we have the data in a column, we just need it transposed. reset_index makes the resulting series into a dataframe. the original columns names are made into an index, so when we transpose we have the columns as you had stated in your columns.
df.unstack().reset_index().set_index('level_0')[[0]].T
level_0 A A A A A B B B B B ... F F F F F G G G G G
0 10 17 24 31 38 11 18 25 32 39 ... 15 22 29 36 43 16 23 30 37 44
vote and/or accept if the answer helps
the easiest way is to convert your dataframe to a numpy array, reshape it then cast it back to a new dataframe.
Edit:
data= # your dataframe
new_dataframe=pd.DataFrame(data.to_numpy().reshape(len(data)//5,-1),columns=np.tile(data.columns,5))
Stacking and unstacking data in pandas
Data in tables are often presented multiple ways. Long form ("tidy data") refers to data that are stacked in a couple of columns. One of the columns will have categorical indicators about the values. In contrast, wide form ("stacked data") is where each category has it's own column.
In your example, you present the wide form of data, and you're trying to get it into long form. The pandas.melt, pandas.groupby, pandas.pivot, pandas.stack, pandas.unstack, and pandas.reset_index are the functions that help convert between these forms.
Start with your original dataframe:
df = pd.DataFrame({
'A' : [10, 17, 24, 31, 38],
'B' : [11, 18, 25, 32, 39],
'C' : [12, 19, 26, 33, 40],
'D' : [13, 20, 27, 34, 41],
'E' : [14, 21, 28, 35, 42],
'F' : [15, 22, 29, 36, 43],
'G' : [16, 23, 30, 37, 44]})
A B C D E F G
0 10 11 12 13 14 15 16
1 17 18 19 20 21 22 23
2 24 25 26 27 28 29 30
3 31 32 33 34 35 36 37
4 38 39 40 41 42 43 44
Use pandas.melt to convert it to long form, then sort to get it how you requested the data: The ignore index option helps us to get it back to wide form later.
melted_df = df.melt(ignore_index=False).sort_values(by='value')
variable value
0 A 10
0 B 11
0 C 12
0 D 13
0 E 14
0 F 15
0 G 16
1 A 17
1 B 18
...
Use groupby, unstack, and reset_index to convert it back to wide form. This is often a much more difficult process that relies on grouping by the value stacked column, other columns, index, and stacked variable and then unstacking and resetting the index.
(melted_df
.reset_index() # puts the index values into a column called 'index'
.groupby(['index','variable']) #groups by the index and the variable
.value #selects the value column in each of the groupby objects
.mean() #since there is only one item per group, it only aggregates one item
.unstack() #this sets the first item of the multi-index to columns
.reset_index() #fix the index
.set_index('index') #set index
)
A B C D E F G
0 10 11 12 13 14 15 16
1 17 18 19 20 21 22 23
2 24 25 26 27 28 29 30
3 31 32 33 34 35 36 37
4 38 39 40 41 42 43 44
This stuff can be quite difficult and requires trial and error. I would recommend making a smaller version of your problems and mess with them. This way you can figure out how the functions are working.
Try this using arange() with floordiv to get groups by every 5, then creating a new df with the groups. This should work even if your df is not divisible by 5.
l = 5
(df.groupby(np.arange(len(df.index))//l)
.apply(lambda x: pd.DataFrame([x.to_numpy().ravel()]))
.set_axis(df.columns.tolist() * l,axis=1)
.reset_index(drop=True))
or
(df.groupby(np.arange(len(df.index))//5)
.apply(lambda x: x.reset_index(drop=True).stack())
.unstack(level=[1,2])
.droplevel(0,axis=1))
Output:
A B C D E F G A B C ... E F G A B C D E F G
0 9 0 3 2 6 2 9 1 7 5 ... 2 5 9 5 4 9 7 3 8 9
1 9 5 0 8 1 5 8 7 7 7 ... 6 3 5 5 2 3 9 7 5 6

Pandas: adding several columns to a dataframe in a single line

I come from R background & am wondering if there's a single line code to add several new columns to an existing dataframe in Pandas just like dplyr. If have this code:
import pandas as pd
df = pd.DataFrame({'a': range(1, 11)})
df['b'] = range(11, 21)
df['c'] = range(21, 31)
df['d'] = range(31, 40)
df['e'] = range(41, 50)
Is there a way to make all columns addition into df in one line?
An example of what I want in R would be:
library(dplyr)
df <- data.frame('a' = 1:10)
df <- df %>% mutate(b = 11:20, c = 21:30, d = 31:40, e = 41:50)
There is assign:
df.assign(b=range(11,21), c=range(21,31), d=range(31,41))
Things are even easier when you have a dictionary:
# assume you get this from somewhere else
val_dict = {'b': range(11,21), 'c':range(21,31)}
df.assign(**val_dict)
Note the second approach is expected when b is not a possible choice for keyword arguments, for example, having spaces 'a b'.
As others have noted, you could build them all in the original construction of the dataframe, but if you needed to add multiple columns at a later point, you can add each through multiple declaration:
df['b'], df['c'], df['d'], df['e'] = range(11, 21), range(21,31), range(31,41), range(41,51)
df = pd.DataFrame( {c: range(x, y) for c,x,y in [(chr(97+x), x*10+1, x*10+11) for x in range(5)]})
>>> df
a b c d e
0 1 11 21 31 41
1 2 12 22 32 42
2 3 13 23 33 43
3 4 14 24 34 44
4 5 15 25 35 45
5 6 16 26 36 46
6 7 17 27 37 47
7 8 18 28 38 48
8 9 19 29 39 49
9 10 20 30 40 50
Or to add to an existing dataframe:
df = pd.DataFrame({'a': range(1,11)})
df = pd.concat([df, pd.DataFrame( {c: range(x, y) for c,x,y in [(chr(97+x), x*10+1, x*10+11) for x in range(1, 5)]})], axis=1)
Check out this:
>>> from datar.all import f, tibble, mutate
>>> df = tibble(a = f[1:10])
>>> df >> mutate(b = f[11:20], c = f[21:30], d = f[31:40], e = f[41:50])
a b c d e
<int64> <int64> <int64> <int64> <int64>
0 1 11 21 31 41
1 2 12 22 32 42
2 3 13 23 33 43
3 4 14 24 34 44
4 5 15 25 35 45
5 6 16 26 36 46
6 7 17 27 37 47
7 8 18 28 38 48
8 9 19 29 39 49
9 10 20 30 40 50
I am the author of the datar package.
You just pass all the data and associated column name into pd.DataFrame just as you did with column 'a', separate with commas.
Like this:
df = pd.DataFrame({'a': range(1, 11), 'b' : range(11, 21)})

Complicated refer to another table

I have dataframe shown in below:
column name 'Types'shows each types dified
I would like to add another column named 'number' defined as below.
df=pd.DataFrame({'Sex':['M','F','F','M'],'Age':[30,31,33,32],'Types':['A','C','B','D']})
Out[8]:
Age Sex Types
0 30 M A
1 31 F C
2 33 F B
3 32 M D
and I have another male table below;
each column represents Types!
(It was difficult to create table for me, Are there another easy way to create?)
table_M = pd.DataFrame(np.arange(20).reshape(4,5),index=[30,31,32,33],columns=["A","B","C","D","E"])
table_M.index.name="Age(male)"
A B C D E
Age(male)
30 0 1 2 3 4
31 5 6 7 8 9
32 10 11 12 13 14
33 15 16 17 18 19
and I have female table below;
table_F = pd.DataFrame(np.arange(20,40).reshape(4,5),index=[30,31,32,33],columns=["A","B","C","D","E"])
table_F.index.name="Age(female)"
A B C D E
Age(female)
30 20 21 22 23 24
31 25 26 27 28 29
32 30 31 32 33 34
33 35 36 37 38 39
so I would like to add 'number' column as shown below;
Age Sex Types number
0 30 M A 0
1 31 F C 27
2 33 F B 36
3 32 M D 13
this number column refer to female and male table. for each age , Type, and Sex.
It was too complicated for me.
Can I ask how to add 'number' column?
I suggest reshaping your male and female tables:
males = (table_M.stack().to_frame('number').assign(Sex='M').reset_index()
.rename(columns={'Age(male)': 'Age', 'level_1': 'Types'}))
females = (table_F.stack().to_frame('number').assign(Sex='F').reset_index()
.rename(columns={'Age(female)': 'Age', 'level_1': 'Types'}))
reshaped = pd.concat([males, females], ignore_index=True)
Then merge:
df.merge(reshaped)
Out:
Age Sex Types number
0 30 M A 0
1 31 F C 27
2 33 F B 36
3 32 M D 13
What this does is it stacks the columns of Male and Female tables, and assigns an indicator column showing Sex ('M' and 'F'). females.head() looks like this:
females.head()
Out:
Age Types number Sex
0 30 A 20 F
1 30 B 21 F
2 30 C 22 F
3 30 D 23 F
4 30 E 24 F
and males.head():
males.head()
Out:
Age Types number Sex
0 30 A 0 M
1 30 B 1 M
2 30 C 2 M
3 30 D 3 M
4 30 E 4 M
With pd.concat these two are combined into a single DataFrame and merge by default works on the common columns so it looks for the matches in 'Age', 'Sex', 'Types' columns and merge two DataFrames based on that.
One other possibility is to use df.lookup:
df.loc[df['Sex']=='M', 'number'] = table_M.lookup(*df.loc[df['Sex']=='M', ['Age', 'Types']].values.T)
df.loc[df['Sex']=='F', 'number'] = table_F.lookup(*df.loc[df['Sex']=='F', ['Age', 'Types']].values.T)
df
Out:
Age Sex Types number
0 30 M A 0.0
1 31 F C 27.0
2 33 F B 36.0
3 32 M D 13.0
This looks up the males in table_M, and females in table_F.
It's easier if you two tables are combined such that you can access the 'Sex' via an apply.
table = pd.concat([table_F, table_M], axis=1, keys=['F', 'M'])
accessor = lambda row: table.loc[row.Age, (row.Sex, row.Types)]
df['number'] = df.apply(accessor, axis=1)
df
Another way to do this:
In [60]: df['numbers'] = df.apply(lambda x: table_F.loc[[x.Age]][x.Types].iloc[0] if x.Sex == 'F' else table_M.loc[[x.Age]][x.Types].iloc[0], axis = 1)
In [60]: df
Out[60]:
Age Sex Types numbers
0 30 M A 0
1 31 F C 27
2 33 F B 36
3 32 M D 13

Pandas dataframe compression

How do I map one dataframe into another df with less number of rows summing values of rows whoose indices are in given interval?
For example
Given df:
Survived
Age
20 1
22 1
23 3
24 2
30 2
33 1
40 8
42 7
Desired df
(for interval = 5):
Survived
Age
20 7
25 0
30 3
35 0
40 15
(for interval = 10):
Survived
Age
20 7
30 3
40 15
You can use a function for the groupby argument:
In [6]: df.groupby(lambda x: x//10 * 10).sum()
Out[6]:
Survived
20 7
30 3
40 15
Note, this also works with 5 but it doesn't work the way you want with empty groups, that is, it doesn't fill in with zeroes!
In [12]: df.groupby(lambda x: x//5 *5).sum()
Out[12]:
Survived
20 7
30 3
40 15
However, if the data were to contain values for those groups in the 5 interval, you can see it is working.
In [18]: df
Out[18]:
Survived
Age
20 1
22 1
23 3
24 2
26 99
30 2
33 1
40 8
42 7
47 99
In [19]: df.groupby(lambda x: x//5 *5).sum()
Out[19]:
Survived
20 7
25 99
30 3
40 15
45 99
First convert int index to TimedeltaIndex and then resample:
df.index = pd.TimedeltaIndex(df.index.to_series(), unit='s')
print (df)
Survived
00:00:20 1
00:00:22 1
00:00:23 3
00:00:24 2
00:00:30 2
00:00:33 1
00:00:40 8
00:00:42 7
df1 = df.resample('5S').sum().fillna(0)
df1.index = df1.index.seconds
print (df1)
Survived
20 7.0
25 0.0
30 3.0
35 0.0
40 15.0
df2 = df.resample('10S').sum().fillna(0)
df2.index = df2.index.seconds
print (df2)
Survived
20 7
30 3
40 15
EDIT:
If Age > 60 it works nice too:
print (df)
Survived
Age
20 1
22 1
23 3
24 2
30 2
33 1
40 8
42 7
60 8
62 7
70 8
72 7
df.index = pd.TimedeltaIndex(df.index.to_series(), unit='s')
df1 = df.resample('5S').sum().fillna(0)
df1.index = df1.index.seconds
print (df1)
Survived
20 7.0
25 0.0
30 3.0
35 0.0
40 15.0
45 0.0
50 0.0
55 0.0
60 15.0
65 0.0
70 15.0
df2 = df.resample('10S').sum().fillna(0)
df2.index = df2.index.seconds
print (df2)
Survived
20 7.0
30 3.0
40 15.0
50 0.0
60 15.0
70 15.0
You can create an new column from the column Age and then use groupby:
In order to create the new column, Age needs to be taken out of the index:
df.reset_index(inplace = True)
def cat_age(age):
return 10*int(age/10.)
df['category_age'] = df.Age.apply(lambda x: cat_age(x))
df.groupby('category_age',as_index = False).agg({'Survived':sum})
Output:
category_age Survived
0 20 7
1 30 3
2 40 15
Of course if you want to change the categories, you can pass the interval in cat_age:
def cat_age(age,interval)
return interval*int(1.*age/interval)

Calling agg without first calling groupby

Is there a function similar to agg, that doesn't require a groupby call first?
For example, I often already have an agg map written, and want to evaluate the map for the entire table.
So I want to change
data = data.groupby("key").agg({"foo1":"sum", "foo2":"mean"})
to
data = data.agg({"foo1":"sum", "foo2":"mean"})
I currently do this by inserting a fake key, and then aggregating on that. But that's a hack. Is there a better way?
UPDATE: as #root proposed in the comment, it would be easier and more elegant to group by np.repeat(0, len(df)):
In [5]: df.groupby(np.repeat(0, len(df))).agg({'A':'sum', 'B':'mean', 'C':'min'})
Out[5]:
B A C
0 42.9 484 21
OLD answer:
assuming that you have a numeric index which is always >= 0:
In [139]: df.groupby(df.index >= 0, as_index=False).agg({'A':'sum', 'B':'mean', 'C':'min'})
Out[139]:
A B C
0 484 42.9 21
or assuming that your index doesn't have any NaNs
In [140]: df.groupby(df.index==df.index, as_index=False).agg({'A':'sum', 'B':'mean', 'C':'min'})
Out[140]:
A B C
0 484 42.9 21
if your index can have NaN's use the following trick:
In [160]: df.groupby(pd.notnull(df.index) | pd.isnull(df.index), as_index=False).agg({'A':'sum', 'B':'mean', 'C':'min'})
Out[160]:
A B C
0 484 42.9 21
Data:
In [138]: df
Out[138]:
A B C
0 34 45 68
1 71 62 61
2 39 51 33
3 38 62 27
4 16 39 21
5 94 41 41
6 14 11 41
7 76 40 29
8 44 34 70
9 58 44 68

Categories

Resources