Hello StackOverflowers!
I have a pandas DataFrame
df = pd.DataFrame({
'A':[1,1,2,1,3,3,1,6,3,5,1],
'B':[10,10,300,10,30,40,20,10,30,45,20],
'C':[20,20,20,20,15,20,15,15,15,15,15],
'D':[10,20,30,40,80,10,20,50,30,10,70],
'E':[10,10,10,22,22,3,4,5,9,0,1]
})
Then I groupby it on some columns
groups = df.groupby(['A', 'B', 'C'])
I would like to select/filter the original data based on the groupby indices.
For example I would like to get 3 random combinations out of the groupby
Any ideas?
Instead of iterating along all groups len(indices) times and indexing on the respective indices value each time, get a list of the groups' keys from the dictionary returned by GroupBy.groups, and do single calls to GroupBy.get_group for each index:
keys = list(groups.groups.keys())
# [(1, 10, 20), (1, 20, 15), (2, 300, 20)...
pd.concat([groups.get_group(keys[i]) for i in indices])
A B C D E
6 1 20 15 20 4
10 1 20 15 70 1
5 3 40 20 10 3
4 3 30 15 80 22
8 3 30 15 30 9
What I could do is
groups = df.groupby(['A', 'B', 'C'])
indices = [1, 4, 3]
pd.concat([[df_group for names, df_group in groups][i] for i in indices])
Which results to :
Out[24]:
A B C D E
6 1 20 15 20 4
10 1 20 15 70 1
5 3 40 20 10 3
4 3 30 15 80 22
8 3 30 15 30 9
I wonder if there is a more elegant way, maybe implemented already in the pd.groupby()?
Related
Assume I have df1:
df1= pd.DataFrame({'alligator_apple': range(1, 11),
'barbadine': range(11, 21),
'capulin_cherry': range(21, 31)})
alligator_apple barbadine capulin_cherry
0 1 11 21
1 2 12 22
2 3 13 23
3 4 14 24
4 5 15 25
5 6 16 26
6 7 17 27
7 8 18 28
8 9 19 29
9 10 20 30
And a df2:
df2= pd.DataFrame({'alligator_apple': [6, 7, 15, 5],
'barbadine': [3, 19, 25, 12],
'capulin_cherry': [1, 9, 15, 27]})
alligator_apple barbadine capulin_cherry
0 6 3 1
1 7 19 9
2 15 25 15
3 5 12 27
I'm looking for a way to create a new column in df2 that gets number of rows based on a condition where all columns in df1 has values greater than their counterparts in df2 for each row. For example:
alligator_apple barbadine capulin_cherry greater
0 6 3 1 4
1 7 19 9 1
2 15 25 15 0
3 5 12 27 3
To elaborate, at row 0 of df2, df1.alligator_apple has 4 rows which values are higher than df2.alligator_apple with the value of 6. df1.barbadine has 10 rows which values are higher than df2.barbadine with value of 3, while similarly df1.capulin_cherry has 10 rows.
Finally, apply an 'and' condition to all aforementioned conditions to get the number '4' of df2.greater of first row. Repeat for the rest of rows in df2.
Is there a simple way to do this?
I believe this does what you want:
df2['greater'] = df2.apply(
lambda row:
(df1['alligator_apple'] > row['alligator_apple']) &
(df1['barbadine'] > row['barbadine']) &
(df1['capulin_cherry'] > row['capulin_cherry']),
axis=1,
).sum(axis=1)
print(df2)
output:
alligator_apple barbadine capulin_cherry greater
0 6 3 1 4
1 7 19 9 1
2 15 25 15 0
3 5 12 27 3
Edit: if you want to generalize and apply this logic for a given column set, we can use functools.reduce together with operator.and_:
import functools
import operator
columns = ['alligator_apple', 'barbadine', 'capulin_cherry']
df2['greater'] = df2.apply(
lambda row: functools.reduce(
operator.and_,
(df1[column] > row[column] for column in columns),
),
axis=1,
).sum(axis=1)
There's a general solution to this that should work well.
def gt_mask(row,df):
mask = True
for key,val in row.items():
mask &= df[key] > val
return len(df[mask])
df2['greater'] = df2.apply(gt_mask,df=df1,axis=1)
Output df2
,alligator_apple,barbadine,capulin_cherry,greater
0,6,3,1,4
1,7,19,9,1
2,15,25,15,0
3,5,12,27,3
This creates a mask, iterating through the key/val pairs for a given row.
Edit this answer was a big help: Masking a DataFrame on multiple column conditions - inside a loop
I have a dataframe where i need to create new column based on the multiplication of other column with specific column
Here is how my data frame looks.
df:
Brand Price S_Value S_Factor
A 10 2 2
B 20 4 1
C 30 2 1
D 40 1 2
E 50 1 1
F 10 1 1
I would like multiply column Value and Factor with Price to get new column. I can do it manually but I have a lot of column and all start with specific prefix wihivh i need to multiply... here I used S_ which mean I need to multiply all the columns which start with S_
Here would be te desired output columns
Brand Price S_Value S_Factor S_Value_New S_Factor_New
A 10 2 2
B 20 4 1
C 30 2 1
D 40 1 2
E 50 1 1
F 10 1 1
Firstly, to get the columns which you have to multiply, you can use list comprehension and string function startswith. And then just loop over the columns and create new columns by muptiplying with Price
multiply_cols = [col for col in df.columns if col.startswith('S_')]
for col in multiply_cols:
df[col+'_New'] = df[col] * df['Price']
df
Since you did not added and example of the output. This might be what you are looking for:
dfr = pd.DataFrame({
'Brand' : ['A', 'B', 'C', 'D', 'E', 'F'],
'price' : [10, 20, 30, 40, 50, 10],
'S_Value' : [2,4,2,1,1,1],
'S_Factor' : [2,1,1,2,1,1]
})
pre_fixes = ['S_']
for prefix in pre_fixes:
coltocal = [col for col in dfr.columns if col.startswith(prefix)]
for col in coltocal:
dfr.loc[:,col+'_new'] = dfr.price*dfr[col]
dfr
Brand price S_Value S_Factor S_Value_new S_Factor_new
0 A 10 2 2 20 20
1 B 20 4 1 80 20
2 C 30 2 1 60 30
3 D 40 1 2 40 80
4 E 50 1 1 50 50
5 F 10 1 1 10 10
Just add as many prefixes you have to pre_fixes (use come to separate them)
I have a pandas dataframe with some columns in it. The column I am interested in is something like this,
df['col'] = ['A', 'A', 'B', 'C', 'B', 'A']
I want to make another column say, col_count such that it shows count value in col from that index to the end of the column.
The first A in the column should have a value 3 because there is 3 occurrence of A in the column from that index. The second A will have value 2 and so on.
Finally, I want to get the following result,
col col_count
0 A 3
1 A 2
2 B 2
3 C 1
4 B 1
5 A 1
How can I do this effectively in pandas.? I was able to do this by looping through the dataframe and taking a unique count of that value for a sliced dataframe.
Is there an efficient method to do this.? Something without loops preferable.
Another part of the question is, I have another column like this along with col,
df['X'] = [10, 40, 10, 50, 30, 20]
I want to sum up this column in the same fashion I wanted to count the column col.
For instance, At index 0, I will have 10 + 40 + 20 as the sum. At index 1, the sum will be 40 + 20. In short, instead of counting, I want to sum up another column.
The result will be like this,
col col_count X X_sum
0 A 3 10 70
1 A 2 40 60
2 B 2 10 40
3 C 1 50 50
4 B 1 30 30
5 A 1 20 20
Use pandas.Series.groupby with cumcount and cumsum.
g = df[::-1].groupby('col')
df['col_count'] = g.cumcount().add(1)
df['X_sum'] = g['X'].cumsum()
print(df)
Output:
col X col_count X_sum
0 A 10 3 70
1 A 40 2 60
2 B 10 2 40
3 C 50 1 50
4 B 30 1 30
5 A 20 1 20
I'd like to keep the columns in the order they were defined with pd.DataFrame. In the example below, df.info shows that GroupId is the first column and print also prints GroupId.
I'm using Python version 3.6.3
import numpy as np
import pandas as pd
df = pd.DataFrame({'Id' : np.random.randint(1,100,10),
'GroupId' : np.random.randint(1,5,10) })
df.info()
print(df.iloc[:,0])
One way is to use collections.OrderedDict, as below. Note that the OrderedDict object takes a list of tuples as an input.
from collections import OrderedDict
df = pd.DataFrame(OrderedDict([('Id', np.random.randint(1,100,10)),
('GroupId', np.random.randint(1,5,10))]))
# Id GroupId
# 0 37 4
# 1 10 2
# 2 42 1
# 3 97 2
# 4 6 4
# 5 59 2
# 6 12 2
# 7 69 1
# 8 79 1
# 9 17 1
Unless you're using python-3.6+ where dictionaries are ordered, this just isn't possible with a (standard) dictionary. You will need to zip your items together and pass a list of tuples:
np.random.seed(0)
a = np.random.randint(1, 100, 10)
b = np.random.randint(1, 5, 10)
df = pd.DataFrame(list(zip(a, b)), columns=['Id', 'GroupId'])
Or,
data = [a, b]
df = pd.DataFrame(list(zip(*data)), columns=['Id', 'GroupId']))
df
Id GroupId
0 45 3
1 48 1
2 65 1
3 68 1
4 68 3
5 10 2
6 84 3
7 22 4
8 37 4
9 88 3
I'm sure there is a neat way of doing this, but haven't had any luck finding it yet.
Suppose I have a data frame:
f = pd.DataFrame({'A':[1, 2, 3, 4], 'B': [10, 20, 30, 40], 'C':[100, 200, 300, 400]}).T
that is, with rows indexed A, B and C.
Now suppose I want to take rows A and B, and replace them both by a single row that is their sum; and, moreover, that I want to assign a given index (say 'sum') to that replacement row (note the order of indices doesn't matter).
At the moment I'm having to do:
f.append(pd.DataFrame(f.ix[['A','B']].sum()).T).drop(['A','B'])
followed by something equally clunky to set the index of the replacement row. However, I'm curious to know if there's an elegant, one-line way of doing both of these steps?
Do this:
In [79]: f.append(f.loc[['A', 'B']].sum(), ignore_index=True).drop([0, 1]).set_index(Index(['C', 'sumAB'])
)
Out[79]:
0 1 2 3
C 100 200 300 400
sumAB 11 22 33 44
Alternatively you can use Index.get_indexer for an even uglier one-liner:
In [96]: f.append(f.loc[['A', 'B']].sum(), ignore_index=True).drop(f.index.get_indexer(['A', 'B'])).set_index(Index(['C', 'sumAB']))
Out[96]:
0 1 2 3
C 100 200 300 400
sumAB 11 22 33 44
Another option is to use concat:
In [11]: AB = list('AB')
First select the rows you wish to sum:
In [12]: f.loc[AB]
Out[12]:
0 1 2 3
A 1 2 3 4
B 10 20 30 40
In [13]: f.loc[AB].sum()
Out[13]:
0 11
1 22
2 33
3 44
dtype: int64
and as a row in a DataFrame (Note: this step may not be necessary in future versions...):
In [14]: pd.DataFrame({'sumAB': f.loc[AB].sum()}).T
Out[14]:
0 1 2 3
sumAB 11 22 33 44
and we want to concat with all the remaining rows:
In [15]: f.loc[f.index - AB]
Out[15]:
0 1 2 3
C 100 200 300 400
In [16]: pd.concat([pd.DataFrame({'sumAB': f.loc[AB].sum()}).T,
f.loc[f.index - AB]],
axis=0)
Out[16]:
0 1 2 3
sumAB 11 22 33 44
C 100 200 300 400