df = pd.DataFrame([["Alpha", 3, 2, 4], ["Bravo", 2, 3, 1], ["Charlie", 4, 1, 3], ["Delta", 1, 4, 2]],
columns = ["Company", "Running", "Combat", "Range"])
print(df)
Company Running Combat Range
0 Alpha 3 2 4
1 Bravo 2 3 1
2 Charlie 4 1 3
3 Delta 1 4 2
Hi, I am trying to sort the the following dataframe so the rows would be arranged such that the best performing across the three columns would be at the top. In this case would be Bravo company as it is 2 in running, 3 in drills and 1 in range.
Would this approach work if the list have a lot more companies and it is hard to know the exact "best performing company"?
I have tried:
df_sort = df.sort_values(['Running', 'Combat', 'Range'], ascending=[True, True, True])
current output:
Company Running Combat Range
1 Delta 1 4 2
0 Bravo 2 3 1
3 Alpha 3 2 4
2 Charlie 4 1 3
but it doesn't turn out how I wanted it to be. Can this be done through pandas?
I was expecting the output to be:
Company Running Combat Range
0 Bravo 2 3 1
1 Delta 1 4 2
2 Charlie 4 1 3
3 Alpha 3 2 4
If want sorting by means per rows first create mean, then add Series.argsort for positions of sorted values and last change order of values by DataFrame.iloc:
df1 = df.iloc[df.mean(axis=1).argsort()]
print (df1)
Company Running Combat Range
1 Bravo 2 3 1
3 Delta 1 4 2
2 Charlie 4 1 3
0 Alpha 3 2 4
EDIT: If need remove some columns before by DataFrame.drop:
cols = ['Overall','Subordination']
df2 = text_df.iloc[text_df.drop(cols, axis=1).mean(axis=1).argsort()]
print (df2)
Company Running Combat Overall Subordination Range
1 Bravo 2 3 0.70 Poor 1
3 Delta 1 4 0.83 Good 2
2 Charlie 4 1 0.81 Good 3
0 Alpha 3 2 0.91 Excellent 4
Related
Learning Python. I have a dataframe like this
cand1 cand2 cand3
0 40.0900 39.6700 36.3700
1 44.2800 44.2800 35.4200
2 43.0900 51.2200 46.3500
3 35.7200 55.2700 36.4700
and I want to rank each row according to the value of the columns, so that I get
cand1 cand2 cand3
0 1 2 3
1 1 1 3
2 1 3 2
3 3 1 2
I have now
for index, row in df.iterrows():
df.loc['Rank'] = df.loc[index].rank(ascending=False).astype(int)
print (df)
However, this keeps on repeating the whole dataframe. Note also the special case in row 2, where two values are the same.
Suggestion appreciated
Use df.rank instead of series rank
df_rank = df.rank(axis=1, ascending=False, method='min').astype(int)
Out[165]:
cand1 cand2 cand3
0 1 2 3
1 1 1 3
2 3 1 2
3 3 1 2
This seems to be easy but couldn't find a working solution for it:
I have a dataframe with 3 columns:
df = pd.DataFrame({'A': [0,0,2,2,2],
'B': [1,1,2,2,3],
'C': [1,1,2,3,4]})
A B C
0 0 1 1
1 0 1 1
2 2 2 2
3 2 2 3
4 2 3 4
I want to select rows based on values of column A, then groupby based on values of column B, and finally transform values of column C into sum. something along the line of this (obviously not working) code:
df[df['A'].isin(['2']), 'C'] = df[df['A'].isin(['2']), 'C'].groupby('B').transform('sum')
desired output for above example is:
A B C
0 0 1 1
1 0 1 1
2 2 2 5
3 2 3 4
I also know how to split dataframe and do it. I am looking more for a solution that does it without the need of split+concat/merge. Thank you.
Is it just
s = df['A'].isin([2])
pd.concat((df[s].groupby(['A','B'])['C'].sum().reset_index(),
df[~s])
)
Output:
A B C
0 2 2 5
1 2 3 4
0 0 1 1
Update: Without splitting, you can assign a new column indicating special values of A:
(df.sort_values('A')
.assign(D=(~df['A'].isin([2])).cumsum())
.groupby(['D','A','B'])['C'].sum()
.reset_index('D',drop=True)
.reset_index()
)
Output:
A B C
0 0 1 1
1 0 1 1
2 2 2 5
3 2 3 4
I am trying to duplicate this result from R in Python. The function I want to apply (np.diff) takes an input and returns an array of the same size. When I try to group I get an output the size of the number of groups, not the number of rows.
Example DataFrame:
df = pd.DataFrame({'sample':[1,1,1,1,1,2,2,2,2,2],'value':[1,2,3,4,5,1,3,2,4,3]})
If I apply diff to it I get close to the result I want, except at the group borders. The (-4) value is a problem.
x = np.diff([df.loc[:,'value']], 1, prepend=0)[0]
df.loc[:,'delta'] = x
sample value delta
0 1 1 1
1 1 2 1
2 1 3 1
3 1 4 1
4 1 5 1
5 2 1 -4
6 2 3 2
7 2 2 -1
8 2 4 2
9 2 3 -1
I think the answer is to use groupby and apply or transform but I cannot figure out the syntax. The closest I can get is:
df.groupby('sample').apply(lambda df: np.diff(df['value'], 1, prepend =0 ))
x
1 [1, 1, 1, 1, 1]
2 [1, 2, -1, 2, -1]
Here is possible use DataFrameGroupBy.diff, replace first missing values to 1 and then values to integers:
df['delta'] = df.groupby('sample')['value'].diff().fillna(1).astype(int)
print (df)
sample value delta
0 1 1 1
1 1 2 1
2 1 3 1
3 1 4 1
4 1 5 1
5 2 1 1
6 2 3 2
7 2 2 -1
8 2 4 2
9 2 3 -1
Your solution is possible change by GroupBy.transform, specify processing column after groupby and remove y column in lambda function:
df['delta'] = df.groupby('sample')['value'].transform(lambda x: np.diff(x, 1, prepend = 0))
Here is a fictitious example:
id cluster
1 3
2 3
3 3
4 1
5 5
So the cluster for id 4 and 5 should be replaced by some text.
So, I'm able to find which values have a frequency of less than 3 using:
counts = distclust.groupby("cluster")["cluster"].count()
counts[counts < 3].index.values
Now, I'm not sure I go and replace these values in my dataframe with some arbitrary text (i.e. "noise")
I think that is enough information, let me know if you'd like me to include anything else:
In [82]: df.groupby('cluster').filter(lambda x: len(x) <= 2)
Out[82]:
id cluster
3 4 1
4 5 5
updating:
In [95]: idx = df.groupby('cluster').filter(lambda x: len(x) <= 2).index
In [96]: df.loc[idx, 'cluster'] = -999
In [97]: df
Out[97]:
id cluster
0 1 3
1 2 3
2 3 3
3 4 -999
4 5 -999
df.cluster.replace((df.cluster.value_counts()<=1).replace({True:'noise',False:np.nan}).dropna())
Out[627]:
0 3
1 3
2 3
3 noise
4 noise
Name: cluster, dtype: object
After assign it back
df.cluster=df.cluster.replace((df.cluster.value_counts()<=1).replace({True:'noise',False:np.nan}).dropna())
df
Out[629]:
id cluster
0 1 3
1 2 3
2 3 3
3 4 noise
4 5 noise
I have this Pandas dataframe which is a single year snapshot:
data = pd.DataFrame({'ID' : (1, 2),
'area': (2, 3),
'population' : (100, 200),
'demand' : (100, 200)})
I want to make this into a time series where population grows by 10% per year and demand grows by 20% per year. In this example I do this for two extra years.
This should be the output (note: it includes an added 'year' column):
output = pd.DataFrame({'ID': (1,2,1,2,1,2),
'year': (1,1,2,2,3,3),
'area': (2,3,2,3,2,3),
'population': (100,200,110,220,121,242),
'demand': (100,200,120,240,144,288)})
Setup variables:
k = 5 #Number of years to forecast
a = 1.20 #Demand Growth
b = 1.10 #Population Growth
Forecast dataframe:
df_out = (data[['ID','area']].merge(pd.concat([(data[['demand','population']].mul([pow(a,i),pow(b,i)])).assign(year=i+1) for i in range(k)]),
left_index=True, right_index=True)
.sort_values(by='year'))
print(df_out)
Output:
ID area demand population year
0 1 2 100.00 100.00 1
1 2 3 200.00 200.00 1
0 1 2 120.00 110.00 2
1 2 3 240.00 220.00 2
0 1 2 144.00 121.00 3
1 2 3 288.00 242.00 3
0 1 2 172.80 133.10 4
1 2 3 345.60 266.20 4
0 1 2 207.36 146.41 5
1 2 3 414.72 292.82 5
create a numpy array with [1.1, 1.2] that I repeat and cumprod
prepend a set of ones [1.0, 1.0] to account for the initial condition
multiply by the values of a conveniently stacked pd.Series
manipulate into a pd.DataFrame constructor
clean up indices and what not
k = 5
cols = ['ID', 'area']
cum_ret = np.vstack(
[np.ones((1, 2)), np.array([[1.2, 1.1]]
)[[0] * k].cumprod(0)])[:, [0, 0, 1, 1]]
s = data.set_index(cols).unstack(cols)
pd.DataFrame(
cum_ret * s.values,
columns=s.index
).stack(cols).reset_index(cols).reset_index(drop=True)
ID area demand population
0 1 2 100.000 100.000
1 2 3 200.000 200.000
2 1 2 120.000 110.000
3 2 3 240.000 220.000
4 1 2 144.000 121.000
5 2 3 288.000 242.000
6 1 2 172.800 133.100
7 2 3 345.600 266.200
8 1 2 207.360 146.410
9 2 3 414.720 292.820
10 1 2 248.832 161.051
11 2 3 497.664 322.102