I have a data frame with many columns that I need to divide by a column to compute proportions. Can someone help with a for loop for this? In the given data example below I want to add columns c1p = c1/ct, c2p=c2/ct, c3p=c3/ct, c4p=c4/ct.
id c1 c2 c3 c4 ct
1 6 8 8 12 34
2 5 3 11 6 25
3 3 9 6 12 30
4 14 10 10 3 37
The power of DataFrames is that you can do stuff column by column. Like this:
for i in range(1, 5):
df[f'c{i}p'] = df[f'[c{i}'] / df['ct']
Here is a vectorized approach to get another dataframe df_p which contains all the per-unit values you need:
df_p = df.filter(regex='c[0-9]').divide(df['ct'], axis=0)
Related
The basic idea is that I have a computation that involves multiple columns from a dataframe and returns multiple columns, which I'd like to integrate in the dataframe.
I'd like to do something like this:
df = pd.DataFrame({'id':['i1', 'i1', 'i2', 'i2'], 'a':[1,2,3,4], 'b':[5,6,7,8]})
def custom_f(a, b):
computation = a+b
return computation + 1, computation*2
df['c1'], df['c2'] = df.groupby('id').apply(lambda x: custom_f(x.a, x.b))
Desired output:
id a b c1 c2
0 i1 1 5 7 12
1 i1 2 6 9 16
2 i2 3 7 11 20
3 i2 4 8 13 24
I know how I could do this one column at a time, but in reality the 'computation' operation using the two columns is quite expensive so I'm trying to figure out how I could only run it once.
EDIT: I realised that the given example can be solved without the groupby, but for my use case for the actual 'computation' I'm doing the groupby because I'm using the first and last values of arrays in each group for my computation. For the sake of simplicity I omitted that, but imagine that it is needed.
df['c1'], df['c2'] = custom_f(df['a'], df['b']) # you dont need apply for your desired output here
you can try:
def custom_f(a, b):
computation = a+b
return pd.concat([(computation + 1),(computation*2)],axis=1)
Finally:
df[['c1','c2']]=df.groupby('id').apply(lambda x: custom_f(x.a, x.b)).values
output of df:
id a b c1 c2
0 i1 1 5 7 12
1 i1 2 6 9 16
2 i2 3 7 11 20
3 i2 4 8 13 24
I have a dataframe that uses MultiIndex for both index and columns.
For example:
df = pd.DataFrame(index=pd.MultiIndex.from_product([[1,2], [1,2,3], [4,5]], names=['i','j', 'k']), columns=pd.MultiIndex.from_product([[1,2], [1,2]], names=['x', 'y']))
for c in df.columns:
df[c] = np.random.randint(100, size=(12,1))
x 1 2
y 1 2 1 2
i j k
1 1 4 10 13 0 76
5 92 37 52 40
2 4 88 77 50 22
5 75 31 19 1
3 4 61 23 5 47
5 43 68 10 21
2 1 4 23 15 17 5
5 47 68 6 94
2 4 0 12 24 54
5 83 27 46 19
3 4 7 22 5 15
5 7 10 89 79
I want to group the values by a name in the index and by a name in the columns.
For each such group, we will have a 2D array of numbers (rather than a Series). I want to aggregate std() of all entries in that 2D array.
For example, let's say I groupby ['i', 'x'], one group would be with values of i=1 and x=1. I want to compute std for each of these 2D arrays and produce a DataFrame with i values as index and x values as columns.
What is the best way to achieve this?
If I do stack() to get x as an index, I will still be computing several std() instead of one as there will still be multiple columns.
You can use nested list comprehensions. For your example, with the given kind of DataFrame (not the same, as the values are random; you may want to fix a seed value so that results are comparable) and i and x as the indices of interest, it would work like this:
# get values of the top level row index
rows = set(df.index.get_level_values(0))
# get values of the top level column index
columns = set(df.columns.get_level_values(0))
# for every sub-dataframe (every combination of top-level indices)
# compute sampling standard deviation (1 degree of freedom) across all values
df_groupSD = pd.DataFrame([[df.loc[(row, )][(col, )].values.std(ddof=1)
for col in columns] for row in rows],
index = rows, columns = columns)
# show result
display(df_groupSD)
Output:
1 2
1 31.455115 25.433812
2 29.421699 33.748962
There may be better ways, of course.
You can use stack to put the 'y' level of column as index and then groupby only i to get:
print (df.stack(level='y').groupby(['i']).std())
x 1 2
i
1 32.966811 23.933462
2 28.668825 28.541835
Try the following code:
df.groupby(level=0).apply(lambda grp: grp.stack().std())
I am trying to groupby several columns. With one column it is easy, just df.groupby('c1').['c3].sum()
Original dataframe
c1 c2 c3
1 1 2
1 2 12
2 1 87
2 2 12
2 3 87
2 3 13
Desired result
c2 c3(c1_1) c3(c1_2)
1 2 87
2 12 12
3 (0?) 100
Where c3(c1_1) means sum of column c3 where c1 has a value of 1
I have no idea how to apply groupby on this. It would be nice, if someone will show not only how to solve it, but what to read to have no such stupid questions
You can group by multiple columns at once by providing a list to groupby. If you don't mind the output being formatted slightly differently, you can achieve this result through
In [32]: df.groupby(['c2', 'c1']).c3.sum().unstack(fill_value=0)
Out[32]:
c1 1 2
c2
1 2 87
2 12 12
3 0 100
With a bit of work, this can be massaged into the form you give as well.
Y6=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
Y6=pd.DataFrame(data=Y6)
for i in Y6:
df[i]=Y6.iloc[i:i+1]
print(df[2])
Desired Output -
df[1]=[1,2]
df[2]=[3,4]
I would like to split this into 10 dataframes with 2 components in each dataframe.
You just want a dataframe of every two values?
I'm still confused at what you are looking for:
dfs = list()
for x in range(0, len(Y6), 2):
df = Y6.iloc[x:x+2].T
df.columns= ['one', 'two']
dfs.append(df)
for df in dfs:
print(df)
print()
Result is 10 dataframes, each with one row, each with 2 items from the original df:
one two
0 1 2
one two
0 3 4
one two
0 5 6
one two
0 7 8
one two
0 9 10
one two
0 11 12
one two
0 13 14
one two
0 15 16
one two
0 17 18
one two
0 19 20
I'm trying to take an existing DataFrame and append a new column.
Let's say I have this DataFrame (just some random numbers):
a b c d e
0 2.847674 0.890958 -1.785646 -0.648289 1.178657
1 -0.865278 0.696976 1.522485 -0.248514 1.004034
2 -2.229555 -0.037372 -1.380972 -0.880361 -0.532428
3 -0.057895 -2.193053 -0.691445 -0.588935 -0.883624
And I want to create a new column 'f' that multiplies each row by a 'costs' vector, for instance [1,0,0,0,0]. So, for row zero, the output in column f should be 2.847674.
Here's the function I currently use:
def addEstimate (df, costs):
row_iterator = df.iterrows()
for i, row in row_iterator:
df.ix[i, 'f'] = np.dot(costs, df.ix[i])
I'm doing this with a 15-element vector, over ~20k rows, and I'm finding that this is super-duper slow (half an hour). I suspect that using iterrows and ix is inefficient, but I'm not sure how to correct this.
Is there a way that I can apply this to the entire DataFrame at once, rather than looping through rows? Or do you have other suggestions to speed this up?
You can create the new column with df['f'] = df.dot(costs).
dot is already a DataFrame method: applying it to the DataFrame as a whole will be much quicker than looping over the DataFrame and applying np.dot to individual rows.
For example:
>>> df # an example DataFrame
a b c d e
0 0 1 2 3 4
1 12 13 14 15 16
2 24 25 26 27 28
3 36 37 38 39 40
>>> costs = [1, 0, 0, 0, 2]
>>> df['f'] = df.dot(costs)
>>> df
a b c d e f
0 0 1 2 3 4 8
1 12 13 14 15 16 44
2 24 25 26 27 28 80
3 36 37 38 39 40 116
Pandas has a dot function as well. Does
df['dotproduct'] = df.dot(costs)
do what you are looking for?