I have a dataframe looks like this:
ids value
1 0.1
1 0.2
1 0.14
2 0.22
....
I am trying to loop through each ids and calculate new columns for each id.
for id, row in df.groupby('ids'):
x = row.loc[0, 'value']
for i in range (len(row)):
row.loc[i, 'new_col_1'] = i * x
row.loc[i, 'new_col_2'] = i * x * 10
My goal is to add the 2 new columns for each id back to the original dataframe, so my df would look like this:
ids value new_col_1 new_col_2
1 0.1 0 0
1 0.2 0.2 2
1 0.14 0.28 2.8
2 0.22 0 0
....
cumcount
With a little Numpy broadcasting sprinkled in.
cumcount gets you your for i in range(len(df)) bit
df.groupby('ids').cumcount()
0 0
1 1
2 2
3 0
dtype: int64
c = df.groupby('ids').cumcount()
v = df.value
df.join(
pd.DataFrame(
(c.values * v.values)[:, None] * [1, 10],
df.index,
).rename(columns=lambda x: f"new_col_{x + 1}")
)
ids value new_col_1 new_col_2
0 1 0.10 0.00 0.0
1 1 0.20 0.20 2.0
2 1 0.14 0.28 2.8
3 2 0.22 0.00 0.0
Related
So I have a data frame called df. It looks like this.
0
1
2
1
2
3
4
5
6
7
8
9
I want to sum up the columns and divide the sum of the columns by the sum of the rows.
So for example:
row 1, column 0: (1+4+7)/(1+2+3)
row 2, column 0: (1+4+7)/(4+5+6)
so on and so forth.
so that my final result is like this.
0
1
2
2
2.5
3
0.8
1
1.2
0.5
0.625
0.75
How do I do it in python using pandas and dataframe?
You can also do it this way:
import numpy as np
a = df.to_numpy()
b = np.divide.outer(a.sum(0),a.sum(1))
# divide is a ufunc(universal function) in numpy.
# All ufunc's support outer functionality
out = pd.DataFrame(b, index=df.index, columns=df.columns)
output:
0 1 2
0 2.0 2.500 3.00
1 0.8 1.000 1.20
2 0.5 0.625 0.75
You can use the underlying numpy array:
a = df.to_numpy()
out = pd.DataFrame(a.sum(0)/a.sum(1)[:,None],
index=df.index, columns=df.columns)
output:
0 1 2
0 2.0 2.500 3.00
1 0.8 1.000 1.20
2 0.5 0.625 0.75
I have a pandas dataframe as follows:
df = pd.DataFrame(data = [[1,0.56],[1,0.59],[1,0.62],[1,0.83],[2,0.85],[2,0.01],[2,0.79],[3,0.37],[3,0.99],[3,0.48],[3,0.55],[3,0.06]],columns=['polyid','value'])
polyid value
0 1 0.56
1 1 0.59
2 1 0.62
3 1 0.83
4 2 0.85
5 2 0.01
6 2 0.79
7 3 0.37
8 3 0.99
9 3 0.48
10 3 0.55
11 3 0.06
I need to reclassify the 'value' column separately for each 'polyid'. For the reclassification, I have two dictionaries. One with the bins that contain the information on how I want to cut the 'values' for each 'polyid' separately:
bins_dic = {1:[0,0.6,0.8,1], 2:[0,0.2,0.9,1], 3:[0,0.5,0.6,1]}
And one with the ids with which I want to label the resulting bins:
ids_dic = {1:[1,2,3], 2:[1,2,3], 3:[1,2,3]}
I tried to get this answer to work for my use case. I could only come up with applying pd.cut on each 'polyid' subset and then pd.concat all subsets again back to one dataframe:
import pandas as pd
def reclass_df_dic(df, bins_dic, names_dic, bin_key_col, val_col, name_col):
df_lst = []
for key in df[bin_key_col].unique():
bins = bins_dic[key]
names = names_dic[key]
sub_df = df[df[bin_key_col] == key]
sub_df[name_col] = pd.cut(df[val_col], bins, labels=names)
df_lst.append(sub_df)
return(pd.concat(df_lst))
df = pd.DataFrame(data = [[1,0.56],[1,0.59],[1,0.62],[1,0.83],[2,0.85],[2,0.01],[2,0.79],[3,0.37],[3,0.99],[3,0.48],[3,0.55],[3,0.06]],columns=['polyid','value'])
bins_dic = {1:[0,0.6,0.8,1], 2:[0,0.2,0.9,1], 3:[0,0.5,0.6,1]}
ids_dic = {1:[1,2,3], 2:[1,2,3], 3:[1,2,3]}
df = reclass_df_dic(df, bins_dic, ids_dic, 'polyid', 'value', 'id')
This results in my desired output:
polyid value id
0 1 0.56 1
1 1 0.59 1
2 1 0.62 2
3 1 0.83 3
4 2 0.85 2
5 2 0.01 1
6 2 0.79 2
7 3 0.37 1
8 3 0.99 3
9 3 0.48 1
10 3 0.55 2
11 3 0.06 1
However, the line:
sub_df[name_col] = pd.cut(df[val_col], bins, labels=names)
raises the warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
that I am unable to solve with using .loc. Also, I guess there generally is a more efficient way of doing this without having to loop over each category?
A simpler solution would be to use groupby and apply a custom function on each group. In this case, we can define a function reclass that obtains the correct bins and ids and then uses pd.cut:
def reclass(group, name):
bins = bins_dic[name]
ids = ids_dic[name]
return pd.cut(group, bins, labels=ids)
df['id'] = df.groupby('polyid')['value'].apply(lambda x: reclass(x, x.name))
Result:
polyid value id
0 1 0.56 1
1 1 0.59 1
2 1 0.62 2
3 1 0.83 3
4 2 0.85 2
5 2 0.01 1
6 2 0.79 2
7 3 0.37 1
8 3 0.99 3
9 3 0.48 1
10 3 0.55 2
11 3 0.06 1
Suppose I have a dataset that looks something like:
INDEX A B C
1 1 1 0.75
2 1 1 1
3 1 0 0.35
4 0 0 1
5 1 1 0
I want to get a dataframe that looks like the following, with the original columns, and all possible interactions between columns:
INDEX A B C A_B A_C B_C
1 1 1 0.75 1 0.75 0.75
2 1 1 1 1 1 1
3 1 0 0.35 0 0.35 0
4 0 0 1 0 0 0
5 1 1 0 1 0 0
My actual datasets are pretty large (~100 columns). What is the fastest way to achieve this?
I could, of course, do a nested loop, or similar, to achieve this but I was hoping there is a more efficient way.
You could use itertools.combinations for this:
>>> import pandas as pd
>>> from itertools import combinations
>>> df = pd.DataFrame({
... "A": [1,1,1,0,1],
... "B": [1,1,0,0,1],
... "C": [.75,1,.35,1,0]
... })
>>> df.head()
A B C
0 1 1 0.75
1 1 1 1.00
2 1 0 0.35
3 0 0 1.00
4 1 1 0.00
>>> for col1, col2 in combinations(df.columns, 2):
... df[f"{col1}_{col2}"] = df[col1] * df[col2]
...
>>> df.head()
A B C A_B A_C B_C
0 1 1 0.75 1 0.75 0.75
1 1 1 1.00 1 1.00 1.00
2 1 0 0.35 0 0.35 0.00
3 0 0 1.00 0 0.00 0.00
4 1 1 0.00 1 0.00 0.00
If you need to vectorize an arbitrary function on the pairs of columns you could use:
import numpy as np
def fx(x, y):
return np.multiply(x, y)
for col1, col2 in combinations(df.columns, 2):
df[f"{col1}_{col2}"] = np.vectorize(fx)(df[col1], df[col2])
I am not aware of a native pandas function to solve this, but itertools.combinations would be an improvement over a nested loop.
You could do something like:
from itertools import combinations
df = pd.DataFrame(data={"A": [1,1,1,0,1],
"B": [1,1,0,0,1],
"C": [0.75, 1, 0.35, 1, 0]})
for comb in combinations(df.columns, 2):
col_name = comb[0] + "_" + comb[1]
result[col_name] = df[comb[0]] * df[comb[1]]
Hello i have a problem which i am not able to implement a solution on.
I have following two DataFrames:
>>> df1
A B date
1 1 01-2016
2 1 02-2017
1 2 03-2017
2 2 04-2020
>>> df2
A B 01-2016 02-2017 03-2017 04.2020
1 1 0.10 0.22 0.55 0.77
2 1 0.20 0.12 0.99 0.125
1 2 0.13 0.15 0.15 0.245
2 2 0.33 0.1 0.888 0.64
What i want is following DataFrame:
>>> df3
A B date value
1 1 01-2016 0.10
2 1 02-2017 0.12
1 2 03-2017 0.15
2 2 04-2020 0.64
I already tried following:
summarize_dates = self.summarize_specific_column(data=df1, column='date')
for date in summarize_dates:
left_on = np.append(left_on, date)
right_on = np.append(right_on, merge_columns.upper())
result = pd.merge(left=df2, right=df1,
left_on=left_on, right_on=right_on,
how='right')
print(result)
This does not work. Can you help me and suggest a more comfortable implementation? Manyy thanks in advance!
You can melt df2 and then merge using the default 'inner' merge
df3 = df1.merge(df2.melt(id_vars = ['A', 'B'], var_name='date'))
A B date value
0 1 1 01-2016 0.10
1 2 1 02-2017 0.12
2 1 2 03-2017 0.15
3 2 2 04-2020 0.64
Using lookup
df1['value']=df2.set_index(['A','B']).lookup(df1.set_index(['A','B']).index,df1.date)
df1
Out[228]:
A B date value
0 1 1 01-2016 0.10
1 2 1 02-2017 0.12
2 1 2 03-2017 0.15
3 2 2 04-2020 0.64
I have two DataFrames and want to use the second one only on the rows whose index is not already contained in the first one.
What is the most efficient way to do this?
Example:
df_1
idx val
0 0.32
1 0.54
4 0.26
5 0.76
7 0.23
df_2
idx val
1 10.24
2 10.90
3 10.66
4 10.25
6 10.13
7 10.52
df_final
idx val
0 0.32
1 0.54
2 10.90
3 10.66
4 0.26
5 0.76
6 10.13
7 0.23
Recap: I need to add the rows in df_2 for which the index is not already in df_1.
EDIT
Removed some indices in df_2 to illustrate the fact that all indices from df_1 are not covered in df_2.
You can use reindex with combine_first or fillna:
df = df_1.reindex(df_2.index).combine_first(df_2)
print (df)
val
idx
0 0.32
1 0.54
2 10.90
3 10.66
4 0.26
5 0.76
6 10.13
7 0.23
df = df_1.reindex(df_2.index).fillna(df_2)
print (df)
val
idx
0 0.32
1 0.54
2 10.90
3 10.66
4 0.26
5 0.76
6 10.13
7 0.23
You can achieve the wanted output by using the combine_first method of the DataFrame. From the documentation of the method:
Combine two DataFrame objects and default to non-null values in frame calling the method. Result index columns will be the union of the respective indexes and columns
Example usage:
import pandas as pd
df_1 = pd.DataFrame([0.32,0.54,0.26,0.76,0.23], columns=['val'], index=[0,1,4,5,7])
df_1.index.name = 'idx'
df_2 = pd.DataFrame([10.56,10.24,10.90,10.66,10.25,10.13,10.52], columns=['val'], index=[0,1,2,3,4,6,7])
df_2.index.name = 'idx'
df_final = df_1.combine_first(df_2)
This will give the desired result:
In [7]: df_final
Out[7]:
val
idx
0 0.32
1 0.54
2 10.90
3 10.66
4 0.26
5 0.76
6 10.13
7 0.23