Having a DF with columns A and B, I would like to add additional column C which will include the combination of A and B values per row. I.e., if I have a DF:
A B
0 1 1
1 1 2
2 2 1
3 2 2
I would like to create:
A B C
0 1 1 1_1
1 1 2 1_2
2 2 1 2_1
3 2 2 1_2
Obviously, I can go over all rows of the DF and just merge the values. Which is very SLOW for large tables. I can also use .unique() for columns A and B and iterate over all combinations, creating vectors col1_un and col2_un respectively, and then updating the relevant indexes in the table using something like
cols_2_merge = ['A','B']
col1_un = DF[cols_2_merge[0]].unique()
col2_un = DF[cols_2_merge[1]].unique()
for i in range(len(col1_un)):
try:
ind1 = np.where(DF[cols_2_merge[0]].str.contains(col1_un[i], na=False))[0]
except:
ind1 = np.where(DF[cols_2_merge[0]] == col1_un[i])[0]
for j in range(len(col2_un)):
try:
ind2 = np.where(DF[cols_2_merge[1]].str.contains(col2_un[j], na=False))[0]
except:
ind2 = np.where(DF[cols_2_merge[1]] == col2_un[j])[0]
new_ind = col1_un[i] + '-' + col2_un[j]
tmp_ind = np.in1d(ind1, ind2)
ind = ind1[tmp_ind]
if len(ind) > 0:
DF[new_col_name][ind] = new_ind
This is still SLOW. I can play with it a bit more not searching over the entire DF but reducing the field of search to indexes that weren't changed thus far. Still SLOW.
There is the option of group by that does exactly what I want, finding all unique pairs of combinations of the two columns and it's relatively fast, but I haven't figured out how to access the index of the original DF for each group.
Help please?
You can do it this without using groupby, just use the fact that on strings + means concatenation, and that pandas does this elementwise on series:
df['C'] = df['A'].astype(str) + '_' + df['B'].astype(str)
#joris - thank you very much.
It did work, of course! FAST, I need to add :-)
For more complicated group-based combinations one can use
GB = DF[cols_2_merge].groupby(cols_2_merge)
for i in GB.groups:
DO WHATEVER YOU WANT...
Thanks again!
Related
edit: I understand how to get the actual values, but I wonder how to append a row with these 2 sums to the existing df?
I have a dataframe score_card that looks like:
15min_colour
15min_high
15min_price
30min_colour
30min_high
30min_price
1
1
-1
1
-1
1
1
-1
1
1
1
1
-1
1
-1
1
1
1
-1
1
-1
1
-1
1
Now I'd like to add a row that sums up all the 15min numbers (first 3 columns) and the 30min numbers and so on (the actual df is larger). Means I don't want to add up the individual columns but rather the sum of the columns' sums. The row I'd like to add would look like:
sum_15min_colour&15min_high&15min_price
sum_30min_colour&30min_high&30min_price
0
8
Please disregard the header, it's only to clarify what I'm intending to do.
I assume there's a multiindex involved, but I couldn't figure out how to apply it to my existing df to achieve the desired output.
Also, is it possible to add a colum with the sum of the whole table?
Thanks for your support.
You can sum in this way:
np.sum(df.filter(like='15').values), np.sum(df.filter(like='30').values)
0,8
groupby
Can take a callable (think function) and use it on the index or columns
df.groupby(lambda x: x.split('_')[0], axis=1).sum().sum()
15min 0
30min 8
dtype: int64
it's depends on axis.
Simply - this sum the value in axis 0:
So in your case - columns(it's sum all values in columns vertically).
df.sum(axis = 0, skipna = True)
print(df):
OUTPUT:
sum_column = df["col1"] + df["col2"]
df["col3"] = sum_column
print(df)
OUTPUT:
So in your case:
summed0Axis = df.sum(axis = 0, skipna = True)
sum_column = summed0Axis["15min_colour"] + summed0Axis["15min_high"] + summed0Axis["15min_price"]
print(sum_column)
more intelligent option:
Find all columns, which included 15:
columnsWith15 = df.loc[:,df.columns.str.contains("15").sum]
columnsWith30 = df.loc[:,df.columns.str.contains("30").sum]
I just wanted to ask the community and see if there is a more efficient to do this.
I have several rows in a data frame and I am using .loc to filter values in row A for I can perform calculations on row B.
I can easily do something like...
filter_1 = df.loc['Condition'] = 1
And then perform the mathematical calculation on row B that I need.
But there are many conditions I must go through so I was wondering if I could possibly make a list of the conditions and then iterate them through the .loc function in less lines of code?
Would something like this work where I create a list, then iterate the conditions through a loop?
Thank you!
This example gets most of what I want. I just need it to show 6.4 and 7.0 in this example. How can I manipulate the iteration for it shows the results for the unique values in row 'a'?
import pandas as pd
a = [1,2,1,2,1,2,1,2,1,2]
b = [5,1,3,5,7,20,9,5,8,4]
col = ['a', 'b']
list_1 = []
for i, j in zip(a,b):
list_1.append([i,j])
df1 = pd.DataFrame(list_1, columns= col)
for i in a:
aa = df1[df1['a'].isin([i])]
aa1 = aa['b'].mean()
print (aa1)
Solution using set
set_a = set(a)
for i in set_a:
aa = df[df['a'].isin([i])]
aa1 = aa['b'].mean()
print (aa1)
Solution using pandas mean function
Is this what you are looking for?
import pandas as pd
a = [1,2,1,2,1,2,1,2,1,2]
b = [5,1,3,5,7,20,9,5,8,4]
df = pd.DataFrame({'a':a,'b':b})
print (df)
print(df.groupby('a').mean())
The results from this are:
Original Dataframe df:
a b
0 1 5
1 2 1
2 1 3
3 2 5
4 1 7
5 2 20
6 1 9
7 2 5
8 1 8
9 2 4
The mean value of df['a'] is:
b
a
1 6.4
2 7.0
Here you go:
df = df[(df['A'] > 1) & (df['A'] < 10)]
Imagine that I have a Dataframe and the columns are [A,B,C]. There are some different values for each of these columns. And I want to produce one more column D which can be received with the following function:
def produce_column(i):
# Extract current row by index
raw = df.loc[i]
# Extract previous 3 values for the same sub-df which are before i
df_same = df[
(df['A'] == raw.A)
& (df['B'] == raw.B)
].loc[:i].tail(3)
# Check that we have enough values
if df_same.shape[0] != 3:
return False
# Doesn't matter which function is in use, I just need to apply it on the column / columns
diffs = df_same['C'].map(lambda x: x <= 10 and x > 0)
return all(diffs)
df['D'] = df.index.map(lambda x: produce_column(x))
So on each step, I need to get the Dataframe, which have the same set of properties as a row and perform some operations on columns of this Dataframe. I have a few hundred thousands of rows, so this code takes a lot of time to be executed. I think that a good idea is to vectorize the operation, but I don't know how to do that. Maybe there's another way to perform this?
Thanks in advance!
UPD Here's an example
df = pd.DataFrame([(1,2,3), (4,5,6), (7,8,9)], columns=['A','B','C'])
A B C
0 1 2 3
1 4 5 6
2 7 8 9
df['D'] = df.index.map(lambda x: produce_column(x))
A B C D
0 1 2 3 True
1 4 5 6 True
2 7 8 9 False
I have this dataframe:
I want to add each column, as duration + credit_amount, so I have created the following algorithm:
def automate_add(add):
for i, column in enumerate(df):
for j, operando in enumerate(df):
if column != operando:
columnName = column + '_sum_' + operando
add[columnName] = df[column] + df[operando]
with the output:
duration_sum_credit_amount
duration_sum_installment_commitment
credit_amount_sum_duration
credit_amount_sum_installment_commitment
installment_commitment_sum_duration
installment_commitment_sum_credit_amount
However, knowing that duration + credit_amount = credit_amount + duration. I wouldn't like to have repeated columns.
Expecting this result from the function:
duration_sum_credit_amount
duration_sum_installment_commitment
credit_amount_sum_installment_commitment
How can I do it?
I am trying to use hash sets but seems to work only in pandas series [1].
EDIT:
Dataframe: https://www.openml.org/d/31
Use the below, should work faster:
import itertools
my_list=[(pd.Series(df.loc[:,list(i)].sum(axis=1),\
name='_sum_'.join(df.loc[:,list(i)].columns))) for i in list(itertools.combinations(df.columns,2))]
final_df=pd.concat(my_list,axis=1)
print(final_df)
duration_sum_credit_amount duration_sum_installment_commitment \
0 1175 10
1 5999 50
2 2108 14
3 7924 44
4 4894 27
credit_amount_sum_installment_commitment
0 1173
1 5953
2 2098
3 7884
4 4873
Explanation:
print(list(itertools.combinations(df.columns,2))) gives:
[('duration', 'credit_amount'),
('duration', 'installment_commitment'),
('credit_amount', 'installment_commitment')]
Post that do :
for i in list(itertools.combinations(df.columns,2)):
print(df.loc[:,list(i)])
print("---------------------------")
this prints the combinations of columns together. so i just summed it on axis=1 and called it under pd.series, and gave it a name by joining them.
Post this just append them to the list and concat them on axis=1 to get the final result. :)
You have been pointed already to itertools.combinations, which is the right tool here, and will save you some for loops and the issue with repeated columns. See the documentation for more details about permutations, combinations etc.
First, let's create the DataFrame so we can reproduce the example:
import pandas as pd
from itertools import combinations
df = pd.DataFrame({
'a': [1,2,3],
'b': [4,5,6],
'c': [7,8,9]
})
>>> df
a b c
0 1 4 7
1 2 5 8
2 3 6 9
Now let's get to work. The idea is to get all the combinations of the columns, then do a dictionary comprehension to return something like {column_name: sum}. Here it is:
>>> pd.DataFrame({c1 + '_sum_' + c2: df[c1] + df[c2]
for c1, c2 in combinations(df.columns, 2)})
a_sum_b a_sum_c b_sum_c
0 5 8 11
1 7 10 13
2 9 12 15
Notice you can replace sum with any other function that operates on two pd.Series.
The function can have one more if condition to check if the associate addition is already added as a column to dataframe like below:
def automate_add(add):
columnLst=[]
#list where we will add column names to avoid the associate sum columns
for i, column in enumerate(df):
for j, operando in enumerate(df):
if column != operando:
if operando + '_sum_' + column not in columnLst:
columnName = column + '_sum_' + operando
add[columnName] = df[column] + df[operando]
columnLst.append(columnName)
I havent tested this on your data. Try and let me know if it doesnt work.
I have a dataset showing below.
What I would like to do is three things.
Step 1: AA to CC is an index, however, happy to keep in the dataset for the future purpose.
Step 2: Count 0 value to each row.
Step 3: If 0 is more than 20% in the row, which means more than 2 in this case because DD to MM is 10 columns, remove the row.
So I did a stupid way to achieve above three steps.
df = pd.read_csv("dataset.csv", header=None)
df_bool = (df == "0")
print(df_bool.sum(axis=1))
then I got an expected result showing below.
0 0
1 0
2 1
3 0
4 1
5 8
6 1
7 0
So removed the row #5 as I indicated below.
df2 = df.drop([5], axis=0)
print(df2)
This works well even this is not an elegant, kind of a stupid way to go though.
However, if I import my dataset as header=0, then this approach did not work at all.
df = pd.read_csv("dataset.csv", header=0)
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
How come this happens?
Also, if I would like to write a code with loop, count and drop functions, what does the code look like?
You can just continue using boolean_indexing:
First we calculate number of columns and number of zeroes per row:
n_columns = len(df.columns) # or df.shape[1]
zeroes = (df == "0").sum(axis=1)
We then select only rows that have less than 20 % zeroes.
proportion_zeroes = zeroes / n_columns
max_20 = proportion_zeroes < 0.20
df[max_20] # This will contain only rows that have less than 20 % zeroes
One liner:
df[((df == "0").sum(axis=1) / len(df.columns)) < 0.2]
It would have been great if you could have posted how the dataframe looks in pandas rather than a picture of an excel file. However, constructing a dummy df
df = pd.DataFrame({'index1':['a','b','c'],'index2':['b','g','f'],'index3':['w','q','z']
,'Col1':[0,1,0],'Col2':[1,1,0],'Col3':[1,1,1],'Col4':[2,2,0]})
Step1, assigning the index can be done using the .set_index() method as per below
df.set_index(['index1','index2','index3'],inplace=True)
instead of doing everything manually when it comes fo filtering out, you can use the return you got from df_bool.sum(axis=1) in the filtering of the dataframe as per below
df.loc[(df==0).sum(axis=1) / (df.shape[1])>0.6]
index1 index2 index3 Col1 Col2 Col3 Col4
c f z 0 0 1 0
and using that you can drop those rows, assuming 20% then you would use
df = df.loc[(df==0).sum(axis=1) / (df.shape[1])<0.2]
Ween it comes to the header issue it's a bit difficult to answer without seeing the what the file or dataframe looks like