I have two Pandas DataFrames (A and B) with 2 columns and different number of rows.
They used to be numpy 2D matrices and they both contain integer values.
Is there any way to retrieve the indices of matching rows between those two?
I've been trying isin() or query() or merge(), without success.
This is actually a follow-up to a previous question: I'm trying with pandas dataframes since the original matrices are rather huge.
The desired output, if possible, should be an array (or list) containing in i-th position the row index in B for the i-th row of A. E.g an output list of [1,5,4] means that the first row of A has been found in first row of B, the second row of A has been found in fifth row in B and the third row of A has been found in forth row in B.
i would do it this way:
In [199]: df1.reset_index().merge(df2.reset_index(), on=['a','b'])
Out[199]:
index_x a b index_y
0 1 9 1 17
1 3 4 0 4
or like this:
In [211]: pd.merge(df1.reset_index(), df2.reset_index(), on=['a','b'], suffixes=['_1','_2'])
Out[211]:
index_1 a b index_2
0 1 9 1 17
1 3 4 0 4
data:
In [201]: df1
Out[201]:
a b
0 1 9
1 9 1
2 8 1
3 4 0
4 2 0
5 2 2
6 2 9
7 1 1
8 4 3
9 0 4
In [202]: df2
Out[202]:
a b
0 3 5
1 5 0
2 7 8
3 6 8
4 4 0
5 1 5
6 9 0
7 9 4
8 0 9
9 0 1
10 6 9
11 6 7
12 3 3
13 5 1
14 4 2
15 5 0
16 9 5
17 9 1
18 1 6
19 9 5
Without merging, you can use == and then look if on each row there is False.
df1 = pd.DataFrame({'a':[0,1,2,3,4],'b':[0,1,2,3,4]})
df2 = pd.DataFrame({'a':[0,1,2,3,4],'b':[2,1,2,2,4]})
test = pd.DataFrame(index = df1.index,columns = ['test'])
for row in df1.index:
if False in (df1 == df2).loc[row].values:
test.ix[row,'test'] = False
else:
test.ix[row,'test'] = True
Out[1]:
test
0 False
1 True
2 True
3 False
4 True
Related
I have a DataFrame with two columns A and B.
I want to create a new column named C to identify the continuous A with the same B value.
Here's an example
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,5,6,10,11,12,13,18], 'B':[1,1,2,2,3,3,3,3,4,4]})
I found a similar question, but that method only identifies the continuous A regardless of B.
df['C'] = df['A'].diff().ne(1).cumsum().sub(1)
I have tried to groupby B and apply the function like this:
df['C'] = df.groupby('B').apply(lambda x: x['A'].diff().ne(1).cumsum().sub(1))
However, it doesn't work: TypeError: incompatible index of inserted column with frame index.
The expected output is
A B C
1 1 0
2 1 0
3 2 1
5 2 2
6 3 3
10 3 4
11 3 4
12 3 4
13 4 5
18 4 6
Let's create a sequential counter using groupby, diff and cumsum then factorize to reencode the counter
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().factorize()[0]
Result
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6
Use DataFrameGroupBy.diff with compare not equal 1 and Series.cumsum, last subtract 1:
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().sub(1)
print (df)
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6
I have a dataframe with one column and I would like to get a Dataframe with N columns all of which will be identical to the first one. I can simply duplicate it by:
df[['new column name']] = df[['column name']]
but I have to make more than 1000 identical columns that's why it doesnt work
. One important thing is figures in columns should change for instance if first columns is 0 the nth column is n and the previous is n-1
If it's a single column, you can use tranpose and then simply replicate them with pd.concat and tranpose back to the original format, this avoids looping and should be faster, then you can change the column names in a second line, but without dealing with all the data in the dataframe which would be the most consuming performance wise:
import pandas as pd
df = pd.DataFrame({'Column':[1,2,3,4,5]})
Original dataframe:
Column
0 1
1 2
2 3
3 4
4 5
df = pd.concat([df.T]*1000).T
Output:
Column Column Column Column ... Column Column Column Column
0 1 1 1 1 ... 1 1 1 1
1 2 2 2 2 ... 2 2 2 2
2 3 3 3 3 ... 3 3 3 3
3 4 4 4 4 ... 4 4 4 4
4 5 5 5 5 ... 5 5 5 5
[5 rows x 1000 columns]
df.columns = ['Column'+'_'+str(i) for i in range(1000)]
Say that you have a df:, with column name 'company_name' that consists of 8 companies:
df = {"company_name":{"0":"Telia","1":"Proximus","2":"Tmobile","3":"Orange","4":"Telefonica","5":"Verizon","6":"AT&T","7":"Koninklijke"}}
company_name
0 Telia
1 Proximus
2 Tmobile
3 Orange
4 Telefonica
5 Verizon
6 AT&T
7 Koninklijke
You can use a loop and range to determine how many identical columns to create, and do:
for i in range(0,1000):
df['company_name'+str(i)] = df['company_name']
which results in the shape of the df:
df.shape
(8, 1001)
i.e. it replicated 1000 times the same columns. The names of the duplicated columns will be the same as the original one, plus an integer (=+1) at the end:
'company_name', 'company_name0', 'company_name1', 'company_name2','company_name..N'
df
A B C
0 x x x
1 y x z
Duplicate column "C" 5 times using df.assign:
n = 5
df2 = df.assign(**{f'C{i}': df['C'] for i in range(1, n+1)})
df2
A B C C1 C2 C3 C4 C5
0 x x x x x x x x
1 y x z z z z z z
Set n to 1000 to get your desired output.
You can also directly assign the result back:
df[[f'C{i}' for i in range(1, n+1)]] = df[['C']*n].to_numpy()
df
A B C C1 C2 C3 C4 C5
0 x x x x x x x x
1 y x z z z z z z
I think the most efficient is to index with DataFrame.loc instead of using an outer loop
n = 3
new_df = df.loc[:, ['column_duplicate']*n +
df.columns.difference(['column_duplicate']).tolist()]
print(new_df)
column_duplicate column_duplicate column_duplicate other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
9 9 9 9 19
If you want add a suffix
suffix_tup = ('a', 'b', 'c')
not_dup_cols = df.columns.difference(['column_duplicate']).tolist()
new_df = (df.loc[:, ['column_duplicate']*len(suffix_tup) +
not_dup_cols]
.set_axis(list(map(lambda suffix: f'column_duplicate_{suffix}',
suffix_tup)) +
not_dup_cols, axis=1)
)
print(new_df)
column_duplicate_a column_duplicate_b column_duplicate_c other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
or add an index
n = 3
not_dup_cols = df.columns.difference(['column_duplicate']).tolist()
new_df = (df.loc[:, ['column_duplicate']*n +
not_dup_cols]
.set_axis(list(map(lambda x: f'column_duplicate_{x}', range(n))) +
not_dup_cols, axis=1)
)
print(new_df)
column_duplicate_0 column_duplicate_1 column_duplicate_2 other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
9 9 9 9 19
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[10,11,12]],columns=['A','B','C'])
df[df['B']%2 ==0]['C'] = 5
I am expecting this code to change the value of columns C to 5, wherever B is even. But it is not working.
It returns the table as follow
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
I am expecting it to return
A B C
0 1 2 5
1 4 5 6
2 7 8 5
3 10 11 12
If need change value of column in DataFrame is necessary DataFrame.loc with condition and column name:
df.loc[df['B']%2 ==0, 'C'] = 5
print (df)
A B C
0 1 2 5
1 4 5 6
2 7 8 5
3 10 11 12
Your solution is nice example of chained indexing - docs.
You could just change the order to:
df['C'][df['B']%2 == 0] = 5
And it also works
Using numpy where
df['C'] = np.where(df['B']%2 == 0, 5, df['C'])
Output
A B C
0 1 2 5
1 4 5 6
2 7 8 5
3 10 11 12
I'm having trouble working out how to add the index value of a pandas dataframe to each value at that index. For example, if I have a dataframe of zeroes, the row with index 1 should have a value of 1 for all columns. The row at index 2 should have values of 2 for each column, and so on.
Can someone enlighten me please?
You can use pd.DataFrame.add with axis=0. Just remember, as below, to convert your index to a series first.
df = pd.DataFrame(np.random.randint(0, 10, (5, 5)))
print(df)
0 1 2 3 4
0 3 4 2 2 2
1 9 6 1 8 0
2 2 9 0 5 3
3 3 1 1 7 0
4 2 6 3 6 6
df = df.add(df.index.to_series(), axis=0)
print(df)
0 1 2 3 4
0 3 4 2 2 2
1 10 7 2 9 1
2 4 11 2 7 5
3 6 4 4 10 3
4 6 10 7 10 10
df = pd.DataFrame({'a':[1,2,3,4],'b':[5,6,7,8],'c':[9,10,11,12]})
How can I insert a new row of zeros at index 0 in one single line?
I tried pd.concat([pd.DataFrame([[0,0,0]]),df) but it did not work.
The desired output:
a b c
0 0 0 0
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
You can concat the temp df with the original df but you need to pass the same column names so that it aligns in the concatenated df, additionally to get the index as you desire call reset_index with drop=True param.
In [87]:
pd.concat([pd.DataFrame([[0,0,0]], columns=df.columns),df]).reset_index(drop=True)
Out[87]:
a b c
0 0 0 0
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
alternatively to EdChum's solution you can do this:
In [163]: pd.DataFrame([[0,0,0]], columns=df.columns).append(df, ignore_index=True)
Out[163]:
a b c
0 0 0 0
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
An answer more specific to the dataframe being prepended to
pd.concat([df.iloc[[0], :] * 0, df]).reset_index(drop=True)