Merge dataframes on index - python

So I have two dataframes, in pandas:
x = pd.DataFrame([[1,2],[3,4]])
>>> x
0 1
0 1 2
1 3 4
y = pd.DataFrame([[7,8],[5,6]])
>>> y
0 1
0 7 8
1 5 6
Clearly they are the same size. Now it seems you can do merges and joins on a selected column, but I can't seem to do it on an index. I want the outcome to be as:
0 1 2 3
0 7 8 1 2
1 5 6 3 4

How about:
>>> x = pd.DataFrame([[1,2],[3,4]])
>>> y = pd.DataFrame([[7,8],[5,6]])
>>> df = pd.concat([y,x],axis=1,ignore_index=True)
>>> df
0 1 2 3
0 7 8 1 2
1 5 6 3 4
[2 rows x 4 columns]

Related

How can one duplicate columns N times in DataFrame?

I have a dataframe with one column and I would like to get a Dataframe with N columns all of which will be identical to the first one. I can simply duplicate it by:
df[['new column name']] = df[['column name']]
but I have to make more than 1000 identical columns that's why it doesnt work
. One important thing is figures in columns should change for instance if first columns is 0 the nth column is n and the previous is n-1
If it's a single column, you can use tranpose and then simply replicate them with pd.concat and tranpose back to the original format, this avoids looping and should be faster, then you can change the column names in a second line, but without dealing with all the data in the dataframe which would be the most consuming performance wise:
import pandas as pd
df = pd.DataFrame({'Column':[1,2,3,4,5]})
Original dataframe:
Column
0 1
1 2
2 3
3 4
4 5
df = pd.concat([df.T]*1000).T
Output:
Column Column Column Column ... Column Column Column Column
0 1 1 1 1 ... 1 1 1 1
1 2 2 2 2 ... 2 2 2 2
2 3 3 3 3 ... 3 3 3 3
3 4 4 4 4 ... 4 4 4 4
4 5 5 5 5 ... 5 5 5 5
[5 rows x 1000 columns]
df.columns = ['Column'+'_'+str(i) for i in range(1000)]
Say that you have a df:, with column name 'company_name' that consists of 8 companies:
df = {"company_name":{"0":"Telia","1":"Proximus","2":"Tmobile","3":"Orange","4":"Telefonica","5":"Verizon","6":"AT&T","7":"Koninklijke"}}
company_name
0 Telia
1 Proximus
2 Tmobile
3 Orange
4 Telefonica
5 Verizon
6 AT&T
7 Koninklijke
You can use a loop and range to determine how many identical columns to create, and do:
for i in range(0,1000):
df['company_name'+str(i)] = df['company_name']
which results in the shape of the df:
df.shape
(8, 1001)
i.e. it replicated 1000 times the same columns. The names of the duplicated columns will be the same as the original one, plus an integer (=+1) at the end:
'company_name', 'company_name0', 'company_name1', 'company_name2','company_name..N'
df
A B C
0 x x x
1 y x z
Duplicate column "C" 5 times using df.assign:
n = 5
df2 = df.assign(**{f'C{i}': df['C'] for i in range(1, n+1)})
df2
A B C C1 C2 C3 C4 C5
0 x x x x x x x x
1 y x z z z z z z
Set n to 1000 to get your desired output.
You can also directly assign the result back:
df[[f'C{i}' for i in range(1, n+1)]] = df[['C']*n].to_numpy()
df
A B C C1 C2 C3 C4 C5
0 x x x x x x x x
1 y x z z z z z z
I think the most efficient is to index with DataFrame.loc instead of using an outer loop
n = 3
new_df = df.loc[:, ['column_duplicate']*n +
df.columns.difference(['column_duplicate']).tolist()]
print(new_df)
column_duplicate column_duplicate column_duplicate other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
9 9 9 9 19
If you want add a suffix
suffix_tup = ('a', 'b', 'c')
not_dup_cols = df.columns.difference(['column_duplicate']).tolist()
new_df = (df.loc[:, ['column_duplicate']*len(suffix_tup) +
not_dup_cols]
.set_axis(list(map(lambda suffix: f'column_duplicate_{suffix}',
suffix_tup)) +
not_dup_cols, axis=1)
)
print(new_df)
column_duplicate_a column_duplicate_b column_duplicate_c other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
or add an index
n = 3
not_dup_cols = df.columns.difference(['column_duplicate']).tolist()
new_df = (df.loc[:, ['column_duplicate']*n +
not_dup_cols]
.set_axis(list(map(lambda x: f'column_duplicate_{x}', range(n))) +
not_dup_cols, axis=1)
)
print(new_df)
column_duplicate_0 column_duplicate_1 column_duplicate_2 other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
9 9 9 9 19

How do I get dataframe values with multiindex where some value is NOT in multiindex?

Here is example of my df (for example):
2000-02-01 2000-03-01 ...
sub_col_one sub_col_two sub_col_one sub_col_two ...
idx_one idx_two
2 a 5 2 3 3
0 b 0 5 8 1
2 x 0 0 6 1
0 d 8 3 5 5
3 x 5 6 5 9
2 e 2 5 0 5
3 x 1 7 4 4
The question:
How could I get all rows of that df, where idx_two is not equal to x?
I've tried get_level_values, but cant get what I need.
Use Index.get_level_values with name of level with boolean indexing:
df1 = df[df.index.get_level_values('idx_two') != 'x']
Or with position of level, here 1, because python counts from 0:
df1 = df[df.index.get_level_values(1) != 'x']

Pandas select first x rows corresponding to y values, removing results below x

I have a dataframe like so:
ID A B
0 7 4
0 5 2
0 0 3
1 6 7
1 8 9
2 5 5
I would like to select the first x rows for all IDs, but only with there are more than rows for those IDs like so:
If x == 2:
ID A B
0 7 4
0 5 2
1 6 7
1 8 9
If x == 3:
ID A B
0 7 4
0 5 2
0 0 3
... and so on.
Using df.groupby("ID").head(2) approximates what I want, but includes the first row for ID "2", which I don't want:
ID A B
0 7 4
0 5 2
1 6 7
1 8 9
2 5 5
Is there an efficient way to do that, without having to resort to counting rows for each ID?
Use groupby + duplicated with keep=False:
v = df.groupby('ID').head(2)
v[v.ID.duplicated(keep=False)]
ID A B
0 0 7 4
1 0 5 2
3 1 6 7
4 1 8 9
You could also do a 2x groupby (nah... wouldn't recommend):
df[df.groupby('ID').ID.transform('size').gt(1)].groupby('ID').head(2)
ID A B
0 0 7 4
1 0 5 2
3 1 6 7
4 1 8 9
Use the following code:
x = 2
gr = df.groupby('ID', as_index=False)\
.apply(lambda grp: grp.head(x) if len(grp) >= x else None)\
.reset_index(drop=True)
The lambda function applied here checks whether the group length
is at least x (a kind of filtration on group lenght)
and for such groups outputs the first x rows.
This way you avoid the second groupby.
The result is:
ID A B
0 0 7 4
1 0 5 2
2 1 6 7
3 1 8 9

Merge a pandas dataframe with a list through an outer join

I have a dataframe that looks as follow
A B
0 1 4
1 2 5
2 3 6
and a list
names = ['x','y']
I want to get a dataframe that kind of performs and outer join with that list. The desired result is:
A B name
0 1 4 x
1 1 4 y
2 2 5 x
3 2 5 y
4 3 6 x
5 3 6 y
Using pd.concat:
res = pd.concat([df.assign(name=i) for i in names], ignore_index=True)
Result:
A B name
0 1 4 x
1 2 5 x
2 3 6 x
3 1 4 y
4 2 5 y
5 3 6 y
Using additional key for merge
df.assign(key=1).merge(pd.DataFrame({'Name':names,'key':1})).drop('key',1)
Out[54]:
A B Name
0 1 4 x
1 1 4 y
2 2 5 x
3 2 5 y
4 3 6 x
5 3 6 y
Comprehension
pd.DataFrame(
[r + (n,) for r in zip(*map(df.get, df)) for n in names],
columns=[*df.columns, *['name']]
)
A B name
0 1 4 x
1 1 4 y
2 2 5 x
3 2 5 y
4 3 6 x
5 3 6 y

Numpy Array to Pandas Data Frame of X Y Coordinates

I have a two dimensional numpy array:
arr = np.array([[1,2,3],[4,5,6],[7,8,9]])
How would I go about converting this into a pandas data frame that would have the x coordinate, y coordinate, and corresponding array value at that index into a pandas data frame like this:
x y val
0 0 1
0 1 4
0 2 7
1 0 2
1 1 5
1 2 8
...
With stack and reset index:
df = pd.DataFrame(arr).stack().rename_axis(['y', 'x']).reset_index(name='val')
df
Out:
y x val
0 0 0 1
1 0 1 2
2 0 2 3
3 1 0 4
4 1 1 5
5 1 2 6
6 2 0 7
7 2 1 8
8 2 2 9
If ordering is important:
df.sort_values(['x', 'y'])[['x', 'y', 'val']].reset_index(drop=True)
Out:
x y val
0 0 0 1
1 0 1 4
2 0 2 7
3 1 0 2
4 1 1 5
5 1 2 8
6 2 0 3
7 2 1 6
8 2 2 9
Here's a NumPy method -
>>> arr
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
>>> shp = arr.shape
>>> r,c = np.indices(shp)
>>> pd.DataFrame(np.c_[r.ravel(), c.ravel(), arr.ravel('F')], \
columns=((['x','y','val'])))
x y val
0 0 0 1
1 0 1 4
2 0 2 7
3 1 0 2
4 1 1 5
5 1 2 8
6 2 0 3
7 2 1 6
8 2 2 9

Categories

Resources