I have a dataframe with one column and I would like to get a Dataframe with N columns all of which will be identical to the first one. I can simply duplicate it by:
df[['new column name']] = df[['column name']]
but I have to make more than 1000 identical columns that's why it doesnt work
. One important thing is figures in columns should change for instance if first columns is 0 the nth column is n and the previous is n-1
If it's a single column, you can use tranpose and then simply replicate them with pd.concat and tranpose back to the original format, this avoids looping and should be faster, then you can change the column names in a second line, but without dealing with all the data in the dataframe which would be the most consuming performance wise:
import pandas as pd
df = pd.DataFrame({'Column':[1,2,3,4,5]})
Original dataframe:
Column
0 1
1 2
2 3
3 4
4 5
df = pd.concat([df.T]*1000).T
Output:
Column Column Column Column ... Column Column Column Column
0 1 1 1 1 ... 1 1 1 1
1 2 2 2 2 ... 2 2 2 2
2 3 3 3 3 ... 3 3 3 3
3 4 4 4 4 ... 4 4 4 4
4 5 5 5 5 ... 5 5 5 5
[5 rows x 1000 columns]
df.columns = ['Column'+'_'+str(i) for i in range(1000)]
Say that you have a df:, with column name 'company_name' that consists of 8 companies:
df = {"company_name":{"0":"Telia","1":"Proximus","2":"Tmobile","3":"Orange","4":"Telefonica","5":"Verizon","6":"AT&T","7":"Koninklijke"}}
company_name
0 Telia
1 Proximus
2 Tmobile
3 Orange
4 Telefonica
5 Verizon
6 AT&T
7 Koninklijke
You can use a loop and range to determine how many identical columns to create, and do:
for i in range(0,1000):
df['company_name'+str(i)] = df['company_name']
which results in the shape of the df:
df.shape
(8, 1001)
i.e. it replicated 1000 times the same columns. The names of the duplicated columns will be the same as the original one, plus an integer (=+1) at the end:
'company_name', 'company_name0', 'company_name1', 'company_name2','company_name..N'
df
A B C
0 x x x
1 y x z
Duplicate column "C" 5 times using df.assign:
n = 5
df2 = df.assign(**{f'C{i}': df['C'] for i in range(1, n+1)})
df2
A B C C1 C2 C3 C4 C5
0 x x x x x x x x
1 y x z z z z z z
Set n to 1000 to get your desired output.
You can also directly assign the result back:
df[[f'C{i}' for i in range(1, n+1)]] = df[['C']*n].to_numpy()
df
A B C C1 C2 C3 C4 C5
0 x x x x x x x x
1 y x z z z z z z
I think the most efficient is to index with DataFrame.loc instead of using an outer loop
n = 3
new_df = df.loc[:, ['column_duplicate']*n +
df.columns.difference(['column_duplicate']).tolist()]
print(new_df)
column_duplicate column_duplicate column_duplicate other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
9 9 9 9 19
If you want add a suffix
suffix_tup = ('a', 'b', 'c')
not_dup_cols = df.columns.difference(['column_duplicate']).tolist()
new_df = (df.loc[:, ['column_duplicate']*len(suffix_tup) +
not_dup_cols]
.set_axis(list(map(lambda suffix: f'column_duplicate_{suffix}',
suffix_tup)) +
not_dup_cols, axis=1)
)
print(new_df)
column_duplicate_a column_duplicate_b column_duplicate_c other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
or add an index
n = 3
not_dup_cols = df.columns.difference(['column_duplicate']).tolist()
new_df = (df.loc[:, ['column_duplicate']*n +
not_dup_cols]
.set_axis(list(map(lambda x: f'column_duplicate_{x}', range(n))) +
not_dup_cols, axis=1)
)
print(new_df)
column_duplicate_0 column_duplicate_1 column_duplicate_2 other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
9 9 9 9 19
Related
I need to go through a large pd and select consecutive rows with similar values in a column. i.e. in the pd below and selecting column x: I want to specify consecutive values in column x? Say if I want consecutive values of 3 and 5 only
col row x y
1 1 1 1
5 7 3 0
2 2 2 2
6 3 3 8
9 2 3 4
5 3 3 9
4 9 4 4
5 5 5 1
3 7 5 2
6 6 6 6
5 8 6 2
3 7 6 0
The results output would be:
col row x y consecutive-count
6 3 3 8 1
9 2 3 4 1
5 3 3 9 1
5 5 5 1 2
3 7 5 2 2
I tried
m = df['x'].eq(df['x'].shift())
df[m|m.shift(-1, fill_value=False)]
But that includes the consecutive 6 that I don't want.
I also tried:
df.query( 'x in [3,5]')
That prints every row where x has 3 or 5.
IIUC use masks for boolean indexing. Check for 3 or 5, and use a cummax and reverse cummax to ensure having the order:
m1 = df['x'].eq(3)
m2 = df['x'].eq(5)
out = df[(m1|m2)&(m1.cummax()&m2[::-1].cummax())]
Output:
col row x y
2 6 3 3 8
3 9 2 3 4
4 5 3 3 9
6 5 5 5 1
7 3 7 5 2
you can create a group column for consecutive values, and filter by the group count and value of x:
# create unique ids for consecutive groups, then get group length:
group_num = (df.x.shift() != df.x).cumsum()
group_len = group_num.groupby(group_num).transform("count")
# filter main df:
df2 = df[(df.x.isin([3,5])) & (group_len > 1)]
# add new group num col
df2['consecutive-count'] = (df2.x != df2.x.shift()).cumsum()
output:
col row x y consecutive-count
3 6 3 3 8 1
4 9 2 3 4 1
5 5 3 3 9 1
7 5 5 5 1 2
8 3 7 5 2 2
I have a dataframes dfa:
y X1 X2 X3
Company Period
1 1 1 2 3 4
2 3 4 5 6
3 3 6 5 6
2 1 1 2 3 4
2 3 4 5 6
3 7 8 9 10
...
and dfb
Company Period
1 1
2
3
7 1
2
3
1 1
2
3
...
As you can see dfb has a non-unique multiindex. I'd like to concatinate both dfs in a way that can handle the non-uniquness and add the vlaues of dfa to dfb everywhere, where the indexes are equal. So the desired result would look like that:
y X1 X2 X3
Company Period
1 1 1 2 3 4
2 3 4 5 6
3 3 6 5 6
7 1 1 2 3 4
2 1 5 5 6
3 1 6 8 9
1 1 1 2 3 4
2 3 4 5 6
3 3 6 5 6
...
I have tried the following:
dfb.join(dfa, how='left') #results in dfb
dfb = pd.concat([dfb, dfa], axis = 1, join = 'inner') #raises: ValueError: cannot handle a non-unique multi-index!
bs_df.merge(dfa.reset_index(), left_on=['Company', 'PeriodQ'], right_on=['Company', 'PeriodQ'], how='left') #results in dfb
What am I doing wrong?
I saw similar question here but the solution did not work for me
You can reindex your DataFrame with duplicate indices as well and it will just repeat the corresponding rows.
In [11]: df = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], columns=['X', 'Y', 'Z'], index=pd.MultiIndex.from_product([[1], [1,2,3]]))
Out[12]:
X Y Z
1 1 1 2 3
2 4 5 6
3 7 8 9
In [15]: df.loc[pd.MultiIndex.from_product([[1], [1,2,1,2]]), :]
Out[15]:
X Y Z
1 1 1 2 3
2 4 5 6
1 1 2 3
2 4 5 6
I have a dataframe like so:
ID A B
0 7 4
0 5 2
0 0 3
1 6 7
1 8 9
2 5 5
I would like to select the first x rows for all IDs, but only with there are more than rows for those IDs like so:
If x == 2:
ID A B
0 7 4
0 5 2
1 6 7
1 8 9
If x == 3:
ID A B
0 7 4
0 5 2
0 0 3
... and so on.
Using df.groupby("ID").head(2) approximates what I want, but includes the first row for ID "2", which I don't want:
ID A B
0 7 4
0 5 2
1 6 7
1 8 9
2 5 5
Is there an efficient way to do that, without having to resort to counting rows for each ID?
Use groupby + duplicated with keep=False:
v = df.groupby('ID').head(2)
v[v.ID.duplicated(keep=False)]
ID A B
0 0 7 4
1 0 5 2
3 1 6 7
4 1 8 9
You could also do a 2x groupby (nah... wouldn't recommend):
df[df.groupby('ID').ID.transform('size').gt(1)].groupby('ID').head(2)
ID A B
0 0 7 4
1 0 5 2
3 1 6 7
4 1 8 9
Use the following code:
x = 2
gr = df.groupby('ID', as_index=False)\
.apply(lambda grp: grp.head(x) if len(grp) >= x else None)\
.reset_index(drop=True)
The lambda function applied here checks whether the group length
is at least x (a kind of filtration on group lenght)
and for such groups outputs the first x rows.
This way you avoid the second groupby.
The result is:
ID A B
0 0 7 4
1 0 5 2
2 1 6 7
3 1 8 9
I have a dataframe
C V S D LOC
1 2 3 4 X
5 6 7 8
1 2 3 4
5 6 7 8 Y
9 10 11 12
how can i select rows from loc X to Y and inport them in another csv
Use idxmax for first values of index where True in condition:
df = df.loc[(df['LOC'] == 'X').idxmax():(df['LOC'] == 'Y').idxmax()]
print (df)
C V S D LOC
0 1 2 3 4 X
1 5 6 7 8 NaN
2 1 2 3 4 NaN
3 5 6 7 8 Y
In [133]: df.loc[df.index[df.LOC=='X'][0]:df.index[df.LOC=='Y'][0]]
Out[133]:
C V S D LOC
0 1 2 3 4 X
1 5 6 7 8 NaN
2 1 2 3 4 NaN
3 5 6 7 8 Y
PS this will select all rows between first occurence of X and first occurence of Y
I have two Pandas DataFrames (A and B) with 2 columns and different number of rows.
They used to be numpy 2D matrices and they both contain integer values.
Is there any way to retrieve the indices of matching rows between those two?
I've been trying isin() or query() or merge(), without success.
This is actually a follow-up to a previous question: I'm trying with pandas dataframes since the original matrices are rather huge.
The desired output, if possible, should be an array (or list) containing in i-th position the row index in B for the i-th row of A. E.g an output list of [1,5,4] means that the first row of A has been found in first row of B, the second row of A has been found in fifth row in B and the third row of A has been found in forth row in B.
i would do it this way:
In [199]: df1.reset_index().merge(df2.reset_index(), on=['a','b'])
Out[199]:
index_x a b index_y
0 1 9 1 17
1 3 4 0 4
or like this:
In [211]: pd.merge(df1.reset_index(), df2.reset_index(), on=['a','b'], suffixes=['_1','_2'])
Out[211]:
index_1 a b index_2
0 1 9 1 17
1 3 4 0 4
data:
In [201]: df1
Out[201]:
a b
0 1 9
1 9 1
2 8 1
3 4 0
4 2 0
5 2 2
6 2 9
7 1 1
8 4 3
9 0 4
In [202]: df2
Out[202]:
a b
0 3 5
1 5 0
2 7 8
3 6 8
4 4 0
5 1 5
6 9 0
7 9 4
8 0 9
9 0 1
10 6 9
11 6 7
12 3 3
13 5 1
14 4 2
15 5 0
16 9 5
17 9 1
18 1 6
19 9 5
Without merging, you can use == and then look if on each row there is False.
df1 = pd.DataFrame({'a':[0,1,2,3,4],'b':[0,1,2,3,4]})
df2 = pd.DataFrame({'a':[0,1,2,3,4],'b':[2,1,2,2,4]})
test = pd.DataFrame(index = df1.index,columns = ['test'])
for row in df1.index:
if False in (df1 == df2).loc[row].values:
test.ix[row,'test'] = False
else:
test.ix[row,'test'] = True
Out[1]:
test
0 False
1 True
2 True
3 False
4 True