I have this column (similar but with a lot of more entries)
import pandas as pd
numbers = range(1,16)
sequence = []
for number in numbers:
sequence.append(number)
df = pd.DataFrame(sequence).rename(columns={0: 'sequence'})
and I want to distribute the same values into lots of more columns periodically (and automatically) to get something like this (but with a bunch of values)
Thanks
Use reshape with 5 for number of new rows, -1 is for count automatically number of columns:
numbers = range(1,16)
df = pd.DataFrame(np.array(numbers).reshape(-1, 5).T)
print (df)
0 1 2
0 1 6 11
1 2 7 12
2 3 8 13
3 4 9 14
4 5 10 15
If length of values in range cannot be filled to N rows here is possible solution:
L = range(1,22)
N = 5
filled = 0
arr = np.full(((len(L) - 1)//N + 1)*N, filled)
arr[:len(L)] = L
df = pd.DataFrame(arr.reshape((-1, N)).T)
print(df)
0 1 2 3 4
0 1 6 11 16 21
1 2 7 12 17 0
2 3 8 13 18 0
3 4 9 14 19 0
4 5 10 15 20 0
Use pandas.Series.values.reshape and desired rows and columns
pd.DataFrame(df.sequence.values.reshape(5, -1))
If you like to reshape after reading dataframe then
df = pd.DataFrame(df.to_numpy().reshape(5,-1))
num_cols = 3
result = pd.DataFrame(df.sequence.to_numpy().reshape(-1, num_cols, order="F"))
for a given number of columns, e.g., 3 here, reshapes df.sequence to (total_number_of_values / num_cols, num_cols) where first shape is inferred with -1. The Fortran order matches the structure so that numbers are "going down first",
to get
>>> result
0 1 2
0 1 6 11
1 2 7 12
2 3 8 13
3 4 9 14
4 5 10 15
If num_cols = 5, then
>>> result
0 1 2 3 4
0 1 4 7 10 13
1 2 5 8 11 14
2 3 6 9 12 15
Related
I have a dataframe with one column and I would like to get a Dataframe with N columns all of which will be identical to the first one. I can simply duplicate it by:
df[['new column name']] = df[['column name']]
but I have to make more than 1000 identical columns that's why it doesnt work
. One important thing is figures in columns should change for instance if first columns is 0 the nth column is n and the previous is n-1
If it's a single column, you can use tranpose and then simply replicate them with pd.concat and tranpose back to the original format, this avoids looping and should be faster, then you can change the column names in a second line, but without dealing with all the data in the dataframe which would be the most consuming performance wise:
import pandas as pd
df = pd.DataFrame({'Column':[1,2,3,4,5]})
Original dataframe:
Column
0 1
1 2
2 3
3 4
4 5
df = pd.concat([df.T]*1000).T
Output:
Column Column Column Column ... Column Column Column Column
0 1 1 1 1 ... 1 1 1 1
1 2 2 2 2 ... 2 2 2 2
2 3 3 3 3 ... 3 3 3 3
3 4 4 4 4 ... 4 4 4 4
4 5 5 5 5 ... 5 5 5 5
[5 rows x 1000 columns]
df.columns = ['Column'+'_'+str(i) for i in range(1000)]
Say that you have a df:, with column name 'company_name' that consists of 8 companies:
df = {"company_name":{"0":"Telia","1":"Proximus","2":"Tmobile","3":"Orange","4":"Telefonica","5":"Verizon","6":"AT&T","7":"Koninklijke"}}
company_name
0 Telia
1 Proximus
2 Tmobile
3 Orange
4 Telefonica
5 Verizon
6 AT&T
7 Koninklijke
You can use a loop and range to determine how many identical columns to create, and do:
for i in range(0,1000):
df['company_name'+str(i)] = df['company_name']
which results in the shape of the df:
df.shape
(8, 1001)
i.e. it replicated 1000 times the same columns. The names of the duplicated columns will be the same as the original one, plus an integer (=+1) at the end:
'company_name', 'company_name0', 'company_name1', 'company_name2','company_name..N'
df
A B C
0 x x x
1 y x z
Duplicate column "C" 5 times using df.assign:
n = 5
df2 = df.assign(**{f'C{i}': df['C'] for i in range(1, n+1)})
df2
A B C C1 C2 C3 C4 C5
0 x x x x x x x x
1 y x z z z z z z
Set n to 1000 to get your desired output.
You can also directly assign the result back:
df[[f'C{i}' for i in range(1, n+1)]] = df[['C']*n].to_numpy()
df
A B C C1 C2 C3 C4 C5
0 x x x x x x x x
1 y x z z z z z z
I think the most efficient is to index with DataFrame.loc instead of using an outer loop
n = 3
new_df = df.loc[:, ['column_duplicate']*n +
df.columns.difference(['column_duplicate']).tolist()]
print(new_df)
column_duplicate column_duplicate column_duplicate other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
9 9 9 9 19
If you want add a suffix
suffix_tup = ('a', 'b', 'c')
not_dup_cols = df.columns.difference(['column_duplicate']).tolist()
new_df = (df.loc[:, ['column_duplicate']*len(suffix_tup) +
not_dup_cols]
.set_axis(list(map(lambda suffix: f'column_duplicate_{suffix}',
suffix_tup)) +
not_dup_cols, axis=1)
)
print(new_df)
column_duplicate_a column_duplicate_b column_duplicate_c other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
or add an index
n = 3
not_dup_cols = df.columns.difference(['column_duplicate']).tolist()
new_df = (df.loc[:, ['column_duplicate']*n +
not_dup_cols]
.set_axis(list(map(lambda x: f'column_duplicate_{x}', range(n))) +
not_dup_cols, axis=1)
)
print(new_df)
column_duplicate_0 column_duplicate_1 column_duplicate_2 other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
9 9 9 9 19
I have a following pandas sample dataset:
Dim1 Dim2 Dim3 Dim4
0 1 2 7 15
1 1 10 12 2
2 9 19 18 16
3 4 2 4 15
4 8 1 9 5
5 14 18 3 14
6 19 9 9 17
I want to make a complex comparison based on all 4 columns and generate a column called Domination_count. For every row, I want to calculate how many other rows the given one dominates. Domination is defined as "being better in one dimension, while not being worse in the others". A is better than B if the value of A is less than B.
The final result should become:
Dim1 Dim2 Dim3 Dim4 Domination_count
0 1 2 7 15 2
1 1 10 12 2 1
2 9 19 18 16 0
3 4 2 4 15 2
4 8 1 9 5 2
5 14 18 3 14 0
6 19 9 9 17 0
Some explanation behind the final numbers:
the option 0 is better than option 2 and 6
the option 1 is better than option 2
option 2, 5,6 are better than no other option
option 3 and 4 are better than option 2, 6
I could not think of any code that allows me to compare multiple columns simultaneously. I found this approach which does not do the comparison simultaneously.
Improving on the answer:
My first answer worked if there were no equal rows. In the case of equal rows they would increment the domination count because they are not worse than the other rows.
This somewhat simpler solution takes care of that problem.
#create a dataframe with a duplicate row
df = pd.DataFrame([[1, 2, 7, 15],[1, 10,12,2],[9, 19,18,16],[4, 2, 4, 15],[8, 1, 9, 5],[14,18,3, 14],[19,9, 9, 17], [14,18,3, 14]], #[14,18,3, 14]
columns = ['Dim1','Dim2','Dim3','Dim4']
)
df2 = df.copy()
def domination(row,df):
#filter for all rows where none of the columns are worse
df = df[(row <= df).all(axis = 1)]
#filter for rows where any column is better.
df = df[(row < df).any(axis = 1)]
return len(df)
df['Domination_count'] = df.apply(domination, args=[df], axis = 1)
df
This will correctly account for the criteria in the post and will not count the duplicate row in the domination column
Dim1 Dim2 Dim3 Dim4 Domination_count
0 1 2 7 15 2
1 1 10 12 2 1
2 9 19 18 16 0
3 4 2 4 15 2
4 8 1 9 5 2
5 14 18 3 14 0
6 19 9 9 17 0
7 14 18 3 14 0
My previous solution counts the equal rows:
df2['Domination_count'] = df2.apply(lambda x: (x <= df2).all(axis=1).sum() -1, axis=1)
df2
Dim1 Dim2 Dim3 Dim4 Domination_count
0 1 2 7 15 2
1 1 10 12 2 1
2 9 19 18 16 0
3 4 2 4 15 2
4 8 1 9 5 2
5 14 18 3 14 1
6 19 9 9 17 0
7 14 18 3 14 1
Original Solution
I like this as a solution. It takes each row of the dataframe and compares each element it to all rows of the dataframe to see if that element is less than or equal to the other rows (not worse than). Then, it counts the rows where all of the elements are not worse than the other rows. This counts the current row which is never worse than itself so we subtract 1.
df['Domination_count'] = df.apply(lambda x: (x <= df).all(axis=1).sum() -1, axis=1)
The result is:
Dim1 Dim2 Dim3 Dim4 Domination_count
0 1 2 7 15 2
1 1 10 12 2 1
2 9 19 18 16 0
3 4 2 4 15 2
4 8 1 9 5 2
5 14 18 3 14 0
6 19 9 9 17 0
In one line using list comprehension:
df['Domination_count'] = [(df.loc[df.index!=row] - df.loc[row].values.squeeze() > 0).all(axis = 1).sum() for row in df.index]
Subtract each row from all remaining rows elementwise, then count rows with all positive values (meaning that each corresponding value in the row we subtracted was lower) in the resulting dataframe.
I may have gotten your definition of domination wrong, so perhaps you'll need to change strict positivity check for whatever you need.
A simple iterative solution:
df['Domination_count']=0 #initialize column to zero
cols = df.columns[:-1] # select all columns but the domination_count
for i in range(len(df.index)): # loop through all the 4 columns
for j in range(len(df.index)):
if np.all(df.loc[i,cols]<=df.loc[j,cols]) and i!=j: # for every ith value check if its smaller than the jth value given that i!=j
df.loc[i,'Domination_count']+=1 #increment by 1
Say I have a list of integers which correspond to points where I want to increase an interger value by 1.
for example Int64Index([5, 10]), not necessarily even spaced like that, and I have a dataframe like,
val new_col
0 0.729726564 1
1 0.067509062 1
2 0.943927114 1
3 0.037718436 1
4 0.512142908 1
5 0.767198655 2
6 0.202230787 2
7 0.343767479 2
8 0.540026305 2
9 0.256425022 2
10 0.403845023 3
11 0.444475008 3
12 0.464677745 3
I want to create new_col which is an int, but increases by on a the above index rows.
Edit:
import pandas as pd
import numpy as np
df = pd.DataFrame({'val': np.random.rand(14)})
df['new_col'] = 1
How to increase the value of new_col by one at each index point (5, 10)?
I see from your comment that you refer to an "arbitrary position" so you can space them as you wish with bins.
example:
bins = [-1,3,5,12,14] #space as you wish
labels = [1,2,3,4] #labels or in your case values that you want
df['new_col'] = pd.cut(list(df.index.values), bins=bins, labels=labels)
val new_col
0 0.509742 1
1 0.081701 1
2 0.990583 1
3 0.813398 1
4 0.905022 2
5 0.951973 2
6 0.702487 3
7 0.916432 3
8 0.647568 3
9 0.955188 3
10 0.875067 3
11 0.284496 3
12 0.393931 3
13 0.341115 4
Use numpy.split with enumerate:
import pandas as pd
indices = [5, 10]
df['add_col'] = pd.concat([s + n for n, s in enumerate(pd.np.split(df['new_col'], indices))])
print(df)
Output:
val new_col add_col
0 0.953431 1 1
1 0.929134 1 1
2 0.548343 1 1
3 0.080713 1 1
4 0.465212 1 1
5 0.290549 1 2
6 0.570886 1 2
7 0.232350 1 2
8 0.036968 1 2
9 0.455084 1 2
10 0.385177 1 3
11 0.811477 1 3
12 0.802502 1 3
13 0.001847 1 3
Say I have a data frame that looks like this:
Id ColA
1 2
2 2
3 3
4 5
5 10
6 12
7 18
8 20
9 25
10 26
I would like my code to create a new column at the end of the DataFrame that divides the total # of obvservations by 5 ranging from 5 to 1.
Id ColA Segment
1 2 5
2 2 5
3 3 4
4 5 4
5 10 3
6 12 3
7 18 2
8 20 2
9 25 1
10 26 1
I tried the following code but doesn't work:
df['segment'] = pd.qcut(df['Id'],5)
I also want to know what would happpen if the total of my observations was not dividable by 5.
Actually, you were closer to the answer than you think. This will work regardless of whether len(df) is a multiple of 5 or not.
bins = 5
df['Segment'] = bins - pd.qcut(df['Id'], bins).cat.codes
df
Id ColA Segment
0 1 2 5
1 2 2 5
2 3 3 4
3 4 5 4
4 5 10 3
5 6 12 3
6 7 18 2
7 8 20 2
8 9 25 1
9 10 26 1
Where,
pd.qcut(df['Id'], bins).cat.codes
0 0
1 0
2 1
3 2
4 3
5 4
6 4
dtype: int8
Represents the categorical intervals returned by pd.qcut as integer values.
Another example, for a DataFrame with 7 rows.
df = df.head(7).copy()
df['Segment'] = bins - pd.qcut(df['Id'], bins).cat.codes
df
Id ColA Segment
0 1 2 5
1 2 2 5
2 3 3 4
3 4 5 3
4 5 10 2
5 6 12 1
6 7 18 1
This should work:
df['segment'] = np.linspace(1, 6, len(df), False, dtype=int)
It creates a list of int between 1 and 5 of the size of your array. If you want from 5 to 1, just add [::-1] at the end of the line.
I have two Pandas DataFrames (A and B) with 2 columns and different number of rows.
They used to be numpy 2D matrices and they both contain integer values.
Is there any way to retrieve the indices of matching rows between those two?
I've been trying isin() or query() or merge(), without success.
This is actually a follow-up to a previous question: I'm trying with pandas dataframes since the original matrices are rather huge.
The desired output, if possible, should be an array (or list) containing in i-th position the row index in B for the i-th row of A. E.g an output list of [1,5,4] means that the first row of A has been found in first row of B, the second row of A has been found in fifth row in B and the third row of A has been found in forth row in B.
i would do it this way:
In [199]: df1.reset_index().merge(df2.reset_index(), on=['a','b'])
Out[199]:
index_x a b index_y
0 1 9 1 17
1 3 4 0 4
or like this:
In [211]: pd.merge(df1.reset_index(), df2.reset_index(), on=['a','b'], suffixes=['_1','_2'])
Out[211]:
index_1 a b index_2
0 1 9 1 17
1 3 4 0 4
data:
In [201]: df1
Out[201]:
a b
0 1 9
1 9 1
2 8 1
3 4 0
4 2 0
5 2 2
6 2 9
7 1 1
8 4 3
9 0 4
In [202]: df2
Out[202]:
a b
0 3 5
1 5 0
2 7 8
3 6 8
4 4 0
5 1 5
6 9 0
7 9 4
8 0 9
9 0 1
10 6 9
11 6 7
12 3 3
13 5 1
14 4 2
15 5 0
16 9 5
17 9 1
18 1 6
19 9 5
Without merging, you can use == and then look if on each row there is False.
df1 = pd.DataFrame({'a':[0,1,2,3,4],'b':[0,1,2,3,4]})
df2 = pd.DataFrame({'a':[0,1,2,3,4],'b':[2,1,2,2,4]})
test = pd.DataFrame(index = df1.index,columns = ['test'])
for row in df1.index:
if False in (df1 == df2).loc[row].values:
test.ix[row,'test'] = False
else:
test.ix[row,'test'] = True
Out[1]:
test
0 False
1 True
2 True
3 False
4 True