Filling DataFrame with unique positive integers - python

I have a DataFrame that looks like this
col1 col2 col3 col4 col5
0 0 1 0 1 1
1 0 1 0 0 1
I want to assign a unique positive integer greater than 1 to each 0 entry.
so I want a DataFrame that looks like this
col1 col2 col3 col4 col5
0 2 1 3 1 1
1 4 1 5 6 1
The integers don't have to be from an ordered sequence, just positive and unique.

np.arange(...).reshape(df.shape) generates a dataframe the sive of df consisting of continuous integers starting at 2.
df.where(df, ...) works because your dataframe consists of binary indicators (zeros and ones). It keeps all true values (i.e. the ones) and then uses the continuous numpy array to fill in the zeros.
# optional: inplace=True
>>> df.where(df, np.arange(start=2, stop=df.shape[0] * df.shape[1] + 2).reshape(df.shape))
col1 col2 col3 col4 col5
0 2 1 4 1 1
1 7 1 9 10 1

I think you can use numpy.arange for generating unique random numbers with shape and replace all 0 by boolean mask generating by df == 0:
print df
col1 col2 col3 col4 col5
0 0 1 0 1 1
1 0 1 0 0 1
print df == 0
col1 col2 col3 col4 col5
0 True False True False False
1 True False True True False
print df.shape
(2, 5)
#count of integers
min_count = df.shape[0] * df.shape[1]
print min_count
10
#you need add 2, because omit 0 and 1
print np.arange(start=2, stop=min_count + 2).reshape(df.shape)
[[ 2 3 4 5 6]
[ 7 8 9 10 11]]
#use integers from 2 to max count of values of df
df[ df == 0 ] = np.arange(start=2, stop=min_count + 2).reshape(df.shape)
print df
col1 col2 col3 col4 col5
0 2 1 4 1 1
1 7 1 9 10 1
Or use numpy.random.choice for bigger unique random integers:
#count of integers
min_count = df.shape[0] * df.shape[1]
print min_count
10
#you can use bigger number in np.arange, e.g. 100, but minimal is min_count + 2
df[ df == 0 ] = np.random.choice(np.arange(2, 100), replace=False, size=df.shape)
print df
col1 col2 col3 col4 col5
0 17 1 53 1 1
1 39 1 15 76 1

This will work, although it isn't the greatest performance in pandas:
import random
MAX_INT = 100
for row in df:
for col in row:
if col == 0:
col == random.randrange(1, MAX_INT)
Something like itertuples() will be faster, but if it's not a lot of data this is fine.

df[df == 0] = np.random.choice(np.arange(2, df.size + 2), replace=False, size=df.shape)
Lot of already good answers here but throwing this out there.
replace indicates whether the sample is with or without replacement.
np.arange is from (2, size of the df + 2). It's 2 because you want it greater than 1.
size has to be the same shape as df so I just used df.shape
To illustrate what array values np.random.choice generates:
>>> np.random.choice(np.arange(2, df.size + 2), replace=False, size=df.shape)
array([[11, 4, 6, 5, 9],
[ 7, 8, 10, 3, 2]])
Note that they are all greater than 1 and are all unique.
Before:
col1 col2 col3 col4 col5
0 0 1 0 1 1
1 0 1 0 0 1
After:
col1 col2 col3 col4 col5
0 9 1 7 1 1
1 6 1 3 11 1

Related

How to select rows filtered with condition on the previous and the next rows in pandas and put them in a empty df?

Considering the following dataframe df :
df = pd.DataFrame(
{
"col1": [0,1,2,3,4,5,6,7,8,9,10],
"col2": ["A","B","C","D","E","F","G","H","I","J","K"],
"col3": [1e-0,1e-1,1e-2,1e-3,1e-4,1e-5,1e-6,1e-7,1e-8,1e-9,1e-10],
"col4": [0,4,2,5,6,7,6,3,6,2,1]
}
)
I would like to select rows when the col4 value of the current row is greater than the col4 values of the previous and next rows and to store them in an empty frame.
I wrote the following code that works :
df1=pd.DataFrame()
for i in range(1,len(df)-1,1):
if ( (df.iloc[i]['col4'] > df.iloc[i+1]['col4']) and (df.iloc[i]['col4'] > df.iloc[i-1]['col4']) ):
df1=pd.concat([df1,df.iloc[i:i+1]])
I got the expected dataframe df1
col1 col2 col3 col4
1 1 B 1.000000e-01 4
5 5 F 1.000000e-05 7
8 8 I 1.000000e-08 6
But this code is very ugly, not readable, ... Is there a best solution ?
Use boolean indexing with compare next and previous values by Series.shift and Series.gt for greater values, for chain bitwise AND use &:
df = df[df['col4'].gt(df['col4'].shift()) & df['col4'].gt(df['col4'].shift(-1))]
print (df)
col1 col2 col3 col4
1 1 B 1.000000e-01 4
5 5 F 1.000000e-05 7
8 8 I 1.000000e-08 6
EDIT: Solution for always include first and last rows:
mask = df['col4'].gt(df['col4'].shift()) & df['col4'].gt(df['col4'].shift(-1))
mask.iloc[[0, -1]] = True
df = df[mask]
print (df)
col1 col2 col3 col4
0 0 A 1.000000e+00 0
1 1 B 1.000000e-01 4
5 5 F 1.000000e-05 7
8 8 I 1.000000e-08 6
10 10 K 1.000000e-10 1

How to swap column1 value with colum 2 value under a condition in Pandas

I'd like to swap column1 value with column2 value if column1.value >= 14 in pandas!
col1
col2
16
1
3
2
4
3
This should become:
col1
col2
1
16
3
2
4
3
Thanks!
Use Series.mask and re-assign the two columns values:
m = df["col1"].ge(14)
out = df.assign(
col1=df["col1"].mask(m, df["col2"]),
col2=df["col2"].mask(m, df["col1"])
)
Output:
col1 col2
0 1 16
1 3 2
2 4 3
Simple one liner solution,
df.loc[df['col1'] >= 14,['col1','col2']] = df.loc[df['col1'] >= 14,['col2','col1']].values

Pandas: Split dataframe with duplicate values into dataframe with unique values

I have a dataframe in Pandas with duplicate values in Col1:
Col1
a
a
b
a
a
b
What I want to do is to split this df into different df-s with unique Col1 values in each.
DF1:
Col1
a
b
DF2:
Col1
a
b
DF3:
Col1
a
DF4:
Col1
a
Any suggestions ?
I don't think you can achieve this in a vectorial way.
One possibility is to use a custom function to iterate the items and keep track of the unique ones. Then use this to split with groupby:
def cum_uniq(s):
i = 0
seen = set()
out = []
for x in s:
if x in seen:
i+=1
seen = set()
out.append(i)
seen.add(x)
return pd.Series(out, index=s.index)
out = [g for _,g in df.groupby(cum_uniq(df['Col1']))]
output:
[ Col1
0 a,
Col1
1 a
2 b,
Col1
3 a,
Col1
4 a
5 b]
intermediate:
cum_uniq(df['Col1'])
0 0
1 1
2 1
3 2
4 3
5 3
dtype: int64
if order doesn't matter
Let's ad a Col2 to the example:
Col1 Col2
0 a 0
1 a 1
2 b 2
3 a 3
4 a 4
5 b 5
the previous code gives:
[ Col1 Col2
0 a 0,
Col1 Col2
1 a 1
2 b 2,
Col1 Col2
3 a 3,
Col1 Col2
4 a 4
5 b 5]
If order does not matter, you can vectorize it:
out = [g for _,g in df.groupby(df.groupby('Col1').cumcount())]
output:
[ Col1 Col2
0 a 0
2 b 2,
Col1 Col2
1 a 1
5 b 5,
Col1 Col2
3 a 3,
Col1 Col2
4 a 4]

How to enter the value of one index and column into a new cell with +1 in the iteration?

I have the following DataFrame named df1:
col1
col2
col3
5
3
50
10
4
3
2
0
1
I would like to create a loop that adds a new column called "Total", which takes the value of col1 index 0 (5) and enters that value under the column "Total" at index 0. The next iteration, will col2 index 1 (4) and that value will go under column "Total" at index 1. This step will continue all columns and rows are completed.
The ideal output will be the following:
df1
col1
col2
col3
Total
5
3
50
5
10
4
3
4
2
0
1
1
I have the following code but I would like to find a more efficient way of doing this as I have a large DataFrame:
df1.iloc[0,3] = df1.iloc[0,0]
df1.iloc[1,3] = df1.iloc[1,1]
df1.iloc[2,3] = df1.iloc[2,2]
Thank you!
Numpy has a built in diagonal function:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': [5, 10, 2], 'col2': [3, 4, 0], 'col3': [50, 3, 1]})
df['Total'] = np.diag(df)
print(df)
Output
col1 col2 col3 Total
0 5 3 50 5
1 10 4 3 4
2 2 0 1 1
You can try apply on rows
df['Total'] = df.apply(lambda row: row.iloc[row.name], axis=1)
col1 col2 col3 Total
0 5 3 50 5
1 10 4 3 4
2 2 0 1 1
Hope this logic will help
length = len(df1["col1"])
total = pd.Series([df1.iloc[i, i%3] for i in range(length)])
# in i%3, 3 is number of cols(col1, col2, col3)
# add this total Series to df1

Using a column in a Pandas dataframe as a lookup to choose a second column in the same df, twice, and then do a comparison on the results

What is the best way to use data in a DF to retrieve data from other columns in the same DF, do some logical processing, and then write a new value back to the DF?
I have a Pandas dataframe which contains a column that I want to use as a lookup to pick a column out of three options - after I append a suffix to the value.
E.g.
Col1 Col2 Col3A Col4A Col5A
1 Col3 Col3 1 -2 3
2 Col4 Col5 2 -3 4
3 Col3 Col4 -3 4 -5
. ... ... ... ... ...
So in row 1: I need to pick the string "Col3" out of Col1, append "A", and then get the value from Col3A (1).
Then for row 2: that should result in Col4A (-3).
Etc., for all rows.
Then do the same for Col2 and have a second set of values (1, 4, 4, etc.).
Then take those two sets of numbers (1, -3, -3, etc. and 1, 4, 4, etc.), and see if the sign has changed (N, Y, Y, etc.).
That output then needs to be saved in a new column like this:
Col1 Col2 Col3A Col4A Col5A Col6
1 Col3 Col3 1 -2 3 N
2 Col4 Col5 2 -3 4 Y
3 Col3 Col4 -3 4 -5 Y
. ... ... ... ... ... ...
My attempt to solve this so far has mostly thrown memory errors (shape of my actual df is only (91376, 121)), and I feel there must be a better way...
df['Col6'] = np.where(
np.sign(df[df['Col1'] + 'A']) != np.sign(df[df['Col2'] + 'A']),
'Y',
'N'
)
I don't want to have to write an exhaustive tree of np.where's, to capture all 9 combinations of columns, so any suggestions gratefully received.
Thanks.
Use DataFrame.lookup:
a = df.lookup(df.index, df['Col1'] + 'A')
b = df.lookup(df.index, df['Col2'] + 'A')
print (a)
[ 1 -3 -3]
print (b)
[1 4 4]
df['Col6'] = np.where(np.sign(a) != np.sign(b), 'Y', 'N')
print (df)
Col1 Col2 Col3A Col4A Col5A Col6
1 Col3 Col3 1 -2 3 N
2 Col4 Col5 2 -3 4 Y
3 Col3 Col4 -3 4 -5 Y
Same idea as #jezrael but wrote a custom lookup function.
def look_up(df, col, suffix):
encode = pd.get_dummies(df[col])
columns = [str(col) + suffix for col in encode.columns]
encode_array = encode.values
data_array = df[columns].values
return np.einsum('ij,ij-> i', encode_array, data_array)
The rest is pretty much the same:
a = look_up(df, 'Col1', 'A')
b = look_up(df, 'Col2', 'A')
print (a)
[ 1 -3 -3]
print (b)
[1 4 4]
df['Col6'] = np.where(np.sign(a) != np.sign(b), 'Y', 'N')
print (df)
Col1 Col2 Col3A Col4A Col5A Col6
1 Col3 Col3 1 -2 3 N
2 Col4 Col5 2 -3 4 Y
3 Col3 Col4 -3 4 -5 Y
The custom look_up function is ~28 times faster for the above problem but probably not worth the extra effort.

Categories

Resources