Pandas: Split dataframe with duplicate values into dataframe with unique values - python

I have a dataframe in Pandas with duplicate values in Col1:
Col1
a
a
b
a
a
b
What I want to do is to split this df into different df-s with unique Col1 values in each.
DF1:
Col1
a
b
DF2:
Col1
a
b
DF3:
Col1
a
DF4:
Col1
a
Any suggestions ?

I don't think you can achieve this in a vectorial way.
One possibility is to use a custom function to iterate the items and keep track of the unique ones. Then use this to split with groupby:
def cum_uniq(s):
i = 0
seen = set()
out = []
for x in s:
if x in seen:
i+=1
seen = set()
out.append(i)
seen.add(x)
return pd.Series(out, index=s.index)
out = [g for _,g in df.groupby(cum_uniq(df['Col1']))]
output:
[ Col1
0 a,
Col1
1 a
2 b,
Col1
3 a,
Col1
4 a
5 b]
intermediate:
cum_uniq(df['Col1'])
0 0
1 1
2 1
3 2
4 3
5 3
dtype: int64
if order doesn't matter
Let's ad a Col2 to the example:
Col1 Col2
0 a 0
1 a 1
2 b 2
3 a 3
4 a 4
5 b 5
the previous code gives:
[ Col1 Col2
0 a 0,
Col1 Col2
1 a 1
2 b 2,
Col1 Col2
3 a 3,
Col1 Col2
4 a 4
5 b 5]
If order does not matter, you can vectorize it:
out = [g for _,g in df.groupby(df.groupby('Col1').cumcount())]
output:
[ Col1 Col2
0 a 0
2 b 2,
Col1 Col2
1 a 1
5 b 5,
Col1 Col2
3 a 3,
Col1 Col2
4 a 4]

Related

How to select rows filtered with condition on the previous and the next rows in pandas and put them in a empty df?

Considering the following dataframe df :
df = pd.DataFrame(
{
"col1": [0,1,2,3,4,5,6,7,8,9,10],
"col2": ["A","B","C","D","E","F","G","H","I","J","K"],
"col3": [1e-0,1e-1,1e-2,1e-3,1e-4,1e-5,1e-6,1e-7,1e-8,1e-9,1e-10],
"col4": [0,4,2,5,6,7,6,3,6,2,1]
}
)
I would like to select rows when the col4 value of the current row is greater than the col4 values of the previous and next rows and to store them in an empty frame.
I wrote the following code that works :
df1=pd.DataFrame()
for i in range(1,len(df)-1,1):
if ( (df.iloc[i]['col4'] > df.iloc[i+1]['col4']) and (df.iloc[i]['col4'] > df.iloc[i-1]['col4']) ):
df1=pd.concat([df1,df.iloc[i:i+1]])
I got the expected dataframe df1
col1 col2 col3 col4
1 1 B 1.000000e-01 4
5 5 F 1.000000e-05 7
8 8 I 1.000000e-08 6
But this code is very ugly, not readable, ... Is there a best solution ?
Use boolean indexing with compare next and previous values by Series.shift and Series.gt for greater values, for chain bitwise AND use &:
df = df[df['col4'].gt(df['col4'].shift()) & df['col4'].gt(df['col4'].shift(-1))]
print (df)
col1 col2 col3 col4
1 1 B 1.000000e-01 4
5 5 F 1.000000e-05 7
8 8 I 1.000000e-08 6
EDIT: Solution for always include first and last rows:
mask = df['col4'].gt(df['col4'].shift()) & df['col4'].gt(df['col4'].shift(-1))
mask.iloc[[0, -1]] = True
df = df[mask]
print (df)
col1 col2 col3 col4
0 0 A 1.000000e+00 0
1 1 B 1.000000e-01 4
5 5 F 1.000000e-05 7
8 8 I 1.000000e-08 6
10 10 K 1.000000e-10 1

How to switch n columns to rows of a r rows pandas dataframe (n*r rows in the final dataframe)?

Let's take this dataframe :
pd.DataFrame(dict(Col1=["a","c"],Col2=["b","d"],Col3=[1,3],Col4=[2,4]))
Col1 Col2 Col3 Col4
0 a b 1 2
1 c d 3 4
I would like to have one row per value in column Col1 and column Col2 (n=2 and r=2 so the expected dataframe have 2*2 = 4 rows).
Expected result :
Ind Value Col3 Col4
0 Col1 a 1 2
1 Col1 c 3 4
2 Col2 b 1 2
3 Col2 d 3 4
How please could I do ?
Pandas melt does the job here; the rest just has to do with repositioning and renaming the columns appropriately.
Use pandas melt to transform the dataframe, using Col3 and 4 as the index variables. melt typically converts from wide to long.
Next step - reindex the columns, with variable and value as lead columns.
Finally, rename the columns appropriately.
(df.melt(id_vars=['Col3','Col4'])
.reindex(['variable','value','Col3','Col4'],axis=1)
.rename({'variable':'Ind','value':'Value'},axis=1)
)
Ind Value Col3 Col4
0 Col1 a 1 2
1 Col1 c 3 4
2 Col2 b 1 2
3 Col2 d 3 4

How can I match two rows in a pyspark dataframe when the value in a column in a row matches the value in another column in another row?

I have a spark dataframe like below. If the value in col2 is found in other rows in col1, I want to get the values for col3 in a list in a new column. And I would rather not use self-join.
input:
col1 col2 col3
A B 1
B C 2
B A 3
output:
col1 col2 col3 col4
A B 1 [2,3]
B C 2 []
B A 3 [1]
You need to create a mapping using groupby and then use merge.
mapper = df.groupby('col1', as_index=False).agg({'col3': list}).rename(columns={'col3':'col4', 'col1': 'col2'})
df.merge(mapper, on='col2', how='left')
Output:
col1 col2 col3 col4
0 A B 1 [2, 3]
1 B C 2 NaN
2 B A 3 [1]

Sort and align 2 dataframes by values in corresponding columns

I have 2 dataframes that I want to sort that are similar in structure to what I have shown below, but the rows of values when looking at only the first 3 columns are jumbled. How do I sort the dataframes such that the row indices match?
Also it could so happen that there may not be matching rows in which case I want to create a blank entry in the other dataframe at that index. How would I go about doing this?
Dataframe1:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
Dataframe2:
Col1 Col2 Col3 Col4
0 f e g 6
1 a b c 5
2 b c d 3
Is this what you want?:
import pandas as pd
df=pd.DataFrame({'a':[1,3,2],'b':[4,6,5]})
print(df.sort_values(df.columns.tolist()))
Output:
a b
0 1 4
2 2 5
1 3 6
How do I sort the dataframes such that the row indices match
You can sort by the columns that should determine order on both data frames & reset index.
cols = ['Col1', 'Col2', 'Col3']
df1.sort_values(cols).reset_index(drop=True)
#outputs:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
df2.sort_values(cols).reset_index(drop=True)
#outputs:
Col1 Col2 Col3 Col4
0 a b c 5
1 b c d 3
2 f e g 6
...there may not be matching rows in which case I want to create a blank entry in the other dataframe at that index
lets add 1 more row to df1
df1 = pd.DataFrame({
'Col1': list('abfh'),
'Col2': list('bceg'),
'Col3': list('cdgi'),
'Col4': [1,4,5,7]
})
df1
# outputs:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
3 h g i 7
We can use an outer join to add a blank row to df2 where each column in pd.Nan at index 3
if you have sorted both databases already, you can merge using the indexes
df3 = df1.merge(df2, 'left', left_index=True, right_index=True, suffixes=('_x', ''))
otherwise, merge on the columns that *should* determine the sort order, this will create a new dataframe with joined values, sorted in the same way df1 is sorted
df3 = df1.merge(df2, 'left', on=cols, suffixes=('_x', ''))
Then filter out the columns from the left data frame
df3.iloc[:, ~df3.columns.str.endswith('_x')]
#outputs:
Col1 Col2 Col3 Col4
0 f e g 6.0
1 a b c 5.0
2 b c d 3.0
3 NaN NaN NaN NaN

Filling DataFrame with unique positive integers

I have a DataFrame that looks like this
col1 col2 col3 col4 col5
0 0 1 0 1 1
1 0 1 0 0 1
I want to assign a unique positive integer greater than 1 to each 0 entry.
so I want a DataFrame that looks like this
col1 col2 col3 col4 col5
0 2 1 3 1 1
1 4 1 5 6 1
The integers don't have to be from an ordered sequence, just positive and unique.
np.arange(...).reshape(df.shape) generates a dataframe the sive of df consisting of continuous integers starting at 2.
df.where(df, ...) works because your dataframe consists of binary indicators (zeros and ones). It keeps all true values (i.e. the ones) and then uses the continuous numpy array to fill in the zeros.
# optional: inplace=True
>>> df.where(df, np.arange(start=2, stop=df.shape[0] * df.shape[1] + 2).reshape(df.shape))
col1 col2 col3 col4 col5
0 2 1 4 1 1
1 7 1 9 10 1
I think you can use numpy.arange for generating unique random numbers with shape and replace all 0 by boolean mask generating by df == 0:
print df
col1 col2 col3 col4 col5
0 0 1 0 1 1
1 0 1 0 0 1
print df == 0
col1 col2 col3 col4 col5
0 True False True False False
1 True False True True False
print df.shape
(2, 5)
#count of integers
min_count = df.shape[0] * df.shape[1]
print min_count
10
#you need add 2, because omit 0 and 1
print np.arange(start=2, stop=min_count + 2).reshape(df.shape)
[[ 2 3 4 5 6]
[ 7 8 9 10 11]]
#use integers from 2 to max count of values of df
df[ df == 0 ] = np.arange(start=2, stop=min_count + 2).reshape(df.shape)
print df
col1 col2 col3 col4 col5
0 2 1 4 1 1
1 7 1 9 10 1
Or use numpy.random.choice for bigger unique random integers:
#count of integers
min_count = df.shape[0] * df.shape[1]
print min_count
10
#you can use bigger number in np.arange, e.g. 100, but minimal is min_count + 2
df[ df == 0 ] = np.random.choice(np.arange(2, 100), replace=False, size=df.shape)
print df
col1 col2 col3 col4 col5
0 17 1 53 1 1
1 39 1 15 76 1
This will work, although it isn't the greatest performance in pandas:
import random
MAX_INT = 100
for row in df:
for col in row:
if col == 0:
col == random.randrange(1, MAX_INT)
Something like itertuples() will be faster, but if it's not a lot of data this is fine.
df[df == 0] = np.random.choice(np.arange(2, df.size + 2), replace=False, size=df.shape)
Lot of already good answers here but throwing this out there.
replace indicates whether the sample is with or without replacement.
np.arange is from (2, size of the df + 2). It's 2 because you want it greater than 1.
size has to be the same shape as df so I just used df.shape
To illustrate what array values np.random.choice generates:
>>> np.random.choice(np.arange(2, df.size + 2), replace=False, size=df.shape)
array([[11, 4, 6, 5, 9],
[ 7, 8, 10, 3, 2]])
Note that they are all greater than 1 and are all unique.
Before:
col1 col2 col3 col4 col5
0 0 1 0 1 1
1 0 1 0 0 1
After:
col1 col2 col3 col4 col5
0 9 1 7 1 1
1 6 1 3 11 1

Categories

Resources