Split rows into multiple rows based on column value - python

Input DF:
Index Parameters A B C
1 Apple 1 2 3
2 Banana 2 4 5
3 Potato 3 5 2
4 Tomato 1 x 4 1 x 6 2 x 12
Output DF
Index Parameters A B C
1 Apple 1 2 3
2 Banana 2 4 5
3 Potato 3 5 2
4 Tomato_P 1 1 2
5 Tomato_Q 4 6 12
Problem Statement:
I want convert a row of data into multiple rows based on particular column value (Tomato) and with split parameter as x
Code/Findings:
I have a code which works well if I transpose this data set and then apply the answer from here or here and then re-transpose the same.
Looking for a solution which can directly work on the given dataframe

Solution if always only one x values in data - first Series.str.split by columns in list, then Series.explode, added all another columns by DataFrame.join and set _P with _Q with Series.duplicated and numpy.select:
cols = ['A','B','C']
df[cols] = df[cols].apply(lambda x : x.str.split(' x '))
df1 = pd.concat([df[x].explode() for x in cols],axis=1)
#print (df1)
df = df[df.columns.difference(cols)].join(df1)
df['Parameters'] += np.select([df.index.duplicated(keep='last'),
df.index.duplicated()],
['_P','_Q'],
default='')
df = df.reset_index(drop=True)
print (df)
Parameters A B C
0 Apple 1 2 3
1 Banana 2 4 5
2 Potato 3 5 2
3 Tomato_P 1 1 2
4 Tomato_Q 4 6 12
EDIT:
Answer with no explode:
cols = df.columns[1:]
df1 = (pd.concat([df[x].str.split(' x ', expand=True).stack() for x in cols],axis=1, keys=cols)
.reset_index(level=1, drop=True))
print (df1)
A B C
Index
1 1 2 3
2 2 4 5
3 3 5 2
4 1 1 2
4 4 6 12
df = df.iloc[:, [0]].join(df1)
df['Parameters'] += np.select([df.index.duplicated(keep='last'),
df.index.duplicated()],
['_P','_Q'],
default='')
df = df.reset_index(drop=True)
print (df)
Parameters A B C
0 Apple 1 2 3
1 Banana 2 4 5
2 Potato 3 5 2
3 Tomato_P 1 1 2
4 Tomato_Q 4 6 1

This is more like a explode problem , available after pandas 0.25
df[['A','B','C']]=df[['A','B','C']].apply(lambda x : x.str.split(' x '))
df
Index Parameters A B C
0 1 Apple [1] [2] [3]
1 2 Banana [2] [4] [5]
2 3 Potato [3] [5] [2]
3 4 Tomato [1, 4] [1, 6] [2, 12]
df.set_index(['Index','Parameters'],inplace=True)
pd.concat([df[x].explode() for x in ['A','B','C']],axis=1)
A B C
Index Parameters
1 Apple 1 2 3
2 Banana 2 4 5
3 Potato 3 5 2
4 Tomato 1 1 2
Tomato 4 6 12

Related

Append Pandas disjunction of 2 dataframes to first dataframe

Given 2 pandas tables, both with the 3 columns id, x and y coordinates. So several rows of same id represent a graph with its x-yvalues. How would I find paths that do not exist in the first table, but in the second and append them to 1st table? Key problem is that the order of the graphs in both tables can be different.
Example:
df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3], 'x':[1,1,5,4,4,1,1,1], 'y':[1,2,4,4,3,4,5,6]})
df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4], 'x':[1,1,1,1,1,5,4,4,10,10,9], 'y':[4,5,6,1,2,4,4,3,1,2,2]})
(df1 intersect df2 ) ---------> df1
id x y id x y id x y
1 1 1 1 1 4 1 1 1
1 1 2 1 1 5 1 1 2
2 5 4 1 1 6 2 5 4
2 4 4 2 1 1 2 4 4
2 4 3 2 1 2 2 4 3
3 1 4 3 5 4 3 1 4
3 1 5 3 4 4 3 1 5
3 1 6 3 4 3 3 1 6
4 10 1 4 10 1
4 10 2 4 10 2
4 9 2 4 9 2
Should become:
df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3,4,4,4], 'x':[1,1,5,4,4,1,1,1,10,10,9], 'y':[1,2,4,4,3,4,5,6,1,2,2]})
As you can see until id= 3, df1 and df2 have similar graphs, but their order is different from one to another table. In this case for example df1 first graph is df2 seconds graph. Now df2 has a 4th path that is not in df1. In that case the 4th path should be detected and appended to df1. Like that I want to get the intersection of the 2 pandas table and append the disjunction of the both to the first table, with the condition that the id, so to say the order of the paths can be different from one and another.
Imports:
import pandas as pd
Set starting DataFrames:
df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3],
'x':[1,1,5,4,4,1,1,1],
'y':[1,2,4,4,3,4,5,6]})
df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4],
'x':[1,1,1,1,1,5,4,4,10,10,9],
'y':[4,5,6,1,2,4,4,3,1,2,2]})
Outer Merge:
df_merged = df1.merge(df2, on=['x', 'y'], how='outer')
produces:
df_merged =
id_x x y id_y
0 1.0 1 1 2
1 1.0 1 2 2
2 2.0 5 4 3
3 2.0 4 4 3
4 2.0 4 3 3
5 3.0 1 4 1
6 3.0 1 5 1
7 3.0 1 6 1
8 NaN 10 1 4
9 NaN 10 2 4
10 NaN 9 2 4
Note: Why does id_x become floats?
Fill NaN:
df_merged.id_x = df_merged.id_x.fillna(df_merged.id_y).astype('int')
produces:
df_merged =
id_x x y id_y
0 1 1 1 2
1 1 1 2 2
2 2 5 4 3
3 2 4 4 3
4 2 4 3 3
5 3 1 4 1
6 3 1 5 1
7 3 1 6 1
8 4 10 1 4
9 4 10 2 4
10 4 9 2 4
Drop id_y:
df_merged = df_merged.drop(['id_y'], axis=1)
produces:
df_merged =
id_x x y
0 1 1 1
1 1 1 2
2 2 5 4
3 2 4 4
4 2 4 3
5 3 1 4
6 3 1 5
7 3 1 6
8 4 10 1
9 4 10 2
10 4 9 2
Rename id_x to id:
df_merged = df_merged.rename(columns={'id_x': 'id'})
produces:
df_merged =
id x y
0 1 1 1
1 1 1 2
2 2 5 4
3 2 4 4
4 2 4 3
5 3 1 4
6 3 1 5
7 3 1 6
8 4 10 1
9 4 10 2
10 4 9 2
Final Program is 4 lines of code:
import pandas as pd
df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3],
'x':[1,1,5,4,4,1,1,1],
'y':[1,2,4,4,3,4,5,6]})
df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4],
'x':[1,1,1,1,1,5,4,4,10,10,9],
'y':[4,5,6,1,2,4,4,3,1,2,2]})
df_merged = df1.merge(df2, on=['x', 'y'], how='outer')
df_merged.id_x = df_merged.id_x.fillna(df_merged.id_y).astype('int')
df_merged = df_merged.drop(['id_y'], axis=1)
df_merged = df_merged.rename(columns={'id_x': 'id'})
Please remember to put a check next to the selected answer.
Mauritius, try this code:
df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3], 'x':[1,1,5,4,4,1,1,1], 'y':[1,2,4,4,3,4,5,6]})
df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4,5], 'x':[1,1,1,1,1,5,4,4,10,10,9,1], 'y':[4,5,6,1,2,4,4,3,1,2,2,2]})
df1_s = [{(x,y) for x, y in df1[['x','y']][df1.id==i].values} for i in df1.id.unique()]
def f(df2):
data = {(x,y) for x, y in df2[['x','y']].values}
if data not in df1_s:
return True
else:
return False
check = df2.groupby('id').apply(f).apply(pd.Series)
ids = check[check[0]].index.values
df2 = df2.set_index('id').loc[ids].reset_index()
df1 = df1.append(df2)
OUT:
id x y
0 1 1 1
1 1 1 2
2 2 5 4
3 2 4 4
4 2 4 3
5 3 1 4
6 3 1 5
7 3 1 6
0 4 10 1
1 4 10 2
2 4 9 2
3 5 1 2
I think it can be done more simple and pythonic, but I think a lot and still don't know how = )
And I think, should to check ids is not the same in df1 and df2, before append one df to another (in the end). I might add this later.
Does this code do what you want?

Pandas: Sort the column on frequency by another column having same value grouped

I've dataframe which is group by y column and sorted on their count column of y column.
Code:
df['count'] = df.groupby(['y'])['y'].transform(pd.Series.value_counts)
df = df.sort('count', ascending=False)
Output:
x y count
1 a 4
3 a 4
2 a 4
1 a 4
2 c 3
1 c 3
2 c 3
2 b 2
1 b 2
Now, I want to sort x column on its frequency having same values grouped on y column like below:
Expected Output:
x y count
1 a 4
1 a 4
2 a 4
3 a 4
2 c 3
2 c 3
1 c 3
2 b 2
1 b 2
It seems you need groupby and value_counts and then numpy.repeat for expand index values by their counts to DataFrame:
s = df.groupby('y', sort=False)['x'].value_counts()
#alternative
#s = df.groupby('y', sort=False)['x'].apply(pd.Series.value_counts)
print (s)
y x
a 1 2
2 1
3 1
c 2 2
1 1
b 1 1
2 1
Name: x, dtype: int64
df1 = pd.DataFrame(np.repeat(s.index.values, s.values).tolist(), columns=['y','x'])
#change order of columns
df1 = df1.reindex_axis(['x','y'], axis=1)
print (df1)
x y
0 1 a
1 1 a
2 2 a
3 3 a
4 2 c
5 2 c
6 1 c
7 1 b
8 2 b
If you are using an older version where df.sort_values is not supported. you can use:
df.sort(columns=['count','x'], ascending=[False,True])

Add multiple columns to DataFrame and set them equal to an existing column

I want to add multiple columns to a pandas DataFrame and set them equal to an existing column. Is there a simple way of doing this? In R I would do:
df <- data.frame(a=1:5)
df[c('b','c')] <- df$a
df
a b c
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
In pandas this results in KeyError: "['b' 'c'] not in index":
df = pd.DataFrame({'a': np.arange(1,6)})
df[['b','c']] = df.a
you can use .assign() method:
In [31]: df.assign(b=df['a'], c=df['a'])
Out[31]:
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
or a little bit more creative approach:
In [41]: cols = list('bcdefg')
In [42]: df.assign(**{col:df['a'] for col in cols})
Out[42]:
a b c d e f g
0 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2
2 3 3 3 3 3 3 3
3 4 4 4 4 4 4 4
4 5 5 5 5 5 5 5
another solution:
In [60]: pd.DataFrame(np.repeat(df.values, len(cols)+1, axis=1), columns=['a']+cols)
Out[60]:
a b c d e f g
0 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2
2 3 3 3 3 3 3 3
3 4 4 4 4 4 4 4
4 5 5 5 5 5 5 5
NOTE: as #Cpt_Jauchefuerst mentioned in the comment DataFrame.assign(z=1, a=1) will add columns in alphabetical order - i.e. first a will be added to existing columns and then z.
A pd.concat approach
df = pd.DataFrame(dict(a=range5))
pd.concat([df.a] * 5, axis=1, keys=list('abcde'))
a b c d e
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
Turns out you can use a loop to do this:
for i in ['b','c']: df[i] = df.a
You can set them individually if you're only dealing with a few columns:
df['b'] = df['a']
df['c'] = df['a']
or you can use a loop as you discovered.

Sort pandas dataframe within groups

I have a dataframe:
>>> df
Category Score
0 A 1
1 A 2
2 A 3
3 B 5
4 B 9
I expect the output:
Sorting Score within Category.
>>> df
Category Score
2 A 3
1 A 2
0 A 1
4 B 9
3 B 5
Any ideas?
Use sort_values by mention order.
In [17]: df.sort_values(by=['Category', 'Score'], ascending=[True, False])
Out[17]:
Category Score
2 A 3
1 A 2
0 A 1
4 B 9
3 B 5

Pandas DataFrame get column combined max values

I have a pandas DataFrame like following.
df = pd.DataFrame({"A": [3,1,2,4,5,3,4,10], "B": [1,3,2,4,0,0,1,0]})
The row values 0 to 10 are recommendations (10 is best). One DataFrame column is a category (A, B, etc.) the 0 to 10 recommendation is related to. All categories have the same weight but each row is related to one item.
I want the DataFrame to be sorted for items with the max values combined to both (or more) categories. So if a row related to an item has a value of 10 in category A but value 0 in category B, that would not be the expected solution for the highest rated item. In example given above the row with values [4,4] would be the best choice.
My groupby solution does not give the expected result.
grouped = df.groupby(['A', 'B'])
grouped[["A", "B"]].max().sort(ascending=False)
result:
A B
A B
10 2 10 0
5 0 5 0
4 4 4 4
1 4 1
3 1 3 1
0 3 0
2 2 2 2
1 3 1 3
A row based total sum would also not yield the expected result since it does not differentiate between categories.
df = pd.DataFrame({"A": [3,1,2,4,5,3,4,10], "B": [1,3,2,4,0,0,1,0]})
then calculate the rank for each column in the data frame
rank = df.rank(method = "dense")
rank
Out[44]:
A B
0 3 2
1 1 4
2 2 3
3 4 5
4 5 1
5 3 1
6 4 2
7 6 1
add a new column to the data frame which is the the total rank based on all categories
df['total_rank'] = rank.sum(axis = 1)
df
Out[46]:
A B total_rank
0 3 1 5
1 1 3 5
2 2 2 5
3 4 4 9
4 5 0 6
5 3 0 4
6 4 1 6
7 10 0 7
and finally sort your data frame by total rank
df.sort(columns='total_rank' , ascending = False)
Out[49]:
A B total_rank
3 4 4 9
7 10 0 7
4 5 0 6
6 4 1 6
0 3 1 5
1 1 3 5
2 2 2 5
5 3 0 4
How about this
df['pos'] = df.A/df.A.mean() + df.B/df.B.mean()
df.sort( columns='pos', ascending=False)
# A B pos
#3 4 4 3.909091
#7 10 0 2.500000
#1 1 3 2.431818
#2 2 2 1.954545
#6 4 1 1.727273
#0 3 1 1.477273
#4 5 0 1.250000
#5 3 0 0.750000
If you have more columns you want to rank ['A','B','C', ...]
cols = ['A','B'] # ,'C', 'D', ... ]
df['pos'] = pandas.np.sum([ df[col]/df[col].mean() for col in cols ],axis=0)
Update
Because 0 is considered a quality value (lowest), I would amend my answer as follows (not sure it makes a huge difference)
df['pos'] = (df.A+1)/(df.A.max()+1) + (df.B+1)/(df.B.max()+1)
df.sort( columns='pos', ascending=False)
# A B pos
#3 4 4 1.454545
#7 10 0 1.200000
#1 1 3 0.981818
#2 2 2 0.872727
#6 4 1 0.854545
#0 3 1 0.763636
#4 5 0 0.745455
#5 3 0 0.563636

Categories

Resources