Split a dataframe based on certain column values

Split a dataframe based on certain column values - python

Let's say I have a DF like this:
Mean 1
Mean 2
Stat 1
Stat 2
ID
5
10
15
20
Z
3
6
9
12
X
Now, I want to split the dataframe to separate the data based on whether it is a #1 or #2 for each ID.
Basically I would double the amount of rows for each ID, with each one being dedicated to either #1 or #2, and a new column will be added to specify which number we are looking at. Instead of Mean 1 and 2 being on the same row, they will be listed in two separate rows, with the # column making it clear which one we are looking at. What's the best way to do this? I was trying pd.melt(), but it seems like a slightly different use case.
Mean
Stat
ID
#
5
15
Z
1
10
20
Z
2
3
9
X
1
6
12
X
2

Use pd.wide_to_long:
new_df = pd.wide_to_long(
df, stubnames=['Mean', 'Stat'], i='ID', j='#', sep=' '
).reset_index()
new_df:
ID # Mean Stat
0 Z 1 5 15
1 X 1 3 9
2 Z 2 10 20
3 X 2 6 12
Or set_index then str.split the columns then stack if order must match the OP:
new_df = df.set_index('ID')
new_df.columns = new_df.columns.str.split(expand=True)
new_df = new_df.stack().rename_axis(['ID', '#']).reset_index()
new_df:
ID # Mean Stat
0 Z 1 5 15
1 Z 2 10 20
2 X 1 3 9
3 X 2 6 12

Here is a solution with melt and pivot:
df = df.melt(id_vars=['ID'], value_name='Mean')
df[['variable', '#']] = df['variable'].str.split(expand=True)
df = (df.assign(idx=df.groupby('variable').cumcount())
.pivot(index=['idx', 'ID', '#'], columns='variable').reset_index().drop(('idx', ''), axis=1))
df.columns = [col[0] if col[1] == '' else col[1] for col in df.columns]
df
Out[1]:
ID # Mean Stat
0 Z 1 5 15
1 X 1 3 9
2 Z 2 10 20
3 X 2 6 12

Related

Remove a row in Python Dataframe if there are repeating values in different columns

This might be a quite easy problem but I can't deal with it properly and didn't find the exact answer here. So, let's say we have a Python Dataframe as below:
df:
ID a b c d
0 1 3 4 9
1 2 8 8 3
2 1 3 10 12
3 0 1 3 0
I want to remove all the rows that contain repeating values in different columns. In other words, I am only interested in keeping rows with unique values. Referring to the above example, the desired output should be:
ID a b c d
0 1 3 4 9
2 1 3 10 12
(I didn't change the ID values on purpose to make the comparison easier). Please let me know if you have any ideas. Thanks!

You can compare length of sets with length of columns names:
lc = len(df.columns)
df1 = df[df.apply(lambda x: len(set(x)) == lc, axis=1)]
print (df1)
a b c d
ID
0 1 3 4 9
2 1 3 10 12
Or test by Series.duplicated and Series.any:
df1 = df[~df.apply(lambda x: x.duplicated().any(), axis=1)]
Or DataFrame.nunique:
df1 = df[df.nunique(axis=1).eq(lc)]
Or:
df1 = df[[len(set(x)) == lc for x in df.to_numpy()]]

Drop the rows till we reach certain string as column name and append with another dataframe after doing the same in pandas

I have 2 dataframes
df1 = pd.DataFrame([["M","N","O"],["A","B","C"],["X","Y","Z"],[2,3,4],[1,2,3]])
0 1 2
M N O
A B C
X Y Z
2 3 4
1 2 3
df2 = pd.DataFrame([["P","Q","R","S"],["X","Z","W","Y"],[4,5,6,7],[7,8,9,3]])
0 1 2 3
P Q R S
X Z W Y
4 5 6 7
7 8 9 3
I want to read the 1st dataframe drop the rows till the row starts with X and make that row as column names, then read the 2nd dataframe again drop the rows till row starts with X then append it to the 1st dataframe. Repeat the process in loop because I have multiple such dataframes.
Expected Output:
df_out = pd.DataFrame([[2,3,4,0],[1,2,3,0],[4,7,5,6],[7,3,8,9]],columns=["X","Y","Z","W"])
X Y Z W
2 3 4 0
1 2 3 0
4 7 5 6
7 3 8 9
How to do it?

First test if value X exist in any row for all columns in shifted DataFrame for get all rows after match by DataFrame.cummax with DataFrame.any and set columns names by this row in DataFrame.set_axis, same solution use for another DataFrame, join by concat, replace missing values and for expected order add DataFrame.reindex with unon both columns names:
m1 = df1.shift().eq('X').cummax().any(axis=1)
cols1 = df1[df1.eq('X').any(axis=1)].to_numpy().tolist()
df11 = df1[m1].set_axis(cols1, axis=1)
m2 = df2.shift().eq('X').cummax().any(axis=1)
cols2 = df2[df2.eq('X').any(axis=1)].to_numpy().tolist()
df22 = df2[m2].set_axis(cols2, axis=1)
df = (pd.concat([df11, df22], ignore_index=True)
.fillna(0)
.reindex(df11.columns.union(df22.columns, sort=False), axis=1))
print (df)
X Y Z W
0 2 3 4 0
1 1 2 3 0
2 4 7 5 6
3 7 3 8 9

This works,
shift = 0
for index in df.index:
if df.iloc[index - 1, 0] == "X":
X = df.iloc[index - 1, :].values
break
shift -= 1
df = df.shift(shift).dropna()
df.columns = X
df
Output -
X
Y
Z
0
2
3
4
1
1
2
3

dataframe pivot with pandas

I am trying to pivot df111 into df222:
ID1 ID2 Type Value
0 1 a X 1
1 1 a Y 2
2 1 b X 3
3 1 b Y 4
4 2 a X 5
5 2 a Y 6
6 2 b X 7
7 2 b Y 8
ID1 ID2 X Value Y Value
0 1 a 1 2
1 1 b 3 4
2 2 a 5 6
3 2 b 7 8
I tried with df111.pivot() and df111.groupby() but no luck. Can someone throw me a one-liner? Thanks

you can do it by first set_index the three first columns and then unstack. To fit the exact output, rename the columns by keeping the second level and reset_index such as:
df222 = df111.set_index(['ID1', 'ID2','Type']).unstack()
df222.columns = [col[1] + ' Value' for col in df222.columns]
df222 = df222.reset_index()
print (df222)
ID1 ID2 X Value Y Value
0 1 a 1 2
1 1 b 3 4
2 2 a 5 6
3 2 b 7 8
and if you want to do it with chaining methods:
df222 = df111.set_index(['ID1', 'ID2','Type']).Value.unstack()\
.rename(columns = {'X': 'X Value', 'Y': 'Y Value'})\
.rename_axis(None, axis="columns")\
.reset_index()

If you have pivot_table function, why the hell you provide pivot? this is just confusing ...
df333 = pd.pivot_table(df111, index=['ID1','ID2'], columns=['Type'], values='Value')
df333.reset_index()

df222 = (df111.set_index(['ID1', 'ID2','Type']).unstack()
.add_suffix(' Value'))
df222.columns=[lev[1] for lev in df222.columns]
df222.reset_index(inplace=True)

Filling in a new data frame based on two other data frames

I want an efficient way to solve this problem below because my code seems inefficient.
First of all, let me provide a dummy dataset.
import numpy as np
import pandas as pd
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
df1= {'a0' : [1,2,2,1,3], 'a1' : [2,3,3,2,4], 'a2' : [3,4,4,3,5], 'a3' : [4,5,5,4,6], 'a4' : [5,6,6,5,7]}
df2 = {'b0' : [3,6,6,3,8], 'b1' : [6,8,8,6,9], 'b2' : [8,9,9,8,7], 'b3' : [9,7,7,9,2], 'b4' : [7,2,2,7,1]}
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
My actual dataset has more than 100,000 rows and 15 columns. Now, what I want to do is pretty complicated to explain, but here we go.
Goal: I want to create a new df using the two dfs above.
find the global min and max from df1. Since the value is sorted by row, column 'a' will always have the minimum each row, and 'e' will have the maximum. Therefore, I will find the minimum in column 'a0' and maximum in 'a4'.
Min = df1['a0'].min()
Max = df1['a4'].max()
Min
Max
Then I will create a data frame filled with 0s and columns of range(Min, Max). In this case, 1 through 7.
column = []
for i in np.arange(Min, Max+1):
column.append(i)
newdf = pd.DataFrame(0, index = df1.index, columns=column)
The third step is to find the place where the values from df2 will go:
I want to loop through each value in df1 and match each value with the column name in the new df in the same row.
For example, if we are looking at row 0 and go through each column; the values in this case [1,2,3,4,5]. Then the row 0 of the newdf, column 1,2,3,4,5 will be filled with the corresponding values from df2.
Lastly, each corresponding values in df2 (same place) will be added to the place where we found in step 2.
So, the very first row of the new df will look like this:
output = {'1' : [3], '2' : [6], '3' : [8], '4' : [9], '5' : [7], '6' : [0], '7' : [0]}
output = pd.DataFrame(output)
Column 6 and 7 will not be updated because we didn't have 6 and 7 in the very first row of df1.
Here is my code for this process:
for rowidx in range(0, len(df1)):
for columnidx in range(0,len(df1.columns)):
new_column = df1[str(df1.columns[columnidx])][rowidx]
newdf.loc[newdf.index[rowidx], new_column] = df2['b' + df1.columns[columnidx][1:]][rowidx]
I think this does the job, but as I said, my actual dataset is huge with 2999999 rows and Min to Max range is 282 which means 282 columns in the new data frame.
So, the code above runs forever. Is there a faster way to do this? I think I learned something like map-reduce, but I don't know if that would apply here.

Idea is create default columns names in both DataFrames, then concat of DataFrame.stacked Series, add first 0 column to index, remove second level, so possible use DataFrame.unstack:
df1.columns = range(len(df1.columns))
df2.columns = range(len(df2.columns))
newdf = (pd.concat([df1.stack(), df2.stack()], axis=1)
.set_index(0, append=True)
.reset_index(level=1, drop=True)[1]
.unstack(fill_value=0)
.rename_axis(None, axis=1))
print (newdf)
1 2 3 4 5 6 7
0 3 6 8 9 7 0 0
1 0 6 8 9 7 2 0
2 0 6 8 9 7 2 0
3 3 6 8 9 7 0 0
4 0 0 8 9 7 2 1
Another solutions:
comp =[pd.Series(a, index=df1.loc[i]) for i, a in enumerate(df2.values)]
df = pd.concat(comp, axis=1).T.fillna(0).astype(int)
print (df)
1 2 3 4 5 6 7
0 3 6 8 9 7 0 0
1 0 6 8 9 7 2 0
2 0 6 8 9 7 2 0
3 3 6 8 9 7 0 0
4 0 0 8 9 7 2 1
Or:
comp = [dict(zip(x, y)) for x, y in zip(df1.values, df2.values)]
c = pd.DataFrame(comp).fillna(0).astype(int)
print (c)
1 2 3 4 5 6 7
0 3 6 8 9 7 0 0
1 0 6 8 9 7 2 0
2 0 6 8 9 7 2 0
3 3 6 8 9 7 0 0
4 0 0 8 9 7 2 1

Aggregate columns values by string column numerated name in pandas

I have a table
I want to sum values of the columns beloning to the same class h.*. So, my final table will look like this:
Is it possible to aggregate by string column name?
Thank you for any suggestions!

Use lambda function first for select first 3 characters with parameter axis=1 or indexing columns names similar way and aggregate sum:
df1 = df.set_index('object')
df2 = df1.groupby(lambda x: x[:3], axis=1).sum().reset_index()
Or:
df1 = df.set_index('object')
df2 = df1.groupby(df1.columns.str[:3], axis=1).sum().reset_index()
Sample:
np.random.seed(123)
cols = ['object', 'h.1.1','h.1.2','h.1.3','h.1.4','h.1.5',
'h.2.1','h.2.2','h.2.3','h.2.4','h.3.1','h.3.2','h.3.3']
df = pd.DataFrame(np.random.randint(10, size=(4, 13)), columns=cols)
print (df)
object h.1.1 h.1.2 h.1.3 h.1.4 h.1.5 h.2.1 h.2.2 h.2.3 h.2.4 \
0 2 2 6 1 3 9 6 1 0 1
1 9 3 4 0 0 4 1 7 3 2
2 4 8 0 7 9 3 4 6 1 5
3 8 3 5 0 2 6 2 4 4 6
h.3.1 h.3.2 h.3.3
0 9 0 0
1 4 7 2
2 6 2 1
3 3 0 6
df1 = df.set_index('object')
df2 = df1.groupby(lambda x: x[:3], axis=1).sum().reset_index()
print (df2)
object h.1 h.2 h.3
0 2 21 8 9
1 9 11 13 13
2 4 27 16 9
3 8 16 16 9

The solution above works great, but is vulnerable in case the h.X goes beyond single digits. I'd recommend the following:
Sample Data:
cols = ['h.%d.%d' %(i, j) for i in range(1, 11) for j in range(1, 11)]
df = pd.DataFrame(np.random.randint(10, size=(4, len(cols))), columns=cols, index=['p_%d'%p for p in range(4)])
Proposed Solution:
new_df = df.groupby(df.columns.str.split('.').str[1], axis=1).sum()
new_df.columns = 'h.' + new_df.columns # the columns are originallly numbered 1, 2, 3. This brings it back to h.1, h.2, h.3
Alternative Solution:
Going through multiindices might be more convoluted, but may be useful while manipulating this data elsewhere.
df.columns = df.columns.str.split('.', expand=True) # Transform into a multiindex
new_df = df.sum(axis = 1, level=[0,1])
new_df.columns = new_df.columns.get_level_values(0) + '.' + new_df.columns.get_level_values(1) # Rename columns

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split a dataframe based on certain column values - python

Related

Remove a row in Python Dataframe if there are repeating values in different columns

Drop the rows till we reach certain string as column name and append with another dataframe after doing the same in pandas

dataframe pivot with pandas

Filling in a new data frame based on two other data frames

Aggregate columns values by string column numerated name in pandas

Categories

Resources