Pandas stack() if columns have a specific value - python

I am trying to stack this table based on ID column but only considering columns [A-D] where the value is 1 and not 0.
Current df:
ID
A
B
C
D
1
1
0
0
1
3
0
1
0
1
7
1
0
1
1
8
1
0
0
0
What I want:
ID
LETTER
1
A
1
D
3
B
3
D
7
A
7
C
7
D
8
A
The following code works but I need a more efficient solution as I have a df with 93434 rows x 12377 columns.
stacked_df = df.set_index('ID').stack().reset_index(name='has_letter').rename(columns={'level_1':'LETTER'})
stacked_df = stacked_df[stacked_df['has_letter']==1].reset_index(drop=True)
stacked_df.drop(['has_letter'], axis=1, inplace=True)

Try:
print(
df.set_index("ID")
.apply(lambda x: x.index[x == 1], axis=1)
.reset_index()
.explode(0)
.rename(columns={0: "LETTERS"})
)
Prints:
ID LETTERS
0 1 A
0 1 D
1 3 B
1 3 D
2 7 A
2 7 C
2 7 D
3 8 A
Or:
x = df.set_index("ID").stack()
print(
x[x == 1]
.reset_index()
.drop(columns=0)
.rename(columns={"level_1": "LETTER"})
)
Prints:
ID LETTER
0 1 A
1 1 D
2 3 B
3 3 D
4 7 A
5 7 C
6 7 D
7 8 A

You can mask the non-1 values and stack to remove the NaNs:
df2 = df.rename_axis(columns='LETTERS').set_index('ID')
stacked_df = (df2.where(df2.eq(1)).stack()
.reset_index().iloc[:,:2]
)
Output:
ID LETTERS
0 1 A
1 1 D
2 3 B
3 3 D
4 7 A
5 7 C
6 7 D
7 8 A

Try this:
(df.set_index('ID').dot(df.columns[1:]) # use inner product of column names and values
.apply(list) # separate each letter
.explode() # explode each list
.reset_index(name='LETTER') # reset index for df
)
Output:
ID LETTER
0 1 A
1 1 D
2 3 B
3 3 D
4 7 A
5 7 C
6 7 D
7 8 A

Related

split columns wrt column names using pandas dataframe

I have a dataframe df:
(A,B) (B,C) (D,B) (E,F)
0 3 0 1
1 1 3 0
2 2 4 2
I want to split it into different columns for all columns in df as shown below:
A B B C D B E F
0 0 3 3 0 0 1 1
1 1 1 1 3 3 0 0
2 2 2 2 4 4 2 2
and add similar columns together:
A B C D E F
0 3 3 0 1 1
1 5 1 3 0 0
2 6 2 4 2 2
how to achieve this using pandas?
With pandas, you can use this :
out = (
df
.T
.reset_index()
.assign(col= lambda x: x.pop("index").str.strip("()").str.split(","))
.explode("col")
.groupby("col", as_index=False).sum()
.set_index("col")
.T
.rename_axis(None, axis=1)
)
# Output :
print(out)
​
A B C D E F
0 0 3 3 0 1 1
1 1 5 1 3 0 0
2 2 8 2 4 2 2
i think (A, B) as tuple
pd.concat([pd.DataFrame([df[i].tolist()] * len(i), index=list(i)) for i in df.columns]).sum(level=0).T
result:
A B C D E F
0 0 3 3 0 1 1
1 1 5 1 3 0 0
2 2 8 2 4 2 2
if future warning occur, use following code:
pd.concat([pd.DataFrame([df[i].tolist()] * len(i), index=list(i)) for i in df.columns]).groupby(level=0).sum().T
same result
Use concat with removed levels with MultiIndex in columns by Series.str.findall:
df.columns = df.columns.str.findall('(\w+)').map(tuple)
df = (pd.concat([df.droplevel(x, axis=1) for x in range(df.columns.nlevels)], axis=1)
.groupby(level=0, axis=1)
.sum())
print (df)
A B C D E F
0 0 3 3 0 1 1
1 1 5 1 3 0 0
2 2 8 2 4 2 2
For write ouput to file without index use:
df.to_csv('file.csv', index=False)
You can use findall to extract the variables in the header, then melt and explode, finallypivot_table:
out = (df
.reset_index().melt('index')
.assign(variable=lambda d: d['variable'].str.findall('(\w+)'))
.explode('variable')
.pivot_table(index='index', columns='variable', values='value', aggfunc='sum')
.rename_axis(index=None, columns=None)
)
Output:
A B C D E F
0 0 3 3 0 1 1
1 1 5 1 3 0 0
2 2 8 2 4 2 2
Reproducible input:
df = pd.DataFrame({'(A,B)': [0, 1, 2],
'(B,C)': [3, 1, 2],
'(D,B)': [0, 3, 4],
'(E,F)': [1, 0, 2]})
printing/saving without index:
print(out.to_string(index=False))
A B C D E F
0 3 3 0 1 1
1 5 1 3 0 0
2 8 2 4 2 2
# as file
out.to_csv('yourfile.csv', index=False)

On DataFrame.pivot(), different result with what I expected

I'm referring to
https://github.com/pandas-dev/pandas/tree/main/doc/cheatsheet.
As you can see, if I use pivot(), then all values are in row number 0 and 1.
But if I do use pivot(), the result was different like below.
DataFrame before pivot():
DataFrame after pivot():
Is the result on purpose?
In your data, the grey column (index of the row) is missing:
df = pd.DataFrame({'variable': list('aaabbbccc'), 'value': range(9)})
print(df)
# Output
variable value
0 a 0
1 a 1
2 a 2
3 b 3
4 b 4
5 b 5
6 c 6
7 c 7
8 c 8
Add the grey column:
df['grey'] = df.groupby('variable').cumcount()
print(df)
# Output
variable value grey
0 a 0 0
1 a 1 1
2 a 2 2
3 b 3 0
4 b 4 1
5 b 5 2
6 c 6 0
7 c 7 1
8 c 8 2
Now you can pivot:
df = df.pivot('grey', 'variable', 'value')
print(df)
# Output
variable a b c
grey
0 0 3 6
1 1 4 7
2 2 5 8
Take the time to read How can I pivot a dataframe?

Pandas apply to create multiple columns, using multiple columns as input

I am trying to use a function to create multiple outputs, using multiple columns as inputs. Here's my attempt:
df = pd.DataFrame(np.random.randint(0,10,size=(6, 4)), columns=list('ABCD'))
df.head()
A B C D
0 8 2 5 0
1 9 9 8 6
2 4 0 1 7
3 8 4 0 3
4 5 6 9 9
def some_func(a, b, c):
return a+b, a+b+c
df['dd'], df['ee'] = df.apply(lambda x: some_func(a = x['A'], b = x['B'], c = x['C']), axis=1, result_type='expand')
df.head()
A B C D dd ee
0 8 2 5 0 0 1
1 9 9 8 6 0 1
2 4 0 1 7 0 1
3 8 4 0 3 0 1
4 5 6 9 9 0 1
The outputs are all 0 for the first new column, and all 1 for the next new column. I am interested in the correct solution, but I am also curious about why my code resulted this way.
You can assign to subset ['dd','ee']:
def some_func(a, b, c):
return a+b, a+b+c
df[['dd','ee']] = df.apply(lambda x: some_func(a = x['A'],
b = x['B'],
c = x['C']), axis=1, result_type='expand')
print (df)
A B C D dd ee
0 4 7 7 3 11 18
1 2 1 3 4 3 6
2 4 7 6 0 11 17
3 0 9 1 1 9 10
4 5 6 5 9 11 16
5 3 2 4 9 5 9
If possible, better/ fatser is use vectorized solution:
df = df.assign(dd = df.A + df.B, ee = df.A + df.B + df.C)
Just to explain the 0, 1 part. 0 and 1 are actually the column names of
df.apply(lambda x: some_func(a = x['A'], b = x['B'], c = x['C']), axis=1, result_type='expand')
That is
x = df.apply(lambda x: some_func(a = x['A'], b = x['B'], c = x['C']), axis=1, result_type='expand')
a, b = x
print(a) # first column name
print(b) # second column name
output:
0
1
Finally, you assign
df['dd'], df['ee'] = 0, 1
results in
A B C D dd ee
0 8 2 5 0 0 1
1 9 9 8 6 0 1
2 4 0 1 7 0 1
3 8 4 0 3 0 1
4 5 6 9 9 0 1
Alternative way:
df['dd'], df['ee'] = zip(*df.apply(lambda x: some_func(x['A'], x['B'], x['C]) )

Pad dataframe discontinuous column

I have the following dataframe:
Name B C D E
1 A 1 2 2 7
2 A 7 1 1 7
3 B 1 1 3 4
4 B 2 1 3 4
5 B 3 1 3 4
What I'm trying to do is to obtain a new dataframe in which, for rows with the same "Name", the elements in the "B" column are continuous, hence in this example for rows with "Name" = A, the dataframe would have to be padded with elements ranging from 1 to 7, and the values for columns C, D, E should be 0.
Name B C D E
1 A 1 2 2 7
2 A 2 0 0 0
3 A 3 0 0 0
4 A 4 0 0 0
5 A 5 0 0 0
6 A 6 0 0 0
7 A 7 0 0 0
8 B 1 1 3 4
9 B 2 1 5 4
10 B 3 4 3 6
What I've done so far is to turn the B column values for the same "Name" into continuous values:
new_idx = df_.groupby('Name').apply(lambda x: np.arange(x.index.min(), x.index.max() + 1)).apply(pd.Series).stack()
and reindexing the original (having set B as the index) df using this new Series, but I'm having trouble reindexing using duplicates. Any help would be appreciated.
You can use:
def f(x):
a = np.arange(x.index.min(), x.index.max() + 1)
x = x.reindex(a, fill_value=0)
return (x)
new_idx = (df.set_index('B')
.groupby('Name')
.apply(f)
.drop('Name', 1)
.reset_index()
.reindex(columns=df.columns))
print (new_idx)
Name B C D E
0 A 1 2 2 7
1 A 2 0 0 0
2 A 3 0 0 0
3 A 4 0 0 0
4 A 5 0 0 0
5 A 6 0 0 0
6 A 7 1 1 7
7 B 1 1 3 4
8 B 2 1 3 4
9 B 3 1 3 4

add new column to pandas DataFrame with value depended on previous row

I have an existing pandas DataFrame, and I want to add a new column, where the value of each row will depend on the previous row.
for example:
df1 = pd.DataFrame(np.random.randint(10, size=(4, 4)), columns=['a', 'b', 'c', 'd'])
df1
Out[31]:
a b c d
0 9 3 3 0
1 3 9 5 1
2 1 7 5 6
3 8 0 1 7
and now I want to create column e, where for each row i the value of df1['e'][i] would be: df1['e'][i] = df1['d'][i] - df1['d'][i-1]
desired output:
df1:
a b c d e
0 9 3 3 0 0
1 3 9 5 1 1
2 1 7 5 6 5
3 8 0 1 7 1
how can I achieve this?
You can use sub with shift:
df['e'] = df.d.sub(df.d.shift(), fill_value=0)
print (df)
a b c d e
0 9 3 3 0 0.0
1 3 9 5 1 1.0
2 1 7 5 6 5.0
3 8 0 1 7 1.0
If need convert to int:
df['e'] = df.d.sub(df.d.shift(), fill_value=0).astype(int)
print (df)
a b c d e
0 9 3 3 0 0
1 3 9 5 1 1
2 1 7 5 6 5
3 8 0 1 7 1

Categories

Resources