I have a dataframe df:
(A,B) (B,C) (D,B) (E,F)
0 3 0 1
1 1 3 0
2 2 4 2
I want to split it into different columns for all columns in df as shown below:
A B B C D B E F
0 0 3 3 0 0 1 1
1 1 1 1 3 3 0 0
2 2 2 2 4 4 2 2
and add similar columns together:
A B C D E F
0 3 3 0 1 1
1 5 1 3 0 0
2 6 2 4 2 2
how to achieve this using pandas?
With pandas, you can use this :
out = (
df
.T
.reset_index()
.assign(col= lambda x: x.pop("index").str.strip("()").str.split(","))
.explode("col")
.groupby("col", as_index=False).sum()
.set_index("col")
.T
.rename_axis(None, axis=1)
)
# Output :
print(out)
A B C D E F
0 0 3 3 0 1 1
1 1 5 1 3 0 0
2 2 8 2 4 2 2
i think (A, B) as tuple
pd.concat([pd.DataFrame([df[i].tolist()] * len(i), index=list(i)) for i in df.columns]).sum(level=0).T
result:
A B C D E F
0 0 3 3 0 1 1
1 1 5 1 3 0 0
2 2 8 2 4 2 2
if future warning occur, use following code:
pd.concat([pd.DataFrame([df[i].tolist()] * len(i), index=list(i)) for i in df.columns]).groupby(level=0).sum().T
same result
Use concat with removed levels with MultiIndex in columns by Series.str.findall:
df.columns = df.columns.str.findall('(\w+)').map(tuple)
df = (pd.concat([df.droplevel(x, axis=1) for x in range(df.columns.nlevels)], axis=1)
.groupby(level=0, axis=1)
.sum())
print (df)
A B C D E F
0 0 3 3 0 1 1
1 1 5 1 3 0 0
2 2 8 2 4 2 2
For write ouput to file without index use:
df.to_csv('file.csv', index=False)
You can use findall to extract the variables in the header, then melt and explode, finallypivot_table:
out = (df
.reset_index().melt('index')
.assign(variable=lambda d: d['variable'].str.findall('(\w+)'))
.explode('variable')
.pivot_table(index='index', columns='variable', values='value', aggfunc='sum')
.rename_axis(index=None, columns=None)
)
Output:
A B C D E F
0 0 3 3 0 1 1
1 1 5 1 3 0 0
2 2 8 2 4 2 2
Reproducible input:
df = pd.DataFrame({'(A,B)': [0, 1, 2],
'(B,C)': [3, 1, 2],
'(D,B)': [0, 3, 4],
'(E,F)': [1, 0, 2]})
printing/saving without index:
print(out.to_string(index=False))
A B C D E F
0 3 3 0 1 1
1 5 1 3 0 0
2 8 2 4 2 2
# as file
out.to_csv('yourfile.csv', index=False)
Related
Suppose I have a 2*3 dataframe:
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4], 'C': [5, 6]})
A B C
0 1 3 5
1 2 4 6
I'm wondering how can I convert df to a (2*3)*1 dataframe that has the following form? I've tried pd.DataFrame.explode() and pd.wide_to_long() but they didn't appear to be the function I'm looking for.
value
A 0 1
A 1 2
B 0 3
B 1 4
C 0 5
C 1 6
You just need to stack:
df.stack().swaplevel().sort_index()
output:
A 0 1
1 2
B 0 3
1 4
C 0 5
1 6
Or use melt after resetting the index:
df.reset_index().melt(id_vars='index')
output:
index variable value
0 0 A 1
1 1 A 2
2 0 B 3
3 1 B 4
4 0 C 5
5 1 C 6
Alternative outputs
As dataframe:
(df.stack()
.rename('value')
.swaplevel()
.sort_index()
.to_frame()
)
value
A 0 1
1 2
B 0 3
1 4
C 0 5
1 6
All as columns:
(df.stack()
.rename('value')
.swaplevel()
.rename_axis(['col1', 'col2'])
.sort_index()
.reset_index()
)
col1 col2 value
0 A 0 1
1 A 1 2
2 B 0 3
3 B 1 4
4 C 0 5
5 C 1 6
I have 2 DF where I want to check if df1["A"] is in df2. If not fill df2["A"] with 0.
I got it with and ugly for loop and I try to optimize this but I cannot find out how to do it.
testing_list = list(testing_df.columns)
for i in range(len(training_df.columns)):
if not training_df.columns[i] in testing_list:
testing_df[training_df.columns[i]] = 0
Use DataFrame.reindex with new columns created by Index.union:
testing_df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'F':list('aaabbb')
})
training_df = pd.DataFrame({
'A':list('abcdef'),
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
})
cols = testing_df.columns.union(training_df.columns, sort=False)
df = testing_df.reindex(cols, axis=1, fill_value=0)
print (df)
A B F C D
0 a 4 a 0 0
1 b 5 a 0 0
2 c 4 a 0 0
3 d 5 b 0 0
4 e 5 b 0 0
5 f 4 b 0 0
If want add columns for both DataFrames with sorted columns use DataFrame.align:
testing_df, training_df = testing_df.align(training_df, fill_value=0)
print (testing_df)
A B C D F
0 a 4 0 0 a
1 b 5 0 0 a
2 c 4 0 0 a
3 d 5 0 0 b
4 e 5 0 0 b
5 f 4 0 0 b
print (training_df)
A B C D F
0 a 0 7 1 0
1 b 0 8 3 0
2 c 0 9 5 0
3 d 0 4 7 0
4 e 0 2 1 0
5 f 0 3 0 0
When using the drop method for a pandas.DataFrame it accepts lists of column names, but not tuples, despite the documentation saying that "list-like" arguments are acceptable. Am I reading the documentation incorrectly, as I would expect my MWE to work.
MWE
import pandas as pd
df = pd.DataFrame({k: range(5) for k in list('abcd')})
df.drop(['a', 'c'], axis=1) # Works
df.drop(('a', 'c'), axis=1) # Errors
Versions - Using Python 2.7.12, Pandas 0.20.3.
There is problem with tuples select Multiindex:
np.random.seed(345)
mux = pd.MultiIndex.from_arrays([list('abcde'), list('cdefg')])
df = pd.DataFrame(np.random.randint(10, size=(4,5)), columns=mux)
print (df)
a b c d e
c d e f g
0 8 0 3 9 8
1 4 3 4 1 7
2 4 0 9 6 3
3 8 0 3 1 5
df = df.drop(('a', 'c'), axis=1)
print (df)
b c d e
d e f g
0 0 3 9 8
1 3 4 1 7
2 0 9 6 3
3 0 3 1 5
Same as:
df = df[('a', 'c')]
print (df)
0 8
1 4
2 4
3 8
Name: (a, c), dtype: int32
Pandas treats tuples as multi-index values, so try this instead:
In [330]: df.drop(list(('a', 'c')), axis=1)
Out[330]:
b d
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
here is an example for deleting rows (axis=0 - default) in the multi-index DF:
In [342]: x = df.set_index(np.arange(len(df), 0, -1), append=True)
In [343]: x
Out[343]:
a b c d
0 5 0 0 0 0
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
4 1 4 4 4 4
In [344]: x.drop((0,5))
Out[344]:
a b c d
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
4 1 4 4 4 4
In [345]: x.drop([(0,5), (4,1)])
Out[345]:
a b c d
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
So when you specify tuple Pandas treats it as a multi-index label
I used this to delete column of tuples
del df3[('val1', 'val2')]
and it got deleted.
I am seeing a wierd behavior from pandas, maybe it's just me but I am expecting a different result from what I am getting.
so assuming that I have a multi-index dataframe such has:
import pandas as pd
df = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df_first = pd.concat({'ticker1': df, 'ticker2': df, 'ticker3': df}, axis=1)
df_first.columns = df_first.columns.rename(('ticker', 'variables'))
df_first
Out[91]:
ticker ticker1 ticker2 ticker3
variables A B A B A B
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
d 3 3 3 3 3 3
e 4 4 4 4 4 4
and a second dataframe with the same level's name but reversed such has:
df2 = pd.DataFrame(index=list('abcde'), data={'ticker1': range(5), 'ticker2': range(5)})
df_sec = pd.concat({'C': df2, 'D': df2, 'E': df2}, axis=1)
df_sec.columns = df_sec.columns.rename(('variables', 'ticker'))
df_sec
Out[93]:
variables C D E
ticker ticker1 ticker2 ticker1 ticker2 ticker1 ticker2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
d 3 3 3 3 3 3
e 4 4 4 4 4 4
as you can see the levels have the same names but are reversed. when I concat those 2 dataframe on the axis = 1, it mixes up my columns:
pd.concat([df_first, df_sec], axis=1)
Out[94]:
ticker ticker1 ticker2 ticker3 C D E
variables A B A B A B ticker1 ticker2 ticker1 ticker2 ticker1 ticker2
a 0 0 0 0 0 0 0 0 0 0 0 0
b 1 1 1 1 1 1 1 1 1 1 1 1
c 2 2 2 2 2 2 2 2 2 2 2 2
d 3 3 3 3 3 3 3 3 3 3 3 3
e 4 4 4 4 4 4 4 4 4 4 4 4
I know I can swap levels first and get the expected result such has:
pd.concat([df_first, df_sec.swaplevel(0, 1, 1)], axis=1)
Out[95]:
ticker ticker1 ticker2 ticker3 ticker1 ticker2 ticker1 ticker2 ticker1 ticker2
variables A B A B A B C C D D E E
a 0 0 0 0 0 0 0 0 0 0 0 0
b 1 1 1 1 1 1 1 1 1 1 1 1
c 2 2 2 2 2 2 2 2 2 2 2 2
d 3 3 3 3 3 3 3 3 3 3 3 3
e 4 4 4 4 4 4 4 4 4 4 4 4
but is there a way to concat based on the level names directly?
thanks
I can't think of anything that doesn't manipulate the columns index in some way. But this gets close to what you asked for. Namely, it operates on level name.
ln = 'variables'
pd.concat([df_first.stack(ln), df_sec.stack(ln)]).unstack(ln)
OR
ln = 'ticker'
pd.concat([df_first.stack(ln), df_sec.stack(ln)], axis=1).unstack(ln)
I have a Pandas dataframe like this {each row in B is a string with values joined with | symbol}:
A B
a 1|2|3
b 2|4|5
c 3|2|5
I want to create columns which say that the value is present in that row(of column B) or not:
A B 1 2 3 4 5
a 1|2|3 1 1 1 0 0
b 2|4|5 0 1 0 1 1
c 3|5 0 0 1 0 1
I have tried this by looping the columns. But, can it be done using lambda or comprehensions?
You can try get_dummies:
print df
A B
0 a 1|2|3
1 b 2|4|5
2 c 3|2|5
print df.B.str.get_dummies(sep='|')
1 2 3 4 5
0 1 1 1 0 0
1 0 1 0 1 1
2 0 1 1 0 1
And if you need old column B use join:
print df.join(df.B.str.get_dummies(sep='|'))
A B 1 2 3 4 5
0 a 1|2|3 1 1 1 0 0
1 b 2|4|5 0 1 0 1 1
2 c 3|2|5 0 1 1 0 1
Hope this helps.
In [19]: df
Out[19]:
A B
0 a 1|2|3
1 b 2|4|5
2 c 3|2|5
In [20]: op = df.merge(df.B.apply(lambda s: pd.Series(dict((col, 1) for col in s.split('|')))),
left_index=True, right_index=True).fillna(0)
In [21]: op
Out[21]:
A B 1 2 3 4 5
0 a 1|2|3 1 1 1 0 0
1 b 2|4|5 0 1 0 1 1
2 c 3|2|5 0 1 1 0 1