I have 2 dataframes like this,
df1
0 1 2 3 4 5 category
0 1 2 3 4 5 6 foo
1 4 5 6 5 6 7 bar
2 7 8 9 5 6 7 foo1
and
df2
0 1 2 category
0 1 2 3 bar
1 4 5 6 foo
Shape of df1 is (3,7) and shape of df2 is (2,4).
I want to reshape df2 to (2,7) (as per first dataframe df1 columns) keeping the last column same.
df2
0 1 2 3 4 5 category
0 1 2 3 0 0 0 bar
1 4 5 6 0 0 0 foo
If you want to ensure that dataframe having less columns will pad the columns with zero according to the dataframe having more columns, then you can try DataFrame.align on axis=1 to align the columns of two dataframes keeping the rows unchanged:
df1, df2 = df1.align(df2, axis=1, fill_value=0)
print(df2)
0 1 2 3 4 5 category
0 1 2 3 0 0 0 bar
1 4 5 6 0 0 0 foo
You can use .shape[0] to get the # of rows from each dataframe. and .shape[1] to get the # of columns from each dataframe.
Use these logically with insert to only include the required rows and make the required columns 0:
s1, s2 = (df1.shape[1]), (df2.shape[1])
s = s1-s2
[df2.insert(s-1, s-1, 0) for s in range(s2,s1)]
0 1 2 3 4 5 category
0 1 2 3 0 0 0 bar
1 4 5 6 0 0 0 foo
Another method using iloc:
s1, s2 = (df1.shape[1] - 1), (df2.shape[1] - 1)
df3 = pd.concat([df2.iloc[:, :-1],
df1.iloc[:df2.shape[0]:, s2:s1],
df2.iloc[:, -1]], axis=1)
df3.iloc[:, s2:s1] = 0
0 1 2 3 4 5 category
0 1 2 3 0 0 0 bar
1 4 5 6 0 0 0 foo
Related
I have a dataframe
pd.DataFrame([1,2,3,4,1,2,3])
0
0 1
1 2
2 3
3 4
4 1
5 2
6 3
I want to create another column, where it records the most recent index the value "1" occurred
d={'data':[1,2,3,4,1,2,3], 'desired_new_col': [0,0,0,0,4,4,4]}
pd.DataFrame(d)
data desired_new_col
0 1 0
1 2 0
2 3 0
3 4 0
4 1 4
5 2 4
6 3 4
I have some idea of using df.expand().apply(func), but not sure what would be an appropriate function to write for this.
Thanks
Using a mask on the index and ffill:
df = pd.DataFrame({'data': [1,2,3,4,1,2,3]})
df['new'] = (df.index.to_series()
.where(df['data'].eq(1))
.ffill(downcast='infer')
)
Output:
data new
0 1 0
1 2 0
2 3 0
3 4 0
4 1 4
5 2 4
6 3 4
You can do cumsum with sub-group by key then we can groupby with transform idxmax
s = df['data'].eq(1)
df['out'] = s.groupby(s.cumsum())['data'].transform('idxmax')
Out[293]:
0 0
1 0
2 0
3 0
4 4
5 4
6 4
Name: data, dtype: int64
You can do this just by using list comprehension. :)
idx = [i for i in df.index if df[0][i] == 1][-1]
df['desired_new_col'] = [idx if idx <= df.index[i] else 0 for i in df.index]
Output:
df
0 desired_new_col
0 1 0
1 2 0
2 3 0
3 4 0
4 1 4
5 2 4
6 3 4
I have a df1:
a b c
1 0 1 4
2 0 2 5
3 1 1 3
and a second df2:
a b c
1 0 1 5
2 0 2 5
3 1 1 4
These df's have the same goups in a and b. Within groupby of 'a' and 'b' I want df2 underneath df1:
a b c
1 0 1 4
2 0 1 5
3 0 2 5
4 0 2 5
5 1 1 3
6 1 1 4
How can I combine groupby() and concat() to get the desired output?
You can do concat then sort_values
df=pd.concat[df1,df2]).sort_values(['a','b']).reset_index(drop=True)
I have a dataframe with the following form:
data = pd.DataFrame({'ID':[1,1,1,2,2,2,2,3,3],'Time':[0,1,2,0,1,2,3,0,1],
'sig':[2,3,1,4,2,0,2,3,5],'sig2':[9,2,8,0,4,5,1,1,0],
'group':['A','A','A','B','B','B','B','A','A']})
print(data)
ID Time sig sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 2 0 4 0 B
4 2 1 2 4 B
5 2 2 0 5 B
6 2 3 2 1 B
7 3 0 3 1 A
8 3 1 5 0 A
I want to reshape and pad such that each 'ID' has the same number of Time values, the sig1,sig2 are padded with zeros (or mean value within ID) and the group carries the same letter value. The output after repadding would be :
data_pad = pd.DataFrame({'ID':[1,1,1,1,2,2,2,2,3,3,3,3],'Time':[0,1,2,3,0,1,2,3,0,1,2,3],
'sig1':[2,3,1,0,4,2,0,2,3,5,0,0],'sig2':[9,2,8,0,0,4,5,1,1,0,0,0],
'group':['A','A','A','A','B','B','B','B','A','A','A','A']})
print(data_pad)
ID Time sig1 sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 1 3 0 0 A
4 2 0 4 0 B
5 2 1 2 4 B
6 2 2 0 5 B
7 2 3 2 1 B
8 3 0 3 1 A
9 3 1 5 0 A
10 3 2 0 0 A
11 3 3 0 0 A
My end goal is to ultimately reshape this into something with shape (number of ID, number of time points, number of sequences {2 here}).
It seems that if I pivot data, it fills in with nan values, which is fine for the signal values, but not the groups. I am also hoping to avoid looping through data.groupby('ID'), since my actual data has a large number of groups and the looping would likely be very slow.
Here's one approach creating the new index with pd.MultiIndex.from_product and using it to reindex on the Time column:
df = data.set_index(['ID', 'Time'])
# define a the new index
ix = pd.MultiIndex.from_product([df.index.levels[0],
df.index.levels[1]],
names=['ID', 'Time'])
# reindex using the above multiindex
df = df.reindex(ix, fill_value=0)
# forward fill the missing values in group
df['group'] = df.group.mask(df.group.eq(0)).ffill()
print(df.reset_index())
ID Time sig sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 1 3 0 0 A
4 2 0 4 0 B
5 2 1 2 4 B
6 2 2 0 5 B
7 2 3 2 1 B
8 3 0 3 1 A
9 3 1 5 0 A
10 3 2 0 0 A
11 3 3 0 0 A
IIUC:
(data.pivot_table(columns='Time', index=['ID','group'], fill_value=0)
.stack('Time')
.sort_index(level=['ID','Time'])
.reset_index()
)
Output:
ID group Time sig sig2
0 1 A 0 2 9
1 1 A 1 3 2
2 1 A 2 1 8
3 1 A 3 0 0
4 2 B 0 4 0
5 2 B 1 2 4
6 2 B 2 0 5
7 2 B 3 2 1
8 3 A 0 3 1
9 3 A 1 5 0
10 3 A 2 0 0
11 3 A 3 0 0
I have the following pandas dataframe :
a
0 0
1 0
2 1
3 2
4 2
5 2
6 3
7 2
8 2
9 1
I want to store the values in another dataframe such as every group of consecutive indentical values make a labeled group like this :
A B
0 0 2
1 1 1
2 2 3
3 3 1
4 2 2
5 1 1
The column A represent the value of the group and B represents the number of occurences.
this is what i've done so far:
df = pd.DataFrame({'a':[0,0,1,2,2,2,3,2,2,1]})
df2 = pd.DataFrame()
for i,g in df.groupby([(df.a != df.a.shift()).cumsum()]):
vc = g.a.value_counts()
df2 = df2.append({'A':vc.index[0], 'B': vc.iloc[0]}, ignore_index=True).astype(int)
It works but it's a bit messy.
Do you think of a shortest/better way of doing this ?
use GrouBy.agg in Pandas >0.25.0:
new_df= ( df.groupby(df['a'].ne(df['a'].shift()).cumsum(),as_index=False)
.agg(A=('a','first'),B=('a','count')) )
print(new_df)
A B
0 0 2
1 1 1
2 2 3
3 3 1
4 2 2
5 1 1
pandas <0.25.0
new_df= ( df.groupby(df['a'].ne(df['a'].shift()).cumsum(),as_index=False)
.a
.agg({'A':'first','B':'count'}) )
I would try:
df['blocks'] = df['a'].ne(df['a'].shift()).cumsum()
(df.groupby(['a','blocks'],
as_index=False,
sort=False)
.count()
.drop('blocks', axis=1)
)
Output:
a B
0 0 2
1 1 1
2 2 3
3 3 1
4 2 2
5 1 1
I have a matrix with 0s and 1s, and want to do a cumsum on each column that resets to 0 whenever a zero is observed. For example, if we have the following:
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
print(df)
a b
0 0 1
1 1 1
2 0 1
3 1 0
4 1 1
5 0 1
The result I desire is:
print(df)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
However, when I try df.cumsum() * df, I am able to correctly identify the 0 elements, but the counter does not reset:
print(df.cumsum() * df)
a b
0 0 1
1 1 2
2 0 3
3 2 0
4 3 4
5 0 5
You can use:
a = df != 0
df1 = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
print (df1)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
Try this
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
df['groupId1']=df.a.eq(0).cumsum()
df['groupId2']=df.b.eq(0).cumsum()
New=pd.DataFrame()
New['a']=df.groupby('groupId1').a.transform('cumsum')
New['b']=df.groupby('groupId2').b.transform('cumsum')
New
Out[1184]:
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
You may also try the following naive but reliable approach.
Per every column - create groups to count within. Group starts once sequential value difference by row appears and lasts while value is being constant: (x != x.shift()).cumsum().
Example:
a b
0 1 1
1 2 1
2 3 1
3 4 2
4 4 3
5 5 3
Calculate cummulative sums within groups per columns using pd.DataFrame's apply and groupby methods and you get cumsum with the zero reset in one line:
import pandas as pd
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]], columns = ['a','b'])
cs = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumsum())
print(cs)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
A slightly hacky way would be to identify the indices of the zeros and set the corresponding values to the negative of those indices before doing the cumsum:
import pandas as pd
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
z = np.where(df['b']==0)
df['b'][z[0]] = -z[0]
df['b'] = np.cumsum(df['b'])
df
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 1 1
5 0 2