Supposing I have the two DataFrames shown below:
dd = pd.DataFrame([1,0, 3, 0, 5])
0
0 1
1 0
2 3
3 0
4 5
and
df = pd.DataFrame([2,4])
0
0 2
1 4
How can I broadcast the values of df into dd with step = 2 so I end up with
0
0 1
1 2
2 3
3 4
4 5
Another solution:
dd = pd.DataFrame([1, 0, 3, 0, 5])
df = pd.DataFrame([2, 4])
dd.iloc[1::2] = df.values
dd
# Out:
0
0 1
1 2
2 3
3 4
4 5
dd.values[1::2] = df.values
dd now contains:
0
0 1
1 2
2 3
3 4
4 5
Note that here step=2 condition is used. array[1::2] syntax means start from the array element with index 1, until the end, with step=2.
Change df.index by range and fill second DataFrame:
df.index = range(1, len(dd)+1, 2)[:len(df)]
print (df)
0
1 2
3 4
dd.loc[df.index] = df
print (dd)
0
0 1
1 2
2 3
3 4
4 5
Related
I have a bunch of data frames. They all have the same columns but different amounts of rows. They look like this:
df_1
0
0 1
1 0
2 0
3 1
4 1
5 0
df_2
0
0 1
1 0
2 0
3 1
df_3
0
0 1
1 0
2 0
3 1
4 1
I have them all stored in a list.
Then, I have a numpy array where each item maps to a row in each individual df. The numpy array looks like this:
[3 1 1 2 4 0 6 7 2 1 3 2 5 5 5]
If I were to pd.concat my list of dataframes, then I could merge the np array onto the concatenated df. However, I want to preserve the individual df structure, so it should look like this:
0 1
0 1 3
1 0 1
2 0 1
3 1 2
4 1 4
5 0 0
0 1
0 1 6
1 0 7
2 0 2
3 1 1
0 1
0 1 3
1 0 2
2 0 5
3 1 5
4 1 5
Considering the given dataframes & array as,
df1 = pd.DataFrame([1,0,0,1,1,0])
df2 = pd.DataFrame([1,0,0,1])
df3 = pd.DataFrame([1,0,0,1,1])
arr = np.array([3, 1, 1, 2, 4, 0, 6, 7, 2, 1, 3, 2, 5, 5, 5])
You can use numpy.split to split an array into multiple sub-arrays according to the given dataframes. Then you can append those arrays as columns to their respective dataframes.
Use:
dfs = [df1, df2, df3]
def get_indices(dfs):
"""
Returns the split indices inside the array.
"""
indices = [0]
for df in dfs:
indices.append(len(df) + indices[-1])
return indices[1:-1]
# split the given arr into multiple sections.
sections = np.split(arr, get_indices(dfs))
for df, s in zip(dfs, sections):
df[1] = s # append the section of array to dataframe
print(df)
This results:
# df1
0 1
0 1 3
1 0 1
2 0 1
3 1 2
4 1 4
5 0 0
#df2
0 1
0 1 6
1 0 7
2 0 2
3 1 1
# df3
0 1
0 1 3
1 0 2
2 0 5
3 1 5
4 1 5
I was wondering if there's an easier way to create the variables, "freq_t1", and "freq_t2" grouped by id, from the following data:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'id':[1,1,1,2,2,2],
'time':[1,1,2,3,2,2]
})
to
df = pd.DataFrame({
'id':[1,1,1,2,2,2],
'time':[1,1,2,3,2,2],
'freq_t1':[2,2,2,0,0,0],
'freq_t2':[1,1,1,2,2,2]
})
That is, id == 1 has two observations of time == 1, while id == 2 has zero. Similarly, id == 1 has one observation of time == 2, while id == 2 has two.
Use broadcasted comparison on the "time" column with your selected time values, then groupby and transform to broadcast the sum to the original columns. Here's an example:
tvals = [1, 2]
(pd.DataFrame(df['time'].values[:,None] == tvals, columns=tvals)
.groupby(df['id'])
.transform('sum')
.astype(int)
.add_prefix('freq_t'))
freq_t1 freq_t2
0 2 1
1 2 1
2 2 1
3 0 2
4 0 2
5 0 2
When tvals = [1, 2, 3], this produces
freq_t1 freq_t2 freq_t3
0 2 1 0
1 2 1 0
2 2 1 0
3 0 2 1
4 0 2 1
5 0 2 1
If you want columns for all t-values, you can also use get_dummies:
pd.get_dummies(df.time).groupby(df.id).transform('sum').add_prefix('freq_t')
freq_t1 freq_t2 freq_t3
0 2 1 0
1 2 1 0
2 2 1 0
3 0 2 1
4 0 2 1
5 0 2 1
Finally, to concatenate the result to df, use pd.concat:
res = pd.get_dummies(df.time).groupby(df.id).transform('sum').add_prefix('freq_t')
pd.concat([df, res], axis=1)
id time freq_t1 freq_t2 freq_t3
0 1 1 2 1 0
1 1 1 2 1 0
2 1 2 2 1 0
3 2 3 0 2 1
4 2 2 0 2 1
5 2 2 0 2 1
I have a pandas DataFrame
>>> import pandas as pd
>>> df = pd.DataFrame([['a', 2, 3], ['a,b', 5, 6], ['c', 8, 9]])
0 1 2
0 a 2 3
1 a,b 5 6
2 c 8 9
I want to spread the first column to n columns (where n is the number of unique, comma-separated values, in this case 3). Each of the resulting columns shall be 1 if the value is present, and 0 else. Expected result is:
1 2 a c b
0 2 3 1 0 0
1 5 6 1 0 1
2 8 9 0 1 0
I came up with the following code, but it seems a bit circuitous to me.
>>> import re
>>> dfSpread = pd.get_dummies(df[0].str.split(',', expand=True)).\
rename(columns=lambda x: re.sub('.*_','',x))
>>> pd.concat([df.iloc[:,1:], dfSpread], axis = 1)
Is there a built-in function that does just that that I wasn't able to find?
Using get_dummies
df.set_index([1,2])[0].str.get_dummies(',').reset_index()
Out[229]:
1 2 a b c
0 2 3 1 0 0
1 5 6 1 1 0
2 8 9 0 0 1
You can use pop + concat here for an alternative version of Wen's answer.
pd.concat([df, df.pop(df.columns[0]).str.get_dummies(sep=',')], axis=1)
1 2 a b c
0 2 3 1 0 0
1 5 6 1 1 0
2 8 9 0 0 1
If I have an array [1, 2, 3, 4, 5] and a Pandas Dataframe
df = pd.DataFrame([[1,1,1,1,1], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0]])
0 1 2 3 4
0 1 1 1 1 1
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
How do I iterate through the Pandas DataFrame adding my array to each previous row?
The expected result would be:
0 1 2 3 4
0 1 1 1 1 1
1 2 3 4 5 6
2 3 5 7 9 11
3 4 7 10 13 16
The array is added n times to the nth row, which you can create using np.arange(len(df))[:,None] * a and then add the first row:
df
# 0 1 2 3 4
#0 1 1 1 1 1
#1 0 0 0 0 0
#2 0 0 0 0 0
#3 0 0 0 0 0
a = np.array([1, 2, 3, 4, 5])
np.arange(len(df))[:,None] * a
#array([[ 0, 0, 0, 0, 0],
# [ 1, 2, 3, 4, 5],
# [ 2, 4, 6, 8, 10],
# [ 3, 6, 9, 12, 15]])
df[:] = df.iloc[0].values + np.arange(len(df))[:,None] * a
df
# 0 1 2 3 4
#0 1 1 1 1 1
#1 2 3 4 5 6
#2 3 5 7 9 11
#3 4 7 10 13 16
df = pd.DataFrame([
[1,1,1],
[0,0,0],
[0,0,0],
])
s = pd.Series([1,2,3])
# add to every row except first, then cumulative sum
result = df.add(s, axis=1)
result.iloc[0] = df.iloc[0]
result.cumsum()
Or if you want a one-liner:
pd.concat([df[:1], df[1:].add(s, axis=1)]).cumsum()
Either way, result:
0 1 2
0 1 1 1
1 2 3 4
2 3 5 7
Using cumsum and assignment:
df[1:] = (df+l).cumsum()[:-1].values
0 1 2 3 4
0 1 1 1 1 1
1 2 3 4 5 6
2 3 5 7 9 11
3 4 7 10 13 16
Or using concat:
pd.concat((df[:1], (df+l).cumsum()[:-1]))
0 1 2 3 4
0 1 1 1 1 1
0 2 3 4 5 6
1 3 5 7 9 11
2 4 7 10 13 16
After cumsum, you can shift and add back to the original df:
a = [1,2,3,4,5]
updated = df.add(pd.Series(a), axis=1).cumsum().shift().fillna(0)
df.add(updated)
I have a matrix with 0s and 1s, and want to do a cumsum on each column that resets to 0 whenever a zero is observed. For example, if we have the following:
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
print(df)
a b
0 0 1
1 1 1
2 0 1
3 1 0
4 1 1
5 0 1
The result I desire is:
print(df)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
However, when I try df.cumsum() * df, I am able to correctly identify the 0 elements, but the counter does not reset:
print(df.cumsum() * df)
a b
0 0 1
1 1 2
2 0 3
3 2 0
4 3 4
5 0 5
You can use:
a = df != 0
df1 = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
print (df1)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
Try this
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
df['groupId1']=df.a.eq(0).cumsum()
df['groupId2']=df.b.eq(0).cumsum()
New=pd.DataFrame()
New['a']=df.groupby('groupId1').a.transform('cumsum')
New['b']=df.groupby('groupId2').b.transform('cumsum')
New
Out[1184]:
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
You may also try the following naive but reliable approach.
Per every column - create groups to count within. Group starts once sequential value difference by row appears and lasts while value is being constant: (x != x.shift()).cumsum().
Example:
a b
0 1 1
1 2 1
2 3 1
3 4 2
4 4 3
5 5 3
Calculate cummulative sums within groups per columns using pd.DataFrame's apply and groupby methods and you get cumsum with the zero reset in one line:
import pandas as pd
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]], columns = ['a','b'])
cs = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumsum())
print(cs)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
A slightly hacky way would be to identify the indices of the zeros and set the corresponding values to the negative of those indices before doing the cumsum:
import pandas as pd
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
z = np.where(df['b']==0)
df['b'][z[0]] = -z[0]
df['b'] = np.cumsum(df['b'])
df
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 1 1
5 0 2