How to merge one numpy array onto multiple dataframes - python

I have a bunch of data frames. They all have the same columns but different amounts of rows. They look like this:
df_1
0
0 1
1 0
2 0
3 1
4 1
5 0
df_2
0
0 1
1 0
2 0
3 1
df_3
0
0 1
1 0
2 0
3 1
4 1
I have them all stored in a list.
Then, I have a numpy array where each item maps to a row in each individual df. The numpy array looks like this:
[3 1 1 2 4 0 6 7 2 1 3 2 5 5 5]
If I were to pd.concat my list of dataframes, then I could merge the np array onto the concatenated df. However, I want to preserve the individual df structure, so it should look like this:
0 1
0 1 3
1 0 1
2 0 1
3 1 2
4 1 4
5 0 0
0 1
0 1 6
1 0 7
2 0 2
3 1 1
0 1
0 1 3
1 0 2
2 0 5
3 1 5
4 1 5

Considering the given dataframes & array as,
df1 = pd.DataFrame([1,0,0,1,1,0])
df2 = pd.DataFrame([1,0,0,1])
df3 = pd.DataFrame([1,0,0,1,1])
arr = np.array([3, 1, 1, 2, 4, 0, 6, 7, 2, 1, 3, 2, 5, 5, 5])
You can use numpy.split to split an array into multiple sub-arrays according to the given dataframes. Then you can append those arrays as columns to their respective dataframes.
Use:
dfs = [df1, df2, df3]
def get_indices(dfs):
"""
Returns the split indices inside the array.
"""
indices = [0]
for df in dfs:
indices.append(len(df) + indices[-1])
return indices[1:-1]
# split the given arr into multiple sections.
sections = np.split(arr, get_indices(dfs))
for df, s in zip(dfs, sections):
df[1] = s # append the section of array to dataframe
print(df)
This results:
# df1
0 1
0 1 3
1 0 1
2 0 1
3 1 2
4 1 4
5 0 0
#df2
0 1
0 1 6
1 0 7
2 0 2
3 1 1
# df3
0 1
0 1 3
1 0 2
2 0 5
3 1 5
4 1 5

Related

Padding columns of dataframe

I have 2 dataframes like this,
df1
0 1 2 3 4 5 category
0 1 2 3 4 5 6 foo
1 4 5 6 5 6 7 bar
2 7 8 9 5 6 7 foo1
and
df2
0 1 2 category
0 1 2 3 bar
1 4 5 6 foo
Shape of df1 is (3,7) and shape of df2 is (2,4).
I want to reshape df2 to (2,7) (as per first dataframe df1 columns) keeping the last column same.
df2
0 1 2 3 4 5 category
0 1 2 3 0 0 0 bar
1 4 5 6 0 0 0 foo
If you want to ensure that dataframe having less columns will pad the columns with zero according to the dataframe having more columns, then you can try DataFrame.align on axis=1 to align the columns of two dataframes keeping the rows unchanged:
df1, df2 = df1.align(df2, axis=1, fill_value=0)
print(df2)
0 1 2 3 4 5 category
0 1 2 3 0 0 0 bar
1 4 5 6 0 0 0 foo
You can use .shape[0] to get the # of rows from each dataframe. and .shape[1] to get the # of columns from each dataframe.
Use these logically with insert to only include the required rows and make the required columns 0:
s1, s2 = (df1.shape[1]), (df2.shape[1])
s = s1-s2
[df2.insert(s-1, s-1, 0) for s in range(s2,s1)]
0 1 2 3 4 5 category
0 1 2 3 0 0 0 bar
1 4 5 6 0 0 0 foo
Another method using iloc:
s1, s2 = (df1.shape[1] - 1), (df2.shape[1] - 1)
df3 = pd.concat([df2.iloc[:, :-1],
df1.iloc[:df2.shape[0]:, s2:s1],
df2.iloc[:, -1]], axis=1)
df3.iloc[:, s2:s1] = 0
0 1 2 3 4 5 category
0 1 2 3 0 0 0 bar
1 4 5 6 0 0 0 foo

Difference of one multi index level

For a MultiIndex with a repeating level, how can I calculate the differences with another level of the index, effectively ignoring it?
Let me explain in code.
>>> ix = pd.MultiIndex.from_product([(0, 1, 2), (0, 1, 2, 3)])
>>> df = pd.DataFrame([5]*4 + [4]*4 + [3, 2, 1, 0], index=ix)
>>> df
0
0 0 5
1 5
2 5
3 5
1 0 4
1 4
2 4
3 4
2 0 3
1 2
2 1
3 0
Now by some operation I'd like to subtract the last set of values (2, 0:4) from the whole data frame. I.e. df - df.loc[2] to produce this:
0
0 0 2
1 3
2 4
3 5
1 0 1
1 2
2 3
3 4
2 0 0
1 0
2 0
3 0
But the statement produces an error. df - df.loc[2:3] does not, but in addition to the trailing zeros only NaNs are produced - naturally of course because the indices don't match.
How could this be achieved?
I realised that the index level is precisely the problem. So I got a bit closer.
>>> df.droplevel(0) - df.loc[2]
0
0 2
0 1
0 0
1 3
1 2
1 0
2 4
2 3
2 0
3 5
3 4
3 0
Still not quite what I want. But I don't know if there's a convenient way of achieving what I'm after.
This with stack and unstack:
new_df = df.unstack()
new_df.sub(new_df.loc[2]).stack()
Output:
0
0 0 2
1 3
2 4
3 5
1 0 1
1 2
2 3
3 4
2 0 0
1 0
2 0
3 0
Try creating a dataframe with identical index and mapping the last set of data with the first level and populate across the dataframe , then substract:
df - pd.DataFrame(index=df.index,data=df.index.get_level_values(1).map(df.loc[2].squeeze()))
0
0 0 2
1 3
2 4
3 5
1 0 1
1 2
2 3
3 4
2 0 0
1 0
2 0
3 0

Count by group and assign to the new variables

I was wondering if there's an easier way to create the variables, "freq_t1", and "freq_t2" grouped by id, from the following data:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'id':[1,1,1,2,2,2],
'time':[1,1,2,3,2,2]
})
to
df = pd.DataFrame({
'id':[1,1,1,2,2,2],
'time':[1,1,2,3,2,2],
'freq_t1':[2,2,2,0,0,0],
'freq_t2':[1,1,1,2,2,2]
})
That is, id == 1 has two observations of time == 1, while id == 2 has zero. Similarly, id == 1 has one observation of time == 2, while id == 2 has two.
Use broadcasted comparison on the "time" column with your selected time values, then groupby and transform to broadcast the sum to the original columns. Here's an example:
tvals = [1, 2]
(pd.DataFrame(df['time'].values[:,None] == tvals, columns=tvals)
.groupby(df['id'])
.transform('sum')
.astype(int)
.add_prefix('freq_t'))
freq_t1 freq_t2
0 2 1
1 2 1
2 2 1
3 0 2
4 0 2
5 0 2
When tvals = [1, 2, 3], this produces
freq_t1 freq_t2 freq_t3
0 2 1 0
1 2 1 0
2 2 1 0
3 0 2 1
4 0 2 1
5 0 2 1
If you want columns for all t-values, you can also use get_dummies:
pd.get_dummies(df.time).groupby(df.id).transform('sum').add_prefix('freq_t')
freq_t1 freq_t2 freq_t3
0 2 1 0
1 2 1 0
2 2 1 0
3 0 2 1
4 0 2 1
5 0 2 1
Finally, to concatenate the result to df, use pd.concat:
res = pd.get_dummies(df.time).groupby(df.id).transform('sum').add_prefix('freq_t')
pd.concat([df, res], axis=1)
id time freq_t1 freq_t2 freq_t3
0 1 1 2 1 0
1 1 1 2 1 0
2 1 2 2 1 0
3 2 3 0 2 1
4 2 2 0 2 1
5 2 2 0 2 1

Get values from a smaller DataFrame with a specified step

Supposing I have the two DataFrames shown below:
dd = pd.DataFrame([1,0, 3, 0, 5])
0
0 1
1 0
2 3
3 0
4 5
and
df = pd.DataFrame([2,4])
0
0 2
1 4
How can I broadcast the values of df into dd with step = 2 so I end up with
0
0 1
1 2
2 3
3 4
4 5
Another solution:
dd = pd.DataFrame([1, 0, 3, 0, 5])
df = pd.DataFrame([2, 4])
dd.iloc[1::2] = df.values
dd
# Out:
0
0 1
1 2
2 3
3 4
4 5
dd.values[1::2] = df.values
dd now contains:
0
0 1
1 2
2 3
3 4
4 5
Note that here step=2 condition is used. array[1::2] syntax means start from the array element with index 1, until the end, with step=2.
Change df.index by range and fill second DataFrame:
df.index = range(1, len(dd)+1, 2)[:len(df)]
print (df)
0
1 2
3 4
dd.loc[df.index] = df
print (dd)
0
0 1
1 2
2 3
3 4
4 5

Pandas DataFrame: Spread CSV columns to multiple columns

I have a pandas DataFrame
>>> import pandas as pd
>>> df = pd.DataFrame([['a', 2, 3], ['a,b', 5, 6], ['c', 8, 9]])
0 1 2
0 a 2 3
1 a,b 5 6
2 c 8 9
I want to spread the first column to n columns (where n is the number of unique, comma-separated values, in this case 3). Each of the resulting columns shall be 1 if the value is present, and 0 else. Expected result is:
1 2 a c b
0 2 3 1 0 0
1 5 6 1 0 1
2 8 9 0 1 0
I came up with the following code, but it seems a bit circuitous to me.
>>> import re
>>> dfSpread = pd.get_dummies(df[0].str.split(',', expand=True)).\
rename(columns=lambda x: re.sub('.*_','',x))
>>> pd.concat([df.iloc[:,1:], dfSpread], axis = 1)
Is there a built-in function that does just that that I wasn't able to find?
Using get_dummies
df.set_index([1,2])[0].str.get_dummies(',').reset_index()
Out[229]:
1 2 a b c
0 2 3 1 0 0
1 5 6 1 1 0
2 8 9 0 0 1
You can use pop + concat here for an alternative version of Wen's answer.
pd.concat([df, df.pop(df.columns[0]).str.get_dummies(sep=',')], axis=1)
1 2 a b c
0 2 3 1 0 0
1 5 6 1 1 0
2 8 9 0 0 1

Categories

Resources