i have 2 sample datasets dfa and dfb:
import pandas as pd
a = {
'unit': ['A', 'B', 'C', 'D'],
'count': [ 1, 12, 34, 52]
}
b = {
'department': ['E', 'F'],
'count': [ 6, 12]
}
dfa = pd.DataFrame(a)
dfb = pd.DataFrame(b)
they looks like:
dfa
count unit
1 A
12 B
34 C
52 D
dfb
count department
6 E
12 F
what I want is simply have dfa stack on top of dfb not based on any column or any index. i have checked this page: https://pandas.pydata.org/pandas-docs/stable/merging.html but couldn't find the right one for my purpose.
my desired output is to create a dfc that looks like below dataset, i want to keep the headers:
dfc:
count unit
1 A
12 B
34 C
52 D
count department
6 E
12 F
In [37]: pd.concat([dfa, pd.DataFrame(dfb.T.reset_index().T.values, columns=dfa.columns)],
ignore_index=True)
Out[37]:
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
or
In [39]: dfa.append(pd.DataFrame(dfb.T.reset_index().T.values, columns=dfa.columns)) \
.reset_index(drop=True)
Out[39]:
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
UPDATE: merging 3 DFs:
pd.concat([dfa,
pd.DataFrame(dfb.T.reset_index().T.values, columns=dfa.columns),
pd.DataFrame(dfc.T.reset_index().T.values, columns=dfa.columns)],
ignore_index=True)
Option 1
You can construct it from scratch using np.vstack
pd.DataFrame(
np.vstack([dfa.values, dfb.columns, dfb.values]),
columns=dfa.columns
)
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
Option 2
You can export to csv and read it back
from io import StringIO
import pandas as pd
pd.read_csv(StringIO(
'\n'.join([d.to_csv(index=None) for d in [dfa, dfb]])
))
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
dfa.loc[len(dfa),:] = dfb.columns
dfb.columns = dfa.columns
dfa.append(dfb)
Related
So my dataframe has multiple columns, one of them is named "multiple" which contains boolean, only 1s and 0s. Now, I want to replicate all the rows 4 times only for all the df.loc[df.multiple==1]. How can I do that? (I don't want to replicate indexes)
example input:
df=
index strings multiple
0 A 0
1 B 1
2 C 1
3 D 0
4 E 1
Expected output:
index strings multiple
0 A 0
1 B 1
2 B 1
3 B 1
4 B 1
5 B 1
6 C 1
7 C 1
8 C 1
9 C 1
10 C 1
11 D 0
12 E 1
13 E 1
14 E 1
15 E 1
16 E 1
Here is another alternative, based on #Vinzent answer.
It is using the same approach to construct the repeats, but doesn't require to reconstruct the full dataframe. It is instead based on indexing. This solution is ~30% faster on the provided dataset and larger datasets.
df.loc[np.repeat(df.multiple, df.multiple.values*4+1).index].reset_index(drop=True)
This is what numpy.repeat is for:
import pandas as pd
import numpy as np
df = pd.DataFrame([['A', 0],
['B', 1],
['C', 1],
['D', 0],
['E', 1]],
columns=['strings', 'multiple'])
df = pd.DataFrame(np.repeat(df.values, df['multiple']*4+1, axis=0), columns=df.columns)
print(df)
# strings multiple
# 0 A 0
# 1 B 1
# 2 B 1
# 3 B 1
# 4 B 1
# 5 B 1
# 6 C 1
# 7 C 1
# 8 C 1
# 9 C 1
# 10 C 1
# 11 D 0
# 12 E 1
# 13 E 1
# 14 E 1
# 15 E 1
# 16 E 1
You can do it with pandas:
(df.groupby('multiple')
.apply(lambda x: pd.concat([x]*4) if x.name else x)
.droplevel(level=0)
.sort_index()
.reset_index(drop=True)
)
I have a pandas dataframe with name of variables, the values for each and the count (which shows the frequency of that row):
df = pd.DataFrame({'var':['A', 'B', 'C'], 'value':[10, 20, 30], 'count':[1,2,3]})
var value count
A 10 1
B 20 2
C 30 3
I want to use count to get an output like this:
var value
A 10
B 20
B 20
C 30
C 30
C 30
What is the best way to do that?
You can use index.repeat:
i = df.index.repeat(df['count'])
d = df.loc[i, :'value'].reset_index(drop=True)
var value
0 A 10
1 B 20
2 B 20
3 C 30
4 C 30
5 C 30
Use repeat with reindex for this short one-liner:
df.reindex(df.index.repeat(df['count']))
Output:
var value count
0 A 10 1
1 B 20 2
1 B 20 2
2 C 30 3
2 C 30 3
2 C 30 3
Or to eliminate the 'count' column:
df[['var','value']].reindex(df.index.repeat(df['count']))
OR
df.reindex(df.index.repeat(df['count'])).drop('count', axis=1)
Output:
var value
0 A 10
1 B 20
1 B 20
2 C 30
2 C 30
2 C 30
Using Series.repeat
import pandas as pd
df = pd.DataFrame({'var':['A', 'B', 'C'], 'value':[10, 20, 30], 'count':[1,2,3]})
new_df = pd.DataFrame()
new_df['var'] = df['var'].repeat(df['count'])
new_df['value'] = df['value'].repeat(df['count'])
new_df
var value
0 A 10
1 B 20
1 B 20
2 C 30
2 C 30
2 C 30
There are many, many ways to achieve this. Here is one cheeky approach that I like doing:
df.transform({
"count": lambda x: [i for i in range(x)],
"var": lambda x: x,
"value": lambda x: x
}).explode("count").drop("count", axis=1)
It seems so basic, but I can't work out how to achieve the following...
Consider the scenario where I have the following data:
all_columns = ['A','B','C','D']
first_columns = ['A','B']
second_columns = ['C','D']
new_columns = ['E','F']
values = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]]
df = pd.DataFrame(data = values, columns = all_columns)
df
A B C D
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
How can I using this data subsequently subtract let's say column C - column A, then column D - column B and return two new columns E and F respectively to my df Pandas dataframe? I have multiple columns so writing the formula one by one is not an option.
I imagine it should be something like that, but python thinks that I am trying to subtract list names rather than the values in the actual lists...
df[new_columns] = df[second_columns] - df[first_columns]
Expected output:
A B C D E F
0 1 2 3 4 2 2
1 5 6 7 8 2 2
2 9 10 11 12 2 2
3 13 14 15 16 2 2
df['E'] = df['C'] - df['A']
df['F'] = df['D'] - df['B']
Or, alternatively (similar to #rafaelc's comment):
new_cols = ['E', 'F']
second_cols = ['C', 'D']
first_cols = ['A', 'B']
df[new_cols] = df[second_cols] - df[first_cols].values
As #rafaelc and #Ben.T mentioned .. below would be the good fit to go.
I'm Just placing this is in the answer section for the posterity use...
>>> df
A B C D
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
Result:
>>> df[['E', 'F']] = df[['C', 'D']] - df[['A', 'B']].values
>>> df
A B C D E F
0 1 2 3 4 2 2
1 5 6 7 8 2 2
2 9 10 11 12 2 2
3 13 14 15 16 2 2
I have a dataframe defined as follows:
df = pd.DataFrame({'id': [11, 12, 13, 14, 21, 22, 31, 32, 33],
'class': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'C', 'C'],
'count': [2, 2, 2 ,2 ,1, 1, 2, 2, 2]})
For each class, I'd like to select top n rows where n is specified by count column. The expected output from the above dataframe would be like this:
How can I achieve this?
You could use
In [771]: df.groupby('class').apply(
lambda x: x.head(x['count'].iloc[0])
).reset_index(drop=True)
Out[771]:
id class count
0 11 A 2
1 12 A 2
2 21 B 1
3 31 C 2
4 32 C 2
Use:
(df.groupby('class', as_index=False, group_keys=False)
.apply(lambda x: x.head(x['count'].iloc[0])))
Output:
id class count
0 11 A 2
1 12 A 2
4 21 B 1
6 31 C 2
7 32 C 2
Using cumcount
df[(df.groupby('class').cumcount()+1).le(df['count'])]
Out[150]:
class count id
0 A 2 11
1 A 2 12
4 B 1 21
6 C 2 31
7 C 2 32
Here is a solution which groups by class then then looks at the first value in the smaller dataframe and returns the corresponding rows.
def func(df_):
count_val = df_['count'].values[0]
return df_.iloc[0:count_val]
df.groupby('class', group_keys=False).apply(func)
returns
class count id
0 A 2 11
1 A 2 12
4 B 1 21
6 C 2 31
7 C 2 32
I am trying to concatenate multiple Pandas DataFrames, some of which use multi-indexing and others use single indices. As an example, let's consider the following single indexed dataframe:
> import pandas as pd
> df1 = pd.DataFrame({'single': [10,11,12]})
> df1
single
0 10
1 11
2 12
Along with a multiindex dataframe:
> level_dict = {}
> level_dict[('level 1','a','h')] = [1,2,3]
> level_dict[('level 1','b','j')] = [5,6,7]
> level_dict[('level 2','c','k')] = [10, 11, 12]
> level_dict[('level 2','d','l')] = [20, 21, 22]
> df2 = pd.DataFrame(level_dict)
> df2
level 1 level 2
a b c d
h j k l
0 1 5 10 20
1 2 6 11 21
2 3 7 12 22
Now I wish to concatenate the two dataframes. When I try to use concat it flattens the multiindex as follows:
> df3 = pd.concat([df2,df1], axis=1)
> df3
(level 1, a, h) (level 1, b, j) (level 2, c, k) (level 2, d, l) single
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
If instead I append a single column to the multiindex dataframe df2 as follows:
> df2['single'] = [10,11,12]
> df2
level 1 level 2 single
a b c d
h j k l
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
How can I instead generate this dataframe from df1 and df2 with concat, merge, or join?
I don't think you can avoid converting the single index into a MultiIndex. This is probably the easiest way, you could also convert after joining.
In [48]: df1.columns = pd.MultiIndex.from_tuples([(c, '', '') for c in df1])
In [49]: pd.concat([df2, df1], axis=1)
Out[49]:
level 1 level 2 single
a b c d
h j k l
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
If you're just appending one column you could access df1 essentially as a series:
df2[df1.columns[0]] = df1.iloc[:, 0]
df2
level 1 level 2 single
a b c d
h j k l
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
If you could have just made a series in the first place it would be a little easier to read. This command would do the same thing:
ser1 = df1.iloc[:, 0] # make df1's column into a series
df2[ser1.name] = ser1