I have a pandas dataframe with name of variables, the values for each and the count (which shows the frequency of that row):
df = pd.DataFrame({'var':['A', 'B', 'C'], 'value':[10, 20, 30], 'count':[1,2,3]})
var value count
A 10 1
B 20 2
C 30 3
I want to use count to get an output like this:
var value
A 10
B 20
B 20
C 30
C 30
C 30
What is the best way to do that?
You can use index.repeat:
i = df.index.repeat(df['count'])
d = df.loc[i, :'value'].reset_index(drop=True)
var value
0 A 10
1 B 20
2 B 20
3 C 30
4 C 30
5 C 30
Use repeat with reindex for this short one-liner:
df.reindex(df.index.repeat(df['count']))
Output:
var value count
0 A 10 1
1 B 20 2
1 B 20 2
2 C 30 3
2 C 30 3
2 C 30 3
Or to eliminate the 'count' column:
df[['var','value']].reindex(df.index.repeat(df['count']))
OR
df.reindex(df.index.repeat(df['count'])).drop('count', axis=1)
Output:
var value
0 A 10
1 B 20
1 B 20
2 C 30
2 C 30
2 C 30
Using Series.repeat
import pandas as pd
df = pd.DataFrame({'var':['A', 'B', 'C'], 'value':[10, 20, 30], 'count':[1,2,3]})
new_df = pd.DataFrame()
new_df['var'] = df['var'].repeat(df['count'])
new_df['value'] = df['value'].repeat(df['count'])
new_df
var value
0 A 10
1 B 20
1 B 20
2 C 30
2 C 30
2 C 30
There are many, many ways to achieve this. Here is one cheeky approach that I like doing:
df.transform({
"count": lambda x: [i for i in range(x)],
"var": lambda x: x,
"value": lambda x: x
}).explode("count").drop("count", axis=1)
Related
Say I have such Pandas dataframe
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
so df looks like:
print(df)
a b c
0 4 20 25
1 5 10 20
2 3 40 5
3 1 50 15
4 2 30 10
And I want to get the column name of the 2nd largest value in each row. Borrowing the answer from Felex Le in this thread, I can now get the 2nd largest value by:
def second_largest(l = []):
return (l.nlargest(2).min())
print(df.apply(second_largest, axis = 1))
which gives me:
0 20
1 10
2 5
3 15
4 10
dtype: int64
But what I really want is the column names for those values, or to say:
0 b
1 b
2 c
3 c
4 c
Pandas has a function idxmax which can do the job for the largest value:
df.idxmax(axis = 1)
0 c
1 c
2 b
3 b
4 b
dtype: object
Is there any elegant way to do the same job but for the 2nd largest value?
Use numpy.argsort for positions of second largest values:
df['new'] = df['new'] = df.columns.to_numpy()[np.argsort(df.to_numpy())[:, -2]]
print(df)
a b c new
0 4 20 25 b
1 5 10 20 b
2 3 40 5 c
3 1 50 15 c
4 2 30 10 c
Your solution should working, but is slow:
def second_largest(l = []):
return (l.nlargest(2).idxmin())
print(df.apply(second_largest, axis = 1))
If efficiency is important, numpy.argpartition is quite efficient:
N = 2
cols = df.columns.to_numpy()
pd.Series(cols[np.argpartition(df.to_numpy().T, -N, axis=0)[-N]], index=df.index)
If you want a pure pandas (less efficient):
out = df.stack().groupby(level=0).apply(lambda s: s.nlargest(2).index[-1][1])
Output:
0 b
1 b
2 c
3 c
4 c
dtype: object
I have a DataFrame with a format like this (simplified)
a b 43
a c 22
I would like this to be split up in the following way.
a b 20
a b 20
a b 1
a b 1
a b 1
a c 20
a c 1
a c 1
Where I have as many rows as the number divides by 20, and then as many rows as the remainder. I have a solution that basically iterates over the rows and fills up a dictionary which can then be converted back to Dataframe but I was wondering if there is a better solution.
You can use floor divison with modulo first and then create new DataFrame by constructor with numpy.repeat.
Last need numpy.concatenate with list comprehension for C:
a,b = df.C // 20, df.C % 20
#print (a, b)
cols = ['A','B']
df = pd.DataFrame({x: np.repeat(df[x], a + b) for x in cols})
df['C'] = np.concatenate([[20] * x + [1] * y for x,y in zip(a,b)])
print (df)
A B C
0 a b 20
0 a b 20
0 a b 1
0 a b 1
0 a b 1
1 a c 20
1 a c 1
1 a c 1
Setup
Consider the dataframe df
df = pd.DataFrame(dict(A=['a', 'a'], B=['b', 'c'], C=[43, 22]))
df
A B C
0 a b 43
1 a c 22
np.divmod and np.repeat
m = np.array([20, 1])
dm = list(zip(*np.divmod(df.C.values, m[0])))
# [(2, 3), (1, 2)]
rep = [sum(x) for x in dm]
new = np.concatenate([m.repeat(x) for x in dm])
df.loc[df.index.repeat(rep)].assign(C=new)
A B C
0 a b 20
0 a b 20
0 a b 1
0 a b 1
0 a b 1
1 a c 20
1 a c 1
1 a c 1
i have 2 sample datasets dfa and dfb:
import pandas as pd
a = {
'unit': ['A', 'B', 'C', 'D'],
'count': [ 1, 12, 34, 52]
}
b = {
'department': ['E', 'F'],
'count': [ 6, 12]
}
dfa = pd.DataFrame(a)
dfb = pd.DataFrame(b)
they looks like:
dfa
count unit
1 A
12 B
34 C
52 D
dfb
count department
6 E
12 F
what I want is simply have dfa stack on top of dfb not based on any column or any index. i have checked this page: https://pandas.pydata.org/pandas-docs/stable/merging.html but couldn't find the right one for my purpose.
my desired output is to create a dfc that looks like below dataset, i want to keep the headers:
dfc:
count unit
1 A
12 B
34 C
52 D
count department
6 E
12 F
In [37]: pd.concat([dfa, pd.DataFrame(dfb.T.reset_index().T.values, columns=dfa.columns)],
ignore_index=True)
Out[37]:
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
or
In [39]: dfa.append(pd.DataFrame(dfb.T.reset_index().T.values, columns=dfa.columns)) \
.reset_index(drop=True)
Out[39]:
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
UPDATE: merging 3 DFs:
pd.concat([dfa,
pd.DataFrame(dfb.T.reset_index().T.values, columns=dfa.columns),
pd.DataFrame(dfc.T.reset_index().T.values, columns=dfa.columns)],
ignore_index=True)
Option 1
You can construct it from scratch using np.vstack
pd.DataFrame(
np.vstack([dfa.values, dfb.columns, dfb.values]),
columns=dfa.columns
)
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
Option 2
You can export to csv and read it back
from io import StringIO
import pandas as pd
pd.read_csv(StringIO(
'\n'.join([d.to_csv(index=None) for d in [dfa, dfb]])
))
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
dfa.loc[len(dfa),:] = dfb.columns
dfb.columns = dfa.columns
dfa.append(dfb)
I am trying to concatenate multiple Pandas DataFrames, some of which use multi-indexing and others use single indices. As an example, let's consider the following single indexed dataframe:
> import pandas as pd
> df1 = pd.DataFrame({'single': [10,11,12]})
> df1
single
0 10
1 11
2 12
Along with a multiindex dataframe:
> level_dict = {}
> level_dict[('level 1','a','h')] = [1,2,3]
> level_dict[('level 1','b','j')] = [5,6,7]
> level_dict[('level 2','c','k')] = [10, 11, 12]
> level_dict[('level 2','d','l')] = [20, 21, 22]
> df2 = pd.DataFrame(level_dict)
> df2
level 1 level 2
a b c d
h j k l
0 1 5 10 20
1 2 6 11 21
2 3 7 12 22
Now I wish to concatenate the two dataframes. When I try to use concat it flattens the multiindex as follows:
> df3 = pd.concat([df2,df1], axis=1)
> df3
(level 1, a, h) (level 1, b, j) (level 2, c, k) (level 2, d, l) single
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
If instead I append a single column to the multiindex dataframe df2 as follows:
> df2['single'] = [10,11,12]
> df2
level 1 level 2 single
a b c d
h j k l
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
How can I instead generate this dataframe from df1 and df2 with concat, merge, or join?
I don't think you can avoid converting the single index into a MultiIndex. This is probably the easiest way, you could also convert after joining.
In [48]: df1.columns = pd.MultiIndex.from_tuples([(c, '', '') for c in df1])
In [49]: pd.concat([df2, df1], axis=1)
Out[49]:
level 1 level 2 single
a b c d
h j k l
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
If you're just appending one column you could access df1 essentially as a series:
df2[df1.columns[0]] = df1.iloc[:, 0]
df2
level 1 level 2 single
a b c d
h j k l
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
If you could have just made a series in the first place it would be a little easier to read. This command would do the same thing:
ser1 = df1.iloc[:, 0] # make df1's column into a series
df2[ser1.name] = ser1
I have a large DataFrame of observations. i.e.
value 1,value 2
a,1
a,1
a,2
b,3
a,3
I now have an external DataFrame of values
_ ,a,b
1 ,10,20
2 ,30,40
3 ,50,60
What will be an efficient way to add to the first DataFrame the values from the indexed table? i.e.:
value 1,value 2, new value
a,1,10
a,1,10
a,2,30
b,3,60
a,3,50
An alternative solution using .lookup(). It's just one line, vectorized solution. suitable for large dataset.
import pandas as pd
import numpy as np
# generate some artificial data
# ================================
np.random.seed(0)
df1 = pd.DataFrame(dict(value1=np.random.choice('a b'.split(), 10), value2=np.random.randint(1, 10, 10)))
df2 = pd.DataFrame(dict(a=np.random.randn(10), b=np.random.randn(10)), columns=['a', 'b'], index=np.arange(1, 11))
df1
Out[178]:
value1 value2
0 a 6
1 b 3
2 b 5
3 a 8
4 b 7
5 b 9
6 b 9
7 b 2
8 b 7
9 b 8
df2
Out[179]:
a b
1 2.5452 0.0334
2 1.0808 0.6806
3 0.4843 -1.5635
4 0.5791 -0.5667
5 -0.1816 -0.2421
6 1.4102 1.5144
7 -0.3745 -0.3331
8 0.2752 0.0474
9 -0.9608 1.4627
10 0.3769 1.5350
# processing: one liner lookup function
# =======================================================
# df1.value2 is the index and df1.value1 is the column
df1['new_values'] = df2.lookup(df1.value2, df1.value1)
Out[181]:
value1 value2 new_values
0 a 6 1.4102
1 b 3 -1.5635
2 b 5 -0.2421
3 a 8 0.2752
4 b 7 -0.3331
5 b 9 1.4627
6 b 9 1.4627
7 b 2 0.6806
8 b 7 -0.3331
9 b 8 0.0474
Assuming your first and second dfs are df and df1 respectively, you can merge on the matching columns and then mask the 'a' and 'b' conditions:
In [9]:
df = df.merge(df1, left_on=['value 2'], right_on=['_'])
a_mask = (df['value 2'] == df['_']) & (df['value 1'] == 'a')
b_mask = (df['value 2'] == df['_']) & (df['value 1'] == 'b')
df.loc[a_mask, 'new value'] = df['a'].where(a_mask)
df.loc[b_mask, 'new value'] = df['b'].where(b_mask)
df
Out[9]:
value 1 value 2 _ a b new value
0 a 1 1 10 20 10
1 a 1 1 10 20 10
2 a 2 2 30 40 30
3 b 3 3 50 60 60
4 a 3 3 50 60 50
You can then drop the additional columns:
In [11]:
df = df.drop(['_','a','b'], axis=1)
df
Out[11]:
value 1 value 2 new value
0 a 1 10
1 a 1 10
2 a 2 30
3 b 3 60
4 a 3 50
Another way is to define a func to perform the lookup:
In [15]:
def func(x):
row = df1[(df1['_'] == x['value 2'])]
return row[x['value 1']].values[0]
df['new value'] = df.apply(lambda x: func(x), axis = 1)
df
Out[15]:
value 1 value 2 new value
0 a 1 10
1 a 1 10
2 a 2 30
3 b 3 60
4 a 3 50
EDIT
Using #Jianxun Li's lookup works but you have to offset the index as your index is 0 based:
In [20]:
df['new value'] = df1.lookup(df['value 2'] - 1, df['value 1'])
df
Out[20]:
value 1 value 2 new value
0 a 1 10
1 a 1 10
2 a 2 30
3 b 3 60
4 a 3 50