Split row into multiple rows in pandas - python

I have a DataFrame with a format like this (simplified)
a b 43
a c 22
I would like this to be split up in the following way.
a b 20
a b 20
a b 1
a b 1
a b 1
a c 20
a c 1
a c 1
Where I have as many rows as the number divides by 20, and then as many rows as the remainder. I have a solution that basically iterates over the rows and fills up a dictionary which can then be converted back to Dataframe but I was wondering if there is a better solution.

You can use floor divison with modulo first and then create new DataFrame by constructor with numpy.repeat.
Last need numpy.concatenate with list comprehension for C:
a,b = df.C // 20, df.C % 20
#print (a, b)
cols = ['A','B']
df = pd.DataFrame({x: np.repeat(df[x], a + b) for x in cols})
df['C'] = np.concatenate([[20] * x + [1] * y for x,y in zip(a,b)])
print (df)
A B C
0 a b 20
0 a b 20
0 a b 1
0 a b 1
0 a b 1
1 a c 20
1 a c 1
1 a c 1

Setup
Consider the dataframe df
df = pd.DataFrame(dict(A=['a', 'a'], B=['b', 'c'], C=[43, 22]))
df
A B C
0 a b 43
1 a c 22
np.divmod and np.repeat
m = np.array([20, 1])
dm = list(zip(*np.divmod(df.C.values, m[0])))
# [(2, 3), (1, 2)]
rep = [sum(x) for x in dm]
new = np.concatenate([m.repeat(x) for x in dm])
df.loc[df.index.repeat(rep)].assign(C=new)
A B C
0 a b 20
0 a b 20
0 a b 1
0 a b 1
0 a b 1
1 a c 20
1 a c 1
1 a c 1

Related

Get the column names for 2nd largest value for each row in a Pandas dataframe

Say I have such Pandas dataframe
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
so df looks like:
print(df)
a b c
0 4 20 25
1 5 10 20
2 3 40 5
3 1 50 15
4 2 30 10
And I want to get the column name of the 2nd largest value in each row. Borrowing the answer from Felex Le in this thread, I can now get the 2nd largest value by:
def second_largest(l = []):
return (l.nlargest(2).min())
print(df.apply(second_largest, axis = 1))
which gives me:
0 20
1 10
2 5
3 15
4 10
dtype: int64
But what I really want is the column names for those values, or to say:
0 b
1 b
2 c
3 c
4 c
Pandas has a function idxmax which can do the job for the largest value:
df.idxmax(axis = 1)
0 c
1 c
2 b
3 b
4 b
dtype: object
Is there any elegant way to do the same job but for the 2nd largest value?
Use numpy.argsort for positions of second largest values:
df['new'] = df['new'] = df.columns.to_numpy()[np.argsort(df.to_numpy())[:, -2]]
print(df)
a b c new
0 4 20 25 b
1 5 10 20 b
2 3 40 5 c
3 1 50 15 c
4 2 30 10 c
Your solution should working, but is slow:
def second_largest(l = []):
return (l.nlargest(2).idxmin())
print(df.apply(second_largest, axis = 1))
If efficiency is important, numpy.argpartition is quite efficient:
N = 2
cols = df.columns.to_numpy()
pd.Series(cols[np.argpartition(df.to_numpy().T, -N, axis=0)[-N]], index=df.index)
If you want a pure pandas (less efficient):
out = df.stack().groupby(level=0).apply(lambda s: s.nlargest(2).index[-1][1])
Output:
0 b
1 b
2 c
3 c
4 c
dtype: object

Join an array to every row in the pandas dataframe

I have a data frame and an array as follows:
df = pd.DataFrame({'x': range(0,5), 'y' : range(1,6)})
s = np.array(['a', 'b', 'c'])
I would like to attach the array to every row of the data frame, such that I got a data frame as follows:
What would be the most efficient way to do this?
Just plain assignment:
# replace the first `s` with your desired column names
df[s] = [s]*len(df)
Try this:
for i in s:
df[i] = i
Output:
x y a b c
0 0 1 a b c
1 1 2 a b c
2 2 3 a b c
3 3 4 a b c
4 4 5 a b c
You could use pandas.concat:
pd.concat([df, pd.DataFrame(s).T], axis=1).ffill()
output:
x y 0 1 2
0 0 1 a b c
1 1 2 a b c
2 2 3 a b c
3 3 4 a b c
4 4 5 a b c
You can try using df.loc here.
df.loc[:, s] = s
print(df)
x y a b c
0 0 1 a b c
1 1 2 a b c
2 2 3 a b c
3 3 4 a b c
4 4 5 a b c

Create adjacency matrix from adjacency list

I have the next DF with two columns
A x
A y
A z
B x
B w
C x
C w
C i
I want to produce an adjacency matrix like this (count the intersection)
A B C
A 0 1 2
B 1 0 2
C 2 2 0
I have the next code but doesnt work:
import pandas as pd
df = pd.read_csv('lista.csv')
drugs = pd.read_csv('drugs.csv')
drugs = drugs['Drug'].tolist()
df = pd.crosstab(df.Drug, df.Gene)
df = df.reindex(index=drugs, columns=drugs)
How can i obtain the adjacency matrix?
Thanks
Try self merge on column 2 and then crosstab:
s = df.merge(df,on='col2').query('col1_x != col1_y')
pd.crosstab(s['col1_x'], s['col1_y'])
Output:
col1_y A B C
col1_x
A 0 1 1
B 1 0 2
C 1 2 0
Input:
>>> drugs
Drug Gene
0 A x
1 A y
2 A z
3 B x
4 B w
5 C x
6 C w
7 C i
Merge on gene before crosstab and fill diagonal with zeros
df = pd.merge(drugs, drugs, on="Gene")
df = pd.crosstab(df["Drug_x"], df["Drug_y"])
np.fill_diagonal(df.values, 0)
Output:
>>> df
Drug_y A B C
Drug_x
A 0 1 1
B 1 0 2
C 1 2 0

pandas combine pivot table with DataFrame

I want to group my data set and enrich it with a formatted representation of the aggregated information.
This is my data set:
h = ['A', 'B', 'C']
d = [["a", "x", 1], ["a", "y", 2], ["b", "y", 4]]
rows = pd.DataFrame(d, columns=h)
A B C
0 a x 1
1 a y 2
2 b y 4
I create a pivot table to generate 0 for missing values:
pivot = pd.pivot_table(rows,index=["A"], values=["C"], columns=["B"],fill_value=0)
C
B x y
A
a 1 2
b 0 4
I groupy by A to remove dimension B:
wanted = rows.groupby("A").sum()
C
A
a 3
b 4
I try to add a column with the string representation of the aggregate details:
wanted["D"] = pivot["C"].applymap(lambda vs: reduce(lambda a,b: str(a)+"+"+str(b), vs.values))
AttributeError: ("'int' object has no attribute 'values'", u'occurred at index x')
It seems that I don't understand applymap.
What I want to achieve is:
C D
A
a 3 1+2
b 4 0+4
You can first remove [] from parameters in pivot_table, so you remove Multiindex from columns:
pivot = pd.pivot_table(rows,index="A", values="C", columns="B",fill_value=0)
Then sum values by columns:
pivot['C'] = pivot.sum(axis=1)
print (pivot)
B x y C
A
a 1 2 3
b 0 4 4
Cast by astype int columns x and y to str and output to D:
pivot['D'] = pivot['x'].astype(str) + '+' + pivot['y'].astype(str)
print (pivot)
B x y C D
A
a 1 2 3 1+2
b 0 4 4 0+4
Last remove column name by rename_axis (new in pandas 0.18.0) and drop unnecessary columns:
pivot = pivot.rename_axis(None, axis=1).drop(['x', 'y'], axis=1)
print (pivot)
C D
A
a 3 1+2
b 4 0+4
But if want Multiindex in columns:
pivot = pd.pivot_table(rows,index=["A"], values=["C"], columns=["B"],fill_value=0)
pivot['E'] = pivot["C"].sum(1)
print (pivot)
C E
B x y
A
a 1 2 3
b 0 4 4
pivot["D"] = pivot[('C','x')].astype(str) + '+' + pivot[('C','y')].astype(str)
print (pivot)
C E D
B x y
A
a 1 2 3 1+2
b 0 4 4 0+4
pivot = pivot.rename_axis((None,None), axis=1).drop('C', axis=1).rename(columns={'E':'C'})
pivot.columns = pivot.columns.droplevel(-1)
print (pivot)
C D
A
a 3 1+2
b 4 0+4
EDIT:
Another solution with groupby and MultiIndex.droplevel:
pivot = pd.pivot_table(rows,index=["A"], values=["C"], columns=["B"],fill_value=0)
#remove top level of Multiindex in columns
pivot.columns = pivot.columns.droplevel(0)
print (pivot)
B x y
A
a 1 2
b 0 4
wanted = rows.groupby("A").sum()
wanted['D'] = pivot['x'].astype(str) + '+' + pivot['y'].astype(str)
print (wanted)
C D
A
a 3 1+2
b 4 0+4

Convert N by N Dataframe to 3 Column Dataframe

I am using Python 2.7 with Pandas on a Windows 10 machine.
I have an n by n Dataframe where:
1) The index represents peoples names
2) The column headers are the same peoples names in the same order
3) Each cell of the Dataframeis the average number of times they email each other each day.
How would I transform that Dataframeinto a Dataframewith 3 columns, where:
1) Column 1 would be the index of the n by n Dataframe
2) Column 2 would be the row headers of the n by n Dataframe
3) Column 3 would be the cell value corresponding to those two names from the index, column header combination from the n by n Dataframe
Edit
Appologies for not providing an example of what I am looking for. I would like to take df1 and turn it into rel_df, from the code below.
import pandas as pd
from itertools import permutations
df1 = pd.DataFrame()
df1['index'] = ['a', 'b','c','d','e']
df1.set_index('index', inplace = True)
df1['a'] = [0,1,2,3,4]
df1['b'] = [1,0,2,3,4]
df1['c'] = [4,1,0,3,4]
df1['d'] = [5,1,2,0,4]
df1['e'] = [7,1,2,3,0]
##df of all relationships to build
flds = pd.Series(SO_df.fld1.unique())
flds = pd.Series(flds.append(pd.Series(SO_df.fld2.unique())).unique())
combos = []
for L in range(0, len(flds)+1):
for subset in permutations(flds, L):
if len(subset) == 2:
combos.append(subset)
if len(subset) > 2:
break
rel_df = pd.DataFrame.from_records(data = combos, columns = ['fld1','fld2'])
rel_df['value'] = [1,4,5,7,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4]
print df1
>>> print df1
a b c d e
index
a 0 1 4 5 7
b 1 0 1 1 1
c 2 2 0 2 2
d 3 3 3 0 3
e 4 4 4 4 0
>>> print rel_df
fld1 fld2 value
0 a b 1
1 a c 4
2 a d 5
3 a e 7
4 b a 1
5 b c 1
6 b d 1
7 b e 1
8 c a 2
9 c b 2
10 c d 2
11 c e 2
12 d a 3
13 d b 3
14 d c 3
15 d e 3
16 e a 4
17 e b 4
18 e c 4
19 e d 4
Use melt:
df1 = df1.reset_index()
pd.melt(df1, id_vars='index', value_vars=df1.columns.tolist()[1:])
(If in your actual code you're explicitly setting the index as you do here, just skip that step rather than doing the reset_index; melt doesn't work on an index.)
# Flatten your dataframe.
df = df1.stack().reset_index()
# Remove duplicates (e.g. fld1 = 'a' and fld2 = 'a').
df = df.loc[df.iloc[:, 0] != df.iloc[:, 1]]
# Rename columns.
df.columns = ['fld1', 'fld2', 'value']
>>> df
fld1 fld2 value
1 a b 1
2 a c 4
3 a d 5
4 a e 7
5 b a 1
7 b c 1
8 b d 1
9 b e 1
10 c a 2
11 c b 2
13 c d 2
14 c e 2
15 d a 3
16 d b 3
17 d c 3
19 d e 3
20 e a 4
21 e b 4
22 e c 4
23 e d 4

Categories

Resources