DataFrame of lists to appropriate series - python

Suppose I have the dataframe df defined below
df = pd.DataFrame([
[[1, 2], [3, 4, 5]],
[[6], [7, 8, 9, 0]]
], list('AB'), list('XY'))
How do I get it to
A X 0 1
1 2
Y 0 3
1 4
2 5
B X 0 6
Y 0 7
1 8
2 9
3 0
dtype: int64
What I have tried
I started by admonishing the person who did this. That did not work.

Calling stack a couple times and applying pd.Series:
df.stack().apply(pd.Series).stack().astype(int)
The resulting output:
A X 0 1
1 2
Y 0 3
1 4
2 5
B X 0 6
Y 0 7
1 8
2 9
3 0
dtype: int32

Related

Rename unique values pandas column in ascending order

I have a pandas DataFrame that looks similar to the one below:
df = pd.DataFrame({
'label': [0, 0, 2, 3, 8, 8, 9],
'value1': [2, 1, 9, 8, 7, 4, 2],
'value2': [0, 1, 9, 4, 2, 3, 1],
})
>>> df
label value1 value2
0 0 2 0
1 0 1 1
2 2 9 9
3 3 8 4
4 8 7 2
5 8 4 3
6 9 2 1
Values in the label column are not complete (not range(0, n, 1)) due to previously slicing. I would like to reorder this label and assign a sequential range of ascending values so that it becomes:
>>> df
label value1 value2
0 1 2 0
1 1 1 1
2 2 9 9
3 3 8 4
4 4 7 2
5 4 4 3
6 5 2 1
I currently use the code below. Because my real DataFrame has thousands of unique values any suggestions to do this a bit more efficiently (not including looping over every unique value) would be appreciated.
for new_idx, idx in enumerate(df.label.unique()):
df.loc[df['label'] == idx, ['label']] = new_idx
Thanks in advance
Use factorize for improve performance:
df['label'] = pd.factorize(df['label'])[0] + 1
print (df)
label value1 value2
0 1 2 0
1 1 1 1
2 2 9 9
3 3 8 4
4 4 7 2
5 4 4 3
6 5 2 1
Another idea with Series.rank:
df['label'] = df['label'].rank(method='dense').astype(int)
print (df)
label value1 value2
0 1 2 0
1 1 1 1
2 2 9 9
3 3 8 4
4 4 7 2
5 4 4 3
6 5 2 1
Working same only of same ordering:
#dta changed for see difference
df = pd.DataFrame({
'label': [0, 10, 10, 3, 8, 8, 9],
'value1': [2, 1, 9, 8, 7, 4, 2],
'value2': [0, 1, 9, 4, 2, 3, 1],
})
df['label1'] = pd.factorize(df['label'])[0] + 1
df['label2'] = df['label'].rank(method='dense').astype(int)
print (df)
label value1 value2 label1 label2
0 0 2 0 1 1
1 10 1 1 2 5
2 10 9 9 2 5
3 3 8 4 3 2
4 8 7 2 4 3
5 8 4 3 4 3
6 9 2 1 5 4

How do you add an array to each previous row in pandas?

If I have an array [1, 2, 3, 4, 5] and a Pandas Dataframe
df = pd.DataFrame([[1,1,1,1,1], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0]])
0 1 2 3 4
0 1 1 1 1 1
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
How do I iterate through the Pandas DataFrame adding my array to each previous row?
The expected result would be:
0 1 2 3 4
0 1 1 1 1 1
1 2 3 4 5 6
2 3 5 7 9 11
3 4 7 10 13 16
The array is added n times to the nth row, which you can create using np.arange(len(df))[:,None] * a and then add the first row:
df
# 0 1 2 3 4
#0 1 1 1 1 1
#1 0 0 0 0 0
#2 0 0 0 0 0
#3 0 0 0 0 0
a = np.array([1, 2, 3, 4, 5])
np.arange(len(df))[:,None] * a
#array([[ 0, 0, 0, 0, 0],
# [ 1, 2, 3, 4, 5],
# [ 2, 4, 6, 8, 10],
# [ 3, 6, 9, 12, 15]])
df[:] = df.iloc[0].values + np.arange(len(df))[:,None] * a
df
# 0 1 2 3 4
#0 1 1 1 1 1
#1 2 3 4 5 6
#2 3 5 7 9 11
#3 4 7 10 13 16
df = pd.DataFrame([
[1,1,1],
[0,0,0],
[0,0,0],
])
s = pd.Series([1,2,3])
# add to every row except first, then cumulative sum
result = df.add(s, axis=1)
result.iloc[0] = df.iloc[0]
result.cumsum()
Or if you want a one-liner:
pd.concat([df[:1], df[1:].add(s, axis=1)]).cumsum()
Either way, result:
0 1 2
0 1 1 1
1 2 3 4
2 3 5 7
Using cumsum and assignment:
df[1:] = (df+l).cumsum()[:-1].values
0 1 2 3 4
0 1 1 1 1 1
1 2 3 4 5 6
2 3 5 7 9 11
3 4 7 10 13 16
Or using concat:
pd.concat((df[:1], (df+l).cumsum()[:-1]))
0 1 2 3 4
0 1 1 1 1 1
0 2 3 4 5 6
1 3 5 7 9 11
2 4 7 10 13 16
After cumsum, you can shift and add back to the original df:
a = [1,2,3,4,5]
updated = df.add(pd.Series(a), axis=1).cumsum().shift().fillna(0)
df.add(updated)

pandas cut multiple columns

I am looking to apply a bin across a number of columns.
a = [1, 2, 9, 1, 5, 3]
b = [9, 8, 7, 8, 9, 1]
c = [a, b]
print(pd.cut(c, 3, labels=False))
which works great and creates:
[[0 0 2 0 1 0]
[2 2 2 2 2 0]]
However, i would like to apply the 'cut' to create a dataframe with number and bin it as below.
Values bin
0 1 0
1 2 0
2 9 2
3 1 0
4 5 1
5 3 0
Values bin
0 9 2
1 8 2
2 7 2
3 8 2
4 9 2
5 1 0
This is a simple example of what im looking to do. In reality i 63 separate dataframes and a & b are examples of a column from each dataframe.
Use zip with a list comp to build a list of dataframes -
c = [a, b]
r = pd.cut(c, 3, labels=False)
df_list = [pd.DataFrame({'Values' : v, 'Labels' : l}) for v, l in zip(c, r)]
df_list
[ Labels Values
0 0 1
1 0 2
2 2 9
3 0 1
4 1 5
5 0 3, Labels Values
0 2 9
1 2 8
2 2 7
3 2 8
4 2 9
5 0 1]

Split a MultiIndex DataFrame base on another DataFrame

Say you have a multiindex DataFrame
x y z
a 1 0 1 2
2 3 4 5
b 1 0 1 2
2 3 4 5
3 6 7 8
c 1 0 1 2
2 0 4 6
Now you have another DataFrame which is
col1 col2
0 a 1
1 b 1
2 b 3
3 c 1
4 c 2
How do you split the multiindex DataFrame based on the one above?
Use loc by tuples:
df = df1.loc[df2.set_index(['col1','col2']).index.tolist()]
print (df)
x y z
a 1 0 1 2
b 1 0 1 2
3 6 7 8
c 1 0 1 2
2 0 4 6
df = df1.loc[[tuple(x) for x in df2.values.tolist()]]
print (df)
x y z
a 1 0 1 2
b 1 0 1 2
3 6 7 8
c 1 0 1 2
2 0 4 6
Or join:
df = df2.join(df1, on=['col1','col2']).set_index(['col1','col2'])
print (df)
x y z
col1 col2
a 1 0 1 2
b 1 0 1 2
3 6 7 8
c 1 0 1 2
2 0 4 6
Simply using isin
df[df.index.isin(list(zip(df2['col1'],df2['col2'])))]
Out[342]:
0 1 2 3
index1 index2
a 1 1 0 1 2
b 1 1 0 1 2
3 3 6 7 8
c 1 1 0 1 2
2 2 0 4 6
You can also do this using the MultiIndex reindex method https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html
## Recreate your dataframes
tuples = [('a', 1), ('a', 2),
('b', 1), ('b', 2),
('b', 3), ('c', 1),
('c', 2)]
data = [[1, 0, 1, 2],
[2, 3, 4, 5],
[1, 0, 1, 2],
[2, 3, 4, 5],
[3, 6, 7, 8],
[1, 0, 1, 2],
[2, 0, 4, 6]]
idx = pd.MultiIndex.from_tuples(tuples, names=['index1','index2'])
df= pd.DataFrame(data=data, index=idx)
df2 = pd.DataFrame([['a', 1],
['b', 1],
['b', 3],
['c', 1],
['c', 2]])
# Answer Question
idx_subset = pd.MultiIndex.from_tuples([(a, b) for a, b in df2.values], names=['index1', 'index2'])
out = df.reindex(idx_subset)
print(out)
0 1 2 3
index1 index2
a 1 1 0 1 2
b 1 1 0 1 2
3 3 6 7 8
c 1 1 0 1 2
2 2 0 4 6

Pandas dataframe: how to group by values in a column and create new columns out of grouped values

I have a dataframe with two columns:
x y
0 1
1 1
2 2
0 5
1 6
2 8
0 1
1 8
2 4
0 1
1 7
2 3
What I want is:
x val1 val2 val3 val4
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
I know that the values in column x are repeated all N times.
You could use groupby/cumcount to assign column numbers and then call pivot:
import pandas as pd
df = pd.DataFrame({'x': [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2],
'y': [1, 1, 2, 5, 6, 8, 1, 8, 4, 1, 7, 3]})
df['columns'] = df.groupby('x')['y'].cumcount()
# x y columns
# 0 0 1 0
# 1 1 1 0
# 2 2 2 0
# 3 0 5 1
# 4 1 6 1
# 5 2 8 1
# 6 0 1 2
# 7 1 8 2
# 8 2 4 2
# 9 0 1 3
# 10 1 7 3
# 11 2 3 3
result = df.pivot(index='x', columns='columns')
print(result)
yields
y
columns 0 1 2 3
x
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Or, if you can really rely on the values in x being repeated in order N times,
N = 3
result = pd.DataFrame(df['y'].values.reshape(-1, N).T)
yields
0 1 2 3
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Using reshape is quicker than calling groupby/cumcount and pivot, but it
is less robust since it relies on the values in y appearing in the right order.

Categories

Resources