I have a dataframe like this:
1 2 3 4 5 6
Ax Ax Ax Ax Ax Ax
delta delta delta delta delta delta
0 6 4 1 5 3 2
1 6 1 5 3 2 4
2 6 1 5 3 2 4
3 6 1 5 3 2 4
4 6 1 5 3 2 4
5 6 1 5 3 2 4
6 6 1 5 3 2 4
7 6 1 5 3 2 4
8 6 1 5 3 2 4
9 6 1 5 3 2 4
I would like to pivot this such that the values are the column, and the columns are the value.
So, the first two rows would become the following:
1 2 3 4 5 6
0 3 6 5 2 4 1
1 3 6 2 5 4 1
I hope that this makes sense. I have tried using pivot() and pivot_table() but it doesn't seem possible with that.
Try:
df1 = df.copy()
df1.columns = df1.columns.droplevel([1,2])
df1.stack().reset_index().pivot(index='level_0', columns=0)
Slice the columns by the sorted indices:
import numpy as np
import pandas as pd
cols = df.columns.get_level_values(0).to_numpy()
pd.DataFrame(cols[np.argsort(df.to_numpy(), 1)],
columns=list(range(1, df.shape[1]+1)))
1 2 3 4 5 6
0 3 6 5 2 4 1
1 2 5 4 6 3 1
2 2 5 4 6 3 1
3 2 5 4 6 3 1
4 2 5 4 6 3 1
5 2 5 4 6 3 1
6 2 5 4 6 3 1
7 2 5 4 6 3 1
8 2 5 4 6 3 1
9 2 5 4 6 3 1
I'm figuring out how to assign a categorization from an increasing enumeration column. Here an example of my dataframe:
df = pd.DataFrame({'A':[1,1,1,1,1,1,2,2,3,3,3,3,3],'B':[1,2,3,12,13,14,1,2,5,6,7,8,50]})
This produce:
df
Out[9]:
A B
0 1 1
1 1 2
2 1 3
3 1 12
4 1 13
5 1 14
6 2 1
7 2 2
8 3 5
9 3 6
10 3 7
11 3 8
12 3 50
The column B has an increasing numerical serie, but sometimes the series is interrupted and keeps going with other numbers or start again. My desired output is:
Out[11]:
A B C
0 1 1 1
1 1 2 1
2 1 3 1
3 1 12 2
4 1 13 2
5 1 14 2
6 2 1 3
7 2 2 3
8 3 5 3
9 3 6 4
10 3 7 4
11 3 8 4
12 3 50 5
I appreciate your suggestions, because I can not find an ingenious way to
do it. Thanks
Is this what you need ?
df.B.diff().ne(1).cumsum()
Out[463]:
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
8 4
9 4
10 4
11 4
12 5
Name: B, dtype: int32
I have a data like this:
republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
democrat,n,y,n,y,y,y,n,n,n,n,n,n,?,y,y,y
republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,?,y
from source.
I would like to change all different distinct values from all of the data (dataframe) into numeric values in most efficient way.
In the above mentioned example I would like to transform republican-> 1 and democrat -> 2, y ->3, n->4 and ? -> 5 (or NULL).
I tried to use the following:
# Convert string column to integer
def str_column_to_int(dataset, column):
class_values = [row[column] for row in dataset]
unique = set(class_values)
lookup = dict()
for i, value in enumerate(unique):
lookup[value] = i
for row in dataset:
row[column] = lookup[row[column]]
return lookup
However, I'm not sure if using Pandas can be more efficient or there are some other better solutions for it. (This should be generic to any source of data).
Here is the transform of data into dataframe using Pandas:
import pandas as pd
file_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'
dataset = pd.read_csv(file_path, header=None)
v = df.values
f = pd.factorize(v.ravel())[0].reshape(v.shape)
pd.DataFrame(f)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 1 2 1 2 2 2 1 1 1 2 3 2 2 2 1 2
1 0 1 2 1 2 2 2 1 1 1 1 1 2 2 2 1 3
2 4 3 2 2 3 2 2 1 1 1 1 2 1 2 2 1 1
3 4 1 2 2 1 3 2 1 1 1 1 2 1 2 1 1 2
4 4 2 2 2 1 2 2 1 1 1 1 2 3 2 2 2 2
5 4 1 2 2 1 2 2 1 1 1 1 1 1 2 2 2 2
6 4 1 2 1 2 2 2 1 1 1 1 1 1 3 2 2 2
7 0 1 2 1 2 2 2 1 1 1 1 1 1 2 2 3 2
Use replace on the whole dataframe to make the mappings. You could first pass a dictionary of known mappings for values you need to remain consistent, and then generate a set of values for the dataset and map these extra values to say values 100 upwards.
For example, the ? here is not mapped, so would get a value of 100:
mappings = {'republican':1, 'democrat':2, 'y':3, 'n':4}
unknown = set(pd.unique(df.values.ravel())) - set(mappings.keys())
mappings.update([v, c] for c, v in enumerate(unknown, start=100))
df.replace(mappings, inplace=True)
Giving you:
republican n n.1 n.2 n.3 n.4 n.5 n.6 n.7 n.8 n.9 ? n.10 n.11 n.12 n.13 n.14
0 1 4 3 4 3 3 3 4 4 4 3 100 3 3 3 4 3
1 1 4 3 4 3 3 3 4 4 4 4 4 3 3 3 4 100
2 2 100 3 3 100 3 3 4 4 4 4 3 4 3 3 4 4
3 2 4 3 3 4 100 3 4 4 4 4 3 4 3 4 4 3
4 2 3 3 3 4 3 3 4 4 4 4 3 100 3 3 3 3
5 2 4 3 3 4 3 3 4 4 4 4 4 4 3 3 3 3
6 2 4 3 4 3 3 3 4 4 4 4 4 4 100 3 3 3
7 1 4 3 4 3 3 3 4 4 4 4 4 4 3 3 100 3
A more generalized version would be:
mappings = {v:c for c, v in enumerate(sorted(set(pd.unique(df.values.ravel()))), start=1)}
df.replace(mappings, inplace=True)
You can use:
v = df.values
a, b = v.shape
f = pd.factorize(v.T.ravel())[0].reshape(b,a).T
df = pd.DataFrame(f)
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 2 4 2 4 4 4 2 2 2 4 3 4 4 4 2 4
1 0 2 4 2 4 4 4 2 2 2 2 2 4 4 4 2 3
2 1 3 4 4 3 4 4 2 2 2 2 4 2 4 4 2 2
3 1 2 4 4 2 3 4 2 2 2 2 4 2 4 2 2 4
4 1 4 4 4 2 4 4 2 2 2 2 4 3 4 4 4 4
5 1 2 4 4 2 4 4 2 2 2 2 2 2 4 4 4 4
6 1 2 4 2 4 4 4 2 2 2 2 2 2 3 4 4 4
7 0 2 4 2 4 4 4 2 2 2 2 2 2 4 4 3 4
How do I count the number of unique strings in a rolling window of a pandas dataframe?
a = pd.DataFrame(['a','b','a','a','b','c','d','e','e','e','e'])
a.rolling(3).apply(lambda x: len(np.unique(x)))
Output, same as original dataframe:
0
0 a
1 b
2 a
3 a
4 b
5 c
6 d
7 e
8 e
9 e
10 e
Expected:
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1
I think you need first convert values to numeric - by factorize or by rank. Also min_periods parameter is necessary for avoid NaN in start of column:
a[0] = pd.factorize(a[0])[0]
print (a)
0
0 0
1 1
2 0
3 0
4 1
5 2
6 3
7 4
8 4
9 4
10 4
b = a.rolling(3, min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
print (b)
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1
Or:
a[0] = a[0].rank(method='dense')
0
0 1.0
1 2.0
2 1.0
3 1.0
4 2.0
5 3.0
6 4.0
7 5.0
8 5.0
9 5.0
10 5.0
b = a.rolling(3, min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
print (b)
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1
I have a dictionary as follows:
d={1:(array[2,3]), 2:(array[8,4,5]), 3:(array[6,7,8,9])}
As depicted, here the values for each key are variable length arrays.
Now I want to convert it to DataFrame. So the output looks like:
A B
1 2
1 3
2 8
2 4
2 5
3 6
3 7
3 8
3 9
I used pd.Dataframe(d), but it does not handle one to many mapping.Any help would be appreciated.
Use Series constructor with str.len for lenghts of lists (arrays was converted to lists).
Then create new DataFrame with numpy.repeat, numpy.concatenate and Index.values:
d = {1:np.array([2,3]), 2:np.array([8,4,5]), 3:np.array([6,7,8,9])}
print (d)
a = pd.Series(d)
l = a.str.len()
df = pd.DataFrame({'A':np.repeat(a.index.values, l), 'B': np.concatenate(a.values)})
print (df)
A B
0 1 2
1 1 3
2 2 8
3 2 4
4 2 5
5 3 6
6 3 7
7 3 8
8 3 9
pd.DataFrame(
[[k, v] for k, a in d.items() for v in a.tolist()],
columns=['A', 'B']
)
A B
0 1 2
1 1 3
2 2 8
3 2 4
4 2 5
5 3 6
6 3 7
7 3 8
8 3 9
Setup
d = {1: np.array([2,3]), 2: np.array([8,4,5]), 3: np.array([6,7,8,9])}
Here's my version:
(pd.DataFrame.from_dict(d, orient='index').rename_axis('A')
.stack()
.reset_index(name='B')
.drop('level_1', axis=1)
.astype('int'))
Out[63]:
A B
0 1 2
1 1 3
2 2 8
3 2 4
4 2 5
5 3 6
6 3 7
7 3 8
8 3 9