Efficient conversion of dataframe distinct values in Python - python

I have a data like this:
republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
democrat,n,y,n,y,y,y,n,n,n,n,n,n,?,y,y,y
republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,?,y
from source.
I would like to change all different distinct values from all of the data (dataframe) into numeric values in most efficient way.
In the above mentioned example I would like to transform republican-> 1 and democrat -> 2, y ->3, n->4 and ? -> 5 (or NULL).
I tried to use the following:
# Convert string column to integer
def str_column_to_int(dataset, column):
class_values = [row[column] for row in dataset]
unique = set(class_values)
lookup = dict()
for i, value in enumerate(unique):
lookup[value] = i
for row in dataset:
row[column] = lookup[row[column]]
return lookup
However, I'm not sure if using Pandas can be more efficient or there are some other better solutions for it. (This should be generic to any source of data).
Here is the transform of data into dataframe using Pandas:
import pandas as pd
file_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'
dataset = pd.read_csv(file_path, header=None)

v = df.values
f = pd.factorize(v.ravel())[0].reshape(v.shape)
pd.DataFrame(f)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 1 2 1 2 2 2 1 1 1 2 3 2 2 2 1 2
1 0 1 2 1 2 2 2 1 1 1 1 1 2 2 2 1 3
2 4 3 2 2 3 2 2 1 1 1 1 2 1 2 2 1 1
3 4 1 2 2 1 3 2 1 1 1 1 2 1 2 1 1 2
4 4 2 2 2 1 2 2 1 1 1 1 2 3 2 2 2 2
5 4 1 2 2 1 2 2 1 1 1 1 1 1 2 2 2 2
6 4 1 2 1 2 2 2 1 1 1 1 1 1 3 2 2 2
7 0 1 2 1 2 2 2 1 1 1 1 1 1 2 2 3 2

Use replace on the whole dataframe to make the mappings. You could first pass a dictionary of known mappings for values you need to remain consistent, and then generate a set of values for the dataset and map these extra values to say values 100 upwards.
For example, the ? here is not mapped, so would get a value of 100:
mappings = {'republican':1, 'democrat':2, 'y':3, 'n':4}
unknown = set(pd.unique(df.values.ravel())) - set(mappings.keys())
mappings.update([v, c] for c, v in enumerate(unknown, start=100))
df.replace(mappings, inplace=True)
Giving you:
republican n n.1 n.2 n.3 n.4 n.5 n.6 n.7 n.8 n.9 ? n.10 n.11 n.12 n.13 n.14
0 1 4 3 4 3 3 3 4 4 4 3 100 3 3 3 4 3
1 1 4 3 4 3 3 3 4 4 4 4 4 3 3 3 4 100
2 2 100 3 3 100 3 3 4 4 4 4 3 4 3 3 4 4
3 2 4 3 3 4 100 3 4 4 4 4 3 4 3 4 4 3
4 2 3 3 3 4 3 3 4 4 4 4 3 100 3 3 3 3
5 2 4 3 3 4 3 3 4 4 4 4 4 4 3 3 3 3
6 2 4 3 4 3 3 3 4 4 4 4 4 4 100 3 3 3
7 1 4 3 4 3 3 3 4 4 4 4 4 4 3 3 100 3
A more generalized version would be:
mappings = {v:c for c, v in enumerate(sorted(set(pd.unique(df.values.ravel()))), start=1)}
df.replace(mappings, inplace=True)

You can use:
v = df.values
a, b = v.shape
f = pd.factorize(v.T.ravel())[0].reshape(b,a).T
df = pd.DataFrame(f)
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 2 4 2 4 4 4 2 2 2 4 3 4 4 4 2 4
1 0 2 4 2 4 4 4 2 2 2 2 2 4 4 4 2 3
2 1 3 4 4 3 4 4 2 2 2 2 4 2 4 4 2 2
3 1 2 4 4 2 3 4 2 2 2 2 4 2 4 2 2 4
4 1 4 4 4 2 4 4 2 2 2 2 4 3 4 4 4 4
5 1 2 4 4 2 4 4 2 2 2 2 2 2 4 4 4 4
6 1 2 4 2 4 4 4 2 2 2 2 2 2 3 4 4 4
7 0 2 4 2 4 4 4 2 2 2 2 2 2 4 4 3 4

Related

How to fill by counting last and forward N values with static window in pandas

I have a calendar data of type of dayworks - the day is the holiday or not.
I want to create a new feautures:
The value in the cell is the number of holidays in the week.
The value in the cell is the number of holidays in the N-window (right and left windows). In example - N=5 (and including current value)
Example:
is_holiday feature_1 feature_2
idx
0 0 2 0
1 0 2 1
2 0 2 2
3 0 2 2
4 0 2 2
5 1 2 2
6 1 2 2
7 0 3 3
8 0 3 4
9 0 3 5
10 0 3 4
11 1 3 3
12 1 3 3
13 1 3 3
...
I think you need grouping for each 7 values and aggregate sum and for second is used Series.rolling:
df['f1'] = df.groupby(df.index // 7)['is_holiday'].transform('sum')
df['f2'] = df['is_holiday'].rolling(9, center=True, min_periods=1).sum().astype(int)
print (df)
is_holiday feature_1 feature_2 f1 f2
idx
0 0 2 0 2 0
1 0 2 1 2 1
2 0 2 2 2 2
3 0 2 2 2 2
4 0 2 2 2 2
5 1 2 2 2 2
6 1 2 2 2 2
7 0 3 3 3 3
8 0 3 4 3 4
9 0 3 5 3 5
10 0 3 4 3 4
11 1 3 3 3 3
12 1 3 3 3 3
13 1 3 3 3 3

Count the most frequent values in a row pandas and make a column with that most frequent value

I have a data frame like this below:
a b c
0 3 3 3
1 3 3 3
2 3 3 3
3 3 3 3
4 2 3 2
5 3 3 3
6 1 2 1
7 2 3 2
8 0 0 0
9 0 1 0
I want to count frequency of each row and add a column result containing the max frequency like this below:
a b c result
0 3 3 3 3
1 3 3 3 3
2 3 3 3 3
3 3 3 3 3
4 2 3 2 2
5 3 3 3 3
6 1 2 1 1
7 2 3 2 2
8 0 0 0 0
9 0 1 0 0
I tries to do transpose and looping through the transposed columns to get the value_counts but could not got the right result.
Any help is highly appreciated.
Use DataFrame.mode with select first column by positions with DataFrame.iloc:
df['result'] = df.mode(axis=1).iloc[:, 0]
print (df)
a b c result
0 3 3 3 3
1 3 3 3 3
2 3 3 3 3
3 3 3 3 3
4 2 3 2 2
5 3 3 3 3
6 1 2 1 1
7 2 3 2 2
8 0 0 0 0
9 0 1 0 0

How to drop certain values after checking a condition on the second column?

Assuming a df as follows:
Product Time
1 1
1 2
1 3
1 4
2 1
2 2
2 3
2 4
2 5
2 6
2 7
3 1
3 2
3 3
4 1
4 2
4 3
I would like to only keep those Products whose Time is greater than 3 and drop the others.
In the above example, after I do
df.groupby(['Product']).size()
I get the following output:
1 4
2 7
3 3
4 3
and based on this, from my main df, I would only like to retain Product 1 & 2
Expected output:
Product Time
1 1
1 2
1 3
1 4
2 1
2 2
2 3
2 4
2 5
2 6
2 7
Use GroupBy.transform for return Series with same size like original, so possible filtering by boolean indexing:
df = df[df.groupby(['Product'])['Product'].transform('size') > 3]
print (df)
Product Time
0 1 1
1 1 2
2 1 3
3 1 4
4 2 1
5 2 2
6 2 3
7 2 4
8 2 5
9 2 6
10 2 7
Details:
b = df.groupby(['Product'])['Product'].transform('size') > 3
a = df.groupby(['Product'])['Product'].transform('size')
print (df.assign(size=a, filter=b))
Product Time size filter
0 1 1 4 True
1 1 2 4 True
2 1 3 4 True
3 1 4 4 True
4 2 1 7 True
5 2 2 7 True
6 2 3 7 True
7 2 4 7 True
8 2 5 7 True
9 2 6 7 True
10 2 7 7 True
11 3 1 3 False
12 3 2 3 False
13 3 3 3 False
14 4 1 3 False
15 4 2 3 False
16 4 3 3 False
If DataFrame is not large, here is alternative with DataFrameGroupBy.filter:
df = df.groupby(['Product']).filter(lambda x: len(x) > 3)
Instead use transform.size after grouping, check which are greater than (gt) 3 and use the result to perform boolean indexing on your dataframe:
df[df.groupby('Product').Time.transform('size').gt(3)]
Product Time
0 1 1
1 1 2
2 1 3
3 1 4
4 2 1
5 2 2
6 2 3
7 2 4
8 2 5
9 2 6
10 2 7
You can do this if you don't plan to use assign operation and you like to use boolean indexing.
g = df.groupby('Product')
t = g.transform('count')
df['c']=t #new column holding the count
df2=df[df['c'] > 3]
print(df2)
Product Time
0 1 1
1 1 2
2 1 3
3 1 4
4 2 1
5 2 2
6 2 3
7 2 4
8 2 5
9 2 6
10 2 7
11 3 1
12 3 2
13 3 3
14 4 1
15 4 2
16 4 3
Product Time c
0 1 1 4
1 1 2 4
2 1 3 4
3 1 4 4
4 2 1 7
5 2 2 7
6 2 3 7
7 2 4 7
8 2 5 7
9 2 6 7
10 2 7 7

pandas - pivot column names as values

I have a dataframe like this:
1 2 3 4 5 6
Ax Ax Ax Ax Ax Ax
delta delta delta delta delta delta
0 6 4 1 5 3 2
1 6 1 5 3 2 4
2 6 1 5 3 2 4
3 6 1 5 3 2 4
4 6 1 5 3 2 4
5 6 1 5 3 2 4
6 6 1 5 3 2 4
7 6 1 5 3 2 4
8 6 1 5 3 2 4
9 6 1 5 3 2 4
I would like to pivot this such that the values are the column, and the columns are the value.
So, the first two rows would become the following:
1 2 3 4 5 6
0 3 6 5 2 4 1
1 3 6 2 5 4 1
I hope that this makes sense. I have tried using pivot() and pivot_table() but it doesn't seem possible with that.
Try:
df1 = df.copy()
df1.columns = df1.columns.droplevel([1,2])
df1.stack().reset_index().pivot(index='level_0', columns=0)
Slice the columns by the sorted indices:
import numpy as np
import pandas as pd
cols = df.columns.get_level_values(0).to_numpy()
pd.DataFrame(cols[np.argsort(df.to_numpy(), 1)],
columns=list(range(1, df.shape[1]+1)))
1 2 3 4 5 6
0 3 6 5 2 4 1
1 2 5 4 6 3 1
2 2 5 4 6 3 1
3 2 5 4 6 3 1
4 2 5 4 6 3 1
5 2 5 4 6 3 1
6 2 5 4 6 3 1
7 2 5 4 6 3 1
8 2 5 4 6 3 1
9 2 5 4 6 3 1

Filling in sequence previous values based on current value in pandas

I have a pandas data frame which looks like below:
ID Value
1 2
2 6
3 3
4 5
I want a new dataframe which gives
ID Value
1 0
1 1
1 2
2 0
2 1
2 2
2 3
2 4
2 5
2 6
3 1
3 2
3 3
3 4
Any kind of suggestions would be appreciated.
Using reindex with repeat and cumcount for get the new value updated
df.reindex(df.index.repeat(df.Value+1)).assign(Value=lambda x : x.groupby('ID').cumcount())
Out[611]:
ID Value
0 1 0
0 1 1
0 1 2
1 2 0
1 2 1
1 2 2
1 2 3
1 2 4
1 2 5
1 2 6
2 3 0
2 3 1
2 3 2
2 3 3
3 4 0
3 4 1
3 4 2
3 4 3
3 4 4
3 4 5
Try,
new_df = df.groupby('ID').Value.apply(lambda x: pd.Series(np.arange(x+1)))\
.reset_index().drop('level_1', 1)
ID Value
0 1 0
1 1 1
2 1 2
3 2 0
4 2 1
5 2 2
6 2 3
7 2 4
8 2 5
9 2 6
10 3 0
11 3 1
12 3 2
13 3 3
14 4 0
15 4 1
16 4 2
17 4 3
18 4 4
19 4 5
Using stack and a list comprehension:
vals = [np.arange(i+1) for i in df.Value]
(pd.DataFrame(vals, index=df.ID)
.stack().reset_index(1, drop=True).astype(int).to_frame('Value'))
Value
ID
1 0
1 1
1 2
2 0
2 1
2 2
2 3
2 4
2 5
2 6
3 0
3 1
3 2
3 3
4 0
4 1
4 2
4 3
4 4
4 5

Categories

Resources