How to collapse/group columns in pandas - python

I have the data with column names as days up to 3000 columns with values 0/1, ex;
And would like to convert/group the columns as weekly (1-7 in week_1 & 8-14 in week_2), ex;
if the columns between 1-7 has at least 1 then week_1 should return 1 else 0.

Convert first column to index and then aggregate max by helper array created by integer division of 7 and added 1:
pd.options.display.max_columns = 30
np.random.seed(2020)
df = pd.DataFrame(np.random.choice([1,0], size=(5, 21), p=(0.1, 0.9)))
df.columns += 1
df.insert(0, 'id', 1000 + df.index)
print (df)
id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 \
0 1000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1001 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1002 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
3 1003 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0
4 1004 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
20 21
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
df = df.set_index('id')
arr = np.arange(len(df.columns)) // 7 + 1
df = df.groupby(arr, axis=1).max().add_prefix('week_').reset_index()
print (df)
id week_1 week_2 week_3
0 1000 0 0 0
1 1001 1 0 0
2 1002 1 1 0
3 1003 1 1 1
4 1004 1 0 0

import pandas as pd
import numpy as np
id = list(range(1000, 1010))
cl = list(range(1,22))
data_ = np.random.rand(10,21)
data_
client_data = pd.DataFrame(data=data_, index=id, columns=cl)
def change_col(col_hd=int):
week_num = (col_hd + 6) // 7
week_header = 'week_' + str(week_num)
return week_header
new_col_header = []
for c in cl:
new_col_header.append(change_col(c))
client_data.columns = new_col_header
client_data.columns.name = 'id'
client_data.groupby(axis='columns', level=0).sum()

Related

Trying to merge dictionaries together to create new df but dictionaries values arent showing up in df

image of jupter notebook issue
For my quarters instead of values for examples 1,0,0,0 showing up I get NaN.
How do I fix the code below so I return values in my dataframe
qrt_1 = {'q1':[1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0]}
qrt_2 = {'q2':[0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0]}
qrt_3 = {'q3':[0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0]}
qrt_4 = {'q4':[0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1]}
year = {'year': [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9]}
value = data_1['Sales']
data = [year, qrt_1, qrt_2, qrt_3, qrt_4]
dataframes = []
for x in data:
dataframes.append(pd.DataFrame(x))
df = pd.concat(dataframes)
I am expecting a dataframe that returns the qrt_1, qrt_2 etc with their corresponding column names
Try to use axis=1 in pd.concat:
df = pd.concat(dataframes, axis=1)
print(df)
Prints:
year q1 q2 q3 q4
0 1 1 0 0 0
1 1 0 1 0 0
2 1 0 0 1 0
3 1 0 0 0 1
4 2 1 0 0 0
5 2 0 1 0 0
6 2 0 0 1 0
7 2 0 0 0 1
8 3 1 0 0 0
9 3 0 1 0 0
10 3 0 0 1 0
11 3 0 0 0 1
12 4 1 0 0 0
13 4 0 1 0 0
14 4 0 0 1 0
15 4 0 0 0 1
16 5 1 0 0 0
17 5 0 1 0 0
18 5 0 0 1 0
19 5 0 0 0 1
20 6 1 0 0 0
21 6 0 1 0 0
22 6 0 0 1 0
23 6 0 0 0 1
24 7 1 0 0 0
25 7 0 1 0 0
26 7 0 0 1 0
27 7 0 0 0 1
28 8 1 0 0 0
29 8 0 1 0 0
30 8 0 0 1 0
31 8 0 0 0 1
32 9 1 0 0 0
33 9 0 1 0 0
34 9 0 0 1 0
35 9 0 0 0 1

How to alternate values in two columns in a dataframe?

I'm trying to create two new columns to alternate starts and endings in a dataframe :
for 1 start there is only 1 ending maximum
the last start can have no ending corresponding
there is no ends before the first start
the succession of two or more starts or two or more ends isn't possible
How could I do that without using any loop, so using numpy or pandas functions ?
The code to create the dataframe :
df = pd.DataFrame({ 'start':[0,0,1,0,1,0,1,0,0,0,0,1,0,1,0,0,0,1,0],
'end':[1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0]})
The render and the result I want :
start end start wanted end wanted
0 0 1 0 0
1 0 0 0 0
2 1 0 1 0
3 0 0 0 0
4 1 0 0 0
5 0 0 0 0
6 1 0 0 0
7 0 1 0 1
8 0 0 0 0
9 0 1 0 0
10 0 0 0 0
11 1 0 1 0
12 0 0 0 0
13 1 0 0 0
14 0 0 0 0
15 0 1 0 1
16 0 0 0 0
17 1 0 1 0
18 0 0 0 0
I don't know how to do this with pure pandas/numpy but here's a simple for loop that gives your expected output. I tested it with a pandas dataframe 50,000 times the size of your example data (so around 1 million rows in total) and it runs in roughly 1 second:
import pandas as pd
df = pd.DataFrame({ 'start':[0,0,1,0,1,0,1,0,0,0,0,1,0,1,0,0,0,1,0],
'end':[1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0]})
start = False
start_wanted = []
end_wanted = []
for s, e in zip(df['start'], df['end']):
if start:
if e == 1:
start = False
start_wanted.append(0)
end_wanted.append(e)
else:
if s == 1:
start = True
start_wanted.append(s)
end_wanted.append(0)
df['start_wanted'] = start_wanted
df['end_wanted'] = end_wanted
print(df)
Output:
end start start_wanted end_wanted
0 1 0 0 0
1 0 0 0 0
2 0 1 1 0
3 0 0 0 0
4 0 1 0 0
5 0 0 0 0
6 0 1 0 0
7 1 0 0 1
8 0 0 0 0
9 1 0 0 0
10 0 0 0 0
11 0 1 1 0
12 0 0 0 0
13 0 1 0 0
14 0 0 0 0
15 1 0 0 1
16 0 0 0 0
17 0 1 1 0
18 0 0 0 0

Create a sub columns in the dataframe using a another dataframe

I am new to the python and pandas. Here, I have a following dataframe .
did features offset word JAPE_feature manual_feature
0 200 0 aa 200 200
0 200 11 bf 200 200
0 200 12 vf 100 100
0 100 13 rw 2200 2200
0 100 14 asd 2600 100
0 2200 16 dsdd 2200 2200
0 2600 18 wd 2200 2600
0 2600 20 wsw 2600 2600
0 4600 21 sd 4600 4600
Now , I have an array which has all the feature values which can appear for that id.
feat = [100,200,2200,2600,156,162,4600,100]
Now, I am trying to create a dataframe whic will look like,
id Features
100 200 2200 2600 156 162 4600 100
0 0 1 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0
2 0 1 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0
4 1 0 0 0 0 0 0 0
5 1 0 0 0 0 0 0 0
7 0 0 1 0 0 0 0 0
8 0 0 0 1 0 0 0 0
9 0 0 0 1 0 0 0 0
10 0 0 0 0 0 0 1 0
so, while doing comparision ,
feature_manual
1
1
0
0
1
1
1
1
1
Here compairing the features and the manual_feature columns. if values are same then 1 or else 0. so 200 and 200 for 0 is same in both so 1
So, this is the expected output. Here I am trying to add the value 1 for that feature in the new csv and for other 0.
So, it is by row by row.
So, If we check in the first row the feature is 200 so there is 1 at 200 and others are 0.
can any one help me with this ?
what I tried is
mux = pd.MultiIndex.from_product([['features'],feat)
df = pd.DataFrame(data, columns=mux)
SO, Here creatig subcolumns but removing all other values . can any one help me ?
Use get_dummies with DataFrame.reindex:
feat = [100,200,2200,2600,156,162,4600,100]
df = df.join(pd.get_dummies(df.pop('features')).reindex(feat, axis=1, fill_value=0))
print (df)
id 100 200 2200 2600 156 162 4600 100
0 0 0 1 0 0 0 0 0 0
1 1 0 1 0 0 0 0 0 0
2 2 0 1 0 0 0 0 0 0
3 4 1 0 0 0 0 0 0 1
4 5 1 0 0 0 0 0 0 1
5 7 0 0 1 0 0 0 0 0
6 8 0 0 0 1 0 0 0 0
7 9 0 0 0 1 0 0 0 0
8 10 0 0 0 0 0 0 1 0
If need MultiIndex only pass mux to reindex, but also convert id column to index:
feat = [100,200,2200,2600,156,162,4600,100]
mux = pd.MultiIndex.from_product([['features'],feat])
df = pd.get_dummies(df.set_index('id')['features']).reindex(mux, axis=1, fill_value=0)
print (df)
features
100 200 2200 2600 156 162 4600 100
id
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0
EDIT:
cols = ['features', 'JAPE_feature', 'manual_feature']
df = pd.get_dummies(df, columns=cols)
df.columns = df.columns.str.rsplit('_',1, expand=True)
print (df)
did offset word features JAPE_feature \
NaN NaN NaN 100 200 2200 2600 4600 100 200 2200 2600
0 0 0 aa 0 1 0 0 0 0 1 0 0
1 0 11 bf 0 1 0 0 0 0 1 0 0
2 0 12 vf 0 1 0 0 0 1 0 0 0
3 0 13 rw 1 0 0 0 0 0 0 1 0
4 0 14 asd 1 0 0 0 0 0 0 0 1
5 0 16 dsdd 0 0 1 0 0 0 0 1 0
6 0 18 wd 0 0 0 1 0 0 0 1 0
7 0 20 wsw 0 0 0 1 0 0 0 0 1
8 0 21 sd 0 0 0 0 1 0 0 0 0
manual_feature
4600 100 200 2200 2600 4600
0 0 0 1 0 0 0
1 0 0 1 0 0 0
2 0 1 0 0 0 0
3 0 0 0 1 0 0
4 0 1 0 0 0 0
5 0 0 0 1 0 0
6 0 0 0 0 1 0
7 0 0 0 0 1 0
8 1 0 0 0 0 1
If want avoid missing values in MultIndex in columns for columns with no MultiIndex:
cols = ['features', 'JAPE_feature', 'manual_feature']
df = df.set_index(df.columns.difference(cols).tolist())
df = pd.get_dummies(df, columns=cols)
df.columns = df.columns.str.rsplit('_',1, expand=True)
print (df)
features JAPE_feature \
100 200 2200 2600 4600 100 200 2200 2600 4600
did offset word
0 0 aa 0 1 0 0 0 0 1 0 0 0
11 bf 0 1 0 0 0 0 1 0 0 0
12 vf 0 1 0 0 0 1 0 0 0 0
13 rw 1 0 0 0 0 0 0 1 0 0
14 asd 1 0 0 0 0 0 0 0 1 0
16 dsdd 0 0 1 0 0 0 0 1 0 0
18 wd 0 0 0 1 0 0 0 1 0 0
20 wsw 0 0 0 1 0 0 0 0 1 0
21 sd 0 0 0 0 1 0 0 0 0 1
manual_feature
100 200 2200 2600 4600
did offset word
0 0 aa 0 1 0 0 0
11 bf 0 1 0 0 0
12 vf 1 0 0 0 0
13 rw 0 0 1 0 0
14 asd 1 0 0 0 0
16 dsdd 0 0 1 0 0
18 wd 0 0 0 1 0
20 wsw 0 0 0 1 0
21 sd 0 0 0 0 1
EDIT:
If want compare some column from list by manual_feature column use DataFrame.eq with converting to integers:
cols = ['JAPE_feature', 'features']
df1 = df[cols].eq(df['manual_feature'], axis=0).astype(int)
print (df1)
JAPE_feature features
0 1 1
1 1 1
2 1 0
3 1 0
4 0 1
5 1 1
6 0 1
7 1 1
8 1 1
Less fancy solution, but maybe easier to understand:
First of all put the features that will decide which feature you choose on each row in a list called for example list_features.
Then:
# List all the features possible and create an empty df
feat = [100,200,2200,2600,156,162,4600,100]
df_final= pd.DataFrame({x:[] for x in feat})
# Fill the df little by little
for x in list_features:
df_final = df_final.append({y:1 if x==y else 0 for y in feat }, ignore_index=True)
These types of problems can be solved in many ways. But here I am using simple way to solve it. Creating df with those features list as a column names and the using some comparison logic to update df with 0 and 1. You can use some other logic to avoid use of for loops.
import pandas as pd
data = {'id':[0,1,2,3,4,5,7,8,9,10],
'features':[200, 200, 200, 200, 100, 100, 2200, 2600, 2600, 4600]}
df1 = pd.DataFrame(data)
features_list = [100,200,2200,2600,156,162,4600]
id_list = df1.id.to_list()
df2 = pd.DataFrame(columns=features_list)
list2 = list()
for i in id_list:
list1 = list()
for k in df2.columns:
if df1[df1.id == i].features.iloc[0] == k:
list1.append(1)
else:
list1.append(0)
list2.append(list1)
for i in range (0,len(list2)):
df2.loc[i] = list2[i]
df2.insert(0, "id", id_list)
>>>(df2)
id 100 200 2200 2600 156 162 4600
0 0 0 1 0 0 0 0 0
1 1 0 1 0 0 0 0 0
2 2 0 1 0 0 0 0 0
3 3 0 1 0 0 0 0 0
4 4 1 0 0 0 0 0 0
5 5 1 0 0 0 0 0 0
6 7 0 0 1 0 0 0 0
7 8 0 0 0 1 0 0 0
8 9 0 0 0 1 0 0 0
9 10 0 0 0 0 0 0 1

How to concatenate bit columns in Python Pandas?

Seems like an easy question but I'm running into an odd error. I have a large dataframe with 24+ columns that all contain 1s or 0s. I wish to concatenate each field to create a binary key that'll act as a signature.
However, when the number of columns exceeds 12, the whole process falls apart.
a = np.zeros(shape=(3,12))
df = pd.DataFrame(a)
df = df.astype(int) # This converts each 0.0 into just 0
df[2]=1 # Changes one column to all 1s
#result
0 1 2 3 4 5 6 7 8 9 10 11
0 0 0 1 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0 0 0 0
Concatenating function...
df['new'] = df.astype(str).sum(1).astype(int).astype(str) # Concatenate
df['new'].apply('{0:0>12}'.format) # Pad leading zeros
# result
0 1 2 3 4 5 6 7 8 9 10 11 new
0 0 0 1 0 0 0 0 0 0 0 0 0 001000000000
1 0 0 1 0 0 0 0 0 0 0 0 0 001000000000
2 0 0 1 0 0 0 0 0 0 0 0 0 001000000000
This is good. However, if I increase the number of columns to 13, I get...
a = np.zeros(shape=(3,13))
# ...same intermediate steps as above...
0 1 2 3 4 5 6 7 8 9 10 11 12 new
0 0 0 1 0 0 0 0 0 0 0 0 0 0 00-2147483648
1 0 0 1 0 0 0 0 0 0 0 0 0 0 00-2147483648
2 0 0 1 0 0 0 0 0 0 0 0 0 0 00-2147483648
Why am I getting -2147483648? I was expecting 0010000000000
Any help is appreciated!

Accessing the values of surrounding cells in a dataframe without using a loop

I am looking for a way to calculate for each cell in a dataframe the sum of the values of all surrounding cells (including in diagonal), without using a loop.
I have come up with something that looks like that, but it does not include diagonals, and as soon as I include diagonals some cells are counted too many times.
# Initializing matrix a
columns = [x for x in range(10)]
rows = [x for x in range(10)]
matrix = pd.DataFrame(index=rows, columns=columns).fillna(0)
# filling up with mock values
matrix.iloc[5,4] = 1
matrix.iloc[5,5] = 1
matrix.iloc[5,6] = 1
matrix.iloc[4,5] = 1
matrix1 = matrix.apply(lambda x: x.shift(1)).fillna(0)
matrix2 = matrix.T.apply(lambda x: x.shift(1)).T.fillna(0)
matrix3 = matrix.apply(lambda x: x.shift(-1)).fillna(0)
matrix4 = matrix.T.apply(lambda x: x.shift(-1)).T.fillna(0)
matrix_out = matrix1 + matrix2 + matrix3 + matrix4
To be more precise, I plan on populating the dataframe only with 0 or 1 values. The test above is the following:
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 1 0 0 0 0
5 0 0 0 0 1 1 1 1 0 0
6 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
The expected output for this input is:
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 1 1 1 0 0 0
4 0 0 0 1 3 3 4 2 1 0
5 0 0 0 1 2 3 3 1 1 0
6 0 0 0 1 3 3 3 2 1 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
Am I in the right direction with this shift() function used within apply, or would you suggest doing otherwise?
Thanks a lot!
Seems like you need
def sum_diag(matrix):
return matrix.shift(1,axis=1).shift(1, axis=0) + matrix.shift(-1, axis=1).shift(1, axis=0) + matrix.shift(1, axis=1).shift(-1) + matrix.shift(-1, axis=1).shift(-1, axis=0)
def sum_nxt(matrix):
return matrix.shift(-1) + matrix.shift(1) + matrix.shift(1,axis=1) + matrix.shift(-1, axis=1)
final = sum_nxt(matrix) + sum_diag(matrix)
Outputs
print(final.fillna(0).astype(int))
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 1 1 1 0 0 0
4 0 0 0 1 3 3 4 2 1 0
5 0 0 0 1 2 3 3 1 1 0
6 0 0 0 1 2 3 3 2 1 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
Notice that you might want to add .fillna(0) to all shift operations to ensure the borders behave well too, if numbers in the borders are not zero.

Categories

Resources