Data frame group ID, create value: count in column - python

Given the following sample dataset:
import numpy as np
import pandas as pd
df1 = (pd.DataFrame(np.random.randint(3, size=(5, 4)), columns=('ID', 'X1', 'X2', 'X3')))
print(df1)
ID X1 X2 X3
0 2 2 0 2
1 1 0 2 1
2 1 2 1 1
3 1 2 0 2
4 2 0 0 0
d = {'ID' : pd.Series([1, 2, 1, 4, 5]), 'Tag' : pd.Series(['One', 'Two', 'Two', 'Four', 'Five'])}
df2 = (pd.DataFrame(d))
print(df2)
ID Tag
0 1 One
1 2 Two
2 1 Two
3 4 Four
4 5 Five
df1['Merged_Tags'] = df1.ID.map(df2.groupby('ID').Tag.apply(list))
print(df1)
ID X1 X2 X3 Merged_Tags
0 2 2 0 2 [Two]
1 1 0 2 1 [One, Two]
2 1 2 1 1 [One, Two]
3 1 2 0 2 [One, Two]
4 2 0 0 0 [Two]
Expected output for ID = 1:
1.
How would one groupby each key and generate a Tag: Frequency format in the Merged_Tags column?
ID X1 X2 X3 Merged_Tags
1 1 0 2 1 [One: 3, Two: 3]
2.
Create a new column for the number of rows with that ID
ID X1 X2 X3 Merged_Tags Frequency
1 1 0 2 1 [One: 3, Two: 3] 3
3.
Add the values of column X3 in each row occurrence with the same ID
ID X1 X2 X3 Merged_Tags Frequency X3++
1 1 0 2 1 [One: 3, Two: 3] 3 4

1 0 2 1 [One: 3, Two: 3]
should be [One: 2, Two:3] instead right? Considering that:
1 : [One,Two]
0 : None
2 : [Two]
1 : [One, Two]
and you want a total counter of each key in the row ?
Please help me understand the intuition behind [One:3, Two:3] in case I am missing anything here, but your question should be easy to solve otherwise

Related

Split a column into multiple columns that has value as list

I have a problem about splitting column into multiple columns
I have a data like table on the top.
column B contains the values of list .
I want to split the values of column B into columns like the right table. The values in the top table will be the number of occurrences of the values in column B (bottom table).
input:
A B
a [1, 2]
b [3, 4, 5]
c [1, 5]
expected output:
A 1 2 3 4 5
a 1 1 0 0 0
b 0 0 1 1 1
c 1 0 0 0 1
You can explode the column of lists and use crosstab:
df2 = df.explode('B')
out = pd.crosstab(df2['A'], df2['B']).reset_index().rename_axis(columns=None)
output:
A 1 2 3 4 5
0 a 1 1 0 0 0
1 b 0 0 1 1 1
2 c 1 0 0 0 1
used input:
df = pd.DataFrame({'A': list('abc'), 'B': [[1,2], [3,4,5], [1,5]]})

Counting values with condition in one DataFrame and adding the result to another DataFrame

I have two DataFrames:
df1 = pd.DataFrame({"id": [1, 2, 3, 4]})
df2 = pd.DataFrame({"id": [1, 1, 2, 4, 4, 4],
"text": ["a", "a", "b", "a", "b", "b"]})
Output df1:
id
0 1
1 2
2 3
3 4
Output df2:
id value
0 1 a
1 1 a
2 2 b
3 4 a
4 4 b
5 4 b
My goal is to add three columns in df1.
In count_all I would like to count the corresponding ids in df2. E.g. id 4 exists 3 times in df2.
In count_a I would like to count the corresponding ids in df2 where the text value == a.
In count_b I would like to count the corresponding ids in df2 where the text value == b.
id count_all count_a count_b
0 1 2 2 0
1 2 1 0 1
2 3 0 0 0
3 4 3 1 2
How can this be done with pandas?
Use crosstab with margins parameter, add missing index values or change columns ordering by DataFrame.reindex, change columns names by DataFrame.add_prefix and last join to df1 by DataFrame.join:
df = (df1.join(pd.crosstab(df2['id'], df2['text'], margins=True)
.reindex(index=df1['id'].unique(),
columns=['All'] + df2['text'].unique().tolist(),
fill_value=0)
.add_prefix('count_'), on='id'))
print (df)
id count_All count_a count_b
0 1 2 2 0
1 2 1 0 1
2 3 0 0 0
3 4 3 1 2
Here is another way:
df1.join(df2.groupby('id').agg(
count_all = ('id','count'),
count_a=('text',lambda x: sum(x.eq('a'))),
count_b = ('text',lambda x: sum(x.eq('b')))),on='id').fillna(0)

Pandas adding calculated vectors into df

my goal is to add formula based vectors to my following df:
Day Name a b 1 2 x1 x2
1 ijk 1 2 3 3 0 1
2 mno 2 1 1 3 1 1
outcome:
Day Name a b 1 2 x1 x2 y1 y2 z1 z2
1 ijk 1 2 3 3 0 1 (1*2)+3 (1*2)+3 (1+2)*(3*1+0*1) (1+2)*(3*2+1*2)
2 mno 2 1 1 3 1 1 (2*1)+1 (2*1)+3 (2+1)*(1*1+1*1) (2+1)*(3*2+1*2)
This is my tedious approach:
df[y1] = df[a]*df[b]+df[1] #This is y1 = a*b+value of column 1
df[y2] = df[a]*df[b]+df[2] #This is y2 = a*b+value of column 2
if column 3 and x3 were added in then: y3 would be y3 = a*b+value of column 3,
if column 4 and x4 were added in then: y4 = a*b+value of column 4 and so on...
df[z1] = (df[a]+df[b])*(df[1]*1+df[x1]*1) The "1" here is from the column name 1 and x1 #z1 = (a+b)*[(value of column 1)*1+(value of column x1)*1]
df[z2] = (df[a]+df[b])*(df[1]*2+df[x1]*2) The "2" here is from the column name 2 and x2 #z2 = (a+b)*[(value of column 2)*2+(value of column x2)*2]
if column 3 and x3 were added in then: z3 = (a+b)*[(value of column 3)*3+(value of column x3)*3] and so on
This works fine; however, this will get tedious if there are more columns added in. For example, it might get "3 4,... x3 x4,..." I'm wondering if there's a better approach to this using a loop maybe?
Many thanks :)
This is one way:
import pandas as pd
df = pd.DataFrame([[1, 'ijk', 1, 2, 3, 3, 2, 0, 1],
[2, 'mno', 2, 1, 1, 3, 1, 1, 1]],
columns=['Day', 'Name', 'a', 'b', 1, 2, 3, 'x1', 'x2'])
for i in range(1, 4):
df['y'+str(i)] = df['a'] * df['b'] + df[i]
#output
#Day Name a b 1 2 3 x1 x2 y1 y2 y3
#1 ijk 1 2 3 3 2 0 1 5 5 4
#2 mno 2 1 1 3 1 1 1 3 5 3

How could I do one hot encoding with multiple values in one cell?

I have this table in Excel:
id class
0 2 3
1 1 3
2 3 5
Now, I want to do a 'special' one-hot encoding in Python.
For each id in the first table, there are two numbers. Each number corresponds to a class (class1, class2, etc.). The second table is created based off of the first such that for each id, each number in its row shows up in its corresponding class column and the other columns just get zeros. For example, the numbers for id 0 are 2 and 3. The 2 is placed at class2 and the 3 is placed at class3. Classes 1, 4, and 5 get the default of 0. The result should be like:
id class1 class2 class3 class4 class5
0 0 2 3 0 0
1 1 0 3 0 0
2 0 0 3 0 5
My previous solution,
foo = lambda x: pd.Series([i for i in x.split()])
result=onehot['hotel'].apply(foo)
result.columns=['class1','class2']
pd.get_dummies(result, prefix='class', columns=['class1','class2'])
results in:
class_1 class_2 class_3 class_3 class_5
0 0.0 1.0 0.0 1.0 0.0
1 1.0 0.0 0.0 1.0 0.0
2 0.0 0.0 1.0 0.0 1.0
(class_3 appears twice). What can I do to fix this? (After this step, I can transform it to the final format I want.)
You need to make your variables to be categorical and then you can use one hot encoding as shown:
In [18]: df1 = pd.DataFrame({"class":pd.Series(['2','1','3']).astype('category',categories=['1','2','3','4','5'])})
In [19]: df2 = pd.DataFrame({"class":pd.Series(['3','3','5']).astype('category',categories=['1','2','3','4','5'])})
In [20]: df_1 = pd.get_dummies(df1)
In [21]: df_2 = pd.get_dummies(df2)
In [22]: df_1.add(df_2).apply(lambda x: x * [i for i in range(1,len(df_1.columns)+1)], axis = 1).astype(int).rename_axis('id')
Out[22]:
class_1 class_2 class_3 class_4 class_5
id
0 0 2 3 0 0
1 1 0 3 0 0
2 0 0 3 0 5
Does this satisfy your problem as stated?
#!/usr/bin/python
input = [
(0, (2,3)),
(1, (1,3)),
(2, (3,5)),
]
maximum = max(reduce(lambda x, y: x+list(y[1]), input, []))
# Or ...
# maximum = 0
# for i, classes in input:
# maximum = max(maximum, *classes)
# print header.
print "\t".join(["id"] + ["class_%d" % i for i in range(1, 6)])
for i, classes in input:
print i,
for r in range(1, maximum+1):
print "\t",
if r in classes:
print float(r),
else:
print 0.0,
print
Output:
id class_1 class_2 class_3 class_4 class_5
0 0.0 2.0 3.0 0.0 0.0
1 1.0 0.0 3.0 0.0 0.0
2 0.0 0.0 3.0 0.0 5.0
It may be simpler to split the original dataframe into 3 columns:
id class_a class_b
0 2 3
1 1 3
2 3 5
And then perform a normal one-hot encoding on that. Afterwards you may end up with duplicates of columns like:
id ... class_a_3 class_b_3 ... class_b_5
0 0 1 0
1 0 1 0
2 1 0 0
But you can merge/sum those after the fact pretty simply.
Likewise, you could pivot the same logic and transform your df into the form:
id class
0 2
0 3
1 1
1 3
2 3
2 5
Then one-hot that, and aggregate using sum on the key id.
What about this?
Given this data
import pandas as pd
df = pd.DataFrame({'id': [0, 1, 2], 'class': ['2 3', '1 3', '3 5']})
1- split values
df['class'] = df['class'].apply(lambda x: x.split(' '))
df
id class
0 0 [2, 3]
1 1 [1, 3]
2 2 [3, 5]
2- explode --> each record in a row
df_long = df.explode('class')
df_long
id class
0 0 2
0 0 3
1 1 1
1 1 3
2 2 3
2 2 5
3- get one hot encoded values
df_one_hot_encoded = pd.concat([df, pd.get_dummies(df_long['class'],prefix='class', prefix_sep='_')], axis=1)
df_one_hot_encoded
id class class_1 class_2 class_3 class_5
0 0 [2, 3] 0 1 0 0
0 0 [2, 3] 0 0 1 0
1 1 [1, 3] 1 0 0 0
1 1 [1, 3] 0 0 1 0
2 2 [3, 5] 0 0 1 0
2 2 [3, 5] 0 0 0 1
4- groupby id and get the max value per column (same result of logical OR for binary values) --> one row per id
df_one_hot_encoded.groupby('id').max().reset_index()
id class class_1 class_2 class_3 class_5
0 0 [2, 3] 0 1 1 0
1 1 [1, 3] 1 0 1 0
2 2 [3, 5] 0 0 1 1
Bringing all together
import pandas as pd
df = pd.DataFrame({'id': [0, 1, 2], 'class': ['2 3', '1 3', '3 5']})
df['class'] = df['class'].apply(lambda x: x.split(' '))
df_long = df.explode('class')
df_one_hot_encoded = pd.concat([df, pd.get_dummies(df_long['class'],prefix='class', prefix_sep='_')], axis=1)
df_one_hot_encoded_compact = df_one_hot_encoded.groupby('id').max().reset_index()

Pandas histogram from count of columns

I have a big dataframe that consist of about 6500 columns where one is a classlabel and the rest are boolean values of either 0 or 1, the dataframe is sparse.
example:
df = pd.DataFrame({
'label' : ['a', 'b', 'c', 'b','a', 'c', 'b', 'a'],
'x1' : np.random.choice(2, 8),
'x2' : np.random.choice(2, 8),
'x3' : np.random.choice(2, 8)})
What I want is a report (preferably in pandas so I can plot it easily) that shows me the sum of unique elements of the columns grouped by the label.
So for example this data frame:
x1 x2 x3 label
0 0 1 1 a
1 1 0 1 b
2 0 1 0 c
3 1 0 0 b
4 1 1 1 a
5 0 0 1 c
6 1 0 0 b
7 0 1 0 a
The result should be something like this:
a: 3 (since it has x1, x2 and x3)
b: 2 (since it has x1, x3)
c: 2 (since it has x2, x3)
So it's kind of a count of which columns are present in each label. Think of a histogram where the x-axis is the label and the y-axis the number of columns.
You could try pivoting:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'label' : ['a', 'b', 'c', 'b','a', 'c', 'b', 'a'],
'x1' : np.random.choice(2, 8),
'x2' : np.random.choice(2, 8),
'x3' : np.random.choice(2, 8)})
pd.pivot_table(df, index='label').transpose().apply(np.count_nonzero)
For df:
label x1 x2 x3
0 a 0 0 0
1 b 0 1 0
2 c 1 0 1
3 b 0 1 0
4 a 1 1 1
5 c 1 0 1
6 b 0 1 0
7 a 1 1 1
The result is:
label
a 3
b 1
c 2
dtype: int64
label = df.groupby('label')
for key,val in label.count()['x1'].iteritems():
strg = '%s:%s' %(key,val)
for col,vl in label.sum().ix[key].iteritems():
if vl!=0:
strg += ' %s'%col
print strg

Categories

Resources