Counting values with condition in one DataFrame and adding the result to another DataFrame - python

I have two DataFrames:
df1 = pd.DataFrame({"id": [1, 2, 3, 4]})
df2 = pd.DataFrame({"id": [1, 1, 2, 4, 4, 4],
"text": ["a", "a", "b", "a", "b", "b"]})
Output df1:
id
0 1
1 2
2 3
3 4
Output df2:
id value
0 1 a
1 1 a
2 2 b
3 4 a
4 4 b
5 4 b
My goal is to add three columns in df1.
In count_all I would like to count the corresponding ids in df2. E.g. id 4 exists 3 times in df2.
In count_a I would like to count the corresponding ids in df2 where the text value == a.
In count_b I would like to count the corresponding ids in df2 where the text value == b.
id count_all count_a count_b
0 1 2 2 0
1 2 1 0 1
2 3 0 0 0
3 4 3 1 2
How can this be done with pandas?

Use crosstab with margins parameter, add missing index values or change columns ordering by DataFrame.reindex, change columns names by DataFrame.add_prefix and last join to df1 by DataFrame.join:
df = (df1.join(pd.crosstab(df2['id'], df2['text'], margins=True)
.reindex(index=df1['id'].unique(),
columns=['All'] + df2['text'].unique().tolist(),
fill_value=0)
.add_prefix('count_'), on='id'))
print (df)
id count_All count_a count_b
0 1 2 2 0
1 2 1 0 1
2 3 0 0 0
3 4 3 1 2

Here is another way:
df1.join(df2.groupby('id').agg(
count_all = ('id','count'),
count_a=('text',lambda x: sum(x.eq('a'))),
count_b = ('text',lambda x: sum(x.eq('b')))),on='id').fillna(0)

Related

Split a column into multiple columns that has value as list

I have a problem about splitting column into multiple columns
I have a data like table on the top.
column B contains the values of list .
I want to split the values of column B into columns like the right table. The values in the top table will be the number of occurrences of the values in column B (bottom table).
input:
A B
a [1, 2]
b [3, 4, 5]
c [1, 5]
expected output:
A 1 2 3 4 5
a 1 1 0 0 0
b 0 0 1 1 1
c 1 0 0 0 1
You can explode the column of lists and use crosstab:
df2 = df.explode('B')
out = pd.crosstab(df2['A'], df2['B']).reset_index().rename_axis(columns=None)
output:
A 1 2 3 4 5
0 a 1 1 0 0 0
1 b 0 0 1 1 1
2 c 1 0 0 0 1
used input:
df = pd.DataFrame({'A': list('abc'), 'B': [[1,2], [3,4,5], [1,5]]})

check if values of a column are in values of another numpy array column in pandas

I have a pandas dataframe
import pandas as pd
dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
'col_a': [1,2,3,1,2],
'col_b': [2,2,[2,3],4,[2,3]]})
I would like to create a column which will assess if values col_a are in col_b.
The output dataframe should look like this:
dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
'col_a': [1,2,3,1,2],
'col_b': [2,2,[2,3],4,[2,3]],
'exists': [0,1,1,0,1]})
How could I do that ?
You can use:
dt["exists"] = dt.col_a.isin(dt.col_b.explode()).astype(int)
explode the list-containing column and check if col_a isin it. Lastly cast to int.
to get
>>> dt
id col_a col_b exists
0 a 1 2 0
1 a 2 2 1
2 a 3 [2, 3] 1
3 b 1 4 0
4 b 2 [2, 3] 1
If row-by-row comparison is required, you can use:
dt["exists"] = dt.col_a.eq(dt.col_b.explode()).groupby(level=0).any().astype(int)
which checks equality by row and if any of the (grouped) exploded values gives True, we say it exists.
Solutions if need test values per rows (it means not each value of column cola_a by all values of col_b):
You can use custom function with if-else statement:
f = lambda x: x['col_a'] in x['col_b']
if isinstance(x['col_b'], list)
else x['col_a']== x['col_b']
dt['e'] = dt.apply(f, axis=1).astype(int)
print (dt)
id col_a col_b exists e
0 a 1 2 0 0
1 a 2 2 1 1
2 a 3 [2, 3] 1 1
3 b 1 4 0 0
4 b 2 [2, 3] 1 1
Or DataFrame.explode with compare both columns and then test it at least one True per index values:
dt['e'] = dt.explode('col_b').eval('col_a == col_b').any(level=0).astype(int)
print (dt)
id col_a col_b exists e
0 a 1 2 0 0
1 a 2 2 1 1
2 a 3 [2, 3] 1 1
3 b 1 4 0 0
4 b 2 [2, 3] 1 1

Pandas: Split columns into multiple columns by two delimiters

I have data like this
ID INFO
1 A=2;B=2;C=5
2 A=3;B=4;C=1
3 A=1;B=3;C=2
I want to split the Info columns into
ID A B C
1 2 2 5
2 3 4 1
3 1 3 2
I can split columns with one delimiter by using
df['A'], df['B'], df['C'] = df['INFO'].str.split(';').str
then split again by = but this seems to not so efficient in case I have many rows and especially when there are so many field that cannot be hard-coded beforehand.
Any suggestion would be greatly welcome.
You could use named groups together with Series.str.extract. In the end concat back the 'ID'. This assumes you always have A=;B=;and C= in a line.
pd.concat([df['ID'],
df['INFO'].str.extract('A=(?P<A>\d);B=(?P<B>\d);C=(?P<C>\d)')], axis=1)
# ID A B C
#0 1 2 2 5
#1 2 3 4 1
#2 3 1 3 2
If you want a more flexible solution that can deal with cases where a single line might be 'A=1;C=2' then we can split on ';' and partition on '='. pivot in the end to get to your desired output.
### Starting Data
#ID INFO
#1 A=2;B=2;C=5
#2 A=3;B=4;C=1
#3 A=1;B=3;C=2
#4 A=1;C=2
(df.set_index('ID')['INFO']
.str.split(';', expand=True)
.stack()
.str.partition('=')
.reset_index(-1, drop=True)
.pivot(columns=0, values=2)
)
# A B C
#ID
#1 2 2 5
#2 3 4 1
#3 1 3 2
#4 1 NaN 2
Browsing a Series is much faster that iterating across the rows of a dataframe.
So I would do:
pd.DataFrame([dict([x.split('=') for x in t.split(';')]) for t in df['INFO']], index=df['ID']).reset_index()
It gives as expected:
ID A B C
0 1 2 2 5
1 2 3 4 1
2 3 1 3 2
It should be faster than splitting twice dataframe columns.
values = [dict(item.split("=") for item in value.split(";")) for value in df.INFO]
df[['a', 'b', 'c']] = pd.DataFrame(values)
This will give you the desired output:
ID INFO a b c
1 a=1;b=2;c=3 1 2 3
2 a=4;b=5;c=6 4 5 6
3 a=7;b=8;c=9 7 8 9
Explanation:
The first line converts every value to a dictionary.
e.g.
x = 'a=1;b=2;c=3'
dict(item.split("=") for item in x.split(";"))
results in :
{'a': '1', 'b': '2', 'c': '3'}
DataFrame can take a list of dicts as an input and turn it into a dataframe.
Then you only need to assign the dataframe to the columns you want:
df[['a', 'b', 'c']] = pd.DataFrame(values)
Another solution is Series.str.findAll to extract values and then apply(pd.Series):
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
df = df.drop("INFO", 1)
Details:
df = pd.DataFrame([[1, "A=2;B=2;C=5"],
[2, "A=3;B=4;C=1"],
[3, "A=1;B=3;C=2"]],
columns=["ID", "INFO"])
print(df.INFO.str.findall(r'=(\d+)'))
# 0 [2, 2, 5]
# 1 [3, 4, 1]
# 2 [1, 3, 2]
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
print(df)
# ID INFO A B C
# 0 1 A=2;B=2;C=5 2 2 5
# 1 2 A=3;B=4;C=1 3 4 1
# 2 3 A=1;B=3;C=2 1 3 2
# Remove INFO column
df = df.drop("INFO", 1)
print(df)
# ID A B C
# 0 1 2 2 5
# 1 2 3 4 1
# 2 3 1 3 2
Another solution :
#split on ';'
#explode
#then split on '='
#and pivot
df_INFO = (df.INFO
.str.split(';')
.explode()
.str.split('=',expand=True)
.pivot(columns=0,values=1)
)
pd.concat([df.ID,df_INFO],axis=1)
ID A B C
0 1 2 2 5
1 2 3 4 1
2 3 1 3 2

Calculating no of non-zero in a column corresponding to another column

I have a dataframe:
d = {'class': [0, 1,1,0,1,0], 'A': [0,4,8,1,0,0],'B':[4,1,0,0,3,1]}
df = pd.DataFrame(data=d)
which looks like-
A B class
0 0 4 0
1 4 1 1
2 8 0 1
3 1 0 0
4 0 3 1
5 0 1 0
I want to calculate for each column the corresponding a,b,c,d which are no of non-zero in column corresponding to class column 1,no of non-zero in column corresponding to class column 0,no of zero in column corresponding to class column 1,no of zero in column corresponding to class column 0
for example-
for column A the a,b,c,d are 2,1,1,2
explantion- In column A we see that where column[class]=1 the number of non zero values in column A are 2 therefore a=2(indices 1,2).Similarly b=1(indices 3)
My attempt(when the dataframe had equal no of 0 and 1 class)-
dataset = pd.read_csv('aaf.csv')
n=len(dataset.columns) #no of columns
X=dataset.iloc[:,1:n].values
l=len(X) #no or rows
score = []
for i in range(n-1):
#print(i)
X_column=X[:,i]
neg_array,pos_array=np.hsplit(X_column,2)##hardcoded
#print(pos_array.size)
a=np.count_nonzero(pos_array)
b=np.count_nonzero(neg_array)
c= l/2-a
d= l/2-b
Use:
d = {'class': [0, 1,1,0,1,0], 'A': [0,4,8,1,0,0],'B':[4,1,0,0,3,1]}
df = pd.DataFrame(data=d)
df = (df.set_index('class')
.ne(0)
.stack()
.groupby(level=[0,1])
.value_counts()
.unstack(1)
.sort_index(level=1, ascending=False)
.T)
print (df)
class 1 0 1 0
True True False False
A 2 1 1 2
B 2 2 1 1
df.columns = list('abcd')
print (df)
a b c d
A 2 1 1 2
B 2 2 1 1

Pandas: set the value of a column in a row to be the value stored in a different df at the index of its other rows

>>> df
0 1
0 0 0
1 1 1
2 2 1
>>> df1
0 1 2
0 A B C
1 D E F
>>> crazy_magic()
>>> df
0 1 3
0 0 0 A #df1[0][0]
1 1 1 E #df1[1][1]
2 2 1 F #df1[2][1]
Is there a way to achieve this without for?
import pandas as pd
df = pd.DataFrame([[0,0],[1,1],[2,1]])
df1 = pd.DataFrame([['A', 'B', 'C'],['D', 'E', 'F']])
df2 = df1.reset_index(drop=False)
# index 0 1 2
# 0 0 A B C
# 1 1 D E F
df3 = pd.melt(df2, id_vars=['index'])
# index variable value
# 0 0 0 A
# 1 1 0 D
# 2 0 1 B
# 3 1 1 E
# 4 0 2 C
# 5 1 2 F
result = pd.merge(df, df3, left_on=[0,1], right_on=['variable', 'index'])
result = result[[0, 1, 'value']]
print(result)
yields
0 1 value
0 0 0 A
1 1 1 E
2 2 1 F
My reasoning goes as follows:
We want to use two columns of df as coordinates.
The word "coordinates" reminds me of pivot, since
if you have two columns whose values represent "coordinates" and a third
column representing values, and you want to convert that to a grid, then
pivot is the tool to use.
But df does not have a third column of values. The values are in df1. In fact df1 looks like the result of a pivot operation. So instead of pivoting df, we want to unpivot df1.
pd.melt is the function to use when you want to unpivot.
So I tried melting df1. Comparison with other uses of pd.melt led me to conclude df1 needed the index as a column. That's the reason for defining df2. So we melt df2.
Once you get that far, visually comparing df3 to df leads you naturally to the use of pd.merge.

Categories

Resources