i have a table in pandas df
id_x id_y
a b
b c
c d
d a
b a
and so on around (1000 rows)
i want to find the count of combinations for each id_x with id_y.
ie. a has combinations with a-b,d-a(total 2 combinations)
similarly b has total 2 combinations(b-c) and also a-b to be considered as a combination for b( a-b = b-a)
and create a dataframe df2 which has
id combinations
a 2
b 2
c 2 #(c-d and b-c)
d 1
and so on ..(distinct product_id_'s)
i tried doing this code
df.groupby(['id_x']).size().reset_index()
but getting wrong result;
id_x 0
0 a 1
1 b 1
2 c 1
3 d 1
what approach should i follow?
my skills on python are at a beginner level.
Thanks in advance.
You can first sort all rows by apply sorted, then create Series by stack and last value_counts:
df = df.apply(sorted,axis=1).drop_duplicates().stack().value_counts()
print (df)
d 2
a 2
b 2
c 2
dtype: int64
Related
I have a dataframe and a list
Person_ID Prod
A 1
B 2
A 3
C 4
D 5
D 1
exclude_people_who_bought = [1]
I would like the following dataframe. In this example, person A and D are excluded because they bought item 1.
Person_ID Prod
B 2
C 4
Try with isin
out = df.loc[~df.Person_ID.isin(df.loc[df.Prod==1,'Person_ID'])]
Person_ID Prod
1 B 2
3 C 4
I have a bunch of rows which I want to rearrange one after the other based on a particular column.
df
B/S
0 B
1 B
2 S
3 S
4 B
5 S
I have thought about doing a loc based on B and S and then adding them all together in a new dataframe but that doesn't seem like good practice for pandas.
Is there a pandas centric approach to this?
Output required
B/S
0 B
2 S
1 B
3 S
4 B
5 S
We can achieve this by making smart use of reset_index:
m = df['B/S'].eq('B')
b = df[m].reset_index(drop=True)
s = df[~m].reset_index(drop=True)
out = b.append(s).sort_index().reset_index(drop=True)
B/S
0 B
1 S
2 B
3 S
4 B
5 S
If you want to keep your index information, we can slightly adjust our approach:
m = df['B/S'].eq('B')
b = df[m].reset_index()
s = df[~m].reset_index()
out = b.append(s).sort_index().set_index('index')
B/S
index
0 B
2 S
1 B
3 S
4 B
5 S
In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)
I have a dataframe that has dtype=object, i.e. categorical variables, for which I'd like to have the counts of each level of. I'd like the result to be a pretty summary of all categorical variables.
To achieve the aforementioned goals, I tried the following:
(line 1) grab the names of all object-type variables
(line 2) count the number of observations for each level (a, b of v1)
(line 3) rename the column so it reads "count"
stringCol = list(df.select_dtypes(include=['object'])) # list object of categorical variables
a = df.groupby(stringCol[0]).agg({stringCol[0]: 'count'})
a = a.rename(index=str, columns={stringCol[0]: 'count'}); a
count
v1
a 1279
b 2382
I'm not sure how to elegantly get the following result where all string column counts are printed. Like so (only v1 and v4 shown, but should be able to print such results for a variable number of columns):
count count
v1 v4
a 1279 l 32
b 2382 u 3055
y 549
The way I can think of doing it is:
select one element of stringCol
calculate the count of for each group of the column.
store the result in a Pandas dataframe.
store the Pandas dataframe in an object (list?)
repeat
if last element of stringCol is done, break.
but there must be a better way than that, just not sure how to do it.
I think simpliest is use loop:
df = pd.DataFrame({'A':list('abaaee'),
'B':list('abbccf'),
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aacbbb')})
print (df)
A B C D E F
0 a a 7 1 5 a
1 b b 8 3 3 a
2 a b 9 5 6 c
3 a c 4 7 9 b
4 e c 2 1 2 b
5 e f 3 0 4 b
stringCol = list(df.select_dtypes(include=['object']))
for c in stringCol:
a = df[c].value_counts().rename_axis(c).to_frame('count')
#alternative
#a = df.groupby(c)[c].count().to_frame('count')
print (a)
count
A
a 3
e 2
b 1
count
B
b 2
c 2
a 1
f 1
count
F
b 3
a 2
c 1
For list of DataFrames use list comprehension:
dfs = [df[c].value_counts().rename_axis(c).to_frame('count') for c in stringCol]
print (dfs)
[ count
A
a 3
e 2
b 1, count
B
b 2
c 2
a 1
f 1, count
F
b 3
a 2
c 1]
I am trying to do the following in pandas:
I have 2 DataFrames, both of which have a number of columns.
DataFrame 1 has a column A, that is of interest for my task;
DataFrame 2 has columns B and C, that are of interest.
What needs to be done: to go through the values in column A and see if the same values exists somewhere in column B. If it does, create a column D in Dataframe 1 and fill its respective cell with the value from C which is on the same row as the found value from B.
If the value from A does not exist in B, then fill the cell in D with a zero.
for i in range(len(df1)):
if df1['A'].iloc[i] in df2.B.values:
df1['D'].iloc[i] = df2['C'].iloc[i]
else:
df1['D'].iloc[i] = 0
This gives me an error: Keyword 'D'. If I create the column D in advance and fill it, for example, with 0's, then I get the following warning: A value is trying to be set on a copy of a slice from a DataFrame. How can I solve this? Or is there a better way to accomplish what I'm trying to do?
Thank you so much for your help!
If I understand correctly:
Given these 2 dataframes:
import pandas as pd
import numpy as np
np.random.seed(42)
df1=pd.DataFrame({'A':np.random.choice(list('abce'), 10)})
df2=pd.DataFrame({'B':list('abcd'), 'C':np.random.randn(4)})
>>> df1
A
0 c
1 e
2 a
3 c
4 c
5 e
6 a
7 a
8 c
9 b
>>> df2
B C
0 a 0.279041
1 b 1.010515
2 c -0.580878
3 d -0.525170
You can achieve what you want using a merge:
new_df = df1.merge(df2, left_on='A', right_on='B', how='left').fillna(0)[['A','C']]
And then just rename the columns:
new_df.columns=['A', 'D']
>>> new_df
A D
0 c -0.580878
1 e 0.000000
2 a 0.279041
3 c -0.580878
4 c -0.580878
5 e 0.000000
6 a 0.279041
7 a 0.279041
8 c -0.580878
9 b 1.010515