Groupby and Sample pandas - python

I am trying to sample the resulting data after doing a groupby on multiple columns. If the respective groupby has more than 2 elements, I want to take sample 2 records, else take all the records
df:
col1 col2 col3 col4
A1 A2 A3 A4
A1 A2 A3 A5
A1 A2 A3 A6
B1 B2 B3 B4
B1 B2 B3 B5
C1 C2 C3 C4
target df:
col1 col2 col3 col4
A1 A2 A3 A4 or A5 or A6
A1 A2 A3 A4 or A5 or A6
B1 B2 B3 B4
B1 B2 B3 B5
C1 C2 C3 C4
I have mentioned A4 or A5 or A6 because, when we take sample, either of the three might return
This is what i have tried so far:
trial = pd.DataFrame(df.groupby(['col1', 'col2','col3'])['col4'].apply(lambda x: x if (len(x) <=2) else x.sample(2)))
However, in this I do not get col1, col2 and col3

I think need double reset_index - first for remove 3.rd level of MultiIndex and second for convert MultiIndex to columns:
trial= (df.groupby(['col1', 'col2','col3'])['col4']
.apply(lambda x: x if (len(x) <=2) else x.sample(2))
.reset_index(level=3, drop=True)
.reset_index())
Or reset_index with drop for remove column level_3:
trial= (df.groupby(['col1', 'col2','col3'])['col4']
.apply(lambda x: x if (len(x) <=2) else x.sample(2))
.reset_index()
.drop('level_3', 1))
print (trial)
col1 col2 col3 col4
0 A1 A2 A3 A4
1 A1 A2 A3 A6
2 B1 B2 B3 B4
3 B1 B2 B3 B5
4 C1 C2 C3 C4

There is no need to convert this to a pandas dataframe its one by default
trial=df.groupby(['col1', 'col2','col3'])['col4'].apply(lambda x: x if (len(x) <=2) else x.sample(2))
And this should add the col1,2,3
trial.reset_index(inplace=True,drop=False)

Related

How to drop duplicates in each group in a dataframe?

I have the following dataset:
id1 id2 value
a1 b1 "main"
a1 b1 "main"
a1 b1 "secondary"
a2 b2 "main"
a2 b2 "repair"
a2 b2 "uploaded"
a2 b2 "main"
I want to drop duplicate values in the column called value in each id1 and id2 group. So the desired result is:
id1 id2 value
a1 b1 "main"
a1 b1 "secondary"
a2 b2 "main"
a2 b2 "repair"
a2 b2 "uploaded"
How could I do that? I know the method drop_duplicates, but how can I use it with groupby?
Try:
x = (
df.groupby(["id1", "id2"])
.apply(lambda x: x.drop_duplicates("value"))
.reset_index(drop=True)
)
print(x)
Prints:
id1 id2 value
0 a1 b1 "main"
1 a1 b1 "secondary"
2 a2 b2 "main"
3 a2 b2 "repair"
4 a2 b2 "uploaded"

Dataframe slicing with string values

I have a string dataframe that I would like to modify. I need to cut off each row of the dataframe at a value say A4 and replace other values after A4 with -- or remove them. I would like to create a new dataframe that has values only upto the string "A4". How would i do this?
import pandas as pd
columns = ['c1','c2','c3','c4','c5','c6']
values = [['A1', 'A2','A3','A4','A5','A6'],['A1','A3','A2','A5','A4','A6'],['A1','A2','A4','A3','A6','A5'],['A2','A1','A3','A4','A5','A6'], ['A2','A1','A3','A4','A6','A5'],['A1','A2','A4','A3','A5','A6']]
input = pd.DataFrame(values, columns)
columns = ['c1','c2','c3','c4','c5','c6']
values = [['A1', 'A2','A3','A4','--','--'],['A1','A3,'A2','A5','A4','--'],['A1','A2','A4','--','--','--'],['A2','A1','A3','A4','--','--'], ['A2','A1','A3','A4','--','--'],['A1','A2','A4','--','--','--']]
output = pd.DataFrame(values, columns)
You can make a small function, that will take an array, and modify the values after your desired value:
def myfunc(x, val):
for i in range(len(x)):
if x[i] == val:
break
x[(i+1):] = '--'
return x
Then you need to apply the function to the dataframe in a rowwise (axis = 1) manner:
input.apply(lambda x: myfunc(x, 'A4'), axis = 1)
0 1 2 3 4 5
c1 A1 A2 A3 A4 -- --
c2 A1 A3 A2 A5 A4 --
c3 A1 A2 A4 -- -- --
c4 A2 A1 A3 A5 A4 --
c5 A2 A1 A4 -- -- --
c6 A1 A2 A4 -- -- --
I assume you will have values more than A4
df.replace('A([5-9])', '--', regex=True)
0 1 2 3 4 5
c1 A1 A2 A3 A4 -- --
c2 A1 A3 A2 -- A4 --
c3 A1 A2 A4 A3 -- --
c4 A2 A1 A3 -- A4 --
c5 A2 A1 A4 A3 -- --
c6 A1 A2 A4 A3 -- --

How to get the difference between two csv by Index using Pandas

Need to get the difference between 2 csv files, kill duplicates and Nan fields.
I am trying this one but it adds them together instead of subtracting.
df1 = pd.concat([df,cite_id]).drop_duplicates(keep=False)[['id','website']]
df is main dataframe
cite_id is dataframe that has to be subtracted.
You can do this efficiently using 'isin'
df.dropna().drop_duplicates()
cite_id.dropna().drop_duplicates()
df[~df.id.isin(cite_id.id.values)]
Or You can merge them and keep only the lines that have a NaN
df[pd.merge(cite_id, df, how='outer').isnull().any(axis=1)]
import pandas as pd
df1 = pd.read_csv("1.csv")
df2 = pd.read_csv("2.csv")
df1 = df1.dropna().drop_duplicates()
df2 = df2.dropna().drop_duplicates()
df = df2.loc[~df2.id.isin(df1.id)]
You can concatenate two dataframes as one, after that you can remove all dupicates
df1
ID B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
cite_id
ID B C D
4 A2 B4 C4 D4
5 A3 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
pd.concat([df1,cite_id]).drop_duplicates(subset=['ID'], keep=False)
Out:
ID B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
6 A6 B6 C6 D6
7 A7 B7 C7 D7

Summing columns from different dataframe according to some column names

Suppose I have a main dataframe
main_df
Cri1 Cri2 Cr3 total
0 A1 A2 A3 4
1 B1 B2 B3 5
2 C1 C2 C3 6
I also have 3 dataframes
df_1
Cri1 Cri2 Cri3 value
0 A1 A2 A3 1
1 B1 B2 B3 2
df_2
Cri1 Cri2 Cri3 value
0 A1 A2 A3 9
1 C1 C2 C3 10
df_3
Cri1 Cri2 Cri3 value
0 B1 B2 B3 15
1 C1 C2 C3 17
What I want is to add value from each frame df to total in the main_df according to Cri
i.e. main_df will become
main_df
Cri1 Cri2 Cri3 total
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
Of course I can do it using for loop, but at the end I want to apply the method to a large amount of data, say 50000 rows in each dataframe.
Is there other ways to solve it?
Thank you!
First you should align your numeric column names. In this case:
df_main = df_main.rename(columns={'total': 'value'})
Then you have a couple of options.
concat + groupby
You can concatenate and then perform a groupby with sum:
res = pd.concat([df_main, df_1, df_2, df_3])\
.groupby(['Cri1', 'Cri2', 'Cri3']).sum()\
.reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
set_index + reduce / add
Alternatively, you can create a list of dataframes indexed by your criteria columns. Then use functools.reduce with pd.DataFrame.add to sum these dataframes.
from functools import reduce
dfs = [df.set_index(['Cri1', 'Cri2', 'Cri3']) for df in [df_main, df_1, df_2, df_3]]
res = reduce(lambda x, y: x.add(y, fill_value=0), dfs).reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14.0
1 B1 B2 B3 22.0
2 C1 C2 C3 33.0

Difference of two dataframes in python

I have a two dataframes
ex:
test_1
name1 name2
a1 b1
a1 b2
a2 b1
a2 b2
a2 b3
test_2
name1 name2
a1 b1
a1 b2
a2 b1
I need the difference of two dataframes like
name1 name2
a2 b2
a2 b3
df=pd.concat([a,b])
df = df.reset_index(drop=True)
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
df1=df.reindex(idx)

Categories

Resources