iloc[] by value columns - python

I want to use iloc with value in column.
df1 = pd.DataFrame({'col1': ['1' ,'1','1','2','2','2','2','2','3' ,'3','3'],
'col2': ['A' ,'B','C','D','E','F','G','H','I' ,'J','K']})
I want to select index 2 in each column value as data frame and the result will be like
col1 col2
1 C
2 F
3 K
Thank you so much

Use GroupBy.nth:
df2 = df1.groupby('col1', as_index=False).nth(2)
Alternative with GroupBy.cumcount:
df2 = df1[df1.groupby('col1').cumcount().eq(2)]
print (df2)
col1 col2
2 1 C
5 2 F
10 3 K

Use GroupBy.nth with as_index=False:
df1.groupby('col1', as_index=False).nth(2)
output:
col1 col2
2 1 C
5 2 F
10 3 K

df1.groupby('col1').agg(lambda ss:ss.iloc[2])
col2
col1
1 C
2 F
3 K

Related

Pandas: Split dataframe with duplicate values into dataframe with unique values

I have a dataframe in Pandas with duplicate values in Col1:
Col1
a
a
b
a
a
b
What I want to do is to split this df into different df-s with unique Col1 values in each.
DF1:
Col1
a
b
DF2:
Col1
a
b
DF3:
Col1
a
DF4:
Col1
a
Any suggestions ?
I don't think you can achieve this in a vectorial way.
One possibility is to use a custom function to iterate the items and keep track of the unique ones. Then use this to split with groupby:
def cum_uniq(s):
i = 0
seen = set()
out = []
for x in s:
if x in seen:
i+=1
seen = set()
out.append(i)
seen.add(x)
return pd.Series(out, index=s.index)
out = [g for _,g in df.groupby(cum_uniq(df['Col1']))]
output:
[ Col1
0 a,
Col1
1 a
2 b,
Col1
3 a,
Col1
4 a
5 b]
intermediate:
cum_uniq(df['Col1'])
0 0
1 1
2 1
3 2
4 3
5 3
dtype: int64
if order doesn't matter
Let's ad a Col2 to the example:
Col1 Col2
0 a 0
1 a 1
2 b 2
3 a 3
4 a 4
5 b 5
the previous code gives:
[ Col1 Col2
0 a 0,
Col1 Col2
1 a 1
2 b 2,
Col1 Col2
3 a 3,
Col1 Col2
4 a 4
5 b 5]
If order does not matter, you can vectorize it:
out = [g for _,g in df.groupby(df.groupby('Col1').cumcount())]
output:
[ Col1 Col2
0 a 0
2 b 2,
Col1 Col2
1 a 1
5 b 5,
Col1 Col2
3 a 3,
Col1 Col2
4 a 4]

Sum column in one dataframe based on row value of another dataframe

Say, I have one data frame df:
a b c d e
0 1 2 dd 5 Col1
1 2 3 ee 9 Col2
2 3 4 ff 1 Col4
There's another dataframe df2:
Col1 Col2 Col3
0 1 2 4
1 2 3 5
2 3 4 6
I need to add a column sum in the first dataframe, wherein it sums values of columns in the second dataframe df2, based on values of column e in df1.
Expected output
a b c d e Sum
0 1 2 dd 5 Col1 6
1 2 3 ee 9 Col2 9
2 3 4 ff 1 Col4 0
The Sum value in the last row is 0 because Col4 doesn't exist in df2.
What I tried: Writing some lamdas, apply function. Wasn't able to do it.
I'd greatly appreciate the help. Thank you.
Try
df['Sum']=df.e.map(df2.sum()).fillna(0)
df
Out[89]:
a b c d e Sum
0 1 2 dd 5 Col1 6.0
1 2 3 ee 9 Col2 9.0
2 3 4 ff 1 Col4 0.0
Try this. The following solution sums all values for a particular column if present in df2 using apply method and returns 0 if no such column exists in df2.
df1.loc[:,"sum"]=df1.loc[:,"e"].apply(lambda x: df2.loc[:,x].sum() if(x in df2.columns) else 0)
Use .iterrows() to iterate through a data frame pulling out the values for each row as well as index.
A nest for loop style of iteration can be used to grab needed values from the second dataframe and apply them to the first
import pandas as pd
df1 = pd.DataFrame(data={'a': [1,2,3], 'b': [2,3,4], 'c': ['dd', 'ee', 'ff'], 'd': [5,9,1], 'e': ['Col1','Col2','Col3']})
df2 = pd.DataFrame(data={'Col1': [1,2,3], 'Col2': [2,3,4], 'Col3': [4,5,6]})
df1['Sum'] = df1['a'].apply(lambda x: None)
for index, value in df1.iterrows():
sum = 0
for index2, value2 in df2.iterrows():
sum += value2[value['e']]
df1['Sum'][index] = sum
Output:
a b c d e Sum
0 1 2 dd 5 Col1 6
1 2 3 ee 9 Col2 9
2 3 4 ff 1 Col3 15

Create new Dataframe from matching two dataframe index's

I'm looking create a new dataframe from data in two separate dataframes - effectively matching the index of each cell and input into a two column dataframe. My real datasets have the exact same number of rows and columns, FWIW. Example below:
DF1:
Col1 Col2 Col3
1 2 3
3 8 7
DF2:
Col1 Col2 Col3
A B E
R S W
Desired Dataframe:
Col1 Col2
1 A
2 B
3 E
3 R
8 S
7 W
Thank you for your help!
here is your code
df3 = pd.Series(df1.values.ravel('F'))
df4 = pd.Series(df2.values.ravel('F'))
df = pd.concat([df3, df4], axis=1)
Use, DataFrame.to_numpy and .flatten:
df = pd.DataFrame(
{'Col1': df1.to_numpy().flatten(), 'Col2': df2.to_numpy().flatten()})
# print(df)
Col1 Col2
0 1 A
1 2 B
2 3 E
3 3 R
4 8 S
5 7 W
You can do it easily like so:
list1 = df1.values.tolist()
list1 = [item for sublist in list1 for item in sublist]
list2 = df2.values.tolist()
list2 = [item for sublist in list2 for item in sublist]
df = {
'Col1': list1,
'Col2': list2
}
df = DataFrame(df)
print(df)
Hope this helps :)
pd.concat(map(lambda x: x.unstack().sort_index(level=-1), (df1, df2)), axis=1).reset_index(drop=True).rename(columns=['Col1', 'Col2'].__getitem__)
Result:
Col1 Col2
0 1 A
1 2 B
2 3 E
3 3 R
4 8 S
5 7 W
Another way (alternative):
pd.concat((df1.stack(),df2.stack()),axis=1).add_prefix('Col').reset_index(drop=True)
or:
d = {'Col1':df1,'Col2':df2}
pd.concat((v.stack() for k,v in d.items()),axis=1,keys=d.keys()).reset_index(drop=True)
#or pd.concat((d.values()),keys=d.keys()).stack().unstack(0).reset_index(drop=True)
Col1 Col2
0 1 A
1 2 B
2 3 E
3 3 R
4 8 S
5 7 W

Deleting rows from a pandas Dataframe which does not match a combination of colums in another Dataframe

My data Frame 1 looks like:
Col1 Col2 Col3
1 A 4 ab
2 A 5 de
3 A 2 ah
4 B 1 ac
5 B 3 jd
6 B 2 am
data frame 2:
col1 col2
1 A 4
2 B 3
How do i delete all the rows in Data Frame 1 which do not match the combination of rows of dataframe 2?
Output Expected:
Col1 Col2 Col3
1 A 4 ab
2 B 3 jd
Use DataFrame.merge with inner join, only necessary rename columns:
df = df2.rename(columns={'col1':'Col1','col2':'Col2'}).merge(df1, on=['Col1','Col2'])
#on should be omited, then merge by intersection of columns of df1, df2
#df = df2.rename(columns={'col1':'Col1','col2':'Col2'}).merge(df1)
print (df)
Col1 Col2 Col3
0 A 4 ab
1 B 3 jd
Another idea is use left_on and right_on parameter and then remove columns with names by df2.columns:
df = (df2.merge(df1, left_on=['col1','col2'],
right_on=['Col1','Col2']).drop(df2.columns, axis=1))
print (df)
Col1 Col2 Col3
0 A 4 ab
1 B 3 jd
If columns names are same:
print (df2)
Col1 Col2
1 A 4
2 B 3
df = df2.merge(df1, on=['Col1','Col2'])
#df = df2.merge(df1)
print (df)
Col1 Col2 Col3
0 A 4 ab
1 B 3 jd
You can also use join, to do an inner join
dfR = df1.join( df ,on=['Col1','Col2'] ,how='inner',rsuffix='_x')
dfR[['Col1','Col2','Col3']]
This will also give you the same result
Col1 Col2 Col3
1 A 4 ab
2 B 3 jd
For more details check these links Join Documentation and
examples

Sort and align 2 dataframes by values in corresponding columns

I have 2 dataframes that I want to sort that are similar in structure to what I have shown below, but the rows of values when looking at only the first 3 columns are jumbled. How do I sort the dataframes such that the row indices match?
Also it could so happen that there may not be matching rows in which case I want to create a blank entry in the other dataframe at that index. How would I go about doing this?
Dataframe1:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
Dataframe2:
Col1 Col2 Col3 Col4
0 f e g 6
1 a b c 5
2 b c d 3
Is this what you want?:
import pandas as pd
df=pd.DataFrame({'a':[1,3,2],'b':[4,6,5]})
print(df.sort_values(df.columns.tolist()))
Output:
a b
0 1 4
2 2 5
1 3 6
How do I sort the dataframes such that the row indices match
You can sort by the columns that should determine order on both data frames & reset index.
cols = ['Col1', 'Col2', 'Col3']
df1.sort_values(cols).reset_index(drop=True)
#outputs:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
df2.sort_values(cols).reset_index(drop=True)
#outputs:
Col1 Col2 Col3 Col4
0 a b c 5
1 b c d 3
2 f e g 6
...there may not be matching rows in which case I want to create a blank entry in the other dataframe at that index
lets add 1 more row to df1
df1 = pd.DataFrame({
'Col1': list('abfh'),
'Col2': list('bceg'),
'Col3': list('cdgi'),
'Col4': [1,4,5,7]
})
df1
# outputs:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
3 h g i 7
We can use an outer join to add a blank row to df2 where each column in pd.Nan at index 3
if you have sorted both databases already, you can merge using the indexes
df3 = df1.merge(df2, 'left', left_index=True, right_index=True, suffixes=('_x', ''))
otherwise, merge on the columns that *should* determine the sort order, this will create a new dataframe with joined values, sorted in the same way df1 is sorted
df3 = df1.merge(df2, 'left', on=cols, suffixes=('_x', ''))
Then filter out the columns from the left data frame
df3.iloc[:, ~df3.columns.str.endswith('_x')]
#outputs:
Col1 Col2 Col3 Col4
0 f e g 6.0
1 a b c 5.0
2 b c d 3.0
3 NaN NaN NaN NaN

Categories

Resources