Create new dataframe by groups based on another dataframe - python

I don't have much experience with working with pandas. I have a pandas dataframe as shown below.
df = pd.DataFrame({ 'A' : [1,2,1],
'start' : [1,3,4],
'stop' : [3,4,8]})
I would like to create a new dataframe that iterates through the rows and appends to resulting dataframe. For example, from row 1 of the input dataframe - Generate a sequence of numbers [1,2,3] and corresponding column to named 1
A seq
1 1
1 2
1 3
2 3
2 4
1 4
1 5
1 6
1 7
1 8
So far, I've managed to identify what function to use to iterate through the rows of the pandas dataframe.

Here's one way with apply:
(df.set_index('A')
.apply(lambda x: pd.Series(np.arange(x['start'], x['stop'] + 1)), axis=1)
.stack()
.to_frame('seq')
.reset_index(level=1, drop=True)
.astype('int')
)
Out:
seq
A
1 1
1 2
1 3
2 3
2 4
1 4
1 5
1 6
1 7
1 8

If you would want to use loops.
In [1164]: data = []
In [1165]: for _, x in df.iterrows():
...: data += [[x.A, y] for y in range(x.start, x.stop+1)]
...:
In [1166]: pd.DataFrame(data, columns=['A', 'seq'])
Out[1166]:
A seq
0 1 1
1 1 2
2 1 3
3 2 3
4 2 4
5 1 4
6 1 5
7 1 6
8 1 7
9 1 8

To add to the answers above, here's a method that defines a function for interpreting the dataframe input shown, into a form that the poster wants:
def gen_df_permutations(perm_def_df):
m_list = []
for i in perm_def_df.index:
row = perm_def_df.loc[i]
for n in range(row.start, row.stop+1):
r_list = [row.A,n]
m_list.append(r_list)
return m_list
Call it, referencing the specification dataframe:
gen_df_permutations(df)
Or optionally call it wrapped in a dataframe creation function to return a final dataframe output:
pd.DataFrame(gen_df_permutations(df),columns=['A','seq'])
A seq
0 1 1
1 1 2
2 1 3
3 2 3
4 2 4
5 1 4
6 1 5
7 1 6
8 1 7
9 1 8
N.B. the first column there is the dataframe index that can be removed/ignored as requirements allow.

Related

How to drop row with bracket in Pandas

I would like to drop the [] for a given df
df=pd.DataFrame(dict(a=[1,2,4,[],5]))
Such that the expected output will be
a
0 1
1 2
2 4
3 5
Edit:
or to make thing more interesting, what if we have two columns and some of the cell is with [] to be dropped.
df=pd.DataFrame(dict(a=[1,2,4,[],5],b=[2,[],1,[],6]))
One way is to get the string repr and filter:
df = df[df['a'].map(repr)!='[]']
Output:
a
0 1
1 2
2 4
4 5
For multiple columns, we could apply the above:
out = df[df.apply(lambda c: c.map(repr)).ne('[]').all(axis=1)]
Output:
a b
0 1 2
2 4 1
4 5 6
You can't use equality directly as pandas will try to align a Series and a list, but you can use isin:
df[~df['a'].isin([[]])]
output:
a
0 1
1 2
2 4
4 5
To act on all columns:
df[~df.isin([[]]).any(1)]
output:
a b
0 1 2
2 4 1
4 5 6

Is it possible to combine agg and value_counts in single line with Pandas

Given a df
a b ngroup
0 1 3 0
1 1 4 0
2 1 1 0
3 3 7 2
4 4 4 2
5 1 1 4
6 2 2 4
7 1 1 4
8 6 6 5
I would like to compute the summation of multiple columns (i.e., a and b) grouped by the column ngroup.
In addition, I would like to count the number of element for each of the group.
Based on these two condition, the expected output as below
a b nrow_same_group ngroup
3 8 3 0
7 11 2 2
4 4 3 4
6 6 1 5
The following code should do the work
import pandas as pd
df=pd.DataFrame(list(zip([1,1,1,3,4,1,2,1,6,10],
[3,4,1,7,4,1,2,1,6,1],
[0,0,0,2,2,4,4,4,5])),columns=['a','b','ngroup'])
grouped_df = df.groupby(['ngroup'])
df1 = grouped_df[['a','b']].agg('sum').reset_index()
df2 = df['ngroup'].value_counts().reset_index()
df2.sort_values('index', axis=0, ascending=True, inplace=True, kind='quicksort', na_position='last')
df2.reset_index(drop=True, inplace=True)
df2.rename(columns={'index':'ngroup','ngroup':'nrow_same_group'},inplace=True)
df= pd.merge(df1, df2, on=['ngroup'])
However, I wonder whether there exist built-in pandas that achieve something similar, in single line.
You can do it using only groupby + agg.
import pandas as pd
df=pd.DataFrame(list(zip([1,1,1,3,4,1,2,1,6,10],
[3,4,1,7,4,1,2,1,6,1],
[0,0,0,2,2,4,4,4,5])),columns=['a','b','ngroup'])
res = (
df.groupby('ngroup', as_index=False)
.agg(a=('a','sum'), b=('b', 'sum'),
nrow_same_group=('a', 'size'))
)
Here the parameters passed to agg are tuples whose first element is the column to aggregate and the second element is the aggregation function to apply to that column. The parameter names are the labels for the resulting columns.
Output:
>>> res
ngroup a b nrow_same_group
0 0 3 8 3
1 2 7 11 2
2 4 4 4 3
3 5 6 6 1
First aggregate a, b with sum then calculate size of each group and assign this to nrow_same_group column
g = df.groupby('ngroup')
g.sum().assign(nrow_same_group=g.size())
a b nrow_same_group
ngroup
0 3 8 3
2 7 11 2
4 4 4 3
5 6 6 1

Select rows of pandas dataframe from list, in order of list

The question was originally asked here as a comment but could not get a proper answer as the question was marked as a duplicate.
For a given pandas.DataFrame, let us say
df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})
df
A B
0 5 1
1 6 2
2 3 3
3 4 5
How can we select rows from a list, based on values in a column ('A' for instance)
For instance
# from
list_of_values = [3,4,6]
# we would like, as a result
# A B
# 2 3 3
# 3 4 5
# 1 6 2
Using isin as mentioned here is not satisfactory as it does not keep order from the input list of 'A' values.
How can the abovementioned goal be achieved?
One way to overcome this is to make the 'A' column an index and use loc on the newly generated pandas.DataFrame. Eventually, the subsampled dataframe's index can be reset.
Here is how:
ret = df.set_index('A').loc[list_of_values].reset_index(inplace=False)
# ret is
# A B
# 0 3 3
# 1 4 5
# 2 6 2
Note that the drawback of this method is that the original indexing has been lost in the process.
More on pandas indexing: What is the point of indexing in pandas?
Use merge with helper DataFrame created by list and with column name of matched column:
df = pd.DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3,5]})
list_of_values = [3,6,4]
df1 = pd.DataFrame({'A':list_of_values}).merge(df)
print (df1)
A B
0 3 3
1 6 2
2 4 5
For more general solution:
df = pd.DataFrame({'A' : [5,6,5,3,4,4,6,5], 'B':range(8)})
print (df)
A B
0 5 0
1 6 1
2 5 2
3 3 3
4 4 4
5 4 5
6 6 6
7 5 7
list_of_values = [6,4,3,7,7,4]
#create df from list
list_df = pd.DataFrame({'A':list_of_values})
print (list_df)
A
0 6
1 4
2 3
3 7
4 7
5 4
#column for original index values
df1 = df.reset_index()
#helper column for count duplicates values
df1['g'] = df1.groupby('A').cumcount()
list_df['g'] = list_df.groupby('A').cumcount()
#merge together, create index from column and remove g column
df = list_df.merge(df1).set_index('index').rename_axis(None).drop('g', axis=1)
print (df)
A B
1 6 1
4 4 4
3 3 3
5 4 5
1] Generic approach for list_of_values.
In [936]: dff = df[df.A.isin(list_of_values)]
In [937]: dff.reindex(dff.A.map({x: i for i, x in enumerate(list_of_values)}).sort_values().index)
Out[937]:
A B
2 3 3
3 4 5
1 6 2
2] If list_of_values is sorted. You can use
In [926]: df[df.A.isin(list_of_values)].sort_values(by='A')
Out[926]:
A B
2 3 3
3 4 5
1 6 2

Comparing the value of a row in a certain column to the values in other columns

Using Pandas
I'm trying to determine whether a value in a certain row is greater than the values in all the other columns in the same row.
To do this I'm looping through the rows of a dataframe and using the 'all' function to compare the values in other columns; but it seems this is throwing an error "string indices must be integers"
It seems like this should work: What's wrong with this approach?
for row in dataframe:
if all (i < row['col1'] for i in [row['col2'], row['col3'], row['col4'], row['col5']]):
row['newcol'] = 'value'
Build a mask and pass it to loc:
df.loc[df['col1'] > df.loc[:, 'col2':'col5'].max(axis=1), 'newcol'] = 'newvalue'
The main problem, in my opinion, is using a loop for vectorisable logic.
Below is an example of how your logic can be implemented using numpy.where.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 9, (5, 10)))
df['new_col'] = np.where(df[1] > df.max(axis=1),
'col1_is_max',
'col1_not_max')
Result:
0 1 2 3 4 5 6 7 8 9 new_col
0 4 1 3 8 3 2 5 1 1 2 col1_not_max
1 2 7 1 2 5 3 5 1 8 5 col1_is_max
2 1 8 2 5 7 4 0 3 6 3 col1_is_max
3 6 4 2 1 7 2 0 8 3 2 col1_not_max
4 0 1 3 3 0 3 7 4 4 1 col1_not_max

Pandas row manipulation

I'm trying to replace a row in a dataframe with the row of another dataframe only if they share a common column.
Here is the first dataframe:
index no foo
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
and the second dataframe:
index no foo
0 2 aaa
1 3 bbb
2 22 3
3 33 4
4 44 5
5 55 6
I'd like my result to be
index no foo
0 0 1
1 1 2
2 2 aaa
3 3 bbb
4 4 5
5 5 6
The result of the inner merge between both dataframes returns the correct rows, but I'm having trouble inserting them at the correct index in the first dataframe
Any help would be greatly appreciated.
Thank you.
This should work as well
df1['foo'] = pd.merge(df1, df2, on='no', how='left').apply(lambda r: r['foo_y'] if r['foo_y'] == r['foo_y'] else r['foo_x'], axis=1)
You could use apply, there is probably a better way than this:
In [67]:
# define a function that takes a row and tries to find a match
def func(x):
# find if 'no' value matches, test the length of the series
if len(df1.loc[df1.no ==x.no, 'foo']) > 0:
return df1.loc[df1.no ==x.no, 'foo'].values[0] # return the first array value
else:
return x.foo # no match so return the existing value
# call apply and using a lamda apply row-wise (axis=1 means row-wise)
df.foo = df.apply(lambda row: func(row), axis=1)
df
Out[67]:
index no foo
0 0 0 1
1 1 1 2
2 2 2 aaa
3 3 3 bbb
4 4 4 5
5 5 5 6
[6 rows x 3 columns]

Categories

Resources