Filling in a new data frame based on two other data frames - python

I want an efficient way to solve this problem below because my code seems inefficient.
First of all, let me provide a dummy dataset.
import numpy as np
import pandas as pd
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
df1= {'a0' : [1,2,2,1,3], 'a1' : [2,3,3,2,4], 'a2' : [3,4,4,3,5], 'a3' : [4,5,5,4,6], 'a4' : [5,6,6,5,7]}
df2 = {'b0' : [3,6,6,3,8], 'b1' : [6,8,8,6,9], 'b2' : [8,9,9,8,7], 'b3' : [9,7,7,9,2], 'b4' : [7,2,2,7,1]}
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
My actual dataset has more than 100,000 rows and 15 columns. Now, what I want to do is pretty complicated to explain, but here we go.
Goal: I want to create a new df using the two dfs above.
find the global min and max from df1. Since the value is sorted by row, column 'a' will always have the minimum each row, and 'e' will have the maximum. Therefore, I will find the minimum in column 'a0' and maximum in 'a4'.
Min = df1['a0'].min()
Max = df1['a4'].max()
Min
Max
Then I will create a data frame filled with 0s and columns of range(Min, Max). In this case, 1 through 7.
column = []
for i in np.arange(Min, Max+1):
column.append(i)
newdf = pd.DataFrame(0, index = df1.index, columns=column)
The third step is to find the place where the values from df2 will go:
I want to loop through each value in df1 and match each value with the column name in the new df in the same row.
For example, if we are looking at row 0 and go through each column; the values in this case [1,2,3,4,5]. Then the row 0 of the newdf, column 1,2,3,4,5 will be filled with the corresponding values from df2.
Lastly, each corresponding values in df2 (same place) will be added to the place where we found in step 2.
So, the very first row of the new df will look like this:
output = {'1' : [3], '2' : [6], '3' : [8], '4' : [9], '5' : [7], '6' : [0], '7' : [0]}
output = pd.DataFrame(output)
Column 6 and 7 will not be updated because we didn't have 6 and 7 in the very first row of df1.
Here is my code for this process:
for rowidx in range(0, len(df1)):
for columnidx in range(0,len(df1.columns)):
new_column = df1[str(df1.columns[columnidx])][rowidx]
newdf.loc[newdf.index[rowidx], new_column] = df2['b' + df1.columns[columnidx][1:]][rowidx]
I think this does the job, but as I said, my actual dataset is huge with 2999999 rows and Min to Max range is 282 which means 282 columns in the new data frame.
So, the code above runs forever. Is there a faster way to do this? I think I learned something like map-reduce, but I don't know if that would apply here.

Idea is create default columns names in both DataFrames, then concat of DataFrame.stacked Series, add first 0 column to index, remove second level, so possible use DataFrame.unstack:
df1.columns = range(len(df1.columns))
df2.columns = range(len(df2.columns))
newdf = (pd.concat([df1.stack(), df2.stack()], axis=1)
.set_index(0, append=True)
.reset_index(level=1, drop=True)[1]
.unstack(fill_value=0)
.rename_axis(None, axis=1))
print (newdf)
1 2 3 4 5 6 7
0 3 6 8 9 7 0 0
1 0 6 8 9 7 2 0
2 0 6 8 9 7 2 0
3 3 6 8 9 7 0 0
4 0 0 8 9 7 2 1
Another solutions:
comp =[pd.Series(a, index=df1.loc[i]) for i, a in enumerate(df2.values)]
df = pd.concat(comp, axis=1).T.fillna(0).astype(int)
print (df)
1 2 3 4 5 6 7
0 3 6 8 9 7 0 0
1 0 6 8 9 7 2 0
2 0 6 8 9 7 2 0
3 3 6 8 9 7 0 0
4 0 0 8 9 7 2 1
Or:
comp = [dict(zip(x, y)) for x, y in zip(df1.values, df2.values)]
c = pd.DataFrame(comp).fillna(0).astype(int)
print (c)
1 2 3 4 5 6 7
0 3 6 8 9 7 0 0
1 0 6 8 9 7 2 0
2 0 6 8 9 7 2 0
3 3 6 8 9 7 0 0
4 0 0 8 9 7 2 1

Related

pandas ascend sort multiple columns but reverse sort one column

I have a pandas DataFrame that has a little over 100 columns.
There are about 50 columns that I want to sort ascending and then there is one column (a date_time column) that I want to reverse sort.
How do I go about achieving this? I know I can do something like...
df = df.sort_values(by = ['column_001', 'column_003', 'column_009', 'column_017',... 'date_time'], ascending=[True, True, True, True,... False])
... but I am trying to avoid having to type 'True' 50 times.
Just wondering if there is a quick hand way of doing this.
Thanks.
Dan
You can use:
cols = ['column_001', 'column_003', 'column_009', 'column_017',... 'date_time']
df.sort_values(by=cols, ascending=[True]*49+[False])
Or, for a programmatic variant for which you don't need to know the position of the False, using numpy:
cols = ['column_001', 'column_003', 'column_009', 'column_017',... 'date_time']
df.sort_values(by=cols, ascending=np.array(cols)!='date_time')
It should go something like this.
to_be_reserved = "COLUMN_TO_BE_RESERVED"
df = df.sort_values(by=[col for col in df.columns if col != to_be_reserved],ignore_index=True)
df[to_be_reserved] = df[to_be_reserved].sort_values(ascending=False,ignore_index = True)
You can also use filter if your 49 columns have a regular pattern:
# if you have a column name pattern
cols = df.filter(regex=('^(column_|date_time)')).columns.tolist()
ascending_false = ['date_time']
ascending = [True if c not in ascending_false else False for c in cols]
df.sort_values(by=cols, ascending=ascending)
Example:
>>> df
column_0 column_1 date_time value other_value another_value
0 4 2 6 6 1 1
1 4 4 0 6 0 2
2 3 2 6 9 0 7
3 9 2 1 7 4 7
4 6 9 2 4 4 1
>>> df.sort_values(by=cols, ascending=ascending)
column_0 column_1 date_time value other_value another_value
2 3 2 6 9 0 7
0 4 2 6 6 1 1
1 4 4 0 6 0 2
4 6 9 2 4 4 1
3 9 2 1 7 4 7

How to represent as a matrix a 4-column dataframe, where column 0 specifies row, columns 1-2 specify column range, and column 3 specifies entry

I have a four-column data frame, given as follows: Column zero consists of text labels chosen from a list ['A','B','C','D'] with possible repetitions. Columns one-two are labelled, start and stop, where the former is less than the latter, and column three, intensity, is a float. For each label, none of the corresponding intervals formed using [start,stop] overlap.
A simple example is given by:
import numpy as np
import pandas as pd
labels=['A','B','C','D']
d = {'label': ['A','B','A','C','D','B','A'],'start': [1, 2,6,4,1,8,12], 'stop':
[4,4,9,6,7,11,16],'intensity':[8,2,4,6,7,1,5]}
df = pd.DataFrame(data=d)
print(df)
label start stop intensity
0 A 1 4 8
1 B 2 4 2
2 A 6 9 4
3 C 4 6 6
4 D 1 7 7
5 B 8 11 1
6 A 12 16 5
I wish to create a matrix, M, having four (=len(labels)) rows and 16 columns. (The number of columns must be at least the maximum entry in df['stop']. Whether it's larger doesn't matter). For each integer k between 0 and 6, the index of df['label'][k] in labels specifies a row of my matrix M. The entries in columns d[start][k] to d[stop][k] of this row should all equal d['intensity'][k]. All other entries of M equal zero.
For example, label A corresponds to rows 0, 2, and 6. In row 0, entries in columns 1-4 equal 8, entries in columns 6-9 equal 4, and entries in columns 12-16 equal 5.
I'd like to do this in the most pythonic way using list operations and at most one loop.
Here's a solution:
MAX = df['stop'].max()
new_df = pd.DataFrame(df.groupby('label').apply(lambda g: sum(g.apply(lambda x: np.isin(np.arange(MAX), np.arange(x['start']-1, x['stop'])).astype(int)*x['intensity'], axis=1))).tolist(), index=labels)
Output:
>>> new_df
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
A 8 8 8 8 0 4 4 4 4 0 0 5 5 5 5 5
B 0 2 2 2 0 0 0 1 1 1 1 0 0 0 0 0
C 0 0 0 6 6 6 0 0 0 0 0 0 0 0 0 0
D 7 7 7 7 7 7 7 0 0 0 0 0 0 0 0 0
another way using explode
df['range'] = df.apply(lambda r: list(range(r['start'], r['stop']+1)), axis=1)
df.explode('range').set_index(['label', 'range'])[['intensity']].unstack()

Split a dataframe based on certain column values

Let's say I have a DF like this:
Mean 1
Mean 2
Stat 1
Stat 2
ID
5
10
15
20
Z
3
6
9
12
X
Now, I want to split the dataframe to separate the data based on whether it is a #1 or #2 for each ID.
Basically I would double the amount of rows for each ID, with each one being dedicated to either #1 or #2, and a new column will be added to specify which number we are looking at. Instead of Mean 1 and 2 being on the same row, they will be listed in two separate rows, with the # column making it clear which one we are looking at. What's the best way to do this? I was trying pd.melt(), but it seems like a slightly different use case.
Mean
Stat
ID
#
5
15
Z
1
10
20
Z
2
3
9
X
1
6
12
X
2
Use pd.wide_to_long:
new_df = pd.wide_to_long(
df, stubnames=['Mean', 'Stat'], i='ID', j='#', sep=' '
).reset_index()
new_df:
ID # Mean Stat
0 Z 1 5 15
1 X 1 3 9
2 Z 2 10 20
3 X 2 6 12
Or set_index then str.split the columns then stack if order must match the OP:
new_df = df.set_index('ID')
new_df.columns = new_df.columns.str.split(expand=True)
new_df = new_df.stack().rename_axis(['ID', '#']).reset_index()
new_df:
ID # Mean Stat
0 Z 1 5 15
1 Z 2 10 20
2 X 1 3 9
3 X 2 6 12
Here is a solution with melt and pivot:
df = df.melt(id_vars=['ID'], value_name='Mean')
df[['variable', '#']] = df['variable'].str.split(expand=True)
df = (df.assign(idx=df.groupby('variable').cumcount())
.pivot(index=['idx', 'ID', '#'], columns='variable').reset_index().drop(('idx', ''), axis=1))
df.columns = [col[0] if col[1] == '' else col[1] for col in df.columns]
df
Out[1]:
ID # Mean Stat
0 Z 1 5 15
1 X 1 3 9
2 Z 2 10 20
3 X 2 6 12

Select rows of pandas dataframe from list, in order of list

The question was originally asked here as a comment but could not get a proper answer as the question was marked as a duplicate.
For a given pandas.DataFrame, let us say
df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})
df
A B
0 5 1
1 6 2
2 3 3
3 4 5
How can we select rows from a list, based on values in a column ('A' for instance)
For instance
# from
list_of_values = [3,4,6]
# we would like, as a result
# A B
# 2 3 3
# 3 4 5
# 1 6 2
Using isin as mentioned here is not satisfactory as it does not keep order from the input list of 'A' values.
How can the abovementioned goal be achieved?
One way to overcome this is to make the 'A' column an index and use loc on the newly generated pandas.DataFrame. Eventually, the subsampled dataframe's index can be reset.
Here is how:
ret = df.set_index('A').loc[list_of_values].reset_index(inplace=False)
# ret is
# A B
# 0 3 3
# 1 4 5
# 2 6 2
Note that the drawback of this method is that the original indexing has been lost in the process.
More on pandas indexing: What is the point of indexing in pandas?
Use merge with helper DataFrame created by list and with column name of matched column:
df = pd.DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3,5]})
list_of_values = [3,6,4]
df1 = pd.DataFrame({'A':list_of_values}).merge(df)
print (df1)
A B
0 3 3
1 6 2
2 4 5
For more general solution:
df = pd.DataFrame({'A' : [5,6,5,3,4,4,6,5], 'B':range(8)})
print (df)
A B
0 5 0
1 6 1
2 5 2
3 3 3
4 4 4
5 4 5
6 6 6
7 5 7
list_of_values = [6,4,3,7,7,4]
#create df from list
list_df = pd.DataFrame({'A':list_of_values})
print (list_df)
A
0 6
1 4
2 3
3 7
4 7
5 4
#column for original index values
df1 = df.reset_index()
#helper column for count duplicates values
df1['g'] = df1.groupby('A').cumcount()
list_df['g'] = list_df.groupby('A').cumcount()
#merge together, create index from column and remove g column
df = list_df.merge(df1).set_index('index').rename_axis(None).drop('g', axis=1)
print (df)
A B
1 6 1
4 4 4
3 3 3
5 4 5
1] Generic approach for list_of_values.
In [936]: dff = df[df.A.isin(list_of_values)]
In [937]: dff.reindex(dff.A.map({x: i for i, x in enumerate(list_of_values)}).sort_values().index)
Out[937]:
A B
2 3 3
3 4 5
1 6 2
2] If list_of_values is sorted. You can use
In [926]: df[df.A.isin(list_of_values)].sort_values(by='A')
Out[926]:
A B
2 3 3
3 4 5
1 6 2

Create new dataframe by groups based on another dataframe

I don't have much experience with working with pandas. I have a pandas dataframe as shown below.
df = pd.DataFrame({ 'A' : [1,2,1],
'start' : [1,3,4],
'stop' : [3,4,8]})
I would like to create a new dataframe that iterates through the rows and appends to resulting dataframe. For example, from row 1 of the input dataframe - Generate a sequence of numbers [1,2,3] and corresponding column to named 1
A seq
1 1
1 2
1 3
2 3
2 4
1 4
1 5
1 6
1 7
1 8
So far, I've managed to identify what function to use to iterate through the rows of the pandas dataframe.
Here's one way with apply:
(df.set_index('A')
.apply(lambda x: pd.Series(np.arange(x['start'], x['stop'] + 1)), axis=1)
.stack()
.to_frame('seq')
.reset_index(level=1, drop=True)
.astype('int')
)
Out:
seq
A
1 1
1 2
1 3
2 3
2 4
1 4
1 5
1 6
1 7
1 8
If you would want to use loops.
In [1164]: data = []
In [1165]: for _, x in df.iterrows():
...: data += [[x.A, y] for y in range(x.start, x.stop+1)]
...:
In [1166]: pd.DataFrame(data, columns=['A', 'seq'])
Out[1166]:
A seq
0 1 1
1 1 2
2 1 3
3 2 3
4 2 4
5 1 4
6 1 5
7 1 6
8 1 7
9 1 8
To add to the answers above, here's a method that defines a function for interpreting the dataframe input shown, into a form that the poster wants:
def gen_df_permutations(perm_def_df):
m_list = []
for i in perm_def_df.index:
row = perm_def_df.loc[i]
for n in range(row.start, row.stop+1):
r_list = [row.A,n]
m_list.append(r_list)
return m_list
Call it, referencing the specification dataframe:
gen_df_permutations(df)
Or optionally call it wrapped in a dataframe creation function to return a final dataframe output:
pd.DataFrame(gen_df_permutations(df),columns=['A','seq'])
A seq
0 1 1
1 1 2
2 1 3
3 2 3
4 2 4
5 1 4
6 1 5
7 1 6
8 1 7
9 1 8
N.B. the first column there is the dataframe index that can be removed/ignored as requirements allow.

Categories

Resources