Python - initiate empty dataframe and populate from another dataframe - python

Working with python pandas 0.19.
I want to create a new dataframe (df2) as a subset of an existing dataframe (df1). df1 looks like this:
In [1]: df1.head()
Out [1]:
col1_name col2_name col3_name
0 23 42 55
1 27 55 57
2 52 20 52
3 99 18 53
4 65 32 51
The logic is:
df2 = []
for i in range(0,N):
loc = some complicated logic
df1_sub = df1.ix[loc,]
df2.append(df1_sub)
df2 = pd.DataFrame.from_records(df2)
The result df2 is indeed a dataframe, but the content is all comprised of column names of df1. It looks like this:
In [2]: df2.head()
Out [2]:
col1_name col2_name col3_name
0 col1_name col2_name col3_name
1 col1_name col2_name col3_name
2 col1_name col2_name col3_name
3 col1_name col2_name col3_name
4 col1_name col2_name col3_name
I know it's probably related to the conversion from list to dataframe but I'm not sure what exactly I'm missing here. Or is there a better way of doing this?

As per Ted Petrou, the solution is simply:
pd.concat(df2)
I was confused by the data type of df2.
It is impossible, given the logic within the for loop, to directly select df1 using some index.

How about just slice the dataframe?
import pandas as pd
DF1 = pd.DataFrame()
DF1['x'] = ['a','b','c','a','c','b']
DF1['y'] = [1,3,2,-1,-2,-3]
DF2 = DF1[[(x == 'a' and y > 0) for x,y in zip(DF1['x'], DF1['y'])]]
This should be way more efficient than appending. DF1[Complicated Condition] takes any Boolean arguement

You can take advantage of pandas' (actually numpy's) masked arrays.
import pandas as pd
df1 = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': ['a', 'b', 'c', 'd', 'e'],
'c': [10, 11, 12, 13, 14]})
# a b c
# 0 1 a 10
# 1 2 b 11
# 2 3 c 12
# 3 4 d 13
# 4 5 e 14
Let's assume that df2 should be a subset of df1: it should have columns b and c and only the rows where column a has an even value:
df2 = df1[df1['a'] % 2 == 0][['b', 'c']]
# b c
# 1 b 11
# 3 d 13

Related

Sum up multiple columns into one columns [duplicate]

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')

How to filter dataframe with multiple boolean conditions

I need to filter a pandas dataframe with two boolean queries, means I want to keep the ones which are True
dataframe:
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
output:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
single filter works:
filter = (df.b == 2)
df = df[filter]
output:
a b c
0 1 2 3
But how can I filter with df.b == 2 or df.b == 5 ?
I tried:
filter = [(df['b']==2) | (df['b']==5)]
df = df[filter]
print(df)
I get :
ValueError: Item wrong length 1 instead of 3
Any suggestions how do achive it?
my desired output is:
a b c
0 1 2 3
1 4 5 6
You pass list as filter, try this: (better don't use filter as variable, it is built-in function in python)
mask = ((df['b']==2) | (df['b']==5))
df = df[mask]
You can use .inin() as alternative solution like below:
mask = [2,5]
df = df[df['b'].isin(mask)]

Switching values from 2 columns

I am writing a Python script that reads an Excel sheet.
In my excel sheet I have two columns, let's say A & B.
If column B's value is greater than column A I would like to switch it.
Example Sheet:
[A] [B]
1 6
10 2
3 11
Output Wanted:
[A] [B]
6 1
10 2
11 3
Right now I have this, but it is giving me completely different values:
s = (~(col['A'] < col['B'])).cumsum().eq(0)
col.loc[s, 'B'] /=2
col.loc[s, 'A'] = col.loc[s, ['A', 'B']].sum(1)
I'm assuming you're using Pandas based on your syntax. This would be a good situation for using the DataFrame.apply() method.
import pandas as pd
df = pd.DataFrame({'A': [1, 10, 3], 'B': [6, 2, 11]})
def switch(row):
if row['A'] < row['B']:
row['A'], row['B'] = row['B'], row['A']
return row
df = df.apply(switch, axis=1)
print(df)
gives:
A B
0 6 1
1 10 2
2 11 3

how to filter this python dataframe

Greeting
I try to get the smallest sizes dataframe that got valid row
import pandas as pd
import random
columns = ['x0','y0']
df_ = pd.DataFrame(index=range(0,30), columns=columns)
df_ = df_.fillna(0)
columns1 = ['x1','y1']
df = pd.DataFrame(index=range(0,11), columns=columns1)
for index, row in df.iterrows():
df.loc[index, "x1"] = random.randint(1, 100)
df.loc[index, "y1"] = random.randint(1, 100)
df_ = df_.combine_first(df)
df = pd.DataFrame(index=range(0,17), columns=columns1)
for index, row in df.iterrows():
df.loc[index, "x2"] = random.randint(1, 100)
df.loc[index, "y2"] = random.randint(1, 100)
df_ = df_.combine_first(df)
From the example the dataframe should output rows from 0 to 10 and the rest got filter out.
I think of keep a counter to keep track of the min row
or using pandasql
or if there is a trick to get this info from the dataframe
the size of dataframe
Actually I will be appending 500+ files with various size to append
and use it to do some analysis. So perf is a consideration.
-student of python
If you want to drop the rows which have NaNs use dropna (here, this is the first ten rows):
In [11]: df_.dropna()
Out[11]:
x0 x1 x2 y0 y1 y2
0 0 49 58 0 68 2
1 0 2 37 0 19 71
2 0 26 95 0 12 17
3 0 87 5 0 70 69
4 0 84 77 0 70 92
5 0 71 98 0 22 5
6 0 28 95 0 70 15
7 0 31 19 0 24 31
8 0 9 37 0 55 29
9 0 30 53 0 15 45
10 0 8 61 0 74 41
However a cleaner, more efficient, and faster way to do this entire process is to update just those first rows (I'm assuming the random integer stuff is just you generating some example dataframes).
Let's store your DataFrames in a list:
In [21]: df1 = pd.DataFrame([[1, 2], [np.nan, 4]], columns=['a', 'b'])
In [22]: df2 = pd.DataFrame([[1, 2], [5, 6], [7, 8]], columns=['a', 'c'])
In [23]: dfs = [df1, df2]
Take the minimum length:
In [24]: m = min(len(df) for df in dfs)
First create an empty DataFrame with the desired rows and columns:
In [25]: columns = reduce(lambda x, y: y.columns.union(x), dfs, [])
In [26]: res = pd.DataFrame(index=np.arange(m), columns=columns)
To do this efficiently we're going to update, and making these changes inplace - on just this DataFrame*:
In [27]: for df in dfs:
res.update(df)
In [28]: res
Out[28]:
a b c
0 1 2 2
1 5 4 6
*If we didn't do this, or were using combine_first or similar, we'd most likely have lots of copying (new DataFrames being created), which will slow things down.
Note: combine_first doesn't offer an inplace flag... you could use combine but this is also more complicated (as well as less efficient). It's also quite straightforward to use where (and manually update), which IIRC is what combine does under the hood.

Pandas: sum DataFrame rows for given columns

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')

Categories

Resources