I have a DataFrame in this format:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
4 13 14 15
and an array like this, with column names:
['a', 'a', 'b', 'c', 'b']
and I’m hoping to extract an array of data, one value from each row. The array of column names specifies which column I want from each row. Here, the result would be:
[1, 4, 8, 12, 14]
Is this possible as a single command with Pandas, or do I need to iterate? I tried using indexing
i = pd.Index(['a', 'a', 'b', 'c', 'b'])
i.choose(df)
but I got a segfault, which I couldn’t diagnose because the documentation is lacking.
You could use lookup, e.g.
>>> i = pd.Series(['a', 'a', 'b', 'c', 'b'])
>>> df.lookup(i.index, i.values)
array([ 1, 4, 8, 12, 14])
where i.index could be different from range(len(i)) if you wanted.
For large datasets, you can use indexing on the base numpy data, if you're prepared to transform your column names into a numerical index (simple in this case):
df.values[arange(5),[0,0,1,2,1]]
out: array([ 1, 4, 8, 12, 14])
This will be much more efficient that list comprehensions, or other explicit iterations.
As MorningGlory stated in the comments, lookup has been deprecated in version 1.2.0.
The documentation states that the same can be achieved using melt and loc but I didn't think it was very obvious so here it goes.
First, use melt to create a look-up DataFrame:
i = pd.Series(["a", "a", "b", "c", "b"], name="col")
melted = pd.melt(
pd.concat([i, df], axis=1),
id_vars="col",
value_vars=df.columns,
ignore_index=False,
)
col variable value
0 a a 1
1 a a 4
2 b a 7
3 c a 10
4 b a 13
0 a b 2
1 a b 5
2 b b 8
3 c b 11
4 b b 14
0 a c 3
1 a c 6
2 b c 9
3 c c 12
4 b c 15
Then, use loc to only get relevant values:
result = melted.loc[melted["col"] == melted["variable"], "value"]
0 1
1 4
2 8
4 14
3 12
Name: value, dtype: int64
Finally - if needed - to get the same index order as before:
result.loc[df.index]
0 1
1 4
2 8
3 12
4 14
Name: value, dtype: int64
Pandas also provides a different solution in the documentation using factorize and numpy indexing:
df = pd.concat([i, df], axis=1)
idx, cols = pd.factorize(df['col'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
[ 1 4 8 12 14]
You can always use list comprehension:
[df.loc[idx, col] for idx, col in enumerate(['a', 'a', 'b', 'c', 'b'])]
Related
I have some list full of pandas dataframes. Is their a way to remove duplicates from it. Here some example code:
import pandas as pd
import numpy as np
if __name__ == '__main__':
data1 = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}
df1 = pd.DataFrame.from_dict(data1, orient='index', columns=['A', 'B', 'C', 'D'])
data2 = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
df2 = pd.DataFrame.from_dict(data2)
df3 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)],
dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
df4 = pd.DataFrame(data, columns=['c', 'a'])
l_input = [df1, df2, df1, df3, df4, df4, df1, df3]
# l_aim = [df1, df2, df3, df4]
the input list l_input in the example should removed and l_aim should be the result.
An efficient method to find the duplicates in linear time would be to compute a hash of the dataframes. You can't do it with the python hash function, but there is a helper function in pandas: pandas.util.hash_pandas_object.
The function computes a hash per row, so you need to aggregate to a single value. sum could be used but it might lead to collisions. Here I opted for a concatenation of all hashes. If you have huge dataframes this might consume a lot of memory (in such case, maybe hash the list of hashes).
Update. The hash of hashes seems to be ideal, see the second option at the end of the answer.
hashes = [pd.util.hash_pandas_object(d).astype(str).str.cat(sep='-')
for d in l_input]
# identify duplicated per index
dups = pd.Series(hashes).duplicated()
Output:
0 False
1 False
2 True
3 False
4 False
5 True
6 True
7 True
dtype: bool
To filter the unique dataframes:
out = [d for d,h in zip(l_input, dups) if h]
variant with a hash of the hashes
I was initially unsure that computing the hash of a list of hashes would be secure, but this seems to be the case, so the second method below should probably be preferred:
def df_hash(df):
s = pd.util.hash_pandas_object(df)
return hash(tuple(s))
hashes = [df_hash(d) for d in l_input]
dups = pd.Series(hashes).duplicated()
out = [d for d,h in zip(l_input, dups) if h]
Try df.equals():
out = []
while l_input:
d = l_input.pop()
if any(d.equals(df) for df in l_input):
continue
out.append(d)
print(*out[::-1], sep="\n\n")
Prints:
A B C D
row_1 3 2 1 0
row_2 a b c d
col_1 col_2
0 3 a
1 2 b
2 1 c
3 0 d
a b c
0 1 2 3
1 4 5 6
2 7 8 9
c a
0 3 1
1 6 4
2 9 7
If these are literally the same objects (as in your example), then you could use their ids:
out = list(dict((id(df), df) for df in l_input).values())
If not, you could use equals:
Output:
[ A B C D
row_1 3 2 1 0
row_2 a b c d,
col_1 col_2
0 3 a
1 2 b
2 1 c
3 0 d,
a b c
0 1 2 3
1 4 5 6
2 7 8 9,
c a
0 3 1
1 6 4
2 9 7]
I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')
Pretty sure this is very simple.
I am reading a csv file and have the dataframe:
Attribute A B C
a 1 4 7
b 2 5 8
c 3 6 9
I want to do a transpose to get
Attribute a b c
A 1 2 3
B 4 5 6
C 7 8 9
However, when I do df.T, it results in
0 1 2
Attribute a b c
A 1 2 3
B 4 5 6
C 7 8 9`
How do I get rid of the indexes on top?
You can set the index to your first column (or in general, the column you want to use as as index) in your dataframe first, then transpose the dataframe. For example if the column you want to use as index is 'Attribute', you can do:
df.set_index('Attribute',inplace=True)
df.transpose()
Or
df.set_index('Attribute').T
It works for me:
>>> data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
>>> df = pd.DataFrame(data, index=['a', 'b', 'c'])
>>> df.T
a b c
A 1 2 3
B 4 5 6
C 7 8 9
If your index column 'Attribute' is really set to index before the transpose, then the top row after the transpose is not the first row, but a title row. if you don't like it, I would first drop the index, then rename them as columns after the transpose.
suppose I have a dataframe df
df = pd.DataFrame([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]],
columns=['A', 'B', 'C', 'D', 'E'])
Which looks like this
A B C D E
0 1 2 3 4 5
1 6 7 8 9 10
How do I reverse the order of the column values but leave the column headers as A, B, C, D, E?
I want it to look like
A B C D E
0 5 4 3 2 1
1 10 9 8 7 6
I've tried sorting the column index df.sort_index(1, ascending=False) but that changes the column heads (obviously) and also, I don't know if my columns start off in a sorted way anyway.
Or you can just reverse your columns:
df.columns = reversed(df.columns)
df.sortlevel(axis=1)
# A B C D E
#0 5 4 3 2 1
#1 10 9 8 7 6
method 1
reconstruct
pd.DataFrame(df.values[:, ::-1], df.index, df.columns)
method 2
assign values
df[:] = df.values[:, ::-1]
df
both give
Also, using np.fliplr which flips the values along the horizontal direction:
pd.DataFrame(np.fliplr(df.values), columns=df.columns, index=df.index)
I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')