Suppose I have a DataFrame
df = pandas.DataFrame({'a': [1,2], 'b': [3,4]}, ['foo', 'bar'])
a b
foo 1 3
bar 2 4
And I want to added a column based on another Series:
s = pandas.Series({'foo': 10, 'baz': 20})
foo 10
baz 20
dtype: int64
How do I assign the Series to a column of the DataFrame and provide a default value if the DataFrame index value is not in the Series index?
I'm looking for something of the form:
df['c'] = s.withDefault(42)
Which would result in the following Dataframe:
a b c
foo 1 3 10
bar 2 4 42
#Note: bar got value 42 because it's not in s
Thank you in advance for your consideration and response.
Using map with get
get has an argument that you can use to specify the default value.
df.assign(c=df.index.map(lambda x: s.get(x, 42)))
a b c
foo 1 3 10
bar 2 4 42
Use reindex with fill_value
df.assign(c=s.reindex(df.index, fill_value=42))
a b c
foo 1 3 10
bar 2 4 42
You need to use join between df and dataframe which is obtained from s and then fill the NaN with default value, which is 42, in your case.
df['c'] = df.join(pandas.DataFrame(s, columns=['c']))['c'].fillna(42).astype(int)
Output:
a b c
foo 1 3 10
bar 2 4 42
Related
I want to add a new column in my dataframe where the new column is an incremental number started from 0
type value
a 25
b 23
c 33
d 31
I expect my dataframe would be:
type value id
a 25 1
b 23 2
c 33 3
d 31 4
beside the id column, I also want to add a new column: status_id where from number 1 to 2 is called foo and from number 3 to 4 is called bar. I expect the full dataframe would be like:
type value id status_id
a 25 1 foo
b 23 2 foo
c 33 3 bar
d 31 4 bar
How can I do this with pandas? Thanks in advance
Something like?
df['id'] = np.arange(1, len(df) + 1)
df['status_id'] = df['id'].sub(1).floordiv(2).map({0: 'foo', 1: 'bar'})
type value id status_id
0 a 25 1 foo
1 b 23 2 foo
2 c 33 3 bar
3 d 31 4 bar
We can try with cut
df['status_id'] = pd.cut(df.id,[0,2,4],labels=['foo','bar'])
df
type value id status_id
0 a 25 1 foo
1 b 23 2 foo
2 c 33 3 bar
3 d 31 4 bar
For the first question,
df.insert(0, 'id', range(1, 1 + len(df)))
For the second question, are you looking at only 4 columns? If so you can insert them manually. If its two foos and two bars for x columns, you can use modulo 4 to insert them correctly.
This question already has answers here:
How to take column-slices of dataframe in pandas
(11 answers)
Closed 1 year ago.
Assume I have a dataframe in Pandas:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
The dataframe df looks like
A B C D
0 foo one 0 0
1 bar one 1 2
2 foo two 2 4
3 bar three 3 6
4 foo two 4 8
5 bar two 5 10
6 foo one 6 12
7 foo three 7 14
How can I write the code if I want to get the value of D when C equals to 1? In other words, how can I return the D value which is 2 when C = 1?
You can filter and use .loc:
result = df.loc[df.C == 1, "D"]
alternative equivalent syntax as noted in the comments:
result = df.loc[df['C'].eq(1), 'D']
While you can "chain operations, such as df[df.C == 1]["D"] will get you the correct result, you will encounter poor performance as your data scales.
You can search the data frame based on condition (here "C" == 1) and then get the column by index lookup
This will return a Series, you will have to use .values to get the NumPy array
>>> df
A B C D
0 foo one 0 0
1 bar one 1 2
2 foo two 2 4
3 bar three 3 6
4 foo two 4 8
5 bar two 5 10
6 foo one 6 12
7 foo three 7 14
>>> df[df["C"] == 1]["D"].values
array([2])
>>> df[df["C"] == 1]["D"].values[0]
2
>>>
References
Selecting rows in pandas DataFrame based on conditions
Given this DF:
a b c d
1 2 1 4
4 3 4 2
foo bar foo yes
What is the best way to delete same columns but with different name in a large pandas DF? For example:
a b d
1 2 4
4 3 2
foo bar yes
Column c was removed from the above dataframe becase a and c where the same column but with different name. So far I tried to
df = df.iloc[:, ~df.columns.duplicated()]
However it is not clear to me how to check the row values inside the DF?
use transpose as below
df.T.drop_duplicates().T
I tried straight forward approach - loop through column names and compare each column with rest of others. Use np.all for exact match. These approach took only 336ms.
repeated_columns = []
for i, column in enumerate(df.columns):
r_columns = df.columns[i+1:]
for r_c in r_columns:
if np.all(df[column] == df[r_c]):
repeated_columns.append(r_c)
new_columns = [x for x in df.columns if x not in repeated_columns]
df[new_columns]
It will give you following output
a b d
0 1 2 4
1 4 3 2
2 foo bar yes
df.loc[:,~df.T.duplicated()]
a b d
0 1 2 4
1 4 3 2
2 foo bar yes
Say I have the pandas DataFrame below:
A B C D
1 foo one 0 0
2 foo one 2 4
3 foo two 4 8
4 cat one 8 4
5 bar four 6 12
6 bar three 7 14
7 bar four 7 14
I would like to select all the rows that have equal values in A but differing values in B. So I would like the output of my code to be:
A B C D
1 foo one 0 0
3 foo two 4 8
5 bar three 7 14
6 bar four 7 14
What's the most efficient way to do this? I have approximately 11,000 rows with a lot of variation in the column values, but this situation comes up a lot. In my dataset, if elements in column A are equal then the corresponding column B value should also be equal, however due to mislabeling this is not the case and I would like to fix this, it would be impractical for me to do this one by one.
You can try groupby() + filter + drop_duplicates():
>>> df.groupby('A').filter(lambda g: len(g) > 1).drop_duplicates(subset=['A', 'B'], keep="first")
A B C D
0 foo one 0 0
2 foo two 4 8
4 bar four 6 12
5 bar three 7 14
OR, in case you want to drop duplicates between the subset of columns A & B then can use below but that will have the row having cat as well.
>>> df.drop_duplicates(subset=['A', 'B'], keep="first")
A B C D
0 foo one 0 0
2 foo two 4 8
3 cat one 8 4
4 bar four 6 12
5 bar three 7 14
Use groupby + filter + head:
result = df.groupby('A').filter(lambda g: len(g) > 1).groupby(['A', 'B']).head(1)
print(result)
Output
A B C D
0 foo one 0 0
2 foo two 4 8
4 bar four 6 12
5 bar three 7 14
The first group-by and filter will remove the rows with no duplicated A values (i.e. cat), the second will create groups with same A, B and for each of those get the first element.
The current answers are correct and may be more sophisticated too. If you have complex criteria, filter function will be very useful. If you are like me and want to keep things simple, i feel following is more beginner friendly way
>>> df = pd.DataFrame({
'A': ['foo', 'foo', 'foo', 'cat', 'bar', 'bar', 'bar'],
'B': ['one', 'one', 'two', 'one', 'four', 'three', 'four'],
'C': [0,2,4,8,6,7,7],
'D': [0,4,8,4,12,14,14]
}, index=[1,2,3,4,5,6,7])
>>> df = df.drop_duplicates(['A', 'B'], keep='last')
A B C D
2 foo one 2 4
3 foo two 4 8
4 cat one 8 4
6 bar three 7 14
7 bar four 7 14
>>> df = df[df.duplicated(['A'], keep=False)]
A B C D
2 foo one 2 4
3 foo two 4 8
6 bar three 7 14
7 bar four 7 14
keep='last' is optional here
This question is very related to these two questions another and thisone, and I'll even use the example from the very helpful accepted solution on that question. Here's the example from the accepted solution (credit to unutbu):
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
# A B C D
# 0 foo one 0 0
# 1 bar one 1 2
# 2 foo two 2 4
# 3 bar three 3 6
# 4 foo two 4 8
# 5 bar two 5 10
# 6 foo one 6 12
# 7 foo three 7 14
print(df.loc[df['A'] == 'foo'])
yields
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
But I want to have all rows of A and only the arrows in B that have 'two' in them. My attempt at it is to try
print(df.loc[df['A']) & df['B'] == 'two'])
This does not work, unfortunately. Can anybody suggest a way to implement something like this? it would be of a great help if the solution is somewhat general where for example column A doesn't have the same value which is 'foo' but has different values and you still want the whole column.
Easy , if you do
df[['A','B']][df['B']=='two']
you will get:
A B
2 foo two
4 foo two
5 bar two
To filter on both A and B:
df[['A','B']][(df['B']=='two') & (df['A']=='foo')]
You get:
A B
2 foo two
4 foo two
and if you want all the columns :
df[df['B']=='two']
you will get:
A B C D
2 foo two 2 4
4 foo two 4 8
5 bar two 5 10
I think I understand your modified question. After sub-selecting on a condition of B, then you can select the columns you want, such as:
In [1]: df.loc[df.B =='two'][['A', 'B']]
Out[1]:
A B
2 foo two
4 foo two
5 bar two
For example, if I wanted to concatenate all the string of column A, for which column B had value 'two', then I could do:
In [2]: df.loc[df.B =='two'].A.sum() # <-- use .mean() for your quarterly data
Out[2]: 'foofoobar'
You could also groupby the values of column B and get such a concatenation result for every different B-group from one expression:
In [3]: df.groupby('B').apply(lambda x: x.A.sum())
Out[3]:
B
one foobarfoo
three barfoo
two foofoobar
dtype: object
To filter on A and B use numpy.logical_and:
In [1]: df.loc[np.logical_and(df.A == 'foo', df.B == 'two')]
Out[1]:
A B C D
2 foo two 2 4
4 foo two 4 8
Row subsetting: Isn't this you are looking for ?
df.loc[(df['A'] == 'foo') & (df['B'] == 'two')]
A B C D
2 foo two 2 4
4 foo two 4 8
You can also add .reset_index() at the end to initialize indexes from zero.