I am trying for a while to solve this problem:
I have a daraframe like this:
import pandas as pd
df=pd.DataFrame(np.array([['A', 2, 3], ['B', 5, 6], ['C', 8, 9]]),columns=['a', 'b', 'c'])
j=[0,2]
But then when i try to select just a part of it filtering by a list of index and a condition on a column I get error...
df[df.loc[j]['a']=='A']
There is somenting wrong, but i don't get what is the problem here. Can you help me?
This is the error message:
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
There is filtered DataFrame compared by original, so indices are different, so error is raised.
You need compare filtered DataFrame:
df1 = df.loc[j]
print (df1)
a b c
0 A 2 3
2 C 8 9
out = df1[df1['a']=='A']
print(out)
a b c
0 A 2 3
Your solution is possible use with convert ndices of filtered mask by original indices by Series.reindex:
out = df[(df.loc[j, 'a']=='A').reindex(df.index, fill_value=False)]
print(out)
a b c
0 A 2 3
Or nicer solution:
out = df[(df['a'] == 'A') & (df.index.isin(j))]
print(out)
a b c
0 A 2 3
A boolean array and the dataframe should be the same length. here your df length is 3 but the boolean array df.loc[j]['a']=='A' length is 2
You should do:
>>> df.loc[j][df.loc[j]['a']=='A']
a b c
0 A 2 3
Related
I have a pandas DataFrame whose values I want to conditionally change into strings without looping over every value.
Example input:
In [1]: df = pd.DataFrame(data = [[1,2], [4,5]], columns = ['a', 'b'])
Out[2]:
a b
0 1 2
1 4 5
This is my best attempt which doesn't work properly
df['a'] = np.where(df['a'] < 3, f'string-{df["a"]}', df['a'])
In [1]: df
Out[2]:
a b
0 string0 1\n1 4\nName: a, dtype: int64 2
1 4 5
Desired output:
Out[2]:
A B
0 string-1 2
1 4 5
I am using np.where() since looping is not feasible due to the size of the actual DataFrame. The actual f-string I am using is also more complex and has two variables that include column names, but the problem is the same.
Are there other ways to conditionally change pandas values into f-strings without looping over each value?
You can use .map() together with f-string, as follows:
df['a'] = df['a'].map(lambda x: f'string-{x}' if x < 3 else x)
Alternatively, you can also use .loc together with string concatenation, as follows:
df.loc[df['a'] < 3, 'a'] = 'string-' + df['a'].astype(str)
#OR
df['a']=np.where(df['a'] < 3, 'string-'+df['a'].astype(str), df['a'])
Result:
print(df)
a b
0 string-1 2
1 4 5
I'm simply trying to access named pandas columns by an integer.
You can select a row by location using df.ix[3].
But how to select a column by integer?
My dataframe:
df=pandas.DataFrame({'a':np.random.rand(5), 'b':np.random.rand(5)})
Two approaches that come to mind:
>>> df
A B C D
0 0.424634 1.716633 0.282734 2.086944
1 -1.325816 2.056277 2.583704 -0.776403
2 1.457809 -0.407279 -1.560583 -1.316246
3 -0.757134 -1.321025 1.325853 -2.513373
4 1.366180 -1.265185 -2.184617 0.881514
>>> df.iloc[:, 2]
0 0.282734
1 2.583704
2 -1.560583
3 1.325853
4 -2.184617
Name: C
>>> df[df.columns[2]]
0 0.282734
1 2.583704
2 -1.560583
3 1.325853
4 -2.184617
Name: C
Edit: The original answer suggested the use of df.ix[:,2] but this function is now deprecated. Users should switch to df.iloc[:,2].
You can also use df.icol(n) to access a column by integer.
Update: icol is deprecated and the same functionality can be achieved by:
df.iloc[:, n] # to access the column at the nth position
You could use label based using .loc or index based using .iloc method to do column-slicing including column ranges:
In [50]: import pandas as pd
In [51]: import numpy as np
In [52]: df = pd.DataFrame(np.random.rand(4,4), columns = list('abcd'))
In [53]: df
Out[53]:
a b c d
0 0.806811 0.187630 0.978159 0.317261
1 0.738792 0.862661 0.580592 0.010177
2 0.224633 0.342579 0.214512 0.375147
3 0.875262 0.151867 0.071244 0.893735
In [54]: df.loc[:, ["a", "b", "d"]] ### Selective columns based slicing
Out[54]:
a b d
0 0.806811 0.187630 0.317261
1 0.738792 0.862661 0.010177
2 0.224633 0.342579 0.375147
3 0.875262 0.151867 0.893735
In [55]: df.loc[:, "a":"c"] ### Selective label based column ranges slicing
Out[55]:
a b c
0 0.806811 0.187630 0.978159
1 0.738792 0.862661 0.580592
2 0.224633 0.342579 0.214512
3 0.875262 0.151867 0.071244
In [56]: df.iloc[:, 0:3] ### Selective index based column ranges slicing
Out[56]:
a b c
0 0.806811 0.187630 0.978159
1 0.738792 0.862661 0.580592
2 0.224633 0.342579 0.214512
3 0.875262 0.151867 0.071244
You can access multiple columns by passing a list of column indices to dataFrame.ix.
For example:
>>> df = pandas.DataFrame({
'a': np.random.rand(5),
'b': np.random.rand(5),
'c': np.random.rand(5),
'd': np.random.rand(5)
})
>>> df
a b c d
0 0.705718 0.414073 0.007040 0.889579
1 0.198005 0.520747 0.827818 0.366271
2 0.974552 0.667484 0.056246 0.524306
3 0.512126 0.775926 0.837896 0.955200
4 0.793203 0.686405 0.401596 0.544421
>>> df.ix[:,[1,3]]
b d
0 0.414073 0.889579
1 0.520747 0.366271
2 0.667484 0.524306
3 0.775926 0.955200
4 0.686405 0.544421
The method .transpose() converts columns to rows and rows to column, hence you could even write
df.transpose().ix[3]
Most of the people have answered how to take columns starting from an index. But there might be some scenarios where you need to pick columns from in-between or specific index, where you can use the below solution.
Say that you have columns A,B and C. If you need to select only column A and C you can use the below code.
df = df.iloc[:, [0,2]]
where 0,2 specifies that you need to select only 1st and 3rd column.
You can use the method take. For example, to select first and last columns:
df.take([0, -1], axis=1)
I have a Dataframe series that contains is a list of strings for each row. I'd like to create another series that is the last string in the list for that row.
So one row may have a list e.g
['a', 'b', 'c', 'd']
I'd like to create another pandas series made up of the last element of the row, normally access as a -1 reference, in this 'd'. The lists for each observation (i.e. row) are of varying length. How can this be done?
I believe need indexing with str, it working with all iterables:
df = pd.DataFrame({'col':[['a', 'b', 'c', 'd'],['a', 'b'],['a'], []]})
df['last'] = df['col'].str[-1]
print (df)
col last
0 [a, b, c, d] d
1 [a, b] b
2 [a] a
3 [] NaN
strings are iterables too:
df = pd.DataFrame({'col':['abcd','ab','a', '']})
df['last'] = df['col'].str[-1]
print (df)
col last
0 abcd d
1 ab b
2 a a
3 NaN
Why not making the list column to a info dataframe, and you can using the index for join
Infodf=pd.DataFrame(df.col.values.tolist(),index=df.index)
Infodf
Out[494]:
0 1 2 3
0 a b c d
1 a b None None
2 a None None None
3 None None None None
I think I over looked the question , and both PiR and Jez provided their valuable suggestion to help me achieve the final result .
Infodf.ffill(1).iloc[:,-1]
I want to, at the same time, create a new column in a pandas dataframe and set its first value to a list.
I want to transform this dataframe
df = pd.DataFrame.from_dict({'a':[1,2],'b':[3,4]})
a b
0 1 3
1 2 4
into this one
a b c
0 1 3 [2,3]
1 2 4 NaN
I tried :
df.loc[0, 'c'] = [2,3]
df.loc[0, 'c'] = np.array([2,3])
df.loc[0, 'c'] = [[2,3]]
df.at[0,'c'] = [2,3]
df.at[0,'d'] = [[2,3]]
It does not work.
How should I proceed?
If the first element of a series is a list, then the series must be of type object (not the most efficient for numerical computations). This should work, however.
df = df.assign(c=None)
df.loc[0, 'c'] = [2, 3]
>>> df
a b c
0 1 3 [2, 3]
1 2 4 None
If you really need the remaining values of column c to be NaNs instead of None, use this:
df.loc[1:, 'c'] = np.nan
The problem seems to have something to do with the type of the c column. If you convert it to type 'object', you can use iat, loc or set_value to set a cell as a list.
df2 = (
df.assign(c=np.nan)
.assign(c=lambda x: x.c.astype(object))
)
df2.set_value(0,'c',[2,3])
Out[86]:
a b c
0 1 3 [2, 3]
1 2 4 NaN
I have the following DataFrame:
a b c
b
2 1 2 3
5 4 5 6
As you can see, column b is used as an index. I want to get the ordinal number of the row fulfilling ('b' == 5), which in this case would be 1.
The column being tested can be either an index column (as with b in this case) or a regular column, e.g. I may want to find the index of the row fulfilling ('c' == 6).
Use Index.get_loc instead.
Reusing #unutbu's set up code, you'll achieve the same results.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.arange(1,7).reshape(2,3),
columns = list('abc'),
index=pd.Series([2,5], name='b'))
>>> df
a b c
b
2 1 2 3
5 4 5 6
>>> df.index.get_loc(5)
1
You could use np.where like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(1,7).reshape(2,3),
columns = list('abc'),
index=pd.Series([2,5], name='b'))
print(df)
# a b c
# b
# 2 1 2 3
# 5 4 5 6
print(np.where(df.index==5)[0])
# [1]
print(np.where(df['c']==6)[0])
# [1]
The value returned is an array since there could be more than one row with a particular index or value in a column.
With Index.get_loc and general condition:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.arange(1,7).reshape(2,3),
columns = list('abc'),
index=pd.Series([2,5], name='b'))
>>> df
a b c
b
2 1 2 3
5 4 5 6
>>> df.index.get_loc(df.index[df['b'] == 5][0])
1
The other answers based on Index.get_loc() do not provide a consistent result, because this function will return in integer if the index consists of all unique values, but it will return a boolean mask array if the index does not consist of unique values. A more consistent approach to return a list of integer values every time would be the following, with this example shown for an index with non-unique values:
df = pd.DataFrame([
{"A":1, "B":2}, {"A":2, "B":2},
{"A":3, "B":4}, {"A":1, "B":3}
], index=[1,2,3,1])
If searching based on index value:
[i for i,v in enumerate(df.index == 1) if v]
[0, 3]
If searching based on a column value:
[i for i,v in enumerate(df["B"] == 2) if v]
[0, 1]