Index values of specific rows in python - python

I am trying to find out Index of such rows before "None" occurs.
pId=["a","b","c","None","d","e","None"]
df = pd.DataFrame(pId,columns=['pId'])
pId
0 a
1 b
2 c
3 None
4 d
5 e
6 None
df.index[df.pId.eq('None') & df.pId.ne(df.pId.shift(-1))]
I am expecting the output of the above code should be
Index([2,5])
It gives me
Index([3,6])
Please correct me

I am not sure for the specific example you showed. Anyway, you could do it in a more simple way:
indexes = [i-1 for i,x in enumerate(pId) if x == 'None']

The problem is that you're returning the index of the "None". You compare it against the previous item, but you're still reporting the index of the "None". Note that your accepted answer doesn't make this check.
In short, you still need to plaster a "-1" onto the result of your checking.

Just -1 from df[df["pId"] == "None"].index:
import pandas as pd
pId=["a","b","c","None","d","e","None"]
df = pd.DataFrame(pId,columns=['pId'])
print(df[df["pId"] == "None"].index - 1)
Which gives you:
Int64Index([2, 5], dtype='int64')
Or if you just want a list of values:
(df[df["pId"] == "None"].index - 1).tolist()
You should be aware that for a list like:
pId=["None","None","b","c","None","d","e","None"]
You get a df like:
pId
0 None
1 None
2 b
3 c
4 None
5 d
6 e
7 None
And output like:
[-1, 0, 3, 6]
Which does not make a great deal of sense.

Related

Drop few rows of a pandas dataframe using lambda

I'm currently facing a problem with method chaining in manipulating data frames in pandas, here is the structure of my data:
import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
df = pd.DataFrame(
{'Frenquency': lst1,
'lst2Tite': lst2,
'lst3Tite': lst3
})
the question is get entries(rows) if the frequency is less than 6, but it needs to be done in method chaining.
I know using a traditional way is easy, I could just do
df[df["Frenquency"]<6]
to get the answer.
However, the question is about how to do it with method chaining, I tried something like
df.drop(lambda x:x.index if x["Frequency"] <6 else null)
but it raised an error "[<function <lambda> at 0x7faf529d3510>] not contained in axis"
Could anyone share some light on this issue?
This is an old question but I will answer since there is no accepted answer for future reference.
df[df.apply(lambda x: True if (x.Frenquency) <6 else False,axis=1)]
explanation: This lambda function checks the frequency and if yes it assigns True otherwise False and that series of True and False used by df to index the true values only. Note the column name Frenquency is a typo but I kept as it is since the question was like so.
Or maybe this:
df.drop(i for i in df.Frequency if i >= 6)
Or use inplace:
df.drop((i for i in df.Frequency if i >= 6), inplace=True)
For this sort of selection, you can maintain a fluent interface and use method-chaining by using the query method:
>>> df.query('Frenquency < 6')
Frenquency lst2Tite lst3Tite
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
>>>
So something like:
df.rename(<something>).query('Frenquency <6').assign(<something>)
Or more concretely:
>>> (df.rename(columns={'Frenquency':'F'})
... .query('F < 6')
... .assign(FF=lambda x: x.F**2))
F lst2Tite lst3Tite FF
0 0 0 0 0
1 1 1 1 1
2 2 2 2 4
3 3 3 3 9
4 4 4 4 16
5 5 5 5 25
Feel this post did not have the answers that addressed the spirit of the question. The most chain-friendly way is (probably) to use Panda's .loc.
import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
df = pd.DataFrame({"Frequency": lst1, "lst2Tite": lst2, "lst3Tite": lst3})
df.loc[lambda _df: 6 < _df["Frequency"]]
Simple!
Would this satisfy your needs?
df.mask(df.Frequency >= 6).dropna()

From a dataframe using the apply() method, how to return a new column with lists of elements from the dataframe?

There's an operation that is a little counter intuitive when using pandas apply() method. It took me a couple of hours of reading to solve, so here it is.
So here is what I was trying to accomplish.
I have a pandas dataframe like so:
test = pd.DataFrame({'one': [[2],['test']], 'two': [[5],[10]]})
one two
0 [2] [5]
1 [test] [10]
and I want to add the columns per row to create a resulting list of length = to the DataFrame's original length like so:
def combine(row):
result = row['one'] + row['two']
return(result)
When running it through the dataframe using the apply() method:
test.apply(lambda x: combine(x), axis=1)
one two
0 2 5
1 test 10
Which isn't quite what we wanted. What we want is:
result
0 [2, 5]
1 [test, 10]
EDIT
I know there are simpler solutions to this example. But this is an abstraction from a much more complex operation.Here's an example of a more complex one:
df_one:
org_id date status id
0 2 2015/02/01 True 3
1 10 2015/05/01 True 27
2 10 2015/06/01 True 18
3 10 2015/04/01 False 27
4 10 2015/03/01 True 40
df_two:
org_id date
0 12 2015/04/01
1 10 2015/02/01
2 2 2015/08/01
3 10 2015/08/01
Here's a more complex operation:
def operation(row, df_one):
sel = (df_one.date < pd.Timestamp(row['date'])) & \
(df_one['org_id'] == row['org_id'])
last_changes = df_one[sel].groupby(['org_id', 'id']).last()
id_list = last_changes[last_changes.status].reset_index().id.tolist()
return (id_list)
then finally run:
df_one.sort_values('date', inplace=True)
df_two['id_list'] = df_two.apply(
operation,
axis=1,
args=(df_one,)
)
This would be impossible with simpler solutions. Hence my proposed one below would be to re write operation to:
def operation(row, df_one):
sel = (df_one.date < pd.Timestamp(row['date'])) & \
(df_one['org_id'] == row['org_id'])
last_changes = df_one[sel].groupby(['org_id', 'id']).last()
id_list = last_changes[last_changes.status].reset_index().id.tolist()
return pd.Series({'id_list': id_list})
We'd expect the following result:
id_list
0 []
1 []
2 [3]
3 [27,18,40]
IIUC we can simply sum two columns:
In [93]: test.sum(axis=1).to_frame('result')
Out[93]:
result
0 [2, 5]
1 [test, 10]
because when we sum lists:
In [94]: [2] + [5]
Out[94]: [2, 5]
they are getting concatenated...
So the answer to this problem lies in how pandas.apply() method works.
When defining
def combine(row):
result = row['one'] + row['two']
return(result)
the function will be returning a list for each row that gets passed in. This is a problem if we use the function with the .apply() method because it will interpret the resulting lists as a Series where each element is a column of that same row.
To solve this we need to create a Series where we specify a new column name like so:
def combine(row):
result = row['one'] + row['two']
return pd.Series({'result': result})
And if we run this again:
test.apply(lambda x: combine(x), axis=1)
result
0 [2, 5]
1 [test, 10]
We'll get what we originally wanted! Again, this is because we are forcing pandas to interpret the entire result as a column.

Python [[0]] meaning

I am running a Python script (Kaggle script). It works in a 3.4.5 virtualenv, but not in 3.5.2
I am not sure why and I am not familiar with the [[0]] syntax. Below is the snippet.
import pandas as pd
data = pd.read_csv(r'path\train.csv')
labels_flat = data[[0]].values.ravel()
It should produce a list of values from the csv's first column.
In 3.5.2 I get this error:
KeyError: '[0] not in index'
I tried to replicate the value with
labels_flat = []
lf = data.values.tolist()
for row in lf:
labels_flat.append(row[0])
But I don't think it is the same thing.
I dont think the problem is with the syntax, your Dataframe just does not contain the index you are looking for.
For me this works:
In [1]: data = pd.DataFrame({0:[1,2,3], 1:[4,5,6], 2:[7,8,9]})
In [2]: data[[0]]
Out[2]:
0
0 1
1 2
2 3
I think what confuses you about the [[0]] syntax is that the squared brackets are used in python for two completely different things, and the [[0]] statement uses both:
A. [] is used to create a list. In the above example [0] creates a list with the single element 0.
B. [] is also used to access an element from a list (or dict,...). So data[0] returns the 0.-th element of data.
The next confusion thing is that while the usual python lists are indexed by numbers (eg. data[4] is the 4. element of data), Pandas Dataframes can be indexed by lists. This is syntactic sugar to easily access multiple columns of the dataframe at once.
So in my example from above, to get column 0 and 1 you can do:
In [3]: data[[0, 1]]
Out[3]:
0 1
0 1 4
1 2 5
2 3 6
Here the inner [0, 1] creates a list with the elements 0 and 1. The outer [ ] retrieve the columns of the dataframe by using the inner list as an index.
For more readability look at this, its the exact same:
In [4]: l = [0, 1]
In [5]: data[l]
Out[5]:
0 1
0 1 4
1 2 5
2 3 6
If you only want the first column (column 0) you get this:
In [6]: data[[0]]
Out[6]:
0
0 1
1 2
2 3
Which is exactly what you were looking for.

Collapsing identical adjacent rows in a Pandas Series

Basically if a column of my pandas dataframe looks like this:
[1 1 1 2 2 2 3 3 3 1 1]
I'd like it to be turned into the following:
[1 2 3 1]
You can write a simple function that loops through the elements of your series only storing the first element in a run.
As far as I know, there is no tool built in to pandas to do this. But it is not a lot of code to do it yourself.
import pandas
example_series = pandas.Series([1, 1, 1, 2, 2, 3])
def collapse(series):
last = ""
seen = []
for element in series:
if element != last:
last = element
seen.append(element)
return seen
collapse(example_series)
In the code above, you will iterate through each element of a series and check if it is the same as the last element seen. If it is not, store it. If it is, ignore the value.
If you need to handle the return value as a series you can change the last line of the function to:
return pandas.Series(seen)
You could write a function that does the following:
x = pandas.Series([1 1 1 2 2 2 3 3 3 1 1])
y = x-x.shift(1)
y[0] = 1
result = x[y!=0]
You can use DataFrame's diff and indexing:
>>> df = pd.DataFrame([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df[0].diff()!=0]
0
0 1
2 2
6 3
10 1
>>> df[df[0].diff()!=0].values.ravel() # If you need an array
array([1, 2, 3, 1])
Same works for Series:
>>> df = pd.Series([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df.diff()!=0].values
array([1, 2, 3, 1])
You can use shift to create a boolean mask to compare the row against the previous row:
In [67]:
s = pd.Series([1,1,2,2,2,2,3,3,3,3,4,4,5])
s[s!=s.shift()]
Out[67]:
0 1
2 2
6 3
10 4
12 5
dtype: int64

Filter a pandas dataframe using values from a dict

I need to filter a data frame with a dict, constructed with the key being the column name and the value being the value that I want to filter:
filter_v = {'A':1, 'B':0, 'C':'This is right'}
# this would be the normal approach
df[(df['A'] == 1) & (df['B'] ==0)& (df['C'] == 'This is right')]
But I want to do something on the lines
for column, value in filter_v.items():
df[df[column] == value]
but this will filter the data frame several times, one value at a time, and not apply all filters at the same time. Is there a way to do it programmatically?
EDIT: an example:
df1 = pd.DataFrame({'A':[1,0,1,1, np.nan], 'B':[1,1,1,0,1], 'C':['right','right','wrong','right', 'right'],'D':[1,2,2,3,4]})
filter_v = {'A':1, 'B':0, 'C':'right'}
df1.loc[df1[filter_v.keys()].isin(filter_v.values()).all(axis=1), :]
gives
A B C D
0 1 1 right 1
1 0 1 right 2
3 1 0 right 3
but the expected result was
A B C D
3 1 0 right 3
only the last one should be selected.
IIUC, you should be able to do something like this:
>>> df1.loc[(df1[list(filter_v)] == pd.Series(filter_v)).all(axis=1)]
A B C D
3 1 0 right 3
This works by making a Series to compare against:
>>> pd.Series(filter_v)
A 1
B 0
C right
dtype: object
Selecting the corresponding part of df1:
>>> df1[list(filter_v)]
A C B
0 1 right 1
1 0 right 1
2 1 wrong 1
3 1 right 0
4 NaN right 1
Finding where they match:
>>> df1[list(filter_v)] == pd.Series(filter_v)
A B C
0 True False True
1 False False True
2 True False False
3 True True True
4 False False True
Finding where they all match:
>>> (df1[list(filter_v)] == pd.Series(filter_v)).all(axis=1)
0 False
1 False
2 False
3 True
4 False
dtype: bool
And finally using this to index into df1:
>>> df1.loc[(df1[list(filter_v)] == pd.Series(filter_v)).all(axis=1)]
A B C D
3 1 0 right 3
Abstraction of the above for case of passing array of filter values rather than single value (analogous to pandas.core.series.Series.isin()). Using the same example:
df1 = pd.DataFrame({'A':[1,0,1,1, np.nan], 'B':[1,1,1,0,1], 'C':['right','right','wrong','right', 'right'],'D':[1,2,2,3,4]})
filter_v = {'A':[1], 'B':[1,0], 'C':['right']}
##Start with array of all True
ind = [True] * len(df1)
##Loop through filters, updating index
for col, vals in filter_v.items():
ind = ind & (df1[col].isin(vals))
##Return filtered dataframe
df1[ind]
##Returns
A B C D
0 1.0 1 right 1
3 1.0 0 right 3
Here is a way to do it:
df.loc[df[filter_v.keys()].isin(filter_v.values()).all(axis=1), :]
UPDATE:
With values being the same across columns you could then do something like this:
# Create your filtering function:
def filter_dict(df, dic):
return df[df[dic.keys()].apply(
lambda x: x.equals(pd.Series(dic.values(), index=x.index, name=x.name)), asix=1)]
# Use it on your DataFrame:
filter_dict(df1, filter_v)
Which yields:
A B C D
3 1 0 right 3
If it something that you do frequently you could go as far as to patch DataFrame for an easy access to this filter:
pd.DataFrame.filter_dict_ = filter_dict
And then use this filter like this:
df1.filter_dict_(filter_v)
Which would yield the same result.
BUT, it is not the right way to do it, clearly.
I would use DSM's approach.
For python2, that's OK in #primer's answer. But, you should be careful in Python3 because of dict_keys. For instance,
>> df.loc[df[filter_v.keys()].isin(filter_v.values()).all(axis=1), :]
>> TypeError: unhashable type: 'dict_keys'
The correct way to Python3:
df.loc[df[list(filter_v.keys())].isin(list(filter_v.values())).all(axis=1), :]
Here's another way:
filterSeries = pd.Series(np.ones(df.shape[0],dtype=bool))
for column, value in filter_v.items():
filterSeries = ((df[column] == value) & filterSeries)
This gives:
>>> df[filterSeries]
A B C D
3 1 0 right 3
To follow up on DSM's answer, you can also use any() to turn your query into an OR operation (instead of AND):
df1.loc[(df1[list(filter_v)] == pd.Series(filter_v)).any(axis=1)]
You can also create a query
query_string = ' and '.join(
[f'({key} == "{val}")' if type(val) == str else f'({key} == {val})' for key, val in filter_v.items()]
)
df1.query(query_string)
Combining previous answers, here's a function you can feed to df1.loc. Allows for AND/OR (using how='all'/'any'), plus it allows comparisons other than == using the op keyword, if desired.
import operator
def quick_mask(df, filters, how='all', op=operator.eq) -> pd.Series:
if how == 'all':
comb = pd.Series.all
elif how == 'any':
comb = pd.Series.any
return comb(op(df[[*filters]], pd.Series(filters)), axis=1)
# Usage
df1.loc[quick_mask(df1, filter_v)]
I had an issue due to my dictionary having multiple values for the same key.
I was able to change DSM's query to:
df1.loc[df1[list(filter_v)].isin(filter_v).all(axis=1), :]

Categories

Resources