I am doing an analysis of a dataset with 6 classes, zero based. The dataset is many thousands of items long.
I need two dataframes with classes 0 & 1 for the first data set and 3 & 5 for the second.
I can get 0 & 1 together easily enough:
mnist_01 = mnist.loc[mnist['class']<= 1]
However, I am not sure how to get classes 3 & 5... so what I would like to be able to do is:
mnist_35 = mnist.loc[mnist['class'] == (3 or 5)]
...rather than doing:
mnist_3 = mnist.loc[mnist['class'] == 3]
mnist_5 = mnist.loc[mnist['class'] == 5]
mnist_35 = pd.concat([mnist_3,mnist_5],axis=0)
You can use isin, probably using set membership to make each check an O(1) time complexity operation:
mnist = pd.DataFrame({'class': [0, 1, 2, 3, 4, 5],
'val': ['a', 'b', 'c', 'd', 'e', 'f']})
>>> mnist.loc[mnist['class'].isin({3, 5})]
class val
3 3 d
5 5 f
>>> mnist.loc[mnist['class'].isin({0, 1})]
class val
0 0 a
1 1 b
Related
Please take this question lightly as asked from curiosity:
As I was trying to see how the slicing in MultiIndex works, I came across the following situation ↓
# Simple MultiIndex Creation
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
# Making Series with that MultiIndex
data = pd.Series(np.random.randint(10, size=6), index=index)
Returns:
a 1 5
2 0
c 1 8
2 6
b 1 6
2 3
dtype: int32
NOTE that the indices are not in the sorted order ie. a, c, b is the order which will result in the expected error that we want while slicing.
# When we do slicing
data.loc["a":"c"]
Errors like:
UnsortedIndexError
----> 1 data.loc["a":"c"]
UnsortedIndexError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'
That's expected. But now, after doing the following steps:
# Making a DataFrame
data = data.unstack()
# Redindexing - to unsort the indices like before
data = data.reindex(["a", "c", "b"])
# Which looks like
1 2
a 5 0
c 8 6
b 6 3
# Then again making series
data = data.stack()
# Reindex Again!
data = data.reindex(["a", "c", "b"], level=0)
# Which looks like before
a 1 5
2 0
c 1 8
2 6
b 1 6
2 3
dtype: int32
The Problem
So, now the process is: Series → Unstack → DataFrame → Stack → Series
Now, if I do the slicing like before (still on with the indices unsorted) we don't get any error!
# The same slicing
data.loc["a":"c"]
Results without an error:
a 1 5
2 0
c 1 8
2 6
dtype: int32
Even if the data.index.is_monotonic → False. Then still why can we slice?
So the question is: WHY?.
I hope you got the understanding of the situation here. Because see, the same series which was before giving the error, after the unstack and stack operation is not giving any error.
So is that a bug, or a new concept that I am missing here?
Thanks!
Aayush ∞ Shah
UPDATE:
I have used the data.reindex() so to unsort that once more. Please have a look at it again.
The difference between your 2 dataframes is the following:
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.randint(10, size=6), index=index)
data2 = data.unstack().reindex(["a", "c", "b"]).stack()
>>> data.index.codes
FrozenList([[0, 0, 2, 2, 1, 1], [0, 1, 0, 1, 0, 1]])
>>> data2.index.codes
FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
Even if your two indexes are the same appearance (values), the internal index (codes) are differents.
Check this method of MultiIndex:
Create a new MultiIndex from the current to monotonically sorted
items IN the levels. This does not actually make the entire MultiIndex
monotonic, JUST the levels.
The resulting MultiIndex will have the same outward
appearance, meaning the same .values and ordering. It will also
be .equals() to the original.
Old answer
# Making a DataFrame
data = data.unstack()
# Which looks like # <- WRONG
1 2 # 1 2
a 5 0 # a 8 0
c 8 6 # b 4 1
b 6 3 # c 7 6
# Then again making series
data = data.stack()
# Which looks like before # <- WRONG
a 1 5 # a 1 2
2 0 # 2 1
c 1 8 # b 1 0
2 6 # 2 1
b 1 6 # c 1 3
2 3 # 2 9
dtype: int32
If you want to use slicing, you have to check if the index is monotonic:
# Simple MultiIndex Creation
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
# Making Series with that MultiIndex
data = pd.Series(np.random.randint(10, size=6), index=index)
>>> data.index.is_monotonic
False
>>> data.unstack().stack().index.is_monotonic
True
>>> data.sort_index().index.is_monotonic
True
Suppose I have a dataframe with (for example) 10 columns: a,b,c,d,e,f,g,h,i,j
I want to bucket these columns as follows: a,b,c into x, d,f,g into y, e,h,i into z and j into j.
Each row of the output will have the x column value equal to the non-NaN a or b or c value of the original df. In case of multiple non-NaN values for a,b,c columns for a particular row in the original df, the output df will just contain a list of those non-NaN values.
To give an example, if the original df is (- just means NaN to save typing effort):
a b c d e f g h i j
0 1 - - - 2 - 4 3 - -
1 - 6 - 0 4 - - - - 2
2 - 3 2 - - - - 1 - 9
The output will be:
x y z j
0 1 4 [2,3] -
1 6 0 4 2
2 [3,2] - 1 9
Is there an efficient way of doing this? I'm not even able to get started using conventional methods.
one way is to create a dictionary with your mappings, apply your column names, stack and to apply your groupby operation and unstack to your original shape.
I couldn't see any logic in your mappings so it will have to be a manual operation I'm afraid.
buckets = {'x': ['a', 'b', 'c'], 'y': ['d', 'f', 'g'], 'z': ['e', 'h', 'i'], 'j': 'j'}
df.columns = df.columns.map( {i : x for x,y in buckets.items() for i in y})
out = df.stack().groupby(level=[0,1]).agg(list).unstack(1)[buckets.keys()]
print(out)
x y z j
0 [1] [4] [2, 3] NaN
1 [6] [0] [4] [2]
2 [3, 2] NaN [1] [9]
First create the dict for mapping , the groupby
d = {'a':'x','b':'x','c':'x','d':'y','f':'y','g':'y','e':'z','h':'z','i':'z','j':'j'}
out = df.groupby(d,axis=1).agg(lambda x : [y[y!='-']for y in x.values])
Out[138]:
j x y z
0 [] [1] [4] [2, 3]
1 [2] [6] [0] [4]
2 [9] [3, 2] [] [1]
Starting with a very basic approach, let's define our buckets and simply iterate, then clean up:
buckets = {
'x': ['a', 'b', 'c'],
'y': ['d', 'e', 'f'],
'z': ['g', 'h', 'i'],
'j': ['j']
}
def clean(val):
val = [x for x in val if not np.isnan(val)]
if len(val) == 0:
return np.nan
elif len(val) == 1:
return val[0]
else:
return val
new_df = pd.DataFrame()
for new_col, old_cols in buckets.items():
new_df[key] = df[old_cols].values.tolist().apply(clean)
Here's how you can do it.
First, we define a method to perform the row-wise bucketing operation.
def bucket_rows(row):
row = row.dropna().to_list()
if len(row) == 0:
row = [np.nan]
return row
Then, we can use the pandas.DataFrame.apply method to map this function onto each row on a dataframe (here, a sub-dataframe, if you will, since we'll get the sub-df using the column names).
I have implemented everything in the following code snippet.
import numpy as np
import pandas as pd
bucket_cols=[["a", "b", "c"], ["d", "f", "g"], ["e", "h","i"], ["j"]]
bucket_names=["x", "y", "z", "j"]
buckets = {}
def bucket_rows(row):
row = row.dropna().to_list() # applying pd.Series.dropna method to remove NaN values
# if the list is empty, populate it with NaN
if len(row) == 0:
row = [np.nan]
# returns bucketed row
return row
# looping through buckets and perforing bucketing operation
for idx, cols in enumerate(bucket_cols):
bucket = df[cols].apply(bucket_rows, axis=1).to_list()
buckets[idx] = bucket
# creating bucketted df from buckets dict
df_bucketted = pd.DataFrame(buckets)
I need to apply a function to a subset of columns in a dataframe. consider the following toy example:
pdf = pd.DataFrame({'a' : [1, 2, 3], 'b' : [2, 3, 4], 'c' : [5, 6, 7]})
arb_cols = ['a', 'b']
what I want to do is this:
[df[c] = df[c].apply(lambda x : 99 if x == 2 else x) for c in arb_cols]
But this is bad syntax. Is it possible to accomplish such a task without a for loop?
With mask
pdf.mask(pdf.loc[:,arb_cols]==2,99).assign(c=pdf.c)
Out[1190]:
a b c
0 1 99 5
1 99 3 6
2 3 4 7
Or with assign
pdf.assign(**pdf.loc[:,arb_cols].mask(pdf.loc[:,arb_cols]==2,99))
Out[1193]:
a b c
0 1 99 5
1 99 3 6
2 3 4 7
Do not use pd.Series.apply when you can use vectorised functions.
For example, the below should be efficient for larger dataframes even though there is an outer loop:
for col in arb_cols:
pdf.loc[pdf[col] == 2, col] = 99
Another option it to use pd.DataFrame.replace:
pdf[arb_cols] = pdf[arb_cols].replace(2, 99)
Yet another option is to use numpy.where:
import numpy as np
pdf[arb_cols] = np.where(pdf[arb_cols] == 2, 99, pdf[arb_cols])
For this case it would probably be better to use applymap if you need to apply a custom function
pdf[arb_cols] = pdf[arb_cols].applymap(lambda x : 99 if x == 2 else x)
out_gate,in_gate,num_connection
a,b,1
a,b,3
b,a,2
b,c,4
c,a,5
c,b,5
c,b,3
c,a,4
shown above is a sample csv file.
First of all, My final goal is that the compile result becomes a table about number of connections between gates like below:
a b c
a 0 4 0
b 2 0 4
c 9 8 0
and Now I finished making a list of the first column(out_gate)
like this; listfile = ['a','b','c'] and trying to match this each data (a,b,c) one by one to the in_gate
so, for example when out_gate 'c'-> in_gate 'b', number of connections is 8 and
'c'->'a' becomes 9.
I can match out_blk and in_blk in a row with its connection numbers, but hard to accumulate the connection numbers of each out_gate
Is there any solution ?
In plain Python you should look at the csv module for the input and a collections.defaultdict for collecting the totals:
from csv import reader
from collections import defaultdict
d = defaultdict(lambda: defaultdict(int))
with open('file.csv') as f:
r = reader(f)
next(r) # skip headers
for row in r:
if len(row) >= 3:
x, y, count = row
d[x][y] += int(count)
keys = sorted(d)
for x in keys:
print(' '.join(str(d[x][y]) for y in keys))
0 4 0
2 0 4
9 8 0
If you do this for large amounts of data, you should absolutely check out numpy and pandas, which both have more effective and natural methods of handling tables than native python.
In case you only need a solution right now, accumulations can be done straight forwardly in pure python with collections.defaultdict:
from collections import defaultdict
con = defaultdict(int)
for count, line in enumerate(connections):
if count == 0:
continue
in_gate, out_gate, number = line.split(',')
con[f"{in_gate}->{out_gate}"] += int(number)
Now you can access the entries the following way:
print(con['a->b'])
>> 4
print(con['a->c'])
>> 0
This is a one-line high-level answer via pandas.pivot_table, if you do not wish to resort to line-by-line readers and defaultdict.
import pandas as pd
df = pd.DataFrame([['a', 'b', 1], ['a', 'b', 3], ['b', 'a', 2], ['b', 'c', 4],
['c', 'a', 5], ['c', 'b', 5], ['c', 'b', 3], ['c', 'a', 4]],
columns=['out_gate', 'in_gate', 'num_connection'])
pd.pivot_table(df, index='out_gate', columns='in_gate', values='num_connection', aggfunc='sum').fillna(0)
You can use itertools.groupby:
import csv
import itertools
data = list(csv.reader(open('filename.csv')))
new_data = [b+[int(a)] for *b, a in data]
final_data = {tuple(a):sum(map(lambda x:x[-1], list(b))) for a, b in itertools.groupby(sorted(new_data, key=lambda x:x[:2]), key=lambda x:x[:2])}
letters = sorted(set([i for b in final_data.keys() for i in b]))
matrix = '\n'.join([' '.join(map(str, [final_data.get((b, i), 0) for i in letters])) for b in letters])
Output:
0 4 0
2 0 4
9 8 0
I have a dataframe with index and multiple columns. Secondly, I have few lists containing index values sampled on certain criterias. Now I want to create columns with labes based on fact whether or not the index of certain row is present in a specified list.
Now there are two situations where I am using it:
1) To create a column and give labels based on one list:
df['1_name'] = df.index.map(lambda ix: 'A' if ix in idx_1_model else 'B')
2) To create a column and give labels based on multiple lists:
def assignLabelsToSplit(ix_, random_m, random_y, model_m, model_y):
if (ix_ in random_m) or (ix_ in model_m):
return 'A'
if (ix_ in random_y) or (ix_ in model_y):
return 'B'
else:
return 'not_assigned'
df['2_name'] = df.index.map(lambda ix: assignLabelsToSplit(ix, idx_2_random_m, idx_2_random_y, idx_2_model_m, idx_2_model_y))
This is working, but it is quite slow. Each call takes about 3 minutes and considering I have to execute the funtions multiple times, it needs to be faster.
Thank you for any suggestions.
I think you need double numpy.where with Index.isin :
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(10,1)), columns=['A'])
#print (df)
random_m = [0,1]
random_y = [2,3]
model_m = [7,4]
model_y = [5,6]
print (type(random_m))
<class 'list'>
print (random_m + model_m)
[0, 1, 7, 4]
print (random_y + model_y)
[2, 3, 5, 6]
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
print (df)
A 2_name
0 8 A
1 8 A
2 3 B
3 7 B
4 7 A
5 0 B
6 4 B
7 2 A
8 5 not_assigned
9 2 not_assigned