Related
I am trying to get a list of tuples with the first and last index of grouped NaNs.
An example input could look like
import pandas as pd
import numpy as np
series = pd.Series([1,2,3,np.nan,np.nan,4,5,6,7,np.nan,8,9,np.nan,np.nan,np.nan])
get_nan_inds(series)
and the output should be
[(3, 5), (9, 10), (12, 15)]
The only similar question I could find doesn't solve my problem.
Alternative solution:
import pandas as pd
import numpy as np
series = pd.Series([1,2,3,np.nan,np.nan,4,5,6,7,np.nan,8,9,np.nan,np.nan,np.nan])
def get_nan_inds(series):
is_null_diff = pd.isnull(pd.Series(list(series) + [False])).diff() #Need to add False at the end for the case when the last elemetn is null
res = [i for i, x in enumerate(list(is_null_diff)) if x is True]
res = [(a, b) for i, (a,b) in enumerate(zip(res, res[1:])) if i % 2 == 0]
return res
get_nan_inds(series)
While wrting this question I came up with the following function in case someone else has a similar problem.
def get_nan_inds(series):
''' Obtain the first and last index of each consecutive NaN group.
'''
series = series.reset_index(drop=True)
index = series[series.isna()].index.to_numpy()
if len(index) == 0:
return []
indices = np.split(index, np.where(np.diff(index) > 1)[0] + 1)
return [(ind[0], ind[-1] + 1) for ind in indices]
import numpy as np
import pandas as pd
import cmath
a = np.array([[complex(3,6),complex(7,9),complex(2,8),complex(6,5)],
[complex(3,7),complex(7,9),complex(2,8),complex(6,5)],
[complex(3,8),complex(7,9),complex(2,8),complex(6,5)],
[complex(3,9),complex(7,9),complex(2,8),complex(6,5)],
[complex(3,1),complex(7,9),complex(2,8),complex(6,5)],
[complex(3,2),complex(7,9),complex(2,8),complex(6,5)],
[complex(3,3),complex(7,9),complex(2,8),complex(6,5)],
[complex(3,4),complex(7,9),complex(2,8),complex(6,5)],
])
l = np.array(['eval1_real','eval2_real','eval3_real','eval4_real','eval1_imag','eval2_imag','eval3_imag','eval4_imag'])
x = 1
for i in range(0, len(a),1):
w = a[i]
e1r = w[0].real
e1c = w[0].imag
e2r = w[1].real
e2c = w[1].imag
e3r = w[2].real
e3c = w[2].imag
e4r = w[3].real
e4c = w[3].imag
p = np.array([e1r, e1c, e2r, e2c, e3r, e3c, e4r, e4c])
m = np.insert(l,x,p,0)
x = x + 1
I tried for loop to separate but i cannot get those number to form together to become a full matrix
Is there a way to separate it altogether without using a loop or some array function i can put those together?
You should learn to use numpy-builtin functions for elemental operations on all elements. You can try,
result = np.dstack(
np.apply_along_axis(
lambda x: [x.real, x.imag], 0, a)
).flatten().reshape(8,8)
numpy.apply_along_axis
numpy.dstack
I'm having a problem with an old function computing the concentration of pandas categorical columns. There seems to have been a change making it impossible to subset the result of the .value_counts() method of a categorical series.
Minimal non-working example:
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":["a","b","c","a"]})
def get_concentration(df,cat):
tmp = df[cat].astype("category")
counts = tmp.value_counts()
obs = len(tmp)
all_cons = []
for key in counts.keys():
single = np.square(np.divide(float(counts[key]),float(obs)))
all_cons.append(single)
return np.sum(all_cons)
get_concentration(df, "A")
This results in a key error for counts["a"]. I'm quite sure this worked in a past version of pandas and the documentation doesn't seem to mention a change regarding the .value_counts() method.
Let's agree on methodology:
>>> df.A.value_counts()
a 2
b 1
c 1
obs = len((df['A'].astype('category'))
>>> obs
4
The concentration should be as follows (per the Herfindahl Index):
>>> (2 / 4.) ** 2 + (1 / 4.) ** 2 + (1 / 4.) ** 2
0.375
Which is equivalent to (Pandas 0.17+):
>>> ((df.A.value_counts() / df.A.count()) ** 2).sum()
0.375
If you really want a function:
def concentration(df, col):
return ((df[col].value_counts() / df[col].count()) ** 2).sum()
>>> concentration(df, 'A')
0.375
Since you're iterating in a loop (and not working vectorically), you might as well just explicitly iterate over pairs. It simplifies the syntax, IMHO:
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":["a","b","c","a"]})
def get_concentration(df,cat):
tmp = df[cat].astype("category")
counts = tmp.value_counts()
obs = len(tmp)
all_cons = []
# See change in following line - you're anyway iterating
# over key-value pairs; why not do so explicitly?
for k, v in counts.to_dict().items():
single = np.square(np.divide(float(v),float(obs)))
all_cons.append(single)
return np.sum(all_cons)
>>> get_concentration(df, "A")
0.25
To fix the current function, you just need to access the index values using .ix (see below). You might be better of using a vectorized function - I've addend one at the end.
df = pd.DataFrame({"A":["a","b","c","a"]})
tmp = df[cat].astype('category')
counts = tmp.value_counts()
obs = len(tmp)
all_cons = []
for key in counts.index:
single = np.square(np.divide(float(counts.ix[key]), float(obs)))
all_cons.append(single)
return np.sum(all_cons)
yields:
get_concentration(df, "A")
0.25
You might want to try a vectorized version, which also doesn't necessarily need the category dtype, such as:
def get_concentration(df, cat):
counts = df[cat].value_counts()
return counts.div(len(counts)).pow(2).sum()
I need to split dataframe into 10 parts then use one part as the testset and remaining 9 (merged to use as training set) , I have come up to the following code where I am able to split the dataset , and m trying to merge the remaining sets after picking one of those 10.
The first iteration goes fine , but I get following error in second iteration.
df = pd.DataFrame(np.random.randn(10, 4), index=list(xrange(10)))
for x in range(3):
dfList = np.array_split(df, 3)
testdf = dfList[x]
dfList.remove(dfList[x])
print testdf
traindf = pd.concat(dfList)
print traindf
print "================================================"
I don't think you have to split the dataframe in 10 but just in 2.
I use this code for splitting a dataframe in training set and validation set:
test_index = np.random.choice(df.index, int(len(df.index)/10), replace=False)
test_df = df.loc[test_index]
train_df = df.loc[~df.index.isin(test_index)]
okay I got it working this way :
df = pd.DataFrame(np.random.randn(10, 4), index=list(xrange(10)))
dfList = np.array_split(df, 3)
for x in range(3):
trainList = []
for y in range(3):
if y == x :
testdf = dfList[y]
else:
trainList.append(dfList[y])
traindf = pd.concat(trainList)
print testdf
print traindf
print "================================================"
But better approach is welcome.
You can use the permutation function from numpy.random
import numpy as np
import pandas as pd
import math as mt
l = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
df = pd.DataFrame({'a': l, 'b': l})
shuffle the dataframe index
shuffled_idx = np.random.permutation(df.index)
divide the shuffled_index into N equal(ish) parts
for this example, let N = 4
N = 4
n = len(shuffled_idx) / N
parts = []
for j in range(N):
parts.append(shuffled_idx[mt.ceil(j*n): mt.ceil(j*n+n)])
# to show each shuffled part of the data frame
for k in parts:
print(df.iloc[k])
I wrote a piece of script find / fork it on github for the purpose of splitting a Pandas dataframe randomly. Here's a link to Pandas - Merge, join, and concatenate functionality!
Same code for your reference:
import pandas as pd
import numpy as np
from xlwings import Sheet, Range, Workbook
#path to file
df = pd.read_excel(r"//PATH TO FILE//")
df.columns = [c.replace(' ',"_") for c in df.columns]
x = df.columns[0].encode("utf-8")
#number of parts the data frame or the list needs to be split into
n = 7
seq = list(df[x])
np.random.shuffle(seq)
lists1 = [seq[i:i+n] for i in range(0, len(seq), n)]
listsdf = pd.DataFrame(lists1).reset_index()
dataframesDict = dict()
# calling xlwings workbook function
Workbook()
for i in range(0,n):
if Sheet.count() < n:
Sheet.add()
doubles[i] =
df.loc[df.Column_Name.isin(list(listsdf[listsdf.columns[i+1]]))]
Range(i,"A1").value = doubles[i]
Looks like you are trying to do a k-fold type thing, rather than a one-off. This code should help. You may also find the SKLearn k-fold functionality works in your case, that's also worth checking out.
# Split dataframe by rows into n roughly equal portions and return list of
# them.
def splitDf(df, n) :
splitPoints = list(map( lambda x: int(x*len(df)/n), (list(range(1,n)))))
splits = list(np.split(df.sample(frac=1), splitPoints))
return splits
# Take splits from splitDf, and return into test set (splits[index]) and training set (the rest)
def makeTrainAndTest(splits, index) :
# index is zero based, so range 0-9 for 10 fold split
test = splits[index]
leftLst = splits[:index]
rightLst = splits[index+1:]
train = pd.concat(leftLst+rightLst)
return train, test
You can then use these functions to make the folds
df = <my_total_data>
n = 10
splits = splitDf(df, n)
trainTest = []
for i in range(0,n) :
trainTest.append(makeTrainAndTest(splits, i))
# Get test set 2
test2 = trainTest[2][1].shape
# Get training set zero
train0 = trainTest[0][0]
I have two equal-length 1D numpy arrays, id and data, where id is a sequence of repeating, ordered integers that define sub-windows on data. For example:
id data
1 2
1 7
1 3
2 8
2 9
2 10
3 1
3 -10
I would like to aggregate data by grouping on id and taking either the max or the min.
In SQL, this would be a typical aggregation query like SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id.
Is there a way I can avoid Python loops and do this in a vectorized manner?
I've been seeing some very similar questions on stack overflow the last few days. The following code is very similar to the implementation of numpy.unique and because it takes advantage of the underlying numpy machinery, it is most likely going to be faster than anything you can do in a python loop.
import numpy as np
def group_min(groups, data):
# sort with major key groups, minor key data
order = np.lexsort((data, groups))
groups = groups[order] # this is only needed if groups is unsorted
data = data[order]
# construct an index which marks borders between groups
index = np.empty(len(groups), 'bool')
index[0] = True
index[1:] = groups[1:] != groups[:-1]
return data[index]
#max is very similar
def group_max(groups, data):
order = np.lexsort((data, groups))
groups = groups[order] #this is only needed if groups is unsorted
data = data[order]
index = np.empty(len(groups), 'bool')
index[-1] = True
index[:-1] = groups[1:] != groups[:-1]
return data[index]
In pure Python:
from itertools import groupby, imap, izip
from operator import itemgetter as ig
print [max(imap(ig(1), g)) for k, g in groupby(izip(id, data), key=ig(0))]
# -> [7, 10, 1]
A variation:
print [data[id==i].max() for i, _ in groupby(id)]
# -> [7, 10, 1]
Based on #Bago's answer:
import numpy as np
# sort by `id` then by `data`
ndx = np.lexsort(keys=(data, id))
id, data = id[ndx], data[ndx]
# get max()
print data[np.r_[np.diff(id), True].astype(np.bool)]
# -> [ 7 10 1]
If pandas is installed:
from pandas import DataFrame
df = DataFrame(dict(id=id, data=data))
print df.groupby('id')['data'].max()
# id
# 1 7
# 2 10
# 3 1
I'm fairly new to Python and Numpy but, it seems like you can use the .at method of ufuncs rather than reduceat:
import numpy as np
data_id = np.array([0,0,0,1,1,1,1,2,2,2,3,3,3,4,5,5,5])
data_val = np.random.rand(len(data_id))
ans = np.empty(data_id[-1]+1) # might want to use max(data_id) and zeros instead
np.maximum.at(ans,data_id,data_val)
For example:
data_val = array([ 0.65753453, 0.84279716, 0.88189818, 0.18987882, 0.49800668,
0.29656994, 0.39542769, 0.43155428, 0.77982853, 0.44955868,
0.22080219, 0.4807312 , 0.9288989 , 0.10956681, 0.73215416,
0.33184318, 0.10936647])
ans = array([ 0.98969952, 0.84044947, 0.63460516, 0.92042078, 0.75738113,
0.37976055])
Of course this only makes sense if your data_id values are suitable for use as indices (i.e. non-negative integers and not huge...presumably if they are large/sparse you could initialize ans using np.unique(data_id) or something).
I should point out that the data_id doesn't actually need to be sorted.
with only numpy and without loops:
id = np.asarray([1,1,1,2,2,2,3,3])
data = np.asarray([2,7,3,8,9,10,1,-10])
# max
_ndx = np.argsort(id)
_id, _pos = np.unique(id[_ndx], return_index=True)
g_max = np.maximum.reduceat(data[_ndx], _pos)
# min
_ndx = np.argsort(id)
_id, _pos = np.unique(id[_ndx], return_index=True)
g_min = np.minimum.reduceat(data[_ndx], _pos)
# compare results with pandas groupby
np_group = pd.DataFrame({'min':g_min, 'max':g_max}, index=_id)
pd_group = pd.DataFrame({'id':id, 'data':data}).groupby('id').agg(['min','max'])
(pd_group.values == np_group.values).all() # TRUE
Ive packaged a version of my previous answer in the numpy_indexed package; its nice to have this all wrapped up and tested in a neat interface; plus it has a lot more functionality as well:
import numpy_indexed as npi
group_id, group_max_data = npi.group_by(id).max(data)
And so on
A slightly faster and more general answer than the already accepted one; like the answer by joeln it avoids the more expensive lexsort, and it works for arbitrary ufuncs. Moreover, it only demands that the keys are sortable, rather than being ints in a specific range. The accepted answer may still be faster though, considering the max/min isn't explicitly computed. The ability to ignore nans of the accepted solution is neat; but one may also simply assign nan values a dummy key.
import numpy as np
def group(key, value, operator=np.add):
"""
group the values by key
any ufunc operator can be supplied to perform the reduction (np.maximum, np.minimum, np.substract, and so on)
returns the unique keys, their corresponding per-key reduction over the operator, and the keycounts
"""
#upcast to numpy arrays
key = np.asarray(key)
value = np.asarray(value)
#first, sort by key
I = np.argsort(key)
key = key[I]
value = value[I]
#the slicing points of the bins to sum over
slices = np.concatenate(([0], np.where(key[:-1]!=key[1:])[0]+1))
#first entry of each bin is a unique key
unique_keys = key[slices]
#reduce over the slices specified by index
per_key_sum = operator.reduceat(value, slices)
#number of counts per key is the difference of our slice points. cap off with number of keys for last bin
key_count = np.diff(np.append(slices, len(key)))
return unique_keys, per_key_sum, key_count
names = ["a", "b", "b", "c", "d", "e", "e"]
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01]
unique_keys, reduced_values, key_count = group(names, values)
print 'per group mean'
print reduced_values / key_count
unique_keys, reduced_values, key_count = group(names, values, np.minimum)
print 'per group min'
print reduced_values
unique_keys, reduced_values, key_count = group(names, values, np.maximum)
print 'per group max'
print reduced_values
I think this accomplishes what you're looking for:
[max([val for idx,val in enumerate(data) if id[idx] == k]) for k in sorted(set(id))]
For the outer list comprehension, from right to left, set(id) groups the ids, sorted() sorts them, for k ... iterates over them, and max takes the max of, in this case, another list comprehension. So moving to that inner list comprehension: enumerate(data) returns both the index and value from data, if id[val] == k picks out the data members corresponding to id k.
This iterates over the full data list for each id. With some preprocessing into sublists, it might be possible to speed it up, but it won't be a one-liner then.
The following solution only requires a sort on the data (not a lexsort) and does not require finding boundaries between groups. It relies on the fact that if o is an array of indices into r then r[o] = x will fill r with the latest value x for each value of o, such that r[[0, 0]] = [1, 2] will return r[0] = 2. It requires that your groups are integers from 0 to number of groups - 1, as for numpy.bincount, and that there is a value for every group:
def group_min(groups, data):
n_groups = np.max(groups) + 1
result = np.empty(n_groups)
order = np.argsort(data)[::-1]
result[groups.take(order)] = data.take(order)
return result
def group_max(groups, data):
n_groups = np.max(groups) + 1
result = np.empty(n_groups)
order = np.argsort(data)
result[groups.take(order)] = data.take(order)
return result