Translating to R which function and cl classification to pandas library - python

I am trying to translate this:
n <- NROW(train)
s <-which(train$cl[-n] == state)
I know that which is just a comparison so I believe in pandas I could just do:
n = train.count()
s = train['-n'] == state
I am really not sure how to translate cl in R to pandas
thanks!

If need size of DataFrame use:
n = len(train)
Or:
n = len(train.index)
Or:
n = train.shape[0]
Second is OK:
s = train['-n'] == state

Related

Python separate arrays of complex number into the real part and imaginary part, outputting 8*8 matrix

import numpy as np
import pandas as pd
import cmath
a = np.array([[complex(3,6),complex(7,9),complex(2,8),complex(6,5)],
[complex(3,7),complex(7,9),complex(2,8),complex(6,5)],
[complex(3,8),complex(7,9),complex(2,8),complex(6,5)],
[complex(3,9),complex(7,9),complex(2,8),complex(6,5)],
[complex(3,1),complex(7,9),complex(2,8),complex(6,5)],
[complex(3,2),complex(7,9),complex(2,8),complex(6,5)],
[complex(3,3),complex(7,9),complex(2,8),complex(6,5)],
[complex(3,4),complex(7,9),complex(2,8),complex(6,5)],
])
l = np.array(['eval1_real','eval2_real','eval3_real','eval4_real','eval1_imag','eval2_imag','eval3_imag','eval4_imag'])
x = 1
for i in range(0, len(a),1):
w = a[i]
e1r = w[0].real
e1c = w[0].imag
e2r = w[1].real
e2c = w[1].imag
e3r = w[2].real
e3c = w[2].imag
e4r = w[3].real
e4c = w[3].imag
p = np.array([e1r, e1c, e2r, e2c, e3r, e3c, e4r, e4c])
m = np.insert(l,x,p,0)
x = x + 1
I tried for loop to separate but i cannot get those number to form together to become a full matrix
Is there a way to separate it altogether without using a loop or some array function i can put those together?
You should learn to use numpy-builtin functions for elemental operations on all elements. You can try,
result = np.dstack(
np.apply_along_axis(
lambda x: [x.real, x.imag], 0, a)
).flatten().reshape(8,8)
numpy.apply_along_axis
numpy.dstack

GroupBy + Condition + Mean()

suppose we have 3 columns, A-B-C, I need to group by "A", but then B needs to be a range where B>0 and B<20, and then with that set calculate the mean from C.
Can u help me?
tyvm!
Try this:
import pandas as pd
data = pd.read_csv('rows.csv')
temp = []
for val in data['PPV']:
if val<20:
temp.append(1)
elif 20<val and val<40:
temp.append(2)
else:
temp.append(3)
data['temp'] = temp
output = data.groupby(['Responsable', 'temp'])['Yield'].mean()
print(output)
You should customize it. You can also do more elegant with numpy.digitize.

Filtering dataframe in more efficient way

How I can write following code in more pandas way:
majority_df = df[(df.voting_majority_status_fk == 4) & (df.other == True)]
minority_df = df[(df.voting_majority_status_fk == 3)]
I need to take only vp_fk that are in majority_df and not in minority_df and then take only unique rows from majority_df by found unique vp_fk
How I can write following more Pandas way.
majority_vp_fk = set(majority_df.vp_fk)
minority_vp_fk = set(minority_df.vp_fk)
clean_majority_vp_fk = majority_vp_fk - minority_vp_fk
clean_majority_df = majority_df[majority_df.vp_fk.isin(clean_majority_vp_fk)]
clean_majority_df = clean_majority_df.drop_duplicates(subset=['probe_fk', 'vp_fk', 'masking_box_fk', 'product_fk'])
Here is my "very theoretic" (it's hard to test it without sample data sets) solution:
minority_df = df[(df.voting_majority_status_fk == 3)]
qry = "voting_majority_status_fk == 4 and other == True and vp_fk not in #minority_df.vp_fk"
result = (df.query(qry)
.drop_duplicates(subset=['probe_fk', 'vp_fk', 'masking_box_fk', 'product_fk']))

Compute "concentration" of pandas categoricals

I'm having a problem with an old function computing the concentration of pandas categorical columns. There seems to have been a change making it impossible to subset the result of the .value_counts() method of a categorical series.
Minimal non-working example:
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":["a","b","c","a"]})
def get_concentration(df,cat):
tmp = df[cat].astype("category")
counts = tmp.value_counts()
obs = len(tmp)
all_cons = []
for key in counts.keys():
single = np.square(np.divide(float(counts[key]),float(obs)))
all_cons.append(single)
return np.sum(all_cons)
get_concentration(df, "A")
This results in a key error for counts["a"]. I'm quite sure this worked in a past version of pandas and the documentation doesn't seem to mention a change regarding the .value_counts() method.
Let's agree on methodology:
>>> df.A.value_counts()
a 2
b 1
c 1
obs = len((df['A'].astype('category'))
>>> obs
4
The concentration should be as follows (per the Herfindahl Index):
>>> (2 / 4.) ** 2 + (1 / 4.) ** 2 + (1 / 4.) ** 2
0.375
Which is equivalent to (Pandas 0.17+):
>>> ((df.A.value_counts() / df.A.count()) ** 2).sum()
0.375
If you really want a function:
def concentration(df, col):
return ((df[col].value_counts() / df[col].count()) ** 2).sum()
>>> concentration(df, 'A')
0.375
Since you're iterating in a loop (and not working vectorically), you might as well just explicitly iterate over pairs. It simplifies the syntax, IMHO:
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":["a","b","c","a"]})
def get_concentration(df,cat):
tmp = df[cat].astype("category")
counts = tmp.value_counts()
obs = len(tmp)
all_cons = []
# See change in following line - you're anyway iterating
# over key-value pairs; why not do so explicitly?
for k, v in counts.to_dict().items():
single = np.square(np.divide(float(v),float(obs)))
all_cons.append(single)
return np.sum(all_cons)
>>> get_concentration(df, "A")
0.25
To fix the current function, you just need to access the index values using .ix (see below). You might be better of using a vectorized function - I've addend one at the end.
df = pd.DataFrame({"A":["a","b","c","a"]})
tmp = df[cat].astype('category')
counts = tmp.value_counts()
obs = len(tmp)
all_cons = []
for key in counts.index:
single = np.square(np.divide(float(counts.ix[key]), float(obs)))
all_cons.append(single)
return np.sum(all_cons)
yields:
get_concentration(df, "A")
0.25
You might want to try a vectorized version, which also doesn't necessarily need the category dtype, such as:
def get_concentration(df, cat):
counts = df[cat].value_counts()
return counts.div(len(counts)).pow(2).sum()

Pandas optimizing an interpolation/counting algorithm

I have a bunch of data (10M + records) that breaks down to an identifier, a location and a date. I want to find the number of times that any identifier moved from some locationA to some other locationB over the entire set of dates. Any identifier may not have a location for all possible dates. When an identifier does not have a location recorded, that should be treated as an actual 'unknown' location for that date.
Here is some reproducible fake data...
import numpy as np
import pandas as pd
import datetime
base = datetime.date.today()
num_days = 50
dates = np.array([base - datetime.timedelta(days=x) for x in range(num_days-1, -1, -1)])
ids = np.arange(50)
mi = pd.MultiIndex.from_product([ids, dates])
locations = np.array([chr(x) for x in 97 + np.random.randint(26, size=len(mi))])
s = pd.Series(locations, index=mi)
mask = np.random.rand(len(mi)) > .5
s[mask] = np.nan
s = s.dropna()
My initial thought was to create a dataframe and use boolean masking/vectorized operations to solve this
df = s.unstack(0).fillna('unknown')
Apparently my data is sparse enough to cause a MemoryError (from all the extra entries resulting from unstacking).
My current working solution is the following
def series_fn(s):
s = s.reindex(pd.date_range(s.index.levels[1].min(), s.index.levels[1].max()), level=-1).fillna('unknown')
mask_prev = (s != s.shift(-1))[:-1]
mask_next = (s != s.shift())[1:]
s_prev = s[:-1][mask_prev]
s_next = s[1:][mask_next]
s_tup = pd.Series(list(zip(s_prev, s_next)))
return s_tup.value_counts()
result_per_id = s.groupby(level=0).apply(series_fn)
result = result_per_id.sum(level=-1)
result looks like
(a, b) 1
(a, c) 5
(a, e) 3
(a, f) 3
(a, g) 3
(a, h) 3
(a, i) 1
(a, j) 1
(a, k) 2
(a, l) 2
...
This is going to take ~5 hours for all my data. Does anyone know any faster ways of doing this?
Thanks!
Hmmm, I guess I should have transposed the data... well that was a relatively simple fix. Instead of using groupby and apply,
s = s.reorder_levels(['date', 'id'])
s = s.sortlevel(0)
results = []
for i in range(len(s.index.levels[0])-1):
t = time.time()
s0 = s.loc[s.index.levels[0][i]]
s1 = s.loc[s.index.levels[0][i+1]]
df = pd.concat((s0, s1), axis=1)
# Note: this is slower than the line above
# df = s.loc[s.index.levels[0][0:2], :].unstack(0)
df = df.fillna('unknown')
mi = pd.MultiIndex.from_arrays((df.iloc[:, 0], df.iloc[:, 1]))
s2 = pd.Series(1, mi)
res = s2.groupby(level=[0, 1]).apply(np.sum)
results.append(res)
print(time.time() - t)
results = pd.concat(results, axis=1)
Still unclear on why the commented out section takes about three times as long as the three lines above it.

Categories

Resources