Count instances of a random number of n length - python

I have a pandas dataframe column containing numbers of varying length. I want to count how many instances of a six digit number I have in a column, regardless of which numbers and their order.
Example:
import pandas as pd
df = pd.DataFrame({"number": [1234, 12345, 777777, 949494, 22, 987654]})
Should return that there is three instances of a six digit number in the column.

I would convert it to string, check the length of the string and sum those which length is 6:
(df['number'].astype(str).apply(len) == 6).sum()

Use np.log10 and floor division which gives you order of magnitude for numbers. Then check how many satisfy that condition.
N = 6
(np.log10(df['number'])//1).eq(N-1).sum()
#3

You can use np.ceil and np.log10:
df['length'] = np.ceil(np.log10(df['number']))
Result:
number length
0 1234 4.0
1 12345 5.0
2 777777 6.0
3 949494 6.0
4 22 2.0
5 987654 6.0
To count instances use:
np.ceil(np.log10(df['number'])).eq(6).sum()
Valid only for values > 0.

Related

Pandas - Equal occurrences of unique type for a column

I have a Pandas DF called “DF”. I would like to sample data from the population in such a way that, given a occurrence count, N = 100 and column = "Type", I would like to print a total of 100 rows from that column in such a way that the distribution of occurrences of each type is equal.
SNo
Type
Difficulty
1
Single
5
2
Single
15
3
Single
4
4
Multiple
2
5
Multiple
14
6
None
7
7
None
4323
For instance, If I specify N = 3, the output must be :
SNo
Type
Difficulty
1
Single
5
3
Multiple
4
6
None
7
If for the number N, the occurrences of certain types do not meet the minimum split, I can randomly increase another count.
I am wondering on how to approach this programmatically. Thanks!
Use groupby.sample (pandas ≥ 1.1) with N divided by the number of types.
NB. This assumes the N is a multiple of the number of types if you want a strict equality.
N = 3
N2 = N//df['Type'].nunique()
out = df.groupby('Type').sample(n=N2)
handling non multiple of the number of types
Use the same as above and complete to N with random rows excluding those already selected.
N = 5
N2, R = divmod(N, df['Type'].nunique())
out = df.groupby('Type').sample(n=N2)
out = pd.concat([out, df.drop(out.index).sample(n=R)])
As there is still a chance that you complete with items of the same group, if you really want to ensure sampling from different groups replace the last step with:
out = pd.concat([out, df.drop(out.index).groupby('Type').sample(n=1).sample(n=R)]
Example output:
SNo Type Difficulty
4 5 Multiple 14
6 7 None 4323
2 3 Single 4
3 4 Multiple 2
5 6 None 14

Find index of first bigger row of current value in pandas dataframe

I have big dataset of values as follow:
column "bigger" would be index of the first row with bigger "bsl" than "mb" from current row. I need to do it without loop as I need it to be done in less than a second. by loop it's over a minute.
For example for the first row (with index 74729) the bigger is going to be 74731. I know it can be done by linq in C# but I'm almost new in python.
here is another example:
here is text version:
index bsl mb bigger
74729 47091.89 47160.00 74731.0
74730 47159.00 47201.00 74735.0
74731 47196.50 47201.50 74735.0
74732 47186.50 47198.02 74735.0
74733 47191.50 47191.50 74735.0
74734 47162.50 47254.00 74736.0
74735 47252.50 47411.50 74736.0
74736 47414.50 47421.00 74747.0
74737 47368.50 47403.00 74742.0
74738 47305.00 47310.00 74742.0
74739 47292.00 47320.00 74742.0
74740 47302.00 47374.00 74742.0
74741 47291.47 47442.50 74899.0
74742 47403.50 47416.50 74746.0
74743 47354.34 47362.50 74746.0
I'm not sure how many rows you have, but if the number is reasonable, you can perform a pairwise comparison:
# get data as arrays
a = df['bsl'].to_numpy()
b = df['mb'].to_numpy()
idx = df.index.to_numpy()
# compare values and mask lower triangle
# to ensure comparing only the greater indices
out = np.triu(a>b[:,None]).argmax(1).astype(float)
# reindex to original indices
idx = idx[out]
# mask invalid indices
idx[out<np.arange(len(out))] = np.nan
df['bigger'] = idx
Output:
bsl mb bigger
0 1 2 2.0
1 2 4 6.0
2 3 3 5.0
3 2 1 3.0
4 3 5 NaN
5 4 2 5.0
6 5 1 6.0
7 1 0 7.0

Create a custom percentile rank for a pandas series

I need to calculate the percentile using a specific algorithm that is not available using either pandas.rank() or numpy.rank().
The ranking algorithm is calculated as follows for a series:
rank[i] = (# of values in series less than i + # of values equal to
i*0.5)/total # of values
so if I had the following series
s=pd.Series(data=[5,3,8,1,9,4,14,12,6,1,1,4,15])
For the first element, 5 there are 6 values less than 5 and no other values = to 5. The rank would be (6+0x0.5)/13 or 6/13.
For the fourth element (1) it would be (0+ 2x0.5)/13 or 1/13.
How could I calculate this without using a loop? I assume a combination of s.apply and/or s.where() but can't figure it out and have tried searching. I am looking to apply to the entire series at once, with the result being a series with the percentile ranks.
You could use numpy broadcasting. First convert s to a numpy column array. Then use numpy broadcasting to count the number of items less than i for each i. Then count the number of items equal to i for each i (note that we need to subract 1 since, i is equal to i itself). Finally add them and build a Series:
tmp = s.to_numpy()
s_col = tmp[:, None]
less_than_i_count = (s_col>tmp).sum(axis=1)
eq_to_i_count = ((s_col==tmp).sum(axis=1) - 1) * 0.5
ranks = pd.Series((less_than_i_count + eq_to_i_count) / len(s), index=s.index)
Output:
0 0.461538
1 0.230769
2 0.615385
3 0.076923
4 0.692308
5 0.346154
6 0.846154
7 0.769231
8 0.538462
9 0.076923
10 0.076923
11 0.346154
12 0.923077
dtype: float64

How many data points are plotted on my matplotlib graph?

So I want to count the number of data points plotted on my graph to keep a total track of graphed data. The problem is, my data table messes it up to where there are some NaN values in a different row in comparison to another column where it may or may not have a NaN value. For example:
# I use num1 as my y-coordinate and num1-num2 for my x-coordinate.
num1 num2 num3
1 NaN 25
NaN 7 45
3 8 63
NaN NaN 23
5 10 42
NaN 4 44
#So in this case, there should be only 2 data point on the graph between num1 and num2. For num1 and num3, there should be 3. There should be 4 data points between num2 and num3.
I believe Matplotlib doesn't graph the rows of the column that contain NaN values since its null (please correct me if I'm wrong, I can only tell this due to no dots being on the 0 coordinate of the x and y axes). In the beginning, I thought I could get away with using .count() and find the smaller of the two columns and use that as my tracker, but realistically that won't work as shown in my example above because it can be even LESS than that since one may have the NaN value and the other will have an actual value. Some examples of code I did:
# both x and y are columns within the DataFrame and are used to "count" how many data points are # being graphed.
def findAmountOfDataPoints(colA, colB):
if colA.count() < colB.count():
print(colA.count()) # Since its a smaller value, print the number of values in colA.
else:
print(colB.count()) # Since its a smaller value, print the number of values in colB.
Also, I thought about using .value_count() but I'm not sure if thats the exact function I'm looking for to complete what I want. Any suggestions?
Edit 1: Changed Data Frame names to make example clearer hopefully.
If I understood correctly your problem, assuming that your table is a pandas dataframe df, the following code should work:
sum((~np.isnan(df['num1']) & (~np.isnan(df['num2']))))
How it works:
np.isnan returns True if a cell is Nan. ~np.isnan is the inverse, hence it returns True when it's not Nan.
The code checks where both the column "num1" AND the column "num2" contain a non-Nan value, in other words it returns True for those rows where both the values exist.
Finally, those good rows are counted with sum, which takes into account only True values.
The way I understood it is that the number of combiniations of points that are not NaN is needed. Using a function I found I came up with this:
import pandas as pd
import numpy as np
def choose(n, k):
"""
A fast way to calculate binomial coefficients by Andrew Dalke (contrib).
https://stackoverflow.com/questions/3025162/statistics-combinations-in-python
"""
if 0 <= k <= n:
ntok = 1
ktok = 1
for t in range(1, min(k, n - k) + 1):
ntok *= n
ktok *= t
n -= 1
return ntok // ktok
else:
return 0
data = {'num1': [1, np.nan,3,np.nan,5,np.nan],
'num2': [np.nan,7,8,np.nan,10,4],
'num3': [25,45,63,23,42,44]
}
df = pd.DataFrame(data)
df['notnulls'] = df.notnull().sum(axis=1)
df['plotted'] = df.apply(lambda row: choose(int(row.notnulls), 2), axis=1)
print(df)
print("Total data points: ", df['plotted'].sum())
With this result:
num1 num2 num3 notnulls plotted
0 1.0 NaN 25 2 1
1 NaN 7.0 45 2 1
2 3.0 8.0 63 3 3
3 NaN NaN 23 1 0
4 5.0 10.0 42 3 3
5 NaN 4.0 44 2 1
Total data points: 9

drop rows based on specific conditions

Here is a part of df:
NUMBER MONEY
12345 20
12345 -20
12345 20
12345 20
123456 10
678910 7.6
123457 3
678910 -7.6
I want to drop rows which have the same NUMBER but opposite money.
The ideal outcome would like below:
NUMBER MONEY
12345 20
12345 20
123456 10
123457 3
note: these entries are not one-to-one correspondence (I mean the total amount is an odd number).
For example, there are four entries are [Number] 12345.
three of them [Money] are 20, one [Money] is -20.
I just want to delete two [Money] is the opposite, and keep the other two whose money is 20.
Here a solution using groupby and apply and a custom function to match and delete pairs.
def remove_pairs(x):
positive = x.loc[x['MONEY'] > 0].index.values
negative = x.loc[x['MONEY'] < 0].index.values
for i, j in zip(positive, negative):
x = x.drop([i, j])
return x
df['absvalues'] = df['MONEY'].abs()
dd = df.groupby(['NUMBER', 'absvalues']).apply(remove_pairs)
dd.reset_index(drop=True, inplace=True)
dd.drop('absvalues', axis=1, inplace=True)
'absvalue' column with the absolute values of 'MONEY' is added to perform a double index selection with groupby, and then the custom function drops rows in pairs selecting positive and negative numbers.
The two last lines just do some cleaning. Using your sample dataframe, the final result dd is:
NUMBER MONEY
0 12345 20.0
1 12345 20.0
2 123456 10.0
3 123457 3.0

Categories

Resources