Cluster similar - but not identical - digits in pandas dataframe - python

I have a pandas dataframe with 2M+ rows. One of the columns pin, contains a series of 14 digits.
I'm trying to cluster similar — but not identical — digits. Specifically, I want to match the first 10 digits without regard to the final four. The pin column was imported as an int then converted to a string.
Put another way, the first 10 digits should match but the final four shouldn't. Duplicates of exact-matching pins should be dropped.
For example these should all be grouped together:
17101110141403
17101110141892
17101110141763
17101110141199
17101110141788
17101110141851
17101110141831
17101110141487
17101110141914
17101110141843
Desired output:
Biggest cluster | other columns
Second biggest cluster | other columns
...and so on | other columns
I've tried using a combination of groupby and regex without success.
pat2 = '1710111014\d\d\d\d'
pat = '\d\d\d\d\d\d\d\d\d\d\d\d\d\d'
grouped = df2.groupby(df2['pin'].str.extract(pat, expand=False), axis= 1)
and
df.groupby(['pin']).filter(lambda group: re.match > 1)
Here's a link to the original data set: https://datacatalog.cookcountyil.gov/Property-Taxation/Assessor-Parcel-Sales/wvhk-k5uv

It's not clear why you need regex for this, what about the following(assuming pin is stored as a string) (Note that you haven't included your expected output)
pin
0 17101110141403
1 17101110141892
2 17101110141763
3 17101110141199
4 17101110141788
5 17101110141851
6 17101110141831
7 17101110141487
8 17101110141914
9 17101110141843
df.groupby(df['pin'].str[:10]).size()
pin
1710111014 10
dtype: int64
If you want this information appended back to your original dataframe, you can use
df['size']=df.groupby(df['pin'].astype(str).str[:10])['pin'].transform(len)
pin size
0 17101110141403 10
1 17101110141892 10
2 17101110141763 10
3 17101110141199 10
4 17101110141788 10
5 17101110141851 10
6 17101110141831 10
7 17101110141487 10
8 17101110141914 10
9 17101110141843 10
Then, assuming you have more columns, you can sort your dataframe by size of cluster with
df.sort_values('size')

Related

Pandas - Equal occurrences of unique type for a column

I have a Pandas DF called “DF”. I would like to sample data from the population in such a way that, given a occurrence count, N = 100 and column = "Type", I would like to print a total of 100 rows from that column in such a way that the distribution of occurrences of each type is equal.
SNo
Type
Difficulty
1
Single
5
2
Single
15
3
Single
4
4
Multiple
2
5
Multiple
14
6
None
7
7
None
4323
For instance, If I specify N = 3, the output must be :
SNo
Type
Difficulty
1
Single
5
3
Multiple
4
6
None
7
If for the number N, the occurrences of certain types do not meet the minimum split, I can randomly increase another count.
I am wondering on how to approach this programmatically. Thanks!
Use groupby.sample (pandas ≥ 1.1) with N divided by the number of types.
NB. This assumes the N is a multiple of the number of types if you want a strict equality.
N = 3
N2 = N//df['Type'].nunique()
out = df.groupby('Type').sample(n=N2)
handling non multiple of the number of types
Use the same as above and complete to N with random rows excluding those already selected.
N = 5
N2, R = divmod(N, df['Type'].nunique())
out = df.groupby('Type').sample(n=N2)
out = pd.concat([out, df.drop(out.index).sample(n=R)])
As there is still a chance that you complete with items of the same group, if you really want to ensure sampling from different groups replace the last step with:
out = pd.concat([out, df.drop(out.index).groupby('Type').sample(n=1).sample(n=R)]
Example output:
SNo Type Difficulty
4 5 Multiple 14
6 7 None 4323
2 3 Single 4
3 4 Multiple 2
5 6 None 14

Pandas: How to convert series with an integer/fraction mix into a whole number

So I'm iterating thru Excel columns containing numbers and I'm trying to round all the numbers using .apply(pd.to_numeric).round()
This has always worked for me but recently, some of the Excel files contain columns with numbers mixed with fractions (e.g. 27 3/8, 50 17/32). When my script runs, I get "Unable to parse string "50 17/32" at position 0"
Suppose this is my series:
0 250.25
1 32.75
2 64
3 50 17/32
4 16 3/8
Name: Qty, dtype: object
Desired result:
0 250
1 33
2 64
3 51
4 16
Name: Qty, dtype: object
I'm trying to split the columns based on the white space and somehow trying to add the 2 columns together, but I'm running into all sorts of issues. The code below sort of works, but my original 'Qty' column is returning a bunch of NaNs instead of the original numbers for rows where there is no delimiter character
df['Qty'] = df['Qty'].fillna(value=np.nan)
df[['Qty','Fraction']] = df['Qty'].str.split(' ', expand=True)
Here's my original ['Qty'] column:
Here's the same rows after running that split code on it:
Intertingly, it does properly split the rows with integer-fraction mix, but turning certain rows to NaN for reasons I don't understand is throwing me off. Another thing I've tried is using lambda functions, but from what I can gather, those work best when it's just a traditional fraction like 3/8, without an integer in front of it. Been researching for hours and I'm close to giving up so if anyone has a clue how to go about this, I'd love to know
Thanks
Here is one approach using fractions.Fraction:
from fractions import Fraction
df2 = df['Qty'].str.extract(r'(\d+(?:\.\d+)?)?\s*(\d+/\d+)?')
out = (pd.to_numeric(df2[0], errors='coerce')
+df2[1].fillna(0).apply(lambda x: float(Fraction(x)))
)
df['float'] = out
df['int'] = out.round().astype(int)
output:
Qty float int
0 250.25 250.25000 250
1 32.75 32.75000 33
2 64 64.00000 64
3 50 17/32 50.53125 51
4 16 3/8 16.37500 16
Alternative using arithmetic:
df2 = df['Qty'].str.extract(r'(\d+(?:\.\d+)?)?\s*(?:(\d+)/(\d+))?').astype(float)
df['int'] = (df2[0]+df2[1].fillna(0)/df2[2].fillna(1)).round().astype(int)

Pandas: row operations on a column, given one reference value on a different column

I am working with a database that looks like the below. For each fruit (just apple and pears below, for conciseness), we have:
1. yearly sales,
2. current sales,
3. monthly sales and
4.the standard deviation of sales.
Their ordering may vary, but it's always 4 values per fruit.
dataset = {'apple_yearly_avg': [57],
'apple_sales': [100],
'apple_monthly_avg':[80],
'apple_st_dev': [12],
'pears_monthly_avg': [33],
'pears_yearly_avg': [35],
'pears_sales': [40],
'pears_st_dev':[8]}
df = pd.DataFrame(dataset).T#tranpose
df = df.reset_index()#clear index
df.columns = (['Description', 'Value'])#name 2 columns
I would like to perform two sets of operations.
For the first set of operations, we isolate a fruit price, say 'pears', and subtract each average sales from current sales.
df_pear = df[df.loc[:, 'Description'].str.contains('pear')]
df_pear['temp'] = df_pear['Value'].where(df_pear.Description.str.contains('sales')).bfill()
df_pear ['some_op'] = df_pear['Value'] - df_pear['temp']
The above works, by creating a temporary column holding pear_sales of 40, backfill it and then use it to subtract values.
Question 1: is there a cleaner way to perform this operation without a temporary array? Also I do get the common warning saying I should use '.loc[row_indexer, col_indexer], even though the output still works.
For the second sets of operations, I need to add '5' rows equal to 'new_purchases' to the bottom of the dataframe, and then fill df_pear['some_op'] with sales * (1 + std_dev *some_multiplier).
df_pear['temp2'] = df_pear['Value'].where(df_pear['Description'].str.contains('st_dev')).bfill()
new_purchases = 5
for i in range(new_purchases):
df_pear = df_pear.append(df_pear.iloc[-1])#appends 5 copies of the last row
counter = 1
for i in range(len(df_pear)-1, len(df_pear)-new_purchases, -1):#backward loop from the bottom
df_pear.some_op.iloc[i] = df_pear['temp'].iloc[0] * (1 + df_pear['temp2'].iloc[i] * counter)
counter += 1
This 'backwards' loop achieves it, but again, I'm worried about readability since there's another temporary column created, and then the indexing is rather ugly?
Thank you.
I think, there is a cleaner way to perform your both tasks, for each
fruit in one go:
Add 2 columns, Fruit and Descr, the result of splitting of Description at the first "_":
df[['Fruit', 'Descr']] = df['Description'].str.split('_', n=1, expand=True)
To see the result you may print df now.
Define the following function to "reformat" the current group:
def reformat(grp):
wrk = grp.set_index('Descr')
sal = wrk.at['sales', 'Value']
dev = wrk.at['st_dev', 'Value']
avg = wrk.at['yearly_avg', 'Value']
# Subtract (yearly) average
wrk['some_op'] = wrk.Value - avg
# New rows
wrk2 = pd.DataFrame([wrk.loc['st_dev']] * 5).assign(
some_op=[ sal * (1 + dev * i) for i in range(5, 0, -1) ])
return pd.concat([wrk, wrk2]) # Old and new rows
Apply this function to each group, grouped by Fruit, drop Fruit
column and save the result back in df:
df = df.groupby('Fruit').apply(reformat)\
.reset_index(drop=True).drop(columns='Fruit')
Now, when you print(df), the result is:
Description Value some_op
0 apple_yearly_avg 57 0
1 apple_sales 100 43
2 apple_monthly_avg 80 23
3 apple_st_dev 12 -45
4 apple_st_dev 12 6100
5 apple_st_dev 12 4900
6 apple_st_dev 12 3700
7 apple_st_dev 12 2500
8 apple_st_dev 12 1300
9 pears_monthly_avg 33 -2
10 pears_sales 40 5
11 pears_yearly_avg 35 0
12 pears_st_dev 8 -27
13 pears_st_dev 8 1640
14 pears_st_dev 8 1320
15 pears_st_dev 8 1000
16 pears_st_dev 8 680
17 pears_st_dev 8 360
Edit
I'm in doubt whether Description should also be replicated to new
rows from "st_dev" row. If you want some other content there, set it
in reformat function, after wrk2 is created.

How to replace 3 largest values sorted by index in a column

I have an easy question, but I've been struggling with the answer. I have a DataFrame, from which I want to replace the 3 largest values with their 7 day rolling means, but in index order. So for a DataFrame like this one:
Sales
2
4
6
8
10
12
14
100
100
200
I want to replace first the two rows with 100 in Sales, and then the row with 200. I tried the following:
df.Sales.replace(df.Sales.nlargest(3).sort_index(),df.Sales.rolling(window=7).mean())
But it brings the following error:
AttributeError: 'numpy.float64' object has no attribute 'replace'
I know that this works:
df.Sales.replace(df.Sales.max(),df.Sales.rolling(window=7).mean())
And I could do that 3 times, but I have the problem that it would replace 200 first, and then the others, so it isn't exactly what I need.
I guess something like this would work:
for i in df.Sales.nlargest(3).sort_index():
df.Sales.replace(i, df.Sales.rolling(window=7)
But I would rather avoid loops. Is it possible?
EDIT: expected output would be:
Sales
2
4
6
8
10
12
14
8
8.86
9.55
In other words, replacing the first 100 with the average from 2 to 14, which is 8. Then replacing the second 100 with the average between 4 through the second 8, which is 8.86, and so on.

counting T/F values for several conditions

I am a beginner using pandas.
I'm looking for mutations on several patients. I have 16 different conditions. I simply write a code about it but how can do this by for loop? I try to find the changes on MUT column and set them as True and False. Then try to count the True/False numbers. I have done for only 4.
Can you suggest a more simple way, instead of writing the same code 16 times?
s1=df["MUT"]
A_T= s1.str.contains("A:T")
ATnum= A_T.value_counts(sort=True)
s2=df["MUT"]
A_G=s2.str.contains("A:G")
AGnum=A_G.value_counts(sort=True)
s3=df["MUT"]
A_C=s3.str.contains("A:C")
ACnum=A_C.value_counts(sort=True)
s4=df["MUT"]
A__=s4.str.contains("A:-")
A_num=A__.value_counts(sort=True)
I'm not an expert with using Pandas, so don't know if there's a cleaner way of doing this, but perhaps the following might work?
chars = 'TGC-'
nums = {}
for char in chars:
s = df["MUT"]
A = s.str.contains("A:" + char)
num = A.value_counts(sort=True)
nums[char] = num
ATnum = nums['T']
AGnum = nums['G']
# ...etc
Basically, go through each unique character (T, G, C, -) then pull out the values that you need, then finally stick the numbers in a dictionary. Then, once the loop is finished, you can fetch whatever numbers you need back out of the dictionary.
Just use value_counts, this will give you a count of all unique values in your column, no need to create 16 variables:
In [5]:
df = pd.DataFrame({'MUT':np.random.randint(0,16,100)})
df['MUT'].value_counts()
Out[5]:
6 11
14 10
13 9
12 9
1 8
9 7
15 6
11 6
8 5
5 5
3 5
2 5
10 4
4 4
7 3
0 3
dtype: int64

Categories

Resources