Removing duplicates based on repeated column indices Python - python

I have a dataframe that has rows with repeated values in sequences.
For example:
df_raw
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14....
220 450 451 456 470 224 220 223 221 340 224 220 223 221 340.....
234 333 453 460 551 226 212 115 117 315 226 212 115 117 315.....
As you see the columns 0-6 are unique in this example and then we have repeated sequences [220 223 221 340 224] for row 1 from columns 6-10 and then again from 11-14.
This pattern is the same for row 2.
I'd like to remove the repeated sequences for each row of my dataframe (more than just 2) for an output like this:
df_clean
0 1 2 3 4 5 6 7 8 9.....
220 450 451 456 470 224 220 223 221 340.....
234 333 453 460 551 226 212 115 117 315.....
I trail with ...... because the columns are long and have multiple repeatitions for each row. I also cannot assume that each row has the exact same amount of repeated sequences nor that each sequence starts at the exact same index or ends at the same index.
Is there an easy way to do this with pandas or even a numpy array?

Related

Removing duplicate entries and extracting desired information

I have a 2 X 2 mattrix that looks like this :
DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1.2e+03 16 44 23 49
DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.3e-27 2 121 264 383
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 3.7 2 96 5 95
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 5e-20 3 115 133 260
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1.3e+03 3 21 277 295
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 4.1e+03 14 29 345 360
DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.9e-18 1 121 1 121
DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 4.1e+02 30 80 157 209
DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 0.94 2 101 273 369
SMC_N 220 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 1.2e-14 3 199 19 351
AAA_21 303 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00011 1 32 40 68
AAA_21 303 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.0015 231 300 279 352
AAA_15 369 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 4e-05 4 53 19 67
AAA_15 369 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 8.8e+02 347 363 332 348
AAA_23 200 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.0014 3 41 22 60
I want to filter out the results so that for example, for the item "DNA_pol3_beta_3" there are 2 entries. out of these two entries, I want to extract only that row whose respective value at the 5th column is the lowest. so that means, out of the two entries :
DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.3e-27 2 121 264 383
the above one should be in the result. similarly for "DNA_pol3_beta_2" there are 4 entries and the program should extract only
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 5e-20 3 115 133 260
because it has the lowest value of 5th column among 4. Also, the program should ignore the entries whose value at 5th column is less than 1E-5.
i tried following code :
for i in lines:
if lines[i+1] == lines [i]:
if lines[i+1][4] > lines [i][4]:
evalue = lines[i][4]
else:
evalue = lines[i+1][4]
You would better use pandas for this. See below:
import pandas as pd
df=pd.read_csv('yourfile.txt', sep=' ', skipinitialspace=True, names=(range(9)))
df=df[df[4]>=0.00001]
result=df.loc[df.groupby(0)[4].idxmin()].sort_index().reset_index(drop=True)
Output:
>>> print(result)
0 1 2 3 4 5 6 7 8
0 DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1200.00000 16 44 23 49
1 DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 3.70000 2 96 5 95
2 DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 0.94000 2 101 273 369
3 AAA_21 303 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00011 1 32 40 68
4 AAA_15 369 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00004 4 53 19 67
5 AAA_23 200 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00140
If you want the file back to csv, you can save it with df.to_csv()

How to split merged column with blank spaces inside

I have a dataframe with a column like this:
0
0 ND 95 356 618 949
1 ND 173 379 571 317
2 ND 719 451 1 040 782
3 ND 1 546 946 588 486
4 ND 3 658 146 1 317 165
5 ND 6 773 270 1 137 655
6 ND 11 148 978 1 303 481
7 14 648 890 ND ND
8 16 968 348 ND 1 436 353
9 ND ND ND
10 ND ND ND
I don't know how to split into in columns, because the columns have not comma separator to do dataset[0].str.split(',', expand = True)
I try with: dataset[0].str.extract(r'((\d{1,2}) (\d{2,3}) (\d{3})|(\d{2,3}) (\d{3}))') but only works for the first group of numbers and the output is the first column right an the other five are a combination of the first.
0 1 2 3 4 5
0 95 356 NaN NaN NaN 95 356
I think that the solution is related with RegEx, but I'm not really familliar with that.
The desired outut that I would like to have is:
0 1 2
0 ND 95 356 618 949
1 ND 173 379 571 317
2 ND 719 451 1 040 782
3 ND 1 546 946 588 486
4 ND 3 658 146 1 317 165
5 ND 6 773 270 1 137 655
6 ND 11 148 978 1 303 481
7 14 648 890 ND ND
8 16 968 348 ND 1 436 353
9 ND ND ND
10 ND ND ND
IIUC, the logic here is that to group each row by three items, while considering ND as three item:
def chunks(lst, n):
"https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks"
for i in range(0, len(lst), n):
yield lst[i:i + n]
def join(arr, n):
return pd.Series([" ".join(chunk) for chunk in chunks(arr, n)])
df["0"] = df["0"].str.replace("ND", "ND_1 ND_2 ND_3")
df2 = df["0"].str.split("\s",expand=True).fillna("").astype(str)
df2 = df2.apply(join, n=3, axis=1).replace("ND_1 ND_2 ND_3", "ND")
print(df2)
Output:
0 1 2
0 ND 95 356 618 949
1 ND 173 379 571 317
2 ND 719 451 1 040 782
3 ND 1 546 946 588 486
4 ND 3 658 146 1 317 165
5 ND 6 773 270 1 137 655
6 ND 11 148 978 1 303 481
7 14 648 890 ND ND
8 16 968 348 ND 1 436 353
9 ND ND ND
10 ND ND ND
You may use
^(ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?)\s+(ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?)\s+(ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?)$
See the regex demo. It matches
^ - start of a string
(ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?) - Group 1:
ND| - ND, or
\d{1,2}(?:\s\d{3})*| - one or two digits followed with 0 or more occurrences of a whitespace and then three digits, or
\d{3}(?:\s\d{3})? - three digits followed with an optional sequence of a whitespace and three digits
\s+ - 1 or more whitespaces
(ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?) - Group 2: same pattern as in Group 1
\s+ - 1+ whitespaces
(ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?) - Group 3: same pattern as in Group 1
$ - end of string.
Note you do not need to write this long pattern by hand, define the block to match an ND or a number and reuse it. In Python, you may use it with the Series.str.extract Pandas method:
v = r'(ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?)'
dataset[0].str.extract(fr'^{v}\s+{v}\s+{v}$', expand=True)

Pandas - disappearing values in value_counts()

I started this question yesterday and have done more work on it.
Thanks #AMC , #ALollz
I have a dataframe of surgical activity data that has 58 columns and 200,000 records. One of the columns is treatment specialty Each row corresponds to a patient encounter. I want to see the relative conribution of medical specialties. One column is 'TRETSPEF' = treatment_specialty. I have used `pd.read_csv('csv, usecols = ['TRETSPEF') to import the series.
df
TRETSPEF
0 150
1 150
2 150
3 150
4 150
... ...
218462 150
218463 &
218464 150
218465 150
218466 218`
The most common treatment specialty is neurosurgery (code 150). So heres the problem. When I apply
.value_counts I get two groups for the 150 code (and the 218 code)
df['TRETSPEF'].value_counts()
150 140411
150 40839
218 13692
108 10552
218 4143
...
501 1
120 1
302 1
219 1
106 1
Name: TRETSPEF, Length: 69, dtype: int64
There are some '&' in there (454) so I wondered if the fact they aren't integers was messing things up so I changed them to null values, and ran value counts.
df['TRETSPEF'].str.replace("&", "").value_counts()
150 140411
218 13692
108 10552
800 858
110 835
811 692
191 580
323 555
454
100 271
400 116
420 47
301 45
812 38
214 24
215 23
180 22
300 17
370 15
421 11
258 11
314 5
422 4
260 4
192 4
242 4
171 4
350 2
307 2
302 2
328 2
160 1
219 1
120 1
107 1
101 1
143 1
501 1
144 1
320 1
104 1
106 1
430 1
264 1
Name: TRETSPEF, dtype: int64
so now I seem to have lost the second group of 150 - about 40000 records by changing '&' to null. The nulls are still showing up in .value_counts though.The length of the series has gone down to 45 fromn 69.
I tried stripping whitespace - no difference. Not sure what tests to run to see why this is happening. I feel it must somehow be due to the data.
This is 100% a data cleansing issue. Try to force the column to be numeric.
pd.to_numeric(df['TRETSPEF'], errors='coerce').value_counts()

Read 4 lines of data into one row of pandas data frame

I have txt file with such values:
108,612,620,900
168,960,680,1248
312,264,768,564
516,1332,888,1596
I need to read all of this into a single row of data frame.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596
I have many such files and so I'll keep appending rows to this data frame.
I believe we need some kind of regex but I'm not able to figure it out. For now this is what I have :
df = pd.read_csv(f,sep=",| ", header = None)
But this takes , and (space) as separators where as I want it to take newline as a separator.
First, read the data:
df = pd.read_csv('test/t.txt', header=None)
It gives you a DataFrame shaped like the CSV. Then concatenate:
s = pd.concat((df.loc[i] for i in df.index), ignore_index=True)
It gives you a Series:
0 108
1 612
2 620
3 900
4 168
5 960
6 680
7 1248
8 312
9 264
10 768
11 564
12 516
13 1332
14 888
15 1596
dtype: int64
Finally, if you really want a horizontal DataFrame:
pd.DataFrame([s])
Gives you:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596
Since you've mentioned in a comment that you have many such files, you should simply store all the Series in a list, and construct a DataFrame with all of them at once when you're finished loading them all.

Python Pandas GroupBy().Sum() Having Clause

So I have this DataFrame with 3 columns 'Order ID, 'Order Qty' and 'Fill Qty'
I want to sum the Fill Qty per order then compare it to Order Qty, Ideally I will return only a dataframe that gives me Order ID whenever aggregate Fill Qty is greater than Order Qty.
In SQL I think what I'm looking for is
SELECT * FROM DataFrame GROUP BY Order ID, Order Qty HAVING sum(Fill Qty)>Order Qty
So far I have this:
SumFills= DataFrame.groupby(['Order ID','Order Qty']).sum()
output:
....................................Fill Qty
Order ID - Order Qty -
1--------- 300 --------- 300
2 --------- 80 ----------- 40
3 --------- 20 ----------- 20
4 --------- 110 ---------- 220
5 --------- 100 ---------- 200
6 --------- 100 ---------- 200
Above is aggregated already, I would ideally like to return a list/array of [4,5,6] since those have sum(fill qty) > Order Qty
View original dataframe:
In [57]: print original_df
Order Id Fill Qty Order Qty
0 1 419 334
1 2 392 152
2 3 167 469
3 4 470 359
4 5 447 441
5 6 154 190
6 7 365 432
7 8 209 181
8 9 140 136
9 10 112 358
10 11 384 302
11 12 307 376
12 13 119 237
13 14 147 342
14 15 279 197
15 16 280 137
16 17 148 381
17 18 313 498
18 19 193 328
19 20 291 193
20 21 100 357
21 22 161 286
22 23 453 168
23 24 349 283
Create and view new dataframe summing the Fill Qty:
In [58]: new_df = original_df.groupby(['Order Id','Order Qty'], as_index=False).sum()
In [59]: print new_df
Order Id Order Qty Fill Qty
0 1 334 419
1 2 152 392
2 3 469 167
3 4 359 470
4 5 441 447
5 6 190 154
6 7 432 365
7 8 181 209
8 9 136 140
9 10 358 112
10 11 302 384
11 12 376 307
12 13 237 119
13 14 342 147
14 15 197 279
15 16 137 280
16 17 381 148
17 18 498 313
18 19 328 193
19 20 193 291
20 21 357 100
21 22 286 161
22 23 168 453
23 24 283 349
Slice new dataframe to only those rows where Fill Qty > Order Qty:
In [60]: new_df = new_df.loc[new_df['Fill Qty'] > new_df['Order Qty'],:]
In [61]: print new_df
Order Id Order Qty Fill Qty
0 1 334 419
1 2 152 392
3 4 359 470
4 5 441 447
7 8 181 209
8 9 136 140
10 11 302 384
14 15 197 279
15 16 137 280
19 20 193 291
22 23 168 453
23 24 283 349

Categories

Resources