How to split merged column with blank spaces inside - python

I have a dataframe with a column like this:
0
0 ND 95 356 618 949
1 ND 173 379 571 317
2 ND 719 451 1 040 782
3 ND 1 546 946 588 486
4 ND 3 658 146 1 317 165
5 ND 6 773 270 1 137 655
6 ND 11 148 978 1 303 481
7 14 648 890 ND ND
8 16 968 348 ND 1 436 353
9 ND ND ND
10 ND ND ND
I don't know how to split into in columns, because the columns have not comma separator to do dataset[0].str.split(',', expand = True)
I try with: dataset[0].str.extract(r'((\d{1,2}) (\d{2,3}) (\d{3})|(\d{2,3}) (\d{3}))') but only works for the first group of numbers and the output is the first column right an the other five are a combination of the first.
0 1 2 3 4 5
0 95 356 NaN NaN NaN 95 356
I think that the solution is related with RegEx, but I'm not really familliar with that.
The desired outut that I would like to have is:
0 1 2
0 ND 95 356 618 949
1 ND 173 379 571 317
2 ND 719 451 1 040 782
3 ND 1 546 946 588 486
4 ND 3 658 146 1 317 165
5 ND 6 773 270 1 137 655
6 ND 11 148 978 1 303 481
7 14 648 890 ND ND
8 16 968 348 ND 1 436 353
9 ND ND ND
10 ND ND ND

IIUC, the logic here is that to group each row by three items, while considering ND as three item:
def chunks(lst, n):
"https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks"
for i in range(0, len(lst), n):
yield lst[i:i + n]
def join(arr, n):
return pd.Series([" ".join(chunk) for chunk in chunks(arr, n)])
df["0"] = df["0"].str.replace("ND", "ND_1 ND_2 ND_3")
df2 = df["0"].str.split("\s",expand=True).fillna("").astype(str)
df2 = df2.apply(join, n=3, axis=1).replace("ND_1 ND_2 ND_3", "ND")
print(df2)
Output:
0 1 2
0 ND 95 356 618 949
1 ND 173 379 571 317
2 ND 719 451 1 040 782
3 ND 1 546 946 588 486
4 ND 3 658 146 1 317 165
5 ND 6 773 270 1 137 655
6 ND 11 148 978 1 303 481
7 14 648 890 ND ND
8 16 968 348 ND 1 436 353
9 ND ND ND
10 ND ND ND

You may use
^(ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?)\s+(ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?)\s+(ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?)$
See the regex demo. It matches
^ - start of a string
(ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?) - Group 1:
ND| - ND, or
\d{1,2}(?:\s\d{3})*| - one or two digits followed with 0 or more occurrences of a whitespace and then three digits, or
\d{3}(?:\s\d{3})? - three digits followed with an optional sequence of a whitespace and three digits
\s+ - 1 or more whitespaces
(ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?) - Group 2: same pattern as in Group 1
\s+ - 1+ whitespaces
(ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?) - Group 3: same pattern as in Group 1
$ - end of string.
Note you do not need to write this long pattern by hand, define the block to match an ND or a number and reuse it. In Python, you may use it with the Series.str.extract Pandas method:
v = r'(ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?)'
dataset[0].str.extract(fr'^{v}\s+{v}\s+{v}$', expand=True)

Related

Custom ranking selection with Pandas

I have a 100x5 pandas dataframe with values ranging from 1 to 499.
# seed for reproducability
np.random.seed(3)
sample = pd.DataFrame(np.random.randint(0, 500, size = (100, 5)))
sample.columns = "X Y Z F V".split()
I want to select 10 rows from this dataframe, where for each column I select the rows corresponding the top 2 values for each of the columns (separately) and without duplicates.
If there are duplicates lets say for top 1st of column X and top 2nd of Y, then randomly keep one of them and the other one replace with the next biggest (top 3rd for X or top 3rd for Y -> randomly) and do it until there are no duplicate rows selected for any columns.
What I have so far
# convert it to long format and use groupby to get top values and their index - ID
stacked = (
sample
.stack()
.reset_index()
.rename(columns = {"level_0": "ID", "level_1": "Feature", 0: "Value"})
.set_index("ID")
)
stacked.groupby("Feature").Value.nlargest(2)
Which gives me this
Feature ID Value
F 37 489
F 32 481
V 19 497
V 22 497
X 25 495
X 32 491
Y 17 498
Y 22 496
Z 95 496
Z 45 489
It means, I need to select the rows based on ID values from that dataset. However, as you see, for column V and Y, F and X I have duplicate rows selected. I could not come up with the implementation of the logic with duplicates. I would be grateful for any help
One potential approach could be to select 5 rows per each column top values with groupby.nlargest() and from that dataframe select two as top as possible rows per each column where there are no duplicates. Unfortunately, I do not know any pythonic ways of doing this:
Feature ID Value
0 F 37 489
1 F 32 481
2 F 65 474
3 F 82 470
4 F 66 467
5 V 19 497
6 V 22 497
7 V 11 489
8 V 98 486
9 V 15 484
10 X 25 495
11 X 32 491
12 X 99 490
13 X 76 487
14 X 93 486
15 Y 17 498
16 Y 22 496
17 Y 89 494
18 Y 68 493
19 Y 3 480
20 Z 95 496
21 Z 45 489
22 Z 62 488
23 Z 79 485
24 Z 22 484
Desirable result would be (there is randomality)
X Y Z F V
37 133 212 351 489 106
32 491 135 441 481 427
19 48 445 289 308 497
22 182 496 484 91 497
25 495 444 216 311 267
99 490 164 345 23 365
17 400 498 274 331 183
89 309 494 122 82 140
95 275 213 496 167 98
45 267 246 489 252 17
Maybe I explained the logic in a complex way, but if you look at dataset and desirable outcome, you might get more clue
EDIT:
fixed the desirable result, it contained duplicate
Here is one possible way to approach the problem, where we use set to store the indices corresponding to top two largest values from each column
Pseudocode
For every column in the dataframe
Drop the indices that are already selected from the previous columns
drop the duplicate values
Sort the values in descending order
Using set union (|=) add the indices corresponding to top 2 largest values in the current column
ix = set()
for c in df.columns:
s = df[c].drop(ix).drop_duplicates().sort_values(ascending=False)
ix |= set(s.index[:2])
>>> df.loc[ix]
X Y Z F V
32 491 135 441 481 427
65 274 320 455 474 437
37 133 212 351 489 106
11 375 54 0 192 489
45 267 246 489 252 17
17 400 498 274 331 183
19 48 445 289 308 497
22 182 496 484 91 497
25 495 444 216 311 267
95 275 213 496 167 98
I came up with a solution although it is not the most elegant:
indexes = []
while len(indexes) < 2 * len(sample.columns):
new_df = sample.drop(index=indexes)
if new_df.iloc[:, len(indexes) // 2].idxmax() not in indexes:
indexes.append(new_df.iloc[:, len(indexes) // 2].idxmax())
sample.iloc[indexes]
Output:
>>> print(sample.iloc[indexes])
X Y Z F V
25 495 444 216 311 267
32 491 135 441 481 427
17 400 498 274 331 183
22 182 496 484 91 497
95 275 213 496 167 98
45 267 246 489 252 17
37 133 212 351 489 106
65 274 320 455 474 437
19 48 445 289 308 497
11 375 54 0 192 489

Removing duplicate entries and extracting desired information

I have a 2 X 2 mattrix that looks like this :
DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1.2e+03 16 44 23 49
DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.3e-27 2 121 264 383
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 3.7 2 96 5 95
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 5e-20 3 115 133 260
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1.3e+03 3 21 277 295
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 4.1e+03 14 29 345 360
DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.9e-18 1 121 1 121
DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 4.1e+02 30 80 157 209
DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 0.94 2 101 273 369
SMC_N 220 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 1.2e-14 3 199 19 351
AAA_21 303 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00011 1 32 40 68
AAA_21 303 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.0015 231 300 279 352
AAA_15 369 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 4e-05 4 53 19 67
AAA_15 369 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 8.8e+02 347 363 332 348
AAA_23 200 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.0014 3 41 22 60
I want to filter out the results so that for example, for the item "DNA_pol3_beta_3" there are 2 entries. out of these two entries, I want to extract only that row whose respective value at the 5th column is the lowest. so that means, out of the two entries :
DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.3e-27 2 121 264 383
the above one should be in the result. similarly for "DNA_pol3_beta_2" there are 4 entries and the program should extract only
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 5e-20 3 115 133 260
because it has the lowest value of 5th column among 4. Also, the program should ignore the entries whose value at 5th column is less than 1E-5.
i tried following code :
for i in lines:
if lines[i+1] == lines [i]:
if lines[i+1][4] > lines [i][4]:
evalue = lines[i][4]
else:
evalue = lines[i+1][4]
You would better use pandas for this. See below:
import pandas as pd
df=pd.read_csv('yourfile.txt', sep=' ', skipinitialspace=True, names=(range(9)))
df=df[df[4]>=0.00001]
result=df.loc[df.groupby(0)[4].idxmin()].sort_index().reset_index(drop=True)
Output:
>>> print(result)
0 1 2 3 4 5 6 7 8
0 DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1200.00000 16 44 23 49
1 DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 3.70000 2 96 5 95
2 DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 0.94000 2 101 273 369
3 AAA_21 303 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00011 1 32 40 68
4 AAA_15 369 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00004 4 53 19 67
5 AAA_23 200 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00140
If you want the file back to csv, you can save it with df.to_csv()

Pandas - how to sort week and year numbers formatted as strings?

I have a pandas dataframe like this, which sorted like:
>>> weekly_count.sort_values(by='date_in_weeks', inplace=True)
>>> weekly_count.loc[:9,:]
date_in_weeks count
0 1-2013 362
1 1-2014 378
2 1-2015 201
3 1-2016 294
4 1-2017 300
5 1-2018 297
6 10-2013 329
7 10-2014 314
8 10-2015 324
9 10-2016 322
in above data, first column, all rows of date_in_weeks are simply "week number of a year - year". I now want to sort it like this:
date_in_weeks count
0 1-2013 362
6 10-2013 329
1 1-2014 378
7 10-2014 314
2 1-2015 201
8 10-2015 324
3 1-2016 294
9 10-2016 322
4 1-2017 300
5 1-2018 297
How do i do this?
Use Series.argsort with converted to datetimes with format %W week number of the year, link:
df = df.iloc[pd.to_datetime(df['date_in_weeks'] + '-0',format='%W-%Y-%w').argsort()]
print (df)
date_in_weeks count
0 1-2013 362
6 10-2013 329
1 1-2014 378
7 10-2014 314
2 1-2015 201
8 10-2015 324
3 1-2016 294
9 10-2016 322
4 1-2017 300
5 1-2018 297
You can also convert to datetime , assign to the df, then sort the values and drop the extra col:
s = pd.to_datetime(df['date_in_weeks'],format='%M-%Y')
final = df.assign(dt=s).sort_values(['dt','count']).drop('dt',1)
print(final)
date_in_weeks count
0 1-2013 362
6 10-2013 329
1 1-2014 378
7 10-2014 314
2 1-2015 201
8 10-2015 324
3 1-2016 294
9 10-2016 322
4 1-2017 300
5 1-2018 297
You can try using auxiliary columns:
import pandas as pd
df = pd.DataFrame({'date_in_weeks':['1-2013','1-2014','1-2015','10-2013','10-2014'],
'count':[362,378,201,329,314]})
df['aux'] = df['date_in_weeks'].str.split('-')
df['aux_2'] = df['aux'].str.get(1).astype(int)
df['aux'] = df['aux'].str.get(0).astype(int)
df = df.sort_values(['aux_2','aux'],ascending=True)
df = df.drop(columns=['aux','aux_2'])
print(df)
Output:
date_in_weeks count
0 1-2013 362
3 10-2013 329
1 1-2014 378
4 10-2014 314
2 1-2015 201

Removing duplicates based on repeated column indices Python

I have a dataframe that has rows with repeated values in sequences.
For example:
df_raw
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14....
220 450 451 456 470 224 220 223 221 340 224 220 223 221 340.....
234 333 453 460 551 226 212 115 117 315 226 212 115 117 315.....
As you see the columns 0-6 are unique in this example and then we have repeated sequences [220 223 221 340 224] for row 1 from columns 6-10 and then again from 11-14.
This pattern is the same for row 2.
I'd like to remove the repeated sequences for each row of my dataframe (more than just 2) for an output like this:
df_clean
0 1 2 3 4 5 6 7 8 9.....
220 450 451 456 470 224 220 223 221 340.....
234 333 453 460 551 226 212 115 117 315.....
I trail with ...... because the columns are long and have multiple repeatitions for each row. I also cannot assume that each row has the exact same amount of repeated sequences nor that each sequence starts at the exact same index or ends at the same index.
Is there an easy way to do this with pandas or even a numpy array?

Read 4 lines of data into one row of pandas data frame

I have txt file with such values:
108,612,620,900
168,960,680,1248
312,264,768,564
516,1332,888,1596
I need to read all of this into a single row of data frame.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596
I have many such files and so I'll keep appending rows to this data frame.
I believe we need some kind of regex but I'm not able to figure it out. For now this is what I have :
df = pd.read_csv(f,sep=",| ", header = None)
But this takes , and (space) as separators where as I want it to take newline as a separator.
First, read the data:
df = pd.read_csv('test/t.txt', header=None)
It gives you a DataFrame shaped like the CSV. Then concatenate:
s = pd.concat((df.loc[i] for i in df.index), ignore_index=True)
It gives you a Series:
0 108
1 612
2 620
3 900
4 168
5 960
6 680
7 1248
8 312
9 264
10 768
11 564
12 516
13 1332
14 888
15 1596
dtype: int64
Finally, if you really want a horizontal DataFrame:
pd.DataFrame([s])
Gives you:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596
Since you've mentioned in a comment that you have many such files, you should simply store all the Series in a list, and construct a DataFrame with all of them at once when you're finished loading them all.

Categories

Resources