I have a TSV that looks as follows:
chr_1 start_1 chr_2 start_2
11 69633786 14 105884873
12 81940993 X 137690551
13 29782093 12 97838049
14 105864244 11 69633799
17 33207000 20 9992701
17 38446991 20 2102271
17 38447482 17 29623333
20 9992701 17 33207000
20 10426599 17 33094167
20 13765533 17 29469669
22 27415959 8 36197094
22 37191634 8 38983042
22 44464751 18 74004141
8 36197054 22 23130534
8 36197054 22 23131537
8 36197054 8 23130539
This will be referred to as transDiffStartEndChr, which is a Dataframe.
I am working on a program that takes this TSV as input, and outputs rows that have the same chr_1 and chr_2, and a start_1 and start_2 that are +/- 1000.
Ideal output would look like:
chr_1 start_1 chr_2 start_2
8 36197054 8 23130539
8 36197054 22 23131537
Potentially creating groups for every hit based on chr_1 and chr_2.
My current script/thoughts:
transDiffStartEndChr = pd.read_csv('test-input.tsv', sep='\t')
#I will extract rows first by chr_1, in this case I'm doing a test case for 17.
rowsStartChr17 = transDiffStartEndChr[transDiffStartEndChr.apply(extractChr, chr='17', axis=1)]
#I figure I can do something stupid and using brute force, but I feel like I'm not tackling this problem correctly
for index, row in rowsStartChr17.iterrows():
for index2, row2 in rowsStartChr17.iterrows():
if index == index2:
continue
elif row['chr_1'] == row2['chr_1'] and row['chr_2'] == row2['chr_2']:
if proximityCheck(row['start_1'], row2['start_1']) and proximityCheck(row['start_2'], row2['start_2']):
print(f'Row: {index} Match: {index2}')
Any thoughts are appreciated.
Can play with numpy and pandas to filter out the groups that don't match your requirements.
>>> df.groupby(['chr_1', 'chr_2'])\
.filter(lambda s: len(np.array(np.where(
np.tril(
np.abs(
np.subtract.outer(s['start_2'].values,
s['start_2'].values)) < 1500 , -1)))\
.flatten()) > 0)
The logic is to groupby chr_1 and chr_2 and perform an outer subtraction between start_2 values to check whether we can values below 1500 (the threshold I used).
Related
I have a dataframe df :
a b c
0 0.897134 -0.356157 -0.396212
1 -2.357861 2.066570 -0.512687
2 -0.080665 0.719328 0.604294
3 -0.639392 -0.912989 -1.029892
4 -0.550007 -0.633733 -0.748733
5 -0.712962 -1.612912 -0.248270
6 -0.571474 1.310807 -0.271137
7 -0.228068 0.675771 0.433016
8 0.005606 -0.154633 0.985484
9 0.691329 -0.837302 -0.607225
10 -0.011909 -0.304162 0.422001
11 0.127570 0.956831 1.837523
12 -1.074771 0.379723 -1.889117
13 -1.449475 -0.799574 -0.878192
14 -1.029757 0.551023 2.519929
15 -1.001400 0.838614 -1.006977
16 0.677216 -0.403859 0.451338
17 0.221596 -0.323259 0.324158
18 -0.241935 -2.251687 -0.088494
19 -0.995426 0.665569 -2.228848
20 1.714709 -0.353391 0.671539
21 0.155050 1.136433 -0.005721
22 -0.502412 -0.610901 1.520165
23 -0.853906 0.648321 1.124464
24 1.149151 -0.187300 -0.412946
25 0.329229 -1.690569 -2.746895
26 0.165158 0.173424 0.896344
27 1.157766 0.525674 -1.279618
28 1.729730 -0.798158 0.644869
29 -0.107285 -1.290374 0.544023
that I need to split into multiple dataframes that will contain every 10 rows of df , and every small dataframe I will write to separate file. so I decided create multilevel dataframe, and for this first assign the index to every 10 rows in my df with this method:
df['split'] = df['split'].apply(lambda x: np.searchsorted(df.iloc[::10], x, side='right')[0])
that throws out
TypeError: 'function' object has no attribute '__getitem__'
So, do you have an idea of how to fix it? Where is my method wrong?
But if you have another approach to split my dataframe into multiple dataframes every of which contains 10 rows of df, you are also welcome, cause this approach was just the first I thought about, but I'm not sure that it's the best one.
There are many ways to do what you want, your method looks over-complicated. A groupby using a scaled index as the grouping key would work:
df = pd.DataFrame(data=np.random.rand(100, 3), columns=list('ABC'))
groups = df.groupby(np.arange(len(df.index))//10)
for (frameno, frame) in groups:
frame.to_csv("%s.csv" % frameno)
You can use a dictionary comprehension to save slices of the dataframe in groups of ten rows:
df_dict = {n: df.iloc[n:n+10, :]
for n in range(0, len(df), 10)}
>>> df_dict.keys()
[0, 10, 20]
>>> df_dict[10]
a b c
10 -0.011909 -0.304162 0.422001
11 0.127570 0.956831 1.837523
12 -1.074771 0.379723 -1.889117
13 -1.449475 -0.799574 -0.878192
14 -1.029757 0.551023 2.519929
15 -1.001400 0.838614 -1.006977
16 0.677216 -0.403859 0.451338
17 0.221596 -0.323259 0.324158
18 -0.241935 -2.251687 -0.088494
19 -0.995426 0.665569 -2.228848
This question already has an answer here:
Pandas Rolling Python to create new Columns
(1 answer)
Closed 2 years ago.
I am practising and new to create a function in Python with conditions:
create a function that takes an input of an integer number (for example m, where m is between 2 to n, and n is the maximum number of rows). This function calculates the ‘Sum A’ and ‘Sum B’ from the last m-days. There will be no value for the first m-days
The original data:
V TP A B Sum A Sum B
3509 47.81
4862 48.406667 235353.2133
1810 49.26 89160.6
3824 49.263333 188382.9867
2209 47.386667 104677.1467
4558 45.573333 207723.2533
3832 44.396667 170128.0267
3778 43.75 165287.5
1005 44.64 44863.2
4047 43.76 177096.72
2201 44.383333 97687.71667 655447.7167 824912.6467
2507 45.156667 113207.7633 533302.2667 824912.6467
4392 44.4333 195151.2 444141.6667 1020063.847
3497 43.296667 151408.4433 255758.68 1171472.29
1181 43.07 50865.67 255758.68 1117660.813
1971 42.89 84536.19 255758.68 994473.75
4994 43.563333 217555.2867 473313.9667 824345.7233
2017 44.816667 90395.21667 563709.1833 659058.2233
2823 44.936667 126856.21 645702.1933 659058.2233
2774 45.13 125190.62 770892.8133 481961.5033
Continue original data
Day
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
The attempt that I have done so far is and it shows error KeyError 'A':
curret_period = int(input("enter days: "))
sumA = curret_period * ((df["A"] < df["A"]),'')
sumB = curret_period * ((df["B"] >= df["B"]),'')
print(sumA)
print(sumB)
I am wondering is there a better way to create the function? I also wonder if below is the one that I need?
def function_name()
print()
Expected result when m= 10:
A B Sum A Sum B
0
1 235353.21333333332
2 89160.59999999999
3 188382.98666666663
4 104677.1466666667
5 207723.25333333333
6 170128.02666666667
7 165287.5
8 44863.200000000004
9 177096.72
10 97687.71666666666 655447.7167 824912.6467
11 113207.76333333334 533302.2667 824912.6467
12 195151.2 444141.6667 1020063.847
13 151408.4433333333 255758.68 1171472.29
14 50865.66999999999 255758.68 1117660.813
15 84536.19000000002 255758.68 994473.75
16 217555.28666666665 473313.9667 824345.7233
17 90395.21666666666 563709.1833 659058.2233
18 126856.21 645702.1933 659058.2233
19 125190.61999999998 770892.8133 481961.5033
Any suggestion? Thank you in advance.
You can utilize df.tail() to get the last m rows of the dataframe and then simply sum() each column.
We can also check if m is not greater than the length of the dataframe, however even if you did not have this it would just sum the entire dataframe.
def sumof(df, m):
if m <= len(df.index):
rows = df.tail(m)
print(rows['A'].sum())
print(rows['B'].sum())
else:
print("'m' can not be greater than length of dataframe")
probably the title of my question is some kind of wrong. Currently I have a list:
a = [11,12,13,14,15,16,17,18,19,20,21,22,25,26,27,28,29,30,31,37,38,39]
and a dataframe df:
colfrom colto
1 99
23 24
25 32
25 40
How can I filter my dataframe that the colfrom is inside the array a or smaller then it, and that coltois inside the array or bigger then it? So basically this rule would lead to:
colfrom colto
1 99
25 32
25 40
The only row who gets kicked out is row 2 (or in python row 1), as 23 and 24 are not in the array (and not lower then 11 and not higher then 39).
Use:
mask = ((df['colfrom'].isin(a)) | (df['colfrom']<min(a)) & (df['colto'].isin(a)) | (df['colto']>max(a)))
df[mask]
colfrom colto
0 1 99
2 25 32
3 25 40
I am creating new dataframe which should contain an only middle value (not Median!!) for every nth rows, however my code doesn't work!
I've tried several approaches through pandas or simple Python but I always fail.
value date index
14 40 1983-07-15 14
15 86 1983-07-16 15
16 12 1983-07-17 16
17 78 1983-07-18 17
18 69 1983-07-19 18
19 78 1983-07-20 19
20 45 1983-07-21 20
21 47 1983-07-22 21
22 48 1983-07-23 22
23 ..... ......... ..
RSDF5 = RSDF4.groupby(pd.Grouper(freq='15D', key='DATE')).[int(len(RSDF5)//2)].reset_index()
I know that the code is wrong and I am completely out of ideas!
SyntaxError: invalid syntax
A solution based on indexes.
df is your original dataframe, N is the number of rows you want to group (assumed to be ad odd number, so there is a unique middle row).
df2 = df.groupby(np.arange(len(df))//N).apply(lambda x : x.iloc[len(x)//2])
Be aware that if the total number or rows is not divisible by N, the last group is shorter (you still get its middle value, though).
If N is an even number, you get the central row closer to the end of the group: for example, if N=6, you get the 4th row of each group of 6 rows.
I have a .csv file which holds two columns which I want to sort.
I want to sort the first column alphabetically and the second one by highest number to lowest.
I used sortedColumn = sorted(csv_opener,key=operator.itemgetter(0)) to sort the first column alphabetically but I also want to do the same thing for the second column. How would I go about doing that?
You can sort by two aspects by having the key callable return a tuple.
I'm assuming that the second column is a string convertable to an integer:
sortedColumn = sorted(csv_opener, key=lambda row: (row[0], -int(row[1])))
By returning negative values from row[1] you can sort from highest-to-lowest, while the main sort is done or row[0] in alphabetical order.
So for the sample rows:
Alpha, 10
Beta, 30
Alpha, 42
Gamma, 81
Beta, 10
the sorted output gives you:
Alpha, 42
Alpha, 10
Beta, 30
Beta, 10
Gamma, 81
sorting first alphabetically on the first column, and then for equal values in the first column, the rows are sorted in descending order on the second column.
Martijn Pieters already provided a perfect answer, but I think it is worth checking out Pandas DataFrame for dealing with CSV data in case you have not considered it.
You can use pandas.read_csv() to read the CSV input as a DataFrame and then use DataFrame.sort_values() to sort it any way you want.
To add an example, let's first generate some random sample data
from faker import Factory
from random import randint, choice
import pandas
fake = Factory.create()
names = [fake.name() for i in range(5)]
nums = [randint(1, 50) for i in range(5)]
data = []
for i in range(10):
data.append((choice(names), choice(nums)))
df = pandas.DataFrame.from_records(data, columns=("Names", "Nums"))
Resulting in, for example
Names Nums
0 Jeffry Wintheiser 25
1 Dr. Corine Sporer PhD 25
2 Jeffry Wintheiser 17
3 Emmett Reilly 17
4 Jeffry Wintheiser 17
5 Emmett Reilly 33
6 Jeffry Wintheiser 33
7 Lilah Purdy 17
8 Emmett Reilly 22
9 Miss Julie Wisoky 25
Then you can use the sort_values as follows
df.sort_values(["Names", "Nums"], ascending=[True, False])
Resulting in
Names Nums
1 Dr. Corine Sporer PhD 25
5 Emmett Reilly 33
8 Emmett Reilly 22
3 Emmett Reilly 17
6 Jeffry Wintheiser 33
0 Jeffry Wintheiser 25
2 Jeffry Wintheiser 17
4 Jeffry Wintheiser 17
7 Lilah Purdy 17
9 Miss Julie Wisoky 25