Select middle value for nth rows in Python - python

I am creating new dataframe which should contain an only middle value (not Median!!) for every nth rows, however my code doesn't work!
I've tried several approaches through pandas or simple Python but I always fail.
value date index
14 40 1983-07-15 14
15 86 1983-07-16 15
16 12 1983-07-17 16
17 78 1983-07-18 17
18 69 1983-07-19 18
19 78 1983-07-20 19
20 45 1983-07-21 20
21 47 1983-07-22 21
22 48 1983-07-23 22
23 ..... ......... ..
RSDF5 = RSDF4.groupby(pd.Grouper(freq='15D', key='DATE')).[int(len(RSDF5)//2)].reset_index()
I know that the code is wrong and I am completely out of ideas!
SyntaxError: invalid syntax

A solution based on indexes.
df is your original dataframe, N is the number of rows you want to group (assumed to be ad odd number, so there is a unique middle row).
df2 = df.groupby(np.arange(len(df))//N).apply(lambda x : x.iloc[len(x)//2])
Be aware that if the total number or rows is not divisible by N, the last group is shorter (you still get its middle value, though).
If N is an even number, you get the central row closer to the end of the group: for example, if N=6, you get the 4th row of each group of 6 rows.

Related

Finding overlap in range based on multiple dataframe column values

I have a TSV that looks as follows:
chr_1 start_1 chr_2 start_2
11 69633786 14 105884873
12 81940993 X 137690551
13 29782093 12 97838049
14 105864244 11 69633799
17 33207000 20 9992701
17 38446991 20 2102271
17 38447482 17 29623333
20 9992701 17 33207000
20 10426599 17 33094167
20 13765533 17 29469669
22 27415959 8 36197094
22 37191634 8 38983042
22 44464751 18 74004141
8 36197054 22 23130534
8 36197054 22 23131537
8 36197054 8 23130539
This will be referred to as transDiffStartEndChr, which is a Dataframe.
I am working on a program that takes this TSV as input, and outputs rows that have the same chr_1 and chr_2, and a start_1 and start_2 that are +/- 1000.
Ideal output would look like:
chr_1 start_1 chr_2 start_2
8 36197054 8 23130539
8 36197054 22 23131537
Potentially creating groups for every hit based on chr_1 and chr_2.
My current script/thoughts:
transDiffStartEndChr = pd.read_csv('test-input.tsv', sep='\t')
#I will extract rows first by chr_1, in this case I'm doing a test case for 17.
rowsStartChr17 = transDiffStartEndChr[transDiffStartEndChr.apply(extractChr, chr='17', axis=1)]
#I figure I can do something stupid and using brute force, but I feel like I'm not tackling this problem correctly
for index, row in rowsStartChr17.iterrows():
for index2, row2 in rowsStartChr17.iterrows():
if index == index2:
continue
elif row['chr_1'] == row2['chr_1'] and row['chr_2'] == row2['chr_2']:
if proximityCheck(row['start_1'], row2['start_1']) and proximityCheck(row['start_2'], row2['start_2']):
print(f'Row: {index} Match: {index2}')
Any thoughts are appreciated.
Can play with numpy and pandas to filter out the groups that don't match your requirements.
>>> df.groupby(['chr_1', 'chr_2'])\
.filter(lambda s: len(np.array(np.where(
np.tril(
np.abs(
np.subtract.outer(s['start_2'].values,
s['start_2'].values)) < 1500 , -1)))\
.flatten()) > 0)
The logic is to groupby chr_1 and chr_2 and perform an outer subtraction between start_2 values to check whether we can values below 1500 (the threshold I used).

Filter dataframe based on list with ranges

probably the title of my question is some kind of wrong. Currently I have a list:
a = [11,12,13,14,15,16,17,18,19,20,21,22,25,26,27,28,29,30,31,37,38,39]
and a dataframe df:
colfrom colto
1 99
23 24
25 32
25 40
How can I filter my dataframe that the colfrom is inside the array a or smaller then it, and that coltois inside the array or bigger then it? So basically this rule would lead to:
colfrom colto
1 99
25 32
25 40
The only row who gets kicked out is row 2 (or in python row 1), as 23 and 24 are not in the array (and not lower then 11 and not higher then 39).
Use:
mask = ((df['colfrom'].isin(a)) | (df['colfrom']<min(a)) & (df['colto'].isin(a)) | (df['colto']>max(a)))
df[mask]
colfrom colto
0 1 99
2 25 32
3 25 40

Calculating average/standard deviations of rows containing certain string in pandas dataframe

I have a large pandas dataframe read as table. I would like to calculate the means and standard deviations of the two different groups, CRPS and Age, so I can plot them in a bar plot with std deviations as the error bars.
I can get the mean calculated by just the Age column. I figured it's a for loop that I have to construct, but I don't know how to construct further than table["Age"].mean(), which just gives me the average of all data points' age values. This is where I need some guidance. I want to look in the group column, tell it to calculate the average and standard deviation for the ages of that group. So, an average and standard deviation value for the ages of the CRPS group, for example.
I have the first 25 rows down below just to show what the dataframe looks like. I also have imported numpy as np as well.
Group Age
0 CRPS 50
1 CRPS 59
2 CRPS 22
3 CRPS 48
4 CRPS 53
5 CRPS 48
6 CRPS 29
7 CRPS 44
8 CRPS 28
9 CRPS 42
10 CRPS 35
11 CONTROLS 54
12 CONTROLS 43
13 CRPS 50
14 CRPS 62
15 CONTROLS 64
16 CONTROLS 39
17 CRPS 40
18 CRPS 59
19 CRPS 46
20 CONTROLS 56
21 CRPS 21
22 CRPS 45
23 CONTROLS 41
24 CRPS 46
25 CONTROLS 35
I don't think you need a for-loop.
Instead, you might try something like:
table.iloc[table['Group'] == 'CRPS']['Age'].mean()
I haven't tested with your table, but I think that will work.
The idea is to first create a boolean array, which is true for row indices where the group field contains 'CRPS', then to select all of those rows using iloc, and finally to take the mean. You could iterate over all of the groups in the following way:
mean_age = dict()
for group in set(table['Group']):
mean_age[group] = table.iloc[table['Group'] == group]['Age'].mean()
Maybe this is where you intended to use a for loop.

Fetching records and then putting in a new column

I am working with Pandas data frame for one of my projects.
I have a column name Count having integer values in that column.
I have 720 values for each hour i.e 24 * 30 days.
I want to run a loop which can get initially first 24 values from the data frame and put in a new column and then take the next 24 and put in the new column and then so on.
for example:
input:
34
45
76
87
98
34
output:
34 87
45 98
76 34
here it is a row of 6 and I am taking first 3 values and putting them in the first column and the next 3 in the 2nd one.
Can someone please help with writing a code/program for the same. It would be of great help.
Thanks!
You can also try numpy's reshape method performed on pd.Series.values.
s = pd.Series(np.arange(720))
df = pd.DataFrame(s.values.reshape((30,24)).T)
Or split (specify how many arrays you want to split into),
df = pd.DataFrame({"day" + str(i): v for i, v in enumerate(np.split(s.values, 30))})

efficiently find maxes of one column over id in a Pandas dataframe

I am working with a very large dataframe (3.5 million X 150 and takes 25 gigs of memory when unpickled) and I need to find maximum of one column over an id number and a date and keep only the row with the maximum value. Each row is a recorded observation for one id at a certain date and I also need the latest date.
This is animal test data where there are twenty additional columns seg1-seg20 for each id and date that are filled with test day information consecutively, for example, first test data fills seg1, second test data fills seg2 ect. The "value" field indicates how many segments have been filled, in other words how many tests have been done, so the row with the maximum "value" has the most test data. Ideally I only want these rows and not the previous rows. For example:
df= DataFrame({'id':[1000,1000,1001,2000,2000,2000],
"date":[20010101,20010201,20010115,20010203,20010223,20010220],
"value":[3,1,4,2,6,6],
"seg1":[22,76,23,45,12,53],
"seg2":[23,"",34,52,24,45],
"seg3":[90,"",32,"",34,54],
"seg4":["","",32,"",43,12],
"seg5":["","","","",43,21],
"seg6":["","","","",43,24]})
df
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
1 20010201 1000 76 1
2 20010115 1001 23 34 32 32 4
3 20010203 2000 45 52 2
4 20010223 2000 12 24 34 43 43 41 6
5 20010220 2000 12 24 34 43 44 35 6
And eventually it should be:
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
2 20010115 1001 23 34 32 32 4
4 20010223 2000 12 24 34 43 43 41 6
I first tried to use .groupby('id').max but couldnt find a way to use it to drop rows. The resulting dataframe MUST contain the ORIGINAL ROWS and not just the maximum value of each column with each id. My current solution is:
for i in df.id.unique():
df =df.drop(df.loc[df.id==i].sort(['value','date']).index[:-1])
But this takes around 10 seconds to run each time through, I assume because its trying to call up the entire dataframe each time through. There are 760,000 unique ids, each are 17 digits long, so it will take way too long to be feasible at this rate.
Is there another method that would be more efficient? Currently it reads every column in as an "object" but converting relevant columns to the lowest possible bit of integer doesnt seem to help either.
I tried with groupby('id').max() and it works, and it also drop the rows. Did you remeber to reassign the df variable? Because this operation (and almost all Pandas' operations) are not in-place.
If you do:
df.groupby('id', sort = False).max()
You will get:
date value
id
1000 20010201 3
1001 20010115 4
2000 20010223 6
And if you don't want id as the index, you do:
df.groupby('id', sort = False, as_index = False).max()
And you will get:
id date value
0 1000 20010201 3
1 1001 20010115 4
2 2000 20010223 6
I don't know if that's going to be much faster, though.
Update
This way the index will not be reseted:
df.iloc[df.groupby('id').apply(lambda x: x['value'].idxmax())]
And you will get:
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
2 20010115 1001 23 34 32 32 4
4 20010223 2000 12 24 34 43 43 43 6

Categories

Resources