How can I count the element in specific intervals in a dataframe? - python

I've got a dataframe like below where columns in c01 represent the start time and c04 the end for time intervals:
c01 c04
1742 8.444991 14.022029
3786 29.91143 31.422439
3951 29.91143 31.145099
5402 37.81136 42.689595
8230 63.12394 65.34602
also a list like this (it's actually way longer):
8.522494
8.54471
8.578426
8.611193
8.644996
8.678053
8.710918
8.744901
8.777851
8.811053
8.844867
8.878389
8.912099
8.944729
8.977601
9.011232
9.04492
9.078157
9.111946
9.144788
9.177663
9.211054
9.245265
9.27805
9.311766
9.344647
9.377612
9.411709
I'd like to count how many elements in the list falls in the intervals shown by the dataframe, where I coded like this:
count = 0
for index, row in speech.iterrows():
count += gtls.count(lambda i : i in [row['c01'], row['c04']])
the file works as a whole but all 'count' turns out to be 0, would you please tell me where did I mess up?

I took the liberty of converting your list into a numpy array() (I called it arr). Then you can use the apply function to create your count column. Let's assume your dataframe is called df.
def get_count(row): #the logic for your summation is here
return np.sum([(row['c01'] < arr) & (row['c04'] >= arr)])
df['C_sum'] = df.apply(get_count, axis = 1)
print(df)
Output:
c01 c04 C_sum
0 8.444991 14.022029 28
1 29.911430 31.422439 0
2 29.911430 31.145099 0
3 37.811360 42.689595 0
4 63.123940 65.346020 0
You can also do the whole thing in one line using lambda:
df['C_sum'] = df.apply(lambda row: np.sum([(row['c01'] < arr) & (row['c04'] >= arr)]), axis = 1)

Welcome to Stack Overflow! The i in [row['c01'], row['c04']] doesn't do what you seem to think; it stands for checking whether element i can be found from the two-element list instead of checking the range between row['c01'] and row['c04']. For checking if a floating point number is within a range, use row['c01'] < i < row['c04'].

Related

How to get a value from a pandas core series?

I have a dataframe df in which are the timezones for particular ip numbers:
ip1 ip2 timezone
0 16777215 0
16777216 16777471 +10:00
16777472 16778239 +08:00
16778240 16779263 +11:00
16779264 16781311 +08:00
16781312 16785407 +09:00
...
The first row is valid for the ip numbers from 0 to 16777215, the second from 16777216 to 16777471 an so on.
Now, I go through a folder an want to know the timezone for every file (after I calculate the ip_number of the file).
I use:
time=df.loc[(df['ip1'] <= ip_number) & (ip_number <= df['ip2']), 'timezone']
and become my expected output:
1192 +05:30
Name: timezone, dtype: object
But this is panda core series series and I just want to have "+5:30".
How do I become this? Or is there another way instead of df.loc[...]to become directly the value of the column timezonein df?
just list it
list(time)
if you are excepting only one value
list(time)[0]
or you can make it earlier:
#for numpy array
time=df.loc[(df['ip1'] <= ip_number) & (ip_number <= df['ip2']), 'timezone'].values
#for list
time=list(df.loc[(df['ip1'] <= ip_number) & (ip_number <= df['ip2']), 'timezone'].values)
To pull the only value out of a Series of size 1, use the Series.item() method:
time = df.loc[(df['ip1'] <= ip_number) & (ip_number <= df['ip2']), 'timezone'].item()
Note that this raises a ValueError if the Series contains more than one item.
Usually pulling single values out of a Series is an anti-pattern. NumPy/Pandas
is built around the idea that applying vectorized functions to large arrays is
going to be much much faster than using a Python loop that processes single
values one at a time.
Given your df and a list of IP numbers, here is a way to find the
corresponding timezone offsets for all the IP numbers with just one call to pd.merge_asof.
import pandas as pd
df = pd.DataFrame({'ip1': [0, 16777216, 16777472, 16778240, 16779264, 16781312],
'ip2': [16777215, 16777471, 16778239, 16779263, 16781311, 16785407],
'timezone': ['0', '+10:00', '+08:00', '+11:00', '+08:00', '+09:00']})
df1 = df.melt(id_vars=['timezone'], value_name='ip').sort_values(by='ip').drop('variable', axis=1)
ip_nums = [16777473, 16777471, 16778238, 16785406]
df2 = pd.DataFrame({'ip':ip_nums}).sort_values(by='ip')
result = pd.merge_asof(df2, df1)
print(result)
yields
ip timezone
0 16777471 +10:00
1 16777473 +08:00
2 16778238 +08:00
3 16785406 +09:00
Ideally, your next step would be to apply more NumPy/Pandas vectorized functions
to process the whole DataFrame at once. But if you must, you could iterate
through the result DataFrame row-by-row. Still, your code will look a little bit cleaner
since you'll be able to read off ip and corresponding offset easily (and without calling .item()).
for row in result.itertuples():
print('{} --> {}'.format(row.ip, row.timezone))
# 16777471 --> +10:00
# 16777473 --> +08:00
# 16778238 --> +08:00
# 16785406 --> +09:00

Reading values from datafram.iloc is too slow and problem in dataframe.values

I use python and I have data of 35 000 rows I need to change values by loop but it takes too much time
ps: I have columns named by succes_1, succes_2, succes_5, succes_7....suces_120 so I get the name of the column by the other loop the values depend on the other column
exemple:
SK_1 Sk_2 Sk_5 .... SK_120 Succes_1 Succes_2 ... Succes_120
1 0 1 0 1 0 0
1 1 0 1 2 1 1
for i in range(len(data_jeux)):
for d in range (len(succ_len)):
ids = succ_len[d]
if data_jeux['SK_%s' % ids][i] == 1:
data_jeux.iloc[i]['Succes_%s' % ids]= 1+i
I ask if there is a way for executing this problem with the faster way I try :
data_jeux.values[i, ('Succes_%s' % ids)] = 1+i
but it returns me the following error maybe it doesn't accept string index
You can define columns and then use loc to increment. It's not clear whether your columns are naturally ordered; if they aren't you can use sorted with a custom function. String-based sorting will cause '20' to come before '100'.
def splitter(x):
return int(x.rsplit('_', maxsplit=1)[-1])
cols = df.columns
sk_cols = sorted(cols[cols.str.startswith('SK')], key=splitter)
succ_cols = sorted(cols[cols.str.startswith('Succes')], key=splitter)
df.loc[df[sk_cols] == 1, succ_cols] += 1

Counting Values In Columns Igonorig AlphaNumeric Values

First post here, I am trying to find out total count of values in an excel file. So after importing the file, I need to run a condition which is count all the values except 0 also where it finds 0 make that blank.
> df6 = df5.append(df5.ne(0).sum().rename('Final Value'))
I tried the above one but not working properly, It is counting the column name as well, I only need to count the float values.
Demo DataFrame:
0 1 2 3
ID_REF 1007_s 1053_a 117_at 121_at
GSM95473 0.08277 0.00874 0.00363 0.01877
GSM95474 0.09503 0.00592 0.00352 0
GSM95475 0.08486 0.00678 0.00386 0.01973
GSM95476 0.08105 0.00913 0.00306 0.01801
GSM95477 0.00000 0.00812 0.00428 0
GSM95478 0.07615 0.00777 0.00438 0.01799
GSM95479 0 0.00508 1 0
GSM95480 0.08499 0.00442 0.00298 0.01897
GSM95481 0.08893 0.00734 0.00204 0
0 1 2 3
ID_REF 1007_s 1053_a 117_at 121_at
These are column name and index value which needs to be ignored when counting.
The output Should be like this after counting:
Final 8 9 9 5
If you just nee the count, but change the values in your dataframe, you could apply a function to each cell in your DataFrame with the applymap method. First create a function to check for a float:
def floatcheck(value):
if isinstance(value, float):
return 1
else:
return 0
Then apply it to your dataframe:
df6 = df5.applymap(floatcheck)
This will create a dataframe with a 1 if the value is a float and a 0 if not. Then you can apply your sum method:
df7 = df6.append(df6.sum().rename("Final Value"))
I was able to solve the issue, So here it is:
df5 = df4.append(pd.DataFrame(dict(((df4[1:] != 1) & (df4[1:] != 0)).sum()), index=['Final']))
df5.columns = df4.columns
went = df5.to_csv("output3.csv")
What i did was i changed the starting index so i didn't count the first row which was alphanumeric and then i just compared it.
Thanks for your response.

Taking single value from a grouped data frame in Pandas

I am a new Python convert (from Matlab). I am using the pandas groupby function, and I am getting tripped up by a seemingly easy problem. I have written a custom function that I apply to the grouped df that returns 4 different values. Three of the values are working great, but the other value is giving me an error. Here is the original df:
Index,SN,Date,City,State,ID,County,Age,A,B,C
0,32,9/1/16,X,AL,360,BB County,29.0,negative,positive,positive
1,32,9/1/16,X,AL,360,BB County,1.0,negative,negative,negative
2,32,9/1/16,X,AL,360,BB County,10.0,negative,negative,negative
3,32,9/1/16,X,AL,360,BB County,11.0,negative,negative,negative
4,35,9/1/16,X,AR,718,LL County,67.0,negative,negative,negative
5,38,9/1/16,X,AR,728-13,JJ County,3.0,negative,negative,negative
6,38,9/1/16,X,AR,728-13,JJ County,8.0,negative,negative,negative
7,30,9/1/16,X,AR,728-13,JJ County,8.0,negative,negative,negative
8,30,9/1/16,X,AR,728-13,JJ County,14.0,negative,negative,negative
9,30,9/1/16,X,AR,728-13,JJ County,5.0,negative,negative,negative
...
This is the function that transforms the data. Basically, it counts the number of 'positive' values and the total number of observations in the group. I also want it to return the ID value, and this is where the problem is:
def _ct_id_pos(grp):
return grp['ID'][0], grp[grp.A == 'positive'].shape[0], grp[grp.B == 'positive'].shape[0], grp.shape[0]
I apply the _ct_id_pos function to the data grouped by Date and SN:
FullMx_prime = FullMx.groupby(['Date', 'SN']).apply(_ct_id_pos).reset_index()
So, the method should return something like this:
Date SN ID 0
0 9/1/16 32 360 (360,2,1,4)
1 9/1/16 35 718 (718,0,0,1)
2 9/2/16 38 728 (728,1,0,2)
3 9/3/16 30 728 (728,2,0,3)
But, I keep getting the following error:
...
KeyError: 0
Obviously, it does not like this part of the function: grp['ID'][0] . I just want to take the first value of grp['ID'] because--if there are multiple values--they should all be the same (i.e., I could take the last, it does not matter). I have tried other ways to index, but to no avail.
Change grp['ID'][0] to grp.iloc[0]['ID']
The problem you are having is due to grp['ID'] which selects a column and returns a pandas.Series. Which is straight forward enough, and you could reasonably expect that [0] would select the first element. But the [0] actually selects based on the index for the Series, and in this case the index is from the dataframe that was grouped. So, 0 is not always going to be a valid index.
Code:
def _ct_id_pos(grp):
id = grp.iloc[0]['ID']
a = grp[grp.A == 'positive'].shape[0]
b = grp[grp.B == 'positive'].shape[0]
sz = grp.shape[0]
return id, a, b, sz
Test Code:
df = pd.read_csv(StringIO(u"""
Index,SN,Date,City,State,ID,County,Age,A,B,C
0,32,9/1/16,X,AL,360,BB County,29.0,negative,positive,positive
1,32,9/1/16,X,AL,360,BB County,1.0,negative,negative,negative
2,32,9/1/16,X,AL,360,BB County,10.0,negative,negative,negative
3,32,9/1/16,X,AL,360,BB County,11.0,negative,negative,negative
4,35,9/1/16,X,AR,718,LL County,67.0,negative,negative,negative
5,38,9/1/16,X,AR,728-13,JJ County,3.0,negative,negative,negative
6,38,9/1/16,X,AR,728-13,JJ County,8.0,negative,negative,negative
7,30,9/1/16,X,AR,728-13,JJ County,8.0,negative,negative,negative
8,30,9/1/16,X,AR,728-13,JJ County,14.0,negative,negative,negative
9,30,9/1/16,X,AR,728-13,JJ County,5.0,negative,negative,negative
"""), header=0, index_col=0)
print(df.groupby(['Date', 'SN']).apply(_ct_id_pos).reset_index())
Results:
Date SN 0
0 9/1/16 30 (728-13, 0, 0, 3)
1 9/1/16 32 (360, 0, 1, 4)
2 9/1/16 35 (718, 0, 0, 1)
3 9/1/16 38 (728-13, 0, 0, 2)

Merging DataFrames on multiple conditions - not specifically on equal values

Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.
I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.
The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:
SNP DataFrame (snp_df):
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 752721
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.
Gene DataFrame (gene_df):
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
This dataframe shows the locations of all the genes of interest.
What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.
If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.
I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the pandasql module, while the other approach is to loop through the separate genes.
SQL method
import pandas as pd
import pandasql as psql
pysqldf = lambda q: psql.sqldf(q, globals())
q = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
"""
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
genic_snps = pysqldf(q)
all_dfs.append(genic_snps)
all_genic_snps = pd.concat(all_dfs)
Gene iteration method
all_dfs = []
for line in gene_df.iterrows():
info = line[1] # Getting the Series object
this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
(snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
Can anyone give any suggestions of a more effective way of doing this?
I've just thought of a way to solve this - by combining my two methods:
First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_chr_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
# Getting rid of redundant genes
min_bp = this_chr_snp['BP'].min()
max_bp = this_chr_snp['BP'].max()
this_genes = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
~(this_genes['chr_stop'] <= min_bp)]
for line in this_genes.iterrows():
info = line[1]
this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
(this_chr_snp['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(1, 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.
You can use the following to accomplish what you're looking for:
merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]
Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:
snp_df
Out[193]:
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 30400
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
gene_df
Out[194]:
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
merged_df
Out[195]:
SNP feature_id
8 rs3131972 GeneID:100302278

Categories

Resources