I have a test dataframe that looks something like this:
data = pd.DataFrame([[0,0,0,3,6,5,6,1],[1,1,1,3,4,5,2,0],[2,1,0,3,6,5,6,1],[3,0,0,2,9,4,2,1]], columns=["id", "sex", "split", "group0Low", "group0High", "group1Low", "group1High", "trim"])
grouped = data.groupby(['sex','split']).mean()
stacked = grouped.stack().reset_index(level=2)
stacked.columns = ['group_level', 'mean']
Next, I want to separate out group_level and stack those 2 new factors:
stacked['group'] = stacked.group_level.str[:6]
stacked['level'] = stacked.group_level.str[6:]
This all works fine. My question is this:
This works if my column names ("group0Low", "group0High", "group1Low", "group1High") have something in common with each other.
What if instead my column names were more like "routeLow", "routeHigh", "landmarkLow", "landmarkHigh"? How would I use str to split group_level in this case?
This question is similar to this one posted here: Slice/split string Series at various positions
The difference is all of my column subnames are different and have no commonality (whereas in the other post everything had group or class in the name). Is there a regex string, or some other method, I can use to do this stacking?
Here is another way. It assumes that low/high group ends with the words Low and High respectively, so that we can use .str.endswith() to identify which rows are Low/High.
Here is the sample data
df = pd.DataFrame('group0Low group0High group1Low group1High routeLow routeHigh landmarkLow landmarkHigh'.split(), columns=['group_level'])
df
group_level
0 group0Low
1 group0High
2 group1Low
3 group1High
4 routeLow
5 routeHigh
6 landmarkLow
7 landmarkHigh
Use np.where, we can do the following
df['level'] = np.where(df['group_level'].str.endswith('Low'), 'Low', 'High')
df['group'] = np.where(df['group_level'].str.endswith('Low'), df['group_level'].str[:-3], df['group_level'].str[:-4])
df
group_level level group
0 group0Low Low group0
1 group0High High group0
2 group1Low Low group1
3 group1High High group1
4 routeLow Low route
5 routeHigh High route
6 landmarkLow Low landmark
7 landmarkHigh High landmark
I suppose it depends how general the strings you're working are. Assuming the only levels are always delimited by a capital letter you can do
In [30]:
s = pd.Series(['routeHigh', 'routeLow', 'landmarkHigh',
'landmarkLow', 'routeMid', 'group0Level'])
s.str.extract('([\d\w]*)([A-Z][\w\d]*)')
Out[30]:
0 1
0 route High
1 route Low
2 landmark High
3 landmark Low
4 route Mid
5 group0 Level
You can even name the columns of the result in the same line by doing
s.str.extract('(?P<group>[\d\w]*)(?P<Level>[A-Z][\w\d]*)')
So in your use case you can do
group_level_df = stacked.group_level.extract('(?P<group>[\d\w]*)(?P<Level>[A-Z][\w\d]*)')
stacked = pd.concat([stacked, group_level_df])
Here's another approach which assumes only knowledge of the level names in advance. Suppose you have three levels:
lower = stacked.group_level.str.lower()
for level in ['low', 'mid', 'high']:
rows_in = lower.str.contains(level)
stacked.loc[rows_in, 'level'] = level.capitalize()
stacked.loc[rows_in, 'group'] = stacked.group_level[rows_in].str.replace(level, '')
Which should work as long as the level doesn't appear in the group name as well, e.g. 'highballHigh'. In cases where group_level didn't contain any of these levels you would end up with null values in the corresponding rows
Related
I have the following dataframe "data" composed of ID and associated cluster number :
ID cluster
FP_101 1
FP_102 1
SP_209 3
SP_300 3
SP_209 1
FP_45 90
SP_50 90
FP_398 100
...
I would like to print clusters which contain more than one ID starting by SP and/or FP.
I think that I have the two parts of the answer but just do not know of to combine them in propre way :
data = data[data['ID'].str.startswith('FP')] (same for SP)
selection fonction : data = data.groupby(['cluster']).filter(lambda x: x['ID'].nunique() > 1)
The result should give from the previous example :
ID cluster
FP_101 1
FP_102 1
SP_209 1
SP_209 3
SP_300 3
How can I combine arrange these fonction to obtain this result ?
This is my understanding of your question; let me know if it helps:
Separating SP & FP
df['Prefix'] = df['ID'].apply(lambda x: x.split('_')[0])
Grouping by clusters
df2 = df.groupby(['cluster', 'Prefix'], as_index = False).agg({'ID':['nunique','unique']})
Filtering
df2.columns = df2.columns.to_flat_index().str.join('')
df2[df2['IDnunique']>1]
Python/pandas newbie here. The csv file I'm trying to work with has been populated with data that looks something like this:
A B C D
Option1(item1=12345, item12='string', item345=0.123) 2020-03-16 1.234 Option2(item4=123, item56=234, item678=345)
I'd like it to look like this:
item1 item12 item345 B C item4 item56 item678
12345 'string' 0.123 2020-03-16 1.234 123 234 345
In other words, I want to replace columns A and D with new columns headed by what's on the left of the equal sign, using what's to the right of the equal sign as the corresponding value, and with the Option1() and Option2() parts and the commas stripped out. The columns that don't contain functions should be left as is.
Is there an elegant way to do this?
Actually, at this point, I'd settle for any old way, elegant or not; I've found various ways of dealing with this situation if, say, there were dicts populating columns, but nothing to help me pick it apart if there are functions there. Trying to search for the answer only gives me a bunch of results for how to apply functions to dataframes.
As long as your functions always have the same arguments, this should work.
You can read the csv with (if separators are 2 or more spaces, that's what I get when I paste your question example):
df = pd.read_csv('test.csv',sep='[\s]{2,}', index_col=False, engine='python')
If your dataframe is df:
# break out both sides of the equal sign in function into columns
A_vals = df['A'].str.extractall(r'([\w\d]+)=([^,\)]*)')
# get rid of the multi-index and put the values after '=' into columns
A_converted = A_vals.unstack(level=-1)[1]
# set column names to values before '='
A_converted.columns = list(A_vals.unstack(level=-1)[0].values[0])
# same thing for 'D'
D_vals = df['D'].str.extractall(r'([\w\d]+)=([^,\)]*)')
D_converted = D_vals.unstack(level=-1)[1]
D_converted.columns = list(D_vals.unstack(level=-1)[0].values[0])
# join everything together
df = A_converted.join(df.drop(['A','D'], axis=1)).join(D_converted)
Some clarification on the regex '([\w\d]+)=([^,\)]*)' has two capture groups (each part in parens):
Group 1 ([\w\d]+) is one or more characters (+) that are word characters \w or numbers \d.
= between groups.
Group 2 ([^,\)]*) is 0 or more characters (*) that are not (^) a comma , or paren \).
I believe you're looking for something along these lines:
contracts = ["Option(conId=384688665, symbol='SPX', lastTradeDateOrContractMonth='20200116', strike=3205.0, right='P', multiplier='100', exchange='SMART', currency='USD', localSymbol='SPX 200117P03205000', tradingClass='SPX')",
"Option(conId=12345678, symbol='DJX', lastTradeDateOrContractMonth='20200113', strike=1205.0, right='P', multiplier='200', exchange='SMART', currency='USD', localSymbol='DJXX 333117Y13205000', tradingClass='DJX')"]
new_conts = []
columns = []
for i in range (len(contracts)):
mod = contracts[i].replace('Option(','').replace(')','')
contracts[i] = mod
new_cont = contracts[i].split(',')
new_conts.append(new_cont)
for contract in new_conts:
column = []
for i in range (len(contract)):
mod = contract[i].split('=')
contract[i] = mod[1]
column.append(mod[0])
columns.append(column)
print(len(columns[0]))
df = pd.DataFrame(new_conts,columns=columns[0])
df
Output:
conId symbol lastTradeDateOrContractMonth strike right multiplier exchange currency localSymbol tradingClass
0 384688665 'SPX' '20200116' 3205.0 'P' '100' 'SMART' 'USD' 'SPX 200117P03205000' 'SPX'
1 12345678 'DJX' '20200113' 1205.0 'P' '200' 'SMART' 'USD' 'DJXX 333117Y13205000' 'DJX'
Obviously you can then delete unwanted columns, change names, etc.
I have the following code which reads a csv file and then analyzes it. One patient has more than one illness and I need to find how many times an illness is seen on all patients. But the query given here
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
is so slow that it takes more than 15 mins. Is there a way to make the query faster?
raw_data = pd.read_csv(r'C:\Users\omer.kurular\Desktop\Data_Entry_2017.csv')
data = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
illnesses = pd.DataFrame({"Finding_Label":[],
"Count_of_Patientes_Having":[],
"Count_of_Times_Being_Shown_In_An_Image":[]})
ids = raw_data["Patient ID"].drop_duplicates()
index = 0
for ctr in data[:1]:
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = raw_data[raw_data["Finding Labels"].str.contains(ctr)].size / 12
for i in ids:
illnesses.at[index, "Count_of_Patientes_Having"] = raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
index = index + 1
Part of dataframes:
Raw_data
Finding Labels - Patient ID
IllnessA|IllnessB - 1
Illness A - 2
From what I read I understand that ctr stands for the name of a disease.
When you are doing this query:
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
You are not only filtering the rows which have the disease, but also which have a specific patient id. If you have a lot of patients, you will need to do this query a lot of times. A simpler way to do it would be to not filter on the patient id and then take the count of all the rows which have the disease.
This would be:
raw_data[raw_data['Finding Labels'].str.contains(ctr)].size
And in this case since you want the number of rows, len is what you are looking for instead of size (size will be the number of cells in the dataframe).
Finally another source of error in your current code was the fact that you were not keeping the count for every patient id. You needed to increment illnesses.at[index, "Count_of_Patientes_Having"] not set it to a new value each time.
The code would be something like (for the last few lines), assuming you want to keep the disease name and the index separate:
for index, ctr in enumerate(data[:1]):
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = len(raw_data[raw_data["Finding Labels"].str.contains(ctr)]) / 12
illnesses.at[index, "Count_of_Patientes_Having"] = len(raw_data[raw_data['Finding Labels'].str.contains(ctr)])
I took the liberty of using enumerate for a more pythonic way of handling indexes. I also don't really know what "Count_of_Times_Being_Shown_In_An_Image" is, but I assumed you had had the same confusion between size and len.
Likely the reason your code is slow is that you are growing a data frame row-by-row inside a loop which can involve multiple in-memory copying. Usually this is reminiscent of general purpose Python and not Pandas programming which ideally handles data in blockwise, vectorized processing.
Consider a cross join of your data (assuming a reasonable data size) to the list of illnesses to line up Finding Labels to each illness in same row to be filtered if longer string contains shorter item. Then, run a couple of groupby() to return the count and distinct count by patient.
# CROSS JOIN LIST WITH MAIN DATA FRAME (ALL ROWS MATCHED)
raw_data = (raw_data.assign(key=1)
.merge(pd.DataFrame({'ills':ills, 'key':1}), on='key')
.drop(columns=['key'])
)
# SUBSET BY ILLNESS CONTAINED IN LONGER STRING
raw_data = raw_data[raw_data.apply(lambda x: x['ills'] in x['Finding Labels'], axis=1)]
# CALCULATE GROUP BY count AND distinct count
def count_distinct(grp):
return (grp.groupby('Patient ID').size()).size
illnesses = pd.DataFrame({'Count_of_Times_Being_Shown_In_An_Image': raw_data.groupby('ills').size(),
'Count_of_Patients_Having': raw_data.groupby('ills').apply(count_distinct)})
To demonstrate, consider below with random, seeded input data and output.
Input Data (attempting to mirror original data)
import numpy as np
import pandas as pd
alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
data_tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
ills = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia",
"Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax",
"Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
np.random.seed(542019)
raw_data = pd.DataFrame({'Patient ID': np.random.choice(data_tools, 25),
'Finding Labels': np.core.defchararray.add(
np.core.defchararray.add(np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]),
np.random.choice(ills, 25).astype('str')),
np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]))
})
print(raw_data.head(10))
# Patient ID Finding Labels
# 0 r xPNPneumothoraxXYm
# 1 python ScSInfiltration9Ud
# 2 stata tJhInfiltrationJtG
# 3 r thLPneumoniaWdr
# 4 stata thYAtelectasis6iW
# 5 sas 2WLPneumonia1if
# 6 julia OPEConsolidationKq0
# 7 sas UFFCardiomegaly7wZ
# 8 stata 9NQHerniaMl4
# 9 python NB8HerniapWK
Output (after running above process)
print(illnesses)
# Count_of_Times_Being_Shown_In_An_Image Count_of_Patients_Having
# ills
# Atelectasis 3 1
# Cardiomegaly 2 1
# Consolidation 1 1
# Effusion 1 1
# Emphysema 1 1
# Fibrosis 2 2
# Hernia 4 3
# Infiltration 2 2
# Mass 1 1
# Nodule 2 2
# Pleural_Thickening 1 1
# Pneumonia 3 3
# Pneumothorax 2 2
I have 2 columns, I need to take specific string information from each column and create a new column with new strings based on this.
In column "Name" I have wellnames, I need to look at the last 4 characters of each wellname and if it Contains "H" then call that "HZ" in a new column.
I need to do the same thing if the column "WELLTYPE" contains specific words.
Using a Data Analysis program Spotfire I can do this all in one simple equation. (see below).
case
When right([UWI],4)~="H" Then "HZ"
When [WELLTYPE]~="Horizontal" Then "HZ"
When [WELLTYPE]~="Deviated" Then "D"
When [WELLTYPE]~="Multilateral" Then "ML"
else "V"
End
What would be the best way to do this in Python Pandas?
Is there a simple clean way you can do this all at once like in the spotfire equaiton above?
Here is the datatable with the two columns and my hopeful outcome column. (it did not copy very well into this), I also provide the code for the table below.
Name WELLTYPE What I Want
0 HH-001HST2 Oil Horizontal HZ
1 HH-001HST Oil_Horizontal HZ
2 HB-002H Oil HZ
3 HB-002 Water_Deviated D
4 HB-002 Oil_Multilateral ML
5 HB-004 Oil V
6 HB-005 Source V
7 BB-007 Water V
Here is the code to create the dataframe
# Dataframe with hopeful outcome
raw_data = {'Name': ['HH-001HST2', 'HH-001HST', 'HB-002H', 'HB-002', 'HB-002','HB-004','HB-005','BB-007'],
'WELLTYPE':['Oil Horizontal', 'Oil_Horizontal', 'Oil', 'Water_Deviated', 'Oil_Multilateral','Oil','Source','Water'],
'What I Want': ['HZ', 'HZ', 'HZ', 'D', 'ML','V','V','V']}
df = pd.DataFrame(raw_data, columns = ['Name','WELLTYPE','What I Want'])
df
Nested 'where' variant:
df['What I Want'] = np.where(df.Name.str[-4:].str.contains('H'), 'HZ',
np.where(df.WELLTYPE.str.contains('Horizontal'),'HZ',
np.where(df.WELLTYPE.str.contains('Deviated'),'D',
np.where(df.WELLTYPE.str.contains('Multilateral'),'ML',
'V'))))
Using apply by row:
def criteria(row):
if row.Name[-4:].find('H') > 0:
return 'HZ'
elif row.WELLTYPE.find('Horizontal') > 0:
return 'HZ'
elif row.WELLTYPE.find('Deviated') > 0:
return 'D'
elif row.WELLTYPE.find('Multilateral') > 0:
return 'ML'
else:
return 'V'
df['want'] = df.apply(criteria, axis=1)
This feels more natural to me. Obviously subjective
from_name = df.Name.str[-4:].str.contains('H').map({True: 'HZ'})
regex = '(Horizontal|Deviated|Multilateral)'
m = dict(Horizontal='HZ', Deviated='D', Multilateral='ML')
from_well = df.WELLTYPE.str.extract(regex, expand=False).map(m)
df['What I Want'] = from_name.fillna(from_well).fillna('V')
print(df)
Name WELLTYPE What I Want
0 HH-001HST2 Oil Horizontal HZ
1 HH-001HST Oil_Horizontal HZ
2 HB-002H Oil HZ HZ
3 HB-002 Water_Deviated D
4 HB-002 Oil_Multilateral ML
5 HB-004 Oil V V
6 HB-005 Source V
7 BB-007 Water V
Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.
I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.
The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:
SNP DataFrame (snp_df):
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 752721
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.
Gene DataFrame (gene_df):
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
This dataframe shows the locations of all the genes of interest.
What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.
If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.
I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the pandasql module, while the other approach is to loop through the separate genes.
SQL method
import pandas as pd
import pandasql as psql
pysqldf = lambda q: psql.sqldf(q, globals())
q = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
"""
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
genic_snps = pysqldf(q)
all_dfs.append(genic_snps)
all_genic_snps = pd.concat(all_dfs)
Gene iteration method
all_dfs = []
for line in gene_df.iterrows():
info = line[1] # Getting the Series object
this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
(snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
Can anyone give any suggestions of a more effective way of doing this?
I've just thought of a way to solve this - by combining my two methods:
First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_chr_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
# Getting rid of redundant genes
min_bp = this_chr_snp['BP'].min()
max_bp = this_chr_snp['BP'].max()
this_genes = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
~(this_genes['chr_stop'] <= min_bp)]
for line in this_genes.iterrows():
info = line[1]
this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
(this_chr_snp['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(1, 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.
You can use the following to accomplish what you're looking for:
merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]
Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:
snp_df
Out[193]:
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 30400
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
gene_df
Out[194]:
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
merged_df
Out[195]:
SNP feature_id
8 rs3131972 GeneID:100302278