I have a pandas dataframe Bg that was created by taking sample in rows and r for in columns. r is a list of genes that I want to split in a row-wise manner for the entire dataframe.
My code below is taking a long time to run and repeatedly crash. I would like to know if there is a more efficient way to achieve the aim.
import pandas as pd
Bg = pd.DataFrame()
for idx, r in pathway_genes.itertuples():
for i, p in enumerate(M.index):
if idx == p:
for genes, samples in common_mrna.iterrows():
b = pd.DataFrame({r:samples})
Bg = Bg.append(b).fillna(0)
M.index
M.index = ['KEGG_VASOPRESSIN_REGULATED_WATER_REABSORPTION',
'KEGG_DRUG_METABOLISM_OTHER_ENZYMES', 'KEGG_PEROXISOME',
'KEGG_LONG_TERM_POTENTIATION', 'KEGG_ADHERENS_JUNCTION', 'KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM']
pathway_genes
geneSymbols
KEGG_ABC_TRANSPORTERS
['ABCA1', 'ABCA10', 'ABCA12']
KEGG_ACUTE_MYELOID_LEUKEMIA
['AKT1', 'AKT2', 'AKT3', 'ARAF']
KEGG_ADHERENS_JUNCTION
['ACP1', 'ACTB', 'ACTG1', 'ACTN1', 'ACTN2']
KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY
['ACACB', 'ACSL1', 'ACSL3', 'ACSL4', 'ACSL5']
KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM
['ABAT', 'ACY3', 'ADSL', 'ADSS1', 'ADSS2']
common_mrna
common_mrna = pd.DataFrame([[1.2, 1.3, 1.4, 1.5], [1.6,1.7,1.8,1.9], [2.0,2.1,2.2,2.3], [2.4,2.5,2.6,2.7], [2.8,2.9,3.0,3.1],[3.2,3.3,3.4,3.5],[3.6,3.7,3.8,3.9],[4.0,4.1,4.2,4.3],[4.4,4.5,4.6,4.7],[4.8,4.9,5.0,5.1],[5.2,5.3,5.4,5.5],[5.6,5.7,5.8,5.9],[6.0,6.1,6.2,6.3],[6.4,6.5,6.6,6.7],[6.8,6.9,7.0,7.1],[7.2,7.3,7.4,7.5],[7.6,7.7,7.8,7.9]], columns=['TCGA-02-0033-01', 'TCGA-02-2470-01', 'TCGA-02-2483-01', 'TCGA-06-0124-01'], index =['ABCA1','ABCA10','ABCA12','AKT1','AKT2','AKT3','ARAF','ACP1','ACTB','ACTG1','ACTN1','ACTN2','ABAT','ACY3','ADSL','ADSS1','ADSS2'])
Desired output:
Bg = pd.DataFrame([[4.0,4.1,4.2,4.3],[4.4,4.5,4.6,4.7],[4.8,4.9,5.0,5.1],[5.2,5.3,5.4,5.5],[5.6,5.7,5.8,5.9],[6.0,6.1,6.2,6.3],[6.4,6.5,6.6,6.7],[6.8,6.9,7.0,7.1],[7.2,7.3,7.4,7.5],[7.6,7.7,7.8,7.9]], columns=['TCGA-02-0033-01', 'TCGA-02-2470-01', 'TCGA-02-2483-01', 'TCGA-06-0124-01'], index =['ACP1','ACTB','ACTG1','ACTN1','ACTN2','ABAT','ACY3','ADSL','ADSS1','ADSS2'])
Firs of all, you can use list comprehension to match the M_index with the pathway_genes
pathway_genes = {'KEGG_ABC_TRANSPORTERS': ['ABCA1', 'ABCA10', 'ABCA12'], 'KEGG_ACUTE_MYELOID_LEUKEMIA': ['AKT1', 'AKT2', 'AKT3', 'ARAF'], 'KEGG_ADHERENS_JUNCTION': ['ACP1', 'ACTB', 'ACTG1', 'ACTN1', 'ACTN2'], 'KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY': ['ACACB', 'ACSL1', 'ACSL3', 'ACSL4', 'ACSL5'], 'KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM': ['ABAT', 'ACY3', 'ADSL', 'ADSS1', 'ADSS2']}
matched_index_symbols = [pathway_genes[i] for i in pathway_genes.keys() if i in M_index]
After that, you can use loc to match all the symbols.
flatten_list = [j for sub in matched_index_symbols for j in sub]
Bg = common_mrna.loc[flatten_list]
Out[26]:
TCGA-02-0033-01 TCGA-02-2470-01 TCGA-02-2483-01 TCGA-06-0124-01
ABCA1 1.2 1.3 1.4 1.5
ABCA10 1.6 1.7 1.8 1.9
ABCA12 2.0 2.1 2.2 2.3
ACP1 4.0 4.1 4.2 4.3
ACTB 4.4 4.5 4.6 4.7
ACTG1 4.8 4.9 5.0 5.1
ACTN1 5.2 5.3 5.4 5.5
ACTN2 5.6 5.7 5.8 5.9
ABAT 6.0 6.1 6.2 6.3
ACY3 6.4 6.5 6.6 6.7
ADSL 6.8 6.9 7.0 7.1
ADSS1 7.2 7.3 7.4 7.5
ADSS2 7.6 7.7 7.8 7.9
Update
It seems that your pathway_genes is not originally a dictionary but a dataframe. If that's the case, you can extract the column index of the dataframe.
pathway_genes
Out[46]:
geneSymbols
KEGG_ABC_TRANSPORTERS [ABCA1, ABCA10, ABCA12]
KEGG_ACUTE_MYELOID_LEUKEMIA [AKT1, AKT2, AKT3, ARAF]
KEGG_ADHERENS_JUNCTION [ACP1, ACTB, ACTG1, ACTN1, ACTN2]
KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY [ACACB, ACSL1, ACSL3, ACSL4, ACSL5]
KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM [ABAT, ACY3, ADSL, ADSS1, ADSS2]
matched_index_symbols = np.array([pathway_genes['geneSymbols'].loc[i] for i in pathway_genes.index if i in M_index])
flatten_list = matched_index_symbols.ravel()
Background:
I'm currently developing some data profiling in SQL Server. This consists of calculating aggregate statistics on the values in targeted columns.
I'm using SQL for most of the heavy lifting, but calling Python for some of the statistics that SQL is poor at calculating. I'm leveraging the Pandas package through SQL Server Machine Language Services.
However,
I'm currently developing this script on Visual Studio. The SQL portion is irrelevant other than as background.
Problem:
My issue is that when I call one of the Python statistics functions, it produces the output as a series with the labels seemingly not part of the data. I cannot access the labels at all. I need the values of these labels, and I need to normalize the data and insert a column with static values describing which calculation was performed on that row.
Constraints:
I will need to normalize each statistic so I can union the datasets and pass the values back to SQL for further processing. All output needs to accept dynamic schemas, so no hardcoding labels etc.
Attempted solutions:
I've tried explicitly coercing output to dataframes. This just results in a series with label "0".
I've also tried adding static values to the columns. This just adds the target column name as one of the inaccessible labels, and the intended static value as part of the series.
I've searched many times for a solution, and couldn't find anything relevant to the problem.
Code and results below. Using the iris dataset as an example.
###########################
## AGG STATS TEST SCRIPT
##
###########################
#LOAD MODULES
import pandas as pds
#GET SAMPLE DATASET
iris = pds.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
#CENTRAL TENDENCY
mode1 = iris.mode()
stat_mode = pds.melt(
mode1
)
stat_median = iris.median()
stat_median['STAT_NAME'] = 'STAT_MEDIAN' #Try to add a column with the value 'STAT_MEDIAN'
#AGGREGATE STATS
stat_describe = iris.describe()
#PRINT RESULTS
print(iris)
print(stat_median)
print(stat_describe)
###########################
## OUTPUT
##
###########################
>>> #PRINT RESULTS
... print(iris) #ORIGINAL DATASET
...
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
[150 rows x 5 columns]
>>> print(stat_median) #YOU CAN SEE THAT IT INSERTED COLUMN INTO ROW LABELS, VALUE INTO RESULTS SERIES
sepal_length 5.8
sepal_width 3
petal_length 4.35
petal_width 1.3
STAT_NAME STAT_MEDIAN
dtype: object
>>> print(stat_describe) #BASIC DESCRIPTIVE STATS, NEED TO LABEL THE STATISTIC NAMES TO UNPIVOT THIS
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
>>>
Any assistance is greatly appreciated. Thank you!
I figured it out. There's a function called reset_index that will convert the index to a column, and create a new numerical index.
stat_median = pds.DataFrame(stat_median)
stat_median.reset_index(inplace=True)
stat_median = stat_median.rename(columns={'index' : 'fieldname', 0: 'value'})
stat_median['stat_name'] = 'median'
I have a dataframe like this.
status new allocation
asset csh fi eq csh fi eq
person act_type
p1 inv 0.0 0.0 100000.0 0.0 0.0 1.0
rsp 0.0 30000.0 20000.0 0.0 0.6 0.4
tfsa 10000.0 40000.0 0.0 0.2 0.8 0.0
The right three columns are percent of total for each act_type. The following does calculate the columns correctly:
# set the percent allocations
df.loc[idx[:,:],idx["allocation",'csh']] = df.loc[idx[:,:],idx["new",'csh']] / df.loc[idx[:,:],idx["new",:]].sum(axis=1)
df.loc[idx[:,:],idx["allocation",'fi']] = df.loc[idx[:,:],idx["new",'fi']] / df.loc[idx[:,:],idx["new",:]].sum(axis=1)
df.loc[idx[:,:],idx["allocation",'eq']] = df.loc[idx[:,:],idx["new",'eq']] / df.loc[idx[:,:],idx["new",:]].sum(axis=1)
I have tried to do these calculations on one line combining 'csh', 'fi', 'eq' as follows:
df.loc[idx[:,:],idx["new", ('csh', 'fi', 'eq')]] / df.loc[idx[:,:],idx["new",:]].sum(axis=1)
But this results in ValueError: cannot join with no level specified and no overlapping names
Any suggestions how I can reduce these three lines to one line of code so that i'm dividing ('csh','fi','eq') by the account total and getting percents in the next columns?
First idx[:,:] should be simplify by :, then use DataFrame.div by axis=0 and for new columns use rename with DataFrame.join:
df1=df.loc[:, idx["new",('csh', 'fi', 'eq')]].div(df.loc[:, idx["new",:]].sum(axis=1),axis=0)
df = df.join(df1.rename(columns={'new':'allocation'}, level=0))
print (df)
status new allocation
asset csh fi eq csh fi eq
person act_type
p1 inv 0.0 0.0 100000.0 0.0 0.0 1.0
rsp 0.0 30000.0 20000.0 0.0 0.6 0.4
tfsa 10000.0 40000.0 0.0 0.2 0.8 0.0
I'm using Pandas for data analysis. I have an input file like this snippet:
VEH SEC POS ACCELL SPEED
2 8.4 36.51 -0.2929 27.39
3 8.4 23.57 -0.7381 33.09
4 8.4 6.18 0.6164 38.8
1 8.5 47.76 0 25.57
I need to reorganize the data so that the rows are the unique (ordered) values from SEC as the 1st column, and then the other columns would be VEH1_POS, VEH1_SPEED, VEH1_ACCELL, VEH2_POS, VEH2_SPEED, VEH2_ACCELL, etc.:
TIME VEH1_POS VEH1_SPEED VEH1_ACCEL VEH2_POS, VEH2_SPEED, etc.
0.1 6.2 3.7 0.0 7.5 2.1
0.2 6.8 3.2 -0.5 8.3 2.1
etc.
So, for example, the value for VEH1_POS for each row in the new dataframe would be filled in by selecting values from the POS column in the original dataframe using the row where the SEC value matches the TIME value for the row in the new dataframe and the VEH value == 1.
To set up the rows in the new data frame I'm doing this:
start = inputdf['SIMSEC'].min()
end = inputdf['SIMSEC'].max()
time_steps = frange(start, end, 0.1)
outputdf['TIME'] = time_steps
But I'm lost at how to select the right values from the input dataframe and create the rest of the new dataframe for further analysis. Note also that the input file will NOT have data for every VEH for every SEC (time stamp). So the solution needs to handle that as well. My best guess was:
outputdf['veh1_pos'] = np.where((inputdf['VEH NO'] == 1) & (inputdf['SIMSEC'] == row['Time Step']))
but that doesn't work.
import pandas as pd
# your data
# ==========================
print(df)
Out[272]:
VEH SEC POS ACCELL SPEED
0 2 8.4 36.51 -0.2929 27.39
1 3 8.4 23.57 -0.7381 33.09
2 4 8.4 6.18 0.6164 38.80
3 1 8.5 47.76 0.0000 25.57
# reshaping
# ==========================
result = df.set_index(['SEC','VEH']).unstack()
Out[278]:
POS ACCELL SPEED
VEH 1 2 3 4 1 2 3 4 1 2 3 4
SEC
8.4 NaN 36.51 23.57 6.18 NaN -0.2929 -0.7381 0.6164 NaN 27.39 33.09 38.8
8.5 47.76 NaN NaN NaN 0 NaN NaN NaN 25.57 NaN NaN NaN
So here, the column has multi-level index where 1st level is POS, ACCELL, SPEED and 2nd level is VEH=1,2,3,4.
# if you want to rename the column
temp_z = result.columns.get_level_values(0)
temp_y = result.columns.get_level_values(1)
temp_x = ['VEH'] * len(temp_y)
result.columns = ['{}{}_{}'.format(x,y,z) for x,y,z in zip(temp_x, temp_y, temp_z)]
Out[298]:
VEH1_POS VEH2_POS VEH3_POS VEH4_POS VEH1_ACCELL VEH2_ACCELL VEH3_ACCELL VEH4_ACCELL VEH1_SPEED VEH2_SPEED VEH3_SPEED VEH4_SPEED
SEC
8.4 NaN 36.51 23.57 6.18 NaN -0.2929 -0.7381 0.6164 NaN 27.39 33.09 38.8
8.5 47.76 NaN NaN NaN 0 NaN NaN NaN 25.57 NaN NaN NaN
I have a data frame (df) in pandas with four columns and I want a new column to represent the mean of this four columns: df['mean']= df.mean(1)
1 2 3 4 mean
NaN NaN NaN NaN NaN
5.9 5.4 2.4 3.2 4.225
0.6 0.7 0.7 0.7 0.675
2.5 1.6 1.5 1.2 1.700
0.4 0.4 0.4 0.4 0.400
So far so good. But when I save the results to a csv file this is what I found:
5.9,5.4,2.4,3.2,4.2250000000000005
0.6,0.7,0.7,0.7,0.6749999999999999
2.5,1.6,1.5,1.2,1.7
0.4,0.4,0.4,0.4,0.4
I guess I can force the format in the mean column, but any idea why this is happenning?
I am using winpython with python 3.3.2 and pandas 0.11.0
You could use the float_format parameter:
import pandas as pd
import io
content = '''\
1 2 3 4 mean
NaN NaN NaN NaN NaN
5.9 5.4 2.4 3.2 4.225
0.6 0.7 0.7 0.7 0.675
2.5 1.6 1.5 1.2 1.700
0.4 0.4 0.4 0.4 0.400'''
df = pd.read_table(io.BytesIO(content), sep='\s+')
df.to_csv('/tmp/test.csv', float_format='%g', index=False)
yields
1,2,3,4,mean
,,,,
5.9,5.4,2.4,3.2,4.225
0.6,0.7,0.7,0.7,0.675
2.5,1.6,1.5,1.2,1.7
0.4,0.4,0.4,0.4,0.4
The answers seem correct. Floating point numbers cannot be perfectly represented on our systems. There are bound to be some differences. Read The Floating Point Guide.
>>> a = 5.9+5.4+2.4+3.2
>>> a / 4
4.2250000000000005
As you said, you could always format the results if you want to get only a fixed number of points after the decimal.
>>> "{:.3f}".format(a/4)
'4.225'