Efficient way to iterate over rows and columns in pandas - python
I have a pandas dataframe Bg that was created by taking sample in rows and r for in columns. r is a list of genes that I want to split in a row-wise manner for the entire dataframe.
My code below is taking a long time to run and repeatedly crash. I would like to know if there is a more efficient way to achieve the aim.
import pandas as pd
Bg = pd.DataFrame()
for idx, r in pathway_genes.itertuples():
for i, p in enumerate(M.index):
if idx == p:
for genes, samples in common_mrna.iterrows():
b = pd.DataFrame({r:samples})
Bg = Bg.append(b).fillna(0)
M.index
M.index = ['KEGG_VASOPRESSIN_REGULATED_WATER_REABSORPTION',
'KEGG_DRUG_METABOLISM_OTHER_ENZYMES', 'KEGG_PEROXISOME',
'KEGG_LONG_TERM_POTENTIATION', 'KEGG_ADHERENS_JUNCTION', 'KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM']
pathway_genes
geneSymbols
KEGG_ABC_TRANSPORTERS
['ABCA1', 'ABCA10', 'ABCA12']
KEGG_ACUTE_MYELOID_LEUKEMIA
['AKT1', 'AKT2', 'AKT3', 'ARAF']
KEGG_ADHERENS_JUNCTION
['ACP1', 'ACTB', 'ACTG1', 'ACTN1', 'ACTN2']
KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY
['ACACB', 'ACSL1', 'ACSL3', 'ACSL4', 'ACSL5']
KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM
['ABAT', 'ACY3', 'ADSL', 'ADSS1', 'ADSS2']
common_mrna
common_mrna = pd.DataFrame([[1.2, 1.3, 1.4, 1.5], [1.6,1.7,1.8,1.9], [2.0,2.1,2.2,2.3], [2.4,2.5,2.6,2.7], [2.8,2.9,3.0,3.1],[3.2,3.3,3.4,3.5],[3.6,3.7,3.8,3.9],[4.0,4.1,4.2,4.3],[4.4,4.5,4.6,4.7],[4.8,4.9,5.0,5.1],[5.2,5.3,5.4,5.5],[5.6,5.7,5.8,5.9],[6.0,6.1,6.2,6.3],[6.4,6.5,6.6,6.7],[6.8,6.9,7.0,7.1],[7.2,7.3,7.4,7.5],[7.6,7.7,7.8,7.9]], columns=['TCGA-02-0033-01', 'TCGA-02-2470-01', 'TCGA-02-2483-01', 'TCGA-06-0124-01'], index =['ABCA1','ABCA10','ABCA12','AKT1','AKT2','AKT3','ARAF','ACP1','ACTB','ACTG1','ACTN1','ACTN2','ABAT','ACY3','ADSL','ADSS1','ADSS2'])
Desired output:
Bg = pd.DataFrame([[4.0,4.1,4.2,4.3],[4.4,4.5,4.6,4.7],[4.8,4.9,5.0,5.1],[5.2,5.3,5.4,5.5],[5.6,5.7,5.8,5.9],[6.0,6.1,6.2,6.3],[6.4,6.5,6.6,6.7],[6.8,6.9,7.0,7.1],[7.2,7.3,7.4,7.5],[7.6,7.7,7.8,7.9]], columns=['TCGA-02-0033-01', 'TCGA-02-2470-01', 'TCGA-02-2483-01', 'TCGA-06-0124-01'], index =['ACP1','ACTB','ACTG1','ACTN1','ACTN2','ABAT','ACY3','ADSL','ADSS1','ADSS2'])
Firs of all, you can use list comprehension to match the M_index with the pathway_genes
pathway_genes = {'KEGG_ABC_TRANSPORTERS': ['ABCA1', 'ABCA10', 'ABCA12'], 'KEGG_ACUTE_MYELOID_LEUKEMIA': ['AKT1', 'AKT2', 'AKT3', 'ARAF'], 'KEGG_ADHERENS_JUNCTION': ['ACP1', 'ACTB', 'ACTG1', 'ACTN1', 'ACTN2'], 'KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY': ['ACACB', 'ACSL1', 'ACSL3', 'ACSL4', 'ACSL5'], 'KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM': ['ABAT', 'ACY3', 'ADSL', 'ADSS1', 'ADSS2']}
matched_index_symbols = [pathway_genes[i] for i in pathway_genes.keys() if i in M_index]
After that, you can use loc to match all the symbols.
flatten_list = [j for sub in matched_index_symbols for j in sub]
Bg = common_mrna.loc[flatten_list]
Out[26]:
TCGA-02-0033-01 TCGA-02-2470-01 TCGA-02-2483-01 TCGA-06-0124-01
ABCA1 1.2 1.3 1.4 1.5
ABCA10 1.6 1.7 1.8 1.9
ABCA12 2.0 2.1 2.2 2.3
ACP1 4.0 4.1 4.2 4.3
ACTB 4.4 4.5 4.6 4.7
ACTG1 4.8 4.9 5.0 5.1
ACTN1 5.2 5.3 5.4 5.5
ACTN2 5.6 5.7 5.8 5.9
ABAT 6.0 6.1 6.2 6.3
ACY3 6.4 6.5 6.6 6.7
ADSL 6.8 6.9 7.0 7.1
ADSS1 7.2 7.3 7.4 7.5
ADSS2 7.6 7.7 7.8 7.9
Update
It seems that your pathway_genes is not originally a dictionary but a dataframe. If that's the case, you can extract the column index of the dataframe.
pathway_genes
Out[46]:
geneSymbols
KEGG_ABC_TRANSPORTERS [ABCA1, ABCA10, ABCA12]
KEGG_ACUTE_MYELOID_LEUKEMIA [AKT1, AKT2, AKT3, ARAF]
KEGG_ADHERENS_JUNCTION [ACP1, ACTB, ACTG1, ACTN1, ACTN2]
KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY [ACACB, ACSL1, ACSL3, ACSL4, ACSL5]
KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM [ABAT, ACY3, ADSL, ADSS1, ADSS2]
matched_index_symbols = np.array([pathway_genes['geneSymbols'].loc[i] for i in pathway_genes.index if i in M_index])
flatten_list = matched_index_symbols.ravel()
Related
How can an array be added as a separate column to a CSV file using numpy (preferable) or python 3 without overwriting the CSV file or data?
The new data to be appended has a shorter length! Here is an example: Add numpy array: ema = [3.3 3.4 3.5 3.6] (csv now has 3-columns of equal length) 1.1 2.1 append ema to end up with: 1.1 2.1 0 1.2 2.2 1.2 2.2 0 1.3 2.3 1.3 2.3 3.3 1.4 2.4 1.4 2.4 3.4 1.5 2.5 1.5 2.5 3.5 1.6 2.6 1.6 2.6 3.6
#kinshukdua's comment suggestion as code: Best way would be to read the old data as a pandas dataframe and then append the column to it and fill empty columns with 0s and finally writing it back to csv. using this import pandas as pd my_csv_path = r"path/to/csv.csv" df = pd.read_csv(my_csv_path) ema_padded = np.concatenate([np.zeros(len(df) - ema.shape[0]), ema]) df['ema'] = pd.Series(ema_padded, index=df.index) df.to_csv(my_csv_path)
df.to_csv(file_path, index=False)
Using pandas to get max value per row and column header
I have a data frame and I am looking to get the max value for each row and the column header for the column where the max value is located and return a new dataframe. In reality my data frame has over 50 columns and over 30,000 rows: df1: ID Tis RNA DNA Prot Node Exv AB 1.4 2.3 0.0 0.3 2.4 4.4 NJ 2.2 3.4 2.1 0.0 0.0 0.2 KL 0.0 0.0 0.0 0.0 0.0 0.0 JC 5.2 4.4 2.1 5.4 3.4 2.3 So the ideal output looks like this: df2: ID AB Exv 4.4 NJ RNA 3.4 KL N/A N/A JC Prot 5.4 I have tried the following without any success: df2 = df1.max(axis=1) result.index = df1.idxmax(axis=1) also tried: df2=pd.Series(df1.columns[np.argmax(df1.values,axis=1)]) final=pd.DataFrame(df1.lookup(s.index,s),s) I have looked at other posts but still can't seem to solve this. Any help would be great
Use if ID is index DataFrame.agg with replace 0 rows by missing values: df = df1.agg(['idxmax','max'], axis=1).mask(lambda x: x['max'].eq(0)) print (df) idxmax max AB Exv 4.4 NJ RNA 3.4 KL NaN NaN JC Prot 5.4 Use if ID is column: df = df1.set_index('ID').agg(['idxmax','max'], axis=1).mask(lambda x: x['max'].eq(0))
Reshape subsections of Pandas DataFrame into a wide format
I am importing data from a PDF which has not been optimised for analysis. The data has been imported into the following dataframe NaN NaN Plant_A NaN Plant_B NaN Pre 1,2 1.1 1.2 6.1 6.2 Pre 3,4 1.3 1.4 6.3 6.4 Post 1,2 2.1 2.2 7.1 7.2 Post 3,4 2.3 2.4 7.3 7.4 and I would like to reorganise it into the following form: Pre_1 Pre_2 Pre_3 Pre_4 Post_1 Post_2 Post_3 Post_4 Plant_A 1.1 1.2 1.3 1.4 2.1 2.2 2.3 2.4 Plant_B 6.1 6.2 6.3 6.4 7.1 7.2 7.3 7.4 I started by splitting the 2nd column by commas, and then combining that with the first column to give me Pre_1 and Pre_2 for instance. However I have struggled to match that with the data in the rest of the columns. For instance, Pre_1 with 1.1 and Pre_2 with 1.2 Any help would be greatly appreciated.
I had to make some assumptions in regards to consistency of your data from itertools import cycle import pandas as pd tracker = {} for temporal, spec, *data in df.itertuples(index=False): data = data[::-1] cycle_plant = cycle(['Plant_A', 'Plant_B']) spec_i = spec.split(',') while data: plant = next(cycle_plant) for i in spec_i: tracker[(plant, f"{temporal}_{i}")] = data.pop() pd.Series(tracker).unstack() Post_1 Post_2 Post_3 Post_4 Pre_1 Pre_2 Pre_3 Pre_4 Plant_A 2.1 2.2 2.3 2.4 1.1 1.2 1.3 1.4 Plant_B 7.1 7.2 7.3 7.4 6.1 6.2 6.3 6.4
How can I manage units in pandas data?
I'm trying to figure out if there is a good way to manage units in my pandas data. For example, I have a DataFrame that looks like this: length (m) width (m) thickness (cm) 0 1.2 3.4 5.6 1 7.8 9.0 1.2 2 3.4 5.6 7.8 Currently, the measurement units are encoded in column names. Downsides include: column selection is awkward -- df['width (m)'] vs. df['width'] things will likely break if the units of my source data change If I wanted to strip the units out of the column names, is there somewhere else that the information could be stored?
There isn't any great way to do this right now, see github issue here for some discussion. As a quick hack, could do something like this, maintaining a separate dict with the units. In [3]: units = {} In [5]: newcols = [] ...: for col in df: ...: name, unit = col.split(' ') ...: units[name] = unit ...: newcols.append(name) In [6]: df.columns = newcols In [7]: df Out[7]: length width thickness 0 1.2 3.4 5.6 1 7.8 9.0 1.2 2 3.4 5.6 7.8 In [8]: units['length'] Out[8]: '(m)'
As I was searching for this, too. Here is what pint and the (experimental) pint_pandas is capable of today: import pandas as pd import pint import pint_pandas ureg = pint.UnitRegistry() ureg.Unit.default_format = "~P" pint_pandas.PintType.ureg.default_format = "~P" df = pd.DataFrame({ "length": pd.Series([1.2, 7.8, 3.4], dtype="pint[m]"), "width": pd.Series([3.4, 9.0, 5.6], dtype="pint[m]"), "thickness": pd.Series([5.6, 1.2, 7.8], dtype="pint[cm]"), }) print(df.pint.dequantify()) length width thickness unit m m cm 0 1.2 3.4 5.6 1 7.8 9.0 1.2 2 3.4 5.6 7.8 df['width'] = df['width'].pint.to("inch") print(df.pint.dequantify()) length width thickness unit m in cm 0 1.2 133.858268 5.6 1 7.8 354.330709 1.2 2 3.4 220.472441 7.8
Offer you some methods: pands-units-extension: janpipek/pandas-units-extension: Units extension array for pandas based on astropy pint-pandas: hgrecco/pint-pandas: Pandas support for pint you can also extend the pandas by yourself following this Extending pandas — pandas 1.3.0 documentation
How to group data by time
I am trying to find a way to group data daily. This is an example of my data set. Dates Price1 Price 2 2002-10-15 11:17:03pm 0.6 5.0 2002-10-15 11:20:04pm 1.4 2.4 2002-10-15 11:22:12pm 4.1 9.1 2002-10-16 12:21:03pm 1.6 1.4 2002-10-16 12:22:03pm 7.7 3.7
Yeah, I would definitely use Pandas for this. The trickiest part is just figuring out the datetime parser for pandas to use to load in the data. After that, its just a resampling of the subsequent DataFrame. In [62]: parse = lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %I:%M:%S%p') In [63]: dframe = pandas.read_table("data.txt", delimiter=",", index_col=0, parse_dates=True, date_parser=parse) In [64]: print dframe Price1 Price 2 Dates 2002-10-15 23:17:03 0.6 5.0 2002-10-15 23:20:04 1.4 2.4 2002-10-15 23:22:12 4.1 9.1 2002-10-16 12:21:03 1.6 1.4 2002-10-16 12:22:03 7.7 3.7 In [78]: means = dframe.resample("D", how='mean', label='left') In [79]: print means Price1 Price 2 Dates 2002-10-15 2.033333 5.50 2002-10-16 4.650000 2.55 where data.txt: Dates , Price1 , Price 2 2002-10-15 11:17:03pm, 0.6 , 5.0 2002-10-15 11:20:04pm, 1.4 , 2.4 2002-10-15 11:22:12pm, 4.1 , 9.1 2002-10-16 12:21:03pm, 1.6 , 1.4 2002-10-16 12:22:03pm, 7.7 , 3.7
From pandas documentation: http://pandas.pydata.org/pandas-docs/stable/pandas.pdf # 72 hours starting with midnight Jan 1st, 2011 In [1073]: rng = date_range(’1/1/2011’, periods=72, freq=’H’)
Use data.groupby(data['dates'].map(lambda x: x.day))