Efficient way to iterate over rows and columns in pandas - python

I have a pandas dataframe Bg that was created by taking sample in rows and r for in columns. r is a list of genes that I want to split in a row-wise manner for the entire dataframe.
My code below is taking a long time to run and repeatedly crash. I would like to know if there is a more efficient way to achieve the aim.
import pandas as pd
Bg = pd.DataFrame()
for idx, r in pathway_genes.itertuples():
for i, p in enumerate(M.index):
if idx == p:
for genes, samples in common_mrna.iterrows():
b = pd.DataFrame({r:samples})
Bg = Bg.append(b).fillna(0)
M.index
M.index = ['KEGG_VASOPRESSIN_REGULATED_WATER_REABSORPTION',
'KEGG_DRUG_METABOLISM_OTHER_ENZYMES', 'KEGG_PEROXISOME',
'KEGG_LONG_TERM_POTENTIATION', 'KEGG_ADHERENS_JUNCTION', 'KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM']
pathway_genes
geneSymbols
KEGG_ABC_TRANSPORTERS
['ABCA1', 'ABCA10', 'ABCA12']
KEGG_ACUTE_MYELOID_LEUKEMIA
['AKT1', 'AKT2', 'AKT3', 'ARAF']
KEGG_ADHERENS_JUNCTION
['ACP1', 'ACTB', 'ACTG1', 'ACTN1', 'ACTN2']
KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY
['ACACB', 'ACSL1', 'ACSL3', 'ACSL4', 'ACSL5']
KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM
['ABAT', 'ACY3', 'ADSL', 'ADSS1', 'ADSS2']
common_mrna
common_mrna = pd.DataFrame([[1.2, 1.3, 1.4, 1.5], [1.6,1.7,1.8,1.9], [2.0,2.1,2.2,2.3], [2.4,2.5,2.6,2.7], [2.8,2.9,3.0,3.1],[3.2,3.3,3.4,3.5],[3.6,3.7,3.8,3.9],[4.0,4.1,4.2,4.3],[4.4,4.5,4.6,4.7],[4.8,4.9,5.0,5.1],[5.2,5.3,5.4,5.5],[5.6,5.7,5.8,5.9],[6.0,6.1,6.2,6.3],[6.4,6.5,6.6,6.7],[6.8,6.9,7.0,7.1],[7.2,7.3,7.4,7.5],[7.6,7.7,7.8,7.9]], columns=['TCGA-02-0033-01', 'TCGA-02-2470-01', 'TCGA-02-2483-01', 'TCGA-06-0124-01'], index =['ABCA1','ABCA10','ABCA12','AKT1','AKT2','AKT3','ARAF','ACP1','ACTB','ACTG1','ACTN1','ACTN2','ABAT','ACY3','ADSL','ADSS1','ADSS2'])
Desired output:
Bg = pd.DataFrame([[4.0,4.1,4.2,4.3],[4.4,4.5,4.6,4.7],[4.8,4.9,5.0,5.1],[5.2,5.3,5.4,5.5],[5.6,5.7,5.8,5.9],[6.0,6.1,6.2,6.3],[6.4,6.5,6.6,6.7],[6.8,6.9,7.0,7.1],[7.2,7.3,7.4,7.5],[7.6,7.7,7.8,7.9]], columns=['TCGA-02-0033-01', 'TCGA-02-2470-01', 'TCGA-02-2483-01', 'TCGA-06-0124-01'], index =['ACP1','ACTB','ACTG1','ACTN1','ACTN2','ABAT','ACY3','ADSL','ADSS1','ADSS2'])

Firs of all, you can use list comprehension to match the M_index with the pathway_genes
pathway_genes = {'KEGG_ABC_TRANSPORTERS': ['ABCA1', 'ABCA10', 'ABCA12'], 'KEGG_ACUTE_MYELOID_LEUKEMIA': ['AKT1', 'AKT2', 'AKT3', 'ARAF'], 'KEGG_ADHERENS_JUNCTION': ['ACP1', 'ACTB', 'ACTG1', 'ACTN1', 'ACTN2'], 'KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY': ['ACACB', 'ACSL1', 'ACSL3', 'ACSL4', 'ACSL5'], 'KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM': ['ABAT', 'ACY3', 'ADSL', 'ADSS1', 'ADSS2']}
matched_index_symbols = [pathway_genes[i] for i in pathway_genes.keys() if i in M_index]
After that, you can use loc to match all the symbols.
flatten_list = [j for sub in matched_index_symbols for j in sub]
Bg = common_mrna.loc[flatten_list]
Out[26]:
TCGA-02-0033-01 TCGA-02-2470-01 TCGA-02-2483-01 TCGA-06-0124-01
ABCA1 1.2 1.3 1.4 1.5
ABCA10 1.6 1.7 1.8 1.9
ABCA12 2.0 2.1 2.2 2.3
ACP1 4.0 4.1 4.2 4.3
ACTB 4.4 4.5 4.6 4.7
ACTG1 4.8 4.9 5.0 5.1
ACTN1 5.2 5.3 5.4 5.5
ACTN2 5.6 5.7 5.8 5.9
ABAT 6.0 6.1 6.2 6.3
ACY3 6.4 6.5 6.6 6.7
ADSL 6.8 6.9 7.0 7.1
ADSS1 7.2 7.3 7.4 7.5
ADSS2 7.6 7.7 7.8 7.9
Update
It seems that your pathway_genes is not originally a dictionary but a dataframe. If that's the case, you can extract the column index of the dataframe.
pathway_genes
Out[46]:
geneSymbols
KEGG_ABC_TRANSPORTERS [ABCA1, ABCA10, ABCA12]
KEGG_ACUTE_MYELOID_LEUKEMIA [AKT1, AKT2, AKT3, ARAF]
KEGG_ADHERENS_JUNCTION [ACP1, ACTB, ACTG1, ACTN1, ACTN2]
KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY [ACACB, ACSL1, ACSL3, ACSL4, ACSL5]
KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM [ABAT, ACY3, ADSL, ADSS1, ADSS2]
matched_index_symbols = np.array([pathway_genes['geneSymbols'].loc[i] for i in pathway_genes.index if i in M_index])
flatten_list = matched_index_symbols.ravel()

Related

How can an array be added as a separate column to a CSV file using numpy (preferable) or python 3 without overwriting the CSV file or data?

The new data to be appended has a shorter length!
Here is an example:
Add numpy array:
ema = [3.3 3.4 3.5 3.6]
(csv now has 3-columns of equal length)
1.1 2.1 append ema to end up with: 1.1 2.1 0
1.2 2.2 1.2 2.2 0
1.3 2.3 1.3 2.3 3.3
1.4 2.4 1.4 2.4 3.4
1.5 2.5 1.5 2.5 3.5
1.6 2.6 1.6 2.6 3.6
#kinshukdua's comment suggestion as code:
Best way would be to read the old data as a pandas dataframe and then
append the column to it and fill empty columns with 0s and finally
writing it back to csv.
using this
import pandas as pd
my_csv_path = r"path/to/csv.csv"
df = pd.read_csv(my_csv_path)
ema_padded = np.concatenate([np.zeros(len(df) - ema.shape[0]), ema])
df['ema'] = pd.Series(ema_padded, index=df.index)
df.to_csv(my_csv_path)
df.to_csv(file_path, index=False)

Using pandas to get max value per row and column header

I have a data frame and I am looking to get the max value for each row and the column header for the column where the max value is located and return a new dataframe. In reality my data frame has over 50 columns and over 30,000 rows:
df1:
ID Tis RNA DNA Prot Node Exv
AB 1.4 2.3 0.0 0.3 2.4 4.4
NJ 2.2 3.4 2.1 0.0 0.0 0.2
KL 0.0 0.0 0.0 0.0 0.0 0.0
JC 5.2 4.4 2.1 5.4 3.4 2.3
So the ideal output looks like this:
df2:
ID
AB Exv 4.4
NJ RNA 3.4
KL N/A N/A
JC Prot 5.4
I have tried the following without any success:
df2 = df1.max(axis=1)
result.index = df1.idxmax(axis=1)
also tried:
df2=pd.Series(df1.columns[np.argmax(df1.values,axis=1)])
final=pd.DataFrame(df1.lookup(s.index,s),s)
I have looked at other posts but still can't seem to solve this.
Any help would be great
Use if ID is index DataFrame.agg with replace 0 rows by missing values:
df = df1.agg(['idxmax','max'], axis=1).mask(lambda x: x['max'].eq(0))
print (df)
idxmax max
AB Exv 4.4
NJ RNA 3.4
KL NaN NaN
JC Prot 5.4
Use if ID is column:
df = df1.set_index('ID').agg(['idxmax','max'], axis=1).mask(lambda x: x['max'].eq(0))

Reshape subsections of Pandas DataFrame into a wide format

I am importing data from a PDF which has not been optimised for analysis.
The data has been imported into the following dataframe
NaN NaN Plant_A NaN Plant_B NaN
Pre 1,2 1.1 1.2 6.1 6.2
Pre 3,4 1.3 1.4 6.3 6.4
Post 1,2 2.1 2.2 7.1 7.2
Post 3,4 2.3 2.4 7.3 7.4
and I would like to reorganise it into the following form:
Pre_1 Pre_2 Pre_3 Pre_4 Post_1 Post_2 Post_3 Post_4
Plant_A 1.1 1.2 1.3 1.4 2.1 2.2 2.3 2.4
Plant_B 6.1 6.2 6.3 6.4 7.1 7.2 7.3 7.4
I started by splitting the 2nd column by commas, and then combining that with the first column to give me Pre_1 and Pre_2 for instance. However I have struggled to match that with the data in the rest of the columns. For instance, Pre_1 with 1.1 and Pre_2 with 1.2
Any help would be greatly appreciated.
I had to make some assumptions in regards to consistency of your data
from itertools import cycle
import pandas as pd
tracker = {}
for temporal, spec, *data in df.itertuples(index=False):
data = data[::-1]
cycle_plant = cycle(['Plant_A', 'Plant_B'])
spec_i = spec.split(',')
while data:
plant = next(cycle_plant)
for i in spec_i:
tracker[(plant, f"{temporal}_{i}")] = data.pop()
pd.Series(tracker).unstack()
Post_1 Post_2 Post_3 Post_4 Pre_1 Pre_2 Pre_3 Pre_4
Plant_A 2.1 2.2 2.3 2.4 1.1 1.2 1.3 1.4
Plant_B 7.1 7.2 7.3 7.4 6.1 6.2 6.3 6.4

How can I manage units in pandas data?

I'm trying to figure out if there is a good way to manage units in my pandas data. For example, I have a DataFrame that looks like this:
length (m) width (m) thickness (cm)
0 1.2 3.4 5.6
1 7.8 9.0 1.2
2 3.4 5.6 7.8
Currently, the measurement units are encoded in column names. Downsides include:
column selection is awkward -- df['width (m)'] vs. df['width']
things will likely break if the units of my source data change
If I wanted to strip the units out of the column names, is there somewhere else that the information could be stored?
There isn't any great way to do this right now, see github issue here for some discussion.
As a quick hack, could do something like this, maintaining a separate dict with the units.
In [3]: units = {}
In [5]: newcols = []
...: for col in df:
...: name, unit = col.split(' ')
...: units[name] = unit
...: newcols.append(name)
In [6]: df.columns = newcols
In [7]: df
Out[7]:
length width thickness
0 1.2 3.4 5.6
1 7.8 9.0 1.2
2 3.4 5.6 7.8
In [8]: units['length']
Out[8]: '(m)'
As I was searching for this, too. Here is what pint and the (experimental) pint_pandas is capable of today:
import pandas as pd
import pint
import pint_pandas
ureg = pint.UnitRegistry()
ureg.Unit.default_format = "~P"
pint_pandas.PintType.ureg.default_format = "~P"
df = pd.DataFrame({
"length": pd.Series([1.2, 7.8, 3.4], dtype="pint[m]"),
"width": pd.Series([3.4, 9.0, 5.6], dtype="pint[m]"),
"thickness": pd.Series([5.6, 1.2, 7.8], dtype="pint[cm]"),
})
print(df.pint.dequantify())
length width thickness
unit m m cm
0 1.2 3.4 5.6
1 7.8 9.0 1.2
2 3.4 5.6 7.8
df['width'] = df['width'].pint.to("inch")
print(df.pint.dequantify())
length width thickness
unit m in cm
0 1.2 133.858268 5.6
1 7.8 354.330709 1.2
2 3.4 220.472441 7.8
Offer you some methods:
pands-units-extension: janpipek/pandas-units-extension: Units extension array for pandas based on astropy
pint-pandas: hgrecco/pint-pandas: Pandas support for pint
you can also extend the pandas by yourself following this Extending pandas — pandas 1.3.0 documentation

How to group data by time

I am trying to find a way to group data daily.
This is an example of my data set.
Dates Price1 Price 2
2002-10-15 11:17:03pm 0.6 5.0
2002-10-15 11:20:04pm 1.4 2.4
2002-10-15 11:22:12pm 4.1 9.1
2002-10-16 12:21:03pm 1.6 1.4
2002-10-16 12:22:03pm 7.7 3.7
Yeah, I would definitely use Pandas for this. The trickiest part is just figuring out the datetime parser for pandas to use to load in the data. After that, its just a resampling of the subsequent DataFrame.
In [62]: parse = lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %I:%M:%S%p')
In [63]: dframe = pandas.read_table("data.txt", delimiter=",", index_col=0, parse_dates=True, date_parser=parse)
In [64]: print dframe
Price1 Price 2
Dates
2002-10-15 23:17:03 0.6 5.0
2002-10-15 23:20:04 1.4 2.4
2002-10-15 23:22:12 4.1 9.1
2002-10-16 12:21:03 1.6 1.4
2002-10-16 12:22:03 7.7 3.7
In [78]: means = dframe.resample("D", how='mean', label='left')
In [79]: print means
Price1 Price 2
Dates
2002-10-15 2.033333 5.50
2002-10-16 4.650000 2.55
where data.txt:
Dates , Price1 , Price 2
2002-10-15 11:17:03pm, 0.6 , 5.0
2002-10-15 11:20:04pm, 1.4 , 2.4
2002-10-15 11:22:12pm, 4.1 , 9.1
2002-10-16 12:21:03pm, 1.6 , 1.4
2002-10-16 12:22:03pm, 7.7 , 3.7
From pandas documentation: http://pandas.pydata.org/pandas-docs/stable/pandas.pdf
# 72 hours starting with midnight Jan 1st, 2011
In [1073]: rng = date_range(’1/1/2011’, periods=72, freq=’H’)
Use
data.groupby(data['dates'].map(lambda x: x.day))

Categories

Resources