so I need to analyse the peak number & width of a signal (in my case Calcium signal from epidermis cells) that I have stored in an excelsheet. Each column has all the values for one Cell (600 values)
To analyse the peaks, which I will be duing with the scipy.signal.find_peaks() and scipy.signal.peak_widths() function, I put the individual columns in an 1D numpy array containing all the 601 values from that column.
I did this by saving all the individual columns (Columns are named A, B, C, D, etc in Excelsheet) into their own dataframes (df_A, df_B) then putting them in an array :
import numpy as np
import pandas as pd
df = pd.read_excel('test.xlsx')
df_A = df.loc[:,'A']
df_B = df.loc[:,'B']
arrA = np.array(df_A)
arrB = np.array(df_B)
To calculate the the peak number&width i used the following lines :
from scipy.signal import find_peaks, peak_widths
peaks_A, _ = find_peaks(x,height=7000, prominence= 1)
results_peakwidth_A = peak_widths(x, peaks, rel_height=0.5)
Now since I have not only one but > 100 cells/signals to analyse, is there an simple way to do this for all the cells/arrays ? This exceeds my capabilities so I would gladly welcome any help.
The proposal would be as follows. In essence you firstly select the required columns (however many there are). Then you create a function that will take in a column (no need to turn it into arrays, unless scipy disagrees, in that case add column = column.values in the top of process function).
Afterwards use apply, which will loop through each column in the dataframe and pass it into the function that you defined.
import pandas as pd
from scipy.signal import find_peaks, peak_widths
df = pd.read_excel('test.xlsx')
df = ... # select all columns from A-Z into a single dataframe with the columns required.
# The shape here would b
# A B C
# 1 4 4.1
# 2 3 4.0
# ...
# define the function you want to apply to each column
def process(column):
peaks, _ = find_peaks(column,height=7000, prominence= 1)
return peak_widths(column, peaks, rel_height=0.5)
new_columns = df.apply(process)
As I'm unsure of what the actual output should look like, you might want to keep the peaks_A and the width. In which case you could alter the process function slightly:
def process(column):
peaks, _ = find_peaks(column,height=7000, prominence= 1)
width = peak_widths(column, peaks, rel_height=0.5)
return pd.Series({"width": width, "peaks": peaks})
Related
I am trying to change the order of variables I use to make a facet grid in xarray. For example, I have [a,b,c,d] as column names. I want to reorder it to [c,d,a,b]. Unfortunately, unlike seaborn, I could not find parameters such as col_order or row_order in xarray plot function (
https://xarray.pydata.org/en/stable/generated/xarray.plot.FacetGrid.html
Update:
To help myself better explain what I need, I took the example below from the user guide of xarray:
In the following example, I need to change the place of months. I mean, for example, I want to put the month 7 as the first column and 2nd month as the 5th and so on and so forth.
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
ds = xr.tutorial.open_dataset("air_temperature.nc").rename({"air": "Tair"})
# we will add a gradient field with appropriate attributes
ds["dTdx"] = ds.Tair.differentiate("lon") / 110e3 / np.cos(ds.lat * np.pi / 180)
ds["dTdy"] = ds.Tair.differentiate("lat") / 105e3
ds.dTdx.attrs = {"long_name": "$∂T/∂x$", "units": "°C/m"}
ds.dTdy.attrs = {"long_name": "$∂T/∂y$", "units": "°C/m"}
monthly_means = ds.groupby("time.month").mean()
# xarray's groupby reductions drop attributes. Let's assign them back so we get nice labels.
monthly_means.Tair.attrs = ds.Tair.attrs
fg = monthly_means.Tair.plot(
col="month",
col_wrap=4, # each row has a maximum of 4 columns
)
plt.show()
Any help is highly appreciated.
xarray will respect the shape of your data, so you can rearrange the data prior to plotting:
In [2]: ds = xr.tutorial.open_dataset("air_temperature.nc")
In [3]: ds_mon = ds.groupby("time.month").mean()
In [4]: # order the data by month, descending
...: ds_mon.air.sel(month=list(range(12, 0, -1))).plot(
...: col="month", col_wrap=4,
...: )
Out[4]: <xarray.plot.facetgrid.FacetGrid at 0x16b9a7700>
I'm interested in the first time a random process crosses a threshold. I am storing the results from observing the process in a dataframe, and have plotted how many times several realisations of that process cross 0.9 after I observe it a the end of 14 rounds.
This image was created with this code
import matplotlib.pyplot as plt
plt.style.use('ggplot')
fin = pd.DataFrame(data=np.random.uniform(size=(100, 13))).T
pos = (fin>0.9).astype(float)
ax=fin.loc[:, pos.loc[12, :] != 1.0].plot(figsize=(12, 6), color='silver', legend=False)
fin.loc[:, pos.loc[12, :] == 1.0].plot(figsize=(12, 6), color='indianred', legend=False, ax=ax)
where fin contained the random numbers, and pos was 1 every time that process crossed 0.9.
I would like to now plot the first time the process in fin crosses 0.9 for each realisation (columns represent realisations, rows represent observation times)
I can find the first occurence of a value above 0.9 with idxmax() but I'm stumped about how to remove everything in the dataframe after that in each column.
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.random.uniform(size=(100, 10)))
maxes = df.idxmax()
It's just that I'm having real difficulty thinking through this.
If I understand correctly, you can use
df = df[df.index < maxes[0]]
IIUC, we can use a boolean matrix with cumprod:
df.where((df < .9).cumprod().astype(bool)).plot()
Output:
I am using the Housing train.csv data from Kaggle to run a prediction.
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv
I am trying to generate a correlation and only keep the features that correlate with SalePrice from 0.5 to 0.9. I tried to use this function to fileter some of it, but I am removing the correlation values that are above .9 only.
How would I update this function to only keep those specific features that I need to generate a correlation heat map?
data = train
corr = data.corr()
columns = np.full((corr.shape[0],), True, dtype=bool)
for i in range(corr.shape[0]):
for j in range(i+1, corr.shape[0]):
if corr.iloc[i,j] >= 0.9:
if columns[j]:
columns[j] = False
selected_columns = data.columns[columns]
data = data[selected_columns]
import pandas as pd
data = pd.read_csv('train.csv')
col = data.columns
c = [i for i in col if data[i].dtypes=='int64' or data[i].dtypes=='float64'] # dropping columns as dtype == object
main_col = ['SalePrice'] # column with which we have to compare correlation
corr_saleprice = data.corr().filter(main_col).drop(main_col)
c1 =(corr_saleprice['SalePrice']>=0.5) & (corr_saleprice['SalePrice']<=0.9)
c2 =(corr_saleprice['SalePrice']>=-0.9) & (corr_saleprice['SalePrice']<=-0.5)
req_index= list(corr_saleprice[c1 | c2].index) # selecting column with given criteria
#req_index.append('SalePrice') #if you want SalePrice column in your final dataframe too , uncomment this line
data = data[req_index]
data
Also using for loops is not so efficient, a direct implementation is favorable. I hope this is what you want!
For generating heatmap , you can use following code:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
a =data.corr()
mask = np.triu(np.ones_like(a, dtype=np.bool))
plt.figure(figsize=(10,10))
_ = sns.heatmap(a,cmap=sns.diverging_palette(250, 20, n=250),square=True,mask=mask,annot=True,center=0.5)
I have a dataframe of the following format.
df
A B Target
5 4 3
1 3 4
I am finding the correlation of each column (except Target) with the Target column using pd.DataFrame(df.corr().iloc[:-1,-1]).
But the issue is - size of my actual dataframe is (216, 72391) which atleast takes 30 minutes to process on my system. Is there any way of parallerize it using a gpu ? I need to find the values of similar kind multiple times so can't wait for the normal processing time of 30 minutes each time.
Here, I have tried to implement your operation using numba
import numpy as np
import pandas as pd
from numba import jit, int64, float64
#
#------------You can ignore the code starting from here---------
#
# Create a random DF with cols_size = 72391 and row_size =300
df_dict = {}
for i in range(0, 72391):
df_dict[i] = np.random.randint(100, size=300)
target_array = np.random.randint(100, size=300)
df = pd.DataFrame(df_dict)
# ----------Ignore code till here. This is just to generate dummy data-------
# Assume df is your original DataFrame
target_array = df['target'].values
# You can choose to restore this column later
# But for now we will remove it, since we will
# call the df.values and find correlation of each
# column with target
df.drop(['target'], inplace=True, axis=1)
# This function takes in a numpy 2D array and a target array as input
# The numpy 2D array has the data of all the columns
# We find correlation of each column with target array
# numba's Jit required that both should have same columns
# Hence the first 2d array is transposed, i.e. it's shape is (72391,300)
# while target array's shape is (300,)
def do_stuff(df_values, target_arr):
# Just create a random array to store result
# df_values.shape[0] = 72391, equal to no. of columns in df
result = np.random.random(df_values.shape[0])
# Iterator over each column
for i in range(0, df_values.shape[0]):
# Find correlation of a column with target column
# In order to find correlation we must transpose array to make them compatible
result[i] = np.corrcoef(np.transpose(df_values[i]), target_arr.reshape(300,))[0][1]
return result
# Decorate the function do_stuff
do_stuff_numba = jit(nopython=True, parallel=True)(do_stuff)
# This contains all the correlation
result_array = do_stuff_numba(np.transpose(df.T.values), target_array)
Link to colab notebook.
You should take a look at dask. It should be able to do what you want and a lot more.
It parallelizes most of the DataFrame functions.
I have a dataframe with 3 columns.
UserId | ItemId | Rating
(where Rating is the rating a User gave to an Item. It's a np.float16. The 2 Id's are np.int32)
How do you best compute correlations between items using python pandas?
My take is to first pivot the table (wide format) and then apply pd.corr
df = df.pivot(index='UserId', columns='ItemId', values='Rating')
df.corr()
It's working on small datasets, but not on big ones.
That first step creates a big matrix dataset mostly full of missing values. It's quite ram intensive and I can't run it with bigger dataframes.
Isn't there a simpler way to compute the correlations directly on the long dataset, without pivoting?
(I looked into pd.groupBy, but that seems to only split the dataframe, not what I'm looking for.)
EDIT: oversimplified data and working pivot code
import pandas as pd
import numpy as np
d = {'UserId': [1,2,3, 1,2,3, 1,2,3],
'ItemId': [1,1,1, 2,2,2, 3,3,3],
'Rating': [1.1,4.5,7.1, 5.5,3.1,5.5, 1.1,np.nan,2.2]}
df = pd.DataFrame(data=d)
df = df.astype(dtype={'UserId': np.int32, 'ItemId': np.int32, 'Rating': np.float32})
print(df.info())
pivot = df.pivot(index='UserId', columns='ItemId', values='Rating')
print('')
print(pivot)
corr = pivot.corr()
print('')
print(corr)
EDIT2: Large random data generator
def randDf(size = 100):
## MAKE RANDOM DATAFRAME, df =======================
import numpy as np
import pandas as pd
import random
import math
dict_for_df = {}
for i in ('UserId','ItemId','Rating'):
dict_for_df[i] = {}
for j in range(size):
if i=='Rating': val = round( random.random()*5, 1)
else: val = round( random.random() * math.sqrt(size/2) )
dict_for_df[i][j] = val # store in a dict
# print(dict_for_df)
df = pd.DataFrame(dict_for_df) # after the loop convert the dict to a dataframe
# print(df.head())
df = df.astype(dtype={'UserId': np.int32, 'ItemId': np.int32, 'Rating': np.float32})
# df = df.astype(dtype={'UserId': np.int64, 'ItemId': np.int64, 'Rating': np.float64})
## remove doubles -----
df.drop_duplicates(subset=['UserId','ItemId'], keep='first', inplace=True)
## show -----
print(df.info())
print(df.head())
return df
# =======================
df = randDf()
I had another go, and have something that gets exactly the same correlation numbers as your method without using pivot, but is much slower. I can't say whether it uses less or more memory:
from scipy.stats.stats import pearsonr
import itertools
import pandas as pd
import numpy as np
d = []
itemids = list(set(df['ItemId']))
pairsofitems = list(itertools.combinations(itemids,2))
for itempair in pairsofitems:
a = df[df['ItemId'] == itempair[0]][['Rating', 'UserId']]
b = df[df['ItemId'] == itempair[1]][['Rating', 'UserId']]
z = np.ones(len(set(df.UserId)), dtype=int)
z = z * np.nan
z[a.UserId.values] = a.Rating.values
w = np.ones(len(set(df.UserId)), dtype=int)
w = w * np.nan
w[b.UserId.values] = b.Rating.values
bad = ~np.logical_or(np.isnan(w), np.isnan(z))
z = np.compress(bad, z)
w = np.compress(bad, w)
d.append({'firstitem': itempair[0],
'seconditem': itempair[1],
'correlation': pearsonr(z,w)[0]})
df_out = pd.DataFrame(d, columns=['firstitem', 'seconditem', 'correlation'])
This was of help working out to handle the nan's before taking the correlation.
The slicing in the two lines after the for loop take time. I think though, it may have potential if the bottlenecks could be fixed.
Yes, there is some repetition in there with the z and w variables, could put that in a function.
Some explanation of what it does:
find all combinations of pairs within your items
organise and "x" and "y" set of points for UserId / Rating where any point pair where one of the two is missing (nan) is dropped. I think of a scatter plot and the correlation being how well a straight line fits through it.
run pearson correlation on this x-y pair
put the ItemId each pair and correlation into a dataframe