Correlation between rows in a pandas dataframe

Correlation between rows in a pandas dataframe - python

I have this dataframe. I would like to find a way to make a correlation matrix between an Hour and the same hour of the day before (for example H01 of 28/09 vs H01 of 27/09).
I thought about two different approaches:
1) Do the corr matrix of the transpose dataframe.
dft=df.transpose()
dft.corr()
2) create a copy of the dataframe with 1 day/rows of lag and than do .corrwith() in order to compare them.
In the first approach I obtain weird results (for example rows like 634 and 635 low correlated even if they have values very similar), in the second approach I obtain all ones. I'm ideally looking forward to find the correlation in days close to eachothers actually. Send help please

Related

how to cluster values of continuous time series

In the picture I plot the values from an array of shape (400,8)
I wish to reorganize the points in order to get 8 series of "continuous" points. Let's call them a(t), b(t), .., h(t). a(t) being the serie with the smaller values and h(t) the serie with the bigger value. They are unknown and I try to obtain them
I have some missing values replaced by 0.
When there is a 0, I do not know to which serie it belongs to. The zeros are always stored with high index in the array
For instance at time t=136 I have only 4 values that are valid. Then array[t,i] > 0 for i <=3 and array[t,i] = 0 for i > 3
How can I cluster the points in a way that I get "continuous" time series i.e. at time t=136, array[136,0] should go into d, array[136,1] should go into e, array[136,2] should go into f and array[136,3] should go into g
I tried AgglomerativeClustering and DBSCAN with scikit-learn with no success.
Data are available at https://drive.google.com/file/d/1DKgx95FAqAIlabq77F9f-5vO-WPj7Puw/view?usp=sharing

My interpretation is that you mean that you have the data in 400 columns and 8 rows. The data values are assigned to the correct columns, but not necessarily to the correct rows. Your figure shows that the 8 signals do not cross each other, so you should be able to simply sort each column individually. But now the missing data is the problem, because the zeros representing missing data will all sort to the bottom rows, forcing the real data into the wrong rows.
I don't know if this is a good answer, but my first hunch is to start by sorting each column individually, then beginning in a place where there are several adjacent columns with full spans of real data, and working away from that location first to the left and then to the right, one column at a time: If the column contains no zeros, it is OK. If it contains zeros, then compute local row averages of the immediately adjacent columns, using only non-zero values (the number of columns depends on the density of missing data and the resolution between the signals), and then put each valid value in the current column into the row with the closest 'local row average' value, and put zeros in the remaining rows. How to code that depends on what you have done so far. If you are using numpy, then it would be convenient to first convert the zeros to NaN's, because numpy.nanmean() will ignore the NaN's.

How to deal with missing value in Pandas DataFrame from open data?

I have downloaded ten open datasets of air pollution in 2010-2019 (which has been transferred to Pandas DataFrame by 'read_csv') that have some missing values.
The rows are ordered by each day including several items (like PM2.5, SO2,...). Most of the data include 17 or 18 items. There are 27 columns which separately are Year, Station, Item, 00, 01, ..., 23.
In this case, I already used
df.fillna(np.nan).apply(lambda x: pd.to_numeric(x,errors='coerce')
and df.interpolate(axis=1,inplace=True)
But now if the data have missing values from '00' to anytime following, the interpolate function would not works. If I want to fill all these blanks, I need to merge the last day data which is not null and use interpolate again.
However, different days have different items numbers, which means there are still some rows that can't be filled.
In a nutshell, now I'm trying to contact all data by the key of items and use interpolate.
By the way, after data cleaning, I would like to apply to xgboost and linear regression to predict PM2.5. Is there any way recommended to deal with the data?
(Or any demo code online?)
For example, the data would be like:
one of the datasets
I used df.groupby('date').size() and got
size of different days

Or in other words, how to split different days and concat together?
Groupby(['date','items'])? and then how to merge?
Or, is that possible to interpolate from the last value of the last row?

Manipulating Pandas Series to Give Highest Correlation at Lag of 0

I am trying to align my data so that when I use another comparison method the two data sets are aligned so that they are most similar. So far I have cross-correlated the two Pandas Series and found the lag position for highest correlation. How can I then shift my data to give the a new highest correlation lag position of 0 when the Series are then cross-correlated again.
I have 4 fairly large Pandas Series. One of these Series is a Query to be compared to the other 3 Series and itself.
To find the offset for highest correlation between a query-target series pair, I have used np.correlate() and have calculated the lag position, and for highest correlation. Having found this lag position I have tried to incorporate this lag into each of the series in order that, once a cross correlation is recalculated, the new lag for highest correlation is now 0. Unfortunately, this has not been very successful.
I feel there are a few ways I could be going wrong in my methodology here, I'm very new to coding, so any pointers will be very much appreciated.
What I Have So Far
Producing a DataFrame containing the original lag positions for highest correlation in each comparison.
lags_o = pd.Dataframe({"a":[np.correlate(s4, s1, mode='full').argmax() - np.correlate(s4, s1, mode='full').size/2],
"b": [np.correlate(s4, s2, mode='full').argmax() - np.correlate(s4, s2, mode='full').size/2],
"c": [np.correlate(s4, s3, mode='full').argmax() - np.correlate(s4, s3, mode='full').size/2],
"d": [np.correlate(s4, s4, mode='full').argmax() - np.correlate(s4, s4, mode='full').size/2]})
When this is run I get the expected value of 0 for the "d" column, indicating that the two series are optimally aligned (which makes sense). The other columns return non zero values so now i want to incorporate these required shifts into the new cross correlation.
# shifting the series by the lag given in lags_o for that comparison
s1_lagged = s1.shift(lags_o["a"].item()
# selecting all non-NaN values in the series for the next correlation
s1_lagged = s1_lagged[~np.isnan(s1_lagged)]
# this is repeated for the other routes - selecting the appropriate df column
What I expected to get back when the query route and the new shifted target series was then passed to the cross-correlation was that each lag position in lags_n would be 0. However, this is not what I am getting at all. Even more confusingly is that the new lag position does not seem to relate to the old lag position (as in the lag does not seem to shift along with the shift imputed into the series). I have tried shifting both the query and target series in turn but have not managed to get the required value.
So my question is how should I correctly manipulate these Series so that i can align these data sets. Happy New Year and thank you for your time and any suggestions you may have.

memory efficient solution for similarity calculations between items - purchases data

I'm working on product recommendations.
My dataset is as follow ( A sample, the full one is with more than 110 000 rows and more than 80000 unique product_id):
user_id product_id
0 0E3D17EA-BEEF-493 12909837
1 0FD6955D-484C-4FC8-8C3F 12732936
2 CC2877D0-A15C-4C0A Gklb38
3 b5ad805c-f295-4852 12909841
4 0E3D17EA-BEEF-493 12645715
I want to calculate the cosine similarity between products based on purchased products per user.
Why? I need to have as a final result:
the list of the 5 most similar products for each product_id.
So, I thought the 1st thing that I need to do is to convert the dataframe into this format:
where I have one row per user_id and columns are product_ids. If a user bought product_id X then the correspondant row,column will contain the value 1, otherwise 0.
I did that using crosstab function of pandas dataframe.
crosstab_df = pd.crosstab(df.user_id, df.product_id).astype('bool').astype('int')
After that, I calculated the similarities between products.
def calculate_similarity(data_items):
"""Calculate the column-wise cosine similarity for a sparse
matrix. Return a new dataframe matrix with similarities.
"""
# create a scipy sparse matrix
data_sparse = sparse.csr_matrix(data_items)
#pairwise similarities between all samples in data_sparse.transpose()
similarities = cosine_similarity(data_sparse.transpose())
#put the similarities between products in a dataframe
sim = pd.DataFrame(data=similarities, index= data_items.columns, columns= data_items.columns)
return sim
similarity_matrix = calculate_similarity(crosstab_df)
I know that this is not efficient, because crosstab doesn't perform well when there is many rows and many columns, which is a case that I have to handle. So, I thought about instead of using a Crosstab DataFrame, I have to use scipy sparse matrix as it makes calculations faster (similarity calculations, vectors normalisation) because the input will be a numpy array, not a dataframe.
However, I didn't know how to do it. I also need to keep track of each column to what product_id it corresponds, so that I can then get the most similar product_ids to each product_id.
I found in other questions answers that:
scipy.sparse.csr_matrix(df.values)
can be used, but in my case I think, I can use it only after applying crosstab.. while I want to get rid of crosstab step.
Also, people suggested using scipy coo_matrix, but I didn't understand how can I apply it in my case, for the results I want..
I'm looking for a memory efficient solution as the initial dataset can grow for thousand of lines and hundred thousand of product_id..

Python - Zero-Order Hold Interpolation (Nearest Neighbor)

I will be shocked if there isn't some standard library function for this especially in numpy or scipy but no amount of Googling is providing a decent answer.
I am getting data from the Poloniex exchange - cryptocurrency. Think of it like getting stock prices - buy and sell orders - pushed to your computer. So what I have is timeseries of prices for any given market. One market might get an update 10 times a day while another gets updated 10 times a minute - it all depends on how many people are buying and selling on the market.
So my timeseries data will end up being something like:
[1 0.0003234,
1.01 0.0003233,
10.0004 0.00033,
124.23 0.0003334,
...]
Where the 1st column is the time value (I use Unix timestamps to the microsecond but didn't think that was necessary in the example. The 2nd column would be one of the prices - either the buy or sell prices.
What I want is to convert it into a matrix where the data is "sampled" at a regular time frame. So the interpolated (zero-order hold) matrix would be:
[1 0.0003234,
2 0.0003233,
3 0.0003233,
...
10 0.0003233,
11 0.00033,
12 0.00033,
13 0.00033,
...
120 0.00033,
125 0.0003334,
...]
I want to do this with any reasonable time step. Right now I use np.linspace(start_time, end_time, time_step) to create the new time vector.
Writing my own, admittedly crude, zero-order hold interpolator won't be that hard. I'll loop through the original time vector and use np.nonzero to find all the indices in the new time vector which fit between one timestamp (t0) and the next (t1) then fill in those indices with the value from time t0.
For now, the crude method will work. The matrix of prices isn't that big. But I have to think there a faster method using one of the built-in libraries. I just can't find it.
Also, for the example above I only use a matrix of Nx2 (column 1: times, column 2: price) but ultimately the market has 6 or 8 different parameters that might get updated. A method/library function that could handled multiple prices and such in different columns would be great.
Python 3.5 via Anaconda on Windows 7 (hopefully won't matter).
TIA

For your problem you can use scipy.interpolate.interp1d. It seems to be able to do everything that you want. It is able to do a zero order hold interpolation if you specify kind="zero". It can also simultaniously interpolate multiple columns of a matrix. You will just have to specify the appropriate axis. f = interp1d(xData, yDataColumns, kind='zero', axis=0) will then return a function that you can evaluate at any point in the interpolation range. You can then get your normalized data by calling f(np.linspace(start_time, end_time, time_step).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Correlation between rows in a pandas dataframe - python

Related

how to cluster values of continuous time series

How to deal with missing value in Pandas DataFrame from open data?

Manipulating Pandas Series to Give Highest Correlation at Lag of 0

memory efficient solution for similarity calculations between items - purchases data

Python - Zero-Order Hold Interpolation (Nearest Neighbor)

Categories

Resources