In the picture I plot the values from an array of shape (400,8)
I wish to reorganize the points in order to get 8 series of "continuous" points. Let's call them a(t), b(t), .., h(t). a(t) being the serie with the smaller values and h(t) the serie with the bigger value. They are unknown and I try to obtain them
I have some missing values replaced by 0.
When there is a 0, I do not know to which serie it belongs to. The zeros are always stored with high index in the array
For instance at time t=136 I have only 4 values that are valid. Then array[t,i] > 0 for i <=3 and array[t,i] = 0 for i > 3
How can I cluster the points in a way that I get "continuous" time series i.e. at time t=136, array[136,0] should go into d, array[136,1] should go into e, array[136,2] should go into f and array[136,3] should go into g
I tried AgglomerativeClustering and DBSCAN with scikit-learn with no success.
Data are available at https://drive.google.com/file/d/1DKgx95FAqAIlabq77F9f-5vO-WPj7Puw/view?usp=sharing
My interpretation is that you mean that you have the data in 400 columns and 8 rows. The data values are assigned to the correct columns, but not necessarily to the correct rows. Your figure shows that the 8 signals do not cross each other, so you should be able to simply sort each column individually. But now the missing data is the problem, because the zeros representing missing data will all sort to the bottom rows, forcing the real data into the wrong rows.
I don't know if this is a good answer, but my first hunch is to start by sorting each column individually, then beginning in a place where there are several adjacent columns with full spans of real data, and working away from that location first to the left and then to the right, one column at a time: If the column contains no zeros, it is OK. If it contains zeros, then compute local row averages of the immediately adjacent columns, using only non-zero values (the number of columns depends on the density of missing data and the resolution between the signals), and then put each valid value in the current column into the row with the closest 'local row average' value, and put zeros in the remaining rows. How to code that depends on what you have done so far. If you are using numpy, then it would be convenient to first convert the zeros to NaN's, because numpy.nanmean() will ignore the NaN's.
Related
I have a big dataset, with 10,000 or so rows as pandas Dataframe. [['Date', 'TAMETR']].
The float values under 'TAMETR' increases and decreases over time.
I wish to loop through the 'TAMETR' column and check if there are consecutive instances where values are greater than let's say 1. Ultimately I'd like to get the average duration length and the distribution of the instances.
I've played around a little with what is written here: How to count consecutive ordered values on pandas data frame
Doubt I fully understand the code but I cant make it work. I don't understand how to tweak it with greater or lower than (</>).
The preferred output would be a dataframe, or array, with all the instances (greater than 1).
I can calculate the average and plot the distribution.
I'm using the diamonds dataset, below are the columns
Question: to create bins having equal population. Also need to generate a report that contains cross tab between bins and cut. Represent the number under each cell as a percentage of total
I have the above query. Although being a beginner, I created the Volume column and tried to create bins with equal population using qcut, but I'm not able to proceed further. Could someone help me out with the approach to solve the question?
pd.qcut(diamond['Volume'], q=4)
You are on the right path: pd.qcut() attempts to break the data you provide into q equal-sized bins (though it may have to adjust a little, depending on the shape of your data).
pd.qcut() also lets you specify labels=False as an argument, which will give you back the number of the bin into which the observation falls. This is a little confusing, so here's a quick exaplanation: you could pass labels=['A','B','C','D'] (given your request for 4 bins), which would return the labels of the bin into which each row falls. By telling pd.qcut that you don't have labels to give the bins, the function returns a bin number, just without a specific label. Otherwise, what the function gives back is a tuple with the range into which the observation (row) fell, and the bin number.
The reason you want the bin number is because of your next request: a cross-tab for the bin-indicator column and cut. First, create a column with the bin numbering:
diamond['binned_volume] = pd.qcut(diamond['Volume'], q=4, labels=False)`
Next, use the pd.crosstab() method to get your table:
pd.crosstab(diamond['binned_volume'], diamond['cut'], normalize=True)
The normalize=True argument will have the table calculate the entries as the entry divided by their sum, which is the last part of your question, I believe.
I am trying to align my data so that when I use another comparison method the two data sets are aligned so that they are most similar. So far I have cross-correlated the two Pandas Series and found the lag position for highest correlation. How can I then shift my data to give the a new highest correlation lag position of 0 when the Series are then cross-correlated again.
I have 4 fairly large Pandas Series. One of these Series is a Query to be compared to the other 3 Series and itself.
To find the offset for highest correlation between a query-target series pair, I have used np.correlate() and have calculated the lag position, and for highest correlation. Having found this lag position I have tried to incorporate this lag into each of the series in order that, once a cross correlation is recalculated, the new lag for highest correlation is now 0. Unfortunately, this has not been very successful.
I feel there are a few ways I could be going wrong in my methodology here, I'm very new to coding, so any pointers will be very much appreciated.
What I Have So Far
Producing a DataFrame containing the original lag positions for highest correlation in each comparison.
lags_o = pd.Dataframe({"a":[np.correlate(s4, s1, mode='full').argmax() - np.correlate(s4, s1, mode='full').size/2],
"b": [np.correlate(s4, s2, mode='full').argmax() - np.correlate(s4, s2, mode='full').size/2],
"c": [np.correlate(s4, s3, mode='full').argmax() - np.correlate(s4, s3, mode='full').size/2],
"d": [np.correlate(s4, s4, mode='full').argmax() - np.correlate(s4, s4, mode='full').size/2]})
When this is run I get the expected value of 0 for the "d" column, indicating that the two series are optimally aligned (which makes sense). The other columns return non zero values so now i want to incorporate these required shifts into the new cross correlation.
# shifting the series by the lag given in lags_o for that comparison
s1_lagged = s1.shift(lags_o["a"].item()
# selecting all non-NaN values in the series for the next correlation
s1_lagged = s1_lagged[~np.isnan(s1_lagged)]
# this is repeated for the other routes - selecting the appropriate df column
What I expected to get back when the query route and the new shifted target series was then passed to the cross-correlation was that each lag position in lags_n would be 0. However, this is not what I am getting at all. Even more confusingly is that the new lag position does not seem to relate to the old lag position (as in the lag does not seem to shift along with the shift imputed into the series). I have tried shifting both the query and target series in turn but have not managed to get the required value.
So my question is how should I correctly manipulate these Series so that i can align these data sets. Happy New Year and thank you for your time and any suggestions you may have.
I am playing with some geo data. Given a point, I am trying to map to an object. So for each connection, I generate two distances, both floats. To find the closest, I want to sort my dataframe by both distances and pick the top row.
Unfortunately when I run a sort (df.sort_values(by=['direct distance', 'pt_to_candidate']) I get the following out-of-order result
I would expect the top two rows, but flipped. If I run the sort on either column solo, I get expected results. If I flip the order of the sort (['pt_to_candidate', 'direct distance']) I get a correct, though not what I necessarily want for my function.
Both columns are type float64.
Why is this sort returning oddly?
For completeness, I should state that I have more columns and rows. From the main dataframe, I filter first and then sort. Also, I cannot recreate by manually entering data into a new dataframe, so I suspect the float length is the issue.
Edit
Adding a value_counts on 'direct distance'
4.246947 7
3.147303 2
2.875081 1
2.875081 1
I have a pandas dataframe column as shown in the figure below. Only two values: Increase and Decrease occur randomly in the column. Is there a way to process that data?
For this particular problem, I want to get the first (2 CONSECUTIVE) occurrence of the word Increase AFTER at least one (2 CONSECUTIVE) occurrences (maybe more, 2 is the minimum) of the word Decrease.
As an example, if the series is (I for "Increase", D for "Decrease"): "I,I,I,I,D,I,I,D,I,D,I,D,D,D,D,I,D,I,D,D,I,I,I,I", it should return the index of row 21 (the third last I in the given series). Assume that the example series that I just showed in a pandas column, meaning the series is vertical and not horizontal, and the indexing starts at 0, meaning that the first I is considered as row 0.
For this particular example, it should return 2009q4, which is the index of that particular row.
If somebody can show me a way to do common tasks like count the number of consecutive occurrences of a given value, detect a value change, get a particular positioned value after a value change etc. for this type of data (which may not required for this problem, but can be useful for future problems), I shall be really grateful.