Pandas Series Resample - python

I have the following pandas series:
dummy_array = pd.Series(np.array(range(-10, 11)), index=(np.array(range(0, 21))/10))
This yield the following array:
0.0 -10
0.1 -9
0.2 -8
0.3 -7
0.4 -6
0.5 -5
0.6 -4
0.7 -3
0.8 -2
0.9 -1
1.0 0
1.1 1
1.2 2
1.3 3
1.4 4
1.5 5
1.6 6
1.7 7
1.8 8
1.9 9
2.0 10
If I want to resample, how can I do it? I read the docs and it suggested this:
dummy_array.resample('20S').mean()
But it's not working. Any ideas?
Thank you.
Edit:
I want my final vector to have double the frequency. So something like this:
0.0 -10
0.05 -9.5
0.1 -9
0.15 -8.5
0.2 -8
0.25 -7.5
etc.

Here is a solution using np.linspace(), .reindex() and interpolate:
The data frame dummmy_array is created as described above.
# get properties of original index
start = dummy_array.index.min()
end = dummy_array.index.max()
num_gridpoints_orig = dummy_array.index.size
# calc number of grid-points in new index
num_gridpoints_new = (num_gridpoints_orig * 2) - 1
# create new index, with twice the number of grid-points (i.e., smaller step-size)
idx_new = np.linspace(start, end, num_gridpoints_new)
# re-index the data frame. New grid-points have value of NaN,
# and we replace these NaNs with interpolated values
df2 = dummy_array.reindex(index=idx_new).interpolate()
print(df2.head())
0.00 -10.0
0.05 -9.5
0.10 -9.0
0.15 -8.5
0.20 -8.0

Create a list of differences based on the original array. We then break it down into values and indices to create the 'pd.Series'. Join the new pd.series and reorder it.
# new list
ups = [[x+0.05,y+0.5] for x,y in zip(dummy_array.index, dummy_array)]
idx = [i[0] for i in ups]
val = [i[1] for i in ups]
d2 = pd.Series(val, index=idx)
d3 = pd.concat([dummy_array,d2], axis=0)
d3.sort_values(inplace=True)
d3
0.00 -10.0
0.05 -9.5
0.10 -9.0
0.15 -8.5
0.20 -8.0
0.25 -7.5
0.30 -7.0
0.35 -6.5
0.40 -6.0
0.45 -5.5
0.50 -5.0
0.55 -4.5
0.60 -4.0
0.65 -3.5
0.70 -3.0
0.75 -2.5
0.80 -2.0
0.85 -1.5
0.90 -1.0
0.95 -0.5
1.00 0.0
1.05 0.5
1.10 1.0
1.15 1.5
1.20 2.0
1.25 2.5
1.30 3.0
1.35 3.5
1.40 4.0
1.45 4.5
1.50 5.0
1.55 5.5
1.60 6.0
1.65 6.5
1.70 7.0
1.75 7.5
1.80 8.0
1.85 8.5
1.90 9.0
1.95 9.5
2.00 10.0
2.05 10.5
dtype: float64

Thank you all for your contributions. After looking at the answers and thinking a bit more I found a more generic solution that should handle every possible case. In this case, I wanted to upsample dummy_arrayA to the same index as dummy_arrayB. What I did was to create a new index which has both A and B. I then use the reindex and interpolate function to calculate what would be the new values, and at the end I drop the old indexes so that I get the same array size as dummy_array-B.
import pandas as pd
import numpy as np
# Create Dummy arrays
dummy_arrayA = pd.Series(np.array(range(0, 4)), index=[0,0.5,1.0,1.5])
dummy_arrayB = pd.Series(np.array(range(0, 5)), index=[0,0.4,0.8,1.2,1.6])
# Create new index based on array A
new_ind = pd.Index(dummy_arrayA.index)
# merge index A and B
new_ind=new_ind.union(dummy_arrayB.index)
# Use the reindex function. This will copy all the values and add the missing ones with nan. Then we call the interpolate function with the index method. So that it's interpolates based on the time.
df2 = dummy_arrayA.reindex(index=new_ind).interpolate(method="index")
# Delete the points.
New_ind_inter = dummy_arrayA.index.intersection(new_ind)
# We need to prevent that common point are also deleted.
new_ind = new_ind.difference(New_ind_inter)
# Delete the old points. So that the final array matchs dummy_arrayB
df2 = df2.drop(new_ind)
print(df2)

Related

Calculate the sum of a pandas column depending on the change of values in another column

I have a dataframe as follows:
df =
col_1 val_1
0 4.0 0.89
1 4.0 0.56
2 49.0 0.7
3 49.0 1.23
4 49.0 0.8
5 52.0 0.5
6 52.0 0.2
I want to calculate the sum of the column val_1 with a penalising factor which depends on the change in the values of col_1.
For example: If there is a change in the value in col_1, then we take the value from previous row in val_1 and subtract with a penalising factor of 0.4
sum = 0.89 + (0.56-0.4) (because there is change of value in col_1 from 4.0 to 49.0) +0.7 +1.23 + (0.8 - 0.4) (because there is a change of value in col_1 from 49.0 to 52.0) + 0.5 + 0.2
sum = 4.08
Is there a way to do this?
use np.where to assign a new column and measure changes with .shift() against each row.
import numpy as np
df['val_1_adj'] = np.where(df['col_1'].ne(df['col_1'].shift(-1).ffill()),
df['val_1'].sub(0.4),
df['val_1'])
print(df)
col_1 val_1 val_1_adj
0 4.0 0.89 0.89
1 4.0 0.56 0.16
2 49.0 0.70 0.70
3 49.0 1.23 1.23
4 49.0 0.80 0.40
5 52.0 0.50 0.50
6 52.0 0.20 0.20
df['val_1_adj'].sum()
4.08
Slight variation on #UmarH's answer
df['penalties'] = np.where(~df.col_1.diff(-1).isin([0, np.nan]), 0.4, 0)
my_sum = (df['val_1'] - df['penalties']).sum()
print(my_sum)
Output:
4.08

sum column based on level selected in column header

I have a pd.dataframe and it looks like this. Note column names represent level.
df
PC 0 1 2 3
0 PC_1 0.74 0.25 0.1 0.0
1 PC_1 0.72 0.26 0.1 0.1
2 PC_2 0.80 0.18 0.2 0.0
3 PC_3 0.79 0.19 0.1 0.1
I want to create another 4 columns next to the existing columns and shift the values based on the condition assigned.
For example: if level =1, df should look like this:
df
PC 0 1 2 3 0_1 1_1 2_1 3_1
0 PC_1 0.74 0.25 0.1 0.0 0.0 (0.72+0.25) 0.1 0.0
1 PC_1 0.72 0.26 0.1 0.1 0.0 (0.72+0.26) 0.1 0.1
2 PC_2 0.80 0.18 0.2 0.0 0.0 (0.80+0.18) 0.2 0.0
3 PC_3 0.79 0.19 0.1 0.1 0.0 (0.79+0.19) 0.1 0.0
If level=3,
df
PC 0 1 2 3 0_3 1_3 2_3 3_3
0 PC_1 0.74 0.25 0.1 0.0 0.0 0.0 0.0 sum(0.74+0.25+0.1+0.0)
1 PC_1 0.72 0.26 0.1 0.1 0.0 0.0 0.0 sum(0.72+0.26+0.1+0.1)
2 PC_2 0.80 0.18 0.2 0.0 0.0 0.0 0.0 sum(0.80+0.18+0.20+0.0)
3 PC_3 0.79 0.19 0.1 0.1 0.0 0.0 0.0 sum(0.79+0.19+0.1+0.1)
I don't know how to solve the problem and am looking for help.
Thank you in advance.
Set 'PC' to the index to make things easier. We zero everything before your column, cumsum up to the column, and keep everything as is after your column.
df = df.set_index('PC')
def add_sum(df, level):
i = df.columns.get_loc(level)
df_add = (pd.concat([pd.DataFrame(0, index=df.index, columns=df.columns[:i]),
df.cumsum(1).iloc[:, i],
df.iloc[:, i+1:]],
axis=1)
.add_suffix(f'_{level}'))
return pd.concat([df, df_add], axis=1)
add_sum(df, '1') # 1 if columns labels are int
0 1 2 3 0_1 1_1 2_1 3_1
PC
PC_1 0.74 0.25 0.1 0.0 0 0.99 0.1 0.0
PC_1 0.72 0.26 0.1 0.1 0 0.98 0.1 0.1
PC_2 0.80 0.18 0.2 0.0 0 0.98 0.2 0.0
PC_3 0.79 0.19 0.1 0.1 0 0.98 0.1 0.1
add_sum(df, '3')
0 1 2 3 0_3 1_3 2_3 3_3
PC
PC_1 0.74 0.25 0.1 0.0 0 0 0 1.09
PC_1 0.72 0.26 0.1 0.1 0 0 0 1.18
PC_2 0.80 0.18 0.2 0.0 0 0 0 1.18
PC_3 0.79 0.19 0.1 0.1 0 0 0 1.18
As you wrote based on level selected in column header in the title,
I understand that:
there is no "external" level variable,
the level (how many columns to sum) results just from
the source column name.
So the task is actually to "concatenate" your both expected results (you presented only how to compute column 1_1 and 3_1) and compute other
new columns the same way.
The solution to do it is surprisingly concise.
Run the following one-liner:
df = df.join(df.iloc[:, 1:].cumsum(axis=1)
.rename(lambda name: str(name) + '_1', axis=1))
Details:
df.iloc[:, 1:] - Take all rows, starting from column 1 (column
numbers from 0).
cumsum(axis=1) - Compute cumulative sum, horizontally.
rename(..., axis=1) - Rename columns.
lambda name: str(name) + '_1' - Lambda function to compute new
column name.
The result so far - new columns.
df = df.join(...) - Join with the original DataFrame and save the
result back under df.

pandas: How to merge multiple dataframes with same column names on one column, with duplicate values?

Following this question,
I have two data sets acquired simultaneously by different acquisition systems with different sampling rates. One is very regular, and the other is not. I would like to create a single dataframe containing both data sets, using the regularly spaced timestamps (in seconds) as the reference for both. The irregularly sampled data should be interpolated on the regularly spaced timestamps.
I have the exact same situation, but my t column may have duplicates.
I would like to remain for each rows whose t is duplicated, with the one whose data column is maximal.
Following the original example:
df1:
t y1
0 0.0 0.0
1 0.5 0.5
2 1.0 1.0
3 1.0 3.0
4 1.5 1.5
5 2.0 2.0
df2:
t y2
0 0.00 0.00
1 0.34 1.02
2 1.01 3.03
3 1.40 4.20
4 1.60 4.80
5 1.70 5.10
6 2.01 6.03
df_combined:
t y1 y2
0 0.0 0.0 0.0
1 0.5 0.5 1.5
2 1.0 3.0 3.0
3 1.5 1.5 4.5
4 2.0 2.0 6.0
notice t=1.0, y1=3.0 now
How do I do this?
There are three tasks:
drop duplicates on df1
interpolate df2,
merge the two
So here's a solution
(df2.set_index('t')
.reindex(new_idx)
.interpolate('index')
.reset_index()
.merge(df1.sort_values('y1', ascending=False)
.drop_duplicates('t'),
on='t', how='right')
)
Output:
t y2 y1
0 0.0 0.0 0.0
1 0.5 1.5 0.5
2 1.0 3.0 3.0
3 1.5 4.5 1.5
4 2.0 6.0 2.0
if you are dealing with "TIMESTAMPS" then you have yo use datetime package which is one of the important one that individual not focused of and one of the important one for time series forecasting as well

Greedy most diverse subset of pandas dataframe

This is my dataset:
import pandas as pd
import itertools
A = ['A','B','C']
M = ['1','2','3']
F = ['plus','minus','square']
df = pd.DataFrame(list(itertools.product(A,M,F)), columns=['A','M','F'])
print(df)
A M F
0 A 1 plus
1 A 1 minus
2 A 1 square
3 A 2 plus
4 A 2 minus
5 A 2 square
I want to get the top-n rows (subset) from that dataframe which maximum diverse.
To compute diversity, I used 1- jaccard.
def jaccard(a, b):
c = a.intersection(b)
return float(len(c)) / (len(a) + len(b) - len(c))
By using dataframe operation, I can do a cartesian product to that dataframe using apply and compute the diversity values of each pair, and get the max value of diversity each pair by using df.idxmax(axis=1). But in here I have to compute all diversity values of each pair first which is not efficient.
0 1 2 3 4 5 6 7 8 9 10
0 0.0 1.0 0.8 0.5 0.5 0.8 0.5 1.0 0.8 0.8 0.8
1 0.0 0.0 1.0 0.8 1.0 0.8 1.0 0.8 0.8 0.8 0.8
2 0.0 0.0 0.0 1.0 0.5 1.0 0.5 0.8 0.8 1.0 1.0
3 0.0 0.0 0.0 0.0 0.8 0.8 0.8 0.8 0.5 0.8 0.5
4 0.0 0.0 0.0 0.0 0.0 0.8 0.8 1.0 0.5 1.0 0.8
df.idxmax(axis=1).sample(4)
5 6
2 3
0 1
8 9
dtype: int64
I want to implement this algorithm, but in some how, I did not understand the lines : 6 and 7.
How to compute argmax in here? and why in the line 10, it returns Sk but there is no initiation Sk value inside the looping?

Why isn't this code to plot a histogram on a continuous value Pandas column working?

I am trying to create a histogram on a continuous value column Trip_distance in a large 1.4M row pandas dataframe. Wrote the following code:
fig = plt.figure(figsize=(17,10))
trip_data.hist(column="Trip_distance")
plt.xlabel("Trip_distance",fontsize=15)
plt.ylabel("Frequency",fontsize=15)
plt.xlim([0.0,100.0])
#plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
But I am not sure why all values give the same frequency plot which shouldn't be the case. What's wrong with the code?
Test data:
VendorID lpep_pickup_datetime Lpep_dropoff_datetime Store_and_fwd_flag RateCodeID Pickup_longitude Pickup_latitude Dropoff_longitude Dropoff_latitude Passenger_count Trip_distance Fare_amount Extra MTA_tax Tip_amount Tolls_amount Ehail_fee improvement_surcharge Total_amount Payment_type Trip_type
0 2 2015-09-01 00:02:34 2015-09-01 00:02:38 N 5 -73.979485 40.684956 -73.979431 40.685020 1 0.00 7.8 0.0 0.0 1.95 0.0 NaN 0.0 9.75 1 2.0
1 2 2015-09-01 00:04:20 2015-09-01 00:04:24 N 5 -74.010796 40.912216 -74.010780 40.912212 1 0.00 45.0 0.0 0.0 0.00 0.0 NaN 0.0 45.00 1 2.0
2 2 2015-09-01 00:01:50 2015-09-01 00:04:24 N 1 -73.921410 40.766708 -73.914413 40.764687 1 0.59 4.0 0.5 0.5 0.50 0.0 NaN 0.3 5.80 1 1.0
3 2 2015-09-01 00:02:36 2015-09-01 00:06:42 N 1 -73.921387 40.766678 -73.931427 40.771584 1 0.74 5.0 0.5 0.5 0.00 0.0 NaN 0.3 6.30 2 1.0
4 2 2015-09-01 00:00:14 2015-09-01 00:04:20 N 1 -73.955482 40.714046 -73.944412 40.714729 1 0.61 5.0 0.5 0.5 0.00 0.0 NaN 0.3 6.30 2 1.0
5 2 2015-09-01 00:00:39 2015-09-01 00:05:20 N 1 -73.945297 40.808186 -73.937668 40.821198 1 1.07 5.5 0.5 0.5 1.36 0.0 NaN 0.3 8.16 1 1.0
6 2 2015-09-01 00:00:52 2015-09-01 00:05:50 N 1 -73.890877 40.746426 -73.876923 40.756306 1 1.43 6.5 0.5 0.5 0.00 0.0 NaN 0.3 7.80 1 1.0
7 2 2015-09-01 00:02:15 2015-09-01 00:05:34 N 1 -73.946701 40.797321 -73.937645 40.804516 1 0.90 5.0 0.5 0.5 0.00 0.0 NaN 0.3 6.30 2 1.0
8 2 2015-09-01 00:02:36 2015-09-01 00:07:20 N 1 -73.963150 40.693829 -73.956787 40.680531 1 1.33 6.0 0.5 0.5 1.46 0.0 NaN 0.3 8.76 1 1.0
9 2 2015-09-01 00:02:13 2015-09-01 00:07:23 N 1 -73.896820 40.746128 -73.888626 40.752724 1 0.84 5.5 0.5 0.5 0.00 0.0 NaN 0.3 6.80 2 1.0
In [ ]:
​
Trip_distance column
0 0.00
1 0.00
2 0.59
3 0.74
4 0.61
5 1.07
6 1.43
7 0.90
8 1.33
9 0.84
10 0.80
11 0.70
12 1.01
13 0.39
14 0.56
Name: Trip_distance, dtype: float64
After 100 bins:
EDIT:
After your comments this actually makes perfect sense why you don't get a histogram of each different value. There are 1.4 million rows, and ten discrete buckets. So apparently each bucket is exactly 10% (to within what you can see in the plot).
A quick rerun of your data:
In [25]: df.hist(column='Trip_distance')
Prints out absolutely fine.
The df.hist function comes with an optional keyword argument bins=10 which buckets the data into discrete bins. With only 10 discrete bins and a more or less homogeneous distribution of hundreds of thousands of rows, you might not be able to see the difference in the ten different bins in your low resolution plot:
In [34]: df.hist(column='Trip_distance', bins=50)
Here's another way to plot the data, involves turning the date_time into an index, this might help you for future slicing
#convert column to datetime
trip_data['lpep_pickup_datetime'] = pd.to_datetime(trip_data['lpep_pickup_datetime'])
#turn the datetime to an index
trip_data.index = trip_data['lpep_pickup_datetime']
#Plot
trip_data['Trip_distance'].plot(kind='hist')
plt.show()

Categories

Resources