How to identify consecutive dates - python

I would like to identify dates in a dataframe which are consecutive, that is there exists either an immediate predecessor or successor. I would then like to mark which dates are and are not consecutive in a new column. Additionally I would like to do this operation within particular subsets of my data.
First I create a new variable where I'd identify True of False for Consecutive Days.
weatherFile['CONSECUTIVE_DAY'] = 'NA'
I've converted dates into datetime objects then to ordinal ones:
weatherFile['DATE_OBJ'] = [datetime.strptime(d, '%Y%m%d') for d in weatherFile['DATE']]
weatherFile['DATE_INT'] = list([d.toordinal() for d in weatherFile['DATE_OBJ']])
Now I would like to identify consecutive dates in the following groups:
weatherFile.groupby(['COUNTY_GEOID_YEAR', 'TEMPBIN'])
I am thinking to loop through the groups and applying an operation that will identify which days are consecutive and which are not, within unique county, tempbin subsets.
I'm rather new to programming and python, is this a good approach so far, if so how can I progress?
Thank you - Let me know if I should provide additional information.
Update:
Using #karakfa advice I tried the following:
weatherFile.groupby(['COUNTY_GEOID_YEAR', 'TEMPBIN'])
weatherFile['DISTANCE'] = weatherFile[1:, 'DATE_INT'] - weatherFile[:-1,'DATE_INT']
weatherFile['CONSECUTIVE?'] = np.logical_or(np.insert((weatherFile['DISTANCE']),0,0) == 1, np.append((weatherFile['DISTANCE']),0) == 1)
This resulting in a TypeError: unhashable type. Traceback happened in the second line. weatherFile['DATE_INT'] is dtype: int64.

You can use .shift(-1) or .shift(1) to compare consecutive entries:
df.loc[df['DATE_INT'].shift(-1) - df['DATE_INT'] == 1, 'CONSECUTIVE_DAY'] = True
Will set CONSECUTIVE_DAY to TRUE if the previous entry is the previous day
df.loc[(df['DATE_INT'].shift(-1) - df['DATE_INT'] == 1) | (df['DATE_INT'].shift(1) - df['DATE_INT'] == -1), 'CONSECUTIVE_DAY'] = True
Will set CONSECUTIVE_DAY to TRUE if the entry is preceeded by or followed by a consecutive date.

Once you have the ordinal numbers it's a trivial task, here I'm using numpy arrays to propose one alternative
a=np.array([1,2,4,6,7,10,12,13,14,20])
d=a[1:]-a[:-1] # compute delta
ind=np.logical_or(np.insert(d,0,0)==1,np.append(d,0)==1) # at least one side matches
a[ind] # get matching entries
gives you the numbers where there is a consecutive number
array([ 1, 2, 6, 7, 12, 13, 14])
namely 4, 10 and 20 are removed.

Related

Phone number cleaning

I have a list of phone numbers in pandas like this:
Phone Number
923*********
0923********
03**********
0923********
I want to clean the phone numbers based on two rules
If the length of string is 11, number should start with '03'
If the length of string is 12, number should start with '923'
I want to discard all other numbers.
So far I have tried creating two seperate columns by following code:
before_cust['digits'] = before_cust['Information'].str.len()
before_cust['starting'] = before_cust['Information'].astype(str).str[:3]
before_cust.loc[((before_cust['digits'] == 11) & before_cust[before_cust['starting'].str.contains("03")==True]) | ((before_cust['digits'] == 12) & (before_cust[before_cust['starting'].str.contains("923")==True]))]
However this code doesn't work. Is there a more efficient way to do this?
Create 2 boolean masks for each condition then filter out your dataframe:
# If the length of string is 11, number should start with '03'
m1 = df['Information'].str.len().eq(11) & df['Information'].str.startswith('03')
# If the length of string is 12, number should start with '923'
m2 = df['Information'].str.len().eq(12) & df['Information'].str.startswith('923')
out = df.loc[m1|m2]
print(out)
# Output:
Information
0 923*********
Note: I think it doesn't work because you use str.contains rather than str.startswith.
Assuming you want to get rid of all the rows that do not satisfy your condition(As you haven't included any other information regarding the dataframe), I'd go with this approach :
func = lambda num :(len(num) == 11 and num.startswith("03")) or (len(num) == 12 and num.startswith("923"))
df = df[df["Information"].apply(func)].reset_index(drop = True)
The lambda function simply returns the boolean that becomes True if your desired condition is met, else False.
Then simply apply this filter to your dataframe and get rid of all the other columns!

Counting dates as numbers in python

So I have this excel file that I loaded into my notebook (roughly 900,000 rows) and it has a column "Date" that I want to access. I access it and I am trying to count the number of dates, I want to have the number of dates say all 900,000 as 1, 2, 3... all the way to 900,000. Afterwards I want to be able to square that number to create another column to put in my dataframe.
'''
day = (df['Date'])
def da():
for i in range(len(day)):
print (i + 1)
'''
I end up getting an output of what I want but I can not square it nor do much else with it using a for loop. Is there a simpler way to do this? New to this so thanks for any help.
If you have an index column starting from 1, you don't even have to create the extra serie called Day:
df['squared_index'] = df.index**2
If it does not start with zero, you can subtract (or add) an offset, in this case we are summing 1:
df['squared_index'] = (df.index+1)**2
Anyway you can still utilize a serie as an auxiliar column:
df['squared_index'] = pd.Series(range(0,len(df))**2
We have used **2, but there are other ways to accomplish the same result:
df.index**2 # regular python **2
np.square(df.index) # numpy square
pd.Series(df.index).pow(2) # pow function - here we have to use serie
# because it does not apply to indexes

How to only iterate certain positions in the Itertool combinations

I am working on a python project which iterates through all the possible combinations of entries in a row of excel data to find which combination produces the correct output.
To achieve this, I am iterating through different combinations of 0 and 1 to choose whether that entry is required for the combination. 1 meaning data point is included in the calculation and 0 meaning the data point is not included.
The number of combinations would thus be equal to 2 ^ (Number of excel columns)
Example Excel Data:
1, 22, 7, 11, 2, 4
Example Iteration:
(1, 0, 0, 0, 1, 0)
I could be looking for what combination of the excel data would result in an output of 3, the only correct combination of the excel data being the above iteration.
However, I would know that any value greater than 3 would not be included in a possible combination that would equal 3. As such I would like to choose and set the values of these columns to 0 and iterate the other columns only. This would in turn reduce the number of combinations.
Combination = 2 ^ (Number of excel columns - Fixed Entry Columns)
At the moment I am using Itertools.products to get all combination which I need:
Numbers = ["0","1"]
for item in itertools.product(Numbers, repeat=len(df.columns)):
Iteration = pd.DataFrame(item) # Iteration e.g (0,1,1,1,0,0,1)
Data = df.iloc[0] # Excel data row
Data = Data.to_numpy()
Iteration = Iteration.astype(float)
Answer = np.dot(Data, Iteration) # Get the result of (Iteration * Data) to check if answer is correct
This results in iterating through combinations which I know will not work.
Is there a way to only iterate 0's and 1's in certain positions of the combination while keeping the known entries a fixed value (either 0 or 1) to reduce the possible combinations?
There are some excel files have over 25 columns which as a result would be 33,554,432 combinations. As such I am trying to reduce the number of columns which I need to iterate by setting values to the columns that I do know.
If you would need further clarification please let me know. I am novice programmer so I may be overlooking or over complicating a simple solution.
Find which columns meet your criteria for exclusion. Then just get the product combinations for the other columns.
One possible method:
from itertools import product
LIMIT=10
column_data = [1, 22, 7, 11, 2, 4]
changeable_indexes = [i for i,x in enumerate(column_data) if x <= LIMIT]
for item in product([0,1], repeat=len(changeable_indexes)):
row_iteration = [0] * len(column_data)
for index, value in zip(changeable_indexes, item):
row_iteration[index] = value
print(row_iteration)

How to check if one Pandas time-series is present in another long time-series?

I have two very long time-series. I have to check if Series B is present(in the given order) in Series A.
Series A: 1,2,3,4,5,6,5,4,3.
Series B: 3,4,5.
Result: True, with index where the small series first element found. Here, index:2 (as 3 is present at index 2 in Series A)
Note: The two series are quite big. let's say A contains 50000 elements and B contains 350.
a very slow solution is to convert series to list and check if first list is a subset of the main list in order
def is_series_a_subseries_in_order(main, sub):
n = len(sub)
main=main.tolist()
sub=sub.tolist()
return any((main[i:i+n] == sub) for i in range(len(main)-n+1))
will return True or False
A naive approach is to check for B(1) in A. In your example B(1) = A(3), so now you have to check if B(2) = A(4) and you continue till the end of your substring... If it's not correct, start with A(4) and continue till the end.
A better way to search for a substring is to apply Knuth-Morris-Pratt's algorithm. I'll let you search for more information about it!
Unluckily the rolling method of pandas does not allow being used as an iterator, even though implementation is planned in #11704.
Thus we have to implement a rolling window for subset checking on our own.
ser_a = pd.Series(data=[1, 2, 3, 4, 5, 6, 5, 4, 3])
ser_b = pd.Series(data=[3, 4, 5])
slider_df = pd.concat(
[ser_a.shift(-i)[:ser_b.size] for i in range(ser_a.size - ser_b.size + 1)],
axis=1).astype(ser_a.dtype).T
sub_series = (ser_b == slider_df).all(axis=1)
# if you want, you can extract only the indices where a subseries was found:
sub_series_startindex = sub_series.index[sub_series]
What I am doing here:
[ser_a.shift(-i)[:ser_b.size] for i in range(ser_a.size - ser_b.size + 1)]: Create a "rolling window" by increased shifting of ser_a, limited to the size of the sub series ser_b to check for. Since shifts at the end will yield NaN, these are excluded in the range.
pd.concat(..., axis=1): Concatenate shifted Series, so that slider_df contains all shifts in the columns.
.astype(ser_a.dtype): is strictly optional. For large Series this may improve performance, for small Series it may degrade performance.
.T: transpose df, so that sub-series-index are aligned by axis 0.
sub_series = (ser_b == slider_df).all(axis=1): Find where ser_b matches sub-series.
sub_series.index[sub_series]: extract the indices, where a matching sub-series was found.

Remove outliers from the target column when an independent variable column has a specific value

I have a dataframe that looks as follow (click on the lick below):
df.head(10)
https://ibb.co/vqmrkXb
What I would like to do is to remove outliers from the target column (occupied_parking_spaces) when the value of the day column is equal to 6 for instance which refers to sunday (df[‘day’] == 6) using the normal distribution 68-95-99.7 rule.
I tried the following code :
df = df.mask((df['occupied_parking_spaces'] - df['occupied_parking_spaces'].mean()).abs() > 2 * df['occupied_parking_spaces'].std()).dropna()
This line of code removes outliers from the whole dataset no matter the independent variables but I only want to remove outliers from the occupied_parking_spacs column where the day value is equal to 6 for exemple.
What I can do is to create a different dataframe for which I will remove outliers:
sunday_df = df.loc[df['day'] == 0]
sunday_df = sunday_df.mask((sunday_df['occupied_parking_spaces'] - sunday_df['occupied_parking_spaces'].mean()).abs() > 2 * sunday_df['occupied_parking_spaces'].std()).dropna()
But by doing this I will get multiple dataframes for everday of the week that I will have to concatenate at the end, and this is something I do not want to do as there must be a way to do this inside the same dataframe.
Could you please help me out?
Having defined some function to remove outliers, you could use np.where to apply it selectively:
import numpy as np
df = np.where(df['day'] == 0,
remove_outliers(df['occupied_parking_spaces']),
df['occupied_parking_spaces']
)

Categories

Resources