Phone number cleaning - python

I have a list of phone numbers in pandas like this:
Phone Number
923*********
0923********
03**********
0923********
I want to clean the phone numbers based on two rules
If the length of string is 11, number should start with '03'
If the length of string is 12, number should start with '923'
I want to discard all other numbers.
So far I have tried creating two seperate columns by following code:
before_cust['digits'] = before_cust['Information'].str.len()
before_cust['starting'] = before_cust['Information'].astype(str).str[:3]
before_cust.loc[((before_cust['digits'] == 11) & before_cust[before_cust['starting'].str.contains("03")==True]) | ((before_cust['digits'] == 12) & (before_cust[before_cust['starting'].str.contains("923")==True]))]
However this code doesn't work. Is there a more efficient way to do this?

Create 2 boolean masks for each condition then filter out your dataframe:
# If the length of string is 11, number should start with '03'
m1 = df['Information'].str.len().eq(11) & df['Information'].str.startswith('03')
# If the length of string is 12, number should start with '923'
m2 = df['Information'].str.len().eq(12) & df['Information'].str.startswith('923')
out = df.loc[m1|m2]
print(out)
# Output:
Information
0 923*********
Note: I think it doesn't work because you use str.contains rather than str.startswith.

Assuming you want to get rid of all the rows that do not satisfy your condition(As you haven't included any other information regarding the dataframe), I'd go with this approach :
func = lambda num :(len(num) == 11 and num.startswith("03")) or (len(num) == 12 and num.startswith("923"))
df = df[df["Information"].apply(func)].reset_index(drop = True)
The lambda function simply returns the boolean that becomes True if your desired condition is met, else False.
Then simply apply this filter to your dataframe and get rid of all the other columns!

Related

How do you sum a dataframe based off a grouping in Python pandas?

I have a for loop with the intent of checking for values greater than zero.
Problem is, I only want each iteration to check the sum of a group of ID’s.
The grouping would be a match of the first 8 characters of the ID string.
I have that grouping taking place before the loop but the loop still appears to search the entire df instead of each group.
LeftGroup = newDF.groupby(‘ID_Left_8’)
for g in LeftGroup.groups:
if sum(newDF[‘Hours_Calc’] > 0):
print(g)
Is there a way to filter that sum to each grouping of leftmost 8 characters?
I was expecting the .groups function to accomplish this, but it still seems to search every single ID.
Thank you.
def filter_and_sum(group):
return sum(group[group['Hours_Calc'] > 0]['Hours_Calc'])
LeftGroup = newDF.groupby('ID_Left_8')
results = LeftGroup.apply(filter_and_sum)
print(results)
This will compute the sum of the Hours_Calc column for each group, filtered by the condition Hours_Calc > 0. The resulting series will have the leftmost 8 characters as the index, and the sum of the Hours_Calc column as the value.

If the value is a number between x and y, change it to z (pd df)

Consider:
data.loc[data.cig_years (10 < a < 20), 'cig_years'] = 1
This is the code I have tried, but it's not working. In pseudocode I want:
In the df 'data'
In the column 'cig_years'
If the value is a number between 10 and 20, change it to 1
Is there a Pythonic way of doing this? Preferably without for loops.
You need to use your dataframe name "data" and change it using .loc like below:
data.loc[10 < data['cig_years'] < 20, 'cig_years'] = 1
I got you. You can filter pandas dataframes with square brackets []:
data['cig_years'] [ (data['cig_years']>10) | (data['cig_years']<20) ] = 1
This basically says:
The columns 'cig_years' in data, where the columns 'cig_years' is more than 10, or less than, is set equal to 1
This is super useful in pandas dataframes because you can filter for specific columns, or filter by conditions on other columns, and then set those filtered values.
You could also use an np.where statement
This statement assumes you are going to leave the cig years alone if it is not between 10 and 20
data['cig_years'] = np.where(10 < data['cig_years'] < 20, 1, data['cig_years'])

How to strip values from columns

I have this dataset where there is a column named 'Discount' and the values are given as '20% off', '25% off' etc.
What i want is to keep just the number in the column and remove the % symbol and the 'off' string.
I'm using this formula to do it.
df['discount'] = df['discount'].apply(lambda x: x.lstrip('%').rstrip('off')
However, when i apply that formula, all the values in the column 'discount' becomes "nan".
I even used this formula as well,
df['discount'] = df['discount'].str.replace('off' , '')
However, this does the same thing.
Is there any other way of handling this? I just want to make all the values in that column to be just the number which is like 25, 20, 10 and get rid of that percentage sign and the string value.
Try this:
d['discount'] = d['discount'].str.replace(r'(%|\s*off)', '', regex=True).astype(int)
Output:
>>> df
discount
0 20
1 25
I came up with this solution:
d['discount'] = d['discount'].split("%")[0]
or as int:
d['discount'] = int(d['discount'].split("%")[0])
We chop the string in two pieces at the %-sign and then take the first part, the number.
If you have a fixed % off suffix, the most efficient is to just remove the last 5 characters:
d['discount'] = d['discount'].str[:-5].astype(int)

Pandas - how to filter dataframe by regex comparisons on mutliple column values

I have a dataframe like the following, where everything is formatted as a string:
df
property value count
0 propAb True 10
1 propAA False 10
2 propAB blah 10
3 propBb 3 8
4 propBA 4 7
5 propCa 100 4
I am trying to find a way to filter the dataframe by applying a series of regex-style rules to both the property and value columns together.
For example, some sample rules may be like the following:
"if property starts with 'propA' and value is not 'True', drop the row".
Another rule may be something more mathematical, like:
"if property starts with 'propB' and value < 4, drop the row".
Is there a way to accomplish something like this without having to iterate over all rows each time for every rule I want to apply?
You still have to apply each rule (how else?), but let pandas handle the rows. Also, instead of removing the rows that you do not like, keep the rows that you do. Here's an example of how the first two rules can be applied:
rule1 = df.property.str.startswith('propA') & (df.value != 'True')
df = df[~rule1] # Keep everything that does NOT match
rule2 = df.property.str.startswith('propB') & (df.value < 4)
df = df[~rule2] # Keep everything that does NOT match
By the way, the second rule will not work because value is not a numeric column.
For the first one:
df = df.drop(df[(df.property.startswith('propA')) & (df.value is not True)].index)
and the other one:
df = df.drop(df[(df.property.startswith('propB')) & (df.value < 4)].index)

How to identify consecutive dates

I would like to identify dates in a dataframe which are consecutive, that is there exists either an immediate predecessor or successor. I would then like to mark which dates are and are not consecutive in a new column. Additionally I would like to do this operation within particular subsets of my data.
First I create a new variable where I'd identify True of False for Consecutive Days.
weatherFile['CONSECUTIVE_DAY'] = 'NA'
I've converted dates into datetime objects then to ordinal ones:
weatherFile['DATE_OBJ'] = [datetime.strptime(d, '%Y%m%d') for d in weatherFile['DATE']]
weatherFile['DATE_INT'] = list([d.toordinal() for d in weatherFile['DATE_OBJ']])
Now I would like to identify consecutive dates in the following groups:
weatherFile.groupby(['COUNTY_GEOID_YEAR', 'TEMPBIN'])
I am thinking to loop through the groups and applying an operation that will identify which days are consecutive and which are not, within unique county, tempbin subsets.
I'm rather new to programming and python, is this a good approach so far, if so how can I progress?
Thank you - Let me know if I should provide additional information.
Update:
Using #karakfa advice I tried the following:
weatherFile.groupby(['COUNTY_GEOID_YEAR', 'TEMPBIN'])
weatherFile['DISTANCE'] = weatherFile[1:, 'DATE_INT'] - weatherFile[:-1,'DATE_INT']
weatherFile['CONSECUTIVE?'] = np.logical_or(np.insert((weatherFile['DISTANCE']),0,0) == 1, np.append((weatherFile['DISTANCE']),0) == 1)
This resulting in a TypeError: unhashable type. Traceback happened in the second line. weatherFile['DATE_INT'] is dtype: int64.
You can use .shift(-1) or .shift(1) to compare consecutive entries:
df.loc[df['DATE_INT'].shift(-1) - df['DATE_INT'] == 1, 'CONSECUTIVE_DAY'] = True
Will set CONSECUTIVE_DAY to TRUE if the previous entry is the previous day
df.loc[(df['DATE_INT'].shift(-1) - df['DATE_INT'] == 1) | (df['DATE_INT'].shift(1) - df['DATE_INT'] == -1), 'CONSECUTIVE_DAY'] = True
Will set CONSECUTIVE_DAY to TRUE if the entry is preceeded by or followed by a consecutive date.
Once you have the ordinal numbers it's a trivial task, here I'm using numpy arrays to propose one alternative
a=np.array([1,2,4,6,7,10,12,13,14,20])
d=a[1:]-a[:-1] # compute delta
ind=np.logical_or(np.insert(d,0,0)==1,np.append(d,0)==1) # at least one side matches
a[ind] # get matching entries
gives you the numbers where there is a consecutive number
array([ 1, 2, 6, 7, 12, 13, 14])
namely 4, 10 and 20 are removed.

Categories

Resources