How to speed up pandas drop() method? - python

I have a large excel file to clean around 200000 rows. So Im using pandas to drop unwanted rows if the conditions meet but it takes some time to run.
My current code looks like this
def cleanNumbers(number): # checks number if it is a valid number
vaild = True
try:
num = pn.parse('+' + str(number), None)
if not pn.is_valid_number(num):
vaild = False
except:
vaild = False
return vaild
for UncleanNum in tqdm(TeleNum):
valid = cleanNumbers(UncleanNum) # calling cleanNumbers function
if valid is False:
df = df.drop(df[df.telephone == UncleanNum].index)
# dropping row if number is not a valid number
It takes around 30 min for this line of code to finish. Is there a more efficient way to drop rows with pandas? If not can I use numpy to have the same output?
Im not that aquainted with pandas or numpy so if you have any tips to share it would be helpful.
Edit:
Im using phonenumbers lib to check if the telephone number is valid. If its not a valid phonenumber i drop the row that number is on.
Example data
address name surname telephone
Street St. Bill Billinson 7398673456897<--let say this is wrong
Street St. Nick Nick 324523452345
Street St. Sam Sammy 234523452345
Street St. Bob Bob 32452345234534<--and this too
Street St. John Greg 234523452345
Output
address name surname telephone
Street St. Nick Nick 324523452345
Street St. Sam Sammy 234523452345
Street St. John Greg 234523452345
This is what my code does but it slow.

In my opinion here main bootleneck is not drop, but custom function repeating for large number of values.
Create list of all valid numbers and then filter by boolean indexing with Series.isin:
v = [UncleanNum for UncleanNum in tqdm(TeleNum) if cleanNumbers(UncleanNum)]
df = df[df.telephone.isin(v)]
EDIT:
After some testing solution should be simplify, because function return boolean:
df1 = df[df['telephone'].apply(cleanNumbers)]

Related

Pandas DF: Create New Col by removing last word from of existing column

This should be easy, but I'm stumped.
I have a df that includes a column of PLACENAMES. Some of these have multiple word names:
Able County
Baker County
Charlie County
St. Louis County
All I want to do is to create a new column in my df that has just the name, without the "county" word:
Able
Baker
Charlie
St. Louis
I've tried a variety of things:
1. places['name_split'] = places['PLACENAME'].str.split()
2. places['name_split'] = places['PLACENAME'].str.split()[:-1]
3. places['name_split'] = places['PLACENAME'].str.rsplit(' ',1)[0]
4. places = places.assign(name_split = lambda x: ' '.join(x['PLACENAME].str.split()[:-1]))
Works - splits the names into a list ['St.','Louis','County']
The list splice is ignored, resulting in the same list ['St.','Louis','County'] rather than ['St.','Louis']
Raises a ValueError: Length of values (2) does not match length of index (41414)
Raises a TypeError: sequence item 0: expected str instance, list found
I've also defined a function and called it with .assign():
def processField(namelist):
words = namelist[:-1]
name = ' '.join(words)
return name
places = places.assign(name_split = lambda x: processField(x['PLACENAME]))
This also raises a TypeError: sequence item 0: expected str instance, list found
This seems to be a very simple goal and I've probably overthought it, but I'm just stumped. Suggestions about what I should be doing would be deeply appreciated.
Apply Series.str.rpartition function:
places['name_split'] = places['PLACENAME'].str.rpartition()[0]
Use str.replace to remove the last word and the preceding spaces:
places['new'] = place['PLACENAME'].str.replace(r'\s*\w+$', '', regex=True)
# or
places['new'] = place['PLACENAME'].str.replace(r'\s*\S+$', '', regex=True)
# or, only match 'County'
places['new'] = place['PLACENAME'].str.replace(r'\s*County$', '', regex=True)
Output:
PLACENAME new
0 Able County Able
1 Baker County Baker
2 Charlie County Charlie
3 St. Louis County St. Louis
regex demo

Setting specific rows to the value found in a row if differing index

I work with a lot of CSV data for my job. I am trying to use Pandas to convert the member 'Email' to populate into the row of their spouses 'PrimaryMemberEmail' column. Here is a sample of what I mean:
import pandas as pd
user_data = {'FirstName':['John','Jane','Bob'],
'Lastname':['Snack','Snack','Tack'],
'EmployeeID':['12345','12345S','54321'],
'Email':['John#issues.com','NaN','Bob#issues.com'],
'DOB':['09/07/1988','12/25/1990','07/13/1964'],
'Role':['Employee On Plan','Spouse On Plan','Employee Off Plan'],
'PrimaryMemberEmail':['NaN','NaN','NaN'],
'PrimaryMemberEmployeeId':['NaN','12345','NaN']
}
df = pd.DataFrame(user_data)
I have thousands of rows like this. I need to only populate the 'PrimaryMemberEmail' when the user is a spouse with the 'Email' of their associated primary holders email. So in this case I would want to autopopulate the 'PrimaryMemberEmail' for Jane Snack to be that of her spouse, John Snack, which is 'John#issues.com' I cannot find a good way to do this. currently I am using:
for i in (df['EmployeeId']):
p = (p + len(df['EmployeeId']) - (len(df['EmployeeId'])-1))
EEID = df['EmployeeId'].iloc[p]
if 'S' in EEID:
df['PrimaryMemberEmail'].iloc[p] = df['Email'].iloc[p-1]
What bothers me is that this only works if my file comes in correctly, like how I showed in the example DataFrame. Also my NaN values do not work with dropna() or other methods, but that is a question for another time.
I am new to python and programming. I am trying to add value to myself in my current health career and I find this all very fascinating. Any help is appreciated.
IIUC, map the values and fillna:
df['PrimaryMemberEmail'] = (df['PrimaryMemberEmployeeId']
.map(df.set_index('EmployeeID')['PrimaryMemberEmail'])
.fillna(df['PrimaryMemberEmail'])
)
Alternatively, if you have real NaNs, (not strings), use boolean indexing:
df.loc[df['PrimaryMemberEmployeeId'].notna(),
'PrimaryMemberEmail'] = df['PrimaryMemberEmployeeId'].map(df.set_index('EmployeeID')['PrimaryMemberEmail'])
output:
FirstName Lastname EmployeeID DOB Role PrimaryMemberEmail PrimaryMemberEmployeeId
0 John Mack 12345 09/07/1988 Employee On Plan John#issues.com NaN
1 Jane Snack 12345S 12/25/1990 Spouse On Plan John#issues.com 12345
2 Bob Tack 54321 07/13/1964 Employee Off Plan Bob#issues.com NaN

How to filter and sort specific csv using python

Please help me with the python script to filter the below CSV.
Below is the example of the CSV dump for which I have done the initial filtration.
Last_name
Gender
Name
Phone
city
Ford
Male
Tom
123
NY
Rich
Male
Robert
21312
LA
Ford
Female
Jessica
123123
NY
Ford
Male
John
3412
NY
Rich
Other
Linda
12312
LA
Ford
Other
James
4321
NY
Smith
Male
David
123123
TX
Rich
Female
Mary
98689
LA
Rich
Female
Jennifer
86860
LA
Ford
Male
Richard
12123
NY
Smith
Other
Daniel
897097
TX
Ford
Other
Lisa
123123123
NY
import re
def gather_info (L_name):
dump_filename = "~/Documents/name_report.csv"
LN = []
with open(dump_filename, "r") as FH:
for var in FH.readlines():
if L_name in var
final = var.split(",")
print(final[1], final[2], final[3])
return LN
if __name__ == "__main__":
L_name = input("Enter the Last name: ")
la_name = gather_info(L_name)
By this, I am able to filter by the last name. for example, if I choose L_name as Ford, then I have my output as
Gender
Name
Phone
Male
Tom
123
Female
Jessica
123123
Male
John
3412
Other
James
4321
Male
Richard
12123
Other
Lisa
22412
I need help extending the script by selecting each gender and the values in the list to perform other functions, then calling the following gender and the values to achieve the same functions. for example, first, it selects the gender Male [Tom, John] and performs other functions. then selects the next gender Female [Jessica] and performs the same functions and then selects the gender Other [James, Lisa] and performs the same functions.
I would recomend using the pandas module which allows for easy filtering and grouping of data
import pandas as pd
if __name__ == '__main__':
data = pd.read_csv('name_reports.csv')
L_name = input("Enter the last name: ")
by_last_name = data[data['Last_name'] == L_name]
groups = by_last_name.groupby(['Gender'])
for group_name, group_data in groups:
print(group_name)
print(group_data)
Breaking this down into its pieces the first part is
data = pd.read_csv('name_reports.csv')
This reads the data from the csv and places it into a dataframe
Second we have
by_last_name = data[data['Last_name'] == L_name]
This filters the dataframe to only have results with Last_name equal to L_name
Next we group the data.
groups = by_last_name.groupby(['Gender'])
this groups the filtered data frames by gender
then we iterate over this. It returns a tuple with the group name and the dataframe associated with that group.
for group_name, group_data in groups:
print(group_name)
print(group_data)
This loop just prints out the data to access fields from it you can use the iterrows function
for index,row in group_data.iterrows():
print(row['city']
print(row['Phone']
print(row['Name']
And then you can use those for whatever function you want. I would recommend reading on the documentation for pandas since depending on the function you plan on using there may be a better way to do it using the library. Here is the link to the library https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
Since you cannot use the pandas module then a method using only the csv module would look like this
import csv
def has_last_name(row,last_name):
return row['Last_name'] == last_name
def has_gender(row,current_gender):
return row['Gender'] == current_gender
if __name__ == '__main__':
data = None
genders = ['Male','Female','Other']
with open('name_reports.csv') as csvfile:
data = list(csv.DictReader(csvfile,delimiter=','))
L_name = input('Enter the Last name: ')
get_by_last_name = lambda row: has_last_name(row,L_name)
filtered_by_last_name = list(filter(get_by_last_name,data))
for gender in genders:
get_by_gender = lambda row: has_gender(row,gender)
filtered_by_gender = list(filter(get_by_gender,filtered_by_last_name))
print(filtered_by_gender)
The important part is the filter built in function. This takes in a function that takes in an item from a list and returns a bool. filter takes this function and an iterable and returns a generator of items that return true for that function. The other important part is the csv.DictReader which returns your csv file as a dictionary which makes allows you to access attributes by key instead of by index.

Formatting strings in a dataframe

i have a dataframe
Name
Joe Smith
Jane Doe
Homer Simpson
i am trying to format this to get to
Name
Smith, Joe
Doe, Jane
Simpson, Homer
i have this code, and it works for ~ 80% of users in my list but some users are not coming through right.
invalid_users = ['Test User', 'Test User2', 'Test User3']
for index, row in df_Users.iterrows():
gap_pos = df_Users["Name"][index].find(" ")
if gap_pos > 0 and row["Name"] not in invalid_users:
row["Name"] = df_Users["Name"][index][len(df_Users["Name"][index])-gap_pos+1:].strip() +', ' + df_Users["Name"][index][:gap_pos]
the users who are not coming through correctly, usually their last name is truncated somewhere - i.e. Simpson ==> mpson
What am I doing wrong here?
Just split on space, then reverse it (that's what .str[::-1] is doing) and join on , :
>>> df['Name'].str.split(' ').str[::-1].str.join(', ')
0 Smith, Joe
1 Doe, Jane
2 Simpson, Homer
Name: Name, dtype: object
And if your data contains the name like Jr. Joe Smith, then you may do it following way:
df['Name'].str.split(' ').str[::-1].apply(lambda x:(x[0],' '.join(x[1:]))).str.join(', ')
I'm not sure what you were trying to with len there, but it's not right. You just want to start straight from gap_pos:
row["Name"] = df_Users["Name"][index][gap_pos+1:].strip() +', ' + df_Users["Name"][index][:gap_pos]
I would be tempted to use split for this.
Pandas is a library that takes profit of vectorial operations, especially for simple transformations and most of DataFrame manipulations.
Given your example, here is a code that would work:
import pandas as pd
df = pd.DataFrame({"name": ["Joe Smith", "Jane Doe", "Homer Simpson"]})
# df
# name
# 0 Joe Smith
# 1 Jane Doe
# 2 Homer Simpson
df["name"] = df["name"].apply(lambda x: f"{x.split(' ')[1]}, {x.split(' ')[0]}")
# df
# name
# 0 Smith, Joe
# 1 Doe, Jane
# 2 Simpson, Homer
The apply function takes every row and applies the specified function to each one of them.
Here, the specified function is a lambda function that, supposing the name pattern is "FirstName LastName", does what you want.

pandas row manipulation - If startwith keyword found - append row to end of previous row

I have a question regarding text file handling. My text file prints as one column. The column has data scattered throughout the rows and visually looks great & somewhat uniform however, still just one column. Ultimately, I'd like to append the row where the keyword is found to the end of the top previous row until data is one long row. Then I'll use str.split() to cut up sections into columns as I need.
In Excel (code below-Top) I took this same text file and removed headers, aligned left, and performed searches for keywords. When found, Excel has a nice feature called offset where you can place or append the cell value basically anywhere using this offset(x,y).value from the active-cell start position. Once done, I would delete the row. This allowed my to get the data into a tabular column format that I could work with.
What I Need:
The below Python code will cycle down through each row looking for the keyword 'Address:'. This part of the code works. Once it finds the keyword, the next line should append the row to the end of the previous row. This is where my problem is. I can not find a way to get the active row number into a variable so I can use in place of the word [index] for the active row. Or [index-1] for the previous row.
Excel Code of similar task
Do
Set Rng = WorkRng.Find("Address", LookIn:=xlValues)
If Not Rng Is Nothing Then
Rng.Offset(-1, 2).Value = Rng.Value
Rng.Value = ""
End If
Loop While Not Rng Is Nothing
Python Equivalent
import pandas as pd
from pandas import DataFrame, Series
file = {'Test': ['Last Name: Nobody','First Name: Tommy','Address: 1234 West Juniper St.','Fav
Toy', 'Notes','Time Slot' ] }
df = pd.DataFrame(file)
Test
0 Last Name: Nobody
1 First Name: Tommy
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I've tried the following:
for line in df.Test:
if line.startswith('Address:'):
df.loc[[index-1],:].values = df.loc[index-1].values + ' ' + df.loc[index].values
Line above does not work with index statement
else:
pass
# df.loc[[1],:] = df.loc[1].values + ' ' + df.loc[2].values # copies row 2 at the end of row 1,
# works with static row numbers only
# df.drop([2,0], inplace=True) # Deletes row from df
Expected output:
Test
0 Last Name: Nobody
1 First Name: Tommy Address: 1234 West Juniper St.
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I am trying to wrap my head around the entire series vectorization approach but still stuck trying loops that I'm semi familiar with. If there is a way to achieve this please point me in the right direction.
As always, I appreciate your time and your knowledge. Please let me know if you can help with this issue.
Thank You,
Use Series.shift on Test then use Series.str.startswith to create a boolean mask, then use boolean indexing with this mask to update the values in Test column:
s = df['Test'].shift(-1)
m = s.str.startswith('Address', na=False)
df.loc[m, 'Test'] += (' ' + s[m])
Result:
Test
0 Last Name: Nobody
1 First Name: Tommy Address: 1234 West Juniper St.
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot

Categories

Resources