I am quite new to pandas (working with 3rd party code, forced to use it!), and I have a dataframe which looks like so:
name_id cookie_id file_name_id
John 56 /some/loc
Doe 45 /some/loc2
John 67 /some/loc3
hilary 768 /some/loc4
wendy 8 /some/loc3
hilary 4 /some/loc4
I would like to sort them by the name_id like so:
name_id cookie_id file_name_id
Doe 45 /some/loc2
John 56 /some/loc
John 67 /some/loc3
hilary 768 /some/loc4
hilary 4 /some/loc4
wendy 8 /some/loc3
I am looking at:
df.sort_values(by=['name_id'])
and it does seem to give me the correct answer, but since I am new to pandas, I am afraid there might be some gotchas I need to aware of.
df.sort_values(by=['name_id']) should be perfectly fine to use. Watch out for spaces in the beginning of the name_id string as those would be sorted first. For example, " wendy" would be placed in the top in your case.
Related
I'm looking to append a column in a pandas data frame that is similar to the following "Identifier" column:
Name. Age Identifier
Peter Pan 13 PanPe
James Jones 24 JonesJa
Peter Pan 22 PanPe
Chris Smith 19 SmithCh
I need the "Identifier" column to look like:
Identifier
PanPe01
JonesJa01
PanPe02
SmithCh01
How would I number each original string with 01? And if there are duplicates (for example Peter Pan), then the following duplicate strings (after the original 01) will have 02, 03, and so forth?
I've been referred to the following theory:
combo="PanPe"
Counts={}
if combo in counts:
count=counts[combo]
counts[combo]=count+1
else:
counts[combo]=1
However, getting a good example of code would be ideal, as I am relatively new to Python, and would love to know the syntax as how to implement an entire column iterated through this process, instead of just one string as shown above with "PanPe".
You can use cumcount here:
df['new_Identifier']=df['Identifier'] + (df.groupby('Identifier').cumcount() + 1).astype(str).str.pad(2, 'left', '0') #thanks #dm2 for the str.pad part
Output:
Name Age Identifier new_Identifier
0 Peter Pan 13 PanPe PanPe01
1 James Jones 24 JonesJa JonesJa01
2 Peter Pan 22 PanPe PanPe02
3 Chris Smith 19 SmithCh SmithCh01
You can use cumcount here:
df['new_Identifier']=df['Identifier'] + (df.groupby('Identifier').cumcount() + 1).astype(str).str.pad(2, 'left', '0') #thanks #dm2 for the str.pad part
Name Age Identifier new_Identifier
0 Peter Pan 13 PanPe PanPe01
1 James Jones 24 JonesJa JonesJa01
2 Peter Pan 22 PanPe PanPe02
3 Chris Smith 19 SmithCh SmithCh01
Thank you #dm2 and #Bushmaster
I'm having trouble formulating this statement in Pandas that would be very simple in excel. I have a dataframe sample as follows:
colA colB colC
10 0 27:15 John Doe
11 0 24:33 John Doe
12 1 29:43 John Doe
13 Inactive John Doe None
14 N/A John Doe None
Obviously the dataframe is much larger than this, with 10,000+ rows, so I'm trying to find an easier way to do this. I want to create a column that checks if colA is equal to 0 or 1. If so, then equals colC. If not, then equals colC. In excel, I would simply create a new column (new_col) and write
=IF(OR(A2<>0,A2<>1),B2,C2)
And then drag fill the entire sheet.
I'm sure this is fairly simple but I cannot for the life of me figure this out.
Result should look like this
colA colB colC new_col
10 0 27:15 John Doe John Doe
11 0 24:33 John Doe John Doe
12 1 29:43 John Doe John Doe
13 Inactive John Doe None John Doe
14 N/A John Doe None John Doe
np.where should do the trick.
df['new_col'] = np.where(df['colA'].isin([0, 1]), df['colB'], df['colC'])
Here is a solution that adds your results to a list given your conditions, then adds the list back in the dataframe as D column.
your_results=[]
for i,data in enumerate(df["colA"]):
if data==0 or data==1:
your_results.append(df["colC"][i])
else:
your_results.append(df["colB"][i])
df["colD"]=your_results
I have a dataset that looks like this:
The count represents the number of times they worked.
Title Name Count
Coach Bob 4
teacher sam 5
driver mark 8
Coach tina 10
teacher kate 3
driver frank 2
I want to create a table which I think will have to be a pivot, that sorts by count times worked, the name and title, so for example the output would look like this:
coach teacher driver
tina 10 sam 5 mark 8
bob 4 kate 3 drank 2
I am familiar with general pivot table code but I think Im going to need to use something a little bit more comprehensive.
DF_PIV = pd.pivot_table(DF, values=['count'], index=['title','Name'], columns=['title']
aggfunc=np.max)
I get an error ValueError: Grouper for 'view_title' not 1-dimensional, but I do not even think I on the right track here.
You can try:
(df.set_index(['Title', df.groupby('Title').cumcount()])
.unstack(0)
.astype(str)
.T
.groupby(level=1).agg(' '.join)
.T)
Output:
Title Coach driver teacher
0 Bob 4 mark 8 sam 5
1 tina 10 frank 2 kate 3
I have a big list of full names example:
datafile.csv:
full_name, dob,
Jerry Smith,21/01/2010
Morty Smith,18/06/2008
Rick Sanchez,27/04/1993
Jery Smith,27/12/2012
Morti Smith,13/03/2012
I'm trying to use fuzz.ration to see if the names in column['fullname'] have any similarities, but code takes forever, mostly because of the nested for loop.
sample code:
dataframe = pd.read_csv('datafile.csv')
_list = []
for row1 in dataframe['fullname']:
for row2 in dataframe['fullname']:
x = fuzz.ratio(row1, row2)
if x > 90:
_list.append([row1, row2, x])
print(_list)
Is there a better method to iterate of a single pandas column to get a ratio for potential duplicate data?
Thanks
Jim
You can create first fuzzy data:
import pandas as pd
from io import StringIO
from fuzzywuzzy import fuzz
data = StringIO("""
Jerry Smith
Morty Smith
Rick Sanchez
Jery Smith
Morti Smith
""")
df = pd.read_csv(data, names=['full_name'])
for index, row in df.iterrows():
df[row['full_name']] = df['full_name'].apply(lambda x:fuzz.ratio(row['full_name'], x))
print(df.to_string())
Output:
full_name Jerry Smith Morty Smith Rick Sanchez Jery Smith Morti Smith
0 Jerry Smith 100 73 26 95 64
1 Morty Smith 73 100 26 76 91
2 Rick Sanchez 26 26 100 27 35
3 Jery Smith 95 76 27 100 67
4 Morti Smith 64 91 35 67 100
Then find best match for selected name:
data_rows = df[df['Jerry Smith'] > 90]
print(data_rows)
Output:
full_name Jerry Smith Morty Smith Rick Sanchez Jery Smith Morti Smith
0 Jerry Smith 100 73 26 95 64
3 Jery Smith 95 76 27 100 67
import pandas as pd
from io import StringIO
from fuzzywuzzy import process
s = """full_name,dob
Jerry Smith,21/01/2010
Morty Smith,18/06/2008
Rick Sanchez,27/04/1993
Jery Smith,27/12/2012
Morti Smith,13/03/2012"""
df = pd.read_csv(StringIO(s))
# 1 - use fuzzywuzzy.process.extract with list comprehension
# 2 - You still have to iterate once but this method avoids the use of apply, which can be very slow
# 3 - convert the list comprehension results to a dataframe
# Note that I am limiting the results to one match. You can adjust the code as you see fit
df2 = pd.DataFrame([process.extract(df['full_name'][i], df[~df.index.isin([i])]['full_name'], limit=1)[0] for i in range(len(df))],
index=df.index, columns=['match_name', 'match_percent', 'match_index'])
# join the new dataframe to the original
final = df.join(df2)
full_name dob match_name match_percent match_index
0 Jerry Smith 21/01/2010 Jery Smith 95 3
1 Morty Smith 18/06/2008 Morti Smith 91 4
2 Rick Sanchez 27/04/1993 Morti Smith 43 4
3 Jery Smith 27/12/2012 Jerry Smith 95 0
4 Morti Smith 13/03/2012 Morty Smith 91 1
This comparison method is doing double work since running fuzz.ratio between "Jerry Smith" and "Morti Smith" is the same as between "Morti Smith" and "Jerry Smith".
If you iterate through a sub array then you will be able to complete this faster.
dataframe = pd.read_csv('datafile.csv')
_list = []
for i_dataframe in range(len(dataframe)-1):
comparison_fullname = dataframe['fullname'][i_dataframe]
for entry_fullname, entry_score in process.extract(comparison_fullname, dataframe['fullname'][i_dataframe+1::], scorer=fuzz.ratio):
if entry_score >=90:
_list.append((comparison_fullname, entry_fullname, entry_score)
print(_list)
This will prevent any duplicate work.
There are generally two parts that can help you to improve the performance:
reduce the amount of comparisions
use a faster way to match the strings
In your implementation your performing a lot of unrequired comparisions, since your always comparing A <-> B and later on B <-> A. Your comparing A <-> A aswell, which is generally always 100. So you can reduce the amount of comparisions by more than 50%. Since you only want to add matches with a score of over 90, this information can be used to speed up the comparisions. While this can not be done in FuzzyWuzzy, it can be done in Rapidfuzz (I am the author). Rapidfuzz implements the same algorithms as FuzzyWuzzy with a relatively similar interface, but has a lot of performance improvements.
Your code could be implemented the following way to implement those two changes, which should be a lot faster. When testing this on my machine this runs around 12 seconds with your code, while this improved version only requires 1.7 seconds.
import pandas as pd
from io import StringIO
from rapidfuzz import fuzz
# generate a bigger list of examples to show the performance benefits
s = "fullname,dob"
s+='''
Jerry Smith,21/01/2010
Morty Smith,18/06/2008
Rick Sanchez,27/04/1993
Jery Smith,27/12/2012
Morti Smith,13/03/2012'''*500
dataframe = pd.read_csv(StringIO(s))
# only create the data series once
full_names = dataframe['fullname']
for index, row1 in full_names.items():
# skip elements that are already compared
for row2 in full_names.iloc[index+1::]:
# use a score_cutoff to improve the runtime for bad matches
score = fuzz.ratio(row1, row2, score_cutoff=90)
if score:
_list.append([row1, row2, score])
i am new to Mysql and just getting started with some basic concepts. i have been trying to solve this for a while now. any help is appreciated.
I have a list of users with two phone numbers. i would want to compare two columns(phone numbers) and generate a new row if the data is different in both columns, else retain the row and make no changes.
The processed data would look like the second table.
Is there any way to acheive this in MySql.
i also don't minds doing the transformation in a dataframe and then loading into a table.
id username primary_phone landline
1 John 222 222
2 Michael 123 121
3 lucy 456 456
4 Anderson 900 901
Thanks!!!
Use DataFrame.melt with remove variable column and DataFrame.drop_duplicates:
df = (df.melt(['id','username'], value_name='phone')
.drop('variable', axis=1)
.drop_duplicates()
.sort_values('id'))
print (df)
id username phone
0 1 John 222
1 2 Michael 123
5 2 Michael 121
2 3 lucy 456
3 4 Anderson 900
7 4 Anderson 901