Pandas Get Row with Smallest Distance Between Strings (Closest Match) - python

Suppose I have a dataframe with an index column filled with strings. Now, suppose I have very similar but somewhat different strings that I want to use to look up rows within the dataframe. How would I do this since they aren't identical? My guess would be to simply choose the row with the lowest distance between the two strings, but I'm not really sure how I could do that efficiently.
For example, if my dataframe is:
and I want to lookup "Lord of the rings", I should get the 2nd row. How would I do this in pandas?

Related

How to parse batches of flagged rows and keep the row sastisfying some conditions in a Pandas dataframes?

I have a dataframe containing duplicates, that are flagged by a specific variable. The df looks like this:
enter image description here
The idea is that the rows to keep and its duplicates are stacked in batches (a pair or more if many duplicates)and identified by the "duplicate" column. I would like, for each batch, to keep the row depending on one conditions: it has to be the row with the smallest number of empty cells. For Alice for instance, it should be the second row (and not the one flagged "keep").
The difficulty lies also in the fact that I cannot group by on the "name", "lastname" or "phone" column, because they are not always filled (the duplicates are computed on these 3 concatenated columns by a ML algo).
Unlike already posted questions I've seen (how do I remove rows with duplicate values of columns in pandas data frame?), here the conditions to select the row to keep is not fixed (like keeping the first row or the last withing the batch of duplicates) but depends on the rows completion in each batch.
How can I parse the dataframe according to this column "duplicate", and among each batch extract the row I want ?
I tried to assign an unique label for each batch, in order to iterate over these label, but it fails.

Extract unique value with multiple columns from DataFrame

I have a dataframe where I want to extract values from two columns but the criteria set is unique values from one of the columns. In the image below, I want to extract unique values of 'education' along with its corresponding values from 'education-num'. I can easily extract the unique values with df['education'].unique() and I am stuck with not being able to extract the 'education-num'.
image of the dataframe.
(Originally the task was to compute the population of people with education of Bachelors, Masters and Doctorate and I assume this would be easier when comparing the 'education-num' rather than logical operators on string. But if there's any way we could do it directly from the 'education' that would also be helpful.
Edit: Turns out the Dataframe.isin helps to select rows by the list of string as given in the solution here.)
P.S. stack-overflow didn't allow me to post the image directly and posted a link to it instead...😒
Select columns by subset and call DataFrame.drop_duplicates:
df1 = df[['education', 'education-num']].drop_duplicates()
If need count population use:
df2 = df.groupby(['education', 'education-num']).size().reset_index(name='count')

Pandas dataframe returns incorrect sort using two float columns

I am playing with some geo data. Given a point, I am trying to map to an object. So for each connection, I generate two distances, both floats. To find the closest, I want to sort my dataframe by both distances and pick the top row.
Unfortunately when I run a sort (df.sort_values(by=['direct distance', 'pt_to_candidate']) I get the following out-of-order result
I would expect the top two rows, but flipped. If I run the sort on either column solo, I get expected results. If I flip the order of the sort (['pt_to_candidate', 'direct distance']) I get a correct, though not what I necessarily want for my function.
Both columns are type float64.
Why is this sort returning oddly?
For completeness, I should state that I have more columns and rows. From the main dataframe, I filter first and then sort. Also, I cannot recreate by manually entering data into a new dataframe, so I suspect the float length is the issue.
Edit
Adding a value_counts on 'direct distance'
4.246947 7
3.147303 2
2.875081 1
2.875081 1

Merging points based on coordinates using python (pandas or geopandas)

I have a dataset like:
pointID lat lon otherinfo
I want to round up the coordinates and aggregate all the points whose coordinates become equal into one single item, and assign it a new name, which would probably be a new dataframe column. The "otherinfo" column must be preserved, meaning that by the end of the operation I will have the same number of rows I had before, but with new IDs based on the rounded coordinates.
How can I achieve this using pandas? Is it any easier if I use geoPandas?
If you already have columns for coodinates (lat and lon), you can do for example (rounding to 2 decimal numbers):
df['new_id'] = df.groupby([df.lat.round(2), df.lon.round(2)]).ngroup()
The ngroup method on the groupby gives for each original row to which group it belongs, so in fact gives you a new unique ID based on rounded lat/lon.

Python - Identify Common Values

I am currently using Python 2.7. I currently have three columns in an Excel document all with different integer values in. The amount of values can vary ranging from 10 through to thousands. Basically, what I am looking to do is scan through the column one and compare each value to see if any appear in column two and three. Similarly, I will then do the same with column 2 to see if any appear in column one and three etc....
My thinking on this is to populate the content of each column into a respective list and then iterate over list 1 (column 1) and then run an if statement to compare each iteration value and see if it exists in list 2 (column 2).
My question is, is this the most efficient means of running this comparison? As said, within the three columns, the same number should appear in each of the three columns (it may appear on a number of occasions) and so I'm looking to identify those numbers which appear in each of the three columns.
Thanks
What about using set intersection?
set(column_1_vals) & set(column_2_vals) & set(column_3_vals)
That will give you those values which appear in all three columns.

Categories

Resources