Mapping/Zipping between two Pandas data frames with a partial string match - python

I have two dataframes of size roughly 1,000,000 rows each. Both share a common 'Address' column which I am using to join the dataframes. Using this join, I wish to move information, which I shall call 'details', from dataframe1 to dataframe2.
df2.details = df2.Address.map(dict(zip(df1.Address,df1.details)))
However, the address column does not exhibit entire commonality. I tried cleaning as best I could, but still can only move roughly 40% of the data across. Is there a way to modify my above code to allow for a partial match? I'm totally stumped on this one.
Data is quite simply as described. Two small dataframes. Fabricated sample data below:
df1
Address Details
Apt 15 A, Long Street, Fake town, US A
df2
Address Details
15A, Long Street, Fake town, U.S.

First, I would recommend performing the join operation and identifying the rows in each data frame that do not have a perfect match. Once you have identified these rows, exclude the others and proceed with the following suggestions:
One approach is to parse the addresses and attempt to standardize
them. You might try using the
usaddress module to
standardize your addresses.
You could also try the approaches recommended in answer to this
question,
although they may take some tweaking for your case. It's hard to say
without multiple examples of the partial string matches.
Another approach would be to use the Google Maps API (or Bing or
MapQuest) for address standardization, though with over million rows
per data frame you will far out strip the free API calls/day and
would need to pay for the service.
A final suggestion is to use the
fuzzywuzzy
module for fuzzy (approximate) string matching.

Related

Python - Fastest Way to Search Hundreds of Thousands/Millions of Records Against Hundreds of Thousands/Millions of Records?

I have a Python program that will take any number of addresses in one database table (Dataset A) and query each one against another database table that contains promotional pricing for addresses across the country (Dataset B). The purpose of this program is to see which addresses in Dataset A are in Dataset B, and if they are, indicate that it found it and also the record ID in the table. I have a set series of wildcard queries to do the SQL searching in a certain order to pull results, which is part of the problem since I do wildcard searches with LIKE ‘%CRITERIA%’, which slows things down tremendously despite everything being indexed in MySQL.
Here's an example type of search that it does today…let’s say the input address in Dataset A is 123 Main Street, Brooklyn, NY 11201 but the address in Dataset B is 123 North Main Street, Brooklyn, NY 11201. Doing a straight query of Dataset A against Dataset B would not find a match since it’s not an exact match, but doing a wildcard search would yield a possible valid result with this style of query:
SELECT *
FROM Dataset_B
WHERE House_Number = 123 AND Street Like ‘%Main%’ AND City = ‘Brooklyn’ AND State = ‘NY’
I skipped the zip code since sometimes that interferes with results since they commonly have the incorrect zip code on them, but doing a zip code search later would be a secondary search if I didn’t find a hit using the above.
After my program receives the query results, it conducts an analysis to make sure that the result for each entry is not a false positive.
I’ve been using a Python multiprocessing SQL that I developed for this purpose and it’s worked pretty well overall, but I’m not sure how resource efficient it is compared to alternatives. It commonly takes up 100% of the hard drive datarate due to all this wildcard querying with up to 8 concurrent queries running at once to take advantage of all of my cores. The problem occurs when Dataset A is hundreds of thousands, or millions of entries where it can take up to 12 hours or more since the data in Dataset B could also be hundreds of thousands, or millions of records. I feel like SQL is the slowest way to do this though and am not sure if there’s a much more efficient solution using Python datastructures to do this including my wildcard criteria? Pandas would probably be overwhelmed by this volume and I’m not sure how much faster it would be compared to SQL. I was playing around with trying to use NumPy but wasn’t sure if that was the right direction. Can someone provide some guidance on the fastest way to tackle this kind of problem?

What is the best way to integrate different Excel files with different sheets with different formats in Python?

I have multiple Excel files with different sheets in each file, these files have been made my people, so each one has different formats, different number of columns and also different structures to represent the data.
For example, in one sheet, the dataframe/table starts at 8th row, second column. In other it starts at 122 row, etc...
I want to retrieve something in common from these Excels, it is variable names and information.
However, I don't how could I possibly retrieve all this information without needing to parse each individual file. This is not an option because there are lot of these files with lots of sheets in each file.
I have been thinking about using regex as well as edit distance between words, but I don't know if that is the best option.
Any help is appreciated.
I will divide my answer into what I think you can do now, and suggestions for the future (if feasible).
An attempt to "solve" the problem you have with existing files.
Without regularity on your input files (such as at least a common name in the column), I think what you're describing is among the best solutions. Having said that, perhaps a "fancier" similarity metric between column names would be more useful than using regular expressions.
If you believe that there will be some regularity in the column names, you could look at string distances such as the Hamming Distance or the Levenshtein distance, and using a threshold on the distance that works for you. As an example, let's say that you have a function d(a:str, b:str) -> float that calculates a distance between column names, you could do something like this:
# this variable is a small sample of "expected" column names
plausible_columns = [
'interesting column',
'interesting',
'interesting-column',
'interesting_column',
]
for f in excel_files:
# process the file until you find columns
# I'm assuming you can put the colum names into
# a variable `columns` here.
for c in columns:
for p in plausible_columns:
if d(c,p) < threshold:
# do something to process the column,
# add to a pandas DataFrame, calculate the mean,
# etc.
If the data itself can tell you something on whether you should process it (such as having a particular distribution, or being in a particular range), you can use such features to decide on whether you should be using that column or not. Even better, you can use many of these characteristics to make a finer decision.
Having said this, I don't think a fully automated solution exists without inspecting some of the data manually, and studying the ditribution of the data, or variability in the names of the columns, etc.
For the future
Even with fancy methods to calculate features and doing some data analysis on the data you have right now, I think it would be impossible to ensure that you will always get the data you need (by the very nature of the problem). A reasonable way to solve this, in my opinion (and if this is feasible in whatever context you're working in), is to impose a stricter format in the data generation end (I suppose this is a manual thing with people inputting data to excel directly). I would argue that the best solution is to get rid of the problem at the root, and create a unified form, or excel sheet format, and distribute it to the people that will fill the files with data, so that you can ensure the data is then automatically ingested minimizing the risk of errors.

Python: (findall, wide to long, pandas): Data manipulation

So, I am trying to pivot this data (link here) to where all the metrics/numbers are in a column with another column being the ID column. Obviously having a ton of data/metrics in a bunch of columns is much harder to compare and do calculations off of than having it all in one column.
So, I know what tools I need for this; Pandas, Findall, Wide_to_long (or melt) and maybe stack. However, I am having a bit of difficulty putting them all in the right place.
I easily can import the data into the df and view it, but when it comes to using findall with wide_to_long to pivot the data I get pretty confused. I am basing my idea off of this example (about half way down when they use the findall / regex to define new column names). I am looking to create a new column for each category of the metrics (ie. population estimate is one column and then % change is another, they should not be all one column)
Can someone help me set up the syntax correctly for this part? I am not good at the expressions dealing with pattern recognition.
Thank you.

General Approach to Working with Data in DataFrames

Question for experienced Pandas users on approach to working with Dataframe data.
Invariably we want to use Pandas to explore relationships among data elements. Sometimes we use groupby type functions to get summary level data on subsets of the data. Sometimes we use plots and charts to compare one column of data against another. I'm sure there are other application I haven't thought of.
When I speak with other fairly novice users like myself, they generally try to extract portions of a "large" dataframe into smaller dfs that are sorted or formatted properly to run applications or plot. This approach certainly has disadvantages in that if you strip out a subset of data into a smaller df and then want to run an analysis against a column of data you left in the bigger df, you have to go back and recut stuff.
My question is - is best practices for more experienced users to leave the large dataframe and try to syntactically pull out the data in such a way that the effect is the same or similar to cutting out a smaller df? Or is it best to actually cut out smaller dfs to work with?
Thanks in advance.

Association rules with pandas dataframe

I have a dataframe like this
df = pd.DataFrame(data=[980,169,104,74], columns=['Count'], index=['X,Y,Z', 'X,Z','X','Y,Z'])
Count
X, Y, Z 980
X,Z 169
X 104
Y,Z 74
I want to be able to extract association rules from this. I've seen that the Apriori algorithm is the reference. And also found the Orange library for data mining is well-known in this field.
But the problem is, in order to use the AssociationRulesInducer I need to create first a file containing all the transactions. Since my dataset is really huge (20 columns and 5 million rows) it will be too expensive to write all this data in a file and read it again with Orange.
Do you have any idea how can I take advantage of my current dataframe structure in order to find association rules ?
The new Orange3-Associate add-on for Orange data mining suite seems to include widgets and code that mines frequent itemsets (and from them association rules) even from sparse arrays or lists of lists, which may work for you.
With 5M rows, it'd be quite awesome if it did. :)
I know it is an old question, but to anyone running into this question when trying to use pandas dataframes for association rules and frequent itemsets (e.g. Apriori):
Have a look at this blog entry explaining how to do that using library mlxtend.
My only recommendation regarding this great blog entry is that if you are dealing with large datasets, you may run into OOM errors for hot-encoded dataframes. I recommend using SparseDtypes then: df = df.astype(pd.SparseDtype(int, fill_value=0))

Categories

Resources