Maximum Limit of distinct fake data using Python Faker package - python

I have used Python Faker for generating fake data. But I need to know what is the maximum number of distinct fake data (eg: fake names) can be generated using faker (eg: fake.name() ).
I have generated 100,000 fake names and I got less than 76,000 distinct names. I need to know the maximum limit so that I can know how much we can scale using this package for generating data.
I need to generate huge dataset. I also want to know is Php faker, perl faker are all same for different environments?
Other packages for generating huge dataset will be highly appreciated.

I had this same issue and looked more into it.
In the en_US provider there about 1000 last names and 750 first names for about 750000 unique combos. If you randomly select a first and last name, there is a chance you'll get duplicates. But in reality, that's how the real world works, there are many John Smiths and Robert Doyles out there.
There are 7203 first names and 473 last names in the en profile which can kind of help. Faker chooses the combo of first name and last name meaning there are about 7203 * 473 = 3407019.
But still, there is a chance you'll get duplicates.
I solve this problem by adding numbers to names.
I need to generate huge dataset.
Keep in mind that in reality, any huge dataset of names will have duplicates. I work with large datasets (> 1 million names) and we see a ton of duplicate first and last names.
If you read the faker package code, you can probably figure out how to modify it so you get all 3M distinct names.

Related

Python - Fastest Way to Search Hundreds of Thousands/Millions of Records Against Hundreds of Thousands/Millions of Records?

I have a Python program that will take any number of addresses in one database table (Dataset A) and query each one against another database table that contains promotional pricing for addresses across the country (Dataset B). The purpose of this program is to see which addresses in Dataset A are in Dataset B, and if they are, indicate that it found it and also the record ID in the table. I have a set series of wildcard queries to do the SQL searching in a certain order to pull results, which is part of the problem since I do wildcard searches with LIKE ‘%CRITERIA%’, which slows things down tremendously despite everything being indexed in MySQL.
Here's an example type of search that it does today…let’s say the input address in Dataset A is 123 Main Street, Brooklyn, NY 11201 but the address in Dataset B is 123 North Main Street, Brooklyn, NY 11201. Doing a straight query of Dataset A against Dataset B would not find a match since it’s not an exact match, but doing a wildcard search would yield a possible valid result with this style of query:
SELECT *
FROM Dataset_B
WHERE House_Number = 123 AND Street Like ‘%Main%’ AND City = ‘Brooklyn’ AND State = ‘NY’
I skipped the zip code since sometimes that interferes with results since they commonly have the incorrect zip code on them, but doing a zip code search later would be a secondary search if I didn’t find a hit using the above.
After my program receives the query results, it conducts an analysis to make sure that the result for each entry is not a false positive.
I’ve been using a Python multiprocessing SQL that I developed for this purpose and it’s worked pretty well overall, but I’m not sure how resource efficient it is compared to alternatives. It commonly takes up 100% of the hard drive datarate due to all this wildcard querying with up to 8 concurrent queries running at once to take advantage of all of my cores. The problem occurs when Dataset A is hundreds of thousands, or millions of entries where it can take up to 12 hours or more since the data in Dataset B could also be hundreds of thousands, or millions of records. I feel like SQL is the slowest way to do this though and am not sure if there’s a much more efficient solution using Python datastructures to do this including my wildcard criteria? Pandas would probably be overwhelmed by this volume and I’m not sure how much faster it would be compared to SQL. I was playing around with trying to use NumPy but wasn’t sure if that was the right direction. Can someone provide some guidance on the fastest way to tackle this kind of problem?

smart way to structure my SQLite Database

I am new to database things and only have a very basic understanding of them.
I need to save historic data of a leaderboard and I am not sure how to do that in a good way.
I will get a list of accountName, characterName and xp.
Options I was thinking of so far:
An extra table for each account where I add their xp as another entry every 10 min (not sure where to put the character name in that option)
A table where I add another table into it every 10 min containing all the data I got for that interval
I am not very sure the first option since there will be about 2000 players I don't know if I want to have 2000 tables (would that be a problem?). But I also don't feel like the second option is a good idea.
It feels like with some basic dimensional modeling techniques you will be able to solve this.
Specifically it sounds like you are in need of a Player Dimension and a Play Fact table...maybe a couple more supporting tables along the way.
It is my pleasure to introduce you to the Guru of Dimensional Modeling (IMHO): Kimball Group - Dimensional Modeling Techniques
My advice - invest a bit of time there, put a few basic dimensional modeling tools in your toolbox, and this build should be quite enjoyable.
In general you want to have a small number of tables, and the number of rows per table doesn't matter so much. That's the case databases are optimized for. Technically you'd want to strive for a structure that implements the Third normal form.
If you wanted to know which account had the most xp, how would you do it? If each account has a separate table, you'd have to query each table. If there's a single table with all the accounts, it's a trivial single query. Expanding that to say the top 15 is likewise a simple single query.
If you had a history table with a snapshot every 10 minutes, that would get pretty big over time but should still be reasonable by database standards. A snapshot every 10 minutes for 2000 characters over 10 years would result in 1,051,920,000 rows, which might be close to the maximum number of rows in a sqlite table. But if you got to that point I think you might be better off splitting the data into multiple databases rather than multiple tables. How far back do you want easily accessible history?

Creating sample data (with faker) that has a % overlap in Python

I have 2 tables - customers and purchases
I've used faker to create 100 entries for the customer table with various details of which one is customer_numbers that are unique 5 digits.
I now want to create a table of 100 purchases that reuses the list of customer_numbers but to ensure that there is at least 25% duplicate records.
I am not sure the best way to do this to ensure the 25% requirement.
I initially created a custom function that resamples my original list (using faker.random_elements()) and just takes the first 100 records in the new list but that doesn't ensure a minimum of 25% overlap.
Is there a built in function I can use? If not, what would be the math behind recreating a list with 25% overlaps from an existing list.
Code seems less relevant here but let me know if you need samples.
Solution I went with(not the only one to the problem):
Calculated the number of samples to drop list_size-(list-size/(1*repeat rate))
Dropped the no of samples from the customer list and resampled from the reduced list the number of samples I dropped (with replacement)
Merged the shortened and resampled list
This solution ensures that len(original list) = len(set((new list)))*1.25
A solution I can think of is as follows:
Randomly selecting a number for your 'guarantee' overlapping between 25 and 100
Randomly picking the 'guarantee' number of records from the existing list
Randomly picking the rest of the required records (without considering the existence of the existing list)
This is because your customer_numbers consist of 5 digits, therefore picking these 5 digits randomly for 100 times will generally not create any duplicates with the existing list which has only 100 items. So I think using a way to 'guarantee' the overlapping percentage would work reasonably well. Also, I think it would not be difficult to implement the above steps in Python.

Mapping/Zipping between two Pandas data frames with a partial string match

I have two dataframes of size roughly 1,000,000 rows each. Both share a common 'Address' column which I am using to join the dataframes. Using this join, I wish to move information, which I shall call 'details', from dataframe1 to dataframe2.
df2.details = df2.Address.map(dict(zip(df1.Address,df1.details)))
However, the address column does not exhibit entire commonality. I tried cleaning as best I could, but still can only move roughly 40% of the data across. Is there a way to modify my above code to allow for a partial match? I'm totally stumped on this one.
Data is quite simply as described. Two small dataframes. Fabricated sample data below:
df1
Address Details
Apt 15 A, Long Street, Fake town, US A
df2
Address Details
15A, Long Street, Fake town, U.S.
First, I would recommend performing the join operation and identifying the rows in each data frame that do not have a perfect match. Once you have identified these rows, exclude the others and proceed with the following suggestions:
One approach is to parse the addresses and attempt to standardize
them. You might try using the
usaddress module to
standardize your addresses.
You could also try the approaches recommended in answer to this
question,
although they may take some tweaking for your case. It's hard to say
without multiple examples of the partial string matches.
Another approach would be to use the Google Maps API (or Bing or
MapQuest) for address standardization, though with over million rows
per data frame you will far out strip the free API calls/day and
would need to pay for the service.
A final suggestion is to use the
fuzzywuzzy
module for fuzzy (approximate) string matching.

Association rules with pandas dataframe

I have a dataframe like this
df = pd.DataFrame(data=[980,169,104,74], columns=['Count'], index=['X,Y,Z', 'X,Z','X','Y,Z'])
Count
X, Y, Z 980
X,Z 169
X 104
Y,Z 74
I want to be able to extract association rules from this. I've seen that the Apriori algorithm is the reference. And also found the Orange library for data mining is well-known in this field.
But the problem is, in order to use the AssociationRulesInducer I need to create first a file containing all the transactions. Since my dataset is really huge (20 columns and 5 million rows) it will be too expensive to write all this data in a file and read it again with Orange.
Do you have any idea how can I take advantage of my current dataframe structure in order to find association rules ?
The new Orange3-Associate add-on for Orange data mining suite seems to include widgets and code that mines frequent itemsets (and from them association rules) even from sparse arrays or lists of lists, which may work for you.
With 5M rows, it'd be quite awesome if it did. :)
I know it is an old question, but to anyone running into this question when trying to use pandas dataframes for association rules and frequent itemsets (e.g. Apriori):
Have a look at this blog entry explaining how to do that using library mlxtend.
My only recommendation regarding this great blog entry is that if you are dealing with large datasets, you may run into OOM errors for hot-encoded dataframes. I recommend using SparseDtypes then: df = df.astype(pd.SparseDtype(int, fill_value=0))

Categories

Resources