Creating sample data (with faker) that has a % overlap in Python

Creating sample data (with faker) that has a % overlap in Python - python

I have 2 tables - customers and purchases
I've used faker to create 100 entries for the customer table with various details of which one is customer_numbers that are unique 5 digits.
I now want to create a table of 100 purchases that reuses the list of customer_numbers but to ensure that there is at least 25% duplicate records.
I am not sure the best way to do this to ensure the 25% requirement.
I initially created a custom function that resamples my original list (using faker.random_elements()) and just takes the first 100 records in the new list but that doesn't ensure a minimum of 25% overlap.
Is there a built in function I can use? If not, what would be the math behind recreating a list with 25% overlaps from an existing list.
Code seems less relevant here but let me know if you need samples.
Solution I went with(not the only one to the problem):
Calculated the number of samples to drop list_size-(list-size/(1*repeat rate))
Dropped the no of samples from the customer list and resampled from the reduced list the number of samples I dropped (with replacement)
Merged the shortened and resampled list
This solution ensures that len(original list) = len(set((new list)))*1.25

A solution I can think of is as follows:
Randomly selecting a number for your 'guarantee' overlapping between 25 and 100
Randomly picking the 'guarantee' number of records from the existing list
Randomly picking the rest of the required records (without considering the existence of the existing list)
This is because your customer_numbers consist of 5 digits, therefore picking these 5 digits randomly for 100 times will generally not create any duplicates with the existing list which has only 100 items. So I think using a way to 'guarantee' the overlapping percentage would work reasonably well. Also, I think it would not be difficult to implement the above steps in Python.

Related

Extract data faster from Redis and store in Panda Dataframe by avoiding key generation

I am using Redis with Python to store my per second ticker data (price and volume of an instrument). I am performing r.hget(instrument,key) and facing the following issue.
My key (string) looks like 01/01/2020-09:32:01 and goes on incrementing per second till the user specified interval.
For example 01/01/2020-09:32:01
01/01/2020-09:32:02 01/01/2020-09:32:03 ....
My r.hget(instrument,key) result looks likeb'672.2,432'(price and volume separated by a comma).
The issue am facing is that a user can specify a long time interval, like 2 years, that is, he/she wants the data from 01/01/2020 to 31/12/2020 (d/m/y format).So to perform the get operation I have to first generate timestamps for that period and then perform the get operation to form a panda dataframe. The generation of this datastamp to use as key for get operation is slowing down my process terribly (but it also ensures that the data is in strict ordering. For example 01/01/2020-09:32:01 will definitely be before 01/01/2020-09:32:02). Is there another way to achieve the same?
If I simply do r.hgetall(...) I wont be able to satisfy the time interval condition of user.

redis sorted set's are good fit for such range queries, sorted sets are made up of unique member's with a score, in your case timestamp can be score in epoch seconds and price and volume can be member, however member in sorted set is unique you may consider adding timestamp to make it unique.
zadd instrument 1577883600 672.2,432,1577883600
zadd instrument 1577883610 672.2,412,1577883610
After adding members to the set you can do range queries using zrangebyscore as below
zrangebyscore instrument 1577883600 1577883610
If your instrument contains many values then consider sharding it into multiple for example per month each set like instrument:202001, instrument:202002 and so on.
following are good read on this topic
Sorted Set Time Series
Sharding Structure

So to perform the get operation I have to first generate timestamps for that period and then perform the get operation...
No. This is the problem.
Make a function that calculates the timestamps and yield a smaller set of values, for a smaller time span (one week or one month).
So the new workflow will be in batches, see this loop:
generate a small set of timestamps
fetch items from redis
Pros:
minimize the memory usage
easy to change your current code to this new algo.
I don't know about redis specific functions, so other specific solutions can be better. My idea is a general approach, I used it with success for other problems.

Have you considered using RedisTimeSeries for this task? It is a redis module that is tailored exactly for the sort of task you are describing.
You can keep two timeseries per instrument that will hold price and value.
With RedisTimeSeries is it easy the query over different ranges and you can use the filtering mechanism to group different series, instrument families for example, and query all of them at once.
// create your timeseries
TS.CREATE instrument:price LABELS instrument violin type string
TS.CREATE instrument:volume LABELS instrument violin type string
// add values
TS.ADD instrument:price 123456 9.99
TS.ADD instrument:volume 123456 42
// query timeseries
TS.RANGE instrument:price - +
TS.RANGE instrument:volume - +
// query multiple timeseries by filtering according to labels
TS.MRANGE - + FILTER instrument=violin
TS.MRANGE - + FILTER type=string
RedisTimeSeries allows running queries with aggregations such as average standard-deviation, and uses double-delta compression which can reduce your memory usage by over 90%.
You can checkout a benchmark here.

Is there an efficient algorithm for clustering a big data array

I have a csv and I want to read it with python. This csv has two columns, one is for the name of the customers and the other is for their age, I want to group the customers according to their age. However, in this csv there are 10^9 rows. So, I have to use an efficient algorithm for this job and not to read all the rows, is there a way to do this?

I presume you're asking how to cluster the data without reading all the rows into memory at once.
One idea is to use a two stage approach to clustering:
First, define your clusters with a sample (random subsets) of the data. For example, you can randomly select 1,000 records (or some other reasonable value) and see how many clusters you need along with the cluster centers. You can repeat this process several times until you are satisfied with your clusters.
Second, since now you have the cluster centers, you can "assign" each customer to their appropriate cluster (that is, using the nearest cluster center). You can do this one-by-one for each record, or in convenient batches, since there is no need to do them all at once. You can even do this assignment "lazily" (only when required) if you don't have to cluster all the records immediately.
This way you never have to load huge amounts of records into memory at once.

Maximum Limit of distinct fake data using Python Faker package

I have used Python Faker for generating fake data. But I need to know what is the maximum number of distinct fake data (eg: fake names) can be generated using faker (eg: fake.name() ).
I have generated 100,000 fake names and I got less than 76,000 distinct names. I need to know the maximum limit so that I can know how much we can scale using this package for generating data.
I need to generate huge dataset. I also want to know is Php faker, perl faker are all same for different environments?
Other packages for generating huge dataset will be highly appreciated.

I had this same issue and looked more into it.
In the en_US provider there about 1000 last names and 750 first names for about 750000 unique combos. If you randomly select a first and last name, there is a chance you'll get duplicates. But in reality, that's how the real world works, there are many John Smiths and Robert Doyles out there.
There are 7203 first names and 473 last names in the en profile which can kind of help. Faker chooses the combo of first name and last name meaning there are about 7203 * 473 = 3407019.
But still, there is a chance you'll get duplicates.
I solve this problem by adding numbers to names.
I need to generate huge dataset.
Keep in mind that in reality, any huge dataset of names will have duplicates. I work with large datasets (> 1 million names) and we see a ton of duplicate first and last names.
If you read the faker package code, you can probably figure out how to modify it so you get all 3M distinct names.

Pandas and the best method for representing variable-length time-series

Here's the scenario. Let's say I have data from a visual psychophysics experiment, in which a subject indicates whether the net direction of motion in a noisy visual stimulus is to the left or to the right. The atomic unit here is a single trial and a typical daily session might have between 1000 and 2000 trials. With each trial are associated various parameters: the difficulty of that trial, where stimuli were positioned on the computer monitor, the speed of motion, the distance of the subject from the display, whether the subject answered correctly, etc. For now, let's assume that each trial has only one value for each parameter (e.g., each trial has only one speed of motion, etc.). So far, so easy: trial ids are the Index and the different parameters correspond to columns.
Here's the wrinkle. With each trial are also associated variable length time series. For instance, each trial will have eye movement data that's sampled at 1 kHz (so we get time of acquisition, the x data at that time point, and y data at that time point). Because each trial has a different total duration, the length of these time series will differ across trials.
So... what's the best means for representing this type of data in a pandas DataFrame? Is this something that pandas can even be expected to deal with? Should I go to multiple DataFrames, one for the single valued parameters and one for the time series like parameters?
I've considered adopting a MultiIndex approach where level 0 corresponds to trial number and level 1 corresponds to time of continuous data acquisition. Then all I'd need to do is repeat the single valued columns to match the length of the time series on that trial. But I immediately foresee 2 problems. First, the number of single valued columns is large enough that extending each one of them to match the length of the time series seems very wasteful if not impractical. Second, and more importantly, if I wanna do basic groupby type of analyses (e.g. getting the proportion of correct responses at a given difficulty level), this will give biased (incorrect) results because whether each trial was correct or wrong will be repeated as many times as necessary for its length to match the length of time series on that trial (which is irrelevant to the computation of the mean across trials).
I hope my question makes sense and thanks for suggestions.

I've also just been dealing with this type of issue. I have a bunch of motion-capture data that I've recorded, containing x- y- and z-locations of several motion-capture markers at time intervals of 10ms, but there are also a couple of single-valued fields per trial (e.g., which task the subject is doing).
I've been using this project as a motivation for learning about pandas so I'm certainly not "fluent" yet with it. But I have found it incredibly convenient to be able to concatenate data frames for each trial into a single larger frame for, e.g., one subject:
subject_df = pd.concat(
[pd.read_csv(t) for t in subject_trials],
keys=[i for i, _ in enumerate(subject_trials)])
Anyway, my suggestion for how to combine single-valued trial data with continuous time recordings is to duplicate the single-valued columns down the entire index of your time recordings, like you mention toward the end of your question.
The only thing you lose by denormalizing your data in this way is that your data will consume more memory; however, provided you have sufficient memory, I think the benefits are worth it, because then you can do things like group individual time frames of data by the per-trial values. This can be especially useful with a stacked data frame!
As for removing the duplicates for doing, e.g., trial outcome analysis, it's really straightforward to do this:
df.outcome.unique()
assuming your data frame has an "outcome" column.

Splitting time data into "runs" in order to plot and examine differences

I am trying to investigate differences between runs/experiments in a continuously logged data set. I am taking a fixed subset of a few months for this data set and then analysising it to come up with an estimate on when a run was started. I have this sorted in a series of times.
With this I then chop the data up into 30 hour chunks (approximate time between runs) and then put it into a dictionary:
data = {}
for time in times:
timeNow = np.datetime64(time.to_datetime())
time30hr = np.datetime64(time.to_datetime())+np.timedelta64(30*60*60,'s')
data[time] = df[timeNow:time30hr]
So now I have a dictionary of dataframes, indexed by by StartTime and each one contains all of my data for a run, plus some extra to ensure I have it all for every run. But to compare two runs together I need to have a common X value to stack them on top of each other. Now every run is different and the point I want to consider "the same" varies depending on what i'm looking at. For the example below I have used the largest value in that dataset to "pivot" on.
for time in data:
A = data[time]
#Find max point for value. And take the first if there is more than 1
maxTtime = A[A['Value'] == A['Value'].max()]['DateTime'][0]
# Now we can say we want 12 hours before and end 12 after.
new = A[maxTtime-datetime.timedelta(0.5):maxTtime+datetime.timedelta(0.5)]
#Stick on a new column with time from 0 point:
new['RTime'] = new['DateTime'] - maxTtime
#Plot values against this new time
plot(new['RTime'],new['Value'])
This yields a graph like:
Which is great except I can't get a decent legend in order to tell what run was what and work out how much variation there is. I believe half my problem is because Im iterating over a dictionary of dataframes which is causing issues.
Could someone recommend how to better organise this (a dictionary of dataframes is all I could do to get it to work). I've thought of doing a hierarchical dataframe and instead of indexing it by run time, assigning a set of identifiers to the runs (The actual time is contained within the dataframes themself so I have no problem loosing the assumed starttime) and plotting it then with a legend.
My final aim is to have a dataset and methodology that means I can investigate the similarity and differences between different runs using different "pivot points" amd produce a graph of each one which I can then interrogate (or at least tell which data set is which to interrogate the data directly) but couldn;t get past various errors with creating it.
I can upload a set of the data to a csv if required but am not sure on the best place to upload it to. Thanks

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.