I have a python list of size 67 with three unique values with following distribution
A - 20 occurrences randomly spread in the list
B - 36 occurrences randomly spread in the list
C - 11 occurrences randomly spread in the list
I want to perform random selection at 10% within each group to perform a special treatment on the values selected from randomisation.
Based on the occurrences in the list shown above, 2 treatments for group A, 3 treatments for B and 1 treatment for C should have been done.
Selection for treatment need not be done exactly on the 10th occurrence of a value but the ratio of treatment to values should be maintained at approximately 10%.
Right now, I have this code
import random
if random.random() <= 0.1
do something
Using this code, doesn't get me above requirement treatments at a group level. Instead it randomly picks treatments across all groups. I want to trim the selection of random samples at a group level. How do I do that?
Also, if this list were dynamic and keeps getting bigger and bigger and populated with more values of A,B,C at run time albeit with different distributions, how can I still maintain the randomisation at a group (unique value in the list) level.
Related
Suppose I have a dataset that records camera sightings of some object over time, and I groupby date so that each group represents sightings within the same day. I'd then like to break one group into 'subgroups' based on the time between sightings -- if the gap is too large, then I want them to be in different groups.
Consider the following as one group.
(Camera). (Time)
A 6
B 12
C 17
D 21
E 47
F 50
Suppose I had a cutoff matrix that told me how close the next sighting had to be for two adjacent cameras to be in the same group. For example, we might have cutoff_mat[d, e] = 10 which means that since cameras D and E are more than 10 units apart in time, I should break the group into two after D and before E. I would like to do so in a way that allows for efficient iteration over each of the resulting groups since my real goal is to compute some other matrix using values within each sub-group, and need to potentially break one group into many and not just two. How do I do this? The dataset is large (>100M points) so something fast would be appreciated.
I am thinking I could do this by creating another column in the original dataset that represents time between consecutive sightings on the same day, and somehow groupby both date AND this new column, but I'm not quite sure how that'd work. I also don't think pd.df.cut() works here since I don't have pre-determined bins.
I have a matrix shown below. The next step in the project is to identify spreads. These are being defined as a series of trades composed of at least two different contracts but all of the same product type. The trades making up the spread must happen within 10 minutes and the total volume of buy must equal that of sell. After identifying which rows are related to a spread they should be outputted or tagged for using later.
Spreads are highlighted in blue for the demo matrix.
I assume you know how to properly slice the time frames.
Then you can create a list which contains all buy/sell values where you count the sell values as a negative ones.
At this point you are only missing the list that contains all combinations of rows in that time window. This list can be created with the help of the itertools module, i.e with:
time_window = [1, 2, 3]
for L in range(0, len(time_window)+1):
for subset in itertools.combinations(time_window, L):
print(subset)
At this point you only test it vs the required value.
I have brain anatomy measurements from 2 different groups of individuals. One group has more individuals than the other (say n and m individuals each). I have to run the KS test on this data. I am a little unclear about the arguments to pass to the scipy two sample KS test. Will arguments to the scipy 2 sample ks test be every individual from group 1 against every individual in group 2 in a for loop ? Or is it every feature in group 1 against every other feature in group 2 ?
I wrote this code but it's obviously wrong as i am using iteritems() to loop over the columns when perhaps it should be n*m ?
for group1, group2 in zip(group1.transpose().iteritems(),
group2.transpose().iteritems()):
value, pvalue = ks_2samp(np.array(group1[1]), np.array(group2[1]))
print(value, pvalue)
if pvalue > 0.05:
print('Samples are likely drawn from the same distributions
(fail to reject H0)')
else:
print('Samples are likely drawn from the different
distributions (reject H0)')
Let's say one of the measurements is brain mass. Gather all the brain mass measurements for group 1 into a sequence (or 1-d array), and do the same for group 2. Pass these two sequences to ks_2samp. That will test whether the brain masses of the two groups come from the same distribution.
For example, if group1 and group2 are Pandas DataFrames with a row for each individual and with columns for the different measurements associated with each individual, including one called "mass" for brain mass, you would do:
value, pvalue = ks_2samp(group1['mass'].to_numpy(), group2['mass'].to_numpy())
I have a pandas dataframe column as shown in the figure below. Only two values: Increase and Decrease occur randomly in the column. Is there a way to process that data?
For this particular problem, I want to get the first (2 CONSECUTIVE) occurrence of the word Increase AFTER at least one (2 CONSECUTIVE) occurrences (maybe more, 2 is the minimum) of the word Decrease.
As an example, if the series is (I for "Increase", D for "Decrease"): "I,I,I,I,D,I,I,D,I,D,I,D,D,D,D,I,D,I,D,D,I,I,I,I", it should return the index of row 21 (the third last I in the given series). Assume that the example series that I just showed in a pandas column, meaning the series is vertical and not horizontal, and the indexing starts at 0, meaning that the first I is considered as row 0.
For this particular example, it should return 2009q4, which is the index of that particular row.
If somebody can show me a way to do common tasks like count the number of consecutive occurrences of a given value, detect a value change, get a particular positioned value after a value change etc. for this type of data (which may not required for this problem, but can be useful for future problems), I shall be really grateful.
I have 140 csv files. Each file has 3 variables and is about 750 GB. Number of observation varies from 60 to 90 million.
I also have another small file, treatment_data - with 138000 row (for each unique ID) and 21 column (01 column for ID and 20 columns of 1s and 0s indicating whether the ID was given a particular treatment or not.
The variables are,
ID_FROM: A Numeric ID
ID_TO: A Numeric ID
DISTANCE: A numeric variable of physical distance between ID_FROM and ID_TO
(So in total, I have 138000*138000 (= 19+ Billion)rows - for every possible bi-lateral combination all ID, divided across these 140 files.
Research Question: Given a distance, how many neighbors (of each treatment type) an ID has.
So I need help with a system (preferably in Pandas) where
the researcher will input a distance
the program will look over all the files and filter out the the
rows wither DISTANCE between ID_FROM and ID_TO is less than
the given distance
output a single dataframe. (DISTANCE can be dropped at this
point)
merge the dataframe with the treatment_data by matching ID_TO
with ID. (ID_TO can be dropped at this point)
collapse the data by ID_FROM (group_by and sum the 1s, across
20 treatment variable.
(In the Final output dataset, I will have 138000 row and 21 column. 01 column for ID. 20 column for each different treatment type. So, for example, I will be able to answer the question, "Within '2000' meter, How many neighbors of '500' (ID) is in 'treatment_media' category?"
IMPORTANT SIDE NOTE:
The DISTANCE variable range between 0 to roughly the radius of an
average sized US State (in meter). Researcher is mostly interested to
see what happens with in 5000 meter. Which usually drops 98% of
observations. But sometimes, he/she will check for longer distance
measure too. So I have to keep all the observations available.
Otherwise, I could have simply filtered out the DISTANCE more than
5000 from the raw input files and made my life easier. The reason I
think this is important is that, the data are sorted based in
ID_FROM across 140 files. If I could somehow rearrange these 19+
billion rows based on DISTANCE and associate them have some kind of
dictionary system, then the program does not need to go over all the
140 files. Most of the time, the researcher will be looking into only
2 percentile of DISTANCE range. It seems like a colossal waste of
time that I have to loop over 140 files. But this is a secondary
thought. Please do provide answer even if you can't use this
additional side-note.
I tried looping over 140 files for a particular distance in Stata, It
takes 11+ hour to complete the task. Which is not acceptable as the
researcher will want to vary the distance with in 0 to 5000 range.
But, most of the computation time is wasted on reading each dataset
on memory (that is how Stata do it). That is why I am seeking help in
Python.
Is there a particular reason that you need to do the whole thing in Python?
This seems like something that a SQL database would be very good at. I think a basic outline like the following could work:
TABLE Distances {
Integer PrimaryKey,
String IdFrom,
String IdTo,
Integer Distance
}
INDEX ON Distances(IdFrom, Distance);
TABLE TreatmentData {
Integer PrimaryKey,
String Id,
String TreatmentType
}
INDEX ON TreatmentData(Id, TreatmentType);
-- How many neighbors of ID 500 are within 2000 meters and have gotten
-- the "treatment_media" treatment?
SELECT
d.IdFrom AS Id,
td.Treatment,
COUNT(*) AS Total
FROM Distances d
JOIN TreatmentData td ON d.IdTo = td.Id
WHERE d.IdFrom = "500"
AND d.Distance <= 2000
AND td.TreatmentType = "treatment_media"
GROUP BY 1, 2;
There's probably some other combination of indexes that would give better performance, but this seems like it would at least answer your example question.