Finding the minimum distance for each row using pandas - python

I am trying to match DNA spiral of different bacterias with their ancestors and I have around 1 million observations. I want to identify the closest ancestor for each bacteria, i.e. I want to compare them with same or older generation ( equal or smaller generation numbers) so my data frame looks like this (for simplicity let's assume DNA vector consist of one number):
bacteria_id generation DNA_vector
213 230 23
254 230 18
256 229 39
289 229 16
310 228 24
324 228 45
I tried to create a matrix and choose the smallest value from that matrix for each bacteria but as it will consist of lot of rows and columns, I get memory error before matrix is created.
Let's assume that it is not bacteria but car and I compare each car with its own generation (e.g. cars launched in 2010) and with the older ones. And also let's change DNA_vector to number of features. And I will assume it is more similar to other car if the difference between number of features is smaller.
So I want to create two additional columns. First one will tell the minimum difference (e.g. for the first one it will be 1, and the most similar car will be model 310)
Expected output is:
bacteria_id generation DNA_vector most_similar_bacteria distance
213 230 23 310 1 (i.e. 24 -23)
254 230 18 289 2
256 229 39 324 6
289 229 16 228 8
310 228 24 324 19
324 228 45 NA NA
Do you have any recommendations?

If you're running into memory errors because of a large dataset, you could try using dask. It is a 'parallel' computing library extremely similar to pandas that allows you to process larger datasets by using your hard drive instead of RAM.
https://dask.pydata.org/en/latest/
May not be something exactly as what you're looking for, but I have had good luck using it with large datasets as you describe.

Related

How to perform TTest on multiple columns

My dataframe is below
patid age gender tg0 tg1 tg2 tg3 tg4 wgt0 wgt1 wgt2 wgt3 wgt4
0 1 45 Male 180 148 106 113 100 198 196 193 188 192
1 2 56 Male 139 94 119 75 92 237 233 232 228 225
2 3 50 Male 152 185 86 149 118 233 231 229 228 226
3 4 46 Female 112 145 136 149 82 179 181 177 174 172
4 5 64 Male 156 104 157 79 97 219 217 215 213 214
Is it the right way If I do the average of tg0 tg1 tg2 tg3 tg4 and wgt0 wgt1 wgt2 wgt3 wgt4 so that i will get 2 columns a and b and do the ttest
Copying the case study also
A physician is evaluating a new diet for her patients with a family history of heart disease. To test the effectiveness of this diet, 16 patients are placed on the diet for 6 months. Their weights and triglyceride levels are measured before and after the study, and the physician wants to know if either set of measurements has changed
Null hypothesis: There is no difference in the levels of Triglycerides and weight of individual after using new diet for 6 months.
Alt hypothesis: There is has been a significant difference in the levels of Triglycerides and weight of individual after using new diet for 6 months.
For 2 variable we can do like below code
from scipy import stats
#Data of group 1
a = np.array([42.1, 80.0, 30.0, 45.8, 57.7, 80.0, 82.4, 66.2, 66.9, 79.0])
#Data of group 2
b = np.array([80.7, 85.1, 88.6, 81.7, 69.8, 79.5, 107.2, 69.3, 80.9, 63.0])
t2, p2 = stats.ttest_ind(a,b)
It looks like you want to find the difference before and after the 6 month period for each measurement type. Based on this, it seems that you would want to do two separate tests:
Whether the final triglyceride measurement value significantly differs from the initial triglyceride measurement.
Whether the final weight measurement value significantly differs from the initial weight measurement.
Note: I'm assuming that each column represents a measurement over time, starting with 0 and ending with 4. This would mean that tg0 and wgt0 are the initial triglyceride and weight measurements respectively, and that tg4 and wgt4 are the final measurements
For each test, you are comparing the final measurement with the initial measurement, so you would want to structure the tests like this:
t_tg, p_tg = stats.ttest_ind(tg4,tg0)
t_wgt, p_wgt = stats.ttest_ind(wgt4,wgt0)
Then use p_tg and p_wgt to make a unique determination for the triglycerides and the weight.
I am not sure why there are four measurements of Triglycerides and weights for each patient.
Assuming that the measurements were taken lets say a month apart (while on the diet and tg0, wgt0 when starting the diet) then you could do one of two things:
Take the first and last values (tg0, tg4) and use those as the two groups (a, b). Do the same for wgt0 and wgt4.
To get better accuracy we can incorporate the other values by fitting a line of best fit through each patient's Triglycerides levels. And then using the first value of that line of best fit as a and last as b for each patient. Do the same for weights.
Is it the right way If I do the average of tg0 tg1 tg2 tg3 tg4 and wgt0 wgt1 wgt2 wgt3 wgt4 so that i will get 2 columns a and b and do the ttest
If (tg0 tg1 tg2 tg3 tg4) are measurements before the diet and (wgt0 wgt1 wgt2 wgt3 wgt4) are measurements after and they are measuring the same thing (for example weight) then you could do what you propose.

Pandas: how to apply weight column to create a new dataframe with weighted data

I have this dataset from US Census Bureau with weighted data:
Weight Income ......
2 136 72000
5 18 18000
10 21 65000
11 12 57000
23 43 25700
The first person represents 136 people, the second 18 and so on. There are a lot of other columns and I need to do several charts and calculations. I will be too much work to apply the weight every time I need to do a chart, pivot table, etc.
Ideally, I would like to use this:
df2 = df.iloc [np.repeat (df.index.values, df.PERWT )]
To create an unweighted or flat dataframe.
This produces a new large (1.4GB) dataframe:
Weight Wage
0 136 72000
0 136 72000
0 136 72000
0 136 72000
0 136 72000
.....
The thing is that using all the columns of the dataset, my computer runs out of memory.
Any idea on how to use the weights to create a new weighted dataframe?
I've tied this:
df2 = df.sample(frac=1, weights=df['Weight'])
But it seems to produce the same data. Changing frac to 0.5 could be a solution, but I'll lose 50% of the information.
Thanks!

Is there an efficient way to filter and apply a function to this dataset?

I have a dataset with columns origin, destination, and cost. There are x origins and y destinations. Each origin is mapped to the y destinations with corresponding cost.
My goal is to create a new column that shows the number of destinations that can be reached from each origin, given the amount of budget spent. I can easily do this for each origin alone but that takes forever to go through x different origins.
Is there a way to filter this huge dataset define a function to arrive at the correct result of destinations for each origin?
My understanding of the question is you want the number of locations reachable per origin, using the given budget for the trip. So, kind of like the number of other destinations you can reach using the same price.
We can do this by grouping the data by origin, then ranking the budgets - using the method = 'max' we take the maximum ranking during ties:
x.groupby('OriginID').apply(lambda x: x.Budget.rank(method = 'max'))
Alright, I read the question carefully and this should give you what you need.
import pandas as pd
df = pd.read_csv('data.csv')
def get_cumulative_destinations(row):
return len(df.loc[(df['OriginID'] == row['OriginID']) & (df['Budget'] <= row['Budget'])].Destination)
df['Cumulative destination'] = df.apply(get_cumulative_destinations, axis=1)
Answer:
OriginID Destination Label Budget Cumulative destination
2507 661 Hos 9.78 30
2507 502 CC 9.98 31
2507 566 Rec 14.76 55
2507 483 CC 20.54 90
2507 461 CC 8.58 20
2507 452 CC 12.22 38
2507 440 CC 14.82 56
2507 516 Rec 14.27 52
2507 580 Rec 15.27 62
...

Finding subset of dataframe rows that maximize one column sum while limiting sum of another

A beginner to pandas and python, I'm trying to find select the 10 rows in a dataframe such that the following requirements are fulfilled:
Only 1 of each category in a categorical column
Maximize sum of a column
While keeping sum of another column below a specified threshold
The concept I struggle with is how to do all of this at the same time. In this case, the goal is to select 10 rows resulting in a subset where sum of OPW is maximized, while the sum of salary remains below an integer threshold, and all strings in POS are unique. If it helps understanding the problem, I'm basically trying to come up with the baseball dream team on a budget, with OPW being the metric for how well the player performs and POS being the position I would assign them to. The current dataframe looks like this:
playerID OPW POS salary
87 bondsba01 62.061290 OF 8541667
439 heltoto01 41.002660 1B 10600000
918 thomafr04 38.107000 1B 7000000
920 thomeji01 37.385272 1B 6337500
68 berkmla01 36.210367 1B 10250000
785 ramirma02 35.785630 OF 13050000
616 martied01 32.906884 3B 3500000
775 pujolal01 32.727629 1B 13870949
966 walkela01 30.644305 OF 6050000
354 giambja01 30.440007 1B 3103333
859 sheffga01 29.090699 OF 9916667
511 jonesch06 28.383418 3B 10833333
357 gilesbr02 28.160054 OF 7666666
31 bagweje01 27.133545 1B 6875000
282 edmonji01 23.486406 CF 4500000
0 abreubo01 23.056375 RF 9000000
392 griffke02 22.965706 OF 8019599
... ... ... ...
If my team was only 3 people, with a OF,1B, and 3B, and I had a sumsalary threshold of $19,100,000, I would get the following team:
playerID OPW POS salary
87 bondsba01 62.061290 OF 8541667
918 thomafr04 38.107000 1B 7000000
616 martied01 32.906884 3B 3500000
The output would ideally be another dataframe with just the 10 rows that fulfill the requirements. The only solution I can think of is to bootstrap a bunch of teams (10 rows) with each row having a unique POS, remove teams above the 'salary' sum threshold, and then sort_value() the teams by df.OPW.sum(). Not sure how to implement that though. Perhaps there is a more elegant way to do this?
Edit: Changed dataframe to provide more information, added more context.
This is a linear programming problem. For each POS, you're trying to maximize individual OPW while total salary across the entire team is subject to a constraint. You can't solve this with simple pandas operations, but PuLP could be used to formulate and solve it (see the Case Studies there for some examples).
However, you could get closer to a manual solution by using pandas to group by (or sort by) POS and then either (1) sort by OPW descending and salary ascending, or (2) add some kind of "return on investment" column (OPW divided by salary, perhaps) and sort on that descending to find the players that give you the biggest bang for the buck in each position.
IIUC you can use groupby with aggregating sum:
df1 = df.groupby('category', as_index=False).sum()
print (df1)
category value cost
0 A 70 2450
1 B 67 1200
2 C 82 1300
3 D 37 4500
Then filter by boolean indexing with treshold:
tresh = 3000
df1 = df1[df1.cost < tresh]
And last get top 10 values by nlargest:
#in sample used top 3, in real data is necessary set to 10
print (df1.nlargest(3,columns=['value']))
category value cost
2 C 82 1300
0 A 70 2450
1 B 67 1200

Scipy Stats ttest_1samp Hypothesis Testing For Comparing Previous Performance To Sample

My Problem I'm Trying To Solve
I have 11 months worth of performance data:
Month Branded Non-Branded Shopping Grand Total
0 2/1/2015 1330 334 161 1825
1 3/1/2015 1344 293 197 1834
2 4/1/2015 899 181 190 1270
3 5/1/2015 939 208 154 1301
4 6/1/2015 1119 238 179 1536
5 7/1/2015 859 238 170 1267
6 8/1/2015 996 340 183 1519
7 9/1/2015 1138 381 172 1691
8 10/1/2015 1093 395 176 1664
9 11/1/2015 1491 426 199 2116
10 12/1/2015 1539 530 156 2225
Let's say it's February, 1 2016 and I'm asking "are the results in January statistically different from the past 11 months?"
Month Branded Non-Branded Shopping Grand Total
11 1/1/2016 1064 408 106 1578
I came across a blog...
I came across iaingallagher's blog. I will reproduce here (in case the blog goes down).
1-sample t-test
The 1-sample t-test is used when we want to compare a sample mean to a
population mean (which we already know). The average British man is
175.3 cm tall. A survey recorded the heights of 10 UK men and we want to know whether the mean of the sample is different from the
population mean.
# 1-sample t-test
from scipy import stats
one_sample_data = [177.3, 182.7, 169.6, 176.3, 180.3, 179.4, 178.5, 177.2, 181.8, 176.5]
one_sample = stats.ttest_1samp(one_sample_data, 175.3)
print "The t-statistic is %.3f and the p-value is %.3f." % one_sample
Result:
The t-statistic is 2.296 and the p-value is 0.047.
Finally, to my question...
In iaingallagher's example, he knows the population mean and is comparing a sample (one_sample_data). In MY example, I want to see if 1/1/2016 is statistically different from the previous 11 months. So in my case, the previous 11 months is an array (instead of a single population mean value) and my sample is one data point (instead of an array)... so it's kind of backwards.
QUESTION
If I was focused on the Shopping column data:
Will scipy.stats.ttest_1samp([161,197,190,154,179,170,183,172,176,199,156], 106) produce a valid result even though my sample (first parameters) is a list of previous results and I'm comparing it to a popmean that's not the population mean but instead one sample.
If this is not the correct stats function, any recommendation on what to use for this hypothesis test situation?
If you are only interested in the "Shopping" column, try to create a .xlsx or .csv file containing the data from only the "Shopping"column.
This way you could import this data and make use of pandas to perform the same T-test for each column individually.
import pandas as pd
from scipy import stats
data = pd.read_excel("datafile.xlxs")
one_sample_data = data["Shopping"]
one_sample = stats.ttest_1samp(one_sample_data, 175.3)

Categories

Resources