Python - converting list of lists results from a function? - python

:Edit: fixed a misunderstanding on my part - i am getting a nested list, not an array.
i'm working with a function in a for loop - bootstrapping some model predictions.
code looks like this:
def revenue(product):
revenue = predict * 4500
profit = revenue - 500000
return profit
and the loop i am feeding it into looks like this:
# set up a loop to select 500 random samples and train our region 2 data set
model = LinearRegression(fit_intercept = True, normalize = False)
features = r2_test.drop(['product'],axis=1)
values = []
for i in range(1000):
subsample = r2_test.sample(500,replace=False)
features = subsample.drop(['product'],axis=1)
predict = model2.predict(features)
result = (revenue(predict))
values.append(result)
so doing a 1000 loop of predictions on 500 samples from this dataframe:
id f0 f1 f2 product
0 74613 -15.001348 -8.276000 -0.005876 3.179103
1 9753 14.272088 -3.475083 0.999183 26.953261
2 93502 6.263187 -5.948386 5.001160 134.766305
3 33405 -13.081196 -11.506057 4.999415 137.945408
4 16486 12.702195 -8.147433 5.004363 134.766305
5 27901 -3.327590 -2.205276 3.003647 84.038886
6 69620 -11.142655 -10.133399 4.002382 110.992147
7 78940 4.234715 -0.001354 2.004588 53.906522
8 56159 13.355129 -0.332068 4.998647 134.766305
9 73142 1.069227 -11.025667 4.997844 137.945408
10 12663 11.777049 -5.334084 2.003033 53.906522
11 39849 16.320755 -0.562946 -0.001783 0.000000
12 61800 7.736313 -6.093374 3.982531 107.813044
13 72213 6.695604 -0.749449 -0.007630 0.000000
14 5479 -10.985487 -5.605994 2.991130 84.038886
15 6297 -0.347599 -6.275884 -0.003448 3.179103
16 88123 12.300570 2.944454 2.005541 53.906522
17 68352 8.900460 -5.632857 4.994324 134.766305
18 99029 -13.412826 -4.729495 2.998590 84.038886
19 64238 -4.373526 -8.590017 2.995379 84.038886
now, once i have my output, i want to select the top 200 predictions from each iteration, i'm using this loop:
# calculate the max value of each of the 500 iterations, then total them for the total profit
top_200 = []
for i in range(0,500):
profits = values.nlargest(200,[i],keep = 'all')
top_200.append(profits)
the problem i am running into is - when i feed values into the top_200 loop, i end up with an array of the selected 200 by column:
[ 0 1 2 3 \
628 125790.297387 -10140.964686 -361625.210913 -243132.040492
32 125429.134599 -368765.455544 -249361.525792 -497190.522207
815 124522.095794 -1793.660411 -11410.126264 114928.508488
645 123891.732231 115946.193531 104048.117460 -246350.752024
119 123063.545808 -124032.987348 -367200.191889 -131237.863430
.. ... ... ... ...
but i'd like to turn it into a dataframe, however, i haven't figured out how to do that while preserving the structure where 0 has it's 200 values, 1 has it's 200 values, etc.
i thought i could do something like:
top_200 = pd.DataFrame(top_200,columns= range(0,500))
and it gives me 500 columns, but only column 0 has anything in it and i end up with a [500,500] dataframe instead of the anticipated 200 rows by 500 columns.
i'm fairly sure there is a good way to do this, but my searching thus far has not turned anything up. I also am not sure what i am looking for is called so, i'm not sure what exactly i am looking for.
any input would be appreciated! Thanks in advance.
:Further editing:
so now that i know i'm getting a lists of lists, not an array, i thought i'd try to write to a dataframe instead:
# calculate the top 200 values of each of the 500 iterations
top_200 = pd.DataFrame(columns=['profits'])
for i in range(0,500):
top_200.loc[i] = i
profits = values.nlargest(200,[i],keep = 'all')
top_200.append(profits)
top_200.head()
but i've futzed something up here as my results are:
profits
0 0
1 1
2 2
3 3
4 4
where my expected results would be something like:
col 1 col2 col3
0 first n_largest first n_largest first n_largest
1 second n_largest second n_largest second n_largest
3 third n_largest third n_largest third n_largest

So, After doing some research based on #CygnusX 's recommended question i figured out that i was laboring under the impression that i had an array as the output, but of course top-200 = [] is a list, which, when combined with the nlargest gives me a list of lists.
Now that i understood the problem better, i converted the list of lists into a dataframe, and then transposed the data - which gave me the results i was looking for.
# calculate the max value of each of the 500 iterations, then total them for the total profit
top_200 = []
for i in range(0,500):
profits = (values.nlargest(200,[i],keep = 'all')).mean()
top_200.append(profits)
test = pd.DataFrame(top_200)
test = test.transpose()
output (screenshot, because, 500 columns.):
there is probably a more elegant way to accomplish this, like not using a list but a dataframe, but, i couldn't get the .append to work the way i wanted in a dataframe, since i wanted to preserve the list of 200 nlargest, not just have a sum or a mean. (which the append worked great for!)

Related

pandas dataframe and external list interaction

I have a pandas dataframe df which looks like this
betasub0 betasub1 betasub2 betasub3 betasub4 betasub5 betasub6 betasub7 betasub8 betasub9 betasub10
0 0.009396 0.056667 0.104636 0.067066 0.009678 0.019402 0.029316 0.187884 0.202597 0.230275 0.083083
1 0.009829 0.058956 0.108205 0.068956 0.009888 0.019737 0.029628 0.187611 0.197627 0.225660 0.083903
2 0.009801 0.058849 0.108092 0.068927 0.009886 0.019756 0.029690 0.188627 0.200235 0.224703 0.081434
3 0.009938 0.059595 0.109310 0.069609 0.009970 0.019896 0.029854 0.189187 0.199424 0.221968 0.081249
4 0.009899 0.059373 0.108936 0.069395 0.009943 0.019852 0.029801 0.188979 0.199893 0.222922 0.081009
Then I have a vector dk that looks like this:
[0.18,0.35,0.71,1.41,2.83,5.66,11.31,22.63,45.25,90.51,181.02]
What I need to do is:
calculate a new vector which is
psik = [np.log2(dki/1e3) for dki in dk]
calculate the sum of each row multiplied with the psik vector (just as the SUMPRODUCT function of excel)
calculate the log2 of each psik value
expected output should be:
betasub0 betasub1 betasub2 betasub3 betasub4 betasub5 betasub6 betasub7 betasub8 betasub9 betasub10 psig dg
0 0.009396 0.056667 0.104636 0.067066 0.009678 0.019402 0.029316 0.187884 0.202597 0.230275 0.083083 -5.848002631 0.017361042
1 0.009829 0.058956 0.108205 0.068956 0.009888 0.019737 0.029628 0.187611 0.197627 0.22566 0.083903 -5.903532822 0.016705502
2 0.009801 0.058849 0.108092 0.068927 0.009886 0.019756 0.02969 0.188627 0.200235 0.224703 0.081434 -5.908820802 0.016644383
3 0.009938 0.059595 0.10931 0.069609 0.00997 0.019896 0.029854 0.189187 0.199424 0.221968 0.081249 -5.930608559 0.016394906
4 0.009899 0.059373 0.108936 0.069395 0.009943 0.019852 0.029801 0.188979 0.199893 0.222922 0.081009 -5.924408689 0.016465513
I would do that with a for loop cycling over the rows like this
for r in rows:
psig_i = sum([d[i]*ri for i,ri in enumerate(r)])
psig.append(sum([d[i]*ri for i,ri in enumerate(r)]))
dg.append(np.log2(psig_i))
df['psig'] = psig
df['dg'] = dg
Is there any other way to update the df without iterating through its rows?
EDIT: I found the solution and I am ashamed for how simple it is
df['psig']=df.mul(psik).sum(axis=1)
df['dg'] = df[psig].apply(lambda x: np.log2(x))
EDIT2: now, my df has more entries, so I have to filter it with a regex to find only the columns with a name starting with "basesub".
I have my array psik and a new column ``psigin thedf. I would like to calculate for each row (i.e. each value of psig```):
sum(((psik-psig)**2)*betasub[0...n])
I did it like this, but maybe there's a better way?
PsimPsig2 = [[(psik_i-psig_i)**2 for psik_i in psik] for psig_i in list(df['psig'])]
psikmpsigname = ['psikmpsig'+str(i) for i in range(len(psik))]
dfPsimPsig2 = pd.DataFrame(data=PsimPsig2,columns=psikmpsigname)
siggAL = np.power(2,(np.power(pd.DataFrame(df.filter(regex=r'^betasub[0-9]',axis=1).values*dfPsimPsig2.values).sum(axis=1),0.5)))
df['siggAL'] = siggAL

Pandas, get pct change period mean

I have a Data Frame which contains a column like this:
pct_change
0 NaN
1 -0.029767
2 0.039884 # period of one
3 -0.026398
4 0.044498 # period of two
5 0.061383 # period of two
6 -0.006618
7 0.028240 # period of one
8 -0.009859
9 -0.012233
10 0.035714 # period of three
11 0.042547 # period of three
12 0.027874 # period of three
13 -0.008823
14 -0.000131
15 0.044907 # period of one
I want to get all the periods where the pct change was positive into a list, so with the example column it will be:
raise_periods = [1,2,1,3,1]
Assuming that the column of your dataframe is a series called y which contains the pct_changes, the following code provides a vectorized solution without loops.
y = df['pct_change']
raise_periods = (y < 0).cumsum()[y > 0]
raise_periods.groupby(raise_periods).count()
eventually, the answer provided by #gioxc88 didn't get me where I wanted, but it did put me in the right direction.
what I ended up doing is this:
def get_rise_avg_period(cls, df):
df[COMPOUND_DIFF] = df[NEWS_COMPOUND].diff()
df[CONSECUTIVE_COMPOUND] = df[COMPOUND_DIFF].apply(lambda x: 1 if x > 0 else 0)
# group together the periods of rise and down changes
unfiltered_periods = [list(group) for key, group in itertools.groupby(df.consecutive_high.values.tolist())]
# filter out only the rise periods
positive_periods = [li for li in unfiltered_periods if 0 not in li]
I wanted to get the average length of this positive periods, so I added this at the end:
period = round(np.mean(positive_periods_lens))

DataFrame.sort_values only looks at first digit rather then entire number

I have a DataFrame that looks like this,
del Ticker Open Interest
0 1 SPY 20,996,893
1 3 IWM 7,391,074
2 5 EEM 6,545,445
...
47 46 MU 1,268,256
48 48 NOK 1,222,759
49 50 ET 1,141,467
I want it to go in order from the lowest number to greatest with df['del'], but when I write df.sort_values('del') I get
del Ticker
0 1 SPY
29 10 BAC
5 11 GE
It appears do do it based on the first number rather than go in order? Am I using the correct code or do I need to completely change it?
Assuming you have numbers as type string you can do:
add leading zeros to the string numbers which will allow for ordering of the string
df["del"] = df["del"].map(lambda x: x.zfill(10))
df = df.sort_values('del')
or convert the type to integer
df["del"] = df["del"].astype('int') # as recommended by Alex.Kh in comment
#df["del"] = df["del"].map(int) # my initial answer
df = df.sort_values('del')
I also noticed that del seems to be sorted in the same way your index is sorted, so you even could do:
df = df.sort_index(ascending=False)
to go from lowest to highest you can explicitly .sort_values('del', ascending=True)

Pandas very slow query

I have the following code which reads a csv file and then analyzes it. One patient has more than one illness and I need to find how many times an illness is seen on all patients. But the query given here
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
is so slow that it takes more than 15 mins. Is there a way to make the query faster?
raw_data = pd.read_csv(r'C:\Users\omer.kurular\Desktop\Data_Entry_2017.csv')
data = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
illnesses = pd.DataFrame({"Finding_Label":[],
"Count_of_Patientes_Having":[],
"Count_of_Times_Being_Shown_In_An_Image":[]})
ids = raw_data["Patient ID"].drop_duplicates()
index = 0
for ctr in data[:1]:
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = raw_data[raw_data["Finding Labels"].str.contains(ctr)].size / 12
for i in ids:
illnesses.at[index, "Count_of_Patientes_Having"] = raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
index = index + 1
Part of dataframes:
Raw_data
Finding Labels - Patient ID
IllnessA|IllnessB - 1
Illness A - 2
From what I read I understand that ctr stands for the name of a disease.
When you are doing this query:
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
You are not only filtering the rows which have the disease, but also which have a specific patient id. If you have a lot of patients, you will need to do this query a lot of times. A simpler way to do it would be to not filter on the patient id and then take the count of all the rows which have the disease.
This would be:
raw_data[raw_data['Finding Labels'].str.contains(ctr)].size
And in this case since you want the number of rows, len is what you are looking for instead of size (size will be the number of cells in the dataframe).
Finally another source of error in your current code was the fact that you were not keeping the count for every patient id. You needed to increment illnesses.at[index, "Count_of_Patientes_Having"] not set it to a new value each time.
The code would be something like (for the last few lines), assuming you want to keep the disease name and the index separate:
for index, ctr in enumerate(data[:1]):
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = len(raw_data[raw_data["Finding Labels"].str.contains(ctr)]) / 12
illnesses.at[index, "Count_of_Patientes_Having"] = len(raw_data[raw_data['Finding Labels'].str.contains(ctr)])
I took the liberty of using enumerate for a more pythonic way of handling indexes. I also don't really know what "Count_of_Times_Being_Shown_In_An_Image" is, but I assumed you had had the same confusion between size and len.
Likely the reason your code is slow is that you are growing a data frame row-by-row inside a loop which can involve multiple in-memory copying. Usually this is reminiscent of general purpose Python and not Pandas programming which ideally handles data in blockwise, vectorized processing.
Consider a cross join of your data (assuming a reasonable data size) to the list of illnesses to line up Finding Labels to each illness in same row to be filtered if longer string contains shorter item. Then, run a couple of groupby() to return the count and distinct count by patient.
# CROSS JOIN LIST WITH MAIN DATA FRAME (ALL ROWS MATCHED)
raw_data = (raw_data.assign(key=1)
.merge(pd.DataFrame({'ills':ills, 'key':1}), on='key')
.drop(columns=['key'])
)
# SUBSET BY ILLNESS CONTAINED IN LONGER STRING
raw_data = raw_data[raw_data.apply(lambda x: x['ills'] in x['Finding Labels'], axis=1)]
# CALCULATE GROUP BY count AND distinct count
def count_distinct(grp):
return (grp.groupby('Patient ID').size()).size
illnesses = pd.DataFrame({'Count_of_Times_Being_Shown_In_An_Image': raw_data.groupby('ills').size(),
'Count_of_Patients_Having': raw_data.groupby('ills').apply(count_distinct)})
To demonstrate, consider below with random, seeded input data and output.
Input Data (attempting to mirror original data)
import numpy as np
import pandas as pd
alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
data_tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
ills = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia",
"Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax",
"Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
np.random.seed(542019)
raw_data = pd.DataFrame({'Patient ID': np.random.choice(data_tools, 25),
'Finding Labels': np.core.defchararray.add(
np.core.defchararray.add(np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]),
np.random.choice(ills, 25).astype('str')),
np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]))
})
print(raw_data.head(10))
# Patient ID Finding Labels
# 0 r xPNPneumothoraxXYm
# 1 python ScSInfiltration9Ud
# 2 stata tJhInfiltrationJtG
# 3 r thLPneumoniaWdr
# 4 stata thYAtelectasis6iW
# 5 sas 2WLPneumonia1if
# 6 julia OPEConsolidationKq0
# 7 sas UFFCardiomegaly7wZ
# 8 stata 9NQHerniaMl4
# 9 python NB8HerniapWK
Output (after running above process)
print(illnesses)
# Count_of_Times_Being_Shown_In_An_Image Count_of_Patients_Having
# ills
# Atelectasis 3 1
# Cardiomegaly 2 1
# Consolidation 1 1
# Effusion 1 1
# Emphysema 1 1
# Fibrosis 2 2
# Hernia 4 3
# Infiltration 2 2
# Mass 1 1
# Nodule 2 2
# Pleural_Thickening 1 1
# Pneumonia 3 3
# Pneumothorax 2 2

Loop to perform same upsampling task over several pandas dataframes for logistic regression

I have a series of dataframes containing daily rainfall totals (continuous data) and whether or not a flood occurs (binary data, i.e. 1 or 0). Each data frame represents a year (e.g. df01, df02, df03, etc.), which looks like this:
date ppt fld
01/02/2011 1.5 0
02/02/2011 0.0 0
03/02/2011 2.7 0
04/02/2011 4.6 0
05/02/2011 15.5 1
06/02/2011 1.5 0
...
I wish to perform logistic regression on each year of data, but the data is heavily imbalanced due to the very small number of flood events relative to the number of rainfall events. As such, I wish to upsample just the minority class (values of 1 in 'fld'). So far I know to split each dataframe into two according to the 'fld' value, upsample the resulting '1' dataframe, and then remerge into one dataframe.
# So if I apply to one dataframe it looks like this:
# Separate majority and minority classes
mask = df01.fld == 0
fld_0 = df01[mask]
fld_1 = df01[~mask]
# Upsample minority class
fld_1_upsampled = resample(fld_1,
replace=True, # sample with replacement
n_samples=247, # to match majority class
random_state=123) # reproducible results
# Combine majority class with upsampled minority class
df01_upsampled = pd.concat([fld_0, fld_1_upsampled])
As I have 17 dataframes, it is inefficient to go dataframe-by-dataframe. Are there any thoughts as to how I could be more efficient with this? So far I have tried this (it is probably evident I have no idea what I am doing with loops of this kind, I am quite new to python):
df_all = [df01, df02, df03, df04,
df05, df06, df07, df08,
df09, df10, df11, df12,
df13, df14, df15, df16, df17]
# This is my list of annual data
for i in df_all:
fld_0 = i[mask]
fld_1 = i[~mask]
fld_1_upsampled = resample(fld_1,
replace=True, # sample with replacement
n_samples=len(fld_0), # to match majority class
random_state=123) # reproducible results
i_upsampled = pd.concat([fld_0, fld_1_upsampled])
return i_upsampled
Which returns the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-36-6fd782d4c469> in <module>()
11 replace=True, # sample with replacement
12 n_samples=247, # to match majority class
---> 13 random_state=123) # reproducible results
14 i_upsampled = pd.concat([fld_0, fld_1_upsampled])
15 return i_upsampled
~/anaconda3/lib/python3.6/site-packages/sklearn/utils/__init__.py in resample(*arrays, **options)
259
260 if replace:
--> 261 indices = random_state.randint(0, n_samples, size=(max_n_samples,))
262 else:
263 indices = np.arange(n_samples)
mtrand.pyx in mtrand.RandomState.randint()
ValueError: low >= high
Any advice or comments greatly appreciated :)
UPDATE: one reply suggested that some of my dataframes may not contain any samples from the minority class. This was correct, so I have removed them, but the same error arises.
Giving you the benefit of the doubt that you're using the same mask syntax in your second code block as in your first, it looks like you may not have any samples to pass in to your resample in one or more of your DFs:
df=pd.DataFrame({'date':[1,2,3,4,5,6],'ppt':[1.5,0,2.7,4.6,15.5,1.5],'fld':[0,1,0,0,1,1]})
date ppt fld
1 1.5 0
2 0.0 1
3 2.7 0
4 4.6 0
5 15.5 1
6 1.5 1
resample(df[df.fld==1], replace=True, n_samples=3, random_state=123)
date ppt fld
6 1.5 1
5 15.5 1
6 1.5 1
resample(df[df.fld==2], replace=True, n_samples=3, random_state=123)
"...ValueError: low >= high"

Categories

Resources