I am working on the dataset, there are a number of categorical features, which I want to encode in top categories, while the bottom to be replaced with other in combination. As there is large number of features with different threshold values. I defined a function like this
def enc(threshold , features):
for i in features:
j = i + "other"
repl = train[features].value_counts()[train[features].value_counts() <= threshold].index
sample = pd.get_dummies(train[features].replace(repl , j))
train.drop(features , axis = 1 , inplace = True)
n_data = pd.concat([sample , train] , axis = 1 , join = "inner")
return n_data
It gives the error Cannot compare types 'ndarray(dtype=object)' and 'tuple' on the line sample = ...
Here threshold is some threshold value for top categories, and features is the list of features I want to iterate over.
When I try to run this code out of the function, it still gives me error.
But the catch is when I try just the major 2 lines out of for loop which are
repl = train["LandContour"].value_counts()[train["LandContour"].value_counts() <= 1000].index
a = pd.get_dummies(train["LandContour"].replace(repl , "LandContou_other"))
with dedicated values, it runs perfectly as I want.
I have also tried my level best of debugging and still was not able to rectify the error.
DataType of repl is <class 'pandas.core.indexes.multi.MultiIndex'>
I have also looked upon a similar question but it didn't solved my problem
I think there is some problem in the for iteration only, or the repl is causing some problem
Related
Luckily i found this side:
https://www.linuxtut.com/en/150745ae0cc17cb5c866/
(There are many Linetypes difined
Excel Enum XlLineStyle)
(xlContinuous = 1
xlDashDot = 4
xlDashDotDot = 5
xlSlantDashDot = 13
xlDash = -4115
xldot = -4118
xlDouble = -4119
xlLineStyleNone = -4142)
i run with try and except +/- 100.000 times set lines because i thought anywhere should be this
[index] number for put this line in my picture too but they warsnt.. why not?
how can i set this line?
why are there some line indexe's in a such huge negative ranche and not just 1, 2, 3...?
how can i discover things like the "number" for doing things like that?
why is this even possible, to send apps data's in particular positions, i want to step a little deeper in that, where can i learn more about this?
(1) You can't find the medium dashed in the linestyle enum because there is none. The line that is drawn as border is a combination of lineStyle and Weight. The lineStyle is xlDash, the weight is xlThin for value 03 in your table and xlMedium for value 08.
(2) To figure out how to set something like this in VBA, use the Macro recorder, it will reveal that lineStyle, Weight (and color) are set when setting a border.
(3) There are a lot of pages describing all the constants, eg have a look to the one #FaneDuru linked to in the comments. They can also be found at Microsoft itself: https://learn.microsoft.com/en-us/office/vba/api/excel.xllinestyle and https://learn.microsoft.com/en-us/office/vba/api/excel.xlborderweight. It seems that someone translated them to Python constants on the linuxTut page.
(4) Don't ask why the enums are not continuous values. I assume especially the constants with negative numbers serve more that one purpose. Just never use the values directly, always use the defined constants.
(5) You can assume that numeric values that have no defined constant can work, but the results are kind of unpredictable. It's unlikely that there are values without constant that result in something "new" (eg a different border style).
As you can see in the following table, not all combination give different borders. Setting the weight to xlHairline will ignore the lineStyle. Setting it to xlThick will also ignore the lineStyle, except for xlDouble. Ob the other hand, xlDouble will be ignored when the weight is not xlThick.
Sub border()
With ThisWorkbook.Sheets(1)
With .Range("A1:J18")
.Clear
.Interior.Color = vbWhite
End With
Dim lStyles(), lWeights(), lStyleNames(), lWeightNames
lStyles() = Array(xlContinuous, xlDash, xlDashDot, xlDashDotDot, xlDot, xlDouble, xlLineStyleNone, xlSlantDashDot)
lStyleNames() = Array("xlContinuous", "xlDash", "xlDashDot", "xlDashDotDot", "xlDot", "xlDouble", "xlLineStyleNone", "xlSlantDashDot")
lWeights = Array(xlHairline, xlThin, xlMedium, xlThick)
lWeightNames = Array("xlHairline", "xlThin", "xlMedium", "xlThick")
Dim x As Long, y As Long
For x = LBound(lStyles) To UBound(lStyles)
Dim row As Long
row = x * 2 + 3
.Cells(row, 1) = lStyleNames(x) & vbLf & "(" & lStyles(x) & ")"
For y = LBound(lWeights) To UBound(lWeights)
Dim col As Long
col = y * 2 + 3
If x = 1 Then .Cells(1, col) = lWeightNames(y) & vbLf & "(" & lWeights(y) & ")"
With .Cells(row, col).Borders
.LineStyle = lStyles(x)
.Weight = lWeights(y)
End With
Next
Next
End With
End Sub
I have many aggregations of the type below in my code:
period = 'ag'
index = ['PA']
lvl = 'pa'
wm = lambda x: np.average(x, weights=dfdom.loc[x.index, 'pop'])
dfpa = dfdom[(dfdom['stratum_kWh'] !=8)].groupby(index).agg(
pa_mean_ea_ag_kwh = ('mean_ea_'+period+'_kwh', wm),
pa_pop = ('dom_pop', 'sum'))
It's straightforward to build the right hand side of the aggregation equation. I want to also dynamically build the left hand side of the aggregate equations so that 'dom', 'ea', 'ag' and 'kw/kwh/thm' can be all created as variable inputs and used depending on which process I'm executing. This will significantly reduce the amount of code that needs to be written and updates will also be easier to manage as otherwise I need to write separate otherwise identical code for each combination of the above.
Can I use eval to do this? I'd appreciate guidance on how to do it. Thanks.
Adding code written after feedback from Vaidøtas I.:
index = ['PA']
lvl = 'pa'
fname = lvl+"_pop"
b = f'dfdom.groupby({index}).agg({lvl}_pop = ("dom_pop", "sum"))'
dfpab = exec(b)
The output for the above is a 'NoneType object'. If I lift the text in variable b and directly run the code as show below, I get a dataframe.
dfpab = dfdom.groupby(['PA']).agg(pa_pop = ("dom_pop", "sum"))
(I've simplified my original example to better connect with the second code added.)
Use exec(), eval() is something different
For example:
exec(f"variable_name{added_namepart} = variable_value{added_valuepart}")
I know this question has been asked several times and I've read them but still haven't been able to figure it out.
Like other people, my feature names at the end are shown as f56, f234, f12 etc. and I want to have the actual names instead of f-somethings! This is the part of the code related to the model:
optimized_params, xgb_model = find_best_parameters() #where fitting and GridSearchCV happens
xgdmat = xgb.DMatrix(X_train_scaled, y_train_scaled)
feature_names=xgdmat.feature_names
final_gb = xgb.train(optimized_params, xgdmat, num_boost_round =
find_optimal_num_trees(optimized_params,xgdmat))
final_gb.get_fscore()
mapper = {'f{0}'.format(i): v for i, v in enumerate(xgdmat.feature_names)}
mapped = {mapper[k]: v for k, v in final_gb.get_fscore().items()}
mapped
xgb.plot_importance(mapped, color='red')
I also tried this:
feature_important = final_gb.get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())
data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
data.plot(kind='barh')
but still the features are shown as f+number. I'd really appreciate any help.
What I'm doing at the moment is to get the number at the end of fs, like 234 from f234 and use it in X_train.columns[234] to see what the actual name was. However, I'm having second thoughts as the name I'm getting this way is the actual feature f234 represents.
First make a dictionary from your original features and map them back to feature names.
# create dict to use later
myfeatures = X_train_scaled.columns
dict_features = dict(enumerate(myfeatures))
# feat importance with names f1,f2,...
axsub = xgb.plot_importance(final_gb )
# get the original names back
Text_yticklabels = list(axsub.get_yticklabels())
dict_features = dict(enumerate(myfeatures))
lst_yticklabels = [ Text_yticklabels[i].get_text().lstrip('f') for i in range(len(Text_yticklabels))]
lst_yticklabels = [ dict_features[int(i)] for i in lst_yticklabels]
axsub.set_yticklabels(lst_yticklabels)
print(dict_features)
plt.show()
Here is the example how it works:
The problem can be solved by using feature_names parameter when creating your xgb.DMatrix
xgdmat = xgb.DMatrix(X_train_scaled, y_train_scaled,feature_names=feature_names)
I am trying to get the doc2vec function to work in python 3.
I Have the following code:
tekstdata = [[ index, str(row["StatementOfTargetFiguresAndPoliciesForTheUnderrepresentedGender"])] for index, row in data.iterrows()]
def prep (x):
low = x.lower()
return word_tokenize(low)
def cleanMuch(data, clean):
output = []
for x, y in data:
z = clean(y)
output.append([str(x), z])
return output
tekstdata = cleanMuch(tekstdata, prep)
def tagdocs(docs):
output = []
for x,y in docs:
output.append(gensim.models.doc2vec.TaggedDocument(y, x))
return output
tekstdata = tagdocs(tekstdata)
print(tekstdata[100])
vectorModel = gensim.models.doc2vec.Doc2Vec(tekstdata, size = 100, window = 4,min_count = 3, iter = 2)
ranks = []
second_ranks = []
for x, y in tekstdata:
print (x)
print (y)
inferred_vector = vectorModel.infer_vector(y)
sims = vectorModel.docvecs.most_similar([inferred_vector], topn=1001, restrict_vocab = None)
rank = [docid for docid, sim in sims].index(y)
ranks.append(rank)
All works as far as I can understand until the rank function.
The error I get is that there is no zero in my list e.g. the documents I am putting in does not have 10 in list:
File "C:/Users/Niels Helsø/Documents/github/Speciale/Test/Data prep.py", line 59, in <module>
rank = [docid for docid, sim in sims].index(y)
ValueError: '10' is not in list
It seems to me that it is the similar function that does not work.
the model trains on my data (1000 documents) and build a vocab which is tagged.
The documentation I mainly have used is this:
Gensim dokumentation
Torturial
I hope that some one can help. If any additional info is need please let me know.
best
Niels
If you're getting ValueError: '10' is not in list, you can rely on the fact that '10' is not in the list. So have you looked at the list, to see what is there, and if it matches what you expect?
It's not clear from your code excerpts that tagdocs() is ever called, and thus unclear what form tekstdata is in when provided to Doc2Vec. The intent is a bit convoluted, and there's nothing to display what the data appears as in its raw, original form.
But perhaps the tags you are supplying to TaggedDocument are not the required list-of-tags, but rather a simple string, which will be interpreted as a list-of-characters. As a result, even if you're supplying a tags of '10', it will be seen as ['1', '0'] – and len(vectorModel.doctags) will be just 10 (for the 10 single-digit strings).
Separate comments on your setup:
1000 documents is pretty small for Doc2Vec, where most published results use tens-of-thousands to millions of documents
an iter of 10-20 is more common in Doc2Vec work (and even larger values might be helpful with smaller datasets)
infer_vector() often works better with non-default values in its optional parameters, especially a steps that's much larger (20-200) or a starting alpha that's more like the bulk-training default (0.025)
Hey so I am just working on some coding homework for my Python class using JES. Our assignment is to take a sound, add some white noise to the background and to add an echo as well. There is a bit more exacts but I believe I am fine with that. There are four different functions that we are making: a main, an echo equation based on a user defined length of time and amount of echos, a white noise generation function, and a function to merge the noises.
Here is what I have so far, haven't started the merging or the main yet.
#put the following line at the top of your file. This will let
#you access the random module functions
import random
#White noise Generation functiton, requires a sound to match sound length
def whiteNoiseGenerator(baseSound) :
noise = makeEmptySound(getLength(baseSound))
index = 0
for index in range(0, getLength(baseSound)) :
sample = random.randint(-500, 500)
setSampleValueAt(noise, index, sample)
return noise
def multipleEchoesGenerator(sound, delay, number) :
endSound = getLength(sound)
newEndSound = endSound +(delay * number)
len = 1 + int(newEndSound/getSamplingRate(sound))
newSound = makeEmptySound(len)
echoAmplitude = 1.0
for echoCount in range (1, number) :
echoAmplitude = echoAmplitude * 0.60
for posns1 in range (0, endSound):
posns2 = posns1 + (delay * echoCount)
values1 = getSampleValueAt(sound, posns1) * echoAmplitude
values2 = getSampleValueAt(newSound, posns2)
setSampleValueAt (newSound, posns2, values1 + values2)
return newSound
I receive this error whenever I try to load it in.
The error was:
Inappropriate argument value (of correct type).
An error occurred attempting to pass an argument to a function.
Please check line 38 of C:\Users\insanity180\Desktop\Work\Winter Sophomore\CS 140\homework3\homework_3.py
That line of code is:
setSampleValueAt (newSound, posns2, values1 + values2)
Anyone have an idea what might be happening here? Any assistance would be great since I am hoping to give myself plenty of time to finish coding this assignment. I have gotten a similar error before and it was usually a syntax error however I don't see any such errors here.
The sound is made before I run this program and I defined delay and number as values 1 and 3 respectively.
Check the arguments to setSampleValueAt; your sample value must be out of bounds (should be within -32768 - 32767). You need to do some kind of output clamping for your algorithm.
Another possibility (which indeed was the error, according to further input) is that your echo will be out of the range of the sample - that is, if your sample was 5 seconds long, and echo was 0.5 seconds long; or the posns1 + delay is beyond the length of the sample; the length of the new sound is not calculated correctly.