python - stratifiedkfold - train_test_split

Scikit-learn TypeError: si no se especifica ningún puntaje, el estimador aprobado debe tener un método de ''puntaje'' (1)

Creé un modelo personalizado en python usando scikit-learn, y quiero usar la validación cruzada.

La clase para el modelo se define de la siguiente manera:

class MultiLabelEnsemble: '''''' MultiLabelEnsemble(predictorInstance, balance=False) Like OneVsRestClassifier: Wrapping class to train multiple models when several objectives are given as target values. Its predictor may be an ensemble. This class can be used to create a one-vs-rest classifier from multiple 0/1 labels to treat a multi-label problem or to create a one-vs-rest classifier from a categorical target variable. Arguments: predictorInstance -- A predictor instance is passed as argument (be careful, you must instantiate the predictor class before passing the argument, i.e. end with (), e.g. LogisticRegression(). balance -- True/False. If True, attempts to re-balance classes in training data by including a random sample (without replacement) s.t. the largest class has at most 2 times the number of elements of the smallest one. Example Usage: mymodel = MultiLabelEnsemble (GradientBoostingClassifier(), True)'''''' def __init__(self, predictorInstance, balance=False): self.predictors = [predictorInstance] self.n_label = 1 self.n_target = 1 self.n_estimators = 1 # for predictors that are ensembles of estimators self.balance=balance def __repr__(self): return "MultiLabelEnsemble" def __str__(self): return "MultiLabelEnsemble : /n" + "/tn_label={}/n".format(self.n_label) + "/tn_target={}/n".format(self.n_target) + "/tn_estimators={}/n".format(self.n_estimators) + str(self.predictors[0]) def fit(self, Xtrain, Ytrain): if len(Ytrain.shape)==1: Ytrain = np.array([Ytrain]).transpose() # Transform vector into column matrix # This is NOT what we want: Y = Y.reshape( -1, 1 ), because Y.shape[1] out of range self.n_target = Ytrain.shape[1] # Num target values = num col of Y self.n_label = len(set(Ytrain.ravel())) # Num labels = num classes (categories of categorical var if n_target=1 or n_target if labels are binary ) # Create the right number of copies of the predictor instance if len(self.predictors)!=self.n_target: predictorInstance = self.predictors[0] self.predictors = [predictorInstance] for i in range(1,self.n_target): self.predictors.append(copy.copy(predictorInstance)) # Fit all predictors for i in range(self.n_target): # Update the number of desired prodictos if hasattr(self.predictors[i], ''n_estimators''): self.predictors[i].n_estimators=self.n_estimators # Subsample if desired if self.balance: pos = Ytrain[:,i]>0 neg = Ytrain[:,i]<=0 if sum(pos)<sum(neg): chosen = pos not_chosen = neg else: chosen = neg not_chosen = pos num = sum(chosen) idx=filter(lambda(x): x[1]==True, enumerate(not_chosen)) idx=np.array(zip(*idx)[0]) np.random.shuffle(idx) chosen[idx[0:min(num, len(idx))]]=True # Train with chosen samples self.predictors[i].fit(Xtrain[chosen,:],Ytrain[chosen,i]) else: self.predictors[i].fit(Xtrain,Ytrain[:,i]) return def predict_proba(self, Xtrain): if len(Xtrain.shape)==1: # IG modif Feb3 2015 X = np.reshape(Xtrain,(-1,1)) prediction = self.predictors[0].predict_proba(Xtrain) if self.n_label==2: # Keep only 1 prediction, 1st column = (1 - 2nd column) prediction = prediction[:,1] for i in range(1,self.n_target): # More than 1 target, we assume that labels are binary new_prediction = self.predictors[i].predict_proba(Xtrain)[:,1] prediction = np.column_stack((prediction, new_prediction)) return prediction

Cuando llamo a esta clase para validación cruzada como esta:

kf = cross_validation.KFold(len(Xtrain), n_folds=10) score = cross_val_score(self.model, Xtrain, Ytrain, cv=kf, n_jobs=-1).mean()

Obtuve el siguiente error:

TypeError: si no se especifica ningún puntaje, el estimador aprobado debe tener un método de "puntaje". El estimador MultiLabelEnsemble no lo hace.

¿Cómo creo un método de puntaje?

La forma más fácil de hacer desaparecer el error es pasar la scoring="accuracy" o la scoring="hamming" a cross_val_score . La función cross_val_score sí no sabe qué tipo de problema está tratando de resolver, por lo que no sabe qué medida es la adecuada. Parece que estás tratando de hacer una clasificación de etiqueta múltiple, ¿entonces quizás quieras usar la pérdida de Hamming?

También puede implementar un método de score como se explica en los documentos "Roll your own estimator", que tiene como def score(self, X, y_true) signature def score(self, X, y_true) . Ver http://scikit-learn.org/stable/developers/#different-objects

Por cierto, usted sabe sobre el OneVsRestClassifier , ¿verdad? Parece un poco como si lo estuvieras reimplantando.