What is a pipeline in machine learning ?

  1. Almost always, we need to tie together many different processes that we use to prepare data for machine learning based model
  2. It is paramount that the stage of transformation of data represented by these processes are standardized
  3. Pipeline class of sklearn helps simplify the chaining of the transformation steps and the model
  4. Pipeline, along with the GridsearchCV helps search over the hyperparameter space applicable at each stage


  1. Sequentially apply a list of transforms and a final estimator.
  2. Intermediate steps of the pipeline must be ‘transforms’, that is, they must
    implement fit and transform methods.
  3. The final estimator only needs to implement fit
  4. Helps standardize the model project by enforcing consistency in building
    testing and production.

Build a pipeline

  1. Import the pipeline class
    a. from sklearn.pipeline import Pipeline
  2. Instantiate the class into an object by listing out the transformation steps. In the following example, a scaling function is followed by the logistic algorithm
    a. pipe = Pipeline([(" scaler", MinMaxScaler()), (" lr", logisticregression())])
  3. Call the fit() function on the pipeline object
    a. pipe.fit( X_train, y_train)
  4. Call the score() function on the pipeline object or predict() function
    a. pipe.score( X_test, y_test)
    In the step 2b, the pipeline object is created using a dictionary of key:value pairs. The keyis specified in strings for e.g. “scaler” followed by the function to be called.
    The key is the name given to a step.
  • The pipeline object requires all the stages to have both ‘fit()’
    and “transform()” function except for the last stage when it is
    an estimator
  • The estimator does not have a “transform()” function because
    it builds the model using the data from previous step. It does
    not transform the data
  • The transform function transforms the input data and emits
    transformed data as output which becomes the input to the
    next stage
  • pipeline.fit() calls the fit and transform functions on each
    stage in sequence. In the last stage, if it is an estimator, only
    the fit function is called to create the model.
  • The model become a part of the pipeline automatically
  • pipeline.predict() calls the transform function at all the stages on the given data
  • In the last stage, it jumps the estimator step because the model is already built
  • It executes the predict() function of the model


A pipeline can be constructed purely for data transformation alone. Which means there it is not mandatory to have an estimator


  • Specifying names for the different stages of a pipeline can lead to ambiguities. When there are multiple stages, each stage has to be uniquely named and we have to make sure there is consistency in the naming
    process such as usage of lower case letters only, each name should be unique, name should reflect the purpose of the stage etc. Manual naming is prone to ambiguity
  • Alternatively we can use “make_pipeline()” function that will create the pipeline and automatically name each step as the lowercase of the name of the function called
    + from sklearn.pipeline import make_pipeline
    + pipe = make_pipeline( MinMaxScaler(), (LogisticRegression()))
    + print(” Pipeline steps:\ n{}”. format( pipe.steps))
  • The advantage of “make_pipeline” is the consistency in the naming of each stage, we can have multiple stages in a pipeline performing the same transformations. Each stage is guaranteed to have a unique meaningful name

Reference : What are Azure Machine Learning pipelines?

at : https://learn.microsoft.com/en-us/azure/machine-learning/concept-ml-pipelines?view=azureml-api-2







Leave a Reply

Your email address will not be published. Required fields are marked *