How to train your own model for CoreML

In this guide we will train a Caffe model using DIGITS on an EC2 g2.2xlarge instance, convert it into a CoreML model using Apple’s coremltools and integrate it into an iOS app. If some of what mentioned is unfamiliar or sounds intimidating, fear not I will try to cover the process inasmuch detail as possible. we will do all of this while building SeeFood, the app first introduced on HBO’s popular TV show Silicon Valley. The difference is that we will succeed where Elrich and Jing Yang failed and go beyond Hot Dog/Not Hot Dog. By the end of this guide we will have trained our own model and integrated it in an iOS app that will detect from 101 dishes locally on the device.

This guide is broken down into three parts:

Each part builds on the part before it, however I will provide all the resources needed to pick up and continue from anywhere.

Just like most things in life there are many ways to achieve the same goal. This guide covers one of many ways to train, convert and integrate an image classifier into an iOS app. If any of the decisions made in this guide does not suite you, I encourage you to explore and find what works best for you.

If you have suggestions or want to reach out to me directly, feel free to do so through Twitter: Reza Shirazian

There are a set of prerequisites that you need before continuing with this guide:

If you’re all set we can begin

Part 1: Train

This part can technically be split into two different parts: Setup and Training. First we’ll cover setting up an EC2 instance and the then we’ll focus on the actual training. I will highlight the process for setting up the instance in more details since it’s something that might be a bit more foreign to iOS developers. If you’re an expert with AWS then here is the summary of the setup:

We will launch a preconfigured amazon machine image (AMI) that will have Caffe and NVIDIA’s DIGITS already set up. This image was created by Satya Mallick and is named “bigvision-digits” under public images. (If you cannot find it, it’s most likely because this AMI is on the Oregon region. Make sure your current regions is set as such) I encourage you to check out his website and view his youtube tutorial which will have a lot of overlap with what’s highlighted in this guide. We will use a g2.2xlarge instance to take advantage of GPU computing for more efficient training. This instance type is not free and at the time of writing costs about $0.65/hour. For more up-to-date pricing check out Amazon instance pricing. Fortunately you won’t need to run the instance for more than 6-7 hours, but do make sure to stop it when you’re done. Running one of these for a whole month can cost close to $500.

Setup

Sign into your AWS console and click on EC2

Select AMIs on the left hand side

Select Public Images from the dropdown next to the search bar

Search for bigvision-digits, select the image and click launch. Make sure you’re in the Oregon region. Your current region is displayed on the top right hand side, between your name and Support.

Select g2.2xlarge from the list of instance types, click “Next: Configure Instance Details”

There is no need to change anything here, Click on “Add Storage”

Change size to 75 Gib and click “Add Tags”

Add Key “Name” and set its value to something descriptive. In this example I picked “Caffe DIGITS”. Click on “Configure Security Group”

Click Add Rule, select Custom TCP with protocol TCP, Port Range 80.
Set Source for SSH and Custom TCP to My IP.
Click “Review and Launch”.

Review everything and click “Launch”

Select “Create a new key pair” and use “digits” as key pair name. Click “Download Key Pair” This will download a file named digits.pem. Remember where this file is and make sure you don’t lose it. You will need it to access your instance later. Click “Launch Instances”.

Wait for a few minutes while the instance gets setup. Click on your instance and look at its description. Once the Instance State is set to running and you can see an IP value by “IPv4 Public IP” your EC2 is ready.

Copy and past the public IP into your browser and you should see the following

DIGITS is up and running. Before we can begin using our instance and DIGITS to train our model we need a dataset. We will be using Food 101 dataset which can be found here . To download the dataset into our instance we need to SSH into it. Open your terminal and follow along

Navigate to the folder where you downloaded digits.pem and change the file’s access permission to 600 (for more info on what this means click here):

chmod 600 digits.pem

SSH into your instance by running the following command, replace YOUR INSTANCE’S PUBLIC ADDRESS with your instance’s public address. The very same IP you pasted into your browser to view DIGITS.

ssh -i digits.pem ubuntu@[YOUR INSTANCE'S PUBLIC ADDRESS]

After a few seconds you should be connected to your instance. If you’re unfamiliar with with the terminal fear not the next few steps are fairly straightforward.

View the folders available to you.

ls

Go to the data folder

cd data

In here you will see a folder named 17flowers which is part of Satya Mallick’s tutorial. He uses this dataset to train a flower classifier. We however will use a different dataset to train a different model.

Create a new folder named 101food

mkdir 101food

Navigate into the folder.

cd 101food

Download the dataset:

wget "http://data.vision.ee.ethz.ch/cvl/food-101.tar.gz"

This process will take awhile. This dataset is more than 5GB is size. Once the file is downloaded unzip it using the following command

tar -xzf food-101.tar.gz

This too will take a few minutes. Once done there should be a food-101 folder. Navigate to its ` images` folder and you should see a folder for each dish type we will classify.

Exit out of the terminal and go back to your browser.

Training

Navigate to your instance’s public IP. Under “Datasets” selects “images” and then “classification”.

Change “Resize Transformation” to “Crop”, select the folder for food-101 images as the “Training Images”. If you’ve followed this guide your folder will be /home/ubuntu/data/101food/food-101/images. Set “Dataset Name”to 101food and click create

DIGITS will begin to create a database based off your dataset. This database will be used by Caffe during training.

When the process is complete you should be able to explore your database. With the database ready we are now ready to train our model. If you ever use your own dataset, it’s worth knowing that DIGITS doesn’t require any specific mapping or label file. It will create the database based off the folder structure.

Go back home on your DIGITS instance. On the “Models” panel select the “Images” dropdown and pick “Classifications”.

On the “New Image Classification Model” page under “Select Dataset” select “101food”. Under “Standard Networks” tab select “Caffe” and pick “AlexNet” as your model. Next select “Customize.”

Technically you could start training your model now but to get better accuracy we are going to use AlexNet and fine-tune existing weights from the original trained model with our dataset. To do this we need to select the original pretrained model which is available with Caffe. Under “Pretrained model” type /home/ubuntu/models/bvlc_alexnet.caffemodel. (DIGITS will provide autocompletion so finding the original model should be easy)

Next we need to change the “Base Learning Rate”. Since this model is already trained we need our dataset to retrain its weights at a smaller rate. So change “Base Learning Rate” from 0.01 to 0.001.

Next we need to change the name of the final layer on our network. The pretrained model for AlexNet was trained to classify 1000 objects, our dataset classifies 101. This can be done by simply changing fc8 to something else. In this example we’ll be changing it to fc8_food. So under “Customize Networks” change all instances of fc8 to fc8_food.

Visualize the model to ensure that the last layer has successfully been renamed to “fc8_food”.

For a more indepth explanation of this process check out Flickr’s fine tuning of CaffeNet: Fine-tuning CaffeNet for Style Recognition on “Flickr Style” Data

Name your model and click create.

DIGITS will begin to train your model. This process can take some time. For me, with a g2.2xlarge instance it took about four hours.

Once the process is complete, we’ll see that our model can detect various dishes at a 65% success rate. This is ok but not perfect. To get a better accuracy, you would need a larger dataset which is something you can take on, but for now we’re going to settle for the 65% and proceed.

At the bottom upload or use a URL of an image to test the trained model. Do this multiple times with images that were not part of the training set. Check “Show visualizations and statistics” to see how each layer processed the image.

The model is trained and ready to be converted to CoreML. Click “Download Model”.

Click here to download the trained model

Part 2: Convert

To convert an existing model to CoreML’s .mlmodel we need to install coremltools. Check out https://pypi.python.org/pypi/coremltools to familiarize yourself with the library.

To install coremltools run from the terminal

pip install -U coremltools

coremltools requires Python 2.7 If you get Could not find a version that satisfies the requirement coremltools (from versions: ) error it’s likely that you’re not running Python 2.7. If you don’t get this error and everything works then you’re luckier than I was and can skip the next few steps.

One way to solve this is to setup a virtual environment with Python 2.7. You can do this by using Anaconda Navigator which can be downloaded along with Anaconda here

Once you have Anaconda Navigator running click on “Environments” on the left hand side.

And select “Create” under the list of environments.

Type “coreml” for name, check Python and select “2.7” from the dropdown. Click create.

You should have the new “coreml” environment up and running after a few second. Click on the new environment and press play. A new command line should open up.

If you run python --version you should see that your new environment is now running under python 2.7

Run pip install -U coremltools from your new command line. If you see a no file or directory error for any of the packages run pip install -U coremltools --ignore-installed

Navigate to the folder where you downloaded the trained model. Create a new file named run.py and write the following code.

import coremltools

# Convert a caffe model to a classifier in Core ML
coreml_model = coremltools.converters.caffe.convert(('snapshot_iter_24240.caffemodel',
													 'deploy.prototxt',
													 'mean.binaryproto'),
													  image_input_names = 'data',
													  class_labels = 'labels.txt',
													  is_bgr=True, image_scale=255.)

# Now save the model
coreml_model.author = "Reza Shirazian"
coreml_model.save('food.mlmodel')

The first line imports coremltools. Then we create a coreml_model and provide all the necessary input for coremltool to convert the Caffe model into a .mlmodel. The references to the files passed as the parameter works as is if run.py is in the same folder as the unpacked trained model downloaded earlier from DIGITS.

You can provide other metadata such as author, description and licence for the converted model. For more details I suggest going over coremltools documentation here: https://pythonhosted.org/coremltools/

Save run.py and run it from the console:

Python run.by

This process will take a few minutes. Once completed you will have a food.mlmodel in the same folder as run.py. You’re now ready to integrate your trained CoreML model into an iOS project.

Click here to download the CoreML model

Part 3: Integrate

Start a new Xcode project. Select iOS and Single View App.

Name your app “SeeFood” and click next. Select the folder where you wish to create your new project and click Create.

Once you’ve created your project, right click on the SeeFood folder under Project Navigator on the left and select Add files to ‘SeeFood’…

Find and select food.mlmodel from the folder where we ran run.py. Make sure Copy items if needed under Options is selected and click add.

After a few second food.mlmodel should appear under your Project Navigator. Click on the file and you should see the models detail.

Make sure under inputs you see data Image<BGR 227,227> and under outputs you see prob Dictionary<String,Double> and classLabel String. If you see anything else, remove the model and insert it again. If that doesn’t work, check the run.py we created earlier and make sure it is exactly as what’s used in this example and that it has access to all the files it’s referencing. Also ensure that the target membership for project is check in the File Inspector.

We’re now ready to work with our CoreML model. Open ViewController and add the following before the class definition swift import UIKit import CoreML import Vision

The add the following function at the bottom of ViewController

  func detectScene(image: CIImage) {
    guard let model = try? VNCoreMLModel(for: food().model) else {
      fatalError()
    }
    // Create a Vision request with completion handler
    let request = VNCoreMLRequest(model: model) { [unowned self] request, error in
      guard let results = request.results as? [VNClassificationObservation],
        let _ = results.first else {
          self.settingImage = false
          return
      }
      
      DispatchQueue.main.async { [unowned self] in
        if let first = results.first {
           if Int(first.confidence * 100) > 1 {
            self.iSee.text = "I see \(first.identifier) \(self.addEmoji(id: first.identifier))"
            self.settingImage = false
          }
        }
      }
    }
    let handler = VNImageRequestHandler(ciImage: image)
    DispatchQueue.global(qos: .userInteractive).async {
      do {
        try handler.perform([request])
      } catch {
        print(error)
      }
    }
  }

To test our CoreML model, add some sample images that were not part of the training set to our Assets.xcassets folder

Go back to the viewController and add the following line in the viewDidLoad method after super.viewDidLoad()

    if let uiExamle = UIImage(named:"pizza"),
      let example = CIImage(image: uiExamle) {
      self.detectScene(image: example)
    }

change pizza to the name of whatever image file you added to Assets.xcassets. Build and run the app. Ignore the Simulator and take a look at the output in the console.

Great! our CoreML model is working. However classifying an image from Assets.xcassets is not super useful. Lets build out the app so it continuously takes a frame from the camera, runs it through our classifier and displays on the screen what it thinks it sees. CoreML is pretty fast and this makes for a much better user experience than having to take a picture and then run it through the classifier.

Click on Main.storyboard. Add a UIImageView and a UILabel to the ViewController and link it back to an outlet on ViewController.swift

  @IBOutlet weak var previewImage: UIImageView!
  @IBOutlet weak var iSee: UILabel!

To capture individual frames from the camera we’re going to use the FrameExtractor. A class described here by Boris Ohayon. The original classes is written in Swift 3, I have made the changes necessary and converted it to Swift 4 so you can copy past it directly from here. I do suggest going through it to understand how AVFoundation works. I’m not going to get into too much details since it’s outside the scope of CoreML and this guide but AVFoundation is definitely worth exploring. If you wish to dive into it, this is a good place to start

import UIKit
import AVFoundation

protocol FrameExtractorDelegate: class {
  func captured(image: UIImage)
}

class FrameExtractor: NSObject, AVCaptureVideoDataOutputSampleBufferDelegate {
  
  private var position = AVCaptureDevice.Position.back
  private let quality = AVCaptureSession.Preset.medium
  
  private var permissionGranted = false
  private let sessionQueue = DispatchQueue(label: "session queue")
  private let captureSession = AVCaptureSession()
  private let context = CIContext()
  
  weak var delegate: FrameExtractorDelegate?
  
  override init() {
    super.init()
    checkPermission()
    sessionQueue.async { [unowned self] in
      self.configureSession()
      self.captureSession.startRunning()
    }
  }
  
  public func flipCamera() {
    sessionQueue.async { [unowned self] in
      self.captureSession.beginConfiguration()
      guard let currentCaptureInput = self.captureSession.inputs.first else { return }
      self.captureSession.removeInput(currentCaptureInput)
      guard let currentCaptureOutput = self.captureSession.outputs.first else { return }
      self.captureSession.removeOutput(currentCaptureOutput)
      self.position = self.position == .front ? .back : .front
      self.configureSession()
      self.captureSession.commitConfiguration()
    }
  }
  
  // MARK: AVSession configuration
  private func checkPermission() {
    switch AVCaptureDevice.authorizationStatus(for: AVMediaType.video) {
    case .authorized:
      permissionGranted = true
    case .notDetermined:
      requestPermission()
    default:
      permissionGranted = false
    }
  }
  
  private func requestPermission() {
    sessionQueue.suspend()
    AVCaptureDevice.requestAccess(for: AVMediaType.video) { [unowned self] granted in
      self.permissionGranted = granted
      self.sessionQueue.resume()
    }
  }
  
  private func configureSession() {
    guard permissionGranted else { return }
    captureSession.sessionPreset = quality
    guard let captureDevice = selectCaptureDevice() else { return }
    guard let captureDeviceInput = try? AVCaptureDeviceInput(device: captureDevice) else { return }
    guard captureSession.canAddInput(captureDeviceInput) else { return }
    captureSession.addInput(captureDeviceInput)
    let videoOutput = AVCaptureVideoDataOutput()
    videoOutput.setSampleBufferDelegate(self, queue: DispatchQueue(label: "sample buffer"))
    guard captureSession.canAddOutput(videoOutput) else { return }
    captureSession.addOutput(videoOutput)
    guard let connection = videoOutput.connection(with: AVFoundation.AVMediaType.video) else { return }
    guard connection.isVideoOrientationSupported else { return }
    guard connection.isVideoMirroringSupported else { return }
    connection.videoOrientation = .portrait
    connection.isVideoMirrored = position == .front
  }
  
  private func selectCaptureDevice() -> AVCaptureDevice? {
    return AVCaptureDevice.default(for: .video)
  }
  
  // MARK: Sample buffer to UIImage conversion
  private func imageFromSampleBuffer(sampleBuffer: CMSampleBuffer) -> UIImage? {
    guard let imageBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return nil }
    let ciImage = CIImage(cvPixelBuffer: imageBuffer)
    guard let cgImage = context.createCGImage(ciImage, from: ciImage.extent) else { return nil }
    return UIImage(cgImage: cgImage)
  }
  
  // MARK: AVCaptureVideoDataOutputSampleBufferDelegate
  func captureOutput(_ captureOutput: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
    guard let uiImage = imageFromSampleBuffer(sampleBuffer: sampleBuffer) else { return }
    DispatchQueue.main.async { [unowned self] in
      self.delegate?.captured(image: uiImage)
    }
  }
}

Go back to ViewController . To get an image, classify it and display the top prediction on the screen, change the ViewController so it looks like this:

import UIKit
import CoreML
import Vision
import AVFoundation

class ViewController: UIViewController, FrameExtractorDelegate {
 
  var frameExtractor: FrameExtractor!
  
  @IBOutlet weak var previewImage: UIImageView!
  @IBOutlet weak var iSee: UILabel!

  var settingImage = false
  
  var currentImage: CIImage? {
    didSet {
      if let image = currentImage{
        self.detectScene(image: image)
      }
    }
  }
  
  override func viewDidLoad() {
    super.viewDidLoad()
    frameExtractor = FrameExtractor()
    frameExtractor.delegate = self
  }
  
  func captured(image: UIImage) {
    self.previewImage.image = image
    if let cgImage = image.cgImage, !settingImage {
      settingImage = true
      DispatchQueue.global(qos: .userInteractive).async {[unowned self] in
        self.currentImage = CIImage(cgImage: cgImage)
      }
    }
  }

  func addEmoji(id: String) -> String {
    switch id {
    case "pizza":
      return "🍕"
    case "hot dog":
      return "🌭"
    case "chicken wings":
      return "🍗"
    case "french fries":
      return "🍟"
    case "sushi":
      return "🍣"
    case "chocolate cake":
      return "🍫🍰"
    case "donut":
      return "🍩"
    case "spaghetti bolognese":
      return "🍝"
    case "caesar salad":
      return "🥗"
    case "macaroni and cheese":
      return "🧀"
    default:
      return ""
    }
  }
  func detectScene(image: CIImage) {
    guard let model = try? VNCoreMLModel(for: food().model) else {
      fatalError()
    }
    // Create a Vision request with completion handler
    let request = VNCoreMLRequest(model: model) { [unowned self] request, error in
      guard let results = request.results as? [VNClassificationObservation],
        let _ = results.first else {
          self.settingImage = false
          return
      }
      
      DispatchQueue.main.async { [unowned self] in
        if let first = results.first {
           if Int(first.confidence * 100) > 1 {
            self.iSee.text = "I see \(first.identifier) \(self.addEmoji(id: first.identifier))"
            self.settingImage = false
          }
        }
//        results.forEach({ (result) in
//          if Int(result.confidence * 100) > 1 {
//            self.settingImage = false
//            print("\(Int(result.confidence * 100))% it's \(result.identifier) ")
//          }
//        })
       // print("********************************")
        
      }
    }
    let handler = VNImageRequestHandler(ciImage: image)
    DispatchQueue.global(qos: .userInteractive).async {
      do {
        try handler.perform([request])
      } catch {
        print(error)
      }
    }
  }
}

The code above is fairly straightforward: we set our ViewController to conform to FrameExtractorDelegate. We create an instance of FrameExtractor named frameExtractor. We set its delegate to selfand implement func captured(image: UIImage) to complete the delegation implementation.

We declare a CIImage variable named currentImage and set its value whenever captured returns an image. We add a didSet to the currentImage to observe when its value changes and call detectScene with the new image. Since captured will take less time than detectScene, to prevent continuous calls into detectScene before it’s done we set a boolean flag called settingImage. This flag is set to true when a new image has been captured and set to false when detectScene has classified it. If it’s true we skip the image captured.

Build and deploy the app on a device running iOS 11. The first time the app runs it will ask for a permission to use camera. If you’ve been following this guide so far, your app will most likely crash. The error you will see is The app’s Info.plist must contain an NSCameraUsageDescription key. error. Since iOS 10 you need to provide a usage description which is shown to the user when the popup for a specific permission is displayed. To fix this, a string value describing your reason needs to be added to the info.plist file for NSCameraUsageDescription key.

Build and run. The app should launch, once the permission is given to use the camera, you should see what the camera is seeing and the label should update to what CoreML thinks is in front of it.

Congratulation, you just trained and integrated your own image classifier and deployed it to an iOS device! As lengthy and convoluted the process might appear, once you go through it you’d realize it’s simple. Thanks to Caffe, DIGITS and now CoreML the hard parts have been figured out and it’s up to you to collect your dataset, train your models and build awesome apps. The amount of coding is minimal and the power is immense. Machine learning is the future and I would love to see what you do with it. Feel free to hit me up on twitter and show me your creations!

Click here to view repo for the full project

Glossary

Further readings

If you’ve made it this far, congratulations again. Although we’ve covered a lot, we haven’t even scratched the surface. Here are some suggestions as to where to go from here:

WWDC 2017 - Introducing Core ML
WWDC 2017 - Core ML in depth
WWDC 2017 - Vision Framework
Apple CoreML Documentation
Caffe: Convolutional Architecture for Fast Feature Embedding Caffe paper
ImageNet Classification with Deep Convolutional Neural Networks AlexNet paper
DIY Deep Learning for Vision: A hands on tutorial with Caffe Caffe Tutorial
Deep Learning for Computer Vision with Caffe and cuDNN
Hacker’s guide to Neural Networks Great article by Andrej Karpathy Deep Learning using NVIDIA DIGITS 3 on EC2 by Satya Mallick.
Matthijs Hollemans Matthijs Hollemans’ blog on Machine Learning.