Unity AR Foundation and CoreML: Hand detection and tracking

Last updated on Dec 15, 2019 6 min read Tech

This post was first published at Medium

0x00 Description

The AR Foundation package in Unity wraps the low-level API such as ARKit, ARCore into a cohesive whole.

The CoreML is a framework that can be harnessed to integrate machine learning models into your app on iOS platform.

This article and the demo project at the end of the article show how to enable the CoreML to work with AR Foundation in Unity. With AR Foundation in Unity and CoreML on iOS, we can interact with virtual objects with our hands.

This article refers to Gil Nakache’s article and uses the mlmodel used in his article. In his article, he describes how to implement these on the native iOS platform with Swift.

Version

Unity Version: 2018.3.13f1

Xcode Version: 10.2.1

The ARFoundation Plugin: 1.5.0-preview.5

iPhone 7: 12.3.1

0x01 Implementation

Import AR Foundation Plugin

For convenience, I use the local package import. This is very simple, just modify the manifest.json file in the package folder and add the local package in the project manifest.

    "com.unity.xr.arfoundation": "file:../ARPackages/com.unity.xr.arfoundation",
    "com.unity.xr.arkit": "file:../ARPackages/com.unity.xr.arkit

After importing the AR Foundation plugin, we can create some related components in the scene, such as AR Session, AR Session Origin.

Then in our script, listen to the frameReceived event to get the data for each frame.

    if (m_CameraManager != null)
    {
        m_CameraManager.frameReceived += OnCameraFrameReceived;
    }

Create a Swift plugin for Unity

In order for C# to communicate with Swift, we need to create an object-c file as a bridge.

In this way, C# can call the method in Object-C by [DllImport("__Internal")]. Then Object-C will call Swift via @objc. After importing UnityInterface.h, Swift can call the UnitySendMessage method to pass data to C#.

There is a sample. This project demonstrates how to create a Swift plugin for Unity and print “Hello, I’m Swift” in Unity.

In the Unity-ARFoundation-HandDetection project, the structure of the plugins folder is as follows:

<Plugins>
  └── iOS
      ├── HandDetector
      │   ├── Native
      │   │  ├──HandDetector.swift
      │   │  └──HandDetectorBridge.mm
      │   └── Managed
      │      └──HandDetector.cs
      └── Unity

However, it should be noted that the Xcode project exported by Unity does not specify the version of Swift.

So you can manually specify a version, or create a script in Unity to automatically set its version.

Import mlmodel

Add the HandModel to our Xcode project, then it will generate an Objective-C model class automatically. But I want the mlmodel to generate a Swift class. We can set it at Build Settings/CoreML Model Compiler - Code Generation Language from Auto to Swift.

Then we get an automatically generated Swift model class called HandModel.

Of course, if you don’t want to add it manually, you can also add it automatically through a build post processing script in Unity.

How to get the ARFrame ptr from AR Foundation

After completing the above steps, the basic framework is built. Next, we will use CoreML to implement hand detection and tracking.

@objc func startDetection(buffer: CVPixelBuffer) -> Bool {
    //TODO
    self.retainedBuffer = buffer
    let imageRequestHandler = VNImageRequestHandler(cvPixelBuffer: self.retainedBuffer!, orientation: .right)
    
    visionQueue.async {
        do {
            defer { self.retainedBuffer = nil }
            try imageRequestHandler.perform([self.predictionRequest])
        } catch {
            fatalError("Perform Failed:\"\(error)\"")
        }
    }
    
    return true
}

In Swift, we need a CVPixelBuffer to create a VNImageRequestHandler to perform the hand detection. Usually we can get it from ARFrame.

CVPixelBufferRef buffer = frame.capturedImage;

Therefore, the next question is how to get the ARFrame pointer of ARKit on iOS from the AR Foundation in C#, then pass the pointer to the Hand Detection plugin in Swift.

In AR Foundation, you can get a nativePtr from a XRCameraFrame, which points to a struct on ARKit that looks like this:

typedef struct UnityXRNativeFrame_1
{
    int version;
    void* framePtr;
} UnityXRNativeFrame_1;

and the framePtr points to the latest ARFrame.

Specifically, we can call TryGetLatestFrame defined in XRCameraSubsystem to get a XRCameraFrame instance.

cameraManager.subsystem.TryGetLatestFrame(cameraParams, out frame)

Then pass the nativePtr from C# to Object-C.

m_HandDetector.StartDetect(frame.nativePtr);

In Object-C, we will get a UnityXRNativeFrame_1 pointer and we can get the ARFrame pointer from UnityXRNativeFrame_1.

    UnityXRNativeFrame_1* unityXRFrame = (UnityXRNativeFrame_1*) ptr;
    ARFrame* frame = (__bridge ARFrame*)unityXRFrame->framePtr;
    
    CVPixelBufferRef buffer = frame.capturedImage

Once the ARFrame is acquired, it comes to the iOS development domain. Create a VNImageRequestHandler object and start performing the detection. Once the detection is complete, the detectionCompleteHandler callback is invoked and passes the result of the detection to Unity via UnitySendMessage.

private func detectionCompleteHandler(request: VNRequest, error: Error?) {
    
    DispatchQueue.main.async {
        
        if(error != nil) {
            UnitySendMessage(self.callbackTarget, "OnHandDetecedFromNative", "")
            fatalError("error\(error)")
        }
        
        guard let observation = self.predictionRequest.results?.first as? VNPixelBufferObservation else {
            UnitySendMessage(self.callbackTarget, "OnHandDetecedFromNative", "")
            fatalError("Unexpected result type from VNCoreMLRequest")
        }
        
        let outBuffer = observation.pixelBuffer
        
        guard let point = outBuffer.searchTopPoint() else{
            UnitySendMessage(self.callbackTarget, "OnHandDetecedFromNative", "")
            return
        }
        
        UnitySendMessage(self.callbackTarget, "OnHandDetecedFromNative", "\(point.x),\(point.y)")
    }
}

Then we will get the position data in viewport space.

Viewport space is normalized and relative to the camera. The bottom-left of the viewport is (0,0); the top-right is (1,1). The z position is in world units from the camera.

Once we get the position in viewport space, we transform it from viewport space to world space via ViewportToWorldPoint function in Unity. Provide the function with a vector where the x-y components of the vector come from Hand Detection and the z component is the distance of the resulting plane from the camera.

   var handPos = new Vector3();
   handPos.x = pos.x;
   handPos.y = 1 - pos.y;
   handPos.z = 4;//m_Cam.nearClipPlane;
   var handWorldPos = m_Cam.ViewportToWorldPoint(handPos);

We can create a new object in Unity with the world space position or move the old object to the world space position. In other words, the position of the object is controlled according to the position of the hand.

Post Process Build

As I said above, we can write a C# script in Unity to automatically set properties of the generated xcode project. For example, we can set the Swift version property in the Build Setting of a Xcode project. We can even add mlmodel file to the Build Phase, such as the Compile Sources Phase. We can use the PBXProject class defined in UnityEditor.iOS.Xcode namespace. PBXProject class defines many useful functions such as AddBuildProperty, SetBuildProperty, AddSourcesBuildPhase.

[PostProcessBuild]
public static void OnPostProcessBuild(BuildTarget buildTarget, string path)
{
    if(buildTarget != BuildTarget.iOS)
    {
        return;
    }

    string projPath = path + "/Unity-iPhone.xcodeproj/project.pbxproj";
    
    var proj = new PBXProject();
    proj.ReadFromFile(projPath);
    var targetGUID = proj.TargetGuidByName("Unity-iPhone");

    //set xcode proj properties
    proj.AddBuildProperty(targetGUID, "SWIFT_VERSION", "4.0");
    proj.SetBuildProperty(targetGUID, "SWIFT_OBJC_BRIDGING_HEADER", "Libraries/Plugins/iOS/HandDetector/Native/HandDetector.h");
    proj.SetBuildProperty(targetGUID, "SWIFT_OBJC_INTERFACE_HEADER_NAME","HandDetector-Swift.h");
    proj.SetBuildProperty(targetGUID, "COREML_CODEGEN_LANGUAGE", "Swift");
    
    
    //add handmodel to xcode proj build phase.
    var buildPhaseGUID = proj.AddSourcesBuildPhase(targetGUID);
    var handModelPath = Application.dataPath + "/../CoreML/HandModel.mlmodel";
    var fileGUID = proj.AddFile(handModelPath, "/HandModel.mlmodel");
    proj.AddFileToBuildSection(targetGUID, buildPhaseGUID, fileGUID);
    
    proj.WriteToFile(projPath);

}

0x02 Conclusion

With the AR Foundation in Unity and the CoreML, we can let Unity Chan stand on our fingers.

This article is a brief description of the process for integrating CoreML and AR Foundation. I think that you can use them to make more interesting content.

Here is the demo project used in the article.

Unity-ARFoundation-HandDetection

Useful Links

https://heartbeat.fritz.ai/hand-detection-with-core-ml-and-arkit-f4c8da98e88e

https://medium.com/@kevinhuyskens/implementing-swift-in-unity-53e0b668f895

Unity AR

Jiadong Chen

Cloud Architect/Senior Developer

Cloud Architect at Company-X | Microsoft MVP, MCT | Azure Certified Solutions Architect & Cybersecurity Architect Expert | Member of .NET Foundation | Packt Author ㅣ Opinions = my own.