Unity AR FoundationでPeople Occlusionをやってみる

この記事はUnityゆるふわサマーアドベントカレンダー 2019 #ゆるふわアドカレの8/11の記事です。

概要

UnityのPackage Managerで提供されているAR Foundation。最近はARKit3のサポートが入ってだいぶ熱くなってきてますね。

ということで、今回はAR Foundationを使ってPeople Occlusionをやってみたのでそのメモです。

AR Foundationのドキュメントはこちら↓

docs.unity3d.com

AR Foundationについてはサンプルなどを含めたものがGitHubで公開されています。

github.com

このサンプルを使ってPeople Occlusion的なものを実装してみたのが以下↓

I've created the people occlusion with Unity AR Foundation. But I don't know why depth texture is null when it set to full resolution. #Unity #ARKit #ARFoundation pic.twitter.com/pleDdNYUSe
— edom18@AR / MESON (@edo_m18) 2019年8月11日

処理フロー

今回は上のtweetのような簡単な見た目のPeople Occlusionを行うまでを解説したいと思います。
大まかなフローはそこまで複雑なことはしていません。

1) People Occlusionに必要なテクスチャを集める
- 1-1) ステンシル / デプス / カメラ映像のテクスチャ
2) ARオブジェクトと人との深度を比較する
3) ポストエフェクトとして描画する

People Occlusionに必要なテクスチャを集める

フローの(1)についてはAR Foundationが提供してくれているARHumanBodyManagerクラスとARCameraBackgroundから得ることができます。

ARHumanBodyManagerとARCameraBackgroundについてはインスペクタなどから設定し、以下のようにすることでテクスチャを得ることができます。

ステンシルとデプス用のテクスチャを得る

// 人の位置と推測された位置が`1`となるマスク用テクスチャ
Texture2D humanStencil = _humanBodyManager.humanStencilTexture;

// 人の位置と推測された位置の深度値を格納しているテクスチャ（単位はメートルの模様）
Texture2D humanDepth = _humanBodyManager.humanDepthTexture;

ステンシルとデプスはARHumanBodyManagerのプロパティから簡単に得ることができます。

カメラからの映像については少し手間を掛ける必要があります。

ARCameraの映像を得る

カメラからの映像についてはARCameraBackgroundクラスを利用します。
ちなみにカメラからの映像が必要な理由は、レンダリング結果にはARオブジェクトも含まれてしまっているので、カメラの映像も別途必要なのです。（もしかしたらデプステクスチャをデプスバッファとして応用することができればこの処理はいらないかもしれません）

対象の映像を得るための方法がドキュメントに記載されています。それを引用すると以下のように説明されています。

Copying the Camera Texture to a Render Texture

The camera textures are likely External Textures and may not last beyond a frame boundary. It can be useful to copy the camera image to a Render Texture for persistence or further processing. This code will blit the camera image to a render texture of your choice:

Graphics.Blit(null, m_MyRenderTexture, m_ARBackgroundCamera.material);

このコード断片が示すように、自前で生成したレンダーテクスチャとARCameraBackgroundクラスのmaterialプロパティを利用して、現在の状態をそのレンダーテクスチャに書き出すことができます。

ARオブジェクトと人との深度を比較する

さて、必要なデータを集めたらそれを利用して「人と思しき場所」の深度を比較し、必要であればARオブジェクトの前面に人の体を描画します。

比較に関してはシェーダを利用し以下のようにします。

// Check depth delta. If delta is over zero, it means pixels that estimated like human is in front of AR objects.
float sceneZ = LinearEyeDepth(tex2D(_CameraDepthTexture, i.uv));
float delta = saturate(sceneZ - depth);
if (delta > 0.0)
{
    return tex2D(_BackgroundTex, i.uv);
}
else
{
    return col;
}

最初の行で行っているのは、3Dシーンで描画されたデプスバッファからの値をリニアに変換しています。

ここでリニアに変換しているのは以下の記事で言及されている以下の理由からです。

https://forum.unity.com/threads/how-to-setup-people-occlusion.691789/

The values in the depth buffer are in meters with the range [0, infinity) and need to be converted into the view space with the depth value [0, 1] mapped between the near & far clip plane.

正確には「デプステクスチャ側を0 - 1に正規化する」と書かれていますが、Unityの単位もメートルなのでそのままリニアに変換することでメートルとして利用できるかなと思ってこうしています。

が、floatの精度などの問題でもしかしたら上で言及されているように、デプステクスチャ側をしっかり0 - 1に正規化したほうがよりきれいになるかもしれません。（それは追って調査）

なお、このあたりの変換については以前書いた記事が理解に役立つかもしれません。

edom18.hateblo.jp

ポストエフェクトとして描画する

最後に、これらの情報を元にしてポストフェクトとしてPeople Occlusionを実現していきます。
イメージ的にはCGシーンとカメラ映像を、深度値を元に切り分けてレンダリングする、という感じです。

ということでセットアップを含めてコード全文を載せておきます。
まずはシェーダから。

Shader "Hidden/PeopleOcclusion"
{
    Properties
    {
        _MainTex ("Texture", 2D) = "white" {}
    }
    SubShader
    {
        Cull Off ZWrite Off ZTest Always

        Pass
        {
            CGPROGRAM
            #pragma vertex vert
            #pragma fragment frag

            #include "UnityCG.cginc"

            struct appdata
            {
                float4 vertex : POSITION;
                float2 uv : TEXCOORD0;
            };

            struct v2f
            {
                float2 uv : TEXCOORD0;
                float4 vertex : SV_POSITION;
            };

            v2f vert (appdata v)
            {
                v2f o;
                o.vertex = UnityObjectToClipPos(v.vertex);
                o.uv = v.uv;
                return o;
            }

            sampler2D _MainTex;
            sampler2D _BackgroundTex;
            sampler2D _DepthTex;
            sampler2D _StencilTex;

            UNITY_DECLARE_DEPTH_TEXTURE(_CameraDepthTexture);

            fixed4 frag (v2f i) : SV_Target
            {
                fixed4 col = tex2D(_MainTex, i.uv);

                float2 uv = i.uv;

                // Flip x axis.
                uv.x = 1.0 - uv.x;

                // Correcting textures ratio that can be got by ARHumanBodyManager to the screen ratio.
                float ratio = 1.62;
                uv.y /= ratio;
                uv.y += 1.0 - (ratio * 0.5);

                float stencil = tex2D(_StencilTex, uv).r;
                if (stencil < 0.9)
                {
                    return col;
                }

                // Check depth delta. If delta is over zero, it means pixels that estimated like human is in front of AR objects.
                float depth = tex2D(_DepthTex, uv).r;
                float sceneZ = LinearEyeDepth(tex2D(_CameraDepthTexture, i.uv));
                float delta = saturate(sceneZ - depth);
                if (delta > 0.0)
                {
                    return tex2D(_BackgroundTex, i.uv);
                }
                else
                {
                    return col;
                }
            }
            ENDCG
        }
    }
}

コードは深度値比較とフェッチする対象を変えるだけなのでそんなに長くないです。
前述の通り、重要な箇所は以下。

// Check depth delta. If delta is over zero, it means pixels that estimated like human is in front of AR objects.
float depth = tex2D(_DepthTex, uv).r;
float sceneZ = LinearEyeDepth(tex2D(_CameraDepthTexture, i.uv));
float delta = saturate(sceneZ - depth);
if (delta > 0.0)
{
    return tex2D(_BackgroundTex, i.uv);
}

ひとつ注意点として、UV値を少しだけ加工しています。
理由は、AR Foundationから得られるステンシルとデプステクスチャのサイズがデバイスの解像度と合っていないためです。（さらに左右反転しているのでそれも合わせて行っています）

そのため少しだけUVの値を加工して縦横比が合うように補正しています。
補正のためのコードは以下。

// Correcting textures ratio that can be got by ARHumanBodyManager to the screen ratio.
float ratio = 1.62;
uv.y /= ratio;
uv.y += 1.0 - (ratio * 0.5);

1.62の根拠は256x192の比率から2688x1242の比率へ変換するためのものです。
ちなみに、ステンシル／デプステクスチャの解像度はstandard resolutionとhalf resolution, full resolutionの3つが選べますが、解像度が違えどどれも比率は同様のものが渡されるのでこの計算で問題なさそうです。

あとはAR Foundationから受け取ったデプステクスチャの値とレンダリングされた3Dシーンの深度値を比較して、人と思われる位置のピクセルの深度が3Dシーンより手前だと判断されたらカメラの映像を利用し、そうでなければそのまま3Dシーンの映像をレンダリングするという形です。

これをセットアップしているC#側のコードは以下のようになります。

using System.Text;
using UnityEngine;
using UnityEngine.UI;
using UnityEngine.XR.ARFoundation;

public class PeopleOcclusion : MonoBehaviour
{
    [SerializeField, Tooltip("The ARHumanBodyManager which will produce frame events.")]
    private ARHumanBodyManager _humanBodyManager;

    [SerializeField]
    private Material _material = null;

    [SerializeField]
    private ARCameraBackground _arCameraBackground = null;

    [SerializeField]
    private RawImage _captureImage = null;

    private RenderTexture _captureTexture = null;

    public ARHumanBodyManager HumanBodyManager
    {
        get { return _humanBodyManager; }
        set { _humanBodyManager = value; }
    }

    [SerializeField]
    private RawImage _rawImage;

    /// <summary>
    /// The UI RawImage used to display the image on screen.
    /// </summary>
    public RawImage RawImage
    {
        get { return _rawImage; }
        set { _rawImage = value; }
    }

    [SerializeField]
    private Text _imageInfo;

    /// <summary>
    /// The UI Text used to display information about the image on screen.
    /// </summary>
    public Text ImageInfo
    {
        get { return _imageInfo; }
        set { _imageInfo = value; }
    }

    #region ### MonoBehaviour ###
    private void Awake()
    {
        Camera camera = GetComponent<Camera>();
        camera.depthTextureMode |= DepthTextureMode.Depth;

        _rawImage.texture = _humanBodyManager.humanDepthTexture;

        _captureTexture = new RenderTexture(Screen.width, Screen.height, 0);
        _captureImage.texture = _captureTexture;
    }
    #endregion ### MonoBehaviour ###

    private void LogTextureInfo(StringBuilder stringBuilder, string textureName, Texture2D texture)
    {
        stringBuilder.AppendFormat("texture : {0}\n", textureName);
        if (texture == null)
        {
            stringBuilder.AppendFormat("   <null>\n");
        }
        else
        {
            stringBuilder.AppendFormat("   format : {0}\n", texture.format.ToString());
            stringBuilder.AppendFormat("   width  : {0}\n", texture.width);
            stringBuilder.AppendFormat("   height : {0}\n", texture.height);
            stringBuilder.AppendFormat("   mipmap : {0}\n", texture.mipmapCount);
        }
    }

    private void Update()
    {
        var subsystem = _humanBodyManager.subsystem;

        if (subsystem == null)
        {
            if (_imageInfo != null)
            {
                _imageInfo.text = "Human Segmentation not supported.";
            }
            return;
        }

        StringBuilder sb = new StringBuilder();
        Texture2D humanStencil = _humanBodyManager.humanStencilTexture;
        Texture2D humanDepth = _humanBodyManager.humanDepthTexture;
        LogTextureInfo(sb, "stencil", humanStencil);
        LogTextureInfo(sb, "depth", humanDepth);

        if (_imageInfo != null)
        {
            _imageInfo.text = sb.ToString();
        }

        _material.SetTexture("_StencilTex", humanStencil);
        _material.SetTexture("_DepthTex", humanDepth);
        _material.SetTexture("_BackgroundTex", _captureTexture);
    }

    private void LateUpdate()
    {
        if (_arCameraBackground.material != null)
        {
            Graphics.Blit(null, _captureTexture, _arCameraBackground.material);
        }
    }

    private void OnRenderImage(RenderTexture src, RenderTexture dest)
    {
        Graphics.Blit(src, dest, _material);
    }
}