AIE Backend Bug: Type Mismatch In Dynamic Subview

by ADMIN 50 views

Hey guys, let's dive into a pretty tricky issue we've run into with the AIE backend. This is all about a type mismatch that pops up when we're dealing with dynamic subviews and the way our static function interfaces are set up. We're gonna break down the problem, how to see it in action, and why it's happening. This is a technical deep dive, so buckle up!

The Core Issue: Type Mismatch

Alright, so the main headache here is that the AIE backend is struggling with dynamic memref types in function interfaces. To give you a bit of background, when we're working with the AIE backend, we're essentially trying to get our code to run efficiently on specialized hardware. Memrefs are a way of representing memory regions in our code, and they have a type associated with them. A dynamic memref means that the size or stride of the memory region isn't known at compile time; it's flexible. A static memref, on the other hand, has a fixed size and stride known at compile time.

The crux of the problem is that MLIR-AIE, the part of our system that handles the low-level details of the AIE hardware, doesn't fully support these dynamic memref types in function interfaces. This means that when we try to pass a dynamic subview (a portion of a larger memory region) to a function, we run into trouble.

So, what's the workaround? Well, we currently choose to allocate new memref objects. These are created with types that match the expected static memref type, and then used as arguments. It's a way of tricking the system into working, but it's not ideal. It would be far better if we could directly pass the dynamic subviews, but the MLIR-AIE limitation gets in the way.

The Root Cause

The reason for this approach comes down to limitations within the MLIR-AIE framework. As mentioned before, this particular framework does not inherently support the utilization of dynamic memref types within the function interfaces. This creates a significant obstacle for the direct passing of dynamic subviews. The implementation strategy, therefore, involves the allocation of new memref objects, conforming to the expected static memref type. This effectively circumvents the compatibility issue and enables function calls to proceed.

Reproducing the Bug: Step-by-Step

Now, let's see how to trigger this bug and get the error message. You can find the necessary files in the top.zip archive, available at the provided link. Here’s a quick guide:

  1. Get the Files: Download and unpack the top.zip file. This archive contains the top.mlir file, which is a crucial component for reproducing the error. It contains the code that will exhibit the problematic behavior.

  2. Set up the Environment: Make sure you have all the necessary tools installed. This includes the AI Engine compiler (aiecc.py), the MLIR tools, and the appropriate environment variables set up, particularly $PEANO_INSTALL_DIR.

  3. Run the Compilation Command: Use the provided command to compile the top.mlir file. Pay close attention to the arguments; they configure the compiler and specify how the code should be generated. Here's the command:

    aiecc.py --alloc-scheme=basic-sequential --aie-generate-xclbin --no-compile-host --xclbin-name=build/final.xclbin --no-xchesscc --no-xbridge --peano $PEANO_INSTALL_DIR --aie-generate-npu-insts --npu-insts-name=insts.txt top.mlir
    

    This command tells the compiler to generate the necessary files for the AIE backend, but it also includes options to prevent host compilation, generate an XCLBIN file (which is a binary file used by the hardware), and other specific parameters that will impact the compilation process.

  4. Observe the Error: If all goes as expected, you should see an error message during the compilation process. The error message indicates a type mismatch, specifically that the llvm.call operation is unable to find a valid LLVM function due to the type incompatibility between the dynamic subview and the expected static type.

By following these steps, you should be able to reproduce the bug and experience the error firsthand.

The Error Message Breakdown

If you get the error, you'll see something like this:

MLIR compilation: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:--:-- 0:00:00 0/1 1 Worker/top.prj/top.mlir.prj/input_with_addresses.mlir:388:7: error: 'llvm.call' op 'add_i16_vector' does not reference a valid LLVM function
    func.call @add_i16_vector(%fifo_4_buff_0, %subview, %_anonymous3) {lib = "add_i16_vector"} : (memref<16x16xi16>, memref<16x16xi16, strided<[16, 1], offset: ?>>, memref<16x16xi16>) -> ()
    ^
/top.prj/top.mlir.prj/input_with_addresses.mlir:388:7: note: see current operation: "llvm.call"(%795, %796, %797) <{CConv = #llvm.cconv<ccc>, TailCallKind = #llvm.tailcallkind<none>, callee = @add_i16_vector, fastmathFlags = #llvm.fastmath<none>, op_bundle_sizes = array<i32>, operandSegmentSizes = array<i32: 3, 0>}> {lib = "add_i16_vector"} : (!llvm.ptr, !llvm.ptr, !llvm.ptr) -> ()
Error encountered while running: aie-opt --pass-pipeline=builtin.module(aie.device(aie-localize-locks,aie-normalize-address-spaces),aie-standard-lowering,aiex-standard-lowering,canonicalize,cse,convert-vector-to-llvm,expand-strided-metadata,lower-affine,convert-math-to-llvm,convert-index-to-llvm,arith-expand,convert-arith-to-llvm,finalize-memref-to-llvm,convert-func-to-llvm{ use-bare-ptr-memref-call-conv=1 },convert-cf-to-llvm,canonicalize,cse) /top.prj/top.mlir.prj/input_with_addresses.mlir -o /top.prj/top.mlir.prj/input_opt_with_addresses.mlir

Let's break this down:

  • The Problem: The error message clearly states that the llvm.call operation, which is used to call the add_i16_vector function, cannot find a valid LLVM function. This is the core issue.
  • Type Mismatch: The error message reveals a type mismatch between the arguments passed to the add_i16_vector function. Specifically, the second argument, %subview, is a memref<16x16xi16, strided<[16, 1], offset: ?>>, which is a dynamic subview. The function interface is expecting a different, static memref type.
  • Compilation Failure: The compilation process ultimately fails due to this type incompatibility. The aie-opt tool, which is responsible for optimizing the code for the AIE hardware, encounters this error and cannot proceed.

Additional Context: Past Discussions

This issue has been discussed before, like in this thread: https://github.com/cornell-zhang/allo/pull/432#issuecomment-3393768288. It's a good place to see the broader discussion and potential future solutions.

This bug isn't just a minor inconvenience; it directly impacts how we can use subviews and dynamic memory in our AIE code. Addressing it will require deeper changes in how MLIR-AIE handles function interfaces and memory types. We'll keep you posted as we make progress on this! Thanks for sticking around, guys!