Exercises: CAN Bus — Encoder Velocity Frames

Covers: Deep-dive session 07 — SN65HVD230 wiring, Zephyr CAN API, encoder frames, bus-off recovery

Section A — Conceptual Questions

A1. Before writing a single line of firmware, you want to verify the physical CAN bus is healthy. Describe the exact sequence of three checks: one multimeter check (power off), one cansend/candump verification (two-terminal test), and what each result tells you.

Answer

**1. Multimeter check (power off):** Place probes across CANH and CANL. Measure resistance. Expected: ~60Ω (two 120Ω termination resistors in parallel). If you read >200Ω, one or both terminators are missing or open. If you read <30Ω, there is a short between the lines. The SN65HVD230 is 3.3V-compatible — do NOT use TJA1050 (5V only). **2. candump + cansend (two terminals, host Linux):** Terminal 1: `candump can0` Terminal 2: `cansend can0 123#DEADBEEF` Terminal 1 should echo the frame back (including its own frame as reflected by the bus). If nothing appears, the interface is off — run `sudo ip link set can0 up type can bitrate 500000` first. **3. What each result tells you:** - 60Ω + candump echo → physical layer is healthy before any firmware is involved. - No echo despite 60Ω → check D/R pin logic level on SN65HVD230 (must be driven low to enable Rx). - Frame CRC errors in `dmesg | grep can` → wiring capacitance or missing stub termination.

A2. The can_add_rx_filter_msgq() API takes a struct can_filter. Explain the difference between setting filter.mask = 0x7FF versus filter.mask = 0x000. Which do you use to receive only the encoder frame with ID 0x101, and why?

Answer

`filter.mask` is the bits that **must match** the `filter.id`. A `1` in a mask bit means "this bit is checked"; a `0` means "don't care". - `filter.mask = 0x7FF` → all 11 standard ID bits are checked → only frames with exactly `filter.id` pass through. - `filter.mask = 0x000` → no bits are checked → every frame passes through (promiscuous). To receive only `0x101`:

struct can_filter enc_filter = {
    .flags = 0,
    .id    = 0x101,
    .mask  = CAN_STD_ID_MASK   /* 0x7FF */
};
can_add_rx_filter_msgq(dev, &enc_msgq, &enc_filter);

Using `0x000` as mask would also deliver encoder frames, but the node would receive every frame on the bus (motor commands, battery status, etc.) and the message queue would fill immediately at any bus activity.

A3. Your CAN bus enters BUS-OFF state. Explain: 1. What sequence of hardware events caused BUS-OFF (transmit error counter)? 2. What does the Zephyr can_set_state_change_callback() callback receive? 3. Why is calling can_recover() directly inside the state-change callback dangerous?

Answer

**1. Sequence causing BUS-OFF:** The CAN controller maintains a transmit error counter (TEC). Each failed transmission increments TEC by 8. When TEC exceeds 255, the controller enters BUS-OFF and disconnects itself from the bus (stops both transmitting and receiving) to protect other nodes. Common causes: missing or incorrect termination, wrong bitrate, loose wiring, or the STM32 transmitting while the transceiver's RS/S pin (SN65HVD230) is in standby mode. **2. Callback receives:** `struct can_bus_err_cnt` with `tx_err_cnt` and `rx_err_cnt`, plus a `state` parameter of type `enum can_state` (one of: `CAN_STATE_ERROR_ACTIVE`, `CAN_STATE_ERROR_PASSIVE`, `CAN_STATE_BUS_OFF`, `CAN_STATE_STOPPED`). **3. Danger of calling `can_recover()` in the callback:** `can_set_state_change_callback()` fires from **interrupt context**. `can_recover()` is a blocking call that initiates a hardware recovery sequence and may sleep. Calling any blocking function from interrupt context causes undefined behaviour (often a kernel panic or immediate crash on Zephyr/Cortex-M). The fix: use `k_work_submit()` inside the callback to schedule recovery in a workqueue thread.

A4. You are parsing the encoder velocity frame:

struct encoder_frame {
    int16_t left_vel;   /* mm/s */
    int16_t right_vel;  /* mm/s */
    uint32_t timestamp_ms;
};

The frame arrives as 8 CAN bytes. Write the exact memcpy call to unpack it safely from can_frame.data[], and explain why a direct cast ((struct encoder_frame *)&frame.data) is incorrect on Cortex-M7.

Answer

**Correct:**

struct encoder_frame enc;
memcpy(&enc, frame.data, sizeof(enc));  /* 8 bytes */

**Why cast is wrong:** `can_frame.data` is a `uint8_t[]` array which may be aligned to 1 byte. `struct encoder_frame` requires at minimum 2-byte alignment for `int16_t` fields (and 4-byte alignment for `uint32_t` on Cortex-M7 with strict alignment). Casting the address directly gives a pointer with insufficient alignment. Accessing it triggers a **UsageFault** (unaligned access fault) on Cortex-M7 when the UNALIGN_TRP bit is set in CCR, or produces silently wrong results otherwise. `memcpy` is always safe because it operates byte-by-byte.

A5. Your encoder thread calls k_msgq_get(&enc_msgq, &frame, K_FOREVER). The CAN bus goes silent (robot is lifted). What happens to the thread? How would you modify the call to detect CAN silence within 500ms and log a warning?

Answer

With `K_FOREVER`, the thread **blocks indefinitely** until a message arrives. It consumes no CPU (Zephyr puts it into `pend` state) but it also never runs — silent failure with no warning. To detect 500ms silence:

int ret = k_msgq_get(&enc_msgq, &frame, K_MSEC(500));
if (ret == -EAGAIN) {
    LOG_WRN("CAN encoder: no frame in 500ms — bus silent or robot stopped");
    /* increment a watchdog counter; trigger safe stop if count exceeds threshold */
    continue;
}

In production code you would also publish a "CAN timeout" event to ZBus so the state machine can transition to degraded mode rather than continuing to use stale velocity data.

Section B — Practical / Debug Scenarios

B1. You flash your firmware and run candump can0 on the host. You see no frames, but your Zephyr code calls can_send() every 100ms and returns 0 (success). The logic analyzer shows valid CAN waveform on the CANH/CANL wires. What is the most likely cause?

Answer

The host Linux interface `can0` is **down** or configured with the wrong bitrate. `can_send()` returning 0 means the STM32 successfully placed the frame on the bus and got an ACK (from the SN65HVD230 transceiver's internal ACK or another node). The Jetson is not listening. Diagnosis:

ip -details link show can0   # check if state is ERROR-ACTIVE or UP
sudo ip link set can0 down
sudo ip link set can0 up type can bitrate 500000
candump can0

If after bringing can0 up you immediately see `can0: arbitration lost` or `error frame` in candump output, you have a bitrate mismatch between STM32 and host (check your `CONFIG_CAN_DEFAULT_BITRATE` in Kconfig vs `500000` in the ip command).

B2. Your encoder velocity reading shows intermittent spikes: left_vel suddenly reads +32767 or -32768 for one frame, then returns to normal. The logic analyzer shows the CAN frame is arriving intact. What is the likely firmware bug?

Answer

Endianness mismatch. CAN data bytes default to big-endian (Motorola byte order) in automotive standards, but `struct encoder_frame` is being `memcpy`d assuming little-endian (x86/ARM default). When the high byte and low byte of `left_vel` are swapped, the value `0x0100` (256 mm/s) is read as `0x0001` (1 mm/s), and near the int16 boundary values like `0x0080` (128 mm/s) become `0x8000` (-32768 mm/s) — a sign-extension artifact of treating the MSB as the high byte. Fix: either pack the struct in big-endian order on the STM32 transmitter, or swap bytes on the receiver:

enc.left_vel  = __builtin_bswap16(enc.left_vel);
enc.right_vel = __builtin_bswap16(enc.right_vel);

Always document the byte order in the CAN DBC or header file comment.

B3. You connect a second STM32 to the same CAN bus for testing. Now your first STM32 enters BUS-OFF immediately on power-up. The second STM32 is also in BUS-OFF. What physical condition causes this?

Answer

**Missing termination.** Without the 120Ω resistors at each end of the bus, every transmitted dominant bit (0V differential) reflects back as an error. Both nodes see their own transmission as an error, increment their TEC, and rapidly reach BUS-OFF. Secondary possibility: both nodes are transmitting simultaneously with different data on the same frame ID (`0x101`), causing arbitration loss to loop — but this produces `CAN_STATE_ERROR_ACTIVE` with incremental TEC, not instant BUS-OFF. Verify: measure CANH–CANL resistance with both nodes powered off → should be ~60Ω. If you read open circuit (>10kΩ), terminators are missing. Add one 120Ω resistor across CANH/CANL at each physical end of the cable.

Section C — Code Reading

C1. This code is supposed to receive encoder frames and publish them to ZBus. Find the bug:

K_MSGQ_DEFINE(enc_msgq, sizeof(struct can_frame), 4, 4);

void can_rx_thread(void *a, void *b, void *c) {
    struct can_frame frame;
    struct encoder_frame enc;
    while (1) {
        k_msgq_get(&enc_msgq, &frame, K_FOREVER);
        memcpy(&enc, frame.data, sizeof(enc));
        zbus_chan_pub(&encoder_chan, &enc, K_NO_WAIT);
    }
}

/* In main: */
can_add_rx_filter_msgq(can_dev, &enc_msgq, NULL);

Answer

**Two bugs:** 1. `can_add_rx_filter_msgq(can_dev, &enc_msgq, NULL)` — passing `NULL` as the filter means the kernel will accept the `NULL` pointer (behaviour is driver-defined, often crashes or matches nothing). A valid `struct can_filter *` must be passed. The promiscuous equivalent is a filter with `mask = 0x000`, not NULL. 2. `memcpy(&enc, frame.data, sizeof(enc))` — `sizeof(struct encoder_frame)` is 8 bytes, but `sizeof(struct can_frame)` is larger than `frame.data` alone (the `can_frame` struct also contains the frame ID, DLC, flags). This specific line is fine as written because `frame.data` is the data array start, but the preceding `k_msgq_get` copies the **entire** `can_frame` struct into `frame`, which is correct. However: there is no DLC check. If a different node sends a 4-byte frame on the same ID, `memcpy` reads 8 bytes, silently copying 4 bytes of `frame.flags` into `enc.timestamp_ms`. Fix:

if (frame.dlc < sizeof(struct encoder_frame)) {
    LOG_WRN("Short encoder frame: dlc=%d", frame.dlc);
    continue;
}