rtl838x_switch_overview

RTL838x switch overview

Many of the switch features are implemented through what Realtek calls tables. Information can be read out from the SoC internal working via access through a table. For this, a command is written to the table access register with an index into a table and the data is returned in several data registers of the table. The table can also be written to, in which case, first the data registers need to be written to with the data and then the write command is executed by writing to the table access register. The 838x platform has three table access registers, which each can control up to 4 different tables, whereas the 9x platform has 4 control registers, which each control up to 7 different tables.

The following gives an overview of the different control registers and the tables behind them:

Table access register L2

  • Controls access to L2 functions such as reading forwarding entries from the Hash table or the CAM
  • Access register RTL83XX_TBL_ACCESS_L2_CTRL
    • On 8x the Execute bit is 16, the RW bit is 15 and the table type is encoded in bits 13 and 14, bit 0-12 encode the index into the table
    • On 9x the Execute bit is 17, the RW bit is 16 and the table type is encoded in bits 14 and 15, bits 0-13 encode the index into the table
    • WARNING: Read access is by a 1 bit in the RW bit for the 8x SoCs, while 0 for the 9x SoCs.
  • 3 Data registers are used with the table RTL83XX_TBL_ACCESS_L2_DATA(0-2)

Table access register 0

  • Controls access to VLAN, access control (IACL) and logging (LOG) functionality
  • Access register RTL83XX_TBL_ACCESS_CTRL_0
    • On 8x the Exectute bit is 15, the RW bit is 14 and the table type is encoded in bits 13 and 14, bit 0-12 encode the index into the table
    • On 9x the Execute bit is 16, the RW bit is 15 and the table type is encoded in bits 12 to 14, bits 0-11 encode the index into the table
  • 3 Data registers are used with the table RTL83XX_TBL_ACCESS_L2_DATA(0-2)

Table access register 1

  • Controls access to untagged ports (UNTAG), Vlan egress conversion features (VLAN_EGR_CNVT) and routing (ROUTING) control
    • On 8x and 9x the Execute bit is 15, the RW bit is 14, the table type is encoded in bits 12 and 13, and bits 0-11 encode the index into the table

Table access register 2 (Only 9x)

  • Controls access to scheduling features (SCHED table)
    • The execute bit is 9, the RW bit is 8, the table type is encoded in bits 6-5, and the index is encoded in bits 6-0

RTL838x

Table Table type Access register Size Data registers Comment
IACL 1 0 1536 18
L2_CAM_IP_MC 1 L2 64 3
L2_CAM_IP_MC_SIP 1 L2 64 3
L2_CAM_MC 1 L2 64 3
L2_CAM_UC 1 L2 64 3
L2_IP_MC 0 L2 8192 3
L2_IP_MC_SIP 0 L2 8192 3
L2_MC 0 L2 8192 3
L2_NEXT_HOP 0 L2 8192 3
L2_NEXT_HOP_LEGACY 0 L2 8192 3
L2_UC 0 L2 8192 3
LOG 3 0 128 2
MC_PMSK 2 L2 512 1
MSTI 2 0 64 2
ROUTING 2 1 512 2
UNTAG 0 1 4096 1
VLAN 0 0 4096 2
VLAN_EGR_CNVT 1 1 128 6

RTL839x

Table Table type Access register Size Data registers Comment
EACL 2 0 2304 17
IACL 2 0 2304 17
L2_CAM_IP_MC 1 L2 64 3
L2_CAM_IP_MC_SIP 1 L2 64 3
L2_CAM_MC 1 L2 64 3
L2_CAM_UC 1 L2 64 3
L2_IP_MC 0 L2 16384 3
L2_IP_MC_SIP 0 L2 16384 3
L2_MC 0 L2 16384 3
L2_NEXT_HOP 0 L2 16384 3
L2_NEXT_HOP_LEGACY 0 L2 16384 3
L2_UC 0 L2 16384 3
LOG 4 0 1024 2
MC_PMSK 2 L2 4096 2
METER 3 0 512 2
MPLS_LIB 3 1 256 2
MSTI 5 0 256 4
OUT_Q 2 2 53 8
ROUTING 2 1 2048 2
SCHED 0 2 53 9
SPG_PORT 1 2 52 7
UNTAG 0 1 4096 2
VLAN 0 0 4096 3
VLAN_EGR_CNVT 1 1 1024 4
VLAN_IGR_CNVT 1 0 1024 5
VLAN_IP_SUBNET_BASED 1 0 1024 5
VLAN_MAC_BASED 1 0 1024 5

An example read access on an RTL838x to a table looks like this:

	u32 idx = (0 << 14) | (hash << 2) | position;

	u32 cmd = 1 << 16 /* Execute cmd */
		| 1 << 15 /* Read */
		| 0 << 13 /* Table type 0b00 */
		| (idx & 0x1fff);

        sw_w32(cmd, RTL838X_TBL_ACCESS_L2_CTRL);
        do { }  while (sw_r32(RTL838X_TBL_ACCESS_L2_CTRL) & (1 << 16));

	r[0] = sw_r32(RTL838X_TBL_ACCESS_L2_DATA(0));
	r[1] = sw_r32(RTL838X_TBL_ACCESS_L2_DATA(1));
	r[2] = sw_r32(RTL838X_TBL_ACCESS_L2_DATA(2));

In this example, the hash is a hash of a MAC address, which gets shifted by two bits to the left to make room for a position in a hash bucket, which can have up to 4 entries. A read command cmd is assembled from the execute bit, RW bit, table type and the index and written into RTL838X_TBL_ACCESS_L2_CTRL. The ASIC then executes the command and fills the result registers. The code busy waits until the execute bit reads 0. And then reads the data from the result registers. In order to write to a table, the data registers are filled, first, and then a write command is executed. The data returned from the table is described in e.g. the files rtk-sdk/src/hal/chipdef/maple/rtk_maple_tableField_list.c or rtk-sdk/src/hal/chipdef/cypress/rtk_cypress_tableField_list.c of the GPL dump of the Realtek SDK. An example entry is e.g. :

    tk_tableField_t RTL8380_VLAN_FIELDS[] = {
    {   /* name     MAPLE_VLAN_MBRtf */
        /* lsp */   32,
        /* len */   29,
    },
    {   /* name     MAPLE_VLAN_FID_MSTItf */
        /* lsp */   5,
        /* len */   6,
    },
    {   /* name     MAPLE_VLAN_L2_HASH_KEY_UCtf */
        /* lsp */   4,
        /* len */   1,
    },
    {   /* name     MAPLE_VLAN_L2_HASH_KEY_MCtf */
        /* lsp */   3,
        /* len */   1,
    },
    {   /* name     MAPLE_VLAN_VLAN_PROFILEtf */
        /* lsp */   0,
        /* len */   3,
    },
};

This gives the field names (e.g. MAPLE_VLAN_FID_MSTItf) and the starting bit in the field (5) and length (6 bits). Note that the 0-bit is always the least significant bit in big-endian of the largest result register filled. In the case of the above table it would be the right-most bit in RTL83XX_TBL_ACCESS_DATA_0(1). The tables are always right-aligned, i.e. the data might not start at the highest bit in the smallest data register (e.g. RTL83XX_TBL_ACCESS_DATA_0(0)). You can think of a table result as a large big-endian result structure across the data registers.

An overview of all tables can be found in e.g. rtk-sdk/src/hal/chipdef/maple/rtk_maple_table_list.c:

    {   /* table name               INT_MAPLE_RTL8380_VLAN */
        /* access table type */     0,
        /* table size */            4096,
        /* total data registers */  2,
        /* total field numbers */   MAPLE_VLANFIELD_LIST_END,
        /* table fields */          RTL8380_VLAN_FIELDS
    },

This defines the VLAN table. The table type is 0 (the table type bits that need to be given in the command to the table access registers), the table size is 4096, i.e. there is one entry in the table for each of the 4096 VLAN IDs, which can be used to access the table as an index, the results are returned in 2 data registers (RTL83XX_TBL_ACCESS_L2_DATA(0-1)).

The information which table is associated with which control registers can be found in ./rtk-sdk/src/hal/mac/drv.c, in the functions rtl8380_table_read() and rtl8390_table_read. For each table name there is a groupId defined, which is an index into the ctrlReg table containing the three possible control registers L2_CTRL and CTRL_0 and CTRL_1.

Vlans are controlled by the VLAN, UNTAG and VLAN_EGR_CNVT tables. Each port also has an inner and outer primary VLAN ID, which can be defined by the RTL83XX_VLAN_PORT_PB_VLAN registers. The RTL83XX_VLAN_PORT_PB_VLAN has the following fields:

* Bit 16-27: Outer primary VLAN ID
* Bits 14-15: Format for outer PVID, by default 0 in the SDK
* Bits 2-13: Inner primary VLAN ID
* Bits 0-1: Format for inner PVID, by default 0 in the SDK

The VLAN table has type 0 using table access register 0. The index into this table is the VLAN-ID 0-4095. It stores the following fields:

  • MBR (port members members of this VLAN, i.e. those ports which send packets tagged with this VLAN): lsb 32, length 29. A bit mask for ports 29-0.
  • FID_MSTI (Forwarding ID and Multiple Spanning Tree ID): lsb 5, lenght 6 bits
  • L2_HASH_KEY_UC (Unicast L2 hash key): lsb 4, length 1
  • L2_HASH_KEY_MC (Unicast L2 hash key): lsb 3, length 1
  • VLAN_PROFILE (ID of the up to 8 different VLAN profiles that can be defined on the SoC): lsb 0, length 3 bits

The UNTAG table contains for each VID a mask of the untagged ports. The table returns exactly one 32 bit register, with bits 28-0 representing the respective port. This bitmap is ANDed with the respective MBR field of the VLAN table to produce the set of ports where this VID is untagged.

The VLAN_EGR_CNVT is 192 bits large and contains automatic conversion rules for Vlans on egress. It also adjusts the priority of packets. Decisions can be based on original port id and a new outgoing port is beings assigned. The RTL839x supports 1024 conversion rules, the RTL8380 128 (this is the index into the table).

The 32 bit per-port VLAN_PORT_TAG_STS_CTRL register control VLAN tagging on egress. It has these fields:

  • RESERVED: lsb 12, length 20
  • EGR_P_OTAG_KEEP (keep outer tag on egress): lsb 10, length 2
  • EGR_P_ITAG_KEEP (keep inner tag on egress): lsb 8, length 2
  • IGR_P_OTAG_KEEP (keep outer tag on ingress): lsb 6, length 2
  • IGR_P_ITAG_KEEP (keep inner tag on ingress): lsb 4, length 2
  • OTAG_STS (use outer tag unless port is member of UNTAG): lsb 2, length 2
  • ITAG_STS (use inner tag unless port is member of UNTAG): lsb 0, length 2

EGR_ fields can be

  • 0: disabled
  • 1: enabled
  • 2: enabled
  • 3: invalid

IGR_ fields can be

  • 0: disabled
  • 1: keep format
  • 2: keep content
  • 3: invalid

_STS fields can be

  • 0: untagged
  • 1: tagged
  • 2: priority tagged
  • 3: invalid

The UNTAG table is only used to manage egress tagging if ITAG_STS and/or OTAG_STS is enabled. The KEEP fields allow unmanaged switches to preserve the original tagging of forwarded packets.

A typical managed switch will disable all KEEP fields and set ITAG_STS to "tagged", forcing a tag on egress unless the port is a member of the VIDs UNTAG bitmap.

The RTL83xx SoCs have a CPU-Port which allows the SoC to send and receive Ethernet packets from the Linux userspace via the Linux kernel. The port number of the CPU port is 28 on the RTL838x, and 52 on the RTL839x. The CPU-port can also be used to mirror or trap incoming or outgoing packets in order to analyze them by the MIPS CPU on the SoC. The CPU-Port has 8 receive queues and 2 send queues which allow to prioritize packets that are sent or received. The 8 send queues correspond to the 8 queues also used for packet forwarding. Packets are sent by storing them in uncached memory (MIPS KSEG1 memory) and pointing one of the pointers on the TX rings of either of the 2 send queues at the data to be sent. Receiving works by providing empty buffers pointed to by pointers on the 8 send rings, which the ASIC will fill.

The Realtek GPL dumps of the SDK contain two different implementations of an Ethernet driver. There is a simple implementation with only one ring for each of the TX and RX packets. Also, this implementation is not using interrupts, nor is it able to cope with a situation where too many packets arrive for the CPU and in principle packets would need to be dropped. This implementation can be found in rtl8390_nic.c or rtl8380_nic.c in the xx directory. A second fully-featured implementation as a linux kernel module can be found for all SoC architectures in sdk/rtk-sdk/system/drv/nic/, the user-space implementation part is in the user subdirectory therein. This implementation also allows to receive Jumbo-frames on the CPU-port of the RTL839x SoCs. Note that this is independent of the switch switching Jumbo frames between ports, which are forwarded anyway with a cut-through approach.

In order to send a packet, first the ring structures for the 2 send rings have to be initialized (function rtl838x_setup_ring_buffer()). This is done by allocating TXRINGLEN empty buffers, and storing pointers to them in Buffer header structures in the *buf field. A second ring is set up in uncached memory which only holds pointers to these header structures, and the SoC is informed of the start of these rings (function rtl838x_hw_ring_setup() sets the RTL83XX_DMA_TX_BASE register to the beginning of the ring). Presently, the buffer header structures has the following fields in the Linux driver:

struct p_hdr {
	uint8_t		*buf;
	uint16_t	reserved;
	uint16_t	size;   /* buffer size */
	uint16_t	offset;
	uint16_t	len;    /* pkt len */
	uint16_t	reserved2;
	uint16_t	cpu_tag[5];
} __packed __aligned(1);

The size attribute gives the size of the empty packet buffer, while the len attribute gives the length of the actual data to be sent. The cpu-tag allows to give further instructions on how the packet is to be sent, like providing the source ports(!) the packet is to take. Further information in the cpu-tag can be vlan-related or priority related. The detailed fields of the TX cpu_tag are:

RTL838x:

struct { 
    u8  CPUTAGIF;     // For a valid tag this needs to be 0x4, otherwise the tag is ignored
    u8      :2;     
    u8  BP_FLTR1:1; 
    u8  BP_FLTR2:1; 
    u8  AS_TAGSTS:1; 
    u8  ACL_ACT:1; 
    u8  RVID_SEL:1; 
    u8  L2LEARNING:1;
    u8  AS_PRI:1;     // Enable priority field
    u8  PRI:3;        // Priority with which to send the packet
    u8      :2;     
    u8  AS_DPM:1;     // Enable the destination port matrix
    u8  DPM_TYPE:1; 
    u8  RSV0;   
    u8  RSV1;   
    u8  RSV2;   
    u32  :3; 
    u32  DPM:29;      // Destination ports 0-29, a bit-field
} __attribute__ ((aligned(1), packed)) tx; 

RTL839x:

struct { 
     u8   CPUTAGIF;      // For a valid tag this needs to be 0x01
     u8   DPM_TYPE:1; 
     u8   ACL_ACT:1; 
     u8     :2; 
     u8   DM_PKT:1; 
     u8   DG_PKT:1; 
     u8   BP_FLTR1:1; 
     u8   BP_FLTR1:1; 
     u8      :4; 
     u8   AS_PRIO:1      // Enable priority field
     u8   PRI:3          // Priority with which to send
     u32  L2LEARNING:1;  // Should the MAC address be learned (not clear whether SA or DA)
     u32  AS_TAGSTS:1; 
     u32  RVID_SEL:1; 
     u32  AS_DPM:1; 
     u32  DPM41_32:20; 
     u32  DPM32_0;  
} __attribute__ ((aligned(1), packed)) tx; 

Flags starting with AS_xxx denote that the field xxx is taken into account when sending the packet. So if AS_PRI is not 0, then PRI will not be looked at when sending the packet (note that PRI can be 0). DPM stands for Destination Port Matrix, i.e. the ports from which the packet will leave the router (the destination for to the packet to go from the CPU-port). Note that both cpu tags are 10 Bytes long, so the driver can use the same definition for both the RTL838x and 9x SoC, merely the data in the tag needs to be interpreted differently.

When sending a packet, the kernel will call the rtl838x_eth_tx(struct sk_buff *skb, struct net_device *dev) function. The function first sets up one of the header structures on the ring including a cpu tag. If DSA is active, the cpu-tag will need to contain the destination port of the packet. It is found at the end of the skb as part of the cpu-tag that DSA has set up (do not confuse this with the CPU-tag for the switch, this tag is merely data added by DSA at the end of the packet buffer, while the cpu-tag in the header is a structure understood by the switch). It also copies the data from the sk_buff to uncached memory, into one of the packet buffers already set up and leaving space (4 bytes) for the CRC sum, which the ASIC will calculate by itself and fill in before actually sending the packet.

In order to actually send a packet, its header structure is handed over to the ASIC by setting the LSB of the pointer in the ring structure with pointers to the headers to 1. This marks that header as now belonging to the ASIC. The ASIC will clear that bit again after having sent the packet. Note that the addresses of the header structures are 4-byte aligned, so normally bits 0 and 1 are empty. Bit 0 is used to signify ownership of the ASIC, where as bit 1 is used to denote the last entry in the ring, which tells the ASIC to find the next header again at the beginnig of the ring. Finally, the TX_FETCH bit (bit 1) has to be written into the RTL83XX_DMA_IF_CTRL register. The TX_EN (TX enable bit 3) of that register should be set to 1 already after setting up the ring structure, it enables the TX packet engine.

When all pending packet have been sent, a network interrupt is triggered, so the CPU can clean up the ring buffer. Alternatively, the simple u-boot driver simply busy waits on the TX_BUSY bit (bit 0) in the RTL83XX_DMA_IF_CTRL register. When using buffers for packets of the same size, the buffers can directly be re-used, their ownership is again with the CPU. The driver simply walks forward from the position in the ring with the pointers to the buffer headers as long as the ASIC ownershipt bit (bit 0 in the pointer to the buffer header in KSEG1) is not set and sends more packets. In principle there is also a RTL83XX_DMA_IF_TX_CUR_DESC_ADDR_CTRL which can be used to read (control?) the current position in the buffer header pointer ring which the ASIC currently looks at (this is not used in the Linux driver).

The interrupt status can be read from RTL83XX_DMA_IF_INTR_STS, which has the following status bits: RTL83XX_DMA_IF_INTR_STS:

  • bits 0-7: RX buffer overrun bits for each of the 8 queues
  • bits 8-15: RX done interrupt for each of the queues separately
  • bits 16 and 17: TX done interrupt for each of the 2 queues
  • bit 18: TX all done interrupt fired when both queues are empty
  • Additionally, the RTL839x SoCs have 3 additional status bits:
    • bit 20: Local notification buffer overrun
    • bit 21: notification buffer overrun
    • bit 22: Notification received

In the network interrupt routine status bits which are 1 need to be cleared by writing 1 into them, otherwise the interrupt will again fire. To control interrupt generation, the SoCs have RTL83XX_DMA_IF_INTR_MSK mask registers, with the same bit layout as the RTL83XX_DMA_IF_INTR_STS status register, which allows to disable individual interrupts.

Presently the Linux driver only makes use of one TX and one RX ring.

In order to receive packets, again the double ring structure has to be set up for all queues that are used by the ASIC to receive packets. Which queues are being used depends on rules having been set up mapping e.g. priorities or packet types (?) to respective queues. The Linux driver makes use of only one RX queue.

The ring of packet headers makes use of the same p_hdr structure as the TX ring. However, while the size attribute again gives the buffer size, the len attribute is initially 0, it will be filled with the correct size by the ASIC once a packet has been received. In the offset value, the highest bit (more-bit) can be used to indicate a continuation of data in a second buffer, which can be used to receive Jumbo frames on the CPU-port, the next packet then has the offset in bytes set. Again a parallel ring structure with pointers to these headers have to be set up, each consecutive in memory and with the ASIC ownership flag (0-bit). The last pointer also has the wrap-around bit set (bit 1).

Once the double rings are set up for each queue, the RX enable bit (bit 2) in RTL83XX_DMA_IF_CTRL can be set. The Linux driver also currently set the RX_TRUNCATE_EN bit(4) in that register and limits the size of received packets to 1600 (bits 5 to 19 in the RTL83XX_DMA_IF_CTRL register) bytes.

The driver needs to keep track by itself which was the last entry in each ring that has already been forwarded into the network stack after reception. When the receive interrupt of a queue (there is one per queue, see above) fires, the Linux driver schedules a NAPI receive event. The NAPI handler then calls rtl838x_hw_receive(struct net_device *dev, int r, int budget) in order to hand the next received packet over into the network stack. The following gives the cpu-tag as filled by the ASIC when it has received a packet on the RTL838x:

struct {
            u8   CPUTAGIF;  // An ID number of the type of CPU-tags, on a valid tag 0x04
            
            u8   QID:3;  // Queue ID on which the packet was received, identical to the ring number on which the header is
            u8   SPN:5;  // Number of the source port on which the packet was received
            
            u16  MIR_HIT:4; // The packet was sent to the CPU-port because of a mirrror hit, this gives the mirror ID
            u16  ACL_HIT:1; // An ACL was hit
            u16  ACL_IDX:11;
            
            u16          :2;
            u16  OTAGIF:1;
            u16  ITAGIF:1;
            u16  RVID:12;   // The remote VLAN ID

            u8          :1;
            u8   MAC_CST:1;
            u8   ATK_HIT:1;   // An attack was identified and the packet trapped to the CPU port
            u8   ATK_TYPE:5;  // The type of the attack
            
            u8   NEW_SA:1;    // A packet arrived with a new Source address, give the CPU a chance to learn
            u8   L2_PMV:1;
            u8           :2;
            u8   REASON:4;    // The reason the packet was forwarded to the CPU-Port
           
            u8   RSV0;
            u8   RSV1;
        } __attribute__ ((aligned(1), packed)) rx;

enum reasons { 
               NIC_RX_REASON_RLDP_RLPP = 1, NIC_RX_REASON_RMA, NIC_RX_REASON_IGR_VLAN_FILTER, NIC_RX_REASON_INNER_OUTTER_CFI, NIC_RX_REASON_MY_MAC,
               NIC_RX_REASON_SPECIAL_TRAP, NIC_RX_REASON_SPECIAL_COPY, NIC_RX_REASON_ROUTING_EXCEPTION, NIC_RX_REASON_UNKWN_UCST_MCST,
               NIC_RX_REASON_MAC_CONSTRAINT_SYS, NIC_RX_REASON_MAC_CONSTRAINT_VLAN, NIC_RX_REASON_MAC_CONSTRAINT_PORT, NIC_RX_REASON_CRC_ERROR,
               NIC_RX_REASON_IP6_UNKWN_EXT_HDR, NIC_RX_REASON_NORMAL_FWD
             };

For the RTL839x, the cpu-tag has the following structure:

 struct {
            u8   CPUTAGIF;    // An ID number of the type of CPU-tags, on a valid tag 0x04
            u8       :2;
            u8   SPN:6;       // The source port number
            u16  MIR_HIT:4;   // A hit in one of the mirror groups
            u16  ACL_IDX:12;  // An ACL hit: the index
            u16  ACL_HIT:1;   // An ACL hit
            u16  OTAGIF:1;
            u16  ITAGIF:1;
            u16          :1;
            u16  RVID:12;     // Remote VLAN ID
            u8   QID:3;       // Number of the queue on which the packet was received, identical to the ring number on which the header is
            u8   ATK_TYPE:5;
            u8   MAC_CST:1;
            u8   CRC:1;
            u8   SFLOW:6;
            u8           :2;
            u8   DM_RXIDX:6;
            u8   NEW_SA:1;    // A packet arrived with a new Source address, give the CPU a chance to learn
            u8   L2_PMV:1;
            u8   OVERSIZE:1;
            u8   REASON:5;    // Reason ID for which the packet was sent to the CPU
        } __attribute__ ((aligned(1), packed)) rx;
        
enum reasons { 
               NIC_RX_REASON_OAM = 1, NIC_RX_REASON_CFM, NIC_RX_REASON_CFM_ETHDM, NIC_RX_REASON_IGR_VLAN_FILTER, NIC_RX_REASON_VLAN_ERROR,
               NIC_RX_REASON_INNER_OUTTER_CFI, NIC_RX_REASON_RMA_USR_DEF1, NIC_RX_REASON_RMA_USR_DEF2, NIC_RX_REASON_RMA_BPDU,
               NIC_RX_REASON_RMA_LACP, NIC_RX_REASON_RMA_PTP, NIC_RX_REASON_RMA_LLDP, NIC_RX_REASON_RMA,
               NIC_RX_REASON_IP6_HOPBYHOP_EXT_HDR_ERROR, NIC_RX_REASON_IP6_UNKWN_EXT_HDR, NIC_RX_REASON_IP4_HDR_ERROR, NIC_RX_REASON_TTL_EXCEED,
               NIC_RX_REASON_IP4_OPTIONS, NIC_RX_REASON_IP6_HDR_ERROR, NIC_RX_REASON_HOP_EXCEED, NIC_RX_REASON_IP6_HOPBYHOP_OPTION,
               NIC_RX_REASON_GW_MAC_ERROR, NIC_RX_REASON_IGMP, NIC_RX_REASON_MLD, NIC_RX_REASON_EAPOL, NIC_RX_REASON_ARP_REQ,
               NIC_RX_REASON_IP6_NEIGHBOR_DISCOVER, NIC_RX_REASON_UNKWN_UCST_MCST, NIC_RX_REASON_MY_MAC, NIC_RX_REASON_INVALID_SA,
               NIC_RX_REASON_NORMAL_FWD
             };

The receiving function rtl838x_hw_receive() starts with the remembered current the pointer ring entry and loops over this and the next entries until the RTL83XX_DMA_IF_RX_CUR_DESC_ADDR_CTRL of each queue has been reached. For each of these entries on the pointer ring it checks whether the header is indeed owned by the CPU and not the ASIC (0-bit needs to be clear) and then copies the data as given by the len field of the packet header over to the network stack. At the end of the packet buffer, 4 bytes are added, which contain the cpu-tag information, which the function fills in from the cpu-tag in the packet header. At present only the source port is used by DSA.

After reading out the buffer and packet header, the header is reset by setting the len field to 0, setting the size to the buffer size and deleting the cpu_tag in the header.

When the ASIC's current ring position pointer hits an element on the ring that is not owned by the ASIC, the ring overflow interrupt is fired (bits 0-7 corrsponding to the ring-id in RTL83XX_DMA_IF_INTR_STS), as it means there is no more space on the ring to store incoming packets. Presently, the driver reacts by discarding packets on the ring to make space for the ASIC. It seems that if this is not done quickly enough or the ring size is smaller than about 200 entries for the RX ring, the ASIC can deadlock in situations where there is a high load of traffic to the CPU-port in combination with lots of traffic through the switch: no more entries are being received at all. In such a situation, the NIC needs to be reset (writing 1 to bit 3 of RTL83xx_RST_GLB_CNTRL resets the NIC) after cleaning up the ring buffers.

The forwarding database is the data structure that allows the ASIC in the switch to decide to which port to forward a packet. It consists of 2 parts, a hash-based L2 Forwarding Database, and so called CAM (Content Adressable Memory), which stores entries that do not find space in the hash-based Forwarding Database. On the RTL838x, the L2 Forwarding database contains 8192 entries, on the RTL839x SoCs 16384 entries, of which 4 entries are found in the same hash-bucket, i.e are addressed by the same hash-value of an entry. In both cases, the CAMs may contain up to 64 entries.

The FDB contains entries for Unicast entries, L2 multicast entries, IPv4 Multicast entries, IPv6 multicast entries and entries for trunk loadbalancing based on SIP or DIP. In the original SDK abstraction they are considered to be accessed by different tables which return different datatypes, but one can also view the L2 FDB as a table containing entries which have a union datatype. Typically an entry in the database consists of a mac-address and information on how to forward that packet, information about the age of that entry. : In order to access the L2 FDB for a unicast entry, first the hash of the mac address concatenated with the destination vlan number is calculated: h = hash(mac « 12 | vid). The hash-algorithm applied to the 64 bit concatenation of mac and vlan-id needs to be the same as the one applied by the SoC, which allows to configure two different hash algorithms (bit 0 of RTL83XX_L2_CTRL_0). On the RTL838x, then the following table access command is used to read the table:

        u16 idx = hash << 2 | pos;
	u32 cmd = 1 << 16 /* Execute cmd */
		| 1 << 15 /* Read */
		| 0 << 13 /* Table type 0b00 */
		| (idx & 0x1fff);

Where pos is the position in the hash-bucket (0-3). In order to find an entry these values have to be consecutively tried out. The entry returned can be of any entry type stored in the common L2 table, which corresponds to entries of different tables as defined by the SDK, namely the following tables: L2_UC (IP unicast entries), L2_IP_MC (IP multicast entries), L2_IP_MC_SIP (load balancing), L2_MC (L2 multicast entries), L2_NEXT_HOP (next hop routing). All these entries have 87 bits and are right aligned in the 96 bits of the 3 data registers storing the returned entry.

This shows the data fields of an L2 Unicast entry, note that not all bits are consecutive (e.g. bit 84 is not defined), because the entry is a union of all possible entries and it needs to be possible to identify the entry type:

 - bit 86: This entry is actually an IPv4 Multicast entry -> continue to decode the entry with that type
 - bit 85: This entry is actually an IPv6 multicast entry
 - bit 83: the entry is static, i.e. does not age
 - bit 81-82: Age of the entry
 - bit 76-80: SLP: The Source Learned Port
 - bit 64-75: VLAN id
 - bit 63: Block this as Source Address
 - bit 62: Block this as Destination Address
 - bit 61: This entry was suspended
 - bit 60: This entry is actually a next hop entry
 - bit 12-59: the MAC address
 - bit 0-11: The forwarding ID / remote VLAN id

A next hop entry then has identical fields, apart from the following:

 - bit 64-72: Route index
 - bit 60: Next hop bit, must be 1

The L2 multicast entry has the following fields, note that one would start decoding the entry as a unicast entry and identify from the MAC address that it is in fact a multicast address:

 - bit 86: This entry is actually an IPv4 Multicast entry -> continue to decode the entry with that type
 - bit 85: This entry is actually an IPv6 multicast entry
 - bit 76-84: Port mask index for multicasts
 - bit 64-75: VLAN ID
 - bit 12-63: MAC address
 - bit 0-11: forwarding ID / remote VLAN id

The SoC manages QoS by assigning a priority (between 0 and 7, 7 being the highers priority) to a packet received on an ingress port or created by the CPU-port and converting this priority to a queue ID. This priority is called the internal or inner priority, which is independent of a priority of an Ethernet packet (802.1q). the Queues with IDs 0-7 also have a priority, the priority being higher for higher numbered queues. Higher priority packets have access to all queues with a priority lower than a maximum priority available to packets of that priority. Because the higher packets have more queues than lower priority packets, they get more bandwidth on an egress port. It is important to understand that 2 different types of queues exist, a set of 8 queues for each of the egress ports, and a CPU-Port queue for receiving packets at the CPU-Port. The CPU-Port queue is identical to the receiving ring number.

Assignment of priorities starts when a packet is received. Such a packet is assigned a set of 5 (7 for the RTL839x) priority-related attributes which in the datasheet are called:

  • Port-based (Inner-tag) Priority (A priority depending on the inner VLAN-tag according to rules defining inner tags for ingress ports) [3]
  • Port-based Outer-tag Priority (A priority depending on the VLAN tag and port of the received packet) [4]
  • Inner-tag Priority: A priority based on the inner VLAN tag [6]
  • Outer-tag Priority: A priority based on the outer VLAN tag [7]
  • Differentiated Service Code Point (DSCP) in the received packet [5]

The RTL839x uses a different set of QoS attributes:

  • Port based [1]
  • Inner Tag [2]
  • Outer Tag [3]
  • DSCP [1]
  • ACL [4]
  • Mac-based VLAN [1] (A VLAN ID based on the Source Address of a packed)
  • Protocol-based VLAN [1]

All of these attributes are used to calculate a final inner priority which then gets mapped to output queues. For this each of the attributes gets a weight assigned, the default weights in the SDK are given above in square brackets. There are 4 different priority selection groups (transformation groups of priority-attributes to inner priority), but the SDK makes use only of group 0 and sets it up with the default weights. The setup is done with a control register per selection group with the following fields:

RTL8380_PRI_SEL_TBL_CTRL:
 - bit 0-2: Port (inner tag) weight (0-7)
 - bit 3-5: Port outer tag weight (0-7)
 - bit 8-11: DSCP weight (0-7)
 - bit 12-14: Inner tag weight (0-7)
 - bit 16-18: Outer tag weight (0-7)
RTL8390_PRI_SEL_TBL_CTRL:
 - bit 0-2: Port based weight (0-7)
 - bit 4-6: Ingress ACL weight
 - bit 8-10: DSCP weight
 - bit 12-14: Inner tag weight
 - bit 16-18: Outer tag weigt
 - bit 20-22: MAC-based VLAN weight
 - bit 24-26: Protocol-based VLAN weight

There is no description of the algorithm used to apply these weights, but it appears a re-scaled scalar product of the priority attributes with the weights is being calculated, which gives the internal priority.

The internal priority is mapped to a queue for each of the egress ports and the CPU-Port. The mapping e.g. looks like this when using the default default setup:

              Number of Available Output Queue
      Priority  1   2   3   4   5   6   7   8
            0   0   0   0   0   0   0   0   0
            1   0   0   0   0   0   0   0   1
            2   0   0   0   1   1   1   1   2
            3   0   0   0   1   1   2   2   3
            4   0   1   1   2   2   3   3   4
            5   0   1   1   2   3   4   4   5
            6   0   1   2   3   4   5   5   6
            7   0   1   2   3   4   5   6   7

This means that for priority 0, only queue 0 is available, where as for priority 2, queues 0, 1 and 2 are available and for packets with internal priority 7, all 8 output queues are available.

For packets received by the CPU-Port the output queue to the CPU port corresponds to the 8 RX rings. The mapping between internal priority and the queue number is done via the RTL83XX_QM_PKT2CPU_INTPRI_MAP register, which has the following bits:

RTL83XX_QM_PKT2CPU_INTPRI_MAP:
bit 0-2: Maximum Queue Number for priority 0 
bit 3-5: Maximum Queue Number for priority 1 
bit 6-8: Maximum Queue Number for priority 2
bit 9-11: Maximum Queue Number for priority 3
bit 12-2: Maximum Queue Number for priority 4 
bit 15-17: Maximum Queue Number for priority 5 
bit 18-20: Maximum Queue Number for priority 6 
bit 21-23: Maximum Queue Number for priority 7 

The mapping to the egress queue is done with the RTL83XX_QM_INTPRI2QID_CTRL register, which has the same bit-layout assigning a maximum queue to a given internal priority as the RTL83XX_QM_PKT2CPU_INTPRI_MAP register.

Additionally, it is possible to give packets priority due to their content and directly assign them to a CPU-Port queue. This is done with the following 3 registers:

RTL8380_QM_PKT2CPU_INTPRI_0: (bits give queue assigned)
 - bits 0-2: Invalid (packets?)

On the RTL839x, scheduling of the queues is controlled by the SCHED table, the index into the table is the port number, the fields are

 - bits 268-287: SCHED_LB_APR_Q7: 
 - bits 248-267: SCHED_LB_APR_Q6:
 - bits 228-247: SCHED_LB_APR_Q5:
 - bits 208-227: SCHED_LB_APR_Q4:
 - bits 188-207: SCHED_LB_APR_Q3:
 - bits 168-187: SCHED_LB_APR_Q2:
 - bits 148-167: SCHED_LB_APR_Q1:
 - bits 128-147: SCHED_LB_APR_Q0:
 - bits 118-127: SCHED_WEIGT_Q7: Scheduling weight for Queue 7 for the given port
 - bits 108-117: SCHED_WEIGT_Q6: Scheduling weight for Queue 6 for the given port
 - bits 98-107: SCHED_WEIGT_Q5: Scheduling weight for Queue 5 for the given port
 - bits 88-97: SCHED_WEIGT_Q4: Scheduling weight for Queue 4 for the given port
 - bits 78-87: SCHED_WEIGT_Q3: Scheduling weight for Queue 3 for the given port
 - bits 68-77: SCHED_WEIGT_Q2: Scheduling weight for Queue 2 for the given port
 - bits 58-67: SCHED_WEIGT_Q1: Scheduling weight for Queue 1 for the given port
 - bits 48-57: SCHED_WEIGT_Q0: Scheduling weight for Queue 0 for the given port
 - bits 40-47: FIX_TKNQX (X = 0-7): 
 - bits 20-39 SCHED_EGR_RATE: Total egress rate for that port
 - bit 19 SCHED_TYPE: The type of the scheduler: 0 = Weighted-Fair-Queueing, 1 = Weighted-Round-Robin
  • rtl838x_switch_overview.txt
  • Last modified: 2022/12/30 15:41
  • by svanheule