Asynchronous packet socket writing with PACKET_TX_RING

2016 update

This post is quite old by now. For a more recent example, take a look at github.com/wdebruij/kerneltools/blob/master/tests/psock_txring_vnet.c

In my last post, I showed how you can read packets enqueued on a packet socket without system calls, by setting up a memory mapped ring buffer between kernel and userspace. Since version 2.6.31, the kernel also supports a transmission ring (or at least, the macro exists since that version; I tested this code against version 2.6.36).

Setting up of a transmission ring is trivial once you know how to create a reception ring. In the setup snippet of the previous post, simply change the call to init_packet_sock to read

 
  fd = init_packetsock(&ring, PACKET_TX_RING);

Then, at runtime, write packets as follows:
 
/// transmit a packet using packet ring
//  NOTE: for high rate processing try to batch system calls, 
//        by writing multiple packets to the ring before calling send()
//
//  @param pkt is a packet from the network layer up (e.g., IP)
//  @return 0 on success, -1 on failure
static int
process_tx(int fd, char *ring, const char *pkt, int pktlen)
{
  static int ring_offset = 0;

  struct tpacket_hdr *header;
  struct pollfd pollset;
  char *off;
  int ret;

  // fetch a frame
  // like in the PACKET_RX_RING case, we define frames to be a page long,
  // including their header. This explains the use of getpagesize().
  header = (void *) ring + (ring_offset * getpagesize());
  assert((((unsigned long) header) & (getpagesize() - 1)) == 0);
  while (header->tp_status != TP_STATUS_AVAILABLE) {

    // if none available: wait on more data
    pollset.fd = fd;
    pollset.events = POLLOUT;
    pollset.revents = 0;
    ret = poll(&pollset, 1, 1000 /* don't hang */);
    if (ret < 0) {
      if (errno != EINTR) {
        perror("poll");
        return -1;
      }
      return 0;
    }
  }

  // fill data
  off = ((void *) header) + (TPACKET_HDRLEN - sizeof(struct sockaddr_ll));
  memcpy(off, pkt, pktlen);

  // fill header
  header->tp_len = pktlen;
  header->tp_status = TP_STATUS_SEND_REQUEST;

  // increase consumer ring pointer
  ring_offset = (ring_offset + 1) & (CONF_RING_FRAMES - 1);

  // notify kernel
  if (sendto(fd, NULL, 0, 0, (void *) &txring_daddr, sizeof(txring_daddr)) < 0) {
    perror("sendto");
    return -1;
  }

  return 0;
}
As the function comment says, this example makes inefficient use of the ring, because it issues a send() call for every packet that it writes. The whole purpose of the ring is to transmit multiple packet without having to issue a system call (and cause a kernel-mode switch).
The function also makes use of global variable txring_daddr that has not yet been introduced. Packets are copied to the Tx ring from the network layer up. This destination address structure contains the link layer information that the kernel needs to complete the packet. I do not know why we cannot just write packets from the link layer up, but this works. The following snippet sets up a destination address structure. It fills in the destination link layer as ff.ff.ff.ff.ff.ff. Replace this with a sane address in your code.
 

static struct sockaddr_ll txring_daddr;

/// create a linklayer destination address
//  @param ringdev is a link layer device name, such as "eth0"
static int
init_ring_daddr(const char *ringdev)
{
    struct ifreq ifreq;

    // get device index
    strcpy(ifreq.ifr_name, ringdev);
    if (ioctl(fd, SIOCGIFINDEX, &ifreq)) {
      perror("ioctl");
      return -1;
    }

    txring_daddr.sll_family    = AF_PACKET;
    txring_daddr.sll_protocol  = htons(ETH_P_IP);
    txring_daddr.sll_ifindex   = ifreq.ifr_ifindex;

    // set the linklayer destination address
    // NOTE: this should be a real address, not ff.ff....
    txring_daddr.sll_halen     = ETH_ALEN;
    memset(&txring_daddr.sll_addr, 0xff, ETH_ALEN);
}

The sockaddr_ll structure is defined in <netpacket/packet.h&gt

Comments