NAPI_HOWTO.txt 27.3 KB
edit raw blame history



1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766


HISTORY:
February 16/2002 -- revision 0.2.1:
COR typo corrected
February 10/2002 -- revision 0.2:
some spell checking ;->
January 12/2002 -- revision 0.1
This is still work in progress so may change.
To keep up to date please watch this space.

Introduction to NAPI
====================

NAPI is a proven (www.cyberus.ca/~hadi/usenix-paper.tgz) technique
to improve network performance on Linux. For more details please
read that paper.
NAPI provides a "inherent mitigation" which is bound by system capacity
as can be seen from the following data collected by Robert on Gigabit 
ethernet (e1000):

 Psize    Ipps       Tput     Rxint     Txint    Done     Ndone
 ---------------------------------------------------------------
   60    890000     409362        17     27622        7     6823
  128    758150     464364        21      9301       10     7738
  256    445632     774646        42     15507       21    12906
  512    232666     994445    241292     19147   241192     1062
 1024    119061    1000003    872519     19258   872511        0
 1440     85193    1000003    946576     19505   946569        0
 

Legend:
"Ipps" stands for input packets per second. 
"Tput" == packets out of total 1M that made it out.
"txint" == transmit completion interrupts seen
"Done" == The number of times that the poll() managed to pull all
packets out of the rx ring. Note from this that the lower the
load the more we could clean up the rxring
"Ndone" == is the converse of "Done". Note again, that the higher
the load the more times we couldn't clean up the rxring.

Observe that:
when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated. 
The system cant handle the processing at 1 interrupt/packet at that load level. 
At lower rates on the other hand, rx interrupts go up and therefore the
interrupt/packet ratio goes up (as observable from that table). So there is
possibility that under low enough input, you get one poll call for each
input packet caused by a single interrupt each time. And if the system 
cant handle interrupt per packet ratio of 1, then it will just have to 
chug along ....


0) Prerequisites:
==================
A driver MAY continue using the old 2.4 technique for interfacing
to the network stack and not benefit from the NAPI changes.
NAPI additions to the kernel do not break backward compatibility.
NAPI, however, requires the following features to be available:

A) DMA ring or enough RAM to store packets in software devices.

B) Ability to turn off interrupts or maybe events that send packets up 
the stack.

NAPI processes packet events in what is known as dev->poll() method.
Typically, only packet receive events are processed in dev->poll(). 
The rest of the events MAY be processed by the regular interrupt handler 
to reduce processing latency (justified also because there are not that 
many of them).
Note, however, NAPI does not enforce that dev->poll() only processes 
receive events. 
Tests with the tulip driver indicated slightly increased latency if
all of the interrupt handler is moved to dev->poll(). Also MII handling
gets a little trickier.
The example used in this document is to move the receive processing only
to dev->poll(); this is shown with the patch for the tulip driver.
For an example of code that moves all the interrupt driver to 
dev->poll() look at the ported e1000 code.

There are caveats that might force you to go with moving everything to 
dev->poll(). Different NICs work differently depending on their status/event 
acknowledgement setup. 
There are two types of event register ACK mechanisms.
	I)  what is known as Clear-on-read (COR).
	when you read the status/event register, it clears everything!
	The natsemi and sunbmac NICs are known to do this.
	In this case your only choice is to move all to dev->poll()

	II) Clear-on-write (COW)
	 i) you clear the status by writing a 1 in the bit-location you want.
		These are the majority of the NICs and work the best with NAPI.
		Put only receive events in dev->poll(); leave the rest in
		the old interrupt handler.
	 ii) whatever you write in the status register clears every thing ;->
		Cant seem to find any supported by Linux which do this. If
		someone knows such a chip email us please.
		Move all to dev->poll()

C) Ability to detect new work correctly.
NAPI works by shutting down event interrupts when there's work and
turning them on when there's none. 
New packets might show up in the small window while interrupts were being 
re-enabled (refer to appendix 2).  A packet might sneak in during the period 
we are enabling interrupts. We only get to know about such a packet when the 
next new packet arrives and generates an interrupt. 
Essentially, there is a small window of opportunity for a race condition
which for clarity we'll refer to as the "rotting packet".

This is a very important topic and appendix 2 is dedicated for more 
discussion.

Locking rules and environmental guarantees
==========================================

-Guarantee: Only one CPU at any time can call dev->poll(); this is because
only one CPU can pick the initial interrupt and hence the initial
netif_rx_schedule(dev);
- The core layer invokes devices to send packets in a round robin format.
This implies receive is totally lockless because of the guarantee that only 
one CPU is executing it.
-  contention can only be the result of some other CPU accessing the rx
ring. This happens only in close() and suspend() (when these methods
try to clean the rx ring); 
****guarantee: driver authors need not worry about this; synchronization 
is taken care for them by the top net layer.
-local interrupts are enabled (if you dont move all to dev->poll()). For 
example link/MII and txcomplete continue functioning just same old way. 
This improves the latency of processing these events. It is also assumed that 
the receive interrupt is the largest cause of noise. Note this might not 
always be true. 
[according to Manfred Spraul, the winbond insists on sending one 
txmitcomplete interrupt for each packet (although this can be mitigated)].
For these broken drivers, move all to dev->poll().

For the rest of this text, we'll assume that dev->poll() only
processes receive events.

new methods introduce by NAPI
=============================

a) netif_rx_schedule(dev)
Called by an IRQ handler to schedule a poll for device

b) netif_rx_schedule_prep(dev)
puts the device in a state which allows for it to be added to the
CPU polling list if it is up and running. You can look at this as
the first half of  netif_rx_schedule(dev) above; the second half
being c) below.

c) __netif_rx_schedule(dev)
Add device to the poll list for this CPU; assuming that _prep above
has already been called and returned 1.

d) netif_rx_reschedule(dev, undo)
Called to reschedule polling for device specifically for some
deficient hardware. Read Appendix 2 for more details.

e) netif_rx_complete(dev)

Remove interface from the CPU poll list: it must be in the poll list
on current cpu. This primitive is called by dev->poll(), when
it completes its work. The device cannot be out of poll list at this
call, if it is then clearly it is a BUG(). You'll know ;->

All these above nethods are used below. So keep reading for clarity.

Device driver changes to be made when porting NAPI
==================================================

Below we describe what kind of changes are required for NAPI to work.

1) introduction of dev->poll() method 
=====================================

This is the method that is invoked by the network core when it requests
for new packets from the driver. A driver is allowed to send upto
dev->quota packets by the current CPU before yielding to the network
subsystem (so other devices can also get opportunity to send to the stack).

dev->poll() prototype looks as follows:
int my_poll(struct net_device *dev, int *budget)

budget is the remaining number of packets the network subsystem on the
current CPU can send up the stack before yielding to other system tasks.
*Each driver is responsible for decrementing budget by the total number of
packets sent.
	Total number of packets cannot exceed dev->quota.

dev->poll() method is invoked by the top layer, the driver just sends if it 
can to the stack the packet quantity requested.

more on dev->poll() below after the interrupt changes are explained.

2) registering dev->poll() method
===================================

dev->poll should be set in the dev->probe() method. 
e.g:
dev->open = my_open;
.
.
/* two new additions */
/* first register my poll method */
dev->poll = my_poll;
/* next register my weight/quanta; can be overridden in /proc */
dev->weight = 16;
.
.
dev->stop = my_close;


3) scheduling dev->poll()
=============================
This involves modifying the interrupt handler and the code
path which takes the packet off the NIC and sends them to the 
stack.

it's important at this point to introduce the classical D Becker 
interrupt processor:

------------------
static irqreturn_t
netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{

	struct net_device *dev = (struct net_device *)dev_instance;
	struct my_private *tp = (struct my_private *)dev->priv;

	int work_count = my_work_count;
        status = read_interrupt_status_reg();
        if (status == 0)
                return IRQ_NONE; /* Shared IRQ: not us */
        if (status == 0xffff)
                return IRQ_HANDLED;      /* Hot unplug */
        if (status & error)
		do_some_error_handling()
        
	do {
		acknowledge_ints_ASAP();

		if (status & link_interrupt) {
			spin_lock(&tp->link_lock);
			do_some_link_stat_stuff();
			spin_lock(&tp->link_lock);
		}
		
		if (status & rx_interrupt) {
			receive_packets(dev);
		}

		if (status & rx_nobufs) {
			make_rx_buffs_avail();
		}
			
		if (status & tx_related) {
			spin_lock(&tp->lock);
			tx_ring_free(dev);
			if (tx_died)
				restart_tx();
			spin_unlock(&tp->lock);
		}

		status = read_interrupt_status_reg();

	} while (!(status & error) || more_work_to_be_done);
	return IRQ_HANDLED;
}

----------------------------------------------------------------------

We now change this to what is shown below to NAPI-enable it:

----------------------------------------------------------------------
static irqreturn_t
netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
	struct net_device *dev = (struct net_device *)dev_instance;
	struct my_private *tp = (struct my_private *)dev->priv;

        status = read_interrupt_status_reg();
        if (status == 0)
                return IRQ_NONE;         /* Shared IRQ: not us */
        if (status == 0xffff)
                return IRQ_HANDLED;         /* Hot unplug */
        if (status & error)
		do_some_error_handling();
        
	do {
/************************ start note *********************************/		
		acknowledge_ints_ASAP();  // dont ack rx and rxnobuff here
/************************ end note *********************************/		

		if (status & link_interrupt) {
			spin_lock(&tp->link_lock);
			do_some_link_stat_stuff();
			spin_unlock(&tp->link_lock);
		}
/************************ start note *********************************/		
		if (status & rx_interrupt || (status & rx_nobuffs)) {
			if (netif_rx_schedule_prep(dev)) {

				/* disable interrupts caused 
			         *	by arriving packets */
				disable_rx_and_rxnobuff_ints();
				/* tell system we have work to be done. */
				__netif_rx_schedule(dev);
			} else {
				printk("driver bug! interrupt while in poll\n");
				/* FIX by disabling interrupts  */
				disable_rx_and_rxnobuff_ints();
			}
		}
/************************ end note note *********************************/		
			
		if (status & tx_related) {
			spin_lock(&tp->lock);
			tx_ring_free(dev);

			if (tx_died)
				restart_tx();
			spin_unlock(&tp->lock);
		}

		status = read_interrupt_status_reg();

/************************ start note *********************************/		
	} while (!(status & error) || more_work_to_be_done(status));
/************************ end note note *********************************/		
	return IRQ_HANDLED;
}

---------------------------------------------------------------------


We note several things from above:

I) Any interrupt source which is caused by arriving packets is now
turned off when it occurs. Depending on the hardware, there could be
several reasons that arriving packets would cause interrupts; these are the
interrupt sources we wish to avoid. The two common ones are a) a packet 
arriving (rxint) b) a packet arriving and finding no DMA buffers available
(rxnobuff) .
This means also acknowledge_ints_ASAP() will not clear the status
register for those two items above; clearing is done in the place where 
proper work is done within NAPI; at the poll() and refill_rx_ring() 
discussed further below.
netif_rx_schedule_prep() returns 1 if device is in running state and
gets successfully added to the core poll list. If we get a zero value
we can _almost_ assume are already added to the list (instead of not running. 
Logic based on the fact that you shouldn't get interrupt if not running)
We rectify this by disabling rx and rxnobuf interrupts.

II) that receive_packets(dev) and make_rx_buffs_avail() may have disappeared.
These functionalities are still around actually......

infact, receive_packets(dev) is very close to my_poll() and 
make_rx_buffs_avail() is invoked from my_poll()

4) converting receive_packets() to dev->poll()
===============================================

We need to convert the classical D Becker receive_packets(dev) to my_poll()

First the typical receive_packets() below:
-------------------------------------------------------------------

/* this is called by interrupt handler */
static void receive_packets (struct net_device *dev)
{

	struct my_private *tp = (struct my_private *)dev->priv;
	rx_ring = tp->rx_ring;
	cur_rx = tp->cur_rx;
	int entry = cur_rx % RX_RING_SIZE;
	int received = 0;
	int rx_work_limit = tp->dirty_rx + RX_RING_SIZE - tp->cur_rx;

	while (rx_ring_not_empty) {
		u32 rx_status;
		unsigned int rx_size;
		unsigned int pkt_size;
		struct sk_buff *skb;
                /* read size+status of next frame from DMA ring buffer */
		/* the number 16 and 4 are just examples */
                rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset));
                rx_size = rx_status >> 16;
                pkt_size = rx_size - 4;

		/* process errors */
                if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) ||
                    (!(rx_status & RxStatusOK))) {
                        netdrv_rx_err (rx_status, dev, tp, ioaddr);
                        return;
                }

                if (--rx_work_limit < 0)
                        break;

		/* grab a skb */
                skb = dev_alloc_skb (pkt_size + 2);
                if (skb) {
			.
			.
			netif_rx (skb);
			.
			.
                } else {  /* OOM */
			/*seems very driver specific ... some just pass
			whatever is on the ring already. */
                }

		/* move to the next skb on the ring */
		entry = (++tp->cur_rx) % RX_RING_SIZE;
		received++ ;

        }

	/* store current ring pointer state */
        tp->cur_rx = cur_rx;

        /* Refill the Rx ring buffers if they are needed */
	refill_rx_ring();
	.
	.

}
-------------------------------------------------------------------
We change it to a new one below; note the additional parameter in
the call.

-------------------------------------------------------------------

/* this is called by the network core */
static int my_poll (struct net_device *dev, int *budget)
{

	struct my_private *tp = (struct my_private *)dev->priv;
	rx_ring = tp->rx_ring;
	cur_rx = tp->cur_rx;
	int entry = cur_rx % RX_BUF_LEN;
	/* maximum packets to send to the stack */
/************************ note note *********************************/		
	int rx_work_limit = dev->quota;

/************************ end note note *********************************/		
    do {  // outer beginning loop starts here

	clear_rx_status_register_bit();

	while (rx_ring_not_empty) {
		u32 rx_status;
		unsigned int rx_size;
		unsigned int pkt_size;
		struct sk_buff *skb;
                /* read size+status of next frame from DMA ring buffer */
		/* the number 16 and 4 are just examples */
                rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset));
                rx_size = rx_status >> 16;
                pkt_size = rx_size - 4;

		/* process errors */
                if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) ||
                    (!(rx_status & RxStatusOK))) {
                        netdrv_rx_err (rx_status, dev, tp, ioaddr);
                        return 1;
                }

/************************ note note *********************************/		
                if (--rx_work_limit < 0) { /* we got packets, but no quota */
			/* store current ring pointer state */
			tp->cur_rx = cur_rx;

			/* Refill the Rx ring buffers if they are needed */
			refill_rx_ring(dev);
                        goto not_done;
		}
/**********************  end note **********************************/

		/* grab a skb */
                skb = dev_alloc_skb (pkt_size + 2);
                if (skb) {
			.
			.
/************************ note note *********************************/		
			netif_receive_skb (skb);
/**********************  end note **********************************/
			.
			.
                } else {  /* OOM */
			/*seems very driver specific ... common is just pass
			whatever is on the ring already. */
                }

		/* move to the next skb on the ring */
		entry = (++tp->cur_rx) % RX_RING_SIZE;
		received++ ;

        }

	/* store current ring pointer state */
        tp->cur_rx = cur_rx;

        /* Refill the Rx ring buffers if they are needed */
	refill_rx_ring(dev);
	
	/* no packets on ring; but new ones can arrive since we last 
	   checked  */
	status = read_interrupt_status_reg();
	if (rx status is not set) {
                        /* If something arrives in this narrow window,
			an interrupt will be generated */
                        goto done;
	}
	/* done! at least that's what it looks like ;->
	if new packets came in after our last check on status bits
	they'll be caught by the while check and we go back and clear them 
	since we havent exceeded our quota */
    } while (rx_status_is_set); 

done:

/************************ note note *********************************/		
        dev->quota -= received;
        *budget -= received;

        /* If RX ring is not full we are out of memory. */
        if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
                goto oom;

	/* we are happy/done, no more packets on ring; put us back
	to where we can start processing interrupts again */
        netif_rx_complete(dev);
	enable_rx_and_rxnobuf_ints();

       /* The last op happens after poll completion. Which means the following:
        * 1. it can race with disabling irqs in irq handler (which are done to 
	* schedule polls)
        * 2. it can race with dis/enabling irqs in other poll threads
        * 3. if an irq raised after the beginning of the outer beginning 
        * loop (marked in the code above), it will be immediately
        * triggered here.
        *
        * Summarizing: the logic may result in some redundant irqs both
        * due to races in masking and due to too late acking of already
        * processed irqs. The good news: no events are ever lost.
        */

        return 0;   /* done */

not_done:
        if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 ||
            tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
                refill_rx_ring(dev);

        if (!received) {
                printk("received==0\n");
                received = 1;
        }
        dev->quota -= received;
        *budget -= received;
        return 1;  /* not_done */

oom:
        /* Start timer, stop polling, but do not enable rx interrupts. */
	start_poll_timer(dev);
        return 0;  /* we'll take it from here so tell core "done"*/

/************************ End note note *********************************/		
}
-------------------------------------------------------------------

From above we note that:
0) rx_work_limit = dev->quota 
1) refill_rx_ring() is in charge of clearing the bit for rxnobuff when
it does the work.
2) We have a done and not_done state.
3) instead of netif_rx() we call netif_receive_skb() to pass the skb.
4) we have a new way of handling oom condition
5) A new outer for (;;) loop has been added. This serves the purpose of
ensuring that if a new packet has come in, after we are all set and done,
and we have not exceeded our quota that we continue sending packets up.
 

-----------------------------------------------------------
Poll timer code will need to do the following:

a) 

        if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 ||
            tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) 
                refill_rx_ring(dev);

        /* If RX ring is not full we are still out of memory.
	   Restart the timer again. Else we re-add ourselves 
           to the master poll list.
         */

        if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
                restart_timer();

	else netif_rx_schedule(dev);  /* we are back on the poll list */
	
5) dev->close() and dev->suspend() issues
==========================================
The driver writer needn't worry about this; the top net layer takes
care of it.

6) Adding new Stats to /proc 
=============================
In order to debug some of the new features, we introduce new stats
that need to be collected.
TODO: Fill this later.

APPENDIX 1: discussion on using ethernet HW FC
==============================================
Most chips with FC only send a pause packet when they run out of Rx buffers.
Since packets are pulled off the DMA ring by a softirq in NAPI,
if the system is slow in grabbing them and we have a high input
rate (faster than the system's capacity to remove packets), then theoretically
there will only be one rx interrupt for all packets during a given packetstorm.
Under low load, we might have a single interrupt per packet.
FC should be programmed to apply in the case when the system cant pull out
packets fast enough i.e send a pause only when you run out of rx buffers.
Note FC in itself is a good solution but we have found it to not be
much of a commodity feature (both in NICs and switches) and hence falls
under the same category as using NIC based mitigation. Also, experiments
indicate that it's much harder to resolve the resource allocation
issue (aka lazy receiving that NAPI offers) and hence quantify its usefulness
proved harder. In any case, FC works even better with NAPI but is not
necessary.


APPENDIX 2: the "rotting packet" race-window avoidance scheme 
=============================================================

There are two types of associations seen here

1) status/int which honors level triggered IRQ

If a status bit for receive or rxnobuff is set and the corresponding 
interrupt-enable bit is not on, then no interrupts will be generated. However, 
as soon as the "interrupt-enable" bit is unmasked, an immediate interrupt is 
generated.  [assuming the status bit was not turned off].
Generally the concept of level triggered IRQs in association with a status and
interrupt-enable CSR register set is used to avoid the race.

If we take the example of the tulip:
"pending work" is indicated by the status bit(CSR5 in tulip).
the corresponding interrupt bit (CSR7 in tulip) might be turned off (but
the CSR5 will continue to be turned on with new packet arrivals even if
we clear it the first time)
Very important is the fact that if we turn on the interrupt bit on when
status is set that an immediate irq is triggered.
 
If we cleared the rx ring and proclaimed there was "no more work
to be done" and then went on to do a few other things;  then when we enable
interrupts, there is a possibility that a new packet might sneak in during
this phase. It helps to look at the pseudo code for the tulip poll
routine:

--------------------------
        do {
                ACK;
                while (ring_is_not_empty()) {
                        work-work-work
                        if quota is exceeded: exit, no touching irq status/mask
                }
                /* No packets, but new can arrive while we are doing this*/
                CSR5 := read
                if (CSR5 is not set) {
                        /* If something arrives in this narrow window here,
                        *  where the comments are ;-> irq will be generated */
                        unmask irqs;
                        exit poll;
                }
        } while (rx_status_is_set);
------------------------

CSR5 bit of interest is only the rx status. 
If you look at the last if statement: 
you just finished grabbing all the packets from the rx ring .. you check if
status bit says there are more packets just in ... it says none; you then
enable rx interrupts again; if a new packet just came in during this check,
we are counting that CSR5 will be set in that small window of opportunity
and that by re-enabling interrupts, we would actually trigger an interrupt
to register the new packet for processing.

[The above description nay be very verbose, if you have better wording 
that will make this more understandable, please suggest it.]

2) non-capable hardware

These do not generally respect level triggered IRQs. Normally,
irqs may be lost while being masked and the only way to leave poll is to do
a double check for new input after netif_rx_complete() is invoked
and re-enable polling (after seeing this new input).

Sample code:

---------
	.
	.
restart_poll:
	while (ring_is_not_empty()) {
		work-work-work
		if quota is exceeded: exit, not touching irq status/mask
	}
	.
	.
	.
	enable_rx_interrupts()
	netif_rx_complete(dev);
	if (ring_has_new_packet() && netif_rx_reschedule(dev, received)) {
		disable_rx_and_rxnobufs()
		goto restart_poll
	} while (rx_status_is_set);
---------
		
Basically netif_rx_complete() removes us from the poll list, but because a
new packet which will never be caught due to the possibility of a race
might come in, we attempt to re-add ourselves to the poll list. 


APPENDIX 3: Scheduling issues.
==============================
As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the 
general solution to schedule softirq's to run before next interrupt and by putting 
them under scheduler control. Also this prevents consecutive softirq's from 
monopolize the CPU. This also have the effect that the priority of ksoftirq needs 
to be considered when running very CPU-intensive applications and networking to
get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0 
(eventually more) is reported cure problems with low network performance at high 
CPU load.

Most used processes in a GIGE router:
USER       PID %CPU %MEM  SIZE   RSS TTY STAT START   TIME COMMAND
root         3  0.2  0.0     0     0  ?  RWN Aug 15 602:00 (ksoftirqd_CPU0)
root       232  0.0  7.9 41400 40884  ?  S   Aug 15  74:12 gated 

--------------------------------------------------------------------

relevant sites:
==================
ftp://robur.slu.se/pub/Linux/net-development/NAPI/


--------------------------------------------------------------------
TODO: Write net-skeleton.c driver.
-------------------------------------------------------------

Authors:
========
Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Jamal Hadi Salim <hadi@cyberus.ca>
Robert Olsson <Robert.Olsson@data.slu.se>

Acknowledgements:
================
People who made this document better:

Lennert Buytenhek <buytenh@gnu.org>
Andrew Morton  <akpm@zip.com.au>
Manfred Spraul <manfred@colorfullife.com>
Donald Becker <becker@scyld.com>
Jeff Garzik <jgarzik@pobox.com>