This documentation was written after having effective results. We will try to get back the road of tunning client and server, but to make it easier to focus on a single side at once, we will be using a "good" configuration for the other peer. Monitoring graphs for the different benches can be found [[http://www.hagtheil.net/files/system/benches10gbps/direct/|here]]. ====== Server ====== The main focus was to tune the server so it could handle alot of connections. Changes are made and ordered to get a noticable gain after each. Some changes could be done much earlier, but often with small impact. ===== baseline ===== No tunning, just fresh install, with a nginx home page. A fresh nginx install, serving default home (a very small html). Input file: small-1.txt new page0 0 get 10.128.0.0:80 / /root/inject -p 24 -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535 20932 hits/s Ok, that's gives us baseline. What we can get without even trying. ===== All your core are belong to us ===== Nginx default configuration only has 4 workers. The systems sees 24 cpu. Lets get 24 workers ! file: /etc/nginx/nginx.conf -worker_processes 4; +worker_processes 24; /root/inject -p 24 -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535 Getting some errors in /var/log/nginx/error.log [...] accept4() failed (24: Too many open files) Increase the number of open files. That's just memory. Memory is cheap. Lets say that instead of 1k (ulimit -n show 1024) we want lets say 1M files (1048576). file:/etc/default/nginx +ULIMIT="-n 1048576" New error... [...] "/var/log/nginx/access.log" failed (28: No space left on device) while logging request [...] No space left ? Damn, why am I even logging my requests ? That's some heavy disk i/o and should just be removed. Lets stop writting useless access.log (keep the error.log, there shouldn't be anything there, and if there is it will probably be usefull). file: /etc/nginx/nginx.conf -access_log /var/log/nginx/access.log; +access_log off; Yet an other error... 768 worker_connections are not enough Lets get ALOT of connections (not wanting it to appear again anytime soon). file:/etc/nginx.conf -worker_connections 768; +worker_connections 524288; Yeah, no more errors. /root/inject -p 24 -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535 47875 hits/s Good... we are getting somewhere. We have 24 process that can handle a connection, it's better than 4. ===== Multiple way to get in ===== There might be some limitation with the bound socket. (Like the kernel locks the socket to check if it the waiting list is not too long before accepting the connection... pure speculation, code not checked) Lets try to replace the single listen by multiple IPs to listen to. file: /etc/nginx/sites-enabled/default -#listen 80; +listen 10.128.0.0:80; +listen 10.128.0.1:80; [...] +listen 10.128.0.23:80; New input file: small-24.txt new page0 0 get 10.128.0.0:80 / new page1 0 get 10.128.0.1:80 / [...] new page23 0 get 10.128.0.23:80 / /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 50743 hits/s Good, it does help to not be limited on a single socket. ===== sorry to interrupt ===== Overall CPU graph shows that one CPU is much much more used than the others. Checking CPU#0 graph, we can see alot of the time is spent in soft-interrupts. We should try to assign the interrupts to other CPUs too... As we can see in ''/proc/interrupts'', we have 24 interrupts for each interface (as many as cpu - threads - seen by the system). A first approach would be to assign them in order. eth1-TxRx-0 0 eth1-TxRx-1 1 [...] eth1-TxRx-23 23 /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 53721 hits/s Better. ===== stop locking yourselves ===== Now that our network interrupts isn't a bottle neck anymore, we get some nice connections each seconds. Nginx just doesn't accept them fast enough. By default, nginx uses a mutex so only one process accept the connection. Well, who cares ? What if everyone tries to ? Ok, most process will fail, but what if they get a new socket too ? that could fasten things up. file:/etc/nginx/nginx.conf +accept_mutex off; /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 97682 hits/s Wow, that much was just due to nginx locking itself, and preventing other workers from getting the new connections at the same time. ===== too crowded ===== * We have 24 interrupts spread on our 24 cpu. * We have 24 nginx workers on our 24 cpu. What if we get less workers ? file: /etc/nginx/nginx.conf -worker_processes 24; +worker_processes 16; /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 126731 hits/s What if we get down to 12 ? file: /etc/nginx/nginx.conf -worker_processes 16; +worker_processes 12; /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 138247 hits/s That much for having as many worker as cpu. ===== lets focus ===== Not having as many worker as cpu allows better performances. Yeah, that leaves more free cpu to handle IRQ... What if we decided to split IRQ on a few CPU, and workers on other CPU. By checking informations from ''/sys/bus/cpu/devices/cpu*/topology/{core,thread}_siblings_list'', we get some idea how the CPU are regarding to threads and processors : ^ CPU ^ processor ^ core ^ thread ^ | 0-5 | 0 | 0-5 | 0 | | 6-11 | 1 | 0-5 | 0 | | 12-17 | 0 | 0-5 | 1 | | 18-23 | 1 | 0-5 | 1 | How to split ? Lets try differents splitting. Each core has 2 threads. Lets use one thread for IRQ, one for a worker. irq 0-23 => cpu 0-11,0-11 workers - cpu 12-23 184769 hits/s We have 2 real processor with 12 threads each. Lets try 1 CPU for IRQ and 1 CPU for workers. irq 0-23 => cpu 0-5,12-17,0-5,12-17 (processor #0) workers - set on 6-11,18-23 (processor #1) 190712 hits/s better What if we use first 3 cores (2 threads per core) of each processor for IRQ, and the 3 last for workers ? irq 0-23 => cpu 0-2,6-8,12-14,18-20,0-2,6-8,12-14,18-20 workers - cpu 3-5,9-11,15-17,21-23 187394 hits/s not as good. Maybe now that we have a separation we can include a few more workers again, and gather the IRQ some more ? 8 cpu for IRQ, 16 workers Lets try again to use one thread for IRQ, and one for worker... first 4 for each processor. irq 0-23 => cpu 0-3,6-9,0-3,6-9,0-3,6-9 worker - cpu 4,5,10-23 153129 hits/s ouch. Not that good... What about one processor for IRQ... first 4 cores (both threads) ? irq 0-23 => 0-3,12-15,0-3,12-15,0-3,12-15 worker - cpu 4-11,16-23 218857 hits/s Wow, much better. Just changing which threads handle does what has a big impact. ===== pin the hopper ===== Ok, our nginx has 16 process working on 16 cpu. Why not associate each process with a single cpu, so they stop hopping from one to an other. 224544 hits/s And yet better, with just affinity. ===== keep it opened ===== Now that we have a nice quick data transfert, our nginx serves about 200k times a single file. Maybe it should consider caching the file, and not having to access it from scratch each time. At that rate, it might make a difference. file:/etc/nginx/nginx.conf +open_file_cache max=1000; 236607 hits/s ===== I can has cookies ===== Kernel shows some syn flood errors... TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters. Lets get that off our back (some options are not related to that message, but are included here too) : file:/etc/sysctl.conf +net.ipv4.tcp_fin_timeout = 1 +net.ipv4.tcp_tw_recycle = 1 +net.ipv4.tcp_tw_reuse = 1 +net.ipv4.tcp_syncookies = 0 +net.core.netdev_max_backlog = 1048576 +net.core.somaxconn = 1048576 +net.ipv4.tcp_max_syn_backlog = 1048576 Check how it gets on a longer period : /root/inject -p 24 -d 600 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 236103 hits/s Ok, we can hold 236k connections per second, without hitting any limit in any log. ===== about client ===== Bench for server was done with a patched version of inject that pinned each process to a single cpu, and gathered network interrupts gathered on a few cpu. This was what gave the best result at a time, but further client test shows it's not optimal. ====== Client ====== Ok, now lets get back to tunning the client. We will reset the client in a default configuration, and tune it to get up at a high hit per second. We keep the server in the latest configuration. We already established that hitting multiple IPs was better than hitting a single one. we will keep that part in place. As our client need to connect at a high rate, we have to use multiple source IP. If we don't, we would soon hit a limit of source ip/port -> destination ip/port. Having a client binds to an IP without specifying the port (letting it be taken from the ephemeral port) would still hit the same flaw (at least under Linux). That means we need a client that binds to a specific ip AND port for each outgoing connection. inject seems to be doing just that. It takes a range of IP and range of ports. It splits the ports between the processes, and tries it with each IP in range, before getting to the next port. All IP in range will be used before a process move to the next port. At our quick connections per seconds, and hoping to present a nice amount of different sources, a /20 is used (4096 IPs) along with all upper ports (1024 -> 65535), that would leave about 252M ip/port tuple. Note: at the high rate we get, it burns an average of 60 port per seconds, and would take about 18 minutes before it would loops back to the first ports. ===== baseline ===== Lets get a few baselines. Lets start with 1 process, and 1 user /root/inject -d 60 -u 1 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 4984 hits/s Ok, that's what a single user can get... that's about 0.20 ms per query. ===== more processes ===== 1 process is nice, but no reason not to get more processes, as we have 24 threads on the processors. /root/inject -p 24 -d 60 -u 1 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 51080 hits/s ===== interrupt someone else ===== As we can see, CPU#0 is full with soft interrupts. Lets get the network irq spread on all cpu. (0-23 to cpu 0-23) /root/inject -p 24 -d 60 -u 1 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 112035 hits/s ===== more users ===== Let the process use more users. /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 228367 hits/s ===== no timestamp ===== By default, tcp get some timestamps on its connection. When we are trying to gain the little performance we are missing, it could be a good idea to not set the timestamp. (note: could be done on server OR client with similar results) file: /etc/sysctl.conf net.ipv4.tcp_timestamps = 0 /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 241193 hits/s ====== dual ====== To check on which side we have a bottle neck, lets try to have 2 servers, or 2 clients. Tests done with the lastest configurations (client and server) which could give 240k hits/s. ===== dual servers ===== We get a second server with the same configuration, and checked it also can handle the 240k/s. Then, we change the scenario to hit the 24 IPs from both servers. New input file: dual-24.txt new page0a 0 get 10.128.0.0:80 / new page0b 0 get 10.132.0.0:80 / new page1a 0 get 10.128.0.1:80 / new page1b 0 get 10.132.0.1:80 / [...] new page23a 0 get 10.128.0.23:80 / new page23b 0 get 10.132.0.23:80 / /root/inject -p 24 -d 60 -u 500 -s 20 -f dual-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 401391 hits/s Though the client seems to use all its CPU for 240k/s, it still can go up and handle 400k hits/s. The bottle neck is probably not really on that side. ===== dual client ===== We get a second client with the same configuration, and checked it also can generate the 240k/s. To launch both clients at the same time, cssh is very nice :) /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 123016 hits/s 121312 hits/s total: 244328 hits/s Ok, client is clearly not the limitation, as with two clients, we get the same total. ====== conclusions ====== The above bench shows the following : * As everyone knows, using multiple cores is better than using only one * smp affinity is important, and can deal huge changes * on high load, it might be better to segregate core usage (as shown by separating irq and nginx) * on high load configuration, reducing the number of process to just have one per used core is better * 240k connections / seconds is doable with a single host For some unknown reason (at the time of writing that documentation), the connections highly drops for 1-2s, as can be seen on [[http://www.hagtheil.net/files/system/benches10gbps/direct/bench-bad/nginx-bad/elastiques-nginx/|bench-bad/nginx-bad]] graphs. I tried to avoid using results triggering such behaviour. Any ideas/hints on what could produce such are welcome. ====== post-bench ====== After publishing the first benches, someone adviced to use httpterm, instead of nginx. Unlike nginx, httpterm is aimed at only doing stress bench, and not serve real pages. Bench using multi-process httpterm directly shows some bug. It still sends header, but fails to send data. Getting down to 1 process keep it running, but obviously not using all cores. As we have 16 core for the web server, so 16 process with 1 IP each were launched, pinned with taskset on a cpu each. file-0.cfg: # taskset 000010 ./httpterm -D -f file-0.cfg global maxconn 30000 ulimit-n 500000 nbproc 1 quiet listen proxy1 10.128.0.0:80 object weight 1 name test1 code 200 size 200 clitimeout 10000 That gives up more connections per seconds: 278765 That helps get even more requests per seconds, but we still get some stall at times.