Avalanche: strange behaviour in the simulations


#1

Hi,

I’m following the instructions on github to run avalanche on an EC2 instance. All seems to work, I even run some linear and exponential simulations, but when looking at the result charts there’s something odd:


:point_up: This should be an exponential

:point_up: This should be linear

all should run for 1 minute, but the exponential simulation runs for much longer…

It seems that it runs 9 exponential peaks and a base line, but I expected that the baseline would be constant and running throughout the simulation not just once, and that the peaks would be much higher (3x times higher) of what they were.

At last, below is command line output with:

$SP_SIM_TIME=1
$SP_BASELINE_USERS=100
$SP_PEAK_USERS=300
================================================================================
---- Global Information --------------------------------------------------------
request count                                       5887 (OK=5876   KO=11    )
min response time                                      2 (OK=2      KO=2     )
max response time                                   1611 (OK=1611   KO=257   )
mean response time                                    27 (OK=27     KO=68    )
std deviation                                         74 (OK=74     KO=91    )
response time 50th percentile                         11 (OK=11     KO=3     )
response time 75th percentile                         19 (OK=19     KO=119   )
response time 95th percentile                        124 (OK=121    KO=225   )
response time 99th percentile                        330 (OK=330    KO=250   )
mean requests/sec                                 94.952 (OK=94.774 KO=0.177 )
---- Response Time Distribution ------------------------------------------------
t < 800 ms                                          5869 (100%)
800 ms < t < 1200 ms                                   0 (  0%)
t > 1200 ms                                            7 (  0%)
failed                                                11 (  0%)
---- Errors --------------------------------------------------------------------
status.find.in(200,304,201,202,203,204,205,206,207,208,209), b     11 (100.0%)
ut actually found 502
================================================================================

It stayed basically at 100 reqs/sec which is the baseline. It seems that it didn’t use the env $SP_PEAK_USERS

What suppose to be the expected behaviour of these simulations? I’m doing something wrong?

Thanks


#2

Hi @caio,

The simulations do not really work at just a one minute length. I would recommend running them for 30 - 60 minutes to actually see the curves work correctly. Our use case at the time of making Avalanche was to observe scaling and behaviour changes over several hours and to ensure that we were scaling sufficiently quickly to stay on top of the different styles of peaks.

The current simulations will likely need a lot of tweaking to work over such short timespans due to the way that time is being handled at the moment.

HTH,
Josh


#3

Hi @josh ,

Thanks a lot for your answer!

We were hoping to test how much memory and the size of the buffers for the Scala-Stream-Collector, but it seems that Avalanche was made to test if our containers scale, not just one of them but the entire infrastructure, right?

Thanks again,
Caio


#4

Hey @caio,

We were hoping to test how much memory and the size of the buffers for the Scala-Stream-Collector, but it seems that Avalanche was made to test if our containers scale, not just one of them but the entire infrastructure, right?

It is made for both - however our most important use case was to test scaling. Testing the memory and size of the buffers would also be easier to see under a longer test case with varying volumes of data over time.


What are the answers you are trying to get out of your testing?

Thanks,
Josh


#5

Hi @josh,

We would like to know:

  • How much memory each container should have reserved
  • Which is better one big container or lots of small ones
  • When to scale up? %CPU limit, %memory limit, network usage limit
  • Increasing the number of containers, increases the total buffer size. Do we lose real-time with that?
  • What happens if a containers dies? Have we lost all it’s data?
  • Relation container X instance

Lots of questions right? :sweat_smile:

Thanks,
Caio


#6

Hey @caio,

So in terms of the Kinesis applications you also need to think in terms of stream limits.

http://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html

In the case of the Stream Collector you will need to think about only the PUT limits - in terms of the fact that your containers combined can only push as fast as the stream will let them.

1. How much memory each container should have reserved

When we run these applications on a server we tend to dedicate half the available memory of the server to the application. The semantics of this will be very different in a container setup so this will require some experimentation on your side… would love to see the numbers you come up with!

2. Which is better one big container or lots of small ones

Each container will have its own buffer - so many small containers will mean that you are increasing your overall buffer. This can have implications for the Kinesis put limitations as you will have more and more applications pushing to the stream. In production we tend to lean towards fewer large collectors rather than many small collectors.

3. When to scale up? %CPU limit, %memory limit, network usage limit

For collectors we scale on CPU. This has worked very well in production. We have experimented with other metrics like Latency with little positive impact.

For the downstream consumers we scale on throughput metrics within the target Kinesis streams to balance our consumers against the perceived load on the stream.

4. Increasing the number of containers, increases the total buffer size. Do we lose real-time with that?

That all depends on how your buffers are setup. I would recommend controlling the real-time aspect through the time-limit - this ensures that no matter how many buffers you have you will always have data at the same latency. This assumes that downstream you have enough bandwidth of course!

5. What happens if a containers dies? Have we lost all it’s data?

If a container dies before it manages to sink its data to the stream yes you will lose that data. Mitigation of this is to ensure that your buffer size is fairly small to limit the impact here. We have thought about using some form of disk cache for the collector but nothing has been scheduled yet. Any thoughts on this can be added here:

6. Relation container X instance

That depends on your instance and the memory / cpu constraints you set for your container. Could you expand a bit more on what you answer you are looking for here?

Collector setup sidenote

Collectors should generally be setup as their own cluster away from other consumers. We lean towards this due to the fact that this is the main area that event loss can occur - once the event is in Kinesis you have up to a week to work with it so issues downstream are much more recoverable.

Running out of container space for another collector as load increases is much more dangerous operationally than for any other application. As they are also scaled very differently it also makes sense to separate them out.


Hope this helps,
Josh