Fork me on GitHub
#aws
<
2023-02-01
>
CarnunMP14:02:44

Hi. I'm in the process of deploying a Clojure webapp to Elastic Beanstalk as per https://clojureverse.org/t/running-a-clojure-application-on-aws-elastic-beanstalk-with-clojure-cli-tools/4161 (cc: @orestis πŸ™) And it's working, it's up and running... but CPU utilization Idle% hovers near 100% and page load times are (consequently) super sluggish. Granted, it's a single t2.small instance (up from the t2.micro that I started with). But still! Any general thoughts as to why, before I dive into figuring out monitoring and profiling and such on AWS (all ~new to me) and hopefully get to the root of it? E.g. is it expected that running Clojure apps this way is more CPU-intensive than running an Uberjar? :thinking_face:

orestis14:02:21

Shouldn't you see idle% close to 100%?

orestis14:02:30

That means it's an idle instance πŸ™‚

orestis14:02:46

For the first few ~seconds, you might still be compiling (but you wouldn't see a response just yet).

orestis14:02:31

Have you checked whether it's the database that's slow? If you have a page that just does CPU-only stuff (just rendering some html without touching network/filesystem) you should be able to see if it's the CPU that's slow or something else.

orestis14:02:11

Our staging environment runs on a t3.medium and it's very acceptable performance wise.

CarnunMP14:02:52

> Shouldn't you see idle% close to 100%? D'oh if so! πŸ˜„ But it's not just the (maybe misread) CPU usage that's throwing me off. Health stuff keeps dipping into 'Severe' more generally:

CarnunMP15:02:04

Also not a first few secs (or first request) vs subsequent requests problem. It's every page load that's slow. (Perhaps because I'm generating/parsing each page anew from markdown stored in the DB, and this would be the time to look into doing some caching...? πŸ˜… )

CarnunMP15:02:10

> Have you checked whether it's the database that's slow? If you have a page that just does CPU-only stuff (just rendering some html without touching network/filesystem) you should be able to see if it's the CPU that's slow or something else. Hmm, I'll give this a shot. πŸ™‚

orestis15:02:19

Can you check perhaps if your application is restarting all the time? Do you see anything suspicious in the logs?

orestis15:02:43

Looks like you had only 2 requests, so probably it's not the requests that are causing this.

CarnunMP15:02:05

One suspicious thing is that it looks like there are constant GETs of /:

Feb  1 14:32:24 ip-172-31-80-204 web: system started
Feb  1 14:32:28 ip-172-31-80-204 web: [qtp1843774765-17] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:32:28 ip-172-31-80-204 web: [qtp1843774765-22] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:32:30 ip-172-31-80-204 web: [qtp1843774765-21] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:32:43 ip-172-31-80-204 web: [qtp1843774765-23] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:32:43 ip-172-31-80-204 web: [qtp1843774765-24] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:32:45 ip-172-31-80-204 web: [qtp1843774765-17] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:32:58 ip-172-31-80-204 web: [qtp1843774765-22] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:32:58 ip-172-31-80-204 web: [qtp1843774765-21] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:00 ip-172-31-80-204 web: [qtp1843774765-23] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:13 ip-172-31-80-204 web: [qtp1843774765-24] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:13 ip-172-31-80-204 web: [qtp1843774765-17] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:15 ip-172-31-80-204 web: [qtp1843774765-22] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:28 ip-172-31-80-204 web: [qtp1843774765-21] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:28 ip-172-31-80-204 web: [qtp1843774765-23] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:28 ip-172-31-80-204 web: [qtp1843774765-24] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:28 ip-172-31-80-204 web: [qtp1843774765-22] INFO io.pedestal.http - {:msg "GET /words/1C67kb9It/home", :line 80}
Feb  1 14:33:30 ip-172-31-80-204 web: [qtp1843774765-17] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:36 ip-172-31-80-204 web: [qtp1843774765-21] INFO io.pedestal.http - {:msg "GET /words/FC9iGu9V5/foo", :line 80}
Feb  1 14:33:43 ip-172-31-80-204 web: [qtp1843774765-20] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:43 ip-172-31-80-204 web: [qtp1843774765-23] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:45 ip-172-31-80-204 web: [qtp1843774765-22] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:58 ip-172-31-80-204 web: [qtp1843774765-24] INFO io.pedestal.http - {:msg "GET /", :line 80}

orestis15:02:26

Sounds like the health check

orestis15:02:14

We've set the health check endpoint to something like _health so that it's obvious in the logs. We just reach out to the database (to check connectivity, a simple "SELECT 1" does the trick) and return a 200.

πŸ‘ 2
CarnunMP15:02:03

Hmm, makes sense! Maybe the health check doesn't like that / returns a 302? :thinking_face:

/var/log/nginx/access.log
----------------------------------------
172.31.67.190 - - [01/Feb/2023:14:57:59 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.94.1 - - [01/Feb/2023:14:58:01 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.34.70 - - [01/Feb/2023:14:58:14 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.67.190 - - [01/Feb/2023:14:58:14 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.94.1 - - [01/Feb/2023:14:58:16 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.34.70 - - [01/Feb/2023:14:58:29 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.67.190 - - [01/Feb/2023:14:58:29 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.94.1 - - [01/Feb/2023:14:58:31 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.67.190 - - [01/Feb/2023:14:58:44 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.34.70 - - [01/Feb/2023:14:58:44 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.94.1 - - [01/Feb/2023:14:58:46 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.67.190 - - [01/Feb/2023:14:58:59 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.34.70 - - [01/Feb/2023:14:58:59 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.94.1 - - [01/Feb/2023:14:59:01 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.67.190 - - [01/Feb/2023:14:59:14 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.34.70 - - [01/Feb/2023:14:59:15 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"

...

----------------------------------------
/var/log/nginx/error.log
----------------------------------------
2023/02/01 14:24:00 [error] 20449#20449: *259 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.94.1, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:24:13 [error] 20449#20449: *261 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.67.190, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:24:13 [error] 20449#20449: *263 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.34.70, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:24:15 [error] 20449#20449: *265 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.94.1, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:24:28 [error] 20449#20449: *267 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.67.190, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:24:28 [error] 20449#20449: *269 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.34.70, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:24:30 [error] 20449#20449: *271 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.94.1, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:24:43 [error] 20449#20449: *273 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.67.190, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:24:43 [error] 20449#20449: *275 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.34.70, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:24:45 [error] 20449#20449: *277 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.94.1, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:24:58 [error] 20449#20449: *279 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.34.70, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:24:58 [error] 20449#20449: *281 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.67.190, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:25:00 [error] 20449#20449: *283 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.94.1, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:25:13 [error] 20449#20449: *285 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.34.70, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:25:13 [error] 20449#20449: *287 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.67.190, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"

CarnunMP15:02:42

> We've set the health check endpoint to something like _health so that it's obvious in the logs. We just reach out to the database (to check connectivity, a simple "SELECT 1" does the trick) and return a 200. Neat!

orestis15:02:51

Yep, I think it’s best to always have an explicit health check path that can do exactly what you need it to, without authentication and other concerns.

CarnunMP15:02:37

It's possible that's the only reason I'm getting health warnings.

CarnunMP15:02:19

But still there's the question of slow loads, hmm... Out of interest @orestis, how would you go about monitoring/profiling such a project? (E.g. is it worth setting up X-Ray?)

orestis15:02:39

You get a lot of nginx errors too

orestis15:02:24

We use cloud watch. X-ray is for tracing individual requests which is useful but at another level.

πŸ‘ 2
orestis15:02:53

In the beginning, the easiest way is to ssh in and tail the log files.

πŸ™ 2
CarnunMP15:02:56

> You get a lot of nginx errors too Hmm... still need to confirm the 302 hunch of course, but I assumed this was the same problem?

orestis15:02:31

It's curious that nginx reports "connection refused" too. You should see this during application startup but not any more.

orestis15:02:13

No, the 302 you see in the pedestal logs which means it's already past the nginx and into your Clojure code. The 500 means nginx can't connect to your app. It seems that the timestamps are different though (14:25 for the errors, 14:59 for the 302)

πŸ‘ 2
CarnunMP15:02:26

> It's curious that nginx reports "connection refused" too. You should see this during application startup but not any more. Hmm, ^that might be exactly it actually... looking more carefully at the timestamps. :))

CarnunMP15:02:50

And ha, yeah. Beat me to noticing the timestamps. πŸ˜›

CarnunMP15:02:51

Brill, okay. I have some stuff to dig into. Thanks a bunch @orestis!

CarnunMP19:02:00

Been away from this for a little while. But the dedicated health check endpoint is working a treat! πŸ™‚

msolli09:02:35

A nice way to tail logs on AWS EB is using the aws command-line tool: aws logs tail /aws/elasticbeanstalk/<env-name>/var/log/web.stdout.log --follow

πŸ™ 2
msolli09:02:53

For troubleshooting performance issues I use https://github.com/steffan-westcott/clj-otel along with service providers such as Honeycomb and New Relic. I’m sure you can use X-Ray too.

πŸ™ 2
CarnunMP14:02:33

Fuck yeah. ~10x faster page loads (e.g. see first and last foo?single=true req in screenshot) after ruling out the DB connection as slow, some profiling with https://github.com/clojure-goes-fast/clj-async-profiler identifying page parsing/generation as the likely culprit, and implementing caching. https://clojurians.slack.com/archives/C09N0H1RB/p1675263604685469?thread_ts=1675263104.785889&amp;cid=C09N0H1RB