aws

carnundotcom 2023-02-01T14:51:44.785889Z

Hi. I'm in the process of deploying a Clojure webapp to Elastic Beanstalk as per https://clojureverse.org/t/running-a-clojure-application-on-aws-elastic-beanstalk-with-clojure-cli-tools/4161 (cc: @orestis πŸ™) And it's working, it's up and running... but CPU utilization Idle% hovers near 100% and page load times are (consequently) super sluggish. Granted, it's a single t2.small instance (up from the t2.micro that I started with). But still! Any general thoughts as to why, before I dive into figuring out monitoring and profiling and such on AWS (all ~new to me) and hopefully get to the root of it? E.g. is it expected that running Clojure apps this way is more CPU-intensive than running an Uberjar? πŸ€”

msolli 2023-02-02T09:41:35.509749Z

A nice way to tail logs on AWS EB is using the aws command-line tool: aws logs tail /aws/elasticbeanstalk/<env-name>/var/log/web.stdout.log --follow

πŸ™ 1
msolli 2023-02-02T09:44:53.344479Z

For troubleshooting performance issues I use https://github.com/steffan-westcott/clj-otel along with service providers such as Honeycomb and New Relic. I’m sure you can use X-Ray too.

πŸ™ 1
carnundotcom 2023-02-02T14:35:33.380329Z

Fuck yeah. ~10x faster page loads (e.g. see first and last foo?single=true req in screenshot) after ruling out the DB connection as slow, some profiling with https://github.com/clojure-goes-fast/clj-async-profiler identifying page parsing/generation as the likely culprit, and implementing caching. https://clojurians.slack.com/archives/C09N0H1RB/p1675263604685469?thread_ts=1675263104.785889&amp;cid=C09N0H1RB

orestis 2023-02-01T14:53:21.216409Z

Shouldn't you see idle% close to 100%?

orestis 2023-02-01T14:53:30.177539Z

That means it's an idle instance πŸ™‚

orestis 2023-02-01T14:54:46.612089Z

For the first few ~seconds, you might still be compiling (but you wouldn't see a response just yet).

orestis 2023-02-01T14:55:31.693939Z

Have you checked whether it's the database that's slow? If you have a page that just does CPU-only stuff (just rendering some html without touching network/filesystem) you should be able to see if it's the CPU that's slow or something else.

orestis 2023-02-01T14:56:11.248349Z

Our staging environment runs on a t3.medium and it's very acceptable performance wise.

carnundotcom 2023-02-01T14:57:52.439109Z

> Shouldn't you see idle% close to 100%? D'oh if so! πŸ˜„ But it's not just the (maybe misread) CPU usage that's throwing me off. Health stuff keeps dipping into 'Severe' more generally:

carnundotcom 2023-02-01T15:00:04.685469Z

Also not a first few secs (or first request) vs subsequent requests problem. It's every page load that's slow. (Perhaps because I'm generating/parsing each page anew from markdown stored in the DB, and this would be the time to look into doing some caching...? πŸ˜… )

carnundotcom 2023-02-01T15:01:10.105519Z

> Have you checked whether it's the database that's slow? If you have a page that just does CPU-only stuff (just rendering some html without touching network/filesystem) you should be able to see if it's the CPU that's slow or something else. Hmm, I'll give this a shot. πŸ™‚

orestis 2023-02-01T15:02:19.459409Z

Can you check perhaps if your application is restarting all the time? Do you see anything suspicious in the logs?

orestis 2023-02-01T15:03:43.761889Z

Looks like you had only 2 requests, so probably it's not the requests that are causing this.

carnundotcom 2023-02-01T15:05:05.289129Z

One suspicious thing is that it looks like there are constant GETs of /:

Feb  1 14:32:24 ip-172-31-80-204 web: system started
Feb  1 14:32:28 ip-172-31-80-204 web: [qtp1843774765-17] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:32:28 ip-172-31-80-204 web: [qtp1843774765-22] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:32:30 ip-172-31-80-204 web: [qtp1843774765-21] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:32:43 ip-172-31-80-204 web: [qtp1843774765-23] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:32:43 ip-172-31-80-204 web: [qtp1843774765-24] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:32:45 ip-172-31-80-204 web: [qtp1843774765-17] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:32:58 ip-172-31-80-204 web: [qtp1843774765-22] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:32:58 ip-172-31-80-204 web: [qtp1843774765-21] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:00 ip-172-31-80-204 web: [qtp1843774765-23] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:13 ip-172-31-80-204 web: [qtp1843774765-24] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:13 ip-172-31-80-204 web: [qtp1843774765-17] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:15 ip-172-31-80-204 web: [qtp1843774765-22] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:28 ip-172-31-80-204 web: [qtp1843774765-21] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:28 ip-172-31-80-204 web: [qtp1843774765-23] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:28 ip-172-31-80-204 web: [qtp1843774765-24] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:28 ip-172-31-80-204 web: [qtp1843774765-22] INFO io.pedestal.http - {:msg "GET /words/1C67kb9It/home", :line 80}
Feb  1 14:33:30 ip-172-31-80-204 web: [qtp1843774765-17] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:36 ip-172-31-80-204 web: [qtp1843774765-21] INFO io.pedestal.http - {:msg "GET /words/FC9iGu9V5/foo", :line 80}
Feb  1 14:33:43 ip-172-31-80-204 web: [qtp1843774765-20] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:43 ip-172-31-80-204 web: [qtp1843774765-23] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:45 ip-172-31-80-204 web: [qtp1843774765-22] INFO io.pedestal.http - {:msg "GET /", :line 80}
Feb  1 14:33:58 ip-172-31-80-204 web: [qtp1843774765-24] INFO io.pedestal.http - {:msg "GET /", :line 80}

orestis 2023-02-01T15:05:26.347529Z

Sounds like the health check

orestis 2023-02-01T15:06:14.258469Z

We've set the health check endpoint to something like _health so that it's obvious in the logs. We just reach out to the database (to check connectivity, a simple "SELECT 1" does the trick) and return a 200.

πŸ‘ 1
carnundotcom 2023-02-01T15:08:03.781609Z

Hmm, makes sense! Maybe the health check doesn't like that / returns a 302? πŸ€”

/var/log/nginx/access.log
----------------------------------------
172.31.67.190 - - [01/Feb/2023:14:57:59 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.94.1 - - [01/Feb/2023:14:58:01 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.34.70 - - [01/Feb/2023:14:58:14 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.67.190 - - [01/Feb/2023:14:58:14 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.94.1 - - [01/Feb/2023:14:58:16 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.34.70 - - [01/Feb/2023:14:58:29 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.67.190 - - [01/Feb/2023:14:58:29 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.94.1 - - [01/Feb/2023:14:58:31 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.67.190 - - [01/Feb/2023:14:58:44 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.34.70 - - [01/Feb/2023:14:58:44 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.94.1 - - [01/Feb/2023:14:58:46 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.67.190 - - [01/Feb/2023:14:58:59 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.34.70 - - [01/Feb/2023:14:58:59 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.94.1 - - [01/Feb/2023:14:59:01 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.67.190 - - [01/Feb/2023:14:59:14 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"
172.31.34.70 - - [01/Feb/2023:14:59:15 +0000] "GET / HTTP/1.1" 302 5 "-" "ELB-HealthChecker/2.0" "-"

...

----------------------------------------
/var/log/nginx/error.log
----------------------------------------
2023/02/01 14:24:00 [error] 20449#20449: *259 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.94.1, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:24:13 [error] 20449#20449: *261 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.67.190, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:24:13 [error] 20449#20449: *263 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.34.70, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:24:15 [error] 20449#20449: *265 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.94.1, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:24:28 [error] 20449#20449: *267 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.67.190, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:24:28 [error] 20449#20449: *269 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.34.70, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:24:30 [error] 20449#20449: *271 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.94.1, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:24:43 [error] 20449#20449: *273 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.67.190, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:24:43 [error] 20449#20449: *275 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.34.70, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:24:45 [error] 20449#20449: *277 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.94.1, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:24:58 [error] 20449#20449: *279 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.34.70, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:24:58 [error] 20449#20449: *281 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.67.190, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:25:00 [error] 20449#20449: *283 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.94.1, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:25:13 [error] 20449#20449: *285 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.34.70, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"
2023/02/01 14:25:13 [error] 20449#20449: *287 connect() failed (111: Connection refused) while connecting to upstream, client: 172.31.67.190, server: , request: "GET / HTTP/1.1", upstream: "", host: "172.31.80.204"

carnundotcom 2023-02-01T15:08:42.811889Z

> We've set the health check endpoint to something like _health so that it's obvious in the logs. We just reach out to the database (to check connectivity, a simple "SELECT 1" does the trick) and return a 200. Neat!

orestis 2023-02-01T15:14:51.455139Z

Yep, I think it’s best to always have an explicit health check path that can do exactly what you need it to, without authentication and other concerns.

carnundotcom 2023-02-01T15:15:37.148669Z

It's possible that's the only reason I'm getting health warnings.

carnundotcom 2023-02-01T15:17:19.257699Z

But still there's the question of slow loads, hmm... Out of interest @orestis, how would you go about monitoring/profiling such a project? (E.g. is it worth setting up X-Ray?)

orestis 2023-02-01T15:17:39.212519Z

You get a lot of nginx errors too

orestis 2023-02-01T15:18:24.765669Z

We use cloud watch. X-ray is for tracing individual requests which is useful but at another level.

πŸ‘ 1
orestis 2023-02-01T15:20:53.369129Z

In the beginning, the easiest way is to ssh in and tail the log files.

πŸ™ 1
carnundotcom 2023-02-01T15:20:56.723659Z

> You get a lot of nginx errors too Hmm... still need to confirm the 302 hunch of course, but I assumed this was the same problem?

orestis 2023-02-01T15:21:31.387029Z

It's curious that nginx reports "connection refused" too. You should see this during application startup but not any more.

orestis 2023-02-01T15:23:13.199559Z

No, the 302 you see in the pedestal logs which means it's already past the nginx and into your Clojure code. The 500 means nginx can't connect to your app. It seems that the timestamps are different though (14:25 for the errors, 14:59 for the 302)

πŸ‘ 1
carnundotcom 2023-02-01T15:24:26.910859Z

> It's curious that nginx reports "connection refused" too. You should see this during application startup but not any more. Hmm, ^that might be exactly it actually... looking more carefully at the timestamps. :))

carnundotcom 2023-02-01T15:24:36.727979Z

Gotcha!

carnundotcom 2023-02-01T15:24:50.932339Z

And ha, yeah. Beat me to noticing the timestamps. πŸ˜›

carnundotcom 2023-02-01T15:25:51.892429Z

Brill, okay. I have some stuff to dig into. Thanks a bunch @orestis!

orestis 2023-02-01T15:26:18.455259Z

πŸ™Œ

carnundotcom 2023-02-01T19:25:00.125209Z

Been away from this for a little while. But the dedicated health check endpoint is working a treat! πŸ™‚