[{"content":"Yesterday I was told that I had to investigate an issue on production that affected our end user using our web front end and was throwing a ton of exceptions. I didn\u0026rsquo;t know how to get that data until a colleague pointed me to a table which recorded our exceptions. After he showed me the table I could see the most common exceptions by fetching and grouping by the StackTrace. This made me wonder if there was a way to make this information available to the developers in a more transparent way and I found out about Gitlab\u0026rsquo;s Error Tracking page.\nGitlab Error Tracking page Basically it uses Sentry as the backend and requires you to setup an instance of Sentry or use Sentry.io SAAS solution.\nI\u0026rsquo;m currently trying to build a Sentry infrastructure on my Kubernetes cluster currently but I\u0026rsquo;m having issues deploying it because something related to the DB migration isn\u0026rsquo;t working 🤷🏻‍♂️\nedgar@UbuntuDesktop:~/k8s_cluster/sentri$ helm install sentry sentry/sentry -n sentry --create-namespace -f values.yaml coalesce.go:175: warning: skipped value for kafka.config: Not a table. coalesce.go:175: warning: skipped value for kafka.zookeeper.topologySpreadConstraints: Not a table. W1206 09:21:02.633883 6836 warnings.go:70] spec.template.spec.containers[0].env[39]: hides previous definition of \u0026#34;KAFKA_ENABLE_KRAFT\u0026#34; W1206 09:21:02.998697 6836 warnings.go:70] annotation \u0026#34;kubernetes.io/ingress.class\u0026#34; is deprecated, please use \u0026#39;spec.ingressClassName\u0026#39; instead Error: INSTALLATION FAILED: failed post-install: 1 error occurred: * timed out waiting for the condition Once I finish the deployment, I\u0026rsquo;m planning to start sending some events to it and checking how it works so we can use it in our infrastructure. This allows a lot of value to our error reporting and make the devs more aware of the exceptions happening on production.\n","date":"2023-12-06T18:18:50Z","image":"https://blog.endoedgar.net/posts/error-tracking-on-gitlab/errortrackinglist_huca269f91587e3e39879b4b981be7d434_1264493_120x120_fill_box_smart1_3.png","permalink":"https://blog.endoedgar.net/posts/error-tracking-on-gitlab/","title":"Error Tracking on Gitlab"},{"content":"One thing that made me scared of Elastic was the way we have to create its queries. It is a completely new way of fetching, sorting and aggregating data. It uses what it calls ElasticSearch DSL Search. I was getting used to it until I got issues and couldn\u0026rsquo;t easily create AND\u0026rsquo;s and OR\u0026rsquo;s using this new language. After some non related Google search I\u0026rsquo;ve found what I should\u0026rsquo;ve found way, way earlier.\nHow to build a OR condition in Elasticsearch Query DSL\nThis website shows exactly what you need to do to create your own DSL queries, but the magical thing that got my attention was the built-in ElasticSearch SQL translator where you can send it a SQL Query such as this:\nPOST _sql/translate { \u0026#34;query\u0026#34;: \u0026#34;SELECT first_name FROM users WHERE (MATCH(first_name, \u0026#39;frank\u0026#39;) OR MATCH(first_name, \u0026#39;damien\u0026#39;)) AND age \u0026lt; 100\u0026#34;, \u0026#34;fetch_size\u0026#34;: 10 } And it returns back DSL like this:\n{ \u0026#34;size\u0026#34;: 10, \u0026#34;query\u0026#34;: { \u0026#34;bool\u0026#34;: { \u0026#34;must\u0026#34;: [ { \u0026#34;bool\u0026#34;: { \u0026#34;should\u0026#34;: [ { \u0026#34;match\u0026#34;: { \u0026#34;first_name\u0026#34;: { \u0026#34;query\u0026#34;: \u0026#34;frank\u0026#34; } } }, { \u0026#34;match\u0026#34;: { \u0026#34;first_name\u0026#34;: { \u0026#34;query\u0026#34;: \u0026#34;damien\u0026#34; } } } ], \u0026#34;boost\u0026#34;: 1 } }, { \u0026#34;range\u0026#34;: { \u0026#34;age\u0026#34;: { \u0026#34;lt\u0026#34;: 100, \u0026#34;boost\u0026#34;: 1 } } } ], \u0026#34;boost\u0026#34;: 1 } }, \u0026#34;_source\u0026#34;: false, \u0026#34;fields\u0026#34;: [ { \u0026#34;field\u0026#34;: \u0026#34;first_name\u0026#34; } ], \u0026#34;sort\u0026#34;: [ { \u0026#34;_doc\u0026#34;: { \u0026#34;order\u0026#34;: \u0026#34;asc\u0026#34; } } ], \u0026#34;track_total_hits\u0026#34;: -1 } Since I have way more familiarity with SQL, I can make more powerful DSL queries from this method and still enjoy the quick searching algorithm from ElasticSearch.\nThis was magical to me.\n","date":"2023-12-04T23:17:54Z","permalink":"https://blog.endoedgar.net/posts/elasticsearch-query-sos/","title":"ElasticSearch Query SOS"},{"content":"In the beginning of this year I was introduced the ElasticSearch, an impressive piece of software which returns results very quickly especially if you need to fetch humongous amount of data. I was skeptical at first but when we started implementing our first queries there, everything went so smoothly and fast that we thought would be great to have it everywhere instead of using SQL. Everything looked great and well until we got our first issues.\nOne of those issues was related to version conflicts. ElasticSearch really doesn\u0026rsquo;t like when you update the same document in your index concurrently. It gives back errors like these:\nversion conflict, required seqNo [113789], primary term [19]. current document has seqNo [113797] and primary term [19] When I got those messages I thought how could we work around this issue. After a good Google Search away I got a \u0026ldquo;solution\u0026rdquo; where our update requests would require a new argument called conflict=proceed. Without much thinking, adding that seemed to fix our initial conflict issue, until I found out what exactly it was doing.\nconflict=proceed basically ignores any documents which conflicts are found and skips updates to them. This is a terrible solution and had to be reverted back to prevent us from not receiving updates from the Pub/Sub system. Reverting it back caused the conflict issue to arise again, but since the Pub/Sub system retried those requests, we were ok for the most part.\nThrottling Issues Yay! CPU usage issues to fix incoming. After half a year of using this approach, I found out that the CPU usage on our ElasticSearch clusters were skyrocketing. At first I thought that we added new listeners to our cluster and I was right, but there was no reason for the CPU to skyrocket from the typical 10% to more than 100% in such a short period of time. Calling _tasks?actions=*\u0026amp;detailed showed me that there were a ton of requests to refresh the indexes. By a ton I really mean a ton. Index refreshing is a very expensive ElasticSearch operation and should be done periodically like every second, not every request (we were receiving hundreds of requests per second at the time). Removing them seemed like a no brainer, and it fixed our issue at the time.\nAnother thing I did here was to fix an issue in the listener we had setup because it was crashing whenever it had an exception. This would be fine if Kubernetes didn\u0026rsquo;t have a wait time to restart that increased every time that happened. I couldn\u0026rsquo;t find a way to fix this behavior at the time, so I just added a try catch to every single listener method and it worked fine, no more crashes.\nThrottling Issues II After just one day of fixing that refresh index issue. I was receiving yet again CPU throttling alerts in my cell phone related to the cluster. It was the same symptom but different cause this time. Turns out that we had a listener setup that wasn\u0026rsquo;t finding some records in a index. I had to refactor that code to use another column to identify the record. The root cause for this is yet to be found, but using the other column fixed the issue.\nCPU Usage fixed again As you can see there, the fix reduced the CPU usage but it was still way too high afterwards, I thought it was fine until\u0026hellip;\nThrottling Issues III You guessed it, one day after my column fix, I had yet another CPU Throttling issue happening at the cluster. This was getting old already lol.\nIdentifying the cause I found out that the listener was being bombarded with the same failed requests from Pub/Sub, by bombarded I mean 400 requests per second. That isn\u0026rsquo;t fine.\nFixing that is as simple as setting the Retry Policy on Pub/Sub from Retry Immediately to Retry after exponential backoff delay. But that still didn\u0026rsquo;t fix the root cause of the problem.\nThe actual fix was to not make two update-by-query requests in the same index sequentially. Remember that initial conflict issue where the same document causes conflicts? Yeah, we were testing ElasticSearch patience by doing that on purpose in the same index and it didn\u0026rsquo;t like it.\nAfter those two fixes, it seemed that the cluster is way more happy to work now.\nCPU Usage fixed again II Aftermath I\u0026rsquo;m still skeptical if this fix will be enough. Looks like we are receiving a ton of requests from the Pub/Sub that look duplicated and shouldn\u0026rsquo;t be even updating our indexes. That is something to fix during the week but I am certain it can be fixed easily by just adding an old and new property to the message Pub/Sub sends us and checking if the document in the index needs to be updated. I\u0026rsquo;ll see.\n","date":"2023-12-03T22:53:10Z","image":"https://blog.endoedgar.net/posts/elasticsearch-challenges/thisisfine_hu8f2f945706077eb2735e9275541627f2_33522_120x120_fill_q75_box_smart1.jpg","permalink":"https://blog.endoedgar.net/posts/elasticsearch-challenges/","title":"ElasticSearch Challenges"}]