DZone Spotlight

Wednesday, October 15 View All Articles »

*You* Can Shape Trend Reports: Join DZone's Database Systems Research

By DZone Editorial

Hey, DZone Community! We have an exciting year of research ahead for our beloved Trend Reports. And once again, we are asking for your insights and expertise (anonymously if you wish) — readers just like you drive the content we cover in our Trend Reports. Check out the details for our research survey below. Database Systems Research With databases powering nearly every modern application nowadays, how are developers and organizations utilizing, managing, and evolving these systems — across usage, architecture, operations, security, and emerging trends like AI and real-time analytics? Take our short research survey (~10 minutes) to contribute to our upcoming Trend Report. Oh, and did we mention that anyone who takes the survey could be one of the lucky four to win an e-gift card of their choosing? We're diving into key topics such as: The databases and query languages developers rely onExperiences and challenges with cloud migrationPractices and tools for data security and observabilityData processing architectures and the role of real-time analyticsEmerging approaches like vector and AI-assisted databases Join the Database Systems Research Over the coming month, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our upcoming Trend Report. Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Content and Community team More

Fixing a Test Involves Much More Than Simply Making It Pass

By Stelios Manioudakis, PhD

CORE

Software tests are essential mechanisms in the development lifecycle designed to verify that code behaves as expected. They catch bugs, ensuring functionality aligns with requirements, and maintain overall system reliability through automated assertions on inputs, outputs, and states. However, fixing a failing test extends far beyond merely adjusting it to pass, as this superficial approach can mask deeper issues in the underlying code or architecture. True resolution demands a thorough investigation into the root cause. It could be a logic error, an environmental mismatch, a dependency failure, or an outdated assumption in the test itself, followed by targeted corrections that preserve the test's intent as a safeguard against regressions. Blindly modifying assertions to match unexpected outcomes, such as changing an expected value from false to true without validating why the discrepancy occurred, risks introducing false positives, eroding trust in the test suite. This way, we are potentially allowing flawed software to reach production, where it could lead to security vulnerabilities, data inconsistencies, or user-facing failures. After all, a proper fix not only restores the green status but also enhances code quality, observability, documentation, and team knowledge. Test failures should be treated as opportunities for genuine improvement rather than quick band-aids. As developers and testers, we rely on tools like Postman to streamline API testing and ensure our applications behave as expected. Postman’s intuitive interface, powerful scripting capabilities, and recent AI integrations make it a go-to for automating tests in CI/CD pipelines or during exploratory testing. However, one seemingly helpful feature — the "Fix test" option powered by Postbot (Postman’s AI assistant) — can introduce subtle risks if used without caution. In this article, we'll explore how the "Fix test" option works, why it might tempt you to use it, and most importantly, the potential dangers of blindly accepting its suggestions. We'll use an example to illustrate how "fixing" a test to make it pass can mask underlying issues, leading to false confidence in your API's reliability. By the end, you'll have best practices to avoid these pitfalls and maintain robust, meaningful tests. Postman Fundamentals Postman allows you to write tests in JavaScript that run after sending a request, verifying aspects like status codes, response bodies, headers, and more. These tests use the pm.test and pm.expect syntax to assert expected outcomes. For instance, you might check if a response JSON field matches a specific value. Postbot is Postman's AI-powered assistant. I entered the following prompt to Postbot. How can you help me? What can you do? Here is the response that I’ve got: Textile You can call me Postbot :) I'm here to help you with your API workflows in Postman. Here are some of the things I can do: 1. Add or update test scripts for your API requests. 2. Add or update visualization scripts to visualize your API responses. 3. Edit existing test scripts to make changes or fix errors. 4. Add or update descriptions for your requests to enhance documentation. 5. Construct new API requests based on your existing requests. 6. Debug and fix your API requests to ensure they work correctly. If a test fails, Postbot can analyze the failure and offer suggestions, including a "Fix test" option. This feature scans the actual response and proposes updates to your test script. It usually involves adjusting assertions to align with the received data. I find Postbot very helpful indeed, as it can save me a great deal of time in writing tests and debugging. But I’ve also learned to always be critical of its results. While convenient, this AI bot may prioritize making the test pass over diagnosing the root cause. Postbot might rewrite an assertion from expecting false to true if that's what the response currently returns, without questioning whether the response itself is correct. How A Boolean Flip Can Hide a Bug Let’s say you're testing an API endpoint for user authentication. Your test collection includes a request to log in with valid credentials, followed by a test script that verifies a boolean field in the response: JavaScript pm.test("Verify authentication status", function () { var jsonData = pm.response.json(); pm.expect(jsonData.isAuthenticated).to.be.false; // Expected: false for guest mode }); You run the test, but it fails because the response shows isAuthenticated: true. Intrigued, you invoke Postbot (via the AI scripting panel or failure suggestions), and it offers the "Fix test" option. Using "Fix test" updates the script to: JavaScript pm.test("Verify authentication status", function () { var jsonData = pm.response.json(); pm.expect(jsonData.isAuthenticated).to.be.true; // Now matches the actual response }); Voilà — the test now passes on rerun. But did you fix the problem? In this scenario, the API might have a bug. Perhaps a configuration error is incorrectly authenticating guests as full users, exposing sensitive data. By "fixing" the test, you've adapted it to the faulty behavior instead of investigating the discrepancy. The test suite turns green, but the underlying issue persists. Are we sure that there is no security vulnerability that we've just masked by turning the test suite green? This example highlights a core issue that goes beyond testing tools. It actually goes beyond testing types and testing levels: Testing is a critical thinking activity. Our tests should exist for a reason. They should be there in order to verify behaviour, functionality, or anything else that we need to verify. If a test passes or fails, we must understand why. Is it a false positive or a false negative? Did the test catch a bug or not? We need to investigate. In the example, instead of investigating, we just let AI change the test so that it passes testing. All we’ve got now is a test that passes testing for the wrong reason. The test suite is green, misleading us into the wrong conclusion that authentication works as expected. The Dangers of Blindly "Fixing" Tests Our authentication example also highlights another important fact. Postbot's fixes are reactive to the current response. They are not proactive in ensuring it aligns with business requirements or specifications. This is where we need to be critical and investigate as testers. While the "Fix test" feature can accelerate testing and development, misusing it can lead to several risks: Masking bugs in the API or backend. Tests exist to catch deviations from expected behavior. If the response is wrong due to a code change, database issue, or external dependency, updating the test to match it sweeps the problem under the rug. This erodes the test's value as a safety net.False positives in CI/CD pipelines. Green tests give teams confidence to deploy. If "fixed" tests ignore real issues, you risk shipping broken features. In a fast-paced environment, this can result in production outages, data corruption, or compliance violations.Loss of test intent and documentation. Original tests often encode business rules or assumptions (e.g., "isAuthenticated should be false for unauthenticated requests"). Altering them without review dilutes this intent. This makes the test suite harder to maintain and understand for future team members.Encouraging lazy debugging practices. Relying on AI to "fix" failures discourages thorough root-cause analysis. What if the failure stems from flaky networks, incorrect request payloads, or environmental differences? Skipping the investigation misses opportunities to improve overall system reliability.Accumulated technical debt. Repeated quick fixes can bloat your test scripts with outdated or incorrect assertions. When the API evolves legitimately, these band-aid tests will fail en masse, requiring more effort to untangle. Best Practices: Safely Leveraging Postbot and "Fix Test" The practices below emphasize a balanced approach where AI assists but does not replace human judgment. The goal is that test fixes contribute to long-term quality rather than short-term convenience. By integrating them into your workflow, you can mitigate risks while still benefiting from the efficiency of automation. Always investigate failures first. Use Postbot to generate initial tests or suggestions, but validate them against specs. Before clicking "Fix test," manually review the response. After a fix, add comments explaining the change. Ask: Is this the expected output according to the API docs or requirements? Use Postman’s console or external tools to debug.Version control your collections. Store Postman collections in Git. This way, changes from "Fix test" appear in diffs, allowing peer reviews before merging.Incorporate environment variables for flexibility. Instead of hardcoding values, use variables like {{expectedAuthStatus}. This makes tests modular without rewriting assertions.Run tests in multiple environments. Test against staging, production-like setups, or mocks to confirm if failures are environment-specific.Educate your team. Share this awareness in code reviews or team meetings. Encourage a culture where test failures prompt discussion, not automatic fixes. By treating "Fix test" as a starting point rather than a solution, you maintain control over your test integrity. This is of paramount importance. After all, why do we test if our test results are misleading and meaningless? Wrapping Up Software testing is a critical thinking activity. Do we understand how and why our systems work? Can we determine whether something is wrong and why? There is no testing without observability and no observability without testing. We need to wear our inspector's hat and investigate. After all, this is what we do, this is who we are: Inspectors. Postman’s AI features like Postbot are game-changers for productivity. However, they're just tools, not oracles. Making a test pass can be tempting, but true fixing requires understanding the "why" behind failures. In our example, flipping a Boolean assertion might greenlight a security flaw, underscoring the need for vigilance. Next time a test fails in Postman, pause before fixing. Dive into the details. Align with requirements. Be confident that your changes enhance reliability, not just silence alerts. By doing so, you'll build a more robust API testing practice that catches issues early and keeps your applications trustworthy. More

Trend Report

Kubernetes in the Enterprise

Over a decade in, Kubernetes is the central force in modern application delivery. However, as its adoption has matured, so have its challenges: sprawling toolchains, complex cluster architectures, escalating costs, and the balancing act between developer agility and operational control. Beyond running Kubernetes at scale, organizations must also tackle the cultural and strategic shifts needed to make it work for their teams.As the industry pushes toward more intelligent and integrated operations, platform engineering and internal developer platforms are helping teams address issues like Kubernetes tool sprawl, while AI continues cementing its usefulness for optimizing cluster management, observability, and release pipelines.DZone's 2025 Kubernetes in the Enterprise Trend Report examines the realities of building and running Kubernetes in production today. Our research and expert-written articles explore how teams are streamlining workflows, modernizing legacy systems, and using Kubernetes as the foundation for the next wave of intelligent, scalable applications. Whether you're on your first prod cluster or refining a globally distributed platform, this report delivers the data, perspectives, and practical takeaways you need to meet Kubernetes' demands head-on.

Refcard #387

Getting Started With CI/CD Pipeline Security

By Sudip Sengupta

CORE

Getting Started With CI/CD Pipeline Security

Refcard #216

Java Caching Essentials

By Granville Barnett

Infusing AI into Your Java Applications

Artificial intelligence (AI) is becoming increasingly pervasive. As an Enterprise Java developer, you might be wondering what value AI can add to your business applications, what tools Java provides to easily do that, and what skills and knowledge you might need to learn. In this article, we equip you with the basic knowledge and skills that you need to start exploring the capabilities of AI to build intelligent and responsive Enterprise Java applications. When we talk about AI in this article, we mean getting responses from a large language model (LLM) based on a request that the Java application sends to the LLM. In our article’s example, we create a simple chatbot that customers can ask for planetary tourist destination recommendations, and then use to book a spaceship to visit them. We demonstrate using Java frameworks like LangChain4j with Quarkus to efficiently interact with LLMs and create satisfying applications for end-users. Hello (AI) World: Getting an LLM to Respond to a Prompt The first version of our spaceship rental application will build a chatbot that interacts with customers using natural language. It should answer any customer questions about planets they wish to visit in the solar system. For the full application code, see spaceship rental step-01 directory in the GitHub repository. The chatbot sends the customer's questions to the application, which interacts with the LLM to help process the natural language questions and to respond to the customer. For the AI-related parts of the application, we create just two files: An AI service, CustomerSupportAgent.java, which builds a prompt informing the LLM about our solar system’s planets and instructs the LLM to answer questions from customers.A WebSocket endpoint, ChatWebSocket.java, which receives the user’s messages from the chatbot. AI services are Java interfaces that provide a layer of abstraction. When using LangChain4j, these interfaces make LLM interaction easier. AI services are an integration point, so in a real application, you would need to consider security, observability, and fault tolerance of the connections and interactions with the LLM. As well as handling LLM connection details (stored separately in the application.properties configuration file), an AI service builds the prompts and manages chat memory for the requests it sends to the LLM. The prompt is built from two pieces of information in the AI service: the system message and the user message. System messages are typically used by developers to give the LLM contextual information and instructions for handling the request, often including examples that you want the LLM to follow when generating its response. User messages provide the LLM with application user requests. The CustomerSupportAgent interface is registered as the AI service in the application. It defines the messages used to build the prompt and sends the prompt to the LLM: Java @SessionScoped @RegisterAiService public interface CustomerSupportAgent { @SystemMessage(""" You are a friendly, but terse customer service agent for Rocket's Cosmic Cruisers, a spaceship rental shop. You answer questions from potential guests about the different planets they can visit. If asked about the planets, only use info from the fact sheet below. """ + PlanetInfo.PLANET_FACT_SHEET) String chat(String userMessage); } Let’s look at what this code is doing. The @SessionScoped annotation maintains the session for the duration of the web service connection and maintains the chat memory for the duration of the conversation. The @RegisterAIService annotation registers an interface as an AI service. LangChain4j automatically implements the interface. The @SystemMessage annotation tells the LLM how to behave when responding to the prompt. When the end user types a message in the chatbot, the WebSocket endpoint passes the message to the chat() method in the AI service. There is no @UserMessage annotation specified in our AI service interface, so the AI service implementation automatically creates a user message with the chat() method parameter value (in this case the userMessage parameter). The AI service adds the user’s message to the system message to build a prompt that it sends to the LLM, then displays the response from the LLM in the chatbot interface. Note that, for readability, the planet information has been placed in a separate PlanetInfo class. Alternatively, you could place the planet information directly in the system message. The ChatWebSocket class defines a WebSocket endpoint for the application’s chatbot UI to interact with: Java @WebSocket(path = "/chat/batch") public class ChatWebSocket { private final CustomerSupportAgent customerSupportAgent; public ChatWebSocket(CustomerSupportAgent customerSupportAgent) { this.customerSupportAgent = customerSupportAgent; } @OnOpen public String onOpen() { return "Welcome to Rocket's Cosmic Cruisers! How can I help you today?"; } @OnTextMessage public String onTextMessage(String message) { return customerSupportAgent.chat(message); } } The CustomerSupportAgent interface uses constructor injection to automatically provide a reference to the AI service. When the end user types a message in the chatbot, the onTextMessage() method passes the message to the AI service chat() method. For example, if the user asks, "What's a good planet to visit if I want to see volcanoes?", the application responds with a recommendation and why the user might like to visit there, as a fan of volcanoes: The Spaceship Rental application chatbot Providing an Illusion of Memory As you continue your conversation with the chatbot, it might seem as though it is aware of previous messages exchanged, that is, the context of your conversation. When you talk to another person, you take for granted that they remember what you (and they) last said. Requests to an LLM are stateless, though, so each response is generated solely based on the information contained within the request prompt. To maintain context in a conversation, the AI service uses chat memory, through LangChain4j, to store prior user messages and the chatbot’s responses. By default, the Quarkus LangChain4j extension stores the chat in memory, and the AI service manages the chat memory (for example, by dropping or summarizing the oldest messages) as needed to remain within the memory limits. LangChain4j by itself would require you to first configure a memory provider, but that is not needed when using the Quarkus LangChain4j extension. This gives a practical illusion of memory to end users and improves the user experience so they can enter follow-on messages without needing to repeat everything they previously said. The user chatbot experience can also be improved by streaming the responses from the LLM. Streaming Responses for a More Responsive User Experience You might notice responses to your chat message window take time to generate and then appear all at once. To improve the chatbot’s perceived responsiveness, we can modify the code to return each token of the response as it is generated. This approach, called streaming, allows users to start reading a partial response before the entire response is available. For the full application code, see the GitHub spaceship rental step-02 directory. Changing our application to stream the chatbot response is easy. First, we’ll update the CustomerSupportAgent interface to add a method that returns an instance of the SmallRye Mutiny Multi<String> interface: Java @SessionScoped @RegisterAiService @SystemMessage(""" You are a friendly, but terse customer service agent for Rocket's Cosmic Cruisers, a spaceship rental shop. You answer questions from potential guests about the different planets they can visit. If asked about the planets, only use info from the fact sheet below. """ + PlanetInfo.PLANET_FACT_SHEET) public interface CustomerSupportAgent { String chat(String userMessage); Multi<String> streamChat(String userMessage); } Moving the @SystemMessage annotation to the interface means that the annotation doesn’t have to be added to each of the methods in the interface. The streamChat() method returns the LLM’s response to the chat window one token at a time (instead of waiting to display the full response all at once). We also need to call the new streamChat() method from a WebSocket endpoint. To preserve both batch and stream functionality, we create a new ChatWebSocketStream class that exposes the /chat/stream WebSocket endpoint: Java @WebSocket(path = "/chat/stream") public class ChatWebSocketStream { private final CustomerSupportAgent customerSupportAgent; public ChatWebSocketStream(CustomerSupportAgent customerSupportAgent) { this.customerSupportAgent = customerSupportAgent; } @OnOpen public String onOpen() { return "Welcome to Rocket's Cosmic Cruisers! How can I help you today?"; } @OnTextMessage public Multi<String> onStreamingTextMessage(String message) { return customerSupportAgent.streamChat(message); } } The customerSupportAgent.streamChat() call invokes the AI service to send the user message to the LLM. After making some minor tweaks to the UI, we can now toggle streaming on and off in our chatbot: The application with the new streaming platform enabled With streaming enabled, each token (each word, or part-word) produced by the LLM is immediately returned to the chat interface. Generating Structured Outputs From Unstructured Data Up to this point, the LLM’s outputs have been intended for the application’s end user. But what if, instead, we want the LLM’s output to be used directly by our application? When the LLM responds to a request, the AI service that mediates the interaction with the LLM can return structured outputs, which are formats that are more structured than a String, such as POJOs, lists of POJOs, and native types. Returning structured outputs significantly simplifies the integration of an LLM’s output with your Java code because it enforces that the output received by the application from the AI service maps to your Java object’s predefined schema. Let’s demonstrate the usefulness of structured outputs by helping the end user select a spaceship from our fleet that meets their needs. For the full application code, see the GitHub spaceship rental step-03 directory. We begin by creating a simple Spaceship record to store information about each individual spaceship in the fleet: Java record Spaceship(String name, int maxPassengers, boolean hasCargoBay, List<String> allowedDestinations) { } Similarly, to represent the user’s query about the spaceships in our fleet, we create a SpaceshipQuery record, which is based on the information the user provided in the chat: Java @Description("A request for a compatible spaceship") public record SpaceshipQuery(int passengers, boolean hasCargo, List<String> destinations) { } The Fleet class populates several Spaceship objects and provides a way to filter out those that do not match the user. Next, we update the CustomerSupportAgent interface to take the user’s message (unstructured text) to create a structured output in the form of the SpaceshipQuery record. To accomplish this feat, we only need to set the return type for a new extractSpaceshipAttributes() method in our AI service to be a SpaceshipQuery: Java SpaceshipQuery extractSpaceshipAttributes(String userMessage); Under the covers, LangChain4j automatically generates a request to the LLM including a JSON schema representation of the desired response. LangChain4j deserializes the JSON-formatted response from the LLM and uses it to return a SpaceshipQuery record, as requested. We also need to know whether the user’s input is about one of our spaceships, or about some other topic. This filtering is accomplished using a simpler, structured output request that returns a boolean: Java @SystemMessage(""" You are a friendly, but terse customer service agent for Rocket's Cosmic Cruisers, a spaceship rental shop. Respond with 'true' if the user message is regarding spaceships in our rental fleet, and 'false' otherwise. """) boolean isSpaceshipQuery(String userMessage); Our last addition to the CustomerSupportAgent interface enables the agent to provide a spaceship suggestion based on our fleet and the user’s request, with and without streaming: Java @UserMessage(""" Given the user's query regarding available spaceships for a trip {message}, provide a well-formed, clear and concise response listing our applicable spaceships. Only use the spaceship fleet data from {compatibleSpaceships} for your response. """) String suggestSpaceships(String message, List<Spaceship> compatibleSpaceships); @UserMessage(""" Given the user's query regarding available spaceships for a trip {message}, provide a well-formed, clear and concise response listing our applicable spaceships. Only use the spaceship fleet data from {compatibleSpaceships} for your response. """) Multi<String> streamSuggestSpaceships(String message, List<Spaceship> compatibleSpaceships); } Our last step is to update the ChatWebSocket and ChatWebSocketStream classes to first check if the user’s query is about spaceships in our fleet. If so, the customer support agent creates a SpaceshipQuery record by extracting the information from the user’s message and then responds with suggested spaceships from the fleet that are compatible with the user’s request. The updated code is similar for both the ChatWebSocket and ChatWebSocketStream classes, so only the ChatWebSocket class is shown here: Java @OnTextMessage public String onTextMessage(String message) { boolean isSpaceshipQuery = customerSupportAgent.isSpaceshipQuery(message); if (isSpaceshipQuery) { SpaceshipQuery userQuery = customerSupportAgent.extractSpaceshipAttributes(message); List<Spaceship> spaceships = Fleet.findCompatibleSpaceships(userQuery); return customerSupportAgent.suggestSpaceships(message, spaceships); } else return customerSupportAgent.chat(message); } With these updates, the customer support agent is ready to use the structured outputs to provide the user with spaceship suggestions: The application providing the user with spaceship suggestions based on the structured output With that, we have completed an AI-infused Java chatbot application that provides planetary tourism recommendations and spaceship rentals. To continue learning, experiment with the full code of our sample application alongside the Quarkus with LangChain4j docs. More on These AI Concepts We’ve discussed various AI concepts throughout this article. If you want to know more about any of them, here is a quick explainer. Large Language Models (LLMs) When we talk about AI in this article, we generally mean getting responses from a large language model. LLMs are machine learning models that are trained to generate a sequence of outputs based on a sequence of inputs (often text inputs and outputs, but some multi-modal LLMs can work with images, audio, or video). LLMs can perform a wide variety of tasks, such as summarizing a document, translating between languages, fact extraction, writing code, etc. This task of creating new content from the input is what’s referred to as Generative AI, or GenAI. You can infuse such capabilities into your application as needed. Making Requests to LLMs: Prompts, Chat Memory, and Tokens How you request information from an LLM influences not only the response you get back from the LLM but also the end user’s experience and the application's running costs. Prompts Sending a request to an LLM, whether from application code or as an end-user in a chat interface, involves writing a prompt. A prompt is the information (usually, but not always, text) to which the LLM responds. If you think of communicating with an LLM like communicating with another person, how you phrase your request is important to making sure the other person (or the LLM, in this case) understands what you want to know. For example, ensuring that you give the context of the request before going on to ask for a specific piece of information, and not providing lots of irrelevant information to confuse the listener. Chat Memory Unlike when you are talking to another person, LLMs are stateless and don’t remember the previous request, so everything you need the LLM to take into consideration needs to be in your request: the prompt, any previous requests and responses (the chat memory), and any tools you provide to help the LLM respond. However, providing too much information to the LLM in the prompt can potentially complicate the request. It can also be costly. Tokens LLMs convert the words in your prompt into a sequence of tokens. Most hosted LLMs charge usage based on the number of tokens in the request and response. A token can represent a whole word or a part of a word. For example, the word "unbelievable" is typically split into multiple tokens: "un", "bel", and "ievable". The more tokens that you include in the request, especially when you include all the chat memory, the greater the potential cost of running the application. Providing all the chat memory in a request can make requests both costly and less clear. Requests to LLMs are limited in length, so it’s important to manage the chat memory and how much information is included in the request. This can be helped a lot by the Java frameworks that you use, such as LangChain4j with Quarkus, which we use for the sample application in this article. LangChain4j and Quarkus Frameworks LangChain4j is an open-source Java framework that manages interactions between Java applications and LLMs. For example, LangChain4j, through the concept of AI services, stores and helps you to manage chat memory, so that you can keep requests to the LLM efficient, focused, and less expensive. Quarkus is a modern, cloud-native, open-source Java framework optimized for developer productivity, running in containerized environments, and with fast startup and low memory usage. The LangChain4j extensions to Quarkus simplify the configuration of connecting to and interacting with LLMs in AI-infused Java applications. The LangChain4j project can be used with other Java application frameworks, including Open Liberty, Spring Boot, and Micronaut. MicroProfile and Jakarta EE are also working together with LangChain4j to provide an open standards-based programming model for developing AI applications. The Sample Application You can find the complete sample application that we demonstrate throughout this article on GitHub. The application is written in Java and runs on Quarkus using the Quarkus LangChain4j extensions. Conclusion Infusing AI into Java applications enhances the application’s capabilities and the end-user's experience. With the help of Java frameworks like Quarkus and LangChain4j to simplify interactions with LLMs, Java developers can easily infuse AI into business applications. Writing AI-infused applications in Java means you're working in Java’s robust, enterprise-ready ecosystem, which not only helps you to easily interact with AI models, but also makes it easy for the applications to benefit from enterprise essentials such as performance, security, observability, and testing. The field of AI is rapidly evolving. By mastering the concepts and technologies in this article, you can stay ahead of the curve and start exploring how AI can help you build intelligent and engaging Java applications. Experiment with the full code of our sample application alongside the Quarkus with LangChain4j docs. If you’d like to learn more, try this tutorial on how to extend the knowledge of the LLM with content from PDF documents by using retrieval-augmented generation (RAG): Build an AI-powered document assistant with Quarkus and LangChain4j. Thanks to Red Hatters Clement Escoffier, Markus Eisele, and Georgios Andrianakis for valuable review comments. This article was originally published on InfoQ on Aug 15, 2025.

By Don Bourne

A Fresh Look at Optimizing Apache Spark Programs

I have spent countless hours debugging slow Spark jobs, and it almost always comes down to a handful of common pitfalls. Apache Spark is a powerful distributed processing engine, but getting top performance requires more than just running your code on a cluster. Even with Spark’s built-in Catalyst optimizer and Tungsten execution engine, a poorly written or configured Spark job can run slowly or inefficiently. In my years as a software engineer, I have learned that getting top performance from Spark requires moving beyond the defaults and treating performance tuning as a core part of the development process. In this article, I will share the practical lessons I use to optimize Spark programs for speed and resource efficiency. Overview: The goal is to tackle performance from every angle. We will start at the top with cluster-level configurations like resource allocation and memory, then dive right into the code to cover best practices for writing efficient Spark APIs. From there, we will get into the often overlooked but critical layer of data storage and formatting for faster I/O. To wrap it all up, we will see how monitoring the Spark UI and logs is key to refining performance over time. Prelude: Understanding Spark’s Architecture and Lazy Evaluation Before we get into performance optimization, it helps to anchor on how Spark runs your program. Diagram: Spark architecture Driver: The driver program runs your main Spark application, builds a logical plan (a DAG of transformations), turns it into a physical plan, and schedules tasks across the Executor programs. It tracks job progress and collects results. Executors: Executors live on worker nodes. They run tasks in parallel, keep partitions of data in memory or on disk, and spill to disk if needed. Executors are where your transformations actually execute. Cluster Manager (allocator): YARN, Kubernetes, or Spark’s standalone manager decides how many executors you get and where they run. With the right settings, Spark can scale executors up/down as workload changes. Lazy evaluation: Transformations like select, filter, groupBy, and join do not execute immediately. Spark builds a plan and only materializes work when you call an action (count, collect, write, show, etc.). This gives Spark the chance to: Push filters down to the data sourcePrune unused columnsReorder operationsChoose efficient join/partition strategies Spark Configuration Tuning Spark offers many knobs and configuration properties that can significantly impact performance. Proper tuning of these settings ensures that your Spark cluster resources are used optimally. Key areas include dynamic resource allocation, memory management, and setting the right level of parallelism for shuffles. 1. Dynamic Resource Allocation By default, Spark uses a static number of executors for the entire application. However, in real-world scenarios, data sizes can vary from one job run to another or even between stages of a single job. Allocating a fixed number of executors can either waste resources or cause bottlenecks. Dynamic Resource Allocation allows Spark to scale the number of executors up or down based on workload. This means Spark can add executors when your job has a lot of work and removes them when they are no longer needed, improving efficiency. To enable dynamic allocation, set spark.dynamicAllocation.enabled to true in your Spark configuration. You should also ensure the cluster is configured to support this by enabling either the external shuffle service or Spark’s internal shuffle tracking (via spark.shuffle.service.enabled or spark.dynamicAllocation.shuffleTracking.enabled). Additionally, you can tune related settings such as the minimum and maximum number of executors Spark can allocate (spark.dynamicAllocation.minExecutors and spark.dynamicAllocation.maxExecutors respectively). For example, you might allow Spark to scale between 2 and 10 executors depending on load: Python from pyspark.sql import SparkSession spark = SparkSession.builder \ .config("spark.dynamicAllocation.enabled", "true") \ .config("spark.dynamicAllocation.minExecutors", 2) \ .config("spark.dynamicAllocation.maxExecutors", 10) \ .getOrCreate() With dynamic allocation on, Spark will automatically “stretch” or “shrink” your cluster usage as needed, so you don’t have to guess the perfect number of executors for each job. This is especially useful for multi-tenant environments or jobs with varying input sizes. 2. Memory Configuration Memory is a critical resource in Spark. If executors or the driver run out of memory, you will get the dreaded out-of-memory (OOM) errors which often crash the job. Proper memory tuning helps avoid these issues. When you allocate memory (spark.executor.memory for executors, spark.driver.memory for the driver), it is important to understand that these two serve different purposes. The driver uses its memory mainly for job coordination — building the DAG, scheduling tasks, and collecting results, and not for holding large datasets. The executors, on the other hand, use their memory for actual data processing and caching, and this is where most tuning matters. Each executor JVM divides its allocated heap into regions for different uses: Execution memory (for shuffles, sorts, joins, and aggregations)Storage memory (for caching datasets and broadcast variables)User and system memory (for internal metadata and user objects) The parameter spark.memory.fraction controls what portion of the executor’s JVM heap is reserved for the combined execution + storage regions, with the remainder (usually 25%) left for user and system overhead. By default, spark.memory.fraction is set to 0.75, meaning roughly 75% of the executor heap is available for Spark’s data operations, and 25% is reserved for overhead and bookkeeping. Within that memory region reserved for data operations (the 75% part), Spark dynamically manages execution and storage memory, giving preference to computation when needed (e.g., a large join) and evicting cached blocks if space runs short. For most applications, this default split works well, but certain workloads benefit from fine-tuning. For example, if you cache large datasets, increasing spark.memory.fraction slightly (e.g., toward 0.8) gives Spark’s unified memory region more room for storage memory, allowing more data to stay in memory and reducing cache evictions. However, this leaves less overall JVM heap available for temporary objects, which can increase the chance of garbage collection or spills during computation. Conversely, if your job involves heavy joins, aggregations, or UDFs with little caching, lowering the fraction (e.g., to 0.6–0.7) slightly reduces the unified region but gives the JVM more free heap outside Spark’s managed memory, helping it handle transient objects and preventing out-of-memory errors during shuffle-intensive workloads. The key is to avoid excessive spilling (which occurs when Spark runs out of in-memory space for shuffles) while leaving enough headroom to prevent OOM errors. You can monitor this balance in the Spark UI’s Executors tab. Frequent disk spills or long GC times are signals that the memory configuration might need adjustment. In summary, allocate sufficient memory to each executor (where most of the work happens), and be mindful of how that memory is divided. Keeping the defaults is often safest unless you have specific caching or shuffle-heavy workloads. And of course, never exceed the physical memory available on a node, or the OS may terminate your processes. 3. Shuffle Parallelism (Partitions) Another important configuration is the degree of parallelism for shuffle operations. Spark breaks tasks into partitions; too few partitions means each task handles a lot of data (potentially straining memory and increasing shuffle time), while too many partitions means a lot of small tasks with overhead and possibly too many small output files. By default, Spark uses 200 shuffle partitions for joins and aggregations (spark.sql.shuffle.partitions = 200). This default may not be optimal for your data size or cluster. Tuning the number of partitions can lead to better performance. For big jobs with very large datasets, you might increase the partitions to get more parallelism. For example, if processing a 1 TB dataset, 200 partitions (~5 GB per partition) is likely too few — you might raise it to 1000 or more so each partition is ~1 GB or less. On the other hand, for smaller jobs or after filtering down data, 200 partitions might be too many (each partition too small). In such cases, reducing the number of partitions can cut down scheduling overhead and small task inefficiencies. Spark’s repartition() and coalesce() transformations or SQL hints can also be used in code to increase or decrease partitions for a given DataFrame. A common rule of thumb is to aim for 100 MB–200 MB of data per partition for efficient processing. You can estimate this by looking at your input data size and dividing by a target partition size. For example, if you expect to process ~50 GB of data, using about 250–500 partitions might be reasonable (since 50 GB / 200 MB ≈ 250). Conversely, if you have a small dataset of 1 GB, having 200 partitions (each ~5 MB) is overkill — you could reduce spark.sql.shuffle.partitions to maybe 10 or 20 in that case. The goal is balanced partitions: not too large to cause memory issues, and not too small to cause excessive task overhead. In Spark 3.x and beyond, there is also an Adaptive Query Execution (AQE) feature that can automatically optimize the number of post-shuffle partitions at runtime. When AQE is enabled (spark.sql.adaptive.enabled=true by default in recent versions), Spark will merge small shuffle partitions or split skewed ones on the fly based on actual data statistics. This helps mitigate issues of suboptimal partition counts. We’ll discuss AQE a bit more later, but it is good to be aware that Spark itself can adjust some of this if configured. Takeaway: Don’t just stick with the default 200 partitions for all jobs. Tune it or let AQE handle it, but ensure your tasks are neither overwhelmed by data nor spending more time coordinating than computing. Code Level Optimization Optimizing Spark is not just about configs. How you write your Spark code has a big impact on performance. This section covers best practices in coding your transformations and actions to minimize expensive operations and fully leverage Spark’s internal optimizations. The guiding principle is to work with Spark, not against it: use high-level APIs that allow Spark to optimize under the hood, and be mindful of operations that cause a lot of data movement or computation. 1. Prefer Data Frame/Dataset APIs Over RDDs Apache Spark started with the low-level RDD API, but nowadays you should favor the higher-level Data Frame or Dataset APIs for most tasks. Data Frames and Datasets provide declarative, SQL-like operations (e.g., select, filter, groupBy) that Spark’s Catalyst optimizer can analyze and optimize aggressively. In contrast, RDDs (with manual map, reduce, etc.) do not get those optimizations. The result: Data Frame/Dataset operations are generally much faster and more memory efficient than equivalent RDD code because Spark can optimize the query plan and even generate efficient underlying code (thanks to Tungsten). For example, if you have data you want to aggregate by key, doing it as a Data Frame df.groupBy("key").agg(sum("value")) will let Spark decide the best execution plan. If you did the same with RDDs (e.g., using rdd.groupByKey() and then summing), Spark has less information to optimize, and the operation could be slower or use more memory. In short, leverage Spark’s built-in engine by using DataFrames/Datasets and express your computations at a high level. This not only gives performance benefits, but typically leads to cleaner and more concise code as well. The only time you might need RDDs is for very specialized processing or when using certain legacy libraries, but even then, you can often convert between DataFrames and RDDs as needed. As a rule of thumb, if an operation is supported in the DataFrame API, use it instead of the RDD API. High-level APIs let Spark do the hard work of optimization for you. 2. Filter Early and Minimize Data Shuffles Minimizing data movement (shuffling) is critical for performance. Shuffles (like those that happen during joins or aggregations) involve redistributing data across the network and are often the most expensive part of a Spark job. One of the best ways to cut down shuffle cost is to simply reduce the data size early in your job. Apply filters and projections as soon as possible in your ETL pipeline so that subsequent operations work on less data. If you only need a subset of data, do not wait until after a big join to filter it and instead push those filters up front. Removing even 10% of data upfront can have ripple effects of speeding up downstream shuffles and computations. For example, if you are processing logs for a specific date range, filter by date as the first step rather than after doing a bunch of processing. This “predicate pushdown” approach often happens automatically with DataFrame queries on persistent data (Parquet will only read needed partitions), but it is also a mindset for your transformations: do the narrow, data-reducing transformations early, and the wide, expensive ones (like joins) later on the smaller data. In addition, consider data partitioning before heavy operations to avoid skew and balance the load. If you know you will join two DataFrames on a key that is very skewed, you might repartition the DataFrame by that key first to ensure even distribution, or use techniques like salting. On the other hand, if you have a many-partitioned dataset but your next stage is an aggregate that does not need so many partitions, you could coalesce it down to avoid unnecessary shuffle partitions. In summary, think about data volume at each step and try to keep it minimal. Every time you can cut down data size or avoid an unnecessary shuffle, you win performance gains. 3. Use Efficient Join Strategies (Broadcast Small Data) Joins are a common source of performance problems in Spark, especially when joining a very large dataset with a smaller one. A standard join will shuffle both sides of the join across the network to group matching keys together, which can be expensive. However, Spark provides an optimization for the scenario where one dataset is much smaller than the other: broadcast joins. A broadcast join sends a copy of the small dataset to every executor, so that Spark can join without shuffling the large dataset at all. Each executor already has the small dataset locally in memory. This can dramatically speed up joins where, say, you have a large fact table and a small dimension table (e.g., a few MBs or a few thousand records). Spark’s Catalyst optimizer will automatically broadcast a table in a join if it estimates the table size is below a certain threshold (controlled by spark.sql.autoBroadcastJoinThreshold, default ~10MB). But sometimes you know better which table is small, or the estimates are off, so you can also manually hint or specify a broadcast. Using the Data Frame API in PySpark/Scala, you can import the broadcast function and apply it to the smaller Data Frame in a join, for example: Python from pyspark.sql.functions import broadcast # large_df: a huge DataFrame # small_df: a small DataFrame that fits in memory result_df = large_df.join(broadcast(small_df), on="customer_id") In the above code, small_df will be broadcast to all executors. The join can then be performed without a costly shuffle of large_df. Broadcast joins work best when the smaller dataset can easily fit in each executor’s memory. If the “small” dataset is too large to broadcast (bigger than the threshold or memory limits), then a broadcast join could backfire by causing memory pressure, so use this for truly small reference data. Another join strategy is to ensure that both sides of a large join are partitioned by the join key (using Data Frame's repartition on the key, or ensuring the data was written partitioned by that key). If both datasets are partitioned on the join key, Spark will still do a shuffle join, but it can make sure the same keys end up in the same partition with potentially less network traffic (this is more advanced and often handled by Catalyst/AQE). 4. Avoid UDFs and Utilize Built-in Functions Spark allows you to define user-defined functions (UDFs) to run custom code on Data Frame rows, but they come at a cost. UDFs (especially in PySpark) break the optimized flow: they force Spark to serialize data to Python, execute your Python code, and deserialize results. They are also “black boxes” to the Catalyst optimizer (Spark cannot look inside them to apply further optimizations). All this means UDFs typically run much slower than Spark’s native operations. Whenever possible, try to use Spark’s built-in functions (in pyspark.sql.functions or the Data Frame API) instead of writing a UDF. Spark has a rich set of functions for string manipulation, date processing, aggregations, etc., which run in the JVM and can be vectorized and optimized by Spark. For example, to convert a string column to uppercase, using the built-in func.upper(col("name")) is far faster than writing a Python UDF to do the same, because the built-in version will execute within Spark’s engine (and even leverage optimized Java code). If you absolutely must use a UDF (for example, you need some complex logic not available as a built-in, or you want to use a specific Python library function on your data), consider using Pandas UDFs (vectorized UDFs). Pandas UDFs can process a batch of rows at once using Apache Arrow for data transfer between JVM and Python, which is much more efficient than the “one row at a time” standard UDF. They were introduced in Spark 2.3+ and can often give performance closer to built-ins while still allowing you to write Python code for the custom logic. To summarize: avoid UDFs unless necessary. Check Spark’s function library first — chances are there is already a way to do what you need. By using built-in functions, you keep the execution within the optimized Spark engine and benefit from Catalyst’s ability to rearrange and optimize those operations. 5. Cache Judiciously to Reuse Data Spark’s lazy evaluation means it will recompute transformations each time an action is called, unless it can optimize otherwise. If you have a DataFrame or RDD that you need to use multiple times in your job (say you do multiple actions or multiple joins with it), it might make sense to cache (persist) that data in memory. Caching can save expensive recomputation in iterative algorithms or complex pipelines. However, caching is not a silver bullet. Use it only for datasets that are reused enough to pay off the cost of caching. If you cache something that is only used once, you incur extra overhead (to store it) for no benefit. Identify points in your job where caching makes a difference. For example, if you filter and transform a base Data Frame and then use the result in two separate joins, caching the result of that filter + transform step might be worthwhile so that step is not redone for each join. When using df.cache() or df.persist(), prefer the MEMORY_ONLY storage level for purely in-memory caching, or MEMORY_AND_DISK if the data is too large (Spark will spill to disk if it cannot fit in RAM). Also, remember to undo persisting the data later if it is no longer needed to free up executor memory. A common pattern is to cache the “core” dataset that many downstream operations depend on. For instance, in a machine learning pipeline, you might cache the prepared feature Data Frame that is used by multiple models for training. Or in an ETL, cache the cleaned data that will be aggregated in multiple ways. This can give significant speedups. Spark will automatically avoid recomputation of some intermediate results thanks to DAG optimization, but it is not omniscient — explicit caching is your tool to hint to Spark, “I’m going to use this again, keep it around.” Used properly, caching and persistence are powerful for optimization. Data Storage and Format Optimization Optimizing how data is stored and accessed can greatly speed up Spark jobs, especially for I/O heavy workloads. This is often an overlooked aspect of performance. By choosing efficient data formats and organizing data cleverly, you can reduce the amount of data that Spark needs to read and write, thus accelerating your jobs. Use Columnar, Compressed File Formats If you have a choice, store your data in a columnar format like Parquet or ORC rather than as plain text (CSV/JSON) or other row-based formats. Columnar formats are optimized for analytic queries: Spark can skip reading columns that are not needed (saving I/O) and apply predicate pushdown (only read data that matches filters). They also integrate well with Spark’s off-heap memory management. Additionally, they support compression and encoding that significantly reduce file sizes. For example, Parquet with Snappy compression often gives huge space savings and speeds up data reads due to needing to transfer less data from disk. The compression is automatic and split into blocks so that Spark can still read in parallel. In short, prefer Parquet/ORC with compression for Spark data — your jobs will thank you with faster disk and network operations. Partition Data on Disk for Pruning If your data is large and stored in a distributed filesystem (like HDFS or S3), organize it into partitioned directories by key fields that you commonly filter by. For instance, it is very common to partition datasets by date (.../year=2025/month=10/day=11/ etc.). When you load a partitioned table and apply a filter like WHERE year=2025 AND month=10, Spark (through Hive Metastore or internal logic) will only read the folders matching those partition values, skipping the rest entirely. This partition pruning can cut down the runtime dramatically for time range queries or other partition key filters. Choose partition columns that have enough granularity to split data, but not so high cardinality that you get a million tiny files. Dates, regions, or categorical fields are good candidates. On a related note, avoid creating too many small files in the output. If you have, say, 1000 partitions each resulting in a small file of a few KB, Spark’s reading overhead increases (and many files can strain the file system). It’s often better to coalesce or repartition data before writing out to end up with larger chunkier files (e.g., each 100MB or more). The goal is a balance: not one huge monolithic file (which can’t be read in parallel efficiently), but also not tens of thousands of tiny files. File I/O operations have overhead, so fewer, larger files are usually better for throughput. Bucketing and Sorting For certain use cases (especially joining on a key or doing aggregations), partitioning might not be enough. Spark offers bucketing (hash partitioning data into a fixed number of files by column and optionally sorting within each bucket). If two tables are bucketed by the same key into the same number of buckets, Spark can do a join or aggregation on those with less shuffle. Bucketing is more of an advanced technique and adds complexity (and it is static once set up), but it can be worth it for repeated queries on huge data. If you have control over how data is written, consider bucketing on high cardinality join keys to optimize those operations. In summary, store data in a Spark-friendly way. Use the right formats (columnar + compressed), partition the data for predicate pushdown, and write outputs in appropriately sized files. These optimizations at the storage layer often make a big difference, because they reduce the amount of work Spark has to do before your actual computation even starts. Monitoring Tuning is an iterative process. Even after applying the above strategies, it’s important to monitor your Spark application’s performance and identify any remaining bottlenecks or areas to improve. Spark provides excellent tools for this. The Spark Web UI is your first stop. When you run a Spark application, you can access its UI (typically at http://<driver-node>:4040 for local or the tracking URL in cluster mode). On the UI, examine the Jobs and Stages tabs to see where time is being spent. Look at metrics like Task Duration, Shuffle Read/Write, Memory Spill, etc. For example, if you see a stage with an enormous Shuffle Read size, that is a hint that maybe you need to reduce data shuffling or use a broadcast join. Large "Shuffle Write" metrics indicate a lot of data being shuffled; this might signal an opportunity for optimization by filtering or repartitioning differently. Similarly, if you notice tasks are slow due to GC, that indicates memory tuning might be needed. Spark UI also helps spot data skew. If one task is running much longer than others in a stage, perhaps one partition had way more data (skew), and you might handle that by better partitioning or using techniques like salting or the spark.sql.adaptive.skewJoin.enabled feature in AQE. Besides the UI, you can enable Spark event logs and use tools or Spark History Server to analyze runs after the fact. There are also external APM (Application Performance Management) tools and metrics sinks (Ganglia, Prometheus, etc.) that can collect Spark metrics. Leverage these tools to get insight into your job’s behavior. It is often helpful to test your job on a sample of data or on a smaller scale, profile it, and then extrapolate or catch issues before running on full data. Finally, remember that optimization is a balancing act and sometimes requires tradeoffs. The “best” settings can depend on your specific data sizes, data distributions, and cluster resources. What works for one job (or one day’s data) might need adjustment for another. By monitoring and understanding Spark’s execution, you will build an intuition for where the bottleneck is — be it CPU, memory, network, or I/O and apply the appropriate fix from the toolbox of techniques discussed above. Key Takeaways Leverage Spark’s optimizer: Use high-level APIs (Data Frames/Datasets) and built-in functions to let Spark’s Catalyst and Tungsten engines optimize your code. Avoid using RDDs or UDFs unless absolutely necessary, as they bypass many optimizations.Tune resources and configs: Adjust Spark configurations to your workload. Enable dynamic allocation for variable workloads, set appropriate executor memory to avoid OOM, and tune the number of shuffle partitions so tasks are neither too large nor too small.Minimize data shuffling: Design your computations to reduce data movement. Filter data early in the pipeline, and use techniques like broadcast joins for small lookup datasets. Where possible, partition or coalesce data to avoid skew and excessive tiny tasks.Efficient data storage: Use columnar formats like Parquet/ORC with compression for input and output data. Partition your data files by common keys (e.g., date) to enable Spark to skip reading unnecessary data. Avoid writing out a myriad of small files. Aim for fewer, larger files for better I/O throughput.Cache smartly: Cache/persist intermediate datasets only when they are reused multiple times in your job. This can save time on re-computation for iterative algorithms or repeated joins, but do not waste memory on caching data that is used just once.Monitor and iterate: Use the Spark UI and logs to profile your job’s performance. Identify stages with heavy shuffles, long tasks, or skew, and address those with the appropriate optimization techniques. Optimization is iterative — observe, tweak, and repeat to hone in on the best performance for your Spark application. By following these best practices and continuously observing your Spark jobs, you can significantly improve the speed and efficiency of your Apache Spark programs.

By Nataraj Mocherla

CNCF Triggers a Platform Parity Breakthrough for Arm64 and x86

The Challenge Developing open-source software for deployment on Arm64 architecture requires a robust continuous integration and continuous deployment (CI/CD) environment. Yet, there has historically been a disparity between the levels of support for Arm64 and traditional x86 processor architectures, with Arm64 usually at a disadvantage. Developers of infrastructure components for multiple architectures have certain expectations of their work environments: Consistency of the tools and methods they use across platforms, so they don’t have to adopt different development procedures just to adopt a less prevalent platform.Performance from their platforms and support mechanisms, so their deployment schemes don’t suffer from speed deficiency when they choose to support multiple platforms.Testing coverage so the very same tests for efficiency, compliance, and security apply to all platforms simultaneously and without substantial differentiation.Maintainability, enabling developers to automate their integration and redevelopment processes so they apply to all platforms without alteration. Product managers for these same components have these same requirements, plus at least two more: Platform coverage capability, so that technical account managers (TAM) may have the skills and readiness they need to respond to customer needs.Support tiering capability, enabling TAM and other IT personnel to classify their levels of software support according to their capability to respond to urgent or emerging customer issues. The Solution Working in collaboration with both Ampere and infrastructure provider Equinix, open-source developer Alex Ellis made available his Actuated CI/CD platform to some of the most critical open-source projects in the cloud-native software ecosystem. Actuated takes GitHub self-hosted automation processes demonstrated by security engineers to be inherently vulnerable to malicious attack, and runs them in microVMs abstracted from the public Internet. Implementation Several key open-source Cloud Native Computing Foundation projects took advantage of an Actuated environment to run all of their GitHub Actions for Arm64. This environment is based on Ampere® Altra® processors made available with the help of infrastructure provider Equinix. The success of this initiative was instrumental in prompting GitHub to implement full support of the Arm64 architecture with GitHub Actions. Now, developers who had been running Arm64 build processes in QEMU emulation environments on x86 architectures can relocate those processes to Arm64 on bare metal. Self-Hosted Runners for GitHub Actions on ARM64 GitHub dominates the hosting of software projects these days. The most popular way that GitHub-hosted projects generate builds and releases for continuous integration is with the platform’s built-in CI toolset, GitHub Actions. The most important role played by the GitHub Actions CI/CD platform is automating software development pipelines. The party responsible for triggering any GitHub Action is a runner. It’s an agent running on a server, waiting for something to do and eager to do it once it’s given the assignment. It’s assigned a job from the workflow and tasked with getting it done. GitHub is a complete software deployment platform. As such, it hosts its own runners, each of which is adapted to its specified target environment and architecture. Until recently, GitHub did not offer hosted runner environments for Arm64. Projects that wanted to generate Arm64-native builds did have an option — the self-hosted runner. GitHub users could install an agent on a physical or virtual machine hosted elsewhere, and have GitHub Actions dispatch jobs to that host, managed by the project users. This required project administrators not only to manage the project itself but also to take care of the maintenance and security of the build environment that the projects would use. In CNCF’s case, developers took advantage of credits to Equinix Metal, enabling them to provision bare metal instances and use them as self-hosted runners for projects. But for a code lab whose projects must be made available 24/7/365 to other developers worldwide, the security of self-hosted runners poses a challenge: Anyone could clone the project repository, modify the Actions jobs, and get access to the runner node to run arbitrary jobs, according to this GitHub documentation. Another problem was ensuring consistency between CI runs. With self-hosted runners, if there were side effects of the CI jobs, such as configuration changes or files left behind afterwards, they would still be there for ensuing jobs. This posed a problem — when running a CI job to build or test software, you should have a controlled environment, so that the only thing that changes between runs is the software. In the case of self-hosted runners, the environment can drift over time. In the absence of a cleanup process, it was possible for runs of the same build job on the same host to generate different results over time. One way developers bypassed the need for Arm64 native runners was by running virtual Arm64 environments on x86 servers, using QEMU open-source emulation. Emulated environments add a huge performance overhead for software compilations, which run at a fraction of the pace of compilations on native, non-emulated hardware. Emulation worked well enough for developing small to medium projects. But if developers had to build something big and important for ARM64, the strain would become so great on their virtual environments that builds would completely fail. “In the past, people were doing builds using QEMU,” said Equinix’s Developer Partner Manager Ed Vielmetti. “Say you were building a compiler, where the intermediate steps require large amounts of memory and very deep integration with the processor. That just would not work in an emulated environment.” The Disparity Phenomenon Unlike the typical enterprise, the Cloud Native Computing Foundation has a special obligation to build its cloud-native components for all the world’s major processor architectures. Projects such as the containerd portable container runtime, the etcd key/value data store, the fluentd log data collector, the Falco real-time threat detection tool, and the OpenTelemetry observability and instrumentation toolkit, among dozens of others, are critical dependencies for the cloud-native ecosystem, and as such, must be built for both x86 and Arm64. To build low-level infrastructure components with support for Arm64, CNCF developers need access to native Arm64 infrastructure. This means, ironically, they need the very class of tools they’re trying to create. At first, Ampere and Equinix collaborated with CNCF to address these gaps by donating Ampere Altra-based servers or setting up Altra-based bare metal nodes at Equinix facilities. The granularity of the Arm64-based server resources that Equinix could share was bare metal nodes — a 160-core dual-socket Ampere Altra system. Ideally, a server like this would be shared among several projects, but this was, at the time, beyond the capabilities of the CNCF. This is the problem that Ampere and Actuated proposed to solve for CNCF by allowing multiple projects to run on fewer hosts, thus providing easy access to build services for more projects while consuming less hardware. “OpenTelemetry is a full-on, full-time-on, CI/CD system,” said Antoine Toulmé, Senior Engineering Manager for Blockchain and DLT and Splunk Maintainer for OpenTelemetry project. “We were able to leverage [our Ampere server] infrastructure for ourselves, but we weren’t able to share it with open source at large." “We cannot give GitHub runners away,” Toulmé said. “Once we were happy with certifying the downstream distributions to our customers, we opened issues with the OpenTelemetry project saying we would like to see ARM64 support being delivered at the highest level — meaning, it should run for every commit, it should run for main, it should run all the time. And the feedback was, well, great, but there are no ARM64 runners in GitHub. So we’re going to need you to work with what we can do here.” Due to the lack of readily available Arm64 platforms for these projects, developers were unaware if the changes they committed were causing issues on Arm64, as test suites were not run as frequently as for x86. Since container orchestration platforms are among the platforms being developed to support Arm64, this phenomenon became a vicious cycle: Releases were gated on passing integration test suites for x86, but releases were not gated on the same test suites passing for Arm64. The solution CNCF’s developers would discover falls far short of qualifying as radical or revolutionary — in fact, it’s more of a bug fix in practice. It’s so simple to implement that it completely compensates for this disparity, not just for CNCF but for any developer of any platform-level component for any architecture. Breakthrough: Actuated, Plus Editing One Line of Code To take the first step towards platform parity between x86 and Arm64, Ampere enlisted the help of Alex Ellis, the creator of a service called Actuated. It’s a product that runs GitHub Actions jobs in secure, isolated microVMs, instrumented to receive build jobs from GitHub Actions, and offering developers visibility into the performance of their build jobs and the load on the shared build systems. Actuated could run all the CNCF’s existing GitHub Actions runners after altering a single line of their configuration files, plus in some cases the pasting of a few code snippets — changes which took less than five minutes to implement. These changes enabled GitHub-hosted projects to point to Actuated’s microVM-driven environment on Ampere Altra processors for their build jobs. “Falco really needed Arm64 GitHub runners to elevate its support for the architecture and enlarge its user base,” Falco project Senior Open Source Engineer and Sysdig Maintainer Federico Di Pierro said. “[Actuated] was the perfect solution for us because it was easy to leverage and relieved any burden for the maintainers. This way, we as maintainers can focus on what really matters for the project, instead of fighting with maintaining and deploying self-hosted infrastructure. Now we are building, testing, and releasing artifacts for ARM64, leveraging Actuated for many of our projects, and it works flawlessly.” Having seen the increase in demand for Arm native build environments in recent years, GitHub announced last June the availability in public beta of Arm64-based hosted runners for GitHub Actions, powered by Ampere compute instances on Microsoft Azure, followed in January 2025 by the release into public preview of free hosted runners for public repositories. For OpenTelemetry, this means the end of network loads as high as 10 times their assigned bandwidth caps, on account of OpenTelemetry builds constantly downloading dependencies from Docker Hub repositories. “Yeah, we were definitely breaking things,” the OpenTelemetry Project’s Antoine Toulmé said. “We got lucky, because the Arm runners for GitHub shipped. We have moved to ARM runners, we are happy as can be, and nothing is breaking anymore.” Now for the first time, project maintainers can pay as close attention to the safety and security of Arm64 builds as they have for x86 builds, knowing that they’re no longer likely to encounter performance degradations or penalties. “[Actuated] gave us great confidence in the CI builds on ARM64,” Principal Software Engineer and AWS Maintainer for the containerd project Phil Estes said. “If the Arm CI breaks now, there’s no way we will merge that [pull request] until we figure out why... We have full confidence now that [build failures] are not an issue with flaky hardware [as they sometimes were before].” For its part, Oracle is continuing its policy of donating $3 million per year in OCI credits for Arm64 instances powered by Ampere to CNCF projects. This generosity, along with the newfound stability of Arm64 platforms catalyzed by Ampere and Equinix, and brought about by Actuated, is enabling prominent cloud infrastructure vendors, including Red Hat, SUSE, Canonical, and Mirantis, to provide full support for their enterprise customers who choose ARM64 infrastructure. Parity makes it possible for enterprises to make sensible choices about their computing infrastructure and platforms without incurring penalties just for choosing an alternative architecture. Large cloud customers are proving that Arm64 can provide organizations with the performance they need and reduced expenses for workloads — all with industry-leading energy efficiency. But organizations can’t experience those benefits until they can deploy their workloads on all infrastructure options on a level playing field with one another and measure the results for themselves. Leveling the Playing Field In early 2023, few options existed for GitHub-hosted projects that wanted to fully integrate Arm64 into their continuous integration processes. Through this initiative, leveraging an innovative software solution from Actuated with Ampere CPUs hosted by Equinix, we lowered the bar for CNCF projects to make a start towards parity of support for ARM64 and x86. Key cloud-native projects, including etcd, containerd, Open Telemetry, Falco, and others, were able to advance their support of Arm64, accelerate their CI runs on native Arm64 infrastructure, and support increasing numbers of their users taking advantage of ARM64 compute in the cloud. By the end of this pilot project, the number of options for developers has grown considerably. The CNCF now offers its projects the ability to run GitHub Actions jobs on managed Kubernetes clusters on OCI, using Ampere-powered instances and the GitHub project Actions Runner Controller, and with the addition of hosted Arm64 runners to GitHub, it has never been easier for projects to easily support this fast-growing and exciting architecture for cloud-native applications. Check out the full Ampere article collection here.

By Scott Fulton III

Stop React Native Crashes: A Production-Ready Error Handling Guide

You are demonstrating your next mobile application idea to investors, and suddenly the application crashes. Now you are clueless whether it crashed due to bad WIFI at the venue, your service is down, or your application itself hit a snag. There are no logs or alerts to tell you what just happened. Mere imagination of this scenario can send any founder or developer into panic and nervousness. Therefore, having robust error handling is critical to modern-day mobile applications. React Native has become an industry standard in building cross-platform native mobile applications. React Native boasts a rich ecosystem of error handling at various levels of architecture. However, often projects lack clear guidelines on how to handle various error scenarios. This results in either overdoing or underdoing error handling in otherwise robust and feature-rich applications. In this article, you will learn how to adopt a three-layer error handling mechanism, which is gaining widespread popularity with major enterprise-level applications built at leading companies. The article will give you a practical guide instead of pure theory, so you are ready to adapt these concepts immediately. Error Types It is important for you to understand core error types to understand layers of error handling in React Native applications. These error categories are explained below: Network errors: Network errors can occur due to service loss, for example, losing cell reception while traveling or WIFI disconnection. These are the most common types of errors a mobile app experiences.Component errors: Component errors can happen due to wrongly implemented React lifecycle hooks, incorrect rendering of JSX, accessing undefined state properties, or infinite render loops. These errors can crash the entire app at once.Runtime errors: These errors are core JavaScript runtime errors like accessing properties on null or undefined values, JSON parsing failures, type mismatches, and unhandled promise rejections. These errors are the toughest to catch as they usually slip through standard testing. Due to the varied nature of these errors, error handling needs to be layered to prevent and recover the application from a crash state. Three Layers of Error Handling Implementing error handling at three critical layers ensures you catch all major errors. These layers are local error handling, component-level error handling, and global error handling. Please note that these layers are not a 1:1 match for each of the above error types; instead, they work together to ensure that you catch all errors safely. Three layers of error handling Layer 1: Local Error Handling Local error handling refers to errors caught and handled within specific functions or code blocks. These errors are predictable and generally recoverable. You can perform several types of local error handling as described ahead: Try-catch: Classic try-catch blocks are particularly useful when working with async operations such as calling external APIs. Please refer to the code below: JavaScript const fetchUserProfile = async (userId) => { try { const response = await fetch(`https://api.example.com/users/${userId}`); if (!response.ok) { throw new Error(`HTTP error! status: ${response.status}`); } const data = await response.json(); return data; } catch (error) { console.error('Failed to fetch user profile:', error); Alert.alert('Error', 'Unable to load profile. Please try again.'); return null; } }; React query error handling: Async operations can be enhanced using modern constructs like React Query, which, along with standard try-catch level error handling, offers more options such as retrying failed operations. This allows for graceful recovery from an error state. Observe the sample implementation given below. JavaScript import { useQuery, useMutation } from '@tanstack/react-query'; import api from './services/api'; const UserProfile = () => { const { data, error, isError, isLoading } = useQuery({ queryKey: ['user', 'profile'], queryFn: async () => { const response = await api.get('/user/profile'); return response.data; }, retry: 3, retryDelay: (attemptIndex) => Math.min(1000 * 2 ** attemptIndex, 30000), onError: (error) => { if (error.code !== 'NETWORK_ERROR') { Alert.alert('Error', 'Failed to load profile'); } }, }); if (isLoading) return <Text>Loading...</Text>; if (isError) return <Text>Error: {error.message}</Text>; return <Text>{data?.name}</Text>; }; Form validation: Form validations at React Native components not only improve the app’s user experience but also improve backend performance by reducing unnecessary API calls. Below is an example for you to refer to: JavaScript import { useForm } from 'react-hook-form'; import { z } from 'zod'; import { zodResolver } from '@hookform/resolvers/zod'; const loginSchema = z.object({ email: z.string().email('Invalid email address'), password: z.string().min(6, 'Password must be at least 6 characters'), }); const LoginForm = () => { const { handleSubmit, formState: { errors } } = useForm({ resolver: zodResolver(loginSchema), }); const onSubmit = async (data) => { try { await api.post('/auth/login', data); } catch (error) { Alert.alert('Login Failed', error.message); } }; return ( <View> {errors.email && <Text style={styles.error}>{errors.email.message}</Text>} {errors.password && <Text style={styles.error}>{errors.password.message}</Text>} </View> ); }; Layer 2: Component-Level Error Handling Error boundaries catch component-level errors in the child component tree that go uncaught during local error handling. These can include infinite rendering, improper lifecycle management, and accessing undefined properties during rendering inside components. Without an Error Boundary, a single error can crash the entire application, leading to an unpleasant customer experience. There are two major steps involved in creating an error boundary: 1) creating a dedicated error boundary component, and 2) wrapping the entire application code in it as a child. The code below highlights how to create the error boundary component. JavaScript import React from 'react'; import { View, Text, TouchableOpacity, StyleSheet } from 'react-native'; class ErrorBoundary extends React.Component { constructor(props) { super(props); this.state = { hasError: false, error: null, errorInfo: null }; } static getDerivedStateFromError(error) { return { hasError: true, error }; } componentDidCatch(error, errorInfo) { console.error('ErrorBoundary caught an error:', error, errorInfo); this.setState({ errorInfo: errorInfo.componentStack, }); } handleReset = () => { this.setState({ hasError: false, error: null, errorInfo: null }); }; render() { if (this.state.hasError) { return ( <View style={styles.container}> <Text style={styles.title}>Oops! Something went wrong</Text> <Text style={styles.message}> We're sorry for the inconvenience. Please try again. </Text> {__DEV__ && this.state.error && ( <View style={styles.errorDetails}> <Text style={styles.errorText}> {this.state.error.toString()} </Text> {this.state.errorInfo && ( <Text style={styles.errorStack}> {this.state.errorInfo} </Text> )} </View> )} <TouchableOpacity style={styles.button} onPress={this.handleReset} > <Text style={styles.buttonText}>Try Again</Text> </TouchableOpacity> </View> ); } return this.props.children; } } Once the error boundary component is created, you can wrap the entire application around it by referring to the code below. JavaScript import React from 'react'; import { SafeAreaView } from 'react-native'; import ErrorBoundary from './components/ErrorBoundary'; import MainNavigator from './navigation/MainNavigator'; const App = () => { return ( <ErrorBoundary> <SafeAreaView style={{ flex: 1 }> <MainNavigator /> </SafeAreaView> </ErrorBoundary> ); }; export default App; Layer 3: Global Error Handling Global error handling is the final safety measure that can help catch uncaught errors in local layer and component layer error handling. These include catching uncaught JavaScript exceptions, unhandled promise rejections, and more. Furthermore, global error handling can help centralize API error handling, creating manageable error handling for entire applications. Global error handling component: Implementation of the global error handling component is quite similar to the error boundary component, with two stages: creating the component and using it in the root application initialization. You can refer to the code below to create a global error handling component. JavaScript import { Alert } from 'react-native'; class GlobalErrorHandler { static init() { this.setupErrorHandler(); this.setupPromiseRejectionHandler(); } static setupErrorHandler() { ErrorUtils.setGlobalHandler((error, isFatal) => { console.error('Global Error Handler:', { error, isFatal, message: error.message, stack: error.stack, }); if (isFatal) { Alert.alert( 'Unexpected Error', 'The app encountered a critical error and needs to restart. ' + 'We apologize for the inconvenience.', [ { text: 'Restart', onPress: () => {}, }, ] ); } else { console.warn('Non-fatal error occurred:', error.message); } }); } static setupPromiseRejectionHandler() { if (__DEV__) { require('promise/setimmediate/rejection-tracking').enable({ allRejections: true, onUnhandled: (id, error) => { console.warn('Unhandled Promise Rejection (Dev):', id, error); }, onHandled: (id) => { console.log('Promise rejection was handled:', id); }, }); } } static handleError(error, context = {}) { console.error('Handled Error:', { error, context, message: error.message, stack: error.stack, }); let userMessage = 'Something went wrong'; if (error.code === 'NETWORK_ERROR') { userMessage = 'No internet connection. Please check your network.'; } else if (error.code === 'UNAUTHORIZED') { userMessage = 'Your session has expired. Please login again.'; } else if (error.message) { userMessage = error.message; } return userMessage; } } export default GlobalErrorHandler; Once created, initialize it in the main application component using the below reference code. JavaScript import React, { useEffect } from 'react'; import { SafeAreaView } from 'react-native'; import ErrorBoundary from './components/ErrorBoundary'; import GlobalErrorHandler from './utils/GlobalErrorHandler'; import MainNavigator from './navigation/MainNavigator'; const App = () => { useEffect(() => { GlobalErrorHandler.init(); }, []); return ( <ErrorBoundary> <SafeAreaView style={{ flex: 1 }> <MainNavigator /> </SafeAreaView> </ErrorBoundary> ); }; export default App; Axios Interceptor: Axios Interceptor acts as a centralized API error handler. You can refer to the code below to implement it. After the implementation, you can import it wherever you need API calling. JavaScript import axios from 'axios'; import AsyncStorage from '@react-native-async-storage/async-storage'; import { Alert } from 'react-native'; const api = axios.create({ baseURL: 'https://api.example.com', timeout: 10000, }); api.interceptors.request.use( async (config) => { const token = await AsyncStorage.getItem('authToken'); if (token) { config.headers.Authorization = `Bearer ${token}`; } return config; }, (error) => Promise.reject(error) ); api.interceptors.response.use( (response) => response, async (error) => { if (!error.response) { Alert.alert('Network Error', 'Please check your internet connection'); return Promise.reject({ code: 'NETWORK_ERROR', message: 'No internet connection', }); } switch (error.response.status) { case 401: await AsyncStorage.removeItem('authToken'); return Promise.reject({ code: 'UNAUTHORIZED', message: 'Session expired. Please login again.', }); case 403: Alert.alert('Access Denied', 'You do not have permission'); break; case 404: Alert.alert('Not Found', 'Resource not found'); break; case 500: case 502: case 503: Alert.alert('Server Error', 'Please try again later'); break; } return Promise.reject(error); } ); export default api; Conclusion You have learned about different errors your React Native application can encounter and a layered approach to handling them. This knowledge will help you build robust mobile applications that are reliable and resilient. If you wish to dive deeper, you can explore error logging services like Sentry and Firebase Crashlytics. While these services have costs, they are very useful tools for understanding and fixing production errors.

By Anujkumarsinh Donvir

CORE

Efficiently Reading Large Excel Files (Over 1 Million Rows) Using the Open-Source Sjxlsx Java API

If you are a developer, regardless of the technologies you use, at some point, you will face the challenge of handling large Excel files. Most commonly, you will see the "out of memory" error when handling these files. Here is the requirement: You get a workbook with 100 spreadsheets, and each of them has 1 million records with a minimum of 100 columns. For your reference, here's how many rows each Excel version can support: 1,048,576 rows – Excel 365, 2013, 2010, 200765,536 rows – Excel 2003 and earlier versions Cost and Hosting Server Limitations Third-party license purchase is not feasible. Open source is allowed.Deployment should be on average-tier cloud environments (4 GB disk/4 GB RAM or less) or on-premises Windows Server (16/32 GB RAM) already loaded with 10+ running applications. How would you handle this situation? Let's explore the current digital market. Are there any open-source solutions available to meet this requirement? I can’t use “interop” when your cloud runs on "Linux OS," and it also causes intermittent hang issues in multi-threaded applications. Most of them will say that the popular library is the "Apache POI" streaming library. Let’s proceed with a practical implementation using Apache POI. To get the large dataset (Excel) file, we have N number of websites, such as awesome-public-datasets, Google Dataset Search, and World Bank Data. Personally, I frequently visit Kaggle. Let's download the container Dataset as a 1.9GB CSV file, and then save it as an Excel file (.XSLX), which became around a 600 MB XLSX file. The sheet includes 87 columns and 1048576 rows. Before we run the solution, let me capture my laptop's resource usage so that we can compare later. Resource Usage CPU: 3%Memory: 54% IDE I am using Eclipse IDE (v: 4.36.0). Apache POI I am using the latest Apache POI 5.2.5 and other dependency JARS available from Apache POI. Source Code Here, I am just attempting to read the sheet name from the workbook, not the rows. Java public static void main(String[] args) throws Exception { String filePath = "C:\\POC\\Containers_Dataset.xlsx"; ReadExcelbyApachePOI(filePath); } /*List out sheet name*/ static void ReadExcelbyApachePOI(String filePath) throws Exception { try (OPCPackage opcPackage = OPCPackage.open(new File(filePath), PackageAccess.READ)) { XSSFWorkbook workbook = new XSSFWorkbook(opcPackage); XSSFReader xssfReader = new XSSFReader(opcPackage); StylesTable styles = xssfReader.getStylesTable(); XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader .getSheetsData(); while (iter.hasNext()) { InputStream stream = iter.next(); String sheetName = iter.getSheetName(); System.out.println("Sheetname: " + sheetName); } } catch (IOException e) { e.printStackTrace(); } } Result Encountering "Java heap space - Out of memory error." Java Exception in thread "main" java.lang.OutOfMemoryError: Java heap space Resource Usage CPU: 58%Memory: 94% Let us try with the sjxlsx open-source API. It’s an open-source Java API Source Code; this project was first published on Google Code, and it seems unmaintained. In GitHub, it is available for anyone who can download and update the changes for their needs. "Memory" and "speed" are primary goals of this API. It provides two modes, which are "classic" and "stream." Classic – All records of the sheet will be loaded.Stream – Read one record at a time. Microsoft XLSX uses XML+zip (OOXML) to store the data. So, to be fast, "sjxlsx" uses STAX for XML input and output. Source Code Java public static void main(String[] args) throws Exception { String filePath = "C:\\POC\\Containers_Dataset.xlsx"; SimpleXLSXWorkbook workbook = newWorkbook(filePath); testLoadALL(workbook); } private static SimpleXLSXWorkbook newWorkbook(String filePath) { return new SimpleXLSXWorkbook(new File(filePath)); } /*Read Each Row*/ private static void printRow(int rowPos, com.incesoft.tools.excel.xlsx.Cell[] row) { int cellPos = 0; for (com.incesoft.tools.excel.xlsx.Cell cell : row) { System.out.println(com.incesoft.tools.excel.xlsx.Sheet.getCellId(rowPos, cellPos) + "=" + cell.getValue()); cellPos++; } } /*Load & Read workbook * false => Read each row * true => Load all rows */ public static void testLoadALL(SimpleXLSXWorkbook workbook) { com.incesoft.tools.excel.xlsx.Sheet sheetToRead = workbook.getSheet(0,false); SheetRowReader rowreader = sheetToRead.newReader(); int rowPos = 0; while (rowreader != null) { com.incesoft.tools.excel.xlsx.Cell[] row = rowreader.readRow(); printRow(rowPos, row); rowPos++; } } Resource Usage CPU: 3% (No changes)RAM: 61% (7% usage: 1 GB usage) Output Java BN1048519=40298 BO1048519=0 BP1048519=0 BQ1048519=0 BR1048519=0 BS1048519=610 BT1048519=0 BU1048519=1 BV1048519=64240 BW1048519=923 BX1048519=158 BY1048519=32 BZ1048519=0 CA1048519=0 CB1048519=0 CC1048519=0 CD1048519=0 CE1048519=0 CF1048519=0 CG1048519=0 CH1048519=10000206 CI1048519=0 A1048520=100.64.0.2-10.16.0.9-35919-8080-6 B1048520=100.64.0.2 C1048520=35919 D1048520=10.16.0.9 E1048520=8080 F1048520=6 G1048520=45266.83932053241 H1048520=41626 I1048520=6 J1048520=5 K1048520=515 L1048520=357 M1048520=515 N1048520=0 O1048520=85.8333333333333 P1048520=210.24786958888899 Q1048520=357 R1048520=0 S1048520=71.400000000000006 T1048520=159.65525359348399 U1048520=20948.445682986501 V1048520=264.25791572574798 W1048520=4162.5999999999904 X1048520=12728.124713056101 Y1048520=40374 Z1048520=9 AA1048520=41626 AB1048520=8325.2000000000007 AC1048520=17922.528077813098 AD1048520=40374 AE1048520=29 AF1048520=41594 AG1048520=10398.5 AH1048520=20011.5685292282 AI1048520=40406 AJ1048520=26 AK1048520=1 AL1048520=1 AM1048520=0 AN1048520=0 AO1048520=0 AP1048520=0 AQ1048520=200 AR1048520=168 AS1048520=144.14068130495301 AT1048520=120.11723442079401 AU1048520=0 AV1048520=515 AW1048520=79.272727272727295 AX1048520=179.87445116474399 AY1048520=32354.8181818181 AZ1048520=2 BA1048520=2 BB1048520=0 BC1048520=2 BD1048520=10 BE1048520=0 BF1048520=0 BG1048520=0 BH1048520=0.83333333333333304 BI1048520=79.272727272727195 BJ1048520=85.8333333333333 BK1048520=71.400000000000006 BL1048520=0 BM1048520=0 BN1048520=0 BO1048520=0 BP1048520=0 BQ1048520=0 BR1048520=0 BS1048520=46 BT1048520=0 BU1048520=32 BV1048520=64240 BW1048520=502 BX1048520=1 BY1048520=32 BZ1048520=0 CA1048520=0 CB1048520=0 CC1048520=0 CD1048520=0 CE1048520=0 CF1048520=0 CG1048520=0 CH1048520=41626 CI1048520=0 A1048521=100.64.0.2-10.16.0.9-9479-8080-6 B1048521=100.64.0.2 C1048521=9479 D1048521=10.16.0.9 E1048521=8080 F1048521=6 G1048521=45266.835683206016 H1048521=111205 I1048521=6 J1048521=5 K1048521=537 L1048521=357 Performance Results Woo-hoo! I can read the records. The winner is "sjxlsx." It has been proven that this library consumes less than 1 GB of memory, compared to the higher usage by Apache POI. It is an excellent open-source Java API for reading large Excel datasets. Additional Features In addition to that, it supports writing Excel and offers rich data formatting in sheets. Java public static void main(String[] args) throws Exception { // WRITE - we take WRITE as a special kind of MODIFY SimpleXLSXWorkbook workbook = newWorkbook(); OutputStream output = ExcelOutput("write"); WriteExcel(workbook, output); output.close(); } private static SimpleXLSXWorkbook newWorkbook() { return new SimpleXLSXWorkbook(new File("/sample.xlsx")); } private static OutputStream ExcelOutput(String suffix) throws Exception { return new BufferedOutputStream(new FileOutputStream("/sample_" + suffix + ".xlsx")); } public static void WriteExcel(SimpleXLSXWorkbook workbook, OutputStream outputStream) throws Exception { com.incesoft.tools.excel.xlsx.Sheet sheet = workbook.getSheet(0); WriteRecords(workbook, sheet, 0); workbook.commit(outputStream); } static public void WriteRecords(SimpleXLSXWorkbook wb, com.incesoft.tools.excel.xlsx.Sheet sheet, int rowOffset) { int columnCount = 10; int rowCount = 10; int offset = rowOffset; for (int r = offset; r < offset + rowCount; r++) { int modfiedRowLength = sheet.getModfiedRowLength(); for (int c = 0; c < columnCount; c++) { sheet.modify(modfiedRowLength, c, r + "," + c, null); } } } Writing a Richly Styled Row Java public static void WriteRichStyleRow(SimpleXLSXWorkbook wb, com.incesoft.tools.excel.xlsx.Sheet sheet) throws Exception { Font font = wb.createFont(); font.setColor("FFFF0000"); Fill fill = wb.createFill(); fill.setFgColor("FF00FF00"); CellStyle style = wb.createStyle(font, fill); RichText richText = wb.createRichText(); richText.setText("test_text"); Font font2 = wb.createFont(); font2.setColor("FFFF0000"); richText.applyFont(font2, 1, 2); sheet.modify(0, 0, (String) null, style); sheet.modify(1, 0, richText, null); } Summary Ultimately, "sjxlsx" provides an efficient, lightweight way to read large Excel files without infrastructure headaches. Thank you!

By Mahendran Chinnaiah

Building Realistic Test Data in Java: A Hands-On Guide for Developers

There’s something that every backend or API developer faces sooner or later: the need for good fake data. Whether you’re testing a new API, populating a database for demos, or simply trying to make your unit tests less “boring”, fake data is part of your daily routine. The problem? Most fake data feels… fake. You end up with “John Doe” and “123 Main Street” repeated over and over, which doesn’t look great when showing a prototype to your team or client. So today, let’s fix that. In this article, we’ll explore two powerful Java libraries that make generating fake yet realistic data a breeze: DataFaker and EasyRandom. We’ll go beyond just generating names and emails — we’ll learn how to integrate both libraries inside a Spring Boot 3 project, how to combine their strengths, and how to make everything available through a REST API that returns test data. This isn’t a theoretical overview. We’ll look at real code, and you’ll walk away knowing exactly how to reproduce it in your next project. Why Bother Generating Fake Data? Let’s face it: manually crafting test data is time-consuming and error-prone. Imagine you’re developing a system for managing users. You need to test pagination, filtering, sorting, and edge cases (like missing emails or very long names). Instead of hand-writing 100 lines of sample JSON, wouldn’t it be nicer to generate it automatically and instantly? Good fake data helps you: Validate your logic in a more realistic scenarioShowcase prototypes with data that “looks real”Stress test APIs or UI components with variable inputsAutomate unit tests without boilerplate “mock builders” So instead of hardcoding “Alice” and “Bob,” we’ll let DataFaker and EasyRandom do the heavy lifting. DataFaker: The Modern, Improved JavaFaker If you’ve used JavaFaker in the past, DataFaker is its modern, actively maintained successor. It’s built for recent Java versions (Java 17+), is fast, and offers hundreds of data categories — including names, addresses, finance, company information, internet data, crypto keys, and even Star Wars characters if you feel nostalgic. Let’s see a quick example: Java import net.datafaker.Faker; Faker faker = new Faker(); System.out.println(faker.name().fullName()); System.out.println(faker.internet().emailAddress()); System.out.println(faker.address().fullAddress()); Run that, and you’ll get something like: Plain Text Matilde Marques [email protected] Rua do Carmo 45, 1200-093 Lisboa Pretty cool, right? And it even looks localized if you change the locale. Java Faker faker = new Faker(new Locale("pt")); Now your data fits your language and region — an enjoyable touch for international testing. EasyRandom: Because We Need More Than Fields While DataFaker focuses on realistic field-level data, EasyRandom (formerly Random Beans) takes a different approach. It’s great when you have complex Java objects — like entities or DTOs — and you want them automatically filled with random but valid values. Think of EasyRandom as a smart “object generator” that knows how to populate your classes, including nested objects, lists, and maps. Example: Java import org.jeasy.random.EasyRandom; EasyRandom easyRandom = new EasyRandom(); Person randomPerson = easyRandom.nextObject(Person.class); This will create a fully populated Person instance, with random strings, numbers, and even nested attributes. So, where DataFaker gives realism (e.g., “John Smith, [email protected]”), EasyRandom gives structure and automation (e.g., filling an entire POJO graph). And the best part? You can combine both — letting EasyRandom create your object and then using DataFaker to polish specific fields with more believable data. Combining DataFaker and EasyRandom: The Sweet Spot Here’s where things get fun. We’ll create a small Spring Boot REST API that exposes endpoints to generate fake users. Each user will have an id, fullName, email, phone, and address. We’ll use DataFaker for realism and EasyRandom for automation. Our project structure looks like this: Plain Text src/ ├─ main/java/com/example/fakedata/ │ ├─ Application.java │ ├─ config/ │ ├─ api/ │ ├─ controller/ │ ├─ domain/ │ ├─ dto/ │ ├─ service/ │ └─ mapper/ └─ resources/ └─ static/index.html The User Domain Class We’ll keep it simple, using Lombok to avoid boilerplate: Java @Data @Builder public class User { private String id; private String fullName; private String email; private String phone; private String address; } And for the API responses, we’ll use a Java record for immutability and readability: Java public record UserDto(String id, String fullName, String email, String phone, String address) { } The Service: Combining Both Libraries Here’s the core of our project: Java @Service public class DataGenService { private final Faker faker = new Faker(Locale.ENGLISH); private final EasyRandom easyRandom; public DataGenService() { EasyRandomParameters params = new EasyRandomParameters() .seed(System.currentTimeMillis()) .stringLengthRange(5, 20); this.easyRandom = new EasyRandom(params); } public User randomUserViaDatafaker() { return User.builder() .id(UUID.randomUUID().toString()) .fullName(faker.name().fullName()) .email(faker.internet().emailAddress()) .phone(faker.phoneNumber().cellPhone()) .address(faker.address().fullAddress()) .build(); } public User randomUserViaEasyRandom() { User u = easyRandom.nextObject(User.class); if (u.getId() == null || u.getId().isBlank()) { u.setId(UUID.randomUUID().toString()); } u.setFullName(faker.name().fullName()); u.setEmail(faker.internet().emailAddress()); return u; } public List<User> manyUsers(int count, boolean easyRandomMode) { return IntStream.range(0, count) .mapToObj(i -> easyRandomMode ? randomUserViaEasyRandom() : randomUserViaDatafaker()) .collect(Collectors.toList()); } } You can see how we use DataFaker for realism and EasyRandom for structure — like a two-chef recipe: one creates the base, the other adds seasoning. The REST Controller Now, let’s make it accessible through a REST API. Java @RestController @RequestMapping("/api/users") public class UserController { private final DataGenService service; public UserController(DataGenService service) { this.service = service; } @GetMapping("/{count}") public ApiResponse<List<UserDto>> generateUsers(@PathVariable int count, @RequestParam(defaultValue = "false") boolean easy) { List<UserDto> users = service.manyUsers(count, easy) .stream().map(UserMapper::toDto) .collect(Collectors.toList()); return ApiResponse.of(users); } } And to make our API responses consistent, we wrap everything in an envelope with a timestamp: Java public record ApiResponse<T>(T data, Instant timestamp) { public static <T> ApiResponse<T> of(T data) { return new ApiResponse<>(data, Instant.now()); } } That way, every API call returns data like this: JSON { "data": [ { "id": "e7b1c37a-8b20-43c1-8ff3-b4aef8d89c3a", "fullName": "Lina Cordeiro", "email": "[email protected]", "phone": "+351 912 345 678", "address": "Rua do Comércio 12, Porto" } ], "timestamp": "2025-10-06T13:02:45.321Z" } Much cleaner and easier to debug. Why Timestamp in Responses? Adding timestamps isn’t just for looks. It’s a simple, useful practice that improves observability. When debugging requests in distributed systems or when clients log responses, having the server timestamp right in the payload helps you correlate events — it’s a micro detail with macro benefits. Why Both Libraries Are Better Together You might wonder: “Why not just use DataFaker alone?” Good question. DataFaker is unbeatable for producing realistic values, but it doesn’t automatically populate deep object structures.EasyRandom, on the other hand, is great for object graphs, but its randomness feels too synthetic — you’ll end up with “[email protected].” Together, they give you: Realism + AutomationEase of integration with tests and APIsConsistency through configuration and seeds It’s a bit like combining a random word generator with a translator — one provides variation, the other makes sense of it. Going Further: Postman, Docker, and CI/CD Our complete project also includes: A Postman collection for quick testingA Dockerfile and docker-compose.yml for containerizationGitHub Actions CI and Dependabot setup for automated builds and dependency updates That makes this small demo a production-grade reference project for testing and learning. If you’re mentoring junior developers or building internal utilities, this is a great example to show clean architecture and reproducible data generation. Repo: github.com/wallaceespindola/fake-data-springboot Practical Ideas for Using This Setup Load testing: Generate thousands of fake users to populate a database.UI prototyping: Feed your frontend with realistic API data.Demo environments: Seed a sandbox with dynamic sample users.Unit tests: Replace new User("a","b") with a call to DataGenService.randomUserViaDatafaker().Data anonymization: Quickly replace sensitive production data with fake equivalents. Each of these is a real-world scenario where this combination shines. Closing Thoughts The difference between a “meh” test dataset and a “wow, this looks real!” demo often comes down to how you generate data. With DataFaker and EasyRandom, you can automate that process elegantly — using modern Java, minimal boilerplate, and libraries that just make sense together. You’ll not only save hours when building tests or mock APIs but also deliver demos that feel alive, diverse, and realistic. The best part? It’s all open-source, lightweight, and easy to integrate with Spring Boot, Quarkus, Micronaut, or even a plain Java console app. So next time you need to populate an API or test your system’s resilience, don’t settle for "John Doe" anymore. Give your fake data some personality — and let Java do the heavy lifting. Need more tech insights? Check out my GitHub repo and LinkedIn page. Happy coding!

By Wallace Espindola

AI-Assisted Kubernetes Diagnostics: A Practical Implementation

Kubernetes troubleshooting follows a repetitive pattern: identify unhealthy pods, examine descriptions, review logs, analyze events, and correlate information to find root causes. For common issues like CrashLoopBackOff, ImagePullBackOff, or OOMKilled pods, engineers repeat the same diagnostic steps daily, sometimes dozens of times per week in busy production environments. The traditional workflow requires running multiple kubectl commands in sequence, mentally correlating outputs from pod descriptions, container logs, event streams, and resource configurations. An engineer investigating a single failing pod might execute 5–10 commands, read through hundreds of lines of output, and spend 10-30 minutes connecting the dots between symptoms and root causes. For straightforward issues like memory limits or missing images, this time investment yields solutions that follow predictable patterns. Large language models can process this same information — pod descriptions, logs, events — and apply pattern recognition trained on thousands of similar scenarios. Instead of an engineer manually correlating data points, an LLM can analyze the complete context at once and suggest likely root causes with specific remediation steps. This article walks through a proof-of-concept tool available at [opscart/k8s-ai-diagnostics](https://github.com/opscart/k8s-ai-diagnostics). The tool detects unhealthy pods in a namespace, analyzes them using OpenAI GPT-4, and provides diagnostics with suggested remediation steps. For certain failure types like CrashLoopBackOff or OOMKilled, it applies fixes automatically with human approval. The implementation stays minimal — just Python, kubectl, and the OpenAI API — making it easy to deploy and test in existing Kubernetes environments. The Problem Space Manual Diagnostic Overhead When a pod fails in Kubernetes, the diagnostic process typically looks like this: Shell # Check pod status kubectl get pods -n production # Examine pod details kubectl describe pod failing-pod -n production # Review container logs kubectl logs failing-pod -n production # Check previous container logs if crashed kubectl logs failing-pod -n production --previous # Examine events kubectl get events -n production --field-selector involvedObject.name=failing-pod For experienced engineers, this workflow becomes muscle memory. However, it still requires: Context switching between multiple kubectl commandsMental correlation of information across different outputsKnowledge of common failure patterns and their solutionsTime to write and apply remediation patches Common Failure Patterns Kubernetes pods fail in predictable ways: ImagePullBackOff: Wrong image name, missing credentials, or registry connectivity issuesCrashLoopBackOff: Application startup failures, missing dependencies, or configuration errorsOOMKilled: Container memory usage exceeds defined limitsProbe Failures: Readiness or liveness probes fail due to application issues or misconfigurations Each pattern has typical root causes and standard remediation approaches. This repetitive nature makes automation worth exploring. The Solution: LLM-Powered Diagnostics The k8s-ai-diagnostics project implements an agent that: Scans a namespace for unhealthy podsCollects pod descriptions and logs via kubectlSends context to OpenAI GPT-4 for analysisReceives structured diagnostics, including root cause, reasons, and fixesOptionally applies remediation with human approval Architecture The tool uses a simple pipeline: Shell ┌──────────────────┐ │ kubectl CLI │ │ (pod status, │ │ descriptions, │ │ logs) │ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Python Script │ │ - Detect pods │ │ - Collect data │ │ - Build context │ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ OpenAI GPT-4 │ │ - Analyze data │ │ - Root cause │ │ - Suggest fixes │ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Remediation │ │ - Human approve │ │ - Apply patches │ │ - kubectl cmds │ └──────────────────┘ The implementation keeps dependencies minimal: Python 3.8+, kubectl, and the OpenAI API. Installation and Setup Prerequisites Shell # Python 3.8 or higher python3 --version # kubectl configured with cluster access kubectl cluster-info # OpenAI API key export OPENAI_API_KEY="your-api-key" Installation Shell # Clone repository git clone https://github.com/opscart/k8s-ai-diagnostics.git cd k8s-ai-diagnostics # Create virtual environment python3 -m venv venv source venv/bin/activate # Install dependencies pip install -r requirements.txt Deploy Test Scenarios Set up local env The repository includes test deployments that simulate common failures: Shell # Create namespace kubectl create namespace ai-apps # Deploy test scenarios sh k8s-manifests/deploy.sh This deploys four intentionally broken pods: broken-nginx: ImagePullBackOff (invalid image name)crashy: CrashLoopBackOff (container exits immediately)oom-test: OOMKilled (exceeds memory limits)unhealthy-probe: Probe failures (missing expected files) Verify deployment: Shell kubectl get pods -n ai-apps Expected output: Shell NAME READY STATUS RESTARTS AGE broken-nginx-5f6cdfb774-m7kw7 0/1 ImagePullBackOff 0 2m crashy-77747bbb47-mr75j 0/1 CrashLoopBackOff 6 2m oom-test-5fd8f6b8d9-c9p52 0/1 OOMKilled 3 2m unhealthy-probe-78d9b76567-5x8h6 0/1 Running 1 2m Running the Diagnostic Agent Execute the agent: Python python3 k8s_ai_agent.py The script prompts for a namespace: Python Enter the namespace to scan: ai-apps Example Diagnostic Session Python Found 4 unhealthy pod(s): ['broken-nginx', 'oom-test', 'crashy', 'unhealthy-probe'] Analyzing pod: crashy... k8s_ai_agent.py execution Plain Text ROOT CAUSE ANALYSIS: Container is exiting immediately with code 1. The application fails to start due to a missing dependency or configuration error. DIAGNOSTIC DETAILS: - Exit Code: 1 (general application error) - Container restart count: 6 - Last termination reason: Error - No application logs available (exits too quickly) SUGGESTED REMEDIATION: 1. Review container entrypoint and command 2. Check for missing environment variables 3. Verify required config files are mounted 4. Add verbose logging to startup process IMMEDIATE ACTION: Restart the pod to clear transient state issues. The agent then prompts: Plain Text Detected CrashLoopBackOff. Suggest restarting the pod. Do you want to apply the above remediation? (yes/no): yes Upon approval, it executes: Shell kubectl delete pod crashy-77747bbb47-mr75j -n ai-apps Result: Plain Text Deployment crashy is now healthy. OOMKilled Remediation Example For memory issues: Plain Text Analyzing pod: oom-test... ROOT CAUSE ANALYSIS: Container terminated due to Out-of-Memory (OOM) condition. Memory consumption exceeded the defined limit of 128Mi. SUGGESTED REMEDIATION: Increase memory limit to accommodate actual usage patterns. Based on termination history and typical application requirements, recommend 400Mi limit. PATCH TO APPLY: spec: template: spec: containers: - name: oom-test resources: limits: memory: "400Mi" requests: memory: "200Mi" Detected OOMKilled. Suggest increasing memory limits. Do you want to apply the above remediation? (yes/no): yes The agent patches the deployment: Shell kubectl patch deployment oom-test -n ai-apps --type='json' \ -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value":"400Mi"}]' ImagePullBackOff Handling For image issues, the agent identifies the problem but skips automatic remediation: Plain Text Analyzing pod: broken-nginx... ROOT CAUSE ANALYSIS: Failed to pull container image "nginx:nonexistent-tag". The specified image tag does not exist in the registry. DIAGNOSTIC DETAILS: - Image: nginx:nonexistent-tag - Error: manifest for nginx:nonexistent-tag not found - Registry: docker.io SUGGESTED REMEDIATION: 1. Verify image tag exists: docker pull nginx:nonexistent-tag 2. Check image name spelling and tag format 3. Ensure registry credentials are configured if using private registry 4. Update deployment with correct image reference ImagePullBackOff detected — likely an image issue. Skipping remediation. This requires manual intervention because the correct image name needs human judgment. How GPT-4 Analysis Works Context Building The agent collects context before sending to GPT-4: Python def collect_pod_context(namespace, pod_name): context = { 'pod_description': run_kubectl(['describe', 'pod', pod_name, '-n', namespace]), 'pod_logs': run_kubectl(['logs', pod_name, '-n', namespace, '--tail=100']), 'previous_logs': run_kubectl(['logs', pod_name, '-n', namespace, '--previous', '--tail=50']), 'pod_events': run_kubectl(['get', 'events', '-n', namespace, '--field-selector', f'involvedObject.name={pod_name}']) } return context Prompt Construction The system prompt guides GPT-4 to provide structured responses: Python system_prompt = """ You are a Kubernetes expert analyzing pod failures. Provide: 1. ROOT CAUSE ANALYSIS: Clear identification of the primary issue 2. DIAGNOSTIC DETAILS: Supporting evidence from events and logs 3. SUGGESTED REMEDIATION: Specific fixes with commands or YAML patches 4. IMMEDIATE ACTION: What to do right now Focus on actionable advice. For resource issues, suggest specific limits. For configuration problems, identify missing or incorrect settings. """ user_prompt = f""" Analyze this Kubernetes pod failure: POD NAME: {pod_name} NAMESPACE: {namespace} STATUS: {pod_status} DESCRIPTION: {pod_description} LOGS: {logs} EVENTS: {events} Provide detailed diagnosis and remediation steps. """ GPT-4 Response Parsing The agent extracts structured information from GPT-4's response: Python def parse_diagnosis(response): diagnosis = { 'root_cause': extract_section(response, 'ROOT CAUSE'), 'details': extract_section(response, 'DIAGNOSTIC DETAILS'), 'remediation': extract_section(response, 'SUGGESTED REMEDIATION'), 'immediate_action': extract_section(response, 'IMMEDIATE ACTION') } return diagnosis The tool implements different remediation approaches based on failure type: IssueDiagnosisAutomated ActionRationaleImagePullBackOffImage issueNone (manual)Requires human judgment on correct imageCrashLoopBackOffContainer crashPod restartClears transient state issuesOOMKilledMemory overusePatch memory limitsPrevents future OOM killsProbe failureMisconfigurationNone (manual)Needs application-level fixes Restart Remediation For CrashLoopBackOff: Python def restart_pod(namespace, pod_name): """Delete pod to trigger recreation by deployment""" run_kubectl(['delete', 'pod', pod_name, '-n', namespace]) # Wait for new pod to be ready wait_for_pod_ready(namespace, deployment_name) Memory Patch Remediation For OOMKilled: Python def patch_memory_limit(namespace, deployment_name, new_limit='400Mi'): """Patch deployment to increase memory limit""" patch = { 'spec': { 'template': { 'spec': { 'containers': [{ 'name': get_container_name(namespace, deployment_name), 'resources': { 'limits': {'memory': new_limit}, 'requests': {'memory': str(int(new_limit[:-2]) // 2) + 'Mi'} } }] } } } } run_kubectl(['patch', 'deployment', deployment_name, '-n', namespace, '--type', 'strategic', '-p', json.dumps(patch)]) Important note: The current implementation uses a fixed memory value (400Mi) as a remediation. This is a limitation — the actual memory size should be determined from historical usage patterns captured through metrics systems like Prometheus. Without access to actual resource consumption data, the tool cannot make informed decisions about appropriate memory limits. This is why Prometheus integration (discussed in the Future Work section) is necessary for production use. Human Approval Gate All remediation actions require explicit approval: Python def apply_remediation(diagnosis, namespace, resource): print(f"\n{diagnosis['immediate_action']}") print(f"\nSUGGESTED FIX:\n{diagnosis['remediation']}") response = input("\nDo you want to apply the above remediation? (yes/no): ") if response.lower() == 'yes': execute_remediation(namespace, resource, diagnosis) else: print("Skipping remediation. Manual intervention required.") Real-World Example: Complete Workflow Here's a complete diagnostic session for an OOMKilled pod: Python $ python3 k8s_ai_agent.py Enter the namespace to scan: production Found 1 unhealthy pod(s): ['api-server-7d5f8b-xk2p9'] Analyzing pod: api-server-7d5f8b-xk2p9... === GPT-4 ANALYSIS === ROOT CAUSE ANALYSIS: The api-server container was terminated by the OOM killer. Memory usage exceeded the configured limit of 256Mi. Analysis of previous terminations shows this is a recurring pattern, indicating the application's memory footprint has grown beyond the initial resource allocation. DIAGNOSTIC DETAILS: - Container terminated with reason: OOMKilled - Memory limit: 256Mi - Restart count: 8 (all OOM-related) - Last termination: 2 minutes ago - Pattern: Crashes occur approximately 15-20 minutes after startup - Application logs show increasing heap usage before termination MEMORY ANALYSIS: Based on the termination pattern and logs: - Initial memory usage: ~80Mi - Pre-crash memory usage: ~250Mi (approaching limit) - Growth pattern suggests memory leak or legitimate increased load - Recommend 512Mi limit to provide headroom SUGGESTED REMEDIATION: 1. Immediate: Increase memory limit to 512Mi 2. Set request to 384Mi (75% of limit) 3. Monitor actual usage with: kubectl top pod api-server-7d5f8b-xk2p9 -n production 4. Investigate memory leak if usage continues growing 5. Consider implementing horizontal pod autoscaling YAML PATCH: spec: template: spec: containers: - name: api-server resources: limits: memory: "512Mi" requests: memory: "384Mi" IMMEDIATE ACTION: Patch deployment with increased memory limits ====================== Detected OOMKilled. Suggest increasing memory limits. REMEDIATION PLAN: - Patch deployment 'api-server' memory limit to 512Mi - Set memory request to 384Mi - Trigger rolling restart Do you want to apply the above remediation? (yes/no): yes Applying remediation... deployment.apps/api-server patched Waiting for new pod to be ready... Pod api-server-7d5f8b-n8x4m is now Running Deployment api-server is now healthy. NEXT STEPS: 1. Monitor pod memory usage: kubectl top pod -n production 2. Check application metrics for memory leak indicators 3. Review application logs for memory-related warnings 4. Consider adding memory usage alerts What Works Well Rapid Triage The tool scans an entire namespace and identifies all unhealthy pods in seconds, providing immediate visibility into cluster health. Context Analysis GPT-4 analyzes pod descriptions, logs, and events together, correlating information that might require multiple mental steps for a human operator. Knowledge Application The model applies Kubernetes best practices and common troubleshooting patterns without requiring the operator to remember specific commands or solutions. Executable Output Rather than just identifying problems, the tool provides specific kubectl commands, YAML patches, and remediation scripts ready to execute. Safe Automation The human approval gate ensures operators review recommendations before applying changes, preventing automated mistakes in production environments. Current Limitations Single LLM Provider The POC only supports OpenAI GPT-4. Adding support for Anthropic Claude, local models via Ollama, or Azure OpenAI would improve flexibility and reduce vendor lock-in. Simple Remediation Logic Current automated fixes are limited: Pod restarts for CrashLoopBackOffMemory limit patches for OOMKilledNo automated fixes for ImagePullBackOff or probe failures More work would require: Image name validation and correctionProbe configuration analysis and fixesNetwork policy adjustmentsRBAC issue resolution Single-Container Assumption The memory patching logic assumes deployments have a single container. Multi-container pods require more analysis to determine which container needs resource adjustments. No Historical Context The agent analyzes each pod independently without considering: Previous diagnostic sessionsRemediation success/failure patternsCluster-wide trendsRelated failures in other namespaces Limited Observability Integration The tool relies solely on kubectl output. Integration with monitoring systems would provide: Historical resource usage trendsPerformance metrics before failuresApplication-specific telemetryDistributed tracing context CLI-Only Interface The current implementation is command-line interactive. Production use would benefit from: Web dashboard for visualizationAPI endpoints for integrationSlack/Teams notificationsIncident management system integration Cost Considerations Each diagnostic session calls the OpenAI API. For large clusters with many unhealthy pods, costs can accumulate. Implementing caching, local models, or rate limiting would help manage expenses. Security Concerns Sending pod logs to external APIs (OpenAI) raises data security issues: Logs may contain sensitive informationAPI keys, tokens, or credentials might leakCompliance requirements may prohibit external data transmission Production deployments need: Log sanitization to remove sensitive dataLocal LLM options for sensitive environmentsAudit trails of what data was sent externally Future Work Multi-Provider LLM Support Add support for alternative models: Python class LLMProvider: def __init__(self, provider='openai', model='gpt-4'): self.provider = provider self.model = model def analyze(self, context): if self.provider == 'openai': return self._openai_analyze(context) elif self.provider == 'anthropic': return self._claude_analyze(context) elif self.provider == 'ollama': return self._ollama_analyze(context) Prometheus Integration Incorporate time-series metrics: Python def enhance_context_with_metrics(namespace, pod_name): metrics = { 'cpu_usage': query_prometheus( f'rate(container_cpu_usage_seconds_total{{pod="{pod_name}"}[5m])' ), 'memory_usage': query_prometheus( f'container_memory_working_set_bytes{{pod="{pod_name}"}' ), 'restart_history': query_prometheus( f'kube_pod_container_status_restarts_total{{pod="{pod_name}"}' ) } return metrics This integration would solve the current limitation where OOMKilled remediation uses fixed memory values (400Mi). With Prometheus data, the tool could analyze actual memory usage patterns over time and recommend appropriate limits based on real consumption trends rather than arbitrary values. Feedback Loop Track remediation success to improve future diagnostics: Python class RemediationTracker: def record_outcome(self, pod_name, diagnosis, action, success): """Track which fixes worked""" outcome = { 'pod': pod_name, 'issue_type': diagnosis['type'], 'action_taken': action, 'successful': success, 'timestamp': datetime.now() } self.store_outcome(outcome) def get_success_rate(self, issue_type): """Calculate success rate for specific issue types""" outcomes = self.query_outcomes(issue_type) return sum(o['successful'] for o in outcomes) / len(outcomes) Expanded Remediation Expand automated fixes: Python class AdvancedRemediation: def fix_image_pull_error(self, namespace, pod_name, diagnosis): """Attempt to fix common image pull issues""" # Check if image exists with 'latest' tag # Verify imagePullSecrets are configured # Test registry connectivity # Suggest alternative image sources pass def fix_probe_failure(self, namespace, deployment, diagnosis): """Adjust probe configuration based on actual app behavior""" # Analyze startup time from logs # Recommend appropriate initialDelaySeconds # Suggest probe endpoint alternatives pass Web Dashboard Build a visualization layer: Python // React component for real-time diagnostics function DiagnosticsDashboard() { const [pods, setPods] = useState([]); const [analyses, setAnalyses] = useState({}); useEffect(() => { // Poll for unhealthy pods fetchUnhealthyPods().then(setPods); }, []); return ( <div> <PodList pods={pods} onAnalyze={runDiagnostics} /> <AnalysisPanel analyses={analyses} /> <RemediationQueue onApprove={applyFix} /> </div> ); } Incident Management Integration Connect to existing workflows: Python def create_incident_with_diagnosis(pod_name, diagnosis): """Create PagerDuty incident with analysis""" incident = { 'title': f'Pod Failure: {pod_name}', 'description': diagnosis['root_cause'], 'urgency': determine_urgency(diagnosis), 'body': { 'type': 'incident_body', 'details': format_diagnosis_for_incident(diagnosis) } } pagerduty_client.create_incident(incident) Getting Started Quick Start Shell # Clone and setup git clone https://github.com/opscart/k8s-ai-diagnostics.git cd k8s-ai-diagnostics python3 -m venv venv source venv/bin/activate pip install -r requirements.txt # Set OpenAI API key export OPENAI_API_KEY="your-key" # Deploy test scenarios kubectl create namespace ai-apps sh k8s-manifests/deploy.sh # Run diagnostics python3 k8s_ai_agent.py # Enter namespace: ai-apps Production Considerations Before using in production: Test in non-production environments – Verify remediation logic doesn't cause unintended consequencesImplement log sanitization – Remove sensitive data before sending to OpenAISet up monitoring – Track diagnostic success rates and API costsConfigure rate limiting – Prevent API quota exhaustionDocument approval workflows – Define who can approve which types of remediationEstablish rollback procedures – Know how to revert automated changes Conclusion The k8s-ai-diagnostics project demonstrates that LLMs can automate routine Kubernetes troubleshooting tasks. By combining kubectl's data collection capabilities with GPT-4's analytical reasoning, the tool provides diagnostic insights that previously required experienced SRE intervention. The POC shows particular strength in handling common failure patterns like CrashLoopBackOff and OOMKilled scenarios, where automated remediation can reduce MTTR. The human approval gate maintains safety while allowing operators to move quickly when confident in the recommendations. However, the current implementation has clear limitations. Production readiness requires addressing security concerns around data transmission, expanding remediation capabilities beyond simple cases, and integrating with existing observability and incident management infrastructure. The OOMKilled remediation, for example, currently uses fixed memory values rather than analyzing actual usage patterns — a gap that Prometheus integration would fill. For teams experiencing high volumes of routine pod failures, this approach offers a way to reduce operational toil. The tool handles repetitive diagnostic work, letting engineers focus on complex issues that require human judgment and problem-solving. As observability integration improves and remediation logic matures, LLM-augmented troubleshooting will become more viable for production environments. Additional Resources GitHub repository: opscart/k8s-ai-diagnosticsKubernetes troubleshooting: kubernetes.io/docs/tasks/debugOpenAI API documentation: platform.openai.com/docskubectl reference: kubernetes.io/docs/reference/kubectl

By Shamsher Khan

Theia Deep Dive, Part 1: From Zero to Your Own IDE

Intro: Why Build Your Own IDE? There are countless reasons to create your own IDE. Perhaps you are creating a sandbox similar to CodeSandbox, but adapted to your own framework. Or an Arduino-style environment where a strictly controlled user interface is required for flashing firmware and interacting with hardware. Maybe your SaaS product needs a built-in cloud editor. Maybe you're creating an educational platform where users should only see part of the code. Or you're releasing your own language or framework, and want the developer tools not to look like an add-on plugin. Or maybe you've decided to create your own Cursor with an AI assistant. Whatever your goal, creating your own IDE is more than possible. In this guide, I'll show you how to do it with Eclipse Theia: a fully modular, open-source platform designed to adapt to your needs. Our Goals I will proceed from the idea that we want to create an online IDE, similar to CodeSandbox. It's great to have a full-fledged IDE with support for dozens of languages, a full-fledged file system, a terminal, and no need to write everything from scratch. At the same time, we don't need all the functionality that any IDE usually offers; for example, functionality such as Debug may be unnecessary. So here's what I'll do: Remove any distracting elements, buttons, and panels, leaving only the essentialsRemove unnecessary menus, buttons, and actionsAdd our own menus and actionsConfigure the default layoutConnect the necessary themes and widgetsMake unique looks and feel with Islands UI So, let's dive in! Meet Theia: The Framework Behind the Magic To bring our idea to life, we will use Eclipse Theia. This is an open-source modular framework for creating IDEs. It looks and works like VS Code and partially uses VS Code components (like Monaco Editor). But it is not a fork of VS Code. It supports VS Code extensions, but is not tied to VS Code either architecturally or visually. Theia was conceived as a universal framework for custom IDEs. It gives you a foundation and a set of separate modules: from a code editor to a terminal, from toolbars to tabs. Everything can be disabled, replaced, or rewritten. And you can make your own modules with their internal widget system, which allows you to write your own components in React (or whatever framework you want, actually). You can build a specialized IDE, such as Arduino IDE or SAP App Studio. Or a minimalistic editor with an AI agent inside, like Cursor. Or something completely non-standard, such as an editor where you can't write code, but you can build diagrams. It all depends on your requirements. It works both in a browser and as a desktop application (via Electron), but I will focus on the web version. The Electron version works similarly in the aspects that will be discussed in this article. Why Not Use VS Code Web? Although VS Code is open source, it is not a modular framework. It is a monolithic product with a plugin system, rather than a set of components and modules that can be assembled into the specific tool you need. You can add extensions, but you cannot remove basic parts such as panels, commands, tabs, and Shell behavior without hacks and serious code rewriting, which can make it difficult to update the VS Code core in the future. And even though VS Code is an open-source product, Microsoft is not interested in you using it to create competitors. So forget about documentation or official support. Also, according to my tests, VS Code consumes about 1.5 times more memory, which can be critical if you run it inside virtual machine instances somewhere in the cloud. You can even run a "browser-only" version of Theia with some kind of virtual file system; in that case, you do not need backend resources at all. Installing Theia and Preparing to Make Changes The first step is to deploy the IDE code itself. We have two options: Clone the full repository. The repository contains a complete build with all plugins. This is the same version you can download and install on your PC from the main page: https://theia-ide.org/, but it's heavy and contains a lot of unnecessary components. My goal is to create a lightweight Cloud IDE build, so this method won't work for me, as it would require more effort to remove unnecessary parts.Use the extension generator. The idea of this generator is to provide a CLI for creating and debugging custom Theia plugins, but when this CLI creates a plugin, it deploys a minimal Theia build that contains nothing except a file tree and code editor. Let's start with this one. To install, run: TypeScript npm install -g yo generator-theia-extension This will install the necessary utilities for generation. To create a sample Theia project (optionally with custom Theia extensions), including a browser and electron app, run: TypeScript mkdir ~/theia && cd ~/theia yo theia-extension Select “Hello World” and specify the name of your extension, for example custom-ui. This will create a minimal possible IDE configuration for browser and electron, an example plugin, and will immediately install the dependencies. Prepare for Customization Before we start customizing our IDE, we'll make a few changes to the build and launch process. First, we'll replace the outdated lerna with turborepo and remove scripts related to the electron version since we won't need it: TypeScript npm remove lerna npm install turbo -D Turborepo is necessary in this project because it contains multiple packages and custom plugins that we'll be writing, each of which needs to be built separately. With turborepo, we can build all our dependencies with a single command. I'll also delete the electron-app folder since I don't need it and remove all electron version-related scripts so they don't get in the way, and I'll change the port to :4000 to avoid conflicts with other applications. Open /package.json and replace lerna with turborepo and remove unnecessary scripts: ~/theia/package.json TypeScript { "private": true, "engines": { "node": ">=20", "npm": ">=10" }, "scripts": { "build": "turbo run build", "start": "turbo start", "watch": "turbo run watch", "postinstall": "theia check:theia-version", "clean": "turbo run clean && rimraf node_modules", "download:plugins": "theia download:plugins --rate-limit=15 --parallel=false", }, "workspaces": [ "custom-ui", "browser-app" ], "devDependencies": { "turbo": "^2.5.4", "typescript": "^5.8.3", "rimraf": "^6.0.1" }, "packageManager": "[email protected]", "theiaPluginsDir": "plugins", "theiaPlugins": {} } ~/theia/browser-app/package.json TypeScript { "private": true, "name": "browser-app", "version": "0.0.0", "dependencies": { "@theia/core": "1.62.2", "@theia/editor": "1.62.2", "@theia/filesystem": "1.62.2", "@theia/markers": "1.62.2", "@theia/messages": "1.62.2", "@theia/monaco": "1.62.2", "@theia/navigator": "1.62.2", "@theia/preferences": "1.62.2", "@theia/process": "1.62.2", "@theia/terminal": "1.62.2", "@theia/workspace": "1.62.2", "custom-ui": "0.0.0" }, "devDependencies": { "@theia/cli": "1.62.2" }, "scripts": { "build": "npm run rebuild && theia build --mode production", "rebuild": "theia rebuild:browser --cacheRoot ..", "start": "theia start -p 4000 --plugins=local-dir:../plugins", "watch": "npm run rebuild && theia build --watch --mode development", "clean": "theia clean && rimraf node_modules" }, "theia": { "target": "browser" }, "packageManager": "[email protected]" } Now, create a file turbo.json. ~/theia/turbo.json TypeScript { "$schema": "<https://turbo.build/schema.json>", "tasks": { "clean": { "outputs": [], "dependsOn": ["^clean"] }, "build": { "dependsOn": ["^build"], "outputs": ["lib/**"] }, "start": { "dependsOn": ["^build"], "outputs": [] }, "watch": { "cache": false, "dependsOn": [], "outputs": ["lib/**"] } } } I also recommend using vite as the bundler for our custom-ui plugin, as this will solve several problems down the line and allow us to easily add necessary plugins to our plugin, such as Tailwind or preprocessor support if needed. First, let's specify paths in tsconfig and update target and lib to more modern versions: ~/theia/custom-ui/tsconfig.json TypeScript { "compilerOptions": { "skipLibCheck": true, "declaration": true, "declarationMap": true, "noImplicitAny": true, "noEmitOnError": false, "noImplicitThis": true, "noUnusedLocals": true, "strictNullChecks": true, "experimentalDecorators": true, "emitDecoratorMetadata": true, "resolveJsonModule": true, "module": "commonjs", "moduleResolution": "node", "target": "ES2022", "jsx": "react", "lib": ["ES2022", "dom"], "sourceMap": true, "rootDir": ".", "outDir": "lib", "paths": { "@/*": ["./src/*"] } }, "include": ["src"], "exclude": ["node_modules", "lib"] } Create vite.config.mjs: ~/theia/custom-ui/vite.config.mjs TypeScript import react from '@vitejs/plugin-react'; import { resolve } from 'path'; import { defineConfig } from 'vite'; import { libInjectCss } from 'vite-plugin-lib-inject-css'; export default defineConfig({ clearScreen: false, plugins: [ react(), libInjectCss(), ], build: { lib: { entry: [ resolve(__dirname, 'src/frontend/index.ts'), ], formats: ['cjs'], fileName: (format, entryName) => `${ entryName }.js`, }, outDir: 'lib', emptyOutDir: true, cssCodeSplit: true, sourcemap: true, target: 'ES2022', rollupOptions: { external: (id) => { // Aliases is not external if (id.startsWith('@/')) { return false; } return !id.startsWith('.') && !id.startsWith('/') && !id.includes('\\0'); }, output: { preserveModules: true, preserveModulesRoot: 'src', entryFileNames: '[name].js', chunkFileNames: '[name].js', format: 'cjs', exports: 'named', }, }, }, resolve: { alias: [{ find: /^@\\/(.*)/, replacement: `${ resolve(__dirname, 'src') }/$1`, }], // preserveSymlinks: true, }, }); Let's specify the scripts and entry points in package.json. ~/theia/custom-ui/package.json TypeScript { ... "scripts": { "clean": "rm -rf lib node_modules", "build": "vite build && npm run dts", "watch": "vite build --watch --clearScreen false", "dev": "npm run watch", "dts": "tsc --emitDeclarationOnly --declaration --outDir lib" }, "theiaExtensions": [{ "frontend": "lib/frontend/index.js" }] } Let's create the main file for our plugin, for now, without any logic. ~/theia/custom-ui/src/frontend/index.ts TypeScript import { ContainerModule } from '@theia/core/shared/inversify'; export default new ContainerModule((bind, unbind, isBound, rebind) => { // TODO Implement any logic here }); Install Vite: TypeScript cd ~/**theia/custom-ui npm i -D vite @vitejs/plugin-react vite-plugin-lib-inject-css Check that everything runs smoothly: TypeScript cd ~/theia npm run build npm run start Open http://127.0.0.1:4000 and check your working IDE: Adding Functionality Currently, our IDE has literally nothing except a file tree and an editor without syntax highlighting support, so let's add the missing plugins. VS Code Plugins VS Code plugins (pre-installed by you or installed by users from the marketplace). These are all the plugins from the VS Code application store, as well as some internal plugins, such as syntax highlighting support for different languages. To install such a plugin, you can use the Extensions section or declare it in package.json → theiaPlugins Let's start by installing basic VS Code plugins for syntax highlighting, as well as the material icons theme: ./package.json TypeScript "theiaPlugins": { "eclipse-theia.builtin-extension-pack": "<https://open-vsx.org/api/eclipse-theia/builtin-extension-pack/1.95.3/file/eclipse-theia.builtin-extension-pack-1.95.3.vsix>", "zhuangtongfa.material-theme": "<https://open-vsx.org/api/zhuangtongfa/material-theme/3.19.0/file/zhuangtongfa.material-theme-3.19.0.vsix>", "PKief.material-icon-theme": "<https://open-vsx.org/api/PKief/material-icon-theme/5.23.0/file/PKief.material-icon-theme-5.23.0.vsix>" } eclipse-theia.builtin-extension-pack includes all plugins contained in the standard VS Code build, but you can install only the ones you need: TypeScript "theiaPlugins": { "vscode.javascript": "<https://open-vsx.org/api/vscode/javascript/1.95.3/file/vscode.javascript-1.95.3.vsix>", "vscode.typescript": "<https://open-vsx.org/api/vscode/typescript/1.95.3/file/vscode.typescript-1.95.3.vsix>", "vscode.typescript-language-features": "<https://open-vsx.org/api/vscode/typescript-language-features/1.95.3/file/vscode.typescript-language-features-1.95.3.vsix>", "vscode.json": "<https://open-vsx.org/api/vscode/json/1.95.3/file/vscode.json-1.95.3.vsix>", "vscode.css": "<https://open-vsx.org/api/vscode/css/1.95.3/file/vscode.css-1.95.3.vsix>", "vscode.html": "<https://open-vsx.org/api/vscode/html/1.95.3/file/vscode.html-1.95.3.vsix>", "vscode.markdown": "<https://open-vsx.org/api/vscode/markdown/1.95.3/file/vscode.markdown-1.95.3.vsix>", "zhuangtongfa.material-theme": "<https://open-vsx.org/api/zhuangtongfa/material-theme/3.19.0/file/zhuangtongfa.material-theme-3.19.0.vsix>", "PKief.material-icon-theme": "<https://open-vsx.org/api/PKief/material-icon-theme/5.23.0/file/PKief.material-icon-theme-5.23.0.vsix>" } zhuangtongfa.material-theme — one dark theme, one of the most popular themes for VS CodePKief.material-icon-theme — material style icons in the file tree and tabs Then execute: TypeScript npm run download:plugins This will download all these plugins to the /plugins folder. Core Plugins These are plugins that extend Theia's functionality. These plugins add core IDE functionality and can completely change behavior. To install such a plugin, simply run npm install for the desired package. The list of packages can be found in the Theia repository. I will install the most essential ones for me, which are project search and VS Code plugin support: TypeScript cd **~/**theia/browser-app npm i @theia/plugin @theia/plugin-ext @theia/plugin-ext-vscode @theia/search-in-workspace Please note that some plugins may depend on others, so when you install @theia/plugin-ext-vscode, almost all the main modules will be included in your build, even if you don't need them. Then run npm run build && npm run start to check that everything is working. Rewiring Theia: UI and Behavior Customization To understand how to customize Theia, we need to understand how its code works internally. Any customizations are done through system plugins, whose code can replace any system object or class. This is possible thanks to InversifyJS, which Theia uses under the hood. It is a simple but incredibly powerful Dependency Injection container that allows you to easily replace some dependencies with others. This way, as a developer of the custom-ui plugin, you can easily override the logic of system modules, such as @theia/terminal on the front end or @theia/file-service on the back end. Now we are ready to start customizing our build. My plan includes: Specify default user settingsCreate a splash screenRemove unnecessary interface elements, functionality, actions, and menusAdd custom commands and menu itemsChange the appearance of some interface elementsChange some behavior logic (for example, prohibit DnD elements on panels) Default IDE Settings Theia uses the same configuration format as VS Code and is configured via the settings.json file. To set the default settings, we first need to collect them. To do this, open your IDE, configure it in any convenient way (via the interface or directly via the JSON config), then press cmd + shift + p to open the Command Palette and enter “Open Settings JSON.” Copy the contents of this file and paste them into browser-app/package.json in the theia section: ./browser-app/package.json TypeScript "theia": { "target": "browser", "frontend": { "config": { "applicationName": "Flexbe IDE", "reloadOnReconnect": false, "preferences": { "editor.indentSize": "tabSize", "editor.tabSize": 4, "workbench.tree.indent": 13, "workbench.colorTheme": "One Dark Pro Night Flat", "workbench.iconTheme": "material-icon-theme" } } }, } Creating a Splash Screen This screen will be displayed during the loading process of Theia instead of the standard loader. To create it, specify the template in the settings of Theia → generator. ./browser-app/package.json TypeScript "theia": { ... "generator": { "config": { "preloadTemplate": "./resources/preload.html" } } } After that, we will correct the webpack configuration so that static files are copied to the frontend assets folder during compilation, and so that we can see the compilation progress: ./browser-app/webpack.config.js TypeScript const CopyWebpackPlugin = require('copy-webpack-plugin'); const path = require('path'); const webpack = require('webpack'); const configs = require('./gen-webpack.config.js'); const nodeConfig = require('./gen-webpack.node.config.js'); // Add copy plugin and progress plugin to the first config configs[0].plugins.push( new CopyWebpackPlugin({ patterns: [ { from: path.resolve(__dirname, './resources'), to: path.resolve(__dirname, './lib/frontend/resources'), }, ], }), new webpack.ProgressPlugin((percentage, message, ...args) => { const cleanMessage = `${ (percentage * 100).toFixed(1) }% ${ message } ${ args.join(' ') }`.trim(); console.warn(cleanMessage); }) ); module.exports = [ ...configs, nodeConfig.config, ]; Create a resources folder with a preload.html file: ./browser-app/resources/preload.html TypeScript <style> html, body { background-color: #23272e; } .theia-preload { position: absolute; top: 0; left: 0; right: 0; bottom: 0; background-image: none; } .theia-preload::after { /* remove default loading animation */ content: none; } .spinner-container { display: flex; flex-direction: center; align-self: center; justify-content: center; height: 100vh; width: 100vw; background-color: #23272e; } .custom-spinner { display: flex; flex-direction: center; align-self: center; justify-content: center; animation: fadeInOut 1.65s ease-in-out infinite; } @keyframes fadeInOut { 0% { opacity: 0.4; } 50% { opacity: 1; } 100% { opacity: 0.4; } } </style> <div class="spinner-container"> <div class="custom-spinner"> <svg width="150" height="30" viewBox="0 0 368 74" fill="none" xmlns="<http://www.w3.org/2000/svg>"> <path d="M27.0691 12.8665C22.672 12.8665 19.1076 16.4384 19.1076 20.8445V28.8226H31.846V38.3964H19.1076H9.55379H0V28.8226H9.55379V20.8445C9.55379 11.1509 17.3957 3.29272 27.0691 3.29272H31.846V12.8665H27.0691ZM19.1076 38.3964V73.5H9.55379V47.9699L19.1076 38.3964ZM171.968 3.29272H162.414V73.5H171.968V65.8373C176.014 70.5661 181.653 73.5 187.891 73.5C200.203 73.5 210.183 62.0699 210.183 47.9699C210.183 33.8703 200.203 22.4402 187.891 22.4402C181.653 22.4402 176.014 25.3742 171.968 30.1029V3.29272ZM186.299 63.9265C177.933 63.9265 171.968 56.3535 171.968 47.9699C171.968 39.5868 177.933 32.0139 186.299 32.0139C194.665 32.0139 200.629 39.5868 200.629 47.9699C200.629 56.3535 194.665 63.9265 186.299 63.9265ZM155.874 22.4402H144.728L131.99 40.2003L126.416 47.9699L108.105 73.5H119.251L131.99 55.7399L144.728 73.5H155.874L137.563 47.9699L155.874 22.4402ZM119.251 22.4402L129.232 36.3559L123.63 44.0851L108.105 22.4402H119.251ZM38.2151 3.29272V62.3308C38.2151 68.4993 43.2052 73.5 49.3611 73.5H54.1383V63.9265H49.3611C48.4817 63.9265 47.7689 63.2119 47.7689 62.3308V3.29272H38.2151ZM69.862 26.3473C73.2072 23.9098 77.494 22.4402 82.7996 22.4402C88.1048 22.4402 92.3916 23.9099 95.7364 26.3477C99.0298 28.7483 101.13 31.8807 102.468 34.8075C103.802 37.7257 104.444 40.5806 104.761 42.6619C104.921 43.7142 105.002 44.6007 105.044 45.2412C105.065 45.5622 105.076 45.8235 105.082 46.0154C105.085 46.1114 105.086 46.1906 105.087 46.251L105.088 46.3278L105.088 46.3559V46.3672V46.3723C105.088 46.3723 105.088 46.3766 100.311 46.375C100.311 51.1616 100.311 51.1616 100.311 51.1616H100.305H100.299H100.294H100.289H100.283H100.277H100.27H100.262H100.254H100.244H100.234H100.224H100.213H100.201H100.189H100.176H100.162H100.148H100.133H100.117H100.101H100.084H100.067H100.049H100.03H100.011H99.991H99.9702H99.9491H99.9272H99.9049H99.8819H99.8581H99.8339H99.8093H99.7836H99.7579H99.7314H99.7041H99.6761H99.648H99.6188H99.5897H99.5593H99.5286H99.4975H99.4656H99.4334H99.4003H99.3669H99.3331H99.2982H99.2632H99.2275H99.1914H99.1546H99.1173H99.0793H99.0409H99.0021H98.9626H98.9226H98.8819H98.8408H98.7993H98.7571H98.7145H98.6715H98.6277H98.5835H98.539H98.4937H98.448H98.4019H98.355H98.3082H98.2602H98.2122H98.1634H98.1143H98.0647H98.0148H97.9641H97.913H97.8616H97.8097H97.7571H97.7041H97.6507H97.597H97.5428H97.4879H97.4326H97.3769H97.3209H97.2644H97.2072H97.15H97.092H97.0336H96.9749H96.9157H96.8562H96.7959H96.7356H96.6746H96.6135H96.5517H96.4895H96.4269H96.3643H96.3009H96.2372H96.1727H96.1081H96.0432H95.978H95.9123H95.8462H95.7794H95.7126H95.6454H95.5778H95.5098H95.4415H95.3724H95.3032H95.2337H95.1638H95.0939H95.0233H94.9522H94.8808H94.8094H94.7372H94.665H94.592H94.5191H94.4457H94.372H94.2979H94.2238H94.1489H94.074H93.9987H93.9231H93.847H93.7706H93.6938H93.617H93.5398H93.4622H93.3843H93.3063H93.2276H93.1489H93.0702H92.9907H92.9112H92.8313H92.7511H92.6704H92.5898H92.5087H92.4273H92.3459H92.2641H92.1819H92.0994H92.0168H91.9339H91.8509H91.7672H91.6839H91.5998H91.5157H91.4312H91.3467H91.2618H91.1766H91.0913H91.0057H90.92H90.834H90.7476H90.6612H90.5744H90.4876H90.4005H90.3129H90.2254H90.1378H90.0499H89.9615H89.8732H89.7845H89.6958H89.6071H89.518H89.4285H89.339H89.2492H89.1593H89.0695H88.9792H88.889H88.7983H88.7073H88.6167H88.5257H88.4343H88.3429H88.2515H88.1597H88.0679H87.9762H87.884H87.7914H87.6993H87.6067H87.5142H87.4212H87.3283H87.2354H87.1421H87.0487H86.9554H86.8617H86.768H86.6743H86.5806H86.4865H86.3925H86.2984H86.2039H86.1094H86.015H85.9205H85.8256H85.7312H85.6363H85.5415H85.4462H85.3514H85.2561H85.1609H85.0657H84.9704H84.8748H84.7792H84.684H84.5883H84.4927H84.3967H84.3011H84.2055H84.1095H84.0135H83.9174H83.8218H83.7258H83.6294H83.5334H83.4374H83.3414H83.245H83.149H83.053H82.9566H82.8606H82.7642H82.6682H82.5718H82.4755H82.3795H82.2831H82.1871H82.0907H81.9947H81.8983H81.8023H81.7059H81.6099H81.5139H81.4175H81.3215H81.2255H81.1295H81.0335H80.9378H80.8418H80.7458H80.6502H80.5542H80.4586H80.363H80.2673H80.1717H80.0765H79.9809H79.8856H79.7904H79.6952H79.5999H79.5051H79.4098H79.315H79.2201H79.1257H79.0308H78.9363H78.8419H78.7474H78.6533H78.5588H78.4648H78.3711H78.277H78.1833H78.0896H77.9963H77.9026H77.8092H77.7163H77.6234H77.5304H77.4375H77.345H77.2524H77.1599H77.0677H76.9755H76.8838H76.792H76.7002H76.6088H76.5174H76.426H76.335H76.244H76.1534H76.0627H75.9725H75.8822H75.7924H75.7025H75.6126H75.5232H75.4341H75.345H75.2559H75.1672H75.0785H74.9902H74.9022H74.8143H74.7263H74.6388H74.5516H74.4644H74.3776H74.2909H74.2044H74.118H74.032H73.9464H73.8608H73.7755H73.6902H73.6054H73.5209H73.4364H73.3523H73.2686H73.1849H73.1015H73.0182H72.9357H72.8527H72.7705H72.6883H72.6065H72.5251H72.4437H72.3627H72.2821H72.2014H72.1215H72.0416H71.9618H71.8827H71.8036H71.7248H71.6465H71.5682H71.4906H71.413H71.3358H71.2586H71.1822H71.1058H71.0301H70.9545H70.8788H70.804H70.7295H70.655H70.5808H70.5071H70.4338H70.3608H70.2882H70.216H70.1961C71.0079 55.8698 73.8892 59.4964 77.7689 61.4404C82.7535 63.9378 90.0107 63.9198 97.5014 58.4601L103.121 66.2024C93.0986 73.5074 82.0461 74.2871 73.4963 70.0033C65.016 65.7542 59.614 56.7802 60.5072 46.1274L60.5103 46.0135C60.5161 45.8216 60.5272 45.5602 60.548 45.2392C60.5898 44.5988 60.6716 43.7126 60.8321 42.6599C61.1497 40.5786 61.7933 37.7241 63.1282 34.8062C64.4669 31.8796 66.5678 28.7476 69.862 26.3473ZM100.311 46.375L105.088 46.3766L105.086 51.1616H100.311L100.311 46.375ZM70.8097 41.5881H70.8788H70.9545H71.0301H71.1058H71.1822H71.2586H71.3358H71.413H71.4906H71.5682H71.6465H71.7248H71.8036H71.8827H71.9618H72.0416H72.1215H72.2014H72.2821H72.3627H72.4437H72.5251H72.6065H72.6883H72.7705H72.8527H72.9357H73.0182H73.1015H73.1849H73.2686H73.3523H73.4364H73.5209H73.6054H73.6902H73.7755H73.8608H73.9464H74.032H74.118H74.2044H74.2909H74.3776H74.4644H74.5516H74.6388H74.7263H74.8143H74.9022H74.9902H75.0785H75.1672H75.2559H75.345H75.4341H75.5232H75.6126H75.7025H75.7924H75.8822H75.9725H76.0627H76.1534H76.244H76.335H76.426H76.5174H76.6088H76.7002H76.792H76.8838H76.9755H77.0677H77.1599H77.2524H77.345H77.4375H77.5304H77.6234H77.7163H77.8092H77.9026H77.9963H78.0896H78.1833H78.277H78.3711H78.4648H78.5588H78.6533H78.7474H78.8419H78.9363H79.0308H79.1257H79.2201H79.315H79.4098H79.5051H79.5999H79.6952H79.7904H79.8856H79.9809H80.0765H80.1717H80.2673H80.363H80.4586H80.5542H80.6502H80.7458H80.8418H80.9378H81.0335H81.1295H81.2255H81.3215H81.4175H81.5139H81.6099H81.7059H81.8023H81.8983H81.9947H82.0907H82.1871H82.2831H82.3795H82.4755H82.5718H82.6682H82.7642H82.8606H82.9566H83.053H83.149H83.245H83.3414H83.4374H83.5334H83.6294H83.7258H83.8218H83.9174H84.0135H84.1095H84.2055H84.3011H84.3967H84.4927H84.5883H84.684H84.7792H84.8748H84.9704H85.0657H85.1609H85.2561H85.3514H85.4462H85.5415H85.6363H85.7312H85.8256H85.9205H86.015H86.1094H86.2039H86.2984H86.3925H86.4865H86.5806H86.6743H86.768H86.8617H86.9554H87.0487H87.1421H87.2354H87.3283H87.4212H87.5142H87.6067H87.6993H87.7914H87.884H87.9762H88.0679H88.1597H88.2515H88.3429H88.4343H88.5257H88.6167H88.7073H88.7983H88.889H88.9792H89.0695H89.1593H89.2492H89.339H89.4285H89.518H89.6071H89.6958H89.7845H89.8732H89.9615H90.0499H90.1378H90.2254H90.3129H90.4005H90.4876H90.5744H90.6612H90.7476H90.834H90.92H91.0057H91.0913H91.1766H91.2618H91.3467H91.4312H91.5157H91.5998H91.6839H91.7672H91.8509H91.9339H92.0168H92.0994H92.1819H92.2641H92.3459H92.4273H92.5087H92.5898H92.6704H92.7511H92.8313H92.9112H92.9907H93.0702H93.1489H93.2276H93.3063H93.3843H93.4622H93.5398H93.617H93.6938H93.7706H93.847H93.9231H93.9987H94.074H94.1489H94.2238H94.2979H94.372H94.4457H94.5191H94.592H94.665H94.7372H94.7844C94.536 40.6816 94.2096 39.7299 93.7818 38.7942C92.9315 36.9342 91.7488 35.2797 90.1167 34.0901C88.5357 32.9377 86.2534 32.0139 82.7996 32.0139C79.3453 32.0139 77.0627 32.9377 75.4805 34.0904C73.8477 35.2804 72.6638 36.9354 71.8128 38.7958C71.385 39.7311 71.0582 40.682 70.8097 41.5881ZM238.845 22.4402C233.539 22.4402 229.252 23.9098 225.907 26.3473C222.613 28.7476 220.512 31.8796 219.173 34.8062C217.838 37.7241 217.195 40.5786 216.877 42.6599C216.717 43.7126 216.635 44.5988 216.593 45.2392C216.572 45.5602 216.561 45.8216 216.555 46.0135L216.552 46.1274C215.659 56.7802 221.062 65.7542 229.541 70.0033C238.091 74.2871 249.144 73.5074 259.166 66.2024L253.546 58.4601C246.056 63.9198 238.799 63.9378 233.814 61.4404C229.934 59.4964 227.053 55.8698 226.241 51.1616H226.261H226.333H226.406H226.479H226.552H226.626H226.7H226.775H226.849H226.924H227H227.075H227.151H227.227H227.304H227.381H227.458H227.536H227.613H227.692H227.77H227.849H227.928H228.007H228.087H228.167H228.247H228.327H228.408H228.489H228.57H228.652H228.733H228.816H228.898H228.981H229.063H229.147H229.23H229.314H229.397H229.482H229.566H229.65H229.735H229.821H229.906H229.992H230.077H230.163H230.25H230.336H230.423H230.51H230.597H230.684H230.771H230.859H230.947H231.035H231.124H231.212H231.301H231.39H231.479H231.568H231.658H231.748H231.837H231.927H232.018H232.108H232.198H232.289H232.38H232.471H232.563H232.654H232.745H232.837H232.929H233.021H233.113H233.205H233.298H233.39H233.483H233.576H233.668H233.761H233.855H233.948H234.041H234.135H234.228H234.322H234.416H234.51H234.604H234.698H234.793H234.887H234.981H235.076H235.171H235.265H235.36H235.455H235.55H235.645H235.74H235.836H235.931H236.026H236.122H236.217H236.312H236.408H236.504H236.6H236.695H236.791H236.887H236.983H237.079H237.175H237.271H237.367H237.463H237.559H237.655H237.751H237.847H237.943H238.04H238.136H238.232H238.328H238.425H238.521H238.617H238.713H238.809H238.906H239.002H239.098H239.194H239.291H239.387H239.483H239.579H239.675H239.771H239.867H239.963H240.059H240.155H240.251H240.346H240.442H240.538H240.633H240.729H240.825H240.92H241.016H241.111H241.206H241.301H241.396H241.491H241.587H241.681H241.776H241.871H241.966H242.06H242.155H242.249H242.343H242.438H242.532H242.626H242.719H242.813H242.907H243.001H243.094H243.187H243.281H243.373H243.466H243.559H243.652H243.744H243.837H243.929H244.021H244.113H244.205H244.297H244.388H244.479H244.571H244.662H244.753H244.843H244.934H245.024H245.115H245.204H245.294H245.384H245.474H245.563H245.652H245.741H245.83H245.918H246.007H246.095H246.183H246.27H246.358H246.446H246.533H246.62H246.706H246.793H246.879H246.965H247.051H247.136H247.222H247.307H247.392H247.476H247.561H247.645H247.729H247.813H247.896H247.979H248.062H248.144H248.227H248.309H248.391H248.472H248.554H248.635H248.716H248.796H248.876H248.956H249.036H249.115H249.194H249.273H249.351H249.429H249.507H249.585H249.662H249.739H249.816H249.892H249.968H250.044H250.119H250.194H250.269H250.343H250.417H250.491H250.564H250.638H250.71H250.782H250.855H250.926H250.997H251.068H251.139H251.209H251.279H251.348H251.418H251.487H251.555H251.623H251.691H251.758H251.825H251.891H251.957H252.023H252.088H252.153H252.218H252.282H252.346H252.409H252.472H252.535H252.597H252.659H252.72H252.781H252.841H252.901H252.961H253.02H253.079H253.137H253.195H253.252H253.31H253.366H253.422H253.478H253.533H253.588H253.642H253.696H253.749H253.802H253.855H253.907H253.958H254.009H254.06H254.11H254.16H254.209H254.257H254.306H254.353H254.4H254.447H254.493H254.539H254.584H254.629H254.673H254.717H254.76H254.802H254.844H254.886H254.927H254.968H255.008H255.047H255.086H255.124H255.162H255.2H255.237H255.273H255.308H255.344H255.378H255.412H255.445H255.478H255.511H255.543H255.574H255.604H255.635H255.664H255.693H255.722H255.749H255.777H255.803H255.829H255.854H255.879H255.903H255.927H255.95H255.972H255.994H256.015H256.036H256.056H256.075H256.094H256.112H256.129H256.146H256.163H256.178H256.193H256.207H256.221H256.234H256.247H256.258H256.269H256.28H256.29H256.299H256.307H256.315H256.322H256.329H256.334H256.339H256.344H256.353C256.353 51.1616 256.356 51.1616 256.356 46.375L256.353 51.1616H261.131L261.133 46.3766L256.356 46.375C261.133 46.3766 261.133 46.3723 261.133 46.3723V46.3672V46.3559L261.133 46.3278L261.132 46.251C261.131 46.1906 261.13 46.1114 261.127 46.0154C261.121 45.8235 261.11 45.5622 261.089 45.2412C261.048 44.6007 260.966 43.7142 260.806 42.6619C260.49 40.5806 259.847 37.7257 258.513 34.8075C257.175 31.8807 255.075 28.7483 251.782 26.3477C248.437 23.9099 244.15 22.4402 238.845 22.4402ZM226.924 41.5881H226.855C227.103 40.682 227.43 39.7311 227.858 38.7958C228.709 36.9354 229.893 35.2804 231.526 34.0904C233.108 32.9377 235.391 32.0139 238.845 32.0139C242.299 32.0139 244.581 32.9377 246.162 34.0901C247.794 35.2797 248.977 36.9342 249.827 38.7942C250.255 39.7299 250.581 40.6816 250.83 41.5881H250.782H250.71H250.638H250.564H250.491H250.417H250.343H250.269H250.194H250.119H250.044H249.968H249.892H249.816H249.739H249.662H249.585H249.507H249.429H249.351H249.273H249.194H249.115H249.036H248.956H248.876H248.796H248.716H248.635H248.554H248.472H248.391H248.309H248.227H248.144H248.062H247.979H247.896H247.813H247.729H247.645H247.561H247.476H247.392H247.307H247.222H247.136H247.051H246.965H246.879H246.793H246.706H246.62H246.533H246.446H246.358H246.27H246.183H246.095H246.007H245.918H245.83H245.741H245.652H245.563H245.474H245.384H245.294H245.204H245.115H245.024H244.934H244.843H244.753H244.662H244.571H244.479H244.388H244.297H244.205H244.113H244.021H243.929H243.837H243.744H243.652H243.559H243.466H243.373H243.281H243.187H243.094H243.001H242.907H242.813H242.719H242.626H242.532H242.438H242.343H242.249H242.155H242.06H241.966H241.871H241.776H241.681H241.587H241.491H241.396H241.301H241.206H241.111H241.016H240.92H240.825H240.729H240.633H240.538H240.442H240.346H240.251H240.155H240.059H239.963H239.867H239.771H239.675H239.579H239.483H239.387H239.291H239.194H239.098H239.002H238.906H238.809H238.713H238.617H238.521H238.425H238.328H238.232H238.136H238.04H237.943H237.847H237.751H237.655H237.559H237.463H237.367H237.271H237.175H237.079H236.983H236.887H236.791H236.695H236.6H236.504H236.408H236.312H236.217H236.122H236.026H235.931H235.836H235.74H235.645H235.55H235.455H235.36H235.265H235.171H235.076H234.981H234.887H234.793H234.698H234.604H234.51H234.416H234.322H234.228H234.135H234.041H233.948H233.855H233.761H233.668H233.576H233.483H233.39H233.298H233.205H233.113H233.021H232.929H232.837H232.745H232.654H232.563H232.471H232.38H232.289H232.198H232.108H232.018H231.927H231.837H231.748H231.658H231.568H231.479H231.39H231.301H231.212H231.124H231.035H230.947H230.859H230.771H230.684H230.597H230.51H230.423H230.336H230.25H230.163H230.077H229.992H229.906H229.821H229.735H229.65H229.566H229.482H229.397H229.314H229.23H229.147H229.063H228.981H228.898H228.816H228.733H228.652H228.57H228.489H228.408H228.327H228.247H228.167H228.087H228.007H227.928H227.849H227.77H227.692H227.613H227.536H227.458H227.381H227.304H227.227H227.151H227.075H227H226.924Z" fill="white"/> <path d="M276.832 70.2393C275.542 68.8663 274.896 67.2125 274.896 65.2779C274.896 63.3433 275.542 61.7208 276.832 60.4102C278.184 59.0373 279.812 58.3508 281.717 58.3508C283.621 58.3508 285.25 59.0373 286.601 60.4102C287.953 61.7208 288.629 63.3433 288.629 65.2779C288.629 67.2125 287.953 68.8663 286.601 70.2393C285.25 71.6122 283.621 72.2987 281.717 72.2987C279.812 72.2987 278.184 71.6122 276.832 70.2393ZM341.833 71.5498H332.433V64.9035C328.377 69.8336 322.94 72.2987 316.119 72.2987C311.02 72.2987 306.78 70.8321 303.401 67.899C300.083 64.9659 298.424 61.0655 298.424 56.1978C298.424 51.2677 300.206 47.5857 303.769 45.1518C307.333 42.718 312.156 41.5011 318.239 41.5011H331.603V39.6289C331.603 33.0138 327.978 29.7062 320.728 29.7062C316.181 29.7062 311.45 31.3912 306.534 34.7612L301.926 28.2085C307.886 23.4032 314.645 21.0005 322.202 21.0005C327.978 21.0005 332.678 22.4983 336.303 25.4938C339.99 28.4269 341.833 33.0762 341.833 39.4417V71.5498ZM331.511 53.0151V48.8026H319.898C312.464 48.8026 308.746 51.1741 308.746 55.917C308.746 58.3508 309.668 60.223 311.511 61.5336C313.354 62.7817 315.904 63.4057 319.161 63.4057C322.479 63.4057 325.367 62.4697 327.824 60.5975C330.282 58.7253 331.511 56.1978 331.511 53.0151ZM366.525 71.5498H356.203V21.7494H366.525V71.5498ZM356.848 11.8268C355.558 10.5162 354.913 8.95609 354.913 7.1463C354.913 5.33651 355.558 3.77635 356.848 2.46581C358.138 1.15527 359.675 0.5 361.456 0.5C363.238 0.5 364.774 1.15527 366.065 2.46581C367.355 3.77635 368 5.33651 368 7.1463C368 8.95609 367.355 10.5162 366.065 11.8268C364.774 13.1373 363.238 13.7926 361.456 13.7926C359.675 13.7926 358.138 13.1373 356.848 11.8268Z" fill="#7666F6"/> </svg> </div> </div> <script> if (document.head) { let link = document.createElement('link'); link.rel = 'icon'; link.href = '/favicon.ico'; document.head.appendChild(link); } </script> Let's set up dependencies and check that everything works: TypeScript npm i copy-webpack-plugin -D npm run build npm run start NOTE: I recommend running npm run start in a separate terminal. Instead of npm run build, you can use npm run watch to speed up rebuilds during making changes. Conclusion In this part, we’ve built the foundation of a custom IDE with Theia: added VS Code extensions for syntax and themes, pulled in core Theia plugins, wired up a custom Vite-based UI, and even introduced our own splash screen. At this stage, you already have a functional browser-based IDE that can be launched and used for real work. But this is just the groundwork. The real transformation begins when we start stripping away the unnecessary parts, reshaping the interface, and building a UX tailored to our product. In the next part, I’ll show how to go deeper — rewiring tabs, sidebars, and layouts to turn Theia into something uniquely yours.

By Maksim Kachurin

Transforming Your Node.js REST API into an AI-Ready MCP Server

The evolution of large language models (LLMs) and agentic AI requires a fundamental shift in how applications expose their capabilities. Traditional REST APIs are designed for software-to-software communication, requiring developers to read documentation and write custom integration code. The Model Context Protocol (MCP) is an open standard designed to solve this by creating a unified, machine-readable interface that AI agents can dynamically discover and interact with. This article provides a comprehensive guide on converting an existing Node.js REST API into an MCP server using the official TypeScript SDK, focusing on the architectural changes and crucial use cases that this conversion unlocks. The Paradigm Shift: From REST to MCP REST APIs are typically designed with human developers in mind, optimizing for resource management (CRUD operations) via HTTP verbs, path variables, and specific request/response formats. The MCP model, in contrast, is AI-first: AspectREST API (Traditional)MCP Server (AI-FIRST) Primary Consumer Human Developers, Client Applications AI Agents, LLMs, AI-Powered IDEs Interface HTTP verbs, Path, Query Params, Custom Body Standardized JSON-RPC messages (Tools, Resources, Prompts) Discovery Manual via OpenAPI/Swagger Documentation Dynamic via the list_tools() or list_resources() protocol Functionality Granular, atomic endpoints (GET /users/{id}) High-level, semantic actions (manage_user_profile) The conversion is not a direct port; it's an abstraction. You wrap your existing Node.js business logic with an MCP layer that translates the standardized MCP calls into the REST requests your API understands. Step 1: Set Up the Node.js MCP Environment The official Model Context Protocol TypeScript SDK is the core tool for this conversion. 1. Initialize the Project and Install Dependencies Assuming a basic Node.js (v18+) project, you'll need the MCP SDK, a utility for request validation (like Zod), and an HTTP client (like axios or node-fetch) to interact with your existing REST API. Shell npm init -y npm install @modelcontextprotocol/sdk zod node-fetch npm install -D typescript @types/node ts-node 2. Instantiate the MCP Server Create a file (e.g., mcp-server.ts) to set up the server instance and a transport layer, such as StdioServerTransport for local testing or StreamableHttpServerTransport for remote deployment. TypeScript import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js"; import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js"; import { z } from "zod"; // Instantiate the core MCP server const server = new McpServer({ name: "MyNodeAPIServer", version: "1.0.0", capabilities: { tools: {}, resources: {}, prompts: {} }, }); // The transport handles communication with the LLM client const transport = new StdioServerTransport(); async function startServer() { // [Tool and Resource registration will go here] await server.connect(transport); console.log("MCP Server is running on standard I/O..."); } startServer().catch(console.error); Step 2: Curate and Define MCP Tools This is the most critical step. Instead of blindly exposing every REST endpoint, you must curate the functionality into high-level, agent-friendly Tools and Resources. 1. Designing LLM-Friendly Tools LLMs perform better with semantic, intention-based tools rather than granular, low-level API calls. Bad (REST-centric): get_user_by_id, update_user_name, update_user_emailGood (MCP-centric): manage_user_profile(userId, newName, newEmail) The MCP tool handler should orchestrate the necessary multiple REST calls to fulfill the single, high-level action. 2. Implementing the Tool Handler Each tool requires a descriptive name, a thorough natural language description for the LLM, and structured input/output schemas using Zod. TypeScript // Define the schema for the tool's input arguments const UpdateUserSchema = z.object({ userId: z.string().describe("The unique ID of the user to update."), newEmail: z.string().email().optional().describe("The user's new email address."), newSubscriptionPlan: z.enum(['basic', 'premium', 'pro']).optional().describe("The new subscription plan to apply."), }); server.registerTool( "manage_subscription", { title: "Manage User Subscription and Profile", description: "Updates a user's email address and/or changes their subscription plan. Requires the user's ID.", argsSchema: UpdateUserSchema, outputSchema: z.object({ status: z.string(), updatedFields: z.array(z.string()), }), }, async (args) => { const { userId, newEmail, newSubscriptionPlan } = args; const updatedFields: string[] = []; // --- REST API CALL Orchestration --- const REST_API_BASE = process.env.REST_API_URL; if (newEmail) { // 1. Call the REST API to update email await fetch(`${REST_API_BASE}/users/${userId}/email`, { method: 'PUT', body: JSON.stringify({ email: newEmail }), headers: { 'Content-Type': 'application/json' }, }); updatedFields.push('email'); } if (newSubscriptionPlan) { // 2. Call the REST API to update subscription await fetch(`${REST_API_BASE}/billing/${userId}/plan`, { method: 'POST', body: JSON.stringify({ plan: newSubscriptionPlan }), headers: { 'Content-Type': 'application/json' }, }); updatedFields.push('subscriptionPlan'); } // Return a structured response for the LLM return { status: "Success", updatedFields: updatedFields.length > 0 ? updatedFields : ["No changes made."], }; } ); 3. Creating Resources for Context For simple GET requests that provide context (read-only data), use ResourceTemplates. These allow the LLM to understand what data is available without necessarily calling a tool. TypeScript server.registerResource( "product_catalog_item", { title: "Product Catalog Item", description: "A single item from the product catalog, including price, stock, and description.", uriTemplate: "api://my-node-api-mcp/products/{productId}", dataSchema: z.object({ id: z.string(), name: z.string(), price: z.number(), description: z.string(), }), }, async (uri) => { // Parse the productId from the URI or argument const productId = uri.split('/').pop(); // Call your REST API: GET /products/{productId} const response = await fetch(`${process.env.REST_API_URL}/products/${productId}`); return await response.json(); } ); Step 3: Implement Security and Error Handling Security is paramount when exposing capabilities to an autonomous agent. 1. Authentication Integration Your MCP server acts as a proxy. Its internal HTTP client must handle authentication for the original REST API. This often involves securely loading API keys or OAuth tokens from environment variables and including them in the Authorization headers of your fetch or axios calls within the tool handlers. 2. Robust Error Responses AI agents rely on structured output to determine the success or failure of an action. Your handler must catch HTTP errors from the REST API and convert them into clear, structured MCP error responses. Bad: Throwing a raw HTTP 404 error.Good: Returning an MCP output with { status: "Error", message: "User with ID 123 not found in the database." } Key Use Cases Unlocked by MCP Converting to MCP is a strategic move that enables new classes of AI-powered applications. 1. AI-Powered Developer Tools (The "Cursor" Use Case) Many modern AI IDEs and code assistants (like Cursor, GitHub Copilot) use MCP to allow the AI to interact with the local development environment or internal services. Scenario: A developer asks, "Run the integration tests for the new user-management module."MCP Tool: run_npm_script(scriptName: string)Node.js API Logic: The tool executes a shell command like npm run test:user-management, securely, with the user's explicit approval. 2. Intelligent Customer Support Automation Expose your core CRM, inventory, or billing APIs as MCP tools to an internal AI agent. Scenario: A support agent asks the AI, "What is the order status for customer Alice, and can we apply a 10% discount?"MCP Tool 1 (Resource): get_customer_order_history(customerId)MCP Tool 2 (Tool): apply_discount_to_order(orderId, percentage)Benefit: The AI autonomously chains the calls, fetching the data and executing the action without manual steps. 3. Dynamic Workflow and Microservice Orchestration An MCP server can sit as an abstraction layer over a sprawling microservice architecture, allowing an LLM to orchestrate complex, multi-step workflows with a single semantic command. Scenario: The LLM receives the instruction, "Process a new customer onboarding for Jane Doe."MCP Tool: onboard_new_customer(name, email) Orchestration Logic: The tool's handler internally calls the User Microservice (REST POST), the Billing Service (REST POST), and the Email Service (REST POST), ensuring the entire business process is completed correctly. This makes the LLM integration simple and resilient to backend complexity. Conclusion: A Future of Standardized AI Integration Converting your Node.js REST API to support MCP is an investment in future-proofing your application for the age of autonomous AI agents. While simply wrapping every endpoint is a good starting point, the true power of MCP is realized through aggressive curation, designing high-level semantic tools that reflect user intent rather than API structure. This process transforms your API from a static data exchange service into a dynamic, AI-callable skillset, greatly expanding its utility in agentic ecosystems

By Lakshmi Narayana Rasalay

Converting ActiveMQ to Jakarta (Part III: Final)

Advanced Technical Approach Some Java frameworks have taken on the complexity of supporting both javax and jakarta package namespaces simultaneously. This approach makes sense for frameworks and platform services, such as Jetty and ActiveMQ, where the core development team needs to move the code base forward to support newer JDKs, while also providing a way for application developers to adopt Jakarta EE gradually. This simplifies the support for open-source frameworks, as there are fewer releases to manage, and in the event of a security bug, being able to release one mainline branch vs having to go back and backport across past versions. However, supporting both javax and jakarta namespaces simultaneously in a single application is complicated and time-consuming. Additionally, it opens additional scenarios that may lead to errors and security gaps for enterprise applications. This limits the ability to set up verification checks and source code scanning to block pre-Jakarta libraries from being used or accidentally pulled in through transitive dependencies. It creates a lot of ambiguity and reduces the effectiveness of DevOps teams in providing pre-approved SDKs to be used by enterprise developers. With the pitfalls outweighing the benefits, enterprise projects should not need to support both javax and jakarta namespaces simultaneously in most scenarios. Special Consideration for Exception Handling for Remote Operations The one caveat to this best practice for enterprise applications is that there may be a need to support mapping exceptions between javax and jakarta package namespaces to support clients making remote calls to a service or API. The server-side either needs to be able to detect javax clients and translate, or a thin client-side wrapper is needed to handle any jakarta exceptions received by remote services. Apache ActiveMQ handles exception namespace mapping appropriately for all client release streams (starting with v6.1.0, v5.19.0, v5.18.4, v5.17.7, and v5.16.8), so no additional handling is required by applications when using Jakarta Messaging. Jakarta EE Updates and Nothing Else A key factor to ActiveMQ’s success was that the scope of change was limited to only what was necessary for the upgrade to Jakarta EE. The change of underlying frameworks naturally brought new minimum JDK version requirements and other changes, as Jakarta EE specifications brought forward their own set of changes. No protocol changes, no data format, or configuration changes were made to ActiveMQ to support backwards compatibility with javax clients and to support roll-forward and rollback during upgrades. Developers should resist the urge to tackle other refactoring, data model, or business functionality changes when making the upgrade to Jakarta EE. These upgrades should be structured as technical debt-only releases to ensure the best outcomes. Jakarta Migration Planning Guide Team Impact: Organizing Changes for Code Review For an enterprise taking on a similar migration of a large and established code base, I highly recommend following this next piece of advice to lower the time and level of effort. Enforce an organizational policy that requires git commits related to package naming to be separated. There should be two types that are clearly labeled in the comments: Java package import namespace-only changesCode changes Namespace-only changes involve updating the file from “import javax.” to “import jakarta.” These text changes may live in Java code files, Spring XML, config properties, or other non-Java code artifacts used by the application. Code changes are updates required due to fixes, technical debt, supporting Jakarta EE specification changes, or framework API changes (such as Spring or Jetty). By separating these changes, you will greatly reduce the time required to review and approve changes. Java package namespace-only changes will have hundreds to thousands of files changed, and thousands to tens of thousands of lines changed. For the most part, these changes can be approved quickly, without the need for a deep code review. The actual impacting code changes should impact fewer files and fewer lines of code change. The code reviews on these changes will require a closer look, and by reducing the scope, you will greatly reduce the time required for code reviews. Practical Tips for Jakarta Migration Drop end-of-life and deprecated modules from your code base.Migrate or drop end-of-life and deprecated dependencies.Upgrade code to use in-Java features where commons-* dependencies are no longer needed.Upgrade to current non-Jakarta affecting dependencies you may have been putting off (Log4j v2, JUnit v5, etc.).Where possible, release JDK 17 changes first (Upgrade JDK using LTS versions 8 -> 11 -> 17).Release a tech-debt update of your product or application. This allows for supporting two modern release streams—non-Jakarta and Jakarta.Update frameworks to Jakarta EE versions.Break-up commits to have import-only changes for faster review.For complex in-house ‘framework’ type components, consider releasing support for both javax and jakarta at the same time.Add support for client-side Jakarta EE module alongside existing modules in the javax release stream.Break-up commits to have import-only changes for faster reviews. In Summary Apache ActiveMQ was successful in its migration to Jakarta EE by tackling necessary technical debt and putting off the urge to incorporate too many changes. The transition was successful, and users were able to quickly adopt the ActiveMQ 6.x releases in their Jakarta EE projects. Additionally, since the wire protocol, configuration, and data formats did not change, older javax applications (and non-Java applications) were able to work seamlessly through an upgrade. This is an exciting time for Java developers as the ecosystem is rapidly adopting awesome new features and great language improvements. I’m interested in your feedback as you tackle Jakarta EE and JDK upgrades for projects of all sizes. Reference Material change typeestimated level of effortNamespace change from “import javax…” to “import jakarta..”LowUpgrade to JDK 17MediumUpdate Maven tooling to align with JDK 17MediumUpdate and refactor code to use updated Jakarta specifications APIsMediumUpdate and refactor code to use current dependencies that implement updated specification APIsHighPay down technical debt HighUpdate and refactor code to drop any dependencies that are not current with Jakarta, JDK 17 or transitive dependencies HighTeam impacts - Managing change across the enterprise High ActiveMQ's Jakarta Migration Metrics The following statistics are provided as a reference for the level of effort required in migrating a medium-sized, mature Java project to Jakarta EE. PRs1Commits25 (the number of intermediate commits is over 100)Files changed1,425Lines added9,514Lines removed8,091Modules dropped2* (1 is the transition module, which got a relocation)Dependencies re-homed2Frameworks dropped2Deprecated J2EE specifications dropped1PR work tasks28CI build jobs80 Apache ActiveMQ 6.0.0 Jakarta Messaging 3.1.0 Release Summary Permanently dropped module: activemq-partition (drop deprecated Apache ZooKeeper test dependency)Jakarta APIs: Jakarta MessagingJakarta XMLJakarta ServletJakarta TransactionUpgrade key dependencies: Jetty v11Spring v6Java JDK 17+Maven modulesDrop JEE API specs that do not have a Jakarta version: j2ee-management (interfaces re-implemented locally to Apache ActiveMQ)Re-homed test dependencies: stompjms Java STOMP clientjoram-jms-tests JMS test utilitiesTemporarily dropped dependencies that did not have Jakarta support at the time. Note: Both have been added back in as of ActiveMQ 6.1.x. JolokiaApache Camel

By Matt Pavlovich