Enhancing Machine Learning Workflows with Large Language Models: A Hybrid Approach


Large Language Models (LLMs) like GPT-4 are considered a foundational base of an intelligent system upon which other layers can be built. However, they could also add immense value when integrated with existing Machine Learning (ML) workflows to augment their outcomes. By embedding LLMs within traditional ML pipelines, we can leverage their advanced language understanding and contextual awareness to significantly improve performance, flexibility, and accuracy while ensuring the reliability and baseline performance of the existing pipeline. 

This approach maximizes the strengths of traditional models and LLMs. It opens up new possibilities for AI applications - making existing systems more innovative and responsive to real-world data nuances. The cherry on top is that there is no added effort to completely re-architect the existing ML architecture to integrate the LLM intelligence.

Traditional Machine Learning Workflow

Traditional ML models, particularly binary and multi-class classifiers, form the backbone of numerous ML applications in production today. A binary classifier differentiates between two classes, like distinguishing between spam and non-spam emails. In contrast, a multi-class classifier can identify multiple categories, such as categorizing news articles into sports, politics, or entertainment. These models learn from vast labeled datasets to make their predictions.

Key Concept: Thresholding in Classification

Since these ML models are trained on classification tasks, they output probabilities for the input belonging to each of the possible classes, which collectively add up to 1. For instance, consider a system with an existing AI model designed for property management (PM) that responds to a prospect’s questions about a specific unit based on existing information in the PM system. This AI model provides two outputs. First, it generates an answer to the question posed by the prospect. Second, it delivers a probability score reflecting confidence in its proposed answer. This is a classification score, where the AI model outputs the probabilities of being confident versus non-confident in the answer it has come up with. 

For instance, the model might output a probability of 0.8 corresponding to its confidence, indicating that it is moderately sure of its response. However, perhaps we want to be cautious and send out the AI model’s generated response only if we are highly confident. This ensures our prospects receive answers that are very likely to be accurate. If the model is not confident enough in the answer it has come up with, we might instead want to have a human review it. To address this, we might set a high confidence threshold and only allow the generated message to be sent to a prospect if the probability of being confident is above, say 0.95.

Deciding the threshold value for decision-making is a critical technique used in classification tasks. It involves setting a cutoff point to differentiate between different classes. This threshold is crucial as it balances precision (avoiding false positives) and recall (identifying true positives). The relative real-world cost of false positive vs false negative predictions is highly domain-specific and dictated by business needs, for which we need to strike an optimal balance by adjusting the threshold.

Incorporating LLMs in the Prediction Stage

The threshold cut-off for an ML model is decided by just one value and only changes rarely once it is decided upon. This is where LLMs can come in –  and act as an extra layer of more intelligent, dynamic, and contextual thresholding. LLMs have an innate ability to understand and interpret complex, nuanced contexts. With their advanced natural language processing capabilities, LLMs can examine the contextual intricacies within the data, leading to more contextually informed and dynamic thresholding decisions.


Tying it back to our AI model example where a response is produced for the prospect’s message in addition to a confident/non-confident set of probability scores for that generated message, instead of classifying a quite confident response with probability 0.94 as not confident (since it is less than a conservative static threshold of  0.95), we could send all responses in a specific confidence range (for example, 0.9 to 0.95) to an LLM and ask it if it is an appropriate message to send or not. This range entails requests for which the model is decently confident but not confident enough to surpass the threshold value. This hybrid system has various advantages:

  • It uses well-trained, pre-existing, and reliable deep-learning models to make classifications that yield accurate results.

  • An ensemble of the existing ML model and LLM uses the general reasoning capabilities of the LLM in conjunction with the task-specific abilities of the existing ML model, avoiding dependence on the exact static threshold value for the outcome.

  • It embeds intelligent verification using LLMs for cases close to the threshold. This could enable better coverage while keeping precision similar.

  • Although an LLM call costs more than inference from an in-house, trained ML model, it's still orders of magnitude cheaper than human verification.

In our use case, this approach enabled us to positively classify and automate 75% of cases that were lower than the threshold and would have otherwise been classified negatively. We achieved this increase in recall while maintaining a similar precision rate, thereby not affecting the quality of positive outcomes while increasing their volume. 

At the same time, it is also essential to consider potential trade-offs that this system ends up making:

  • Integrating LLMs into existing pipelines increases the system's complexity, potentially making it more challenging to maintain, debug, and improve the models over time.

  • The use of LLMs, especially in real-time applications, introduces latency, and processing the additional layer of LLM analysis can slow down the response time, which might be critical in time-sensitive scenarios.

  • There is a dependency on the providers of LLMs for data privacy policies, pricing, security, and performance. Ensuring compliance with data protection regulations becomes extremely important and complicated.

Hybrid ml workflow: Flowchart of how an existing machine learning model is augmented with a large language model to ensure quality responses in a dynamic, context-aware system

Integrating LLMs into traditional ML workflows offers a balanced approach, combining existing models' reliability with LLMs' contextual intelligence. However, there is no free lunch. Organizations must weigh the above-discussed challenges against the potential benefits when considering adopting a hybrid system involving LLMs. Regardless, this emerging hybrid system promises to enhance AI applications, making them more adaptable and responsive to real-world complexities.

Authors and Contributors: Aditya Lahiri, Brandon Davis, Michael Leece, Dimitry Fedosov, Christfried Focke, Tony Froccaro

Revolutionizing PropTech with LLMs

Our flagship product, AppFolio Property Manager (APM), is a vertical software solution designed to help at every step when running a property management business. As APM has advanced in complexity over the years, with many new features and functionalities, an opportunity arose to simplify user actions and drive efficiency. Enter AppFolio Realm-X: the first generative AI conversational interface in the PropTech industry to control APM - powered by transformative Large Language Models (LLMs). 

AppFolio Realm-X

Unlike traditional web-based systems of screens, Realm-X acts as a human-like autonomous assistant to the property manager. Realm-X empowers users to engage with our product through natural conversation, eliminating the need for time-consuming research in help articles and flattening the learning curve for new users. Realm-X can be asked general product questions, retrieve data from a database, streamline multi-step tasks, and automate repetitive workflows in natural language without an instruction manual. With Realm-X, property managers can focus on increasing the scale of their businesses while increasing critical metrics, such as occupancy rates, tenant retention, NOI, and stakeholder satisfaction.

Why LLMs?

Distinguished by their unmatched capabilities, LLMs set themselves apart from traditional Machine Learning (ML) approaches, which predominantly rely on pattern matching and exhibit limited potential for generalization. The distinctive features of LLMs include:

  • Reasoning capabilities: Unlike traditional approaches, LLMs can reason, enabling them to comprehend intricate contexts and relationships.

  • Adaptability to new tasks: LLMs can tackle tasks without requiring task-specific training. This flexibility allows for a more streamlined and adaptable approach to handling new and diverse requirements. For example, if we change the definition of an API, we can immediately adapt the model to it. We don’t need to go through a slow and costly cycle of annotation and retraining.

  • Simplifying model management: LLMs eliminate the need to manage numerous task-specific models. This consolidation simplifies the development process, making it easier to maintain and scale.

LLM Pre-training and Fine-tuning

The first step in creating an LLM is the pre-training stage. Here, the model is exposed to diverse and enormous datasets encompassing a significant fraction of the public internet, commonly using a task such as next-word prediction. This allows it to draw on a wealth of background knowledge about language, the world, entities, and their relationships - making it generally capable of navigating complex tasks.

After pre-training, fine-tuning is applied to customize a language model to excel at a specific task or domain. Over the years, we have used this technique extensively (with smaller language models such as RoBERTa) to classify messages within Lisa or Smart Maintenance or extract information from documents using generative AI models like T5. Fine-tuning LLMs makes them practical and also unlocks surprising new emergent abilities - meaning they can accomplish tasks they were never explicitly engineered for.

  1. Instruction following: During pre-training, a model is conditioned to continue to the current prompt. Instead, we often want it to follow instructions so that we can steer the model via prompting.

  2. Conversational: Fine-tuning conversational data from sources such as Reddit imbues the ability to refer back to previous parts of the conversation and distinguish instructions from the text it generated itself and other inputs. This allows users to engage in a back-and-forth conversation to iteratively refine instructions or ask clarifying questions while reducing the vulnerability to prompt injections.

  3. Safety: Most of the fine-tuning process concerns preventing harmful outputs. A model adapted in this way will tend to decline harmful requests.

  4. Adaptability to new tasks: Traditional ML techniques require adaptation to task-specific datasets. On the other hand, fine-tuned LLMs can draw on context and general background knowledge to adapt to new tasks through prompting, known as zero shot. Additionally, one can improve accuracy by including examples, known as few shot, or other helpful context.

  5. Reasoning from context: LLMs are especially suited for reasoning based on information provided together with the instructions. We can inject portions of documents or API specifications and ask the model to use this information to fulfill a query.

  6. Tool usage: Like humans, LLMs exhibit weaknesses in complex arithmetic, precise code execution, or accessing information beyond their training data. However, like a human using a calculator, code interpreter, or search engine, we can instruct an LLM about the tools at its disposal. Tools can dramatically enhance their capabilities and usefulness since many external systems, including proprietary search engines and APIs, can be formulated as tools.

With a generic model with generic capabilities, the problem remains: how do we adapt it so that a user of our software can use it to run their business?

Breaking Down a User Prompt

Once a user enters a prompt in the chat interface, Realm-X uses an agent paradigm to break it into manageable subtasks handled by more specialized components. To break down the prompt into more manageable subtasks, the agent is provided with: 

  1. A description of the domain and the task it has to complete, along with general guidelines

  2. Conversation history

  3. Dynamically retrieved examples of how to execute a given task and,

  4. A list of tools

A description of the task and the conversation history defines the agent's environment. It allows users to refine or follow up on previous prompts without repeating themselves. Because we provide the agent with a set of known scenarios, the agent can retrieve the ones most relevant to the current situation – improving reliability without polluting the prompt with irrelevant information. 

The agent must call relevant tools, interpret their output, and decide the next step until the user query is fulfilled. Tools encapsulate specific subtasks like data retrieval, action execution, or even other agents. They are vital because they allow us to focus on a subset of the domain without overloading the primary prompt. Additionally, they allow us to parallelize the development process, separate concerns, and isolate different parts of the system. We allow the agent to select tools and generate their input - and tools can even be dynamically composed – such that the output of a tool can be the input for the next.

Generally, a tool returns: 

  1. Structured data to be rendered in the UI

  2. Machine-readable data to be used for automation tasks

  3. A string that summarizes the result to be interpreted by the main agent (and added to the conversation history)

Realm-X in Action

Let’s look at an example of tools in action through two requests Realm-X receives via a single, natural language prompt. A more in-depth analysis of tools will be explored in later posts. The prompt asks Realm-X to 1. Gather a list of residents at their property, The Palms, and 2. Invite the residents via text to a community barbecue on Saturday at 11 AM. 

Realm-X returns a draft bulk message

  • A user can manually update the message or refine it with follow-up prompts

  • Recipients are pre-populated

  • While not displayed above, placeholders can be added to personalize each message

This allows users to quickly create bulk communication, where the reports and content can be customized to many situations relevant to running their business. We aim to expand the capabilities to cover all actions that can be performed inside APM, from sending renewal offers to scheduling inspections. The next step is combining business intelligence and task execution with automation. Our users can put parts of their business on auto-pilot, allowing them to focus their time and energy on building relationships and improving the resident experience rather than on repetitive busy work.

The Road Ahead

Integration of LLMs represents a shift in how we develop software. For one, we don’t have to build custom forms and reports for every use case but can dynamically compose the building blocks. With some creativity, users can even compose workflows we haven’t considered ourselves. We also learned that LLMs are an impartial check on our data and API models – if a model can’t reason its way through, a non-expert human (developer) will also have a hard time. All future APIs must be modeled with an LLM as a consumer in mind, requiring good parameter names, descriptions, and structures representing the entities in our system.

This post only scratches what we can do with this exciting technology. We will follow up with more details surrounding the engineering challenges we encountered and provide a more detailed view into the architectural choices, user experience, testing and evaluation, accuracy, and safety.

Stay tuned!

Authors and Contributors: Chrisftried Focke, Tony Froccaro

Smart Maintenance Conversational AI

If you’re a property manager or a landlord responsible for a portfolio of properties, you know the importance of providing high-quality, prompt maintenance service to your tenants. Maintenance issues - if not dealt with in a timely manner - can affect tenant satisfaction, tenant retention, and ultimately your company’s revenue.

Here we will show how AppFolio’s Smart Maintenance simplifies the maintenance process, helping tenants resolve their maintenance requests conveniently and efficiently. To demonstrate how tenants report maintenance issues using Smart Maintenance, let’s take a look at a hypothetical scenario involving Sandy, a tenant who discovers a clogged drain when doing the dishes after dinner.

Named entity Recognition (NER) used By Smart Maintenance to identify and categorize key information gathered from the conversation


Step 1: Confirming the Tenant's Identity and Address

The first step of Smart Maintenance’s process is to confirm the tenant's identity and address. This is a critical first step because the tenant who is reporting the issue might not be the person who signed the lease, and the property manager needs to know who to contact and where to send the maintenance technician.

To initiate a conversation with Smart Maintenance, a tenant can send a message to our AI-assisted texting interface, or text their property manager who starts the process. In this scenario, Sandy texts Smart Maintenance about her clogged drain directly and does not provide her full name. Because Sandy did not provide her name and address in the initial message, Smart Maintenance will conduct additional information-gathering and respond with follow-up questions to extract that information.

Sandy then replies with her full name and address, and Smart Maintenance uses an identity parsing model to extract and record her contact information and cross-checks it against the value in our database. Throughout this process, there is a human operator overseeing the conversation who can interject at any time - but our AI is automating the operator’s job to a large extent.

Step 2: Identify the Issue

Once Sandy’s contact information is confirmed, Smart Maintenance’s next task is to identify the issue that she is reporting. There are several types of maintenance issues, and each one requires a different course of action to resolve, and with varying urgency. For example, a leaking faucet might need a simple fix, while a broken heater during winter might need an emergency replacement.

Instead of simply providing Sandy with a long list of issues to choose from, or leaving the property manager to guess the issue from Sandy's description, Smart Maintenance uses a machine learning-based model to predict the issue based on her description. If Sandy described the clogged drain succinctly by stating "the kitchen drain is clogged" or in a more roundabout way e.g. "the water is not draining" – Smart Maintenance knows both of her descriptions are referring to the same issue.

We also provide an intuitive user interface for the human operator to review and confirm the issue for an additional layer of confidence. In the rare scenario where the operator finds that the model did not predict the clogged drain accurately, they are provided with the top five predictions that the model believes the issue could be and adjust the prediction accordingly.

Step 3: Gather the Details

After confirming Sandy’s maintenance issue is a clogged drain, Smart Maintenance needs to gather additional details about the drain. These details are necessary to understand the scope, severity, and urgency of the maintenance issue, and further, are used to provide the maintenance technician with the relevant context.

For each maintenance issue submitted to the system, the tenant will be asked a series of troubleshooting questions. These questions are designed to assist the tenant in resolving their issue without sending a maintenance technician, so that simple issues can be resolved promptly. 

If a maintenance technician is needed, Smart maintenance will ask triage questions. Triage questions are meant to determine the urgency of the issue. For Sandy’s clogged drain, the system might ask whether only one drain is clogged, or what the severity is of the drain’s blockage. Additionally, Smart Maintenance inquires as to whether there are any fire, flood, or other hazards associated with the issue. 

Smart maintenance uses another machine learning-based model to turn the tenant’s responses into standardized and structured data that can be easily stored and processed. Smart Maintenance also validates the answers and asks for clarification if the answers are unclear or incomplete.

Steps 4 and 5: Summarize and Close the Issue

After completing all the aforementioned steps, Smart Maintenance uses a summarization model to present the information to a human operator who creates a work order for Sandy’s clogged drain. The summary provided by the issue details model is included in the work order’s description. 

If Sandy is happy with the work order created our closing model will end the conversation based on Sandy’s response. Sandy can also text in later if there is new information, or if she simply wants an update. Our closing model will then re-open the conversation and have operators address Sandy’s questions.

Step 6: Submit the Work Order to the Property Manager and Vendor

With the work order creation finalized by the tenant, Smart Maintenance submits the work order to the property manager. The property manager can now make an informed decision on which vendor is best suited to address Sandy’s problem, in this instance a plumber, and dispatches them to fix the clogged drain.

Once Sandy’s drain is fixed by the plumber, Smart Maintenance continues to optimize the process - handling billing, gathering customer feedback, and even allowing property managers to define rules that will automatically contact vendors based on an issue, the issue’s urgency, and the answers provided by the tenant to the triage questions. 

Overview: How Smart Maintenance Streamlines Operations

Smart Maintenance provides a seamless, user-friendly experience for tenants to report maintenance issues and help manage their maintenance requests more efficiently.

  • Streamlined Issue Reporting: Tenants can simply text or chat to report their maintenance issues, instead of filling out complicated forms or calling an office. They describe their issues in their own words instead of choosing from predefined options while being guided to provide relevant information. This reduces frustration for tenants and encourages them to report issues promptly. 

  • Less Time Investment for Property Management Staff: Smart Maintenance automates repetitive conversations, reduces errors related to data entry, and creates an accurate system of record

  • More Accurate and Complete Issue Identification: Smart Maintenance understands the tenant's description of the issue and is able to discern the issue's severity, frequency, and location of the problem. This helps to avoid miscommunications and ambiguity

  • Quicker and Easier Work Order Creation: Smart Maintenance summarizes the collected information to generate a concise and clear issue description that can be used to create a work order. Property managers can rely on the work orders being accurately submitted and can set rules on how they’ll be notified and how the work is assigned.

  • More Efficient Communication: Smart Maintenance facilitates communication with tenants and vendors, provides updates, and handles notification and assignment of issues. Smart Maintenance also handles common tenant queries, such as ETA requests, feedback, or follow-ups. This can help you improve your transparency and accountability, and keep tenants and vendors satisfied.

Authors and Contributors: Christfried Focke, Ken Graham, Joo Heon Yoon, Shyr-Shea Chang, Tony Froccaro

Building APIs That Delight Customers and Developers

In June of 2022 at the National Apartment Association Conference, AppFolio announced its integration marketplace, AppFolio Stack, to address the increasing demand for integrations with the Property-Tech services our customers use daily to streamline their workflows. AppFolio Stack was carefully designed to address what our primary stakeholders, the customer – an AppFolio Property Management Company, and the partner – a third-party Property-Tech service, state is their number one technical challenge: integrations. In a survey conducted by AppFolio, we found that integrations often create the very same problems they intend to solve: double data entry, extra workflow steps for onsite teams, and data inaccuracies. Below are the key concepts that guided AppFolio Stack’s design, making sure our integrations were done right.

  1. Optimize The Developer Experience From The Get-Go

Every API should be optimized for the developer's experience and ease of use. We use the “Time to Hello World” metric – the time it takes to execute a core functionality of an API (e.g. a user’s first GET request) - to measure our API’s ease of use. Our API documentation’s Getting Started section walks developers through account setup and their first GET request in right around 10 minutes.

Establishing confidence in our platform's ease of use from the get-go, with code samples, straightforward naming conventions, detailed error messages, and strict adherence to OpenAPI specifications provides developers with a familiar experience – optimized for exchanging data as efficiently as possible.

2. Enable Developers to Simulate Live Integrations 

Before going live with customers, we provide our partner developers with a sandbox environment with sample data that is tailored specifically for their use case. For example, if our partner deals with maintenance - the sample data we provide is robust enough to cover the different business models each company has in place for the maintenance domain.

By performing advanced testing against this sandbox environment with sample data that closely mimics the production environment, we set the right expectations for how partners’ software will perform once they enable a connection with real customer data, and increased load. 

3. Rate Limit to Provide Expected Performance 

To ensure all partners can rely on the availability of our system during periods of heavy load, we configure rate limits on a per-second, per-minute, and per-hour basis to prevent network disruptions that may occur once partner-customer connections go live.

We provide detailed error messages when these limits are exceeded -- and through continuous monitoring, we can promptly communicate ways to restructure our partner’s request patterns to improve their querying efficiency. 

Lastly, all rate limits are structured on a customer-partner pair basis. This provides partners with stable performance that is not impacted by adding new partners or enabling new customers for a specific partner.

4. Practice Quality Assurance in Development 

We implement numerous methods for catching errors before our code goes into production. The first line of coverage is unit testing of the source code and Selenium tests to account for any regressions in the UI. By contract testing and leveraging tools such as Postman and Pact, we ensure our integrations continually provide the stable service promised.

Internally, the Stack API is developed as an orchestration of microservices, and this robust contract testing enables us to independently develop each service with the confidence that we would be notified as soon as a code change violates an existing contract.

5. Proactively Monitor and Alert

In an ideal developer environment, users don’t have to report issues and details about what might have caused them. Tools such as NewRelic, DataDog, and Rollbar alert us as soon as any errors or anomalies happen with our customer’s API usage - and we can proactively inform them as soon as the errors occur. 

6. Allow Time to Introduce Breaking Changes

When developing and supporting a complex API platform, change is inevitable. Planning for change enables us to iterate faster while still being able to revisit previous decisions, and add improvements although these could be potentially breaking changes. Our approach is to support partners when transitioning to the new API version while continuing to support old versions throughout their transition. 

7. Maintain Open Lines of Communication With Developers

We are always receptive to feedback and maintain a direct channel of communication with all of our partners using Slack. Whether the input derives from a slack message, or as a result of our internal reporting system, the feedback is shared directly with our team and often results in constructive changes to our API. Through mutual collaboration, improving existing integrations is one of our highest priorities.

Stay tuned for future posts diving deeper into some of the topics touched upon above!


Authors and contributors: Nevena Golubovic and Tony Froccaro