Introduction of Large Language Models for Databricks
Ali Ghodsi, the CEO and Co-Founder of Databricks, announced Dolly 2.0. It is the first open-source LLM (Large Language Model) that can understand and follow instructions. It was trained using a dataset of instructions created by humans and can be used commercially.
What is Dolly, and when was it introduced?
Dolly 2.0 is an open-source LLM. It is a sophisticated AI system that has been fine-tuned on a dataset of human-generated instructions. Dolly 2.0 is based on the EleutherAI Pythia Model. Dolly 2.0 is special because it is good at understanding and follows the instructions given to it by humans. This is a big step forward in how computers understand and use words. Dolly 2.0 was introduced in March 2023. The dataset utilised to train Dolly 2.0, known as data bricks-dolly-15k, comprises 15,000 carefully crafted pairs of prompts and responses generated by humans. This dataset was explicitly designed to fine-tune large language models for better instruction comprehension.
Hugging Face provides a module named Transformers which dolly utilises for its executions. In addition to the Transformers library,
What is Hugging Face?
Hugging Face is a company and open-source community specialising in natural language processing (NLP) and machine learning. They provide various tools and resources to make NLP tasks easier for researchers and developers. One of their popular offerings is the Transformers library, which allows users to work with pre-trained NLP models like BERT and GPT. They also maintain the Datasets library for accessing NLP datasets, a Model Hub for downloading pre-trained models, and an Inference API for deploying models in production. Hugging Face has significantly contributed to advancing NLP and has a strong presence in the NLP community.
Hugging Face also maintains the Datasets library, which provides access to a large number of datasets for NLP tasks. They also offer the Model Hub, where users can find and download pre-trained models, and the Inference API, which allows developers to deploy and serve their models in production.
What does Hugging Face do?
The Hugging Face transformer library was developed to make it easier, more flexible, and simpler to use complex models with a similar architecture. It provides a unified API that allows users to seamlessly access, load, train, and save models without any complications. Initially focused on NLP applications, Hugging Face has expanded to include use cases in audio and visual domains as well. This library follows a standard deep learning approach that involves multiple steps, from data acquisition to model fine-tuning, enabling a reusable workflow tailored to specific domains.
Hugging Face Transformers provides a pipeline that handles all the necessary preprocessing and post-processing tasks for input text data to simplify this procedure. These pipelines serve as the fundamental objects in the Transformer library, encapsulating the entire workflow of Hugging Face solutions. They seamlessly connect a model with the required preprocessing and post-processing steps, allowing us to provide input texts without worrying about the underlying complexities.
What happened before the dolly?
Prior to the introduction of Dolly, language processing relied on traditional methods such as rule-based systems and statistical approaches. These methods had certain limitations in understanding and generating natural language text.
However, new models like Dolly emerged with advancements in deep learning and neural networks. Before Dolly, other language models like GPT (Generative Pre-trained Transformer) gained attention for their ability to generate coherent text based on large amounts of training data.
While these models showed promise, they had limitations in understanding and following specific instructions. Dolly was explicitly developed to address this limitation. It introduced the concept of instruction-following, where the model could understand and generate text based on detailed instructions provided.
Additionally, before Dolly, language processing relied on traditional methods. While there were other language models, Dolly brought the capability of instruction-following, which opened new possibilities in natural language processing tasks.
Why is Dolly important?
Dolly 2.0 has a significant role for several reasons. Dolly 2.0 signifies a remarkable advancement in the realm of language processing technology. Its capabilities in understanding and following instructions accurately demonstrate the progress made in improving communication between humans and machines. Dolly 2.0 is accessible to a wide range of users without any cost. This inclusivity promotes collaboration, knowledge-sharing, and fosters innovation among developers, researchers, and enthusiasts.
Dolly 2.0’s fine-tuning on a dataset of human-generated instructions enhances its ability to comprehend and execute tasks based on human intent. Dolly 2.0’s advancements contribute to the broader field of artificial intelligence, driving progress and inspiring further breakthroughs in language processing technology.
Why is Dolly better or different than GPT 3.5 version?
In a few key aspects, Dolly is distinct from ChatGPT, specifically the GPT-3.5 version. Both are language models developed by OpenAI; Dolly has been fine-tuned on a dataset of human-generated instructions, which makes it particularly skilled at understanding.
Compared to ChatGPT, Dolly’s primary focus is on instruction-following, making it more suitable for tasks that require precise execution of commands or directions. It’s training on a specific dataset tailored for instruction tuning further enhances its ability to comprehend and respond to instructions more reliably.
On the other hand, GPT-3.5, including ChatGPT, is a more generalised language model that excels in generating creative and coherent text based on given prompts. It can converse, provide explanations, and generate human-like responses across various topics.
Why Databricks open-source dolly?
Databricks is well-known for delivering open-source AI tools and services to assist its customers in fully using their information. The most crucial reason of Databricks is to promote collaboration and innovation and encourage the development of new applications in the field of language.
Open-sourcing Dolly enables a broader community of developers, researchers, and enthusiasts to access and contribute to its ongoing improvement. This collaborative approach enhances the potential for groundbreaking advancements and ensures that the technology benefits from diverse perspectives and expertise.
Furthermore, open-sourcing Dolly aligns with DataBricks’ commitment to democratising AI and making advanced language processing capabilities accessible to a broader audience.
How to use Dolly on Databricks?
To use Dolly on Databricks, follow these steps:
- Ensure you have access to Databrick’s machine learning platform.
- Install the dependencies and libraries required for Dolly, such as the Hugging Face Transformers library.
- Load and initialise the Dolly model, such as the “dolly-v2-3b” model, using the appropriate commands.
- If needed, Preprocess your input data or instructions to ensure they are in the appropriate format for Dolly.
- Use the Dolly model to generate responses or outputs based on the given instructions.
- Analyze and evaluate the results generated by Dolly to determine their quality and usefulness.
- Iterate and refine your instructions or experiment with different approaches to achieve the desired output from Dolly.
How does Dolly work with MLflow?
MLflow is a platform designed to help manage and track machine learning projects, while Dolly is an AI language model developed by Databricks. Although MLflow is not directly integrated with Dolly, it can be used in conjunction with Dolly to enhance the workflow and organisation of machine learning experiments.
MLflow provides various capabilities to assist with managing Dolly models and experiments. It lets users track and log experiments, capturing important metadata, parameters, metrics, and results. This tracking functionality helps record the progress and outcomes of different experiments performed with Dolly.
MLflow also aids in deploying Dolly models into production environments. It provides tools for packaging the models, their dependencies, and associated resources, simplifying the deployment process and making integrating Dolly models into real-world applications easier.
Furthermore, MLflow supports collaboration and reproducibility by facilitating sharing and reproduction of experiments. It serves as a centralised platform for team members to collaborate, share findings, and reproduce experiments, fostering teamwork and knowledge sharing.
Dolly –v2-3b Model Code
Databricks has developed dolly-v2-3b, a powerful causal language model with 2.8 billion parameters. This model is based on EleutherAI’s Pythia-2.8b and has been carefully fine-tuned using a dataset of approximately 15,000 instruction records. Databricks employees created the instructions, and the model is released under a permissive license called CC-BY-SA.
To use this model with the transformers library on a machine with GPUs. We need to install the accelerate and transformers libraries. Run the Databricks (screenshots attached):
The model code that is left can be accessed through the following link: FaizanZahid13/dolly-v2-3b-Model-Cards (github.com)
Alternatively, you can investigate an alternative strategy employing Large Language Model (LLMs) capabilities if the Dolly model is not accessible or practical to apply in your Databricks environment. You can easily incorporate LLM models into your workflow with Databricks’ support for both training and using them.
To implement this strategy, you can use Databricks’ LLMs features to train a Large Language model on a sizable corpus of text data relevant to your domain, such as product descriptions, customer reviews, or any other text pertaining to products. You may extract the text’s contextual information and semantic correlations by training the LLMs.
SentenceTransformer with MLFlow:
SentenceTransformer is a Python library designed explicitly for generating sentence embeddings. Sentence embeddings are dense numerical representations that capture the semantic meaning of sentences. By leveraging LLMs, SentenceTransformer can convert input text into meaningful embeddings, which can be used for tasks like semantic search.
MLflow and SentenceTransformer are potent tools that can be used with large language models (LLMs) to enhance various natural language processing (NLP) tasks, including semantic search. It offers a complete set of tools and libraries to manage the entire ML development process, including experiment tracking, repeatability, model packing, and deployment. Data scientists and ML engineers can efficiently collaborate, track and compare experiments, and organise their projects.
The MLflow allows for model versioning, management, and deployment through a model registry. And supports various deployment options, including serving models via REST API or integrating them into other applications.
Implementations:
Here we will demonstrate how large language models (LLMs) can be used for product search. Unlike standard keyword-based searches, LLMs can comprehend the similarities and meanings of words and sentences. This makes it possible for us to identify products that satisfy the user’s intent more precisely. We will use two types of LLMs: One that has already been trained on broad text and the other that has been specifically customised for our product catalogue. We will also demonstrate how to use Databricks to deploy the model. You will need a Databricks workspace that supports GPU clusters and model serving to follow along.
A Visual Guide to Accessing Databricks Community Account and How to Databricks Login
We access the Databricks community version on his official website. Click the link to enter the Databricks community page. By the following step, we can create the Databrick community account.
Step 1:
Go and search for Databricks community editor
Step 2:
Either here or at the top of this page, click Try Databricks.
Step 3:
Click TRY DATABRICKS after entering your name, company, email, and title.
Step 4:
Get Started with Community Edition can be accessed by clicking the link on the Choose a cloud provider dialogue.
To validate your email address, locate the welcome email and click the link. Create your Databricks password when prompted. You will be directed to the Databricks Community Edition home page after clicking.
Get Started.
Here are the instructions to open a notebook in Databricks.
Step: 1
In step one, first imports the link in the Databrick workspace.
Step: 2
Here we import the link to Notebook.
Step: 3
Create a new cluster and use an existing one if you already created one.
Step: 4
In this step, we can select the cluster type. Here we work in ML, so we choose it.
Connect the notebook to the Cluster. Run the notebook’s cells one at a time or all at once. Examine the outcomes and change the code if necessary.
Here in this Link, we have five notebooks that show how to use large language models and vector databases to enable semantic product search. Semantic search is a way of finding products that match the meaning and intent of a user query rather than just the keywords. This can improve the user experience and satisfaction and increase the conversion rate and revenue for online retailers.
The first notebook introduces the concept of semantic search and how large language models can learn the associations between words from a large corpus of documents. It also explains how to access the Wayfair Annotation Dataset (WANDS), which contains over 42,000 products, 480 queries, and 233,000 labels for product relevance.
Users of the solution accelerator have an opportunity to investigate and contrast the performance of an off-the-shelf model and fine-tuned model that was trained exclusively on product text. It assists users in deploying models within the Databricks environment and guarantees smooth integration with semantic search features. Users can do this to improve their search capabilities and effectively extract pertinent insights from product-related data.
It’s essential to have a Databricks workspace that supports GPU-based clusters and the Databricks model serving functionality to properly utilise LLMs in Databricks. The accessibility of GPU clusters, however, is determined by your cloud provider and the quotas allocated to your cloud subscription. The fact that the Databricks model serving feature is now restricted to particular AWS and Azure regions is also significant.
Ensure you have access to the required resources in your Databricks environment before continuing. Verify that you can use the model serving capability and have GPU-enabled clusters available for seamless deployment.
Please be aware that because of LLMs’ high computational and memory demands, GPU clusters are instrumental in this context. These clusters give you the ability to develop and improve models.
The second notebook shows how to use an off-the-shelf model like DistilBERT to generate embeddings for the product text and user queries. Embeddings are numerical representations of the meaning and context of a text. The notebook also shows how to store these embeddings in a vector database, such as Chroma, which enables fast and efficient similarity search.
Product text relevant to the search
The third notebook demonstrates how to fine-tune the model using the WANDS dataset to better capture the nuances and specificities of the product domain. Fine-tuning is a process of adjusting the model parameters using additional data to improve its performance on a specific task. Here is the performance score.
Baseline Model Performance Estimation
The fourth notebook explains how to package and persist the model and the vector database so that they can be easily deployed and reused later.
The fifth notebook shows how to deploy the model using Databricks Model Serving, a feature that allows users to create REST APIs for their models with just a few clicks. The notebook also shows how to test the API using curl commands and Postman.
Deploy Model
LLM-powered semantic search can completely transform product search by producing extraordinarily accurate and relevant results. Businesses can improve their product search capabilities and better understand customer intent by utilising word associations and fine-tuning models with data relevant to catalogues. Enabling quick retrieval of relevant products, and integrating embeddings and vector databases further improves search efficiency. Customers’ overall buying experiences are improved when LLM models are deployed using REST APIs. This opens new opportunities for e-commerce platforms.
Author
Faizan Zahid
Associate Consultant
Ibad Khan
Intern