NVIDIA · shubhadeepd · Jun 18, 2024 · Jun 14, 2024
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1,19 @@
+# Ignore git objects
+.git/
+.gitignore
+.gitlab-ci.yml
+.gitmodules
+
+# Ignore temperory volumes
+deploy/compose/volumes
+
+# creating a docker image
+.dockerignore
+
+# Ignore any virtual environment configuration files
+.env*
+.venv/
+env/
+# Ignore python bytecode files
+*.pyc
+__pycache__/
diff --git a/.gitignore b/.gitignore
@@ -24,3 +24,7 @@ docs/_*
 docs/notebooks
 docs/experimental
 docs/tools
+
+# Developing examples
+RetrievalAugmentedGeneration/examples/simple_rag_api_catalog/
+deploy/compose/simple-rag-api-catalog.yaml
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -9,3 +9,17 @@ repos:
        args:
          - --license-filepath
          - RetrievalAugmentedGeneration/LICENSE.md
+- repo: https://github.com/psf/black
+    rev: 19.10b0
+    hooks:
+      - id: black
+        args: ["--skip-string-normalization", "--line-length=119"]
+        additional_dependencies: ['click==8.0.4']
+- repo: https://github.com/pycqa/isort
+  rev: 5.12.0
+  hooks:
+    - id: isort
+      name: isort (python)
+      args: ["--multi-line=3", "--trailing-comma", "--force-grid-wrap=0", "--use-parenthese", "--line-width=119", "--ws"]
+
+
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,38 @@ All notable changes to this project will be documented in this file.

 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

+
+## [0.7.0] - 2024-06-18
+
+This release switches all examples to use cloud hosted GPU accelerated LLM and embedding models from [Nvidia API Catalog](https://build.nvidia.com) as default. It also deprecates support to deploy on-prem models using NeMo Inference Framework Container and adds support to deploy accelerated generative AI models across the cloud, data center, and workstation using [latest Nvidia NIM-LLM](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html).
+
+### Added
+- Added model [auto download and caching support for `nemo-retriever-embedding-microservice` and `nemo-retriever-reranking-microservice`](./deploy/compose/docker-compose-nim-ms.yaml). Updated steps to deploy the services can be found [here](https://nvidia.github.io/GenerativeAIExamples/latest/nim-llms.html).
+- [Multimodal RAG Example enhancements](https://nvidia.github.io/GenerativeAIExamples/latest/multimodal-data.html)
+  - Moved to the [PDF Plumber library](https://pypi.org/project/pdfplumber/) for parsing text and images.
+  - Added `pgvector` vector DB support.
+  - Added support to ingest files with .pptx extension
+  - Improved accuracy of image parsing by using [tesseract-ocr](https://pypi.org/project/tesseract-ocr/)
+- Added a [new notebook showcasing RAG usecase using accelerated NIM based on-prem deployed models](./notebooks/08_RAG_Langchain_with_Local_NIM.ipynb)
+- Added a [new experimental example](./experimental/rag-developer-chatbot/) showcasing how to create a developer-focused RAG chatbot using RAPIDS cuDF source code and API documentation.
+- Added a [new experimental example](./experimental/event-driven-rag-cve-analysis/) demonstrating how NVIDIA Morpheus, NIMs, and RAG pipelines can be integrated to create LLM-based agent pipelines.
+
+### Changed
+- All examples now use llama3 models from [Nvidia API Catalog](https://build.nvidia.com/search?term=llama3) as default. Summary of updated examples and the model it uses is available [here](https://nvidia.github.io/GenerativeAIExamples/latest/index.html#developer-rag-examples).
+- Switched default embedding model of all examples to [Snowflake arctic-embed-I model](https://build.nvidia.com/snowflake/arctic-embed-l)
+- Added more verbose logs and support to configure [log level for chain server using LOG_LEVEL enviroment variable](https://nvidia.github.io/GenerativeAIExamples/latest/configuration.html#chain-server).
+- Bumped up version of `langchain-nvidia-ai-endpoints`, `sentence-transformers` package and `milvus` containers
+- Updated base containers to use ubuntu 22.04 image `nvcr.io/nvidia/base/ubuntu:22.04_20240212`
+- Added `llama-index-readers-file` as dependency to avoid runtime package installation within chain server.
+
+
+### Deprecated
+- Deprecated support of on-prem LLM model deployment using [NeMo Inference Framework Container](https://github.com/NVIDIA/GenerativeAIExamples/blob/v0.6.0/deploy/compose/rag-app-text-chatbot.yaml#L2). Developers can use [Nvidia NIM-LLM to deploy TensorRT optimized models on-prem and plug them in with existing examples](https://nvidia.github.io/GenerativeAIExamples/latest/nim-llms.html).
+- Deprecated [kubernetes operator support](https://github.com/NVIDIA/GenerativeAIExamples/tree/v0.6.0/deploy/k8s-operator/kube-trailblazer).
+- `nvolveqa_40k` embedding model was deprecated from [Nvidia API Catalog](https://build.nvidia.com). Updated all [notebooks](./notebooks/) and [experimental artifacts](./experimental/) to use [Nvidia embed-qa-4 model](https://build.nvidia.com/nvidia/embed-qa-4) instead.
+- Removed [notebooks numbered 00-04](https://github.com/NVIDIA/GenerativeAIExamples/tree/v0.6.0/notebooks), which used on-prem LLM model deployment using deprecated [NeMo Inference Framework Container](https://github.com/NVIDIA/GenerativeAIExamples/blob/v0.6.0/deploy/compose/rag-app-text-chatbot.yaml#L2).
+
+
 ## [0.6.0] - 2024-05-07

 ### Added

diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@ State-of-the-art Generative AI examples that are easy to deploy, test, and exten

 ## NVIDIA NGC

-Generative AI Examples can use models and GPUs from the [NVIDIA NGC: AI Development Catalog](https://catalog.ngc.nvidia.com).
+Generative AI Examples can use models and GPUs from the [NVIDIA API Catalog](https://catalog.ngc.nvidia.com).

 Sign up for a [free NGC developer account](https://ngc.nvidia.com/signin) to access:

@@ -27,34 +27,32 @@ The examples demonstrate how to combine NVIDIA GPU acceleration with popular LLM
 The examples are easy to deploy with [Docker Compose](https://docs.docker.com/compose/).

 Examples support local and remote inference endpoints.
-If you have a GPU, you can inference locally with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM).
+If you have a GPU, you can inference locally with an [NVIDIA NIM for LLMs](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nim/containers/nim_llm).
 If you don't have a GPU, you can inference and embed remotely with [NVIDIA API Catalog endpoints](https://build.nvidia.com/explore/discover).

 | Model                              | Embedding        | Framework  | Description                                                                                                                                                                                               | Multi-GPU                                                                  | TRT-LLM | NVIDIA Endpoints | Triton | Vector Database    |
 | ---------------------------------- | ---------------- | ---------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------- | ------- | ---------------- | ------ | ------------------ |
-| mixtral_8x7b                       | ai-embed-qa-4     | LangChain  | NVIDIA API Catalog endpoints chat bot [[code](./RetrievalAugmentedGeneration/examples/nvidia_api_catalog/), [docs](https://nvidia.github.io/GenerativeAIExamples/latest/api-catalog.html)]                | No                                                                         | No      | Yes              | Yes    | Milvus or pgvector |
-| llama-2                            | UAE-Large-V1      | LlamaIndex | Canonical QA Chatbot [[code](./RetrievalAugmentedGeneration/examples/developer_rag/), [docs](https://nvidia.github.io/GenerativeAIExamples/latest/local-gpu.html)]                                        | [Yes](https://nvidia.github.io/GenerativeAIExamples/latest/multi-gpu.html) | Yes     | No               | Yes    | Milvus or pgvector |
-| llama-2                            | all-MiniLM-L6-v2 | LlamaIndex | Chat bot, GeForce, Windows [[repo](https://github.com/NVIDIA/trt-llm-rag-windows/tree/release/1.0)]                                                                                                       | No                                                                         | Yes     | No               | No     | FAISS              |
-| llama-2                            | ai-embed-qa-4     | LangChain  | Chat bot with query decomposition agent [[code](./RetrievalAugmentedGeneration/examples/query_decomposition_rag/), [docs](https://nvidia.github.io/GenerativeAIExamples/latest/query-decomposition.html)] | No                                                                         | No      | Yes              | Yes    | Milvus or pgvector |
-| mixtral_8x7b                       | ai-embed-qa-4     | LangChain  | Minimilastic example: RAG with NVIDIA AI Foundation Models [[code](./examples/5_mins_rag_no_gpu/), [README](./examples/README.md#rag-in-5-minutes-example)]                                               | No                                                                         | No      | Yes              | Yes    | FAISS              |
-| mixtral_8x7b<br>Deplot<br>Neva-22b | ai-embed-qa-4     | Custom     | Chat bot with multimodal data [[code](./RetrievalAugmentedGeneration/examples/multimodal_rag/), [docs](https://nvidia.github.io/GenerativeAIExamples/latest/multimodal-data.html)]                        | No                                                                         | No      | Yes              | No     | Milvus or pvgector |
-| llama-2                            | UAE-Large-V1      | LlamaIndex | Chat bot with quantized LLM model [[docs](https://nvidia.github.io/GenerativeAIExamples/latest/quantized-llm-model.html)]                                                                                 | Yes                                                                        | Yes     | No               | Yes    | Milvus or pgvector |
+| llama3-70b                       | snowflake-arctic-embed-l     | LangChain  | NVIDIA API Catalog endpoints chat bot [[code](./RetrievalAugmentedGeneration/examples/nvidia_api_catalog/), [docs](https://nvidia.github.io/GenerativeAIExamples/latest/api-catalog.html)]                | No                                                                         | No      | Yes              | Yes    | Milvus or pgvector |
+| llama3-8b                            | snowflake-arctic-embed-l      | LlamaIndex | Canonical QA Chatbot [[code](./RetrievalAugmentedGeneration/examples/developer_rag/), [docs](https://nvidia.github.io/GenerativeAIExamples/latest/api-catalog.html#using-the-llamaindex-data-framework)]                                        | [Yes](https://nvidia.github.io/GenerativeAIExamples/latest/multi-gpu.html) | Yes     | No               | Yes    | Milvus or pgvector |
+| llama3-70b                            | snowflake-arctic-embed-l     | LangChain  | Chat bot with query decomposition agent [[code](./RetrievalAugmentedGeneration/examples/query_decomposition_rag/), [docs](https://nvidia.github.io/GenerativeAIExamples/latest/query-decomposition.html)] | No                                                                         | No      | Yes              | Yes    | Milvus or pgvector |
+| llama3-70b                       | ai-embed-qa-4     | LangChain  | Minimilastic example: RAG with NVIDIA AI Foundation Models [[code](./examples/5_mins_rag_no_gpu/), [README](./examples/README.md#rag-in-5-minutes-example)]                                               | No                                                                         | No      | Yes              | Yes    | FAISS              |
+| llama3-8b<br>Deplot<br>Neva-22b | snowflake-arctic-embed-l     | Custom     | Chat bot with multimodal data [[code](./RetrievalAugmentedGeneration/examples/multimodal_rag/), [docs](https://nvidia.github.io/GenerativeAIExamples/latest/multimodal-data.html)]                        | No                                                                         | No      | Yes              | No     | Milvus or pvgector |
 | llama3-70b                       | none             | PandasAI   | Chat bot with structured data [[code](./RetrievalAugmentedGeneration/examples/structured_data_rag/), [docs](https://nvidia.github.io/GenerativeAIExamples/latest/structured-data.html)]                   | No                                                                         | No      | Yes              | No     | none               |
-| llama-2                            | ai-embed-qa-4     | LangChain  | Chat bot with multi-turn conversation [[code](./RetrievalAugmentedGeneration/examples/multi_turn_rag/), [docs](https://nvidia.github.io/GenerativeAIExamples/latest/multi-turn.html)]                     | No                                                                         | No      | Yes              | No     | Milvus or pgvector |
+| llama3-8b                            | snowflake-arctic-embed-l     | LangChain  | Chat bot with multi-turn conversation [[code](./RetrievalAugmentedGeneration/examples/multi_turn_rag/), [docs](https://nvidia.github.io/GenerativeAIExamples/latest/multi-turn.html)]                     | No                                                                         | No      | Yes              | No     | Milvus or pgvector |

 ### Enterprise RAG Examples

 The enterprise RAG examples run as microservices distributed across multiple VMs and GPUs.
 These examples show how to orchestrate RAG pipelines with [Kubernetes](https://kubernetes.io/) and deployed with [Helm](https://helm.sh/).

 Enterprise RAG examples include a [Kubernetes operator](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/) for LLM lifecycle management.
-It is compatible with the [NVIDIA GPU operator](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/gpu-operator) that automates GPU discovery and lifecycle management in a Kubernetes cluster.
+It is compatible with the [NVIDIA GPU Operator](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/gpu-operator) that automates GPU discovery and lifecycle management in a Kubernetes cluster.

 Enterprise RAG examples also support local and remote inference with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and [NVIDIA API Catalog endpoints](https://build.nvidia.com/explore/discover).

 | Model   | Embedding   | Framework  | Description                                                                | Multi-GPU | Multi-node | TRT-LLM | NVIDIA Endpoints | Triton | Vector Database |
 | ------- | ----------- | ---------- | -------------------------------------------------------------------------- | --------- | ---------- | ------- | ---------------- | ------ | --------------- |
-| llama-2 | NV-Embed-QA | LlamaIndex | Chat bot, Kubernetes deployment [[README](./docs/developer-llm-operator/)] | No        | No         | Yes     | No               | Yes    | Milvus          |
+| llama-3 | nv-embed-qa-4 | LlamaIndex | Chat bot, Kubernetes deployment [[chart](https://registry.ngc.nvidia.com/orgs/ohlfw0olaadg/teams/ea-participants/helm-charts/rag-app-text-chatbot)] | No        | No         | Yes     | No               | Yes    | Milvus          |


 ### Generative AI Model Examples
@@ -89,6 +87,16 @@ These are open source connectors for NVIDIA-hosted and self-hosted API endpoints
 |[NVIDIA Triton Inference Server](https://docs.llamaindex.ai/en/stable/examples/llm/nvidia_triton.html) | [LlamaIndex](https://www.llamaindex.ai/) |Yes|Yes|No|Triton inference server provides API access to hosted LLM models over gRPC. |
 |[NVIDIA TensorRT-LLM](https://docs.llamaindex.ai/en/stable/examples/llm/nvidia_tensorrt.html) | [LlamaIndex](https://www.llamaindex.ai/) |Yes|Yes|No|TensorRT-LLM provides a Python API to build TensorRT engines with state-of-the-art optimizations for LLM inference on NVIDIA GPUs. |

+
+## Related NVIDIA RAG Projects
+
+- [NVIDIA Tokkio LLM-RAG](https://docs.nvidia.com/ace/latest/workflows/tokkio/text/Tokkio_LLM_RAG_Bot.html): Use Tokkio to add avatar animation for RAG responses.
+
+- [RAG on Windows using TensorRT-LLM and LlamaIndex](https://github.com/NVIDIA/ChatRTX): Create RAG chatbots on Windows using TensorRT-LLM.
+
+- [Hybrid RAG Project on AI Workbench](https://github.com/NVIDIA/workbench-example-hybrid-rag): Run an NVIDIA AI Workbench example project for RAG.
+
+
 ## Support, Feedback, and Contributing

 We're posting these examples on GitHub to support the NVIDIA LLM community and facilitate feedback.

diff --git a/RetrievalAugmentedGeneration/Dockerfile b/RetrievalAugmentedGeneration/Dockerfile
@@ -1,5 +1,5 @@
 ARG BASE_IMAGE_URL=nvcr.io/nvidia/base/ubuntu
-ARG BASE_IMAGE_TAG=20.04_x64_2022-09-23
+ARG BASE_IMAGE_TAG=22.04_20240212

 FROM ${BASE_IMAGE_URL}:${BASE_IMAGE_TAG}

@@ -11,7 +11,7 @@ RUN apt update && \
    apt install -y curl software-properties-common libgl1 libglib2.0-0 && \
    add-apt-repository ppa:deadsnakes/ppa && \
    apt update && apt install -y python3.10 python3.10-dev python3.10-distutils && \
-    apt-get clean
+    apt-get clean 

 # Install pip for python3.10
 RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10
@@ -24,20 +24,32 @@ RUN apt autoremove -y curl software-properties-common
 # Install common dependencies for all examples
 RUN --mount=type=bind,source=RetrievalAugmentedGeneration/requirements.txt,target=/opt/requirements.txt \
    pip3 install --no-cache-dir -r /opt/requirements.txt
-
+    
 # Install any example specific dependency if available
 ARG EXAMPLE_NAME
 COPY RetrievalAugmentedGeneration/examples/${EXAMPLE_NAME} /opt/RetrievalAugmentedGeneration/example
 RUN if [ -f "/opt/RetrievalAugmentedGeneration/example/requirements.txt" ] ; then \
    pip3 install --no-cache-dir -r /opt/RetrievalAugmentedGeneration/example/requirements.txt ; else \
    echo "Skipping example dependency installation, since requirements.txt was not found" ; \
    fi
+RUN python3.10 -m nltk.downloader averaged_perceptron_tagger

+RUN if [ "${EXAMPLE_NAME}" = "multimodal_rag" ] ; then \
+    apt update && \
+    apt install -y libreoffice && \
+    apt install -y tesseract-ocr ; \
+    fi
 # Copy required common modules for all examples
 COPY RetrievalAugmentedGeneration/__init__.py /opt/RetrievalAugmentedGeneration/
 COPY RetrievalAugmentedGeneration/common /opt/RetrievalAugmentedGeneration/common
 COPY integrations /opt/integrations
 COPY tools /opt/tools

+RUN mkdir /tmp-data/; mkdir /tmp-data/nltk_data/
+RUN chmod 777 -R /tmp-data
+RUN chown 1000:1000 -R /tmp-data
+ENV NLTK_DATA=/tmp-data/nltk_data/
+ENV HF_HOME=/tmp-data
+
 WORKDIR /opt
 ENTRYPOINT ["uvicorn", "RetrievalAugmentedGeneration.common.server:app"]