laitimes

Redefining Chatbots: GPT 4o's Multimodal Interaction Innovation

author:ChatGPT sweeper

Redefining Chatbots: GPT 4o's Multimodal Interaction Innovation

原文:Multimodal Chatbot with Text and Audio Using GPT 4o

introduce

Since OpenAI launched GPT models like GPT 4o, the landscape of natural language processing has completely changed and shifted to a new concept called generative AI. At its core, large language models are able to understand complex human queries and generate relevant answers. The next step in the development of such large language models is multimodality, i.e., the ability to understand data other than text. This may include images, audio, and video. Recently, a number of multi-models have been released, both open-source and closed-source, such as Google's Gemini, LlaVa, and GPT 4v. Recently, OpenAI announced the launch of a new multi-model called GPT 4o (Omni). In this article, we will create a multimodal chatbot using this OpenAI GPT 4o.

Redefining Chatbots: GPT 4o's Multimodal Interaction Innovation

Learning Objectives

  • • Learn about GPT-4o's capabilities in text, audio, and image generation
  • • Learn to create helper functions for processing user input (images and audio) for use by GPT-4o
  • • Build a multimodal chatbot that interacts with GPT-4o using Chainlit
  • • Implement the ability to process text, image, and audio input in chatbots
  • • Understand the benefits of using a multimodal approach to chatbot interactions
  • • Explore potential applications for GPT-4o's multimodal capabilities

directory

What is GPT 4o?

The recently announced GPT-4o of OpenAI marks a major leap forward in AI in terms of speed, accuracy, and comprehension, as well as the ability to generate text, audio, and images. This "multilingual" model can translate languages, write creative content, analyze or generate speech with different intonations, and even describe real-world scenes or create images based on your descriptions. In addition to its impressive capabilities, GPT-4o integrates seamlessly with ChatGPT, allowing for real-time conversations, and it can recognize visual information and ask relevant questions. The ability to interact across modalities paves the way for more natural and intuitive interactions with computers, potentially helping visually impaired users and creating new artistic mediums. GPT-4o pushes the boundaries of AI as a groundbreaking next-generation model.

Create helper functions

In this section, we'll start coding a multimodal chatbot. The first step is to download the necessary libraries that we are going to use in this code. To do this, we run the following command

pip install openai chainlit           

Running this command will install the OpenAI library. This will allow us to use different OpenAI models, including text-generation models (such as GPT 4o and GPT 3.5), image generation models (such as DallE-3), and speech-to-text models (such as Whisper).

We install chainlit to create the user interface. The Chainlit library lets us create fast chatbots entirely in Python, without having to write Javascript, HTML, or CSS. Before we can start a chatbot, we need to create some helper functions. The first is a function that processes the image. We cannot provide images of the model directly. We need to encode them as base64 before we provide them. To do this, we use the following code

import base64
def image2base64(image_path):
    with open(image_path, "rb") as img:
        encoded_string =, base64.b64encode(img.read())
    return encoded_string.decode("utf-8")           
  • • First, let's import the base64 library for encoding. Then, we create a function called image_2_base64() that takes the image path as input.
  • • Next, we open the image in the provided path in the mode of reading bytes. We then call the base64 library's b64encode() method to encode the image as base64.
  • • The encoded data is in bytes, so we need to convert it to the Unicode format that the model expects so that this image can be passed along with the user query.
  • • So, we call the decode() method, which takes base64-encoded bytes and decodes them with UTF-8 encoding, ultimately returning base64-encoded strings.
  • • Finally, we return base64-encoded strings that can be concatenated with the text to be sent to the GPT-4o model.
  • • At a high level, what we're doing is converting the image to "Base 64 encoded bytes" and then "Base 64 encoded bytes" to "Base 64 encoded strings".

Multimodal chat even requires audio input. So we need to process the audio before we send it to the model. To do this, we use OpenAI's speech-to-text model. The code is shown below.

from openai import OpenAI
client = OpenAI()
def audio_process(audio_path):
    audio_file = open(audio_path, "rb")
    transcription = client.audio.transcriptions.create(
        model="whisper-1", file=audio_file
    )
    return transcription.text           
  • • Let's start by importing the OpenAI class from the OpenAI function. Make sure the OpenAI API key is present in the .env file.
  • • Next, we instantiate an OpenAI object and store it in a variable client.
  • • Now, let's create a function called audio_process() that expects the input audio path.
  • • We then open the file in read-only mode and store the resulting bytes in variable audio_file.
  • • We then call the client.audio.transcriptions.create() function and pass audio_file to the function. In addition to that, we also tell which model to use for speech-to-text conversion. Here we give "whisper-1".
  • • Calling the above function will pass the contents of the audio_file to the model and store the corresponding transcript in the transcription variable.
  • • Finally, we return to the text part, which is the text of the audio that was actually transcribed from the function.

We can't predict what type of message a user will ask for to the model. Users may sometimes send only plain text, sometimes they may contain pictures, and sometimes they may contain audio files. So, based on that, we need to modify the message that will be sent to the OpenAI model. To do this, we can write another function that will provide different user input to the model. The code is as follows:

def append_messages(image_url=None, query=None, audio_transcript=None):
    message_list = []
    if image_url:
        message_list.append({"type": "image_url", "image_url": {"url": image_url}})
    if query and not audio_transcript:
        message_list.append({"type": "text", "text": query})
    if audio_transcript:
        message_list.append({"type": "text", "text": query + "\n" + audio_transcript})
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": message_list}],
        max_tokens=1024,
    )
    return response.choices[0]           
  • • We created a function called append_messages() with three parameters defined. One is image_url, the second is a user query, and the last one is an audio transcription.
  • • We then create an empty message_list variable by assigning an empty list to an empty message_list variable.
  • • Next, we check if a image_url is provided, and if so, we will create a dictionary with the key being "type", the value being "image_url", and the other key being "image_url", and we pass the other dictionary by giving the image_url that called the function. This is the format in which OpenAI expects images to be sent.
  • • We then check if the query is provided and that no audio transcription is provided. Then we just have to append another dictionary to the messag_list that contains the key-value pairs that we see in the code.
  • • Finally, we check if an audio transcription is provided to the function, and if so, we combine it with the user query and append it to the message list.
  • • We then call the client.chat.completions.create() function, passing the model in it, in this case GPT-4o, along with the message_list we created earlier, and storing the result in the response variable.
  • • We even set max_tokens to 1024. In addition to this, we can also provide additional parameters such as temperature, top_p and top_k.
  • • 最后,我们通过返回 response.choices[0] 存储由 GPT 4 Omni 生成的响应。

By doing so, we've created helper functions. We'll call these helper functions later to pass the user query, audio, and image data, and get the response.

Build a chatbot interface

Now, we'll build the UI part of the chatbot. This can be easily built using the Chainlit library. The code we will write will be in the same file as the helper function is defined. Here's the code:

import chainlit as cl
@cl.on_message
async def chat(msg: cl.Message):
    images = [file for file in msg.elements if "image" in file.mime]
    audios = [file for file in msg.elements if "audio" in file.mime]
    if len(images) > 0:
        base64_image = image2base64(images[0].path)
        image_url = f"data:image/png;base64,{base64_image}"
    elif len(audios) > 0:
        text = audio_process(audios[0].path)
    response_msg = cl.Message(content="")
    if len(images) == 0 and len(audios) == 0:
        response = append_messages(query=msg.content)
    elif len(audios) == 0:
        response = append_messages(image_url=image_url, query=msg.content)
    else:
        response = append_messages(query=msg.content, audio_transcript=text)
    response_msg.content = response.message.content
    await response_msg.send()           
  • • Let's start by importing the chainlit library. Then, we create a decorator @cl.on_message() that tells the following function to run when the user enters a message.
  • • The Chainlit library expects an asynchronous function. So, we define an asynchronous function called chat that accepts a variable called cl.Message. Whether the user enters text, audio, or pictures, everything is stored in cl.Message.
  • • msg.elements contains a list of the types of messages sent by the user. These can be plain text user queries or user queries with pictures or audio.
  • • Therefore, we check if there are some pictures or audio files in the user messages and store them in the images and audio variables.
  • • Now, we use the if block to check for the presence of images, then convert them to base64 strings and store them in the image_url variable in the format expected by the GPT-4o model.
  • • We even check if the audio exists, and if it does, we process that audio by calling the audio_process() function, which returns the audio transcription, and stores it in a variable called text.
  • • We then create a message placeholder response_img by giving it an empty message.
  • • Then check again for pictures or audio. If neither exists, we simply send the user query to the append_message function; If there are audio/pictures, we even pass them to the append_messages function.
  • • We then store the results of the model in a response variable and append it to the placeholder message variable response_msg that we created earlier.
  • • Finally, we call response_msg.send(), which will display the response in the user interface.

The final code will be

from openai import OpenAI
import base64
import chainlit as cl
client = OpenAI()
def append_messages(image_url=None, query=None, audio_transcript=None):
    message_list = []
    if image_url:
        message_list.append({"type": "image_url", "image_url": {"url": image_url}})
    if query and not audio_transcript:
        message_list.append({"type": "text", "text": query})
    if audio_transcript:
        message_list.append({"type": "text", "text": query + "\n" + audio_transcript})
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": message_list}],
        max_tokens=1024,
    )
    return response.choices[0]
def image2base64(image_path):
    with open(image_path, "rb") as img:
        encoded_string = base64.b64encode(img.read())
    return encoded_string.decode("utf-8")
def audio_process(audio_path):
    audio_file = open(audio_path, "rb")
    transcription = client.audio.transcriptions.create(
        model="whisper-1", file=audio_file
    )
    return transcription.text
@cl.on_message
async def chat(msg: cl.Message):
    images = [file for file in msg.elements if "image" in file.mime]
    audios = [file for file in msg.elements if "audio" in file.mime]
    if len(images) > 0:
        base64_image = image2base64(images[0].path)
        image_url = f"data:image/png;base64,{base64_image}"
    elif len(audios) > 0:
        text = audio_process(audios[0].path)
    response_msg = cl.Message(content="")
    if len(images) == 0 and len(audios) == 0:
        response = append_messages(query=msg.content)
    elif len(audios) == 0:
        response = append_messages(image_url=image_url, query=msg.content)
    else:
        response = append_messages(query=msg.content, audio_transcript=text)
    response_msg.content = response.message.content
    await response_msg.send()           

To run this code, type chainlit run app.py, assuming the code is stored in a file named app.py. After running this command, the localhost:8000 port will become active and we will see the following image

Redefining Chatbots: GPT 4o's Multimodal Interaction Innovation

Now let's just enter a normal text query and see the resulting output

Redefining Chatbots: GPT 4o's Multimodal Interaction Innovation

We see that GPT-4o successfully generates output for user queries. We even noticed that the code was being highlighted here, and we could quickly copy and paste. It's all managed by Chainlit, which handles the underlying HTML, CSS, and Javascript. Next, let's try uploading an image and asking for information about the model.

Redefining Chatbots: GPT 4o's Multimodal Interaction Innovation

Here, the model responds well to the pictures we uploaded. It recognizes the picture as an emoji and provides information about it and where it can be used. Now, let's pass an audio file and test it

Redefining Chatbots: GPT 4o's Multimodal Interaction Innovation

speech.mp3 audio contains information about machine learning, so we asked the model to summarize its content. The model generates a summary related to the content in the audio file.

conclusion

In conclusion, the development of a multimodal chatbot using OpenAI's GPT-4o (Omni) marks a significant step forward in AI technology and opens a new era of interactive experiences. Here, we explore how to seamlessly integrate text, image, and audio input into conversations with chatbots, taking full advantage of GPT-4o's capabilities. This innovative approach enhances user engagement and opens the door to different real-world applications, from helping visually impaired users to creating new artistic mediums. By combining the ability of language understanding with multimodal capabilities, GPT-4o has demonstrated its potential to revolutionize the way people interact with AI systems.

Main points

  • • GPT-4o represents a momentous moment for artificial intelligence, providing the ability to understand and generate text, audio, and images through a single model
  • • Integrate multimodal capabilities into chatbots to make interactions more natural and intuitive, increasing user engagement and experience
  • • The assistant function plays a crucial role in pre-processing inputs, such as encoding images to base64 and transcribing audio files before importing them into the model
  • • append_messages function dynamically adjusts input format based on user queries, so plain text, images, and audio transcriptions can be accepted
  • • The Chainlit library simplifies the development of chatbot interfaces, handling the underlying HTML, CSS, and JavaScript, making UI creation simple and easy [20]

Read on