Function-calling and data extraction with LLMs

youngerjesus 2024. 8. 9. 03:13

2024. 8. 9. 03:13

이 글은 Function-calling and data extraction with LLMs 코스를 보고 정리한 글입니다.

Nexsusflow 의 Fine-tuning 된 sLLM 을 이용해서 Unstructured data 에서 Structured 데이터를 뽑아내는 방법을 다루는 강의임.

다루는 내용:

What function-calling is, and how you can use it
create prompt with function definitions and you'll use the LLM response call those functions
defining multiple functions where arguments to one function are themselves functions.
convert Open API Specification to function callable by your LLM

Waht is function calling

Function calling 을 보여주는 예시를 먼저 보자.

다음과 같은 코드가 있다고 하자. 이 함수를 호출하는 작업을 자연어를 가지고 LLM 이 하도록 도와줄 수 있는거임.

from matplotlib import pyplot as plt

def plot_some_points(x : list, y : list):
  """
  Plots some points!
  """
  plt.plot(x, y)
  plt.show()

그러니까 "Hey can you plot y=10x where x=1, 2, 3 for me?" 와 같은 질문과 함수 정의만 주면 LLM 이 함수 호출을 해줄 수 있는 구조로 만들어주는 거.

코드는 다음과 같다:

prompt = \
f'''
Function:
def plot_some_points(x : list, y : list):
  """
  Plots some points!
  """
  plt.plot(x, y)
  plt.show()

User Query: {USER_QUERY}<human_end>
'''

from utils import query_raven
function_call = query_raven(prompt)

# plot_some_points(x=[1, 2, 3], y=[10, 20, 30]) 이렇게 결과가 나올거임. 
print (function_call)

# 함수 호출 완료. exec 는 파이썬 인터프리터와 같은 환경임. 
exec(function_call)

OpenAI 에서도 functoin calling 을 제공해준다.

import json
from openai import OpenAI
from dotenv import load_dotenv
import os

_ = load_dotenv()

def query_openai(msg, functions=None):
  load_dotenv()
  GPT_MODEL = "gpt-3.5-turbo"

  openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
  openai_response = openai_client.chat.completions.create(
    model = GPT_MODEL,
    messages = [{'role': 'user', 'content': msg}],
    tools = functions)
  return openai_response

openai_function = {
  "type": "function",
  "function": {
    "name": "draw_clown_face",
    "description": "Draws a customizable, simplified clown face using matplotlib.",
    "parameters": {
      "type": "object",
      "properties": {
        "face_color": {
          "type": "string",
          "description": "Color of the clown's face."
        },
        "eye_color": {
          "type": "string",
          "description": "Color of the clown's eyes."
        },
        "nose_color": {
          "type": "string",
          "description": "Color of the clown's nose."
        }
        }
      }
    }
  }

openai_msg = \
"Hey can you draw a pink clown face with a red nose"

result = query_openai(openai_msg, functions=[openai_function])

# Function(arguments='{"face_color":"pink","nose_color":"red"}', name='draw_clown_face') 이렇게 출력됨. 
print (result.choices[0].message.tool_calls[0].function)

tool_name = result.choices[0].message.tool_calls[0].function.name
tool_args = result.choices[0].message.tool_calls[0].function.arguments

# draw_clown_face(**{"face_color":"pink","nose_color":"red"}) 이렇게 출력됨. 
print (function_call)

# 함수 호출 완료 
exec(function_call)

Function-calling variadation and data extraction with LLMs

이전 예시에서는 single-call 에 대해서만 배워봤지만 이번엔 다응과 같은 좀 더 다양한 variation 에 대해서 배워봄

Parallel calls
No Calls
MUltiple functions
Nested functions

먼저 function docstring 과 signature 를 통해 function calling 을 쉽게 하는 방법을 보자.

def afunction(arg1:int = 0, arg2:str = "hello", **kwargs)->int:
    ''' this is a function definition
        arg1 (int): an exemplary yet modest argument
        arg2 (str): another nice argument
        **kwargs : the rest of the rabble 

        returns arg1 incremented by one
    '''
    return(arg + 1)

# afunction 이 출력됨. 
print(afunction.__name__)

# 위의 afunction 에 있는 doc_string 이 출력될 것임. 
print(afunction.__doc__)

import inspect

# (arg1: int = 0, arg2: str = 'hello', **kwargs) -> int 이 출력됨. 
print(inspect.signature(afunction))

import inspect
def build_raven_prompt(function_list, user_query):
    raven_prompt = ""
    for function in function_list:
        signature = inspect.signature(function)
        docstring = function.__doc__
        prompt = \
f'''
Function:
def {function.__name__}{signature}
    """
    {docstring.strip()}
    """

'''
        raven_prompt += prompt

    raven_prompt += f"User Query: {user_query}<human_end>"
    return raven_prompt


# function definition 을 쉽게 생성할 수 있음. 
print( build_raven_prompt([afunction], "a query"))


from utils import draw_clown_face

raven_msg = "Hey, can you build me two clowns." \
"The first clown should be red faced, with a blue nose" \
"and a mouth from 0 to 180 degrees. The mouth should be black." \
"The second clown should have a blue face and a green nose" \
"and a red mouth that's 180 to 360 degrees."

raven_prompt = build_raven_prompt([draw_clown_face], raven_msg)

# function calling 에 대한 준비 완료된 상태 
print (raven_prompt)


from utils import query_raven

# 함수 호출만 하면 되는 상태가 됨. 
raven_call = query_raven(raven_prompt)

# draw_clown_face(face_color='red', nose_color='blue', mouth_color='black', mouth_theta=(0, 180)); draw_clown_face(face_color='blue', nose_color='green', mouth_color='red', mouth_theta=(180, 360));
print (raven_call)

# 함수 호출 완료 
exec(raven_call)

이번에는 multiple functions 을 쓰는 예제

from utils import draw_clown_face, draw_tie
raven_msg = "Hey draw a tie?"
raven_prompt = build_raven_prompt\
    ([draw_clown_face, draw_tie], raven_msg)

raven_call = query_raven(raven_prompt)

exec(raven_call)

이번에는 Multiple Parallel Function calling 을 쓰는 예제

raven_msg = "Draw a clown and a tie?"

raven_prompt = build_raven_prompt([draw_tie, draw_clown_face], raven_msg)
raven_call = query_raven(rave

print (raven_call)

exec(raven_call)

interfacing with externanl tools

function calling 의 활용: 외부 리소스인 API 호출과 연계 예시

import requests
def give_joke(category : str):
    """
    Joke categories. Supports: Any, Misc, Programming, Pun, Spooky, Christmas.
    """

    url = f"https://v2.jokeapi.dev/joke/{category}?safe-mode&type=twopart"
    response = requests.get(url)
    print(response.json()["setup"])
    print(response.json()["delivery"])

USER_QUERY = "Hey! Can you get me a joke for this december?"

from utils import query_raven

raven_functions = \
f'''
def give_joke(category : str):
    """
    Joke categories. Supports: Any, Misc, Programming, Dark, Pun, Spooky, Christmas.
    """

User Query: {USER_QUERY}<human_end>
'''

call = query_raven(raven_functions)

exec(call)

이번에는 OpenAPI 와 function calling 의 연계:

import yaml
import json

# Read the content of the file
with open('openapi.yml', 'r') as file:
    file_content = file.read()
file_content = file_content.replace("int\n", "number\n")
file_content = file_content.replace("float\n", "number\n")
data = yaml.safe_load(file_content)

data["servers"] = [{"url":"https://api.open-meteo.com"}]

with open('openapi.json', 'w') as file:
    json_content = json.dump(data, file)

!openapi-python-generator openapi.json ./api_specification_main/

from api_specification_main.services.WeatherForecastAPIs_service\
    import get_v1forecast

user_query = "Hey how is the current weather and windspeed in New York?"

import inspect
signature = inspect.signature(get_v1forecast)
docstring = \
'''
Requires the latitude and longitude.
Set current_weather to True to get the weather.
Set hourly or daily based on preference.
'''

raven_prompt = \
f'''
Function:
{get_v1forecast.__name__}{signature}
"""{docstring}"""

User Query: {user_query}<human_end>'''

print (raven_prompt)

from utils import query_raven
call = query_raven(raven_prompt)
print (call)

eval(call)

Structured Extraction

텍스트에서 주소 뽑기 예제로 보는 Function calling 활용. 그러니까 자연어에서 뽑고 싶은 데이터가 있다면 뽑아낼 수 있음.

text = \
"""
John Doe lives at 123 Elm Street, Springfield. Next to him is Jane Smith, residing at 456 Oak Avenue, Lakeview. Not far away, we find Dr. Emily Ryan at 789 Pine Road, Westwood. Meanwhile, in a different part of town, Mr. Alan Turing can be found at 101 Binary Blvd, Computerville. Nearby, Ms. Olivia Newton stays at 202 Music Lane, Harmony. Also, Prof. Charles Xavier is located at 505 Mutant Circle, X-Town.
"""


raven_prompt = \
f'''
Function:
def address_name_pairs(names : list[str], addresses : list[str]):
"""
Give names and associated addresses.
"""

{text}<human_end>
'''

from utils import query_raven

def address_name_pairs(names : list[str], addresses : list[str]):
  """
  Give names and associated addresses.
  """
  for name, addr in zip(names, addresses):
    print (name, ": ", addr)

result = query_raven(raven_prompt)

eval(result)

출력 결과:

John Doe :  123 Elm Street, Springfield
Jane Smith :  456 Oak Avenue, Lakeview
Dr. Emily Ryan :  789 Pine Road, Westwood
Mr. Alan Turing :  101 Binary Blvd, Computerville
Ms. Olivia Newton :  202 Music Lane, Harmony
Prof. Charles Xavier :  505 Mutant Circle, X-Town

다음은 Data class 를 선언해서 해당 클래스에 맞게 데이터를 뽑아낼 수도 있음.

unbalanced_text = \
"""
Dr. Susan Hill has a practice at 120 Green Road, Evergreen City, and also consults at 450 Riverdale Drive, Brookside. Mark Twain, the renowned author, once lived at 300 Maple Street, Springfield, but now resides at 200 Writers Block, Literaryville. The famous artist, Emily Carter, showcases her work at 789 Artisan Alley, Paintown, and has a studio at 101 Palette Place, Creativeland. Meanwhile, the tech innovator, John Tech, has his main office at 555 Silicon Street, Techville, and a secondary office at 777 Data Drive, Computown, but he lives at 123 Digital Domain, Innovatown.
"""
print (unbalanced_text)

raven_prompt = \
f'''

@dataclass
class Record:
    name : str
    addresses : List[str]

Function:
def insert_into_database(names : List[Record]):
"""
Inserts the records into the database. 
"""

{unbalanced_text}<human_end>

'''

result = query_raven(raven_prompt)
print (result)

출력 결과:

insert_into_database(names=[Record(name='Dr. Susan Hill', addresses=['120 Green Road', '450 Riverdale Drive']), Record(name='Mark Twain', addresses=['300 Maple Street', '200 Writers Block']), Record(name='Emily Carter', addresses=['789 Artisan Alley', '101 Palette Place']), Record(name='John Tech', addresses=['555 Silicon Street', '777 Data Drive', '123 Digital Domain'])])

다음은 Nested 구조에서 JSOn 데이터 뽑아내기

def city_info(city_name : str, location : dict):
  """
  Gets the city info
  """
  return locals()

def construct_location_dict(country : str, continent : dict):
  """
  Provides the location dictionary
  """
  return locals()

def construct_continent_dict(simple_name : str, other_name : str):
  """
  Provides the continent dict
  """
  return locals()

# {'city_name': 'London', 'location': {}} 
print (city_info("London", {}))


raven_prompt = \
'''
Function:
def city_info(city_name : str, location : dict):
"""
Gets the city info
"""

Function:
def construct_location_dict(country : str, continent : dict):
"""
Provides the location dictionary
"""

def construct_continent_dict(simple_name : str, other_name : str):
"""
Provides the continent dict
"""

User Query: {question}<human_end>
'''

question = "I want the city info for London, "\
"which is in the United Kingdom, which is in Europe or Afro-Eur-Asia."

output = query_raven(raven_prompt.format(question = question))

# 이 과정에서 json 으로 생성 
json0 = eval(output) 
print (json0)

출력 결과:

{'city_name': 'London', 'location': {'country': 'United Kingdom', 'continent': {'simple_name': 'Europe', 'other_name': 'Afro-Eur-Asia'}}}

Applications

이번에는 다른 외부 도구인 검색 엔진과, 데이터베이스를 사용할 때 입력할 Input 을 function-calling 으로 만들어 볼거임.

검색 엔진 + function calling 예시:

from dotenv import load_dotenv
_ = load_dotenv()

import os

def do_web_search(full_user_prompt : str, num_results : int = 5):
    API_URL = f'{os.getenv("DLAI_TAVILY_BASE_URL", "https://api.tavily.com")}/search'
    payload = \
    {
      "api_key": os.environ["TAVILY_API_KEY"],
      "query": full_user_prompt,
      "search_depth": "basic",
      "include_answer": False,
      "include_images": False,
      "include_raw_content": False,
      "max_results": num_results,
      "include_domains": [],
      "exclude_domains": []
    }
    import requests
    response = requests.post(API_URL, json=payload)
    response = response.json()
    all_results = "\n\n".join(item["content"] for item in response["results"])
    return all_results

function_calling_prompt = \
"""
Function:
def do_web_search(full_user_prompt : str, num_results : int = 5):
    '''
    Searches the web for the user question.
    '''

Example:
User Query: What is the oldest capital in the world?
Call: do_web_search(full_user_prompt="oldest capital")

User Query: {query}<human_end>
"""
fc_result = query_raven(function_calling_prompt.format(query=question))

# do_web_search(full_user_prompt='R1 thing')
print (fc_result)

result = eval(fc_result)

full_prompt = \
f"""
<s> [INST]
{result}

Use the information above to answer the following question concisely.

Question:
{question} [/INST]
"""


grounded_response = query_raven(full_prompt.format(question = question))

print (grounded_response)

Database + Function calling 예시: Mini version

question = "What is the most expensive item we currently sell?"

from utils import execute_sql, query_raven

schema = \
"""
CREATE TABLE IF NOT EXISTS toys (
    id INTEGER PRIMARY KEY,
    name TEXT,
    price REAL
);
"""

raven_prompt = \
f'''
Function:
def execute_sql(sql_code : str):
  """
  Runs sql code for a company internal database
  """

Schema: {schema}
User Query: {question}
'''

output = query_raven(raven_prompt)
print (f"LLM's function call: {output}")

# LLM's function call: execute_sql(sql_code='SELECT name, price FROM toys ORDER BY price DESC LIMIT 1')
database_result = eval(output)

Sqlite databse 와 function calling 예시:

import sqlite3
import random

# Internal database name setting
DB_NAME = 'toy_database.db'

# Connect to the database
def connect_db():
    return sqlite3.connect(DB_NAME)

# List all toys
def list_all_toys():
    with connect_db() as conn:
        cursor = conn.execute('SELECT * FROM toys')
        return cursor.fetchall()


# Find toy by name prefix
def find_toy_by_prefix(prefix):
    with connect_db() as conn:
        query = 'SELECT * FROM toys WHERE name LIKE ?'
        cursor = conn.execute(query, (prefix + '%',))
        return cursor.fetchall()

# Find toys in a price range
def find_toys_in_price_range(low_price, high_price):
    with connect_db() as conn:
        query = 'SELECT * FROM toys WHERE price BETWEEN ? AND ?'
        cursor = conn.execute(query, (low_price, high_price))
        return cursor.fetchall()

# Get a random selection of toys
def get_random_toys(count=5):
    with connect_db() as conn:
        cursor = conn.execute('SELECT * FROM toys')
        all_toys = cursor.fetchall()
        return random.sample(all_toys, min(count, len(all_toys)))

# Function to get the most expensive toy
def get_most_expensive_toy(count=1):
    with connect_db() as conn:
        cursor = conn.execute(f'SELECT * FROM toys ORDER BY price DESC LIMIT {count}')
        return cursor.fetchone()

# Function to get the cheapest toy
def get_cheapest_toy(count=1):
    with connect_db() as conn:
        cursor = conn.execute('SELECT * FROM toys ORDER BY price ASC LIMIT {count}')
        return cursor.fetchone()

raven_prompt = \
f'''
Function:
def list_all_toys():
    """
    Retrieves a list of all toys from the database. This function does not take any parameters.
    Returns: A list of tuples, where each tuple represents a toy with all its attributes (id, name, price).
    """

Function:
def find_toy_by_prefix(prefix):
    """
    Searches for and retrieves toys whose names start with a specified prefix.
    Parameters:
    - prefix (str): The prefix to search for in toy names.
    Returns: A list of tuples, where each tuple represents a toy that matches the prefix criteria.
    """

Function:
def find_toys_in_price_range(low_price, high_price):
    """
    Finds and returns toys within a specified price range.
    Parameters:
    - low_price (float): The lower bound of the price range.
    - high_price (float): The upper bound of the price range.
    Returns: A list of tuples, each representing a toy whose price falls within the specified range.
    """

Function:
def get_random_toys():
    """
    Selects and returns a random set of toys from the database, simulating a "featured toys" list.

    Returns: A list of tuples, each representing a randomly selected toy. The number of toys returned is up to the specified count.
    """

Function:
def get_most_expensive_toy(count : int):
    """
    Retrieves the most expensive toy from the database.
    This function does not take any parameters.

    Returns: A tuple representing the most expensive toy, including its id, name, and price.
    """

Function:
def get_cheapest_toy(count : int):
    """
    Finds and retrieves the cheapest toy in the database.
    This function does not take any parameters.

    Returns: A tuple representing the cheapest toy, including its id, name, and price.
    """

User Query: {question}<human_end>

'''

output = query_raven(raven_prompt)
print (output)

# get_most_expensive_toy(count=1)
results = eval(output)

Conclusion

Unstructureed Data 를 쉽게 그리고 구조화된 형식으로 뽑아낼 수 있다:

하지만 PDF 와 같은 파일에서도 이를 잘 뽑아낼 수 있을까? 이건 해봐야 할 듯.
아마 작은 데이터에서는 잘 되지만 큰 데이터에서는 안되지 않을까 싶음.

Database 예제를 보면서 느낀 것:

function-calling 을 이용해서 Database 에 사용할 SQL 문을 자동으로 만들어 줄 수 있다면 생산성은 되게 올라갈듯.
Application 의 기능에 대해서 Function definition 만 만들어놓고 바로 SQL 문을 생성하기만 하면 되니까.
코드 구조는 이렇게 될 것 Application 에 기능별로 각각의 API Endpoint 등이 있지만, 서버 구조에는 각 기능을 수행하기 위해서 SQL 문을 작성하는 함수를 호출하게 될거임. 여기서 LLM 이 쓰일거고. LLM 이 내놓은 SQL 문을 그냥 질의만 하면 될 것.

저작자표시 비영리

'Generative AI > Data' 카테고리의 다른 글

A Survey on Data Synthesis and Augmentation for Large Language Models (0)	2025.01.18
Does Synthetic Data Generation of LLMs Help Clinical Text Mining? (0)	2025.01.17
Data-centric Artificial Intelligence: A Survey (0)	2024.11.05
Data-Centric AI (0)	2024.11.03
Preprocessing Unstructured Data for LLM Applications (0)	2024.06.06

여정민의 블로그