Managing vectorization profiles
The FoundationaLLM (FLLM) vectorization pipelines require the following types of profiles:
Data Sources
Data sources are managed with the FoundationaLLM.DataSource
resource provider through the Management API. The structure of a data source profile is the following:
{
"type": "<data_source_type>",
"name": "<name>",
"object_id": "/instances/<instance_id>/providers/FoundationaLLM.DataSource/dataSources/<name>",
"display_name": null,
"description": "<description>",
"<settings>": [
"<value>"
],
"configuration_references": {
"AuthenticationType": "FoundationaLLM:DataSources:datalake01:AuthenticationType",
"ConnectionString": "FoundationaLLM:DataSources:datalake01:ConnectionString"
},
"created_on": "0001-01-01T00:00:00+00:00",
"updated_on": "0001-01-01T00:00:00+00:00",
"created_by": null,
"updated_by": null,
"deleted": false
}
where:
<data_source_type>
is the type of the data source. The supported types areAzureDataLake
,SharePointOnline
,WebSite
andAzureSQLDatabase
.<name>
is the name of the data source.<instance_id>
is the unique identifier of the FLLM instance.<description>
is the description of the Data Source.<settings>
is a JSON object containing the data source settings, the name of the property varies by data source type.<configuration_references>
is a JSON object containing the profile configuration references. The content of this object also varies by data source type.
The reminder of this section describes the configuration parameters for each of the supported Data Source types.
AzureDataLake
"folders": [
"/vectorization-input/sdzwa/journals/2024"
],
"configuration_references": {
"AuthenticationType": "FoundationaLLM:DataSources:datalake01:AuthenticationType",
"ConnectionString": "FoundationaLLM:DataSources:datalake01:ConnectionString",
"APIKey": "FoundationaLLM:DataSources:datalake01:APIKey",
"Endpoint": "FoundationaLLM:DataSources:datalake01:Endpoint"
},
The configuration parameters for AzureDataLake
are the following:
Parameter | Description |
---|---|
folders |
The list of folders in the Azure Data Lake storage account that contain the data to be vectorized. |
configuration_references.AuthenticationType |
The authentication type used to connect to the underlying storage. Can be one of AzureIdentity , AccountKey , or ConnectionString . |
configuration_references.ConnectionString |
The connection string to the Azure Storage account used for the the Azure Data Lake vectorization Data Source. |
SharePointOnline
"site_url": "https://solliance.sharepoint.com/sites/foundationallm01",
"document_libraries": [
"/documents01"
],
"configuration_references": {
"ClientId": "FoundationaLLM:DataSources:sharepointsite01:ClientId",
"TenantId": "FoundationaLLM:DataSources:sharepointsite01:TenantId",
"CertificateName": "FoundationaLLM:DataSources:sharepointsite01:CertificateName",
"KeyVaultURL": "FoundationaLLM:DataSources:sharepointsite01:KeyVaultURL"
},
The configuration parameters for SharePointOnline
are the following:
Parameter | Description |
---|---|
site_url |
The URL of the SharePoint Online site collection. |
document_libraries |
The list of document libraries in the SharePoint Online site collection that contain the data to be vectorized. |
configuration_references.CertificateName |
The name of the X.509 Certificate. The certificate must be valid and be uploaded into an Azure Key Vault certificate store. |
configuration_references.KeyVaultURL |
The URL of the KeyVault where the X.509 Certificate is stored. |
configuration_references.ClientId |
The Application (client) Id of the Microsoft Entra ID App Registration. See Entra ID app registration for SharePoint Online Data Source. |
configuration_references.TenantId |
The unique identifier of the SharePoint Online tenant. |
AzureSQLDatabase
"tables": [
"Table1"
],
"configuration_references": {
"ConnectionString": "FoundationaLLM:DataSources:sqldatabase01:ConnectionString",
"AuthenticationType": "FoundationaLLM:DataSources:sqldatabase01:AuthenticationType"
},
The configuration parameters for AzureSQLDatabase
are the following:
Parameter | Description |
---|---|
tables |
The list of tables in the Azure SQL database that contain the data to be vectorized. |
configuration_references.ConnectionString |
The connection string to the Azure SQL database used for the Azure SQL Database vectorization Data Source. |
configuration_references.AuthenticationType |
The authentication type used to connect to the Azure SQL database. Can be one of AzureIdentity or ConnectionString . |
Managing Data Sources
This section describes how to manage Data Sources using the Management API.
{{baseUrl}}
is the base URL of the Management API. {{instanceId}}
is the unique identifier of the FLLM instance.
Retrieve
HTTP GET {{baseUrl}}/instances/{{instanceId}}/providers/FoundationaLLM.DataSource/dataSources
Create or update
HTTP POST {{baseUrl}}/instances/{{instanceId}}/providers/FoundationaLLM.DataSource/dataSources/<name>
Content-Type: application/json
BODY
<data source>
where <data source>
is a JSON object with the structure described above.
Delete
HTTP DELETE {{baseUrl}}/instances/{{instanceId}}/providers/FoundationaLLM.DataSource/dataSources/<name>
Note
The delete operation is a logical delete. To purge a Data Source, call the /purge
endpoint after deleting the Data Source.
Purge
HTTP POST {{baseUrl}}/instances/{{instanceId}}/providers/FoundationaLLM.DataSource/dataSources/<name>/purge
Content-Type: application/json
BODY
{}
Check Name
HTTP POST {{baseUrl}}/instances/{{instanceId}}/providers/FoundationaLLM.DataSource/dataSources/checkname
Content-Type: application/json
BODY
{
"name": "<name>"
}
Text partitioning profiles
The structure of a text partitioning profile is the following:
{
"type": "text-partitioning-profile",
"name": "<name>",
"object_id": "/instances/[INSTANCE ID]/providers/FoundationaLLM.Vectorization/textPartitioningProfiles/<name>",
"description": "<description>",
"deleted": false,
"text_splitter": "<text_splitter>",
"settings": {<profile_settings>},
"configuration_references": {<profile_configuration_references>}
}
where:
<name>
is the name of the text partitioning profile.<description>
is the description of the text partitioning profile.<text_splitter>
is the type of the text splitter. The supported types areTextTokenSplitter
.<profile_settings>
is a JSON object containing the profile settings.<profile_configuration_references>
is a JSON object containing the profile configuration references.
The reminder of this section describes the configuration parameters for each of the supported text splitters.
TextTokenSplitter
"settings" : {
"Tokenizer": "MicrosoftBPETokenizer",
"TokenizerEncoder": "cl100k_base",
"ChunkSizeTokens": "2000",
"OverlapSizeTokens": "200"
},
"configuration_references": {}
The configuration parameters for TokenTextSplitter
are the following:
Parameter | Description |
---|---|
settings.Tokenizer |
The tokenizer used to split the text into tokens. Currently, the only supported tokenizer is MicrosoftBPETokenizer . Under the hood, it uses the .NET equivalent of OpenAI's tiktoken. |
settings.TokenizerEncoder |
The encoder used by the tokenizer. Currently, the only supported encoder is cl100k_base . This encoder is the one currently used by Azure OpenAI (and OpenAI) in gpt-3.5-turbo and gpt-4 . |
settings.ChunkSizeTokens |
The maximum number of tokens in each text chunk. |
settings.OverlapSizeTokens |
The maximum number of tokens that overlap between two consecutive chunks. |
Managing text partitioning profiles
This section describes how to manage text partitioning profiles using the Management API.
{{baseUrl}}
is the base URL of the Management API. {{instanceId}}
is the unique identifier of the FLLM instance.
Retrieve
HTTP GET {{baseUrl}}/instances/{{instanceId}}/providers/FoundationaLLM.Vectorization/textPartitioningProfiles
Create or update
HTTP POST {{baseUrl}}/instances/{{instanceId}}/providers/FoundationaLLM.Vectorization/textPartitioningProfiles/<name>
Content-Type: application/json
BODY
<text partitioning profile>
where <text partitioning profile>
is a JSON object with the structure described above.
Delete
HTTP DELETE {{baseUrl}}/instances/{{instanceId}}/providers/FoundationaLLM.Vectorization/textPartitioningProfiles/<name>
Note
The delete operation is a logical delete. To purge a Text Partitioning Profile, call the /purge
endpoint after deleting the Text Partitioning Profile.
Purge
HTTP POST {{baseUrl}}/instances/{{instanceId}}/providers/FoundationaLLM.Vectorization/textPartitioningProfiles/<name>/purge
Content-Type: application/json
BODY
{}
Text embedding profiles
The structure of a text embedding profile is the following:
{
"type": "text-embedding-profile",
"name": "<name>",
"object_id": "/instances/[INSTANCE ID]/providers/FoundationaLLM.Vectorization/textEmbeddingProfiles/<name>",
"display_name": null,
"description": "<description>",
"text_embedding": "<text_embedding>",
"settings": {<profile_settings>},
"configuration_references": {<profile_configuration_references>},
"created_on": "0001-01-01T00:00:00+00:00",
"updated_on": "0001-01-01T00:00:00+00:00",
"created_by": null,
"updated_by": null,
"deleted": false
}
where:
<name>
is the name of the text embedding profile.<description>
is the description of the text embedding profile.<text_embedding>
is the type of the text embedder. The supported types areSemanticKernelTextEmbedding
andGatewayTextEmbedding
.SemanticKernelTextEmbedding
supports synchronous and asynchronous vectorization, whileGatewayTextEmbedding
only supports asynchronous vectorization.<profile_settings>
is a JSON object containing the profile settings.<profile_configuration_references>
is a JSON object containing the profile configuration references.
The reminder of this section describes the configuration parameters for each of the supported text embedders.
SemanticKernelTextEmbedding
"settings": {},
"configuration_references": {
"APIKey": "FoundationaLLM:Vectorization:SemanticKernelTextEmbeddingService:APIKey",
"APIVersion": "FoundationaLLM:Vectorization:SemanticKernelTextEmbeddingService:APIVersion",
"AuthenticationType": "FoundationaLLM:Vectorization:SemanticKernelTextEmbeddingService:AuthenticationType",
"DeploymentName": "FoundationaLLM:Vectorization:SemanticKernelTextEmbeddingService:DeploymentName",
"Endpoint": "FoundationaLLM:Vectorization:SemanticKernelTextEmbeddingService:Endpoint"
}
The configuration parameters for SemanticKernelTextEmbedding
are the following:
Parameter | Description |
---|---|
configuration_references.APIKey |
The API key used to connect to the Azure OpenAI service. By default, this maps to the Azure OpenAI service deployed by FLLM. |
configuration_references.APIVersion |
The API version used to connect to the Azure OpenAI service. By default, this value is 2023-05-15 . |
configuration_references.AuthenticationType |
The authentication type used to connect to the Azure OpenAI service. Can be one of AzureIdentity or APIKey . By default, it is set to APIKey . |
configuration_references.DeploymentName |
The name of the Azure OpenAI model deployment. The default value is embeddings . |
configuration_references.Endpoint |
The endpoint of the Azure OpenAI service. By default, this maps to the Azure OpenAI service deployed by FLLM. |
GatewayTextEmbedding
"settings": {
"model_name": "embeddings"
},
"configuration_references": {}
The settings for GatewayTextEmbedding
are the following:
Parameter | Description |
---|---|
settings.model_name |
The name of the embeddings model deployment in Azure OpenAI Service. |
Managing text embedding profiles
This section describes how to manage text embedding profiles using the Management API.
{{baseUrl}}
is the base URL of the Management API. {{instanceId}}
is the unique identifier of the FLLM instance.
Retrieve
HTTP GET {{baseUrl}}/instances/{{instanceId}}/providers/FoundationaLLM.Vectorization/textEmbeddingProfiles
Create or update
HTTP POST {{baseUrl}}/instances/{{instanceId}}/providers/FoundationaLLM.Vectorization/textEmbeddingProfiles/<name>
Content-Type: application/json
BODY
<text embedding profile>
where <text embedding profile>
is a JSON object with the structure described above.
Delete
HTTP DELETE {{baseUrl}}/instances/{{instanceId}}/providers/FoundationaLLM.Vectorization/textEmbeddingProfiles/<name>
Note
The delete operation is a logical delete. To purge a Text Embedding Profile, call the /purge
endpoint after deleting the Text Embedding Profile.
Purge
HTTP POST {{baseUrl}}/instances/{{instanceId}}/providers/FoundationaLLM.Vectorization/textEmbeddingProfiles/<name>/purge
Content-Type: application/json
BODY
{}
Indexing profiles
The structure of an indexing profile is the following:
{
"type": "indexing-profile",
"name": "<name>",
"object_id": "/instances/[INSTANCE ID]/providers/FoundationaLLM.Vectorization/indexingProfiles/<name>",
"display_name": null,
"description": "<description>",
"deleted": false,
"indexer": "<indexer>",
"settings": {<profile_settings>},
"configuration_references": {<profile_configuration_references>},
"created_on": "0001-01-01T00:00:00+00:00",
"updated_on": "0001-01-01T00:00:00+00:00",
"created_by": null,
"updated_by": null,
"deleted": false
}
where:
<name>
is the name of the indexing profile.<description>
is the description of the indexing profile.<indexer>
is the type of the indexer. The supported types areAzureAISearchIndexer
.<profile_settings>
is a JSON object containing the profile settings.<profile_configuration_references>
is a JSON object containing the profile configuration references.
The reminder of this section describes the configuration parameters for each of the supported indexers.
AzureAISearchIndexer
"settings": {
"IndexName": "fllm-default-001",
"TopN": "3",
"Filters": "",
"EmbeddingFieldName": "Embedding",
"TextFieldName": "Text"
},
"configuration_references": {
"APIKey": "FoundationaLLM:Vectorization:AzureAISearchIndexingService:APIKey",
"AuthenticationType": "FoundationaLLM:Vectorization:AzureAISearchIndexingService:AuthenticationType",
"Endpoint": "FoundationaLLM:Vectorization:AzureAISearchIndexingService:Endpoint"
}
The configuration parameters for AzureAISearchIndexer
are the following:
Parameter | Description |
---|---|
settings.IndexName |
The name of the Azure AI Search index. If the index does not exist, it will be created. |
settings.TopN |
The number of embeddings closest to the index query to return. |
settings.Filters |
Optional filters to further refine the index search. |
settings.EmbeddingFieldName |
Field name of the embedding vector in the JSON documents returned by Azure AI Search. |
settings.TextFieldName |
Field name of the text equivalent of the embedding vector in the JSON documents returned by Azure AI Search. |
configuration_references.APIKey |
The API key used to connect to the Azure AI Search service. By default, this maps to the Azure AI Search service deployed by FLLM. |
ConfigurationReference.AuthenticationType |
The authentication type used to connect to the Azure AI Search service. Can be one of AzureIdentity or APIKey . By default, it is set to APIKey . |
configuration_references.Endpoint |
The endpoint of the Azure AI Search service. By default, this maps to the Azure AI Search service deployed by FLLM. |
Managing indexing profiles
This section describes how to manage indexing profiles using the Management API.
{{baseUrl}}
is the base URL of the Management API. {{instanceId}}
is the unique identifier of the FLLM instance.
Retrieve
HTTP GET {{baseUrl}}/instances/{{instanceId}}/providers/FoundationaLLM.Vectorization/indexingProfiles
Create or update
HTTP POST {{baseUrl}}/instances/{{instanceId}}/providers/FoundationaLLM.Vectorization/indexingProfiles/<name>
Content-Type: application/json
BODY
<indexing profile>
where <indexing profile>
is a JSON object with the structure described above.
Delete
HTTP DELETE {{baseUrl}}/instances/{{instanceId}}/providers/FoundationaLLM.Vectorization/indexingProfiles/<name>
Note
The delete operation is a logical delete. To purge an Indexing Profile, call the /purge
endpoint after deleting the Indexing Profile.
Purge
HTTP POST {{baseUrl}}/instances/{{instanceId}}/providers/FoundationaLLM.Vectorization/indexingProfiles/<name>/purge
Content-Type: application/json
BODY
{}
Check Name
HTTP POST {{baseUrl}}/instances/{{instanceId}}/providers/FoundationaLLM.Vectorization/indexingProfiles/checkname
Content-Type: application/json
BODY
{
"name": "<name>"
}
Additional configuration steps
Entra ID app registration for SharePoint Online Data Source
Apps typically access SharePoint Online through certificates: Anyone having the certificate and its private key can use the app with the permissions granted to it.
Create a new App registration in your Microsoft Entra ID tenant. Next, provide a Name for your application and click on Register at the bottom of the blade.
Navigate to the API Permissions blade and click on Add a permission button Here you choose the permissions that you will grant to this application. Select SharePoint from the Microsoft APIs tab, then select Application permissions as the type of permissions required, choose the desired permissions (i.e. Sites.Read.All) and click on Add permissions. Here are the required scopes:
Group.ReadWrite.All
User.ReadWrite.All
Sites.Read.All
ORSites.Selected
Sites.Read.All
will allow the application to read documents and list items in all site collections.Sites.Selected
will allow the application to access only a subset of site collections. The specific site collections and the permissions granted will be configured separately, in SharePoint Online.
The application permission requires admin consent in a tenant before it can be used. In order to do this, click on API permissions in the left menu again. At the bottom you will see a section Grant consent. Click on the Grant admin consent for {{organization}} button and confirm the action by clicking on the Yes button that appears at the top.
To invoke SharePoint Online with an app-only access token, you have to create and configure a self-signed X.509 certificate, which will be used to authenticate your application against Microsoft Entra ID. You can find additional details on how to do this in this document.
Next step is to register the certificate you created to this application. Click on Certificates & secrets blade. Next, click on the Upload certificate button, select the .CER file you generated earlier and click on Add to upload it.
To confirm that the certificate was successfully registered, click on Manifest blade and search for the
keyCredentials
property, which contains your certificate details. It should look like this:"keyCredentials": [ { "customKeyIdentifier": "<$base64CertHash>", "endDate": "yyyy-MM-ddThh:mm:ssZ", "keyId": "<$guid>", "startDate": "yyyy-MM-ddThh:mm:ssZ", "type": "AsymmetricX509Cert", "usage": "Verify", "value": "<$base64Cert>", "displayName": "CN=<$name of your cert>" } ]
Upload and store the certificate in the KeyVault where the FoundationaLLM Vectorization API has permissions to read Secrets. You will need the Certificate Name for the App Configuration settings listed in the table above.
NOTE
Can I use other means besides certificates for realizing app-only access for my Azure AD app?
NO, all other options are blocked by SharePoint Online and will result in an
Access Denied
message.Create a new Data Source using the Management API. Ensure that you set the necessary App Configuration settings appropriately.