generative-ai-cdk-constructs

@cdklabs/generative-ai-cdk-constructs

@cdklabs/generative-ai-cdk-constructs / bedrock / WebCrawlerDataSourceProps

Interface: WebCrawlerDataSourceProps

Interface to create a new standalone data source object.

Extends

WebCrawlerDataSourceAssociationProps

Properties

chunkingStrategy?

readonly optional chunkingStrategy: ChunkingStrategy

The chunking stategy to use for splitting your documents or content. The chunks are then converted to embeddings and written to the vector index allowing for similarity search and retrieval of the content.

Default

ChunkingStrategy.DEFAULT

Inherited from

WebCrawlerDataSourceAssociationProps.chunkingStrategy

contextEnrichment?

readonly optional contextEnrichment: ContextEnrichment

The context enrichment configuration to use.

Default

- No context enrichment is used.

Inherited from

WebCrawlerDataSourceAssociationProps.contextEnrichment

crawlingRate?

readonly optional crawlingRate: number

The max rate at which pages are crawled, up to 300 per minute per host. Higher values will decrease sync time but increase the load on the host.

Default

Inherited from

WebCrawlerDataSourceAssociationProps.crawlingRate

crawlingScope?

readonly optional crawlingScope: CrawlingScope

The scope of the crawling.

Default

- CrawlingScope.DEFAULT

Inherited from

WebCrawlerDataSourceAssociationProps.crawlingScope

customTransformation?

readonly optional customTransformation: CustomTransformation

The custom transformation strategy to use.

Default

- No custom transformation is used.

Inherited from

WebCrawlerDataSourceAssociationProps.customTransformation

dataDeletionPolicy?

readonly optional dataDeletionPolicy: DataDeletionPolicy

The data deletion policy to apply to the data source.

Default

- Sets the data deletion policy to the default of the data source type.

Inherited from

WebCrawlerDataSourceAssociationProps.dataDeletionPolicy

dataSourceName?

readonly optional dataSourceName: string

The name of the data source.

Default

- A new name will be generated.

Inherited from

WebCrawlerDataSourceAssociationProps.dataSourceName

description?

readonly optional description: string

A description of the data source.

Default

- No description is provided.

Inherited from

WebCrawlerDataSourceAssociationProps.description

filters?

readonly optional filters: CrawlingFilters

The filters (regular expression patterns) for the crawling. If there’s a conflict, the exclude pattern takes precedence.

Default

None

Inherited from

WebCrawlerDataSourceAssociationProps.filters

kmsKey?

readonly optional kmsKey: IKey

The KMS key to use to encrypt the data source.

Default

- Service owned and managed key.

Inherited from

WebCrawlerDataSourceAssociationProps.kmsKey

knowledgeBase

readonly knowledgeBase: IKnowledgeBase

The knowledge base to associate with the data source.

maxPages?

readonly optional maxPages: number

The maximum number of pages to crawl. The max number of web pages crawled from your source URLs, up to 25,000 pages. If the web pages exceed this limit, the data source sync will fail and no web pages will be ingested.

Default

- No limit

Inherited from

WebCrawlerDataSourceAssociationProps.maxPages

parsingStrategy?

readonly optional parsingStrategy: ParsingStrategy

The parsing strategy to use.

Default

- No Parsing Stategy is used.

Inherited from

WebCrawlerDataSourceAssociationProps.parsingStrategy

sourceUrls

readonly sourceUrls: string[]

The source urls in the format https://www.sitename.com. Maximum of 100 URLs.

Inherited from

WebCrawlerDataSourceAssociationProps.sourceUrls

userAgent?

readonly optional userAgent: string

The user agent string to use when crawling.

Default

- Default user agent string

Inherited from

WebCrawlerDataSourceAssociationProps.userAgent

userAgentHeader?

readonly optional userAgentHeader: string

The user agent header to use when crawling. A string used for identifying the crawler or bot when it accesses a web server. The user agent header value consists of the bedrockbot, UUID, and a user agent suffix for your crawler (if one is provided). By default, it is set to bedrockbot_UUID. You can optionally append a custom suffix to bedrockbot_UUID to allowlist a specific user agent permitted to access your source URLs.

Default

- Default user agent header (bedrockbot_UUID)

Inherited from

WebCrawlerDataSourceAssociationProps.userAgentHeader

This site is open source. Improve this page.