Retrieval-Augmented Generation (RAG) is an LLM architecture that improves factual accuracy by combining a language model with an external retrieval system. Instead of relying only on parameters, the model queries a source repository (e.g., documents, databases, or knowledge bases), retrieves relevant passages, and conditions its generation on those sources. This ensures responses are both context-aware and grounded in verifiable material.
NotebookLM by Google is a recent example of RAG with accepted file types such as PDFs, websites, YouTube videos, audio files, Google Docs, or Google Slides.
One way to make use of this is to first scrape all the hyperlinks from a webpage, which provides a structured list of subpages that together represent the broader content of a site. These collected URLs can then be uploaded into NotebookLM to generate summaries and insights across the different sections of the website. Since NotebookLM only allows up to 50 links at a time, larger lists need to be split into smaller batches, but even within this limit it can serve as a practical way to explore and synthesize information from an entire site.
Here is a quick way to extract all hyperlinks from a website:
Open the page in Chrome/Firefox.
Press F12 → go to Console.
Paste this snippet:
Array.from(document.querySelectorAll("a"))
.map(a => a.href)
.filter(href => href)