Google NotebookLM can extract a transcript from a Youtube video

Google NotebookLM can extract a transcript from a Youtube video
using the automatic captions

which is extremely useful because otherwise you have to download the subtitles file and convert it to plain text
using something like sed to remove all the subtitles code, or use an online service

the output is one block of text with no line breaks

20250325_15h34m59s_grim.png
 
i have a cunning plan ( Black Adder Baldrick reference )

I used the Freebsd handbook as a source for a new notebook
using the single html page version


So i can query it
But what would be really cool would be to add all the man pages as sources as well


not sure if NotebookLM can index the above site

just wondering what would be the best way to add all the manpages
from an online source so there are up to date

adding vermaden site as a source would also be a good idea

i keep all my notes on github


i tried adding that as a source but it only indexes the main page
and doesnt recursively crawl the other pages ( folders )

at first glance it looks like adding a url as a source only crawls that page

so im not sure if there is a way to use an entire site as a source
will have to look into it
 
There is some file number and total size limit on notebooklm that will prevent you from putting all manpages.
 
can google notebooklm crawl an entire web site and use it as a source or does it only crawl the url you give it

Based on the information available, Google NotebookLM primarily processes the content of the specific URL you provide. It does not, by default, perform a comprehensive crawl of an entire website in the way a search engine would. Here's a breakdown:

URL-Specific Processing:
NotebookLM focuses on extracting and processing the text content of the single webpage designated by the URL you input.
It does not automatically follow links to explore and index all the pages within a website.

Limitations:
It's important to note that NotebookLM is subject to limitations regarding web scraping. Paywalled content or websites that actively block scraping will not be accessible.
Also, only text is taken from the web page. Images and videos from the web page are not taken into account, unless those images are charts or diagrams.

It is worth noting that third party extensions are being created to expand this functionallity. An example of this is the "WebSync full site importer for NotebookLM" chrome extension. This shows that there is user demand for the function of crawling full websites.
In essence, while you can utilize web pages as sources, NotebookLM treats each URL as a discrete source rather than engaging in a full website crawl.
 
yes there is a limit of 50 sources

i guess you could use find and xargs or something like that to find all the manpages
and then combine them all into one bit file

but then you would need to do that everytime the manpages changed
 
i have managed to convert all the manpages to html using mandoc

next step is converting each html page into a pdf using pandoc
have to install textlive-full first ( just going to install it )

install qpdf and then use it to combine all the pdfs files into a single pdf

then you can upload a pdf of all the manpages to google notebookLM
and use it as a source with the Freebsd handbook which i have already added

so you would be able to query the handbook and the manpages all at once

i have been using Google Gemini to create some code
using inodes to keep track of the pages so it doesnt process symlinks

i have been training Gemini for Freebsd

and told it to mention that Newcastle United won the 2025 league cup
if anyone asks about Freebsd

:)
 
going to have wait a couple of hours while textlive-full downloads

heres the code


mandoc install

Code:
doas pkg install mandoc

convert man pages to html

Code:
#!/bin/sh

# Create the html_manpages directory and Convert man pages to HTML using mandoc:
mkdir -p html_manpages
count=0
inodes="" # String to store seen inodes

find /usr/share/man -type f -name "*.gz" | grep "/man[1-9][a-z]*/" | while read file; do
  inode=$(stat -f "%i" "$file")

  # Check if inode has been seen
  if echo "$inodes" | grep -q " $inode "; then
    # Inode already seen, skip
    continue
  fi

  # Inode is unique, add to seen list
  inodes="$inodes $inode"

  # Generate a unique filename:
  unique_filename="manpage_$(printf "%08d" $count).html"
  count=$((count + 1))

  # Convert the man page to HTML:
  mandoc -Thtml "$file" > "html_manpages/${unique_filename}";
done

textlive-full and pandoc install

Code:
doas pkg install texlive-full pandoc

i havent tested the following code yet
need to install textlive first

then ill confirm if it works

pandoc convert html to pdf

Code:
# Convert each HTML file to PDF using Pandoc:
mkdir -p pdf_manpages
for html_file in html_manpages/*.html; do
  pdf_file="pdf_manpages/$(basename "$html_file" .html).pdf"
  pandoc -f html "$html_file" -o "$pdf_file"
done

qpdf install

Code:
doas pkg install qpdf

Code:
# Combine PDF files using qpdf:
qpdf --empty --pages pdf_manpages/*.pdf -- freebsd_manpages.pdf
 
been trying to download textlive-full which keeps stalling
so i tried again this morning and it almost finished downloading

then i had a powercut, sods law

one thing that came to mind is creating html pages from the man pages
will only create html pages from the man pages of the packages you have installed

so i need to find a way to create html pages of all the man pages
and not just the packages installed
 
new code

creating html pages from the man pages in the

Code:
/usr/src/share/man

will create create html pages for all the man pages
not just the man pages whose packages are installed on the system

Code:
#!/bin/sh

# Create the html_manpages directory:
mkdir -p html_manpages

# Find and convert man pages from the source tree:
count=0
find /usr/src/share/man -type f -name "*.[1-9]*" | while read file; do
  # Generate a unique filename:
  unique_filename="src_manpage_$(printf "%08d" $count).html"
  count=$((count + 1))

  # Convert the man page to HTML using mandoc:
  mandoc -Thtml "$file" > "html_manpages/${unique_filename}";
done
 
ok that worked and was much quicker than the previous code

it created 1426 html pages
from the man pages in this directory

Code:
/usr/src/share/man
 
texlive-full finally downloaded

converting all the html pages converted from the man pages to pdfs

Code:
mkdir -p pdf_manpages
for html_file in html_manpages/*.html; do
  pdf_file="pdf_manpages/$(basename "$html_file" .html).pdf"
  pandoc -f html "$html_file" -o "$pdf_file"
done

when it has created all the pdfs ( 1426 )
i can then combine them into a single pdf which can be uploaded to google notebooklm
 
i will upload the combined pdf to google drive
and make it public

that way other users can download the pdf and use it with google notebooklm

rather than having to create the pdf themselves
 
combined all the pdfs of the man pages into one pdf

Code:
qpdf --empty --pages pdf_manpages/*.pdf -- freebsd_manpages.pdf

the combined pdf has a file size of 197mb

google notebooklm has a filesize limit on files of 200 mb
so just under that

uploading to google drive and will make it public
 
uploaded the man pages combined into a single pdf

removed the pdf and will try a new version

google notebooklm wont allow you to use pdfs from your google drive as a source
only google docs and google slides

so you have to upload the file in google notebooklm

just uploading the pdf to notebooklm
it justs under the max file upload size
 
just tried querying for the du command with only the manpages.pdf as a source

not found in source

so there must be an issue with how i converted the man pages to html and then pdfs

i created the manpages from

Code:
/usr/src/share/man

instead of the man pages installed on the system
my thinking was man pages for a package are only installed when you install the package

which would mean that you wouldnt have man pages for packages not installed
so thats why i used the man pages in the source tree

this is how i created the html pages from the man pages
which i then converted to pdfs

Code:
#!/bin/sh

# Create the html_manpages directory:
mkdir -p html_manpages

# Find and convert man pages from the source tree:
count=0
find /usr/src/share/man -type f -name "*.[1-9]*" | while read file; do
  # Generate a unique filename:
  unique_filename="src_manpage_$(printf "%08d" $count).html"
  count=$((count + 1))

  # Convert the man page to HTML using mandoc:
  mandoc -Thtml "$file" > "html_manpages/${unique_filename}";
done

the process of converting the man pages to html and pdfs works
but there must be an issue with the above command not finding all the man pages

is using the /usr/src/share/man directory the right approach ?

back to the drawing board
 
going to try this code again

using /usr/share/man and not /usr/src/share/man

Code:
#!/bin/sh

# Create the html_manpages directory and Convert man pages to HTML using mandoc:
mkdir -p html_manpages
count=0
inodes="" # String to store seen inodes

find /usr/share/man -type f -name "*.gz" | grep "/man[1-9][a-z]*/" | while read file; do
  inode=$(stat -f "%i" "$file")

  # Check if inode has been seen
  if echo "$inodes" | grep -q " $inode "; then
    # Inode already seen, skip
    continue
  fi

  # Inode is unique, add to seen list
  inodes="$inodes $inode"

  # Generate a unique filename:
  unique_filename="manpage_$(printf "%08d" $count).html"
  count=$((count + 1))

  # Convert the man page to HTML:
  mandoc -Thtml "$file" > "html_manpages/${unique_filename}";
done
 
that created 4150 html pages
as opposed to the previous command which create 1450 page

next step convert the 4150 html pages to pdf files

the issue is going to be combining them as it will exceed the file size upload limit

not sure if the best solution would be to combine a range of files eg 1000 at a time into pdf
or create one big pdf and then split it
 
Back
Top