Extract selectable text from a PDF
Returns text per page and joined. Aimed at Vite + a modern browser.
typescript
import * as pdfjsLib from "pdfjs-dist";
import workerUrl from "pdfjs-dist/build/pdf.worker.min.mjs?url";
pdfjsLib.GlobalWorkerOptions.workerSrc = workerUrl;
export async function extractPdfText(file: File): Promise<{ pages: string[]; text: string }> {
const data = new Uint8Array(await file.arrayBuffer());
const pdf = await pdfjsLib.getDocument({ data }).promise;
const pages: string[] = [];
for (let n = 1; n <= pdf.numPages; n += 1) {
const page = await pdf.getPage(n);
const content = await page.getTextContent();
const pageText = content.items
.map((item) => ("str" in item ? item.str : ""))
.join(" ")
.replace(/\s+/g, " ")
.trim();
pages.push(pageText);
}
await pdf.destroy();
return { pages, text: pages.join("\n\n") };
}Dependencias
pdfjs-distNotas de uso
- The worker config above is for Vite (import with ?url).
- Generic bundler alternative (no Vite): GlobalWorkerOptions.workerPort = new Worker(new URL('pdfjs-dist/build/pdf.worker.min.mjs', import.meta.url), { type: 'module' }).
- CDN alternative: GlobalWorkerOptions.workerSrc = 'https://cdnjs.cloudflare.com/ajax/libs/pdf.js/<VERSION>/pdf.worker.min.mjs'.
- Use result.pages for the per-page text or result.text for the joined text.
Limitaciones
- It doesn't do OCR: it only extracts text that's already selectable. A scanned PDF (images) may return empty.
- Order and spacing are approximate (no column reconstruction or layout line breaks).
- The worker version must match the installed pdfjs-dist version.