problem
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
How to get product name, price and
an image from HTML?
problem
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
problem
or
How to get product name, price and
an image from HTML?
Imagine you have HTML like this … or like this
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
problem
How to get product name, price and
an image from HTML?
CSS Query
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
XPath
RegExp
AI based
Visual editors
problem
Probably your code…
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
problem
or
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
solution
or
DSL
use simple DSL language to describe
data we want to grab
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
harvester lib
HTML
tree-like
template
harvester
template
parser
fuzzy
tree
match
data
JSON
JSON
tree
as
a
string
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
JSON Object
harvester:template
- spaces: 2 spaces = 1 level
- tags : div, *
- types : int, float, with,
func, str, empty,…
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
harvester:parser
template
parser
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
harvester:fuzzy tree match
- reсursions hell
- combinations
- deep DOM
- search time
- similar tags
- exp complexity
- partial equality
- hard to debug
PROBLEM
S
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
harvester:match algo
l m
k
tpl
DO
M
klm
1. 123, 245, 367, 378, 368,…
2. 145, 167, 178, 168
3. 19a, 1bc, 1de, 1fg
4. 245, 367, 378, 368
5. 29a, 2bc, 3de, 3fg
6. 49a, 5bc, 7de, 8fg
1. 12, 13
2. 14, 15, 16, 17, 18
3. 19, 1a, 1b, 1c, 1d,…
4. 24, 25, 36, 37, 38
5. 29, 2a, 2b, 2c, 3d, 3e,…
6. 49, 4a, 5b, 5c, 7d, 7e,…
kl,km
1. 1
2. 2, 3
3. 4, 5, 6, 7, 8
4. 9, a, b, c, d, e, f, g
k,l,m lm
1. 23
2. 45, 67, 78, 68
3. 9a, bc, de, fg
2 3
1
4 5 6 7
9 a d e
b c
8
f g
is searched in a
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
harvester:match algo
1
2
3
FOR
EVERY
NODE! recursion
recursion
recursion
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
problem :: solution :: harvester ::
harvester:score
1+74+8
1+12+7
nodeScore ===
if tag : +1
if text & textTag : +6
if text & textType : +12
if text & !textType: -12
if attr & textAttr : +6
1+12+7
1+6+7
1+12+7
Score is an indicator of how close the branches of
the target tree are to each other. Our goal is to
maximize the score
83 points
maxScore:
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
maxScore = nodeScore + inScore + maxLevel - level
problem :: solution :: harvester ::
harvester:harvest()
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
maxScore
found score
harvester:perf
> ~x130000
speed increase after refactoring:
- go outside the DOM root
- go not deeper than tree size
- copy() function optimisations
- cache score & compare it with maxScore
- subsets() function to return nodes from max len to min
- compare current subset’s score and maxScore
- cache: tagName, textContent, parentNode, firstElementChild,
nextElementSibling
- level = round(level * 1.618033) for every deeper/upper node
- many small optimizations…
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
harvester:perf:cache
Getting DOM elements and
attributes are slow, so we put it in a
cache
cache is used for: score, tagName, tagContent,
parentNode, firstElementChild,
nextElementSibling
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
harvester:perf:cache
all children’s score
prevents deep search
parent nodes cache
search lower than parent
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
harvester:ai
In most cases, the AI was wrong in choosing the
algorithm and optimization approaches.
When asked to write basic code for fuzzy tree comparison,
the AI wrote a simple brute-force approach, which is the
slowest one.
But AI generated many good ideas…
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
chatGPT
Gemini
CoPilot
Grok
harvester:ai
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
demo:perf
on rozetka.com.ua
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
demo:puppeteer:news
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
on pravda.com.ua
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
demo:puppeteer:rozetka
on rozetka.com.ua
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
demo:puppeteer:amazon
on amazon.com
harvest() // works in browser & keeps core logic
harvestPage() // harvests 1 rec for Puppeteer | Playwright
harvestPageAll() // harvests many recs for Puppeteer | Playwright
API
summar
y
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
summar
y
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
advantages
1. declarative (readability)
2. find nodes without ids
3. tiny size, only 791 lines
4. robustness to DOM changes
5. getting data in fuzzy DOM
6. it is pleasurably fast
7. get all data per one call
8. supports text data types
9. with Puppeteer/Playwright
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
plans
summar
y
1. add more types
2. promotion plans
3. add cheerio support
4. add playwright support
5. add puppeteer support
6. harvestPageAll(),
harvestPage(),
github
github.com/tmptrash/harvester
npmjs.com/package/js-harvester
problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
questions?

"The Art of Web Scraping: Tree Algorithms and JS Magic", Dmytro Tarasenko

  • 2.
    problem problem :: solution:: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions How to get product name, price and an image from HTML?
  • 3.
    problem problem :: solution:: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
  • 4.
    problem or How to getproduct name, price and an image from HTML? Imagine you have HTML like this … or like this problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
  • 5.
    problem How to getproduct name, price and an image from HTML? CSS Query problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions XPath RegExp AI based Visual editors
  • 6.
    problem Probably your code… problem:: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
  • 7.
    problem or problem :: solution:: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
  • 8.
    solution or DSL use simple DSLlanguage to describe data we want to grab problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
  • 9.
    harvester lib HTML tree-like template harvester template parser fuzzy tree match data JSON JSON tree as a string problem ::solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions JSON Object
  • 10.
    harvester:template - spaces: 2spaces = 1 level - tags : div, * - types : int, float, with, func, str, empty,… problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
  • 11.
    harvester:parser template parser problem :: solution:: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
  • 12.
    harvester:fuzzy tree match -reсursions hell - combinations - deep DOM - search time - similar tags - exp complexity - partial equality - hard to debug PROBLEM S problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
  • 13.
    harvester:match algo l m k tpl DO M klm 1.123, 245, 367, 378, 368,… 2. 145, 167, 178, 168 3. 19a, 1bc, 1de, 1fg 4. 245, 367, 378, 368 5. 29a, 2bc, 3de, 3fg 6. 49a, 5bc, 7de, 8fg 1. 12, 13 2. 14, 15, 16, 17, 18 3. 19, 1a, 1b, 1c, 1d,… 4. 24, 25, 36, 37, 38 5. 29, 2a, 2b, 2c, 3d, 3e,… 6. 49, 4a, 5b, 5c, 7d, 7e,… kl,km 1. 1 2. 2, 3 3. 4, 5, 6, 7, 8 4. 9, a, b, c, d, e, f, g k,l,m lm 1. 23 2. 45, 67, 78, 68 3. 9a, bc, de, fg 2 3 1 4 5 6 7 9 a d e b c 8 f g is searched in a problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
  • 14.
    harvester:match algo 1 2 3 FOR EVERY NODE! recursion recursion recursion problem:: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
  • 15.
    problem :: solution:: harvester :: harvester:score 1+74+8 1+12+7 nodeScore === if tag : +1 if text & textTag : +6 if text & textType : +12 if text & !textType: -12 if attr & textAttr : +6 1+12+7 1+6+7 1+12+7 Score is an indicator of how close the branches of the target tree are to each other. Our goal is to maximize the score 83 points maxScore: problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions maxScore = nodeScore + inScore + maxLevel - level
  • 16.
    problem :: solution:: harvester :: harvester:harvest() problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions maxScore found score
  • 17.
    harvester:perf > ~x130000 speed increaseafter refactoring: - go outside the DOM root - go not deeper than tree size - copy() function optimisations - cache score & compare it with maxScore - subsets() function to return nodes from max len to min - compare current subset’s score and maxScore - cache: tagName, textContent, parentNode, firstElementChild, nextElementSibling - level = round(level * 1.618033) for every deeper/upper node - many small optimizations… problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
  • 18.
    harvester:perf:cache Getting DOM elementsand attributes are slow, so we put it in a cache cache is used for: score, tagName, tagContent, parentNode, firstElementChild, nextElementSibling problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
  • 19.
    harvester:perf:cache all children’s score preventsdeep search parent nodes cache search lower than parent problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
  • 20.
    harvester:ai In most cases,the AI was wrong in choosing the algorithm and optimization approaches. When asked to write basic code for fuzzy tree comparison, the AI wrote a simple brute-force approach, which is the slowest one. But AI generated many good ideas… problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions chatGPT Gemini CoPilot Grok
  • 21.
    harvester:ai problem :: solution:: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
  • 22.
    demo:perf on rozetka.com.ua problem ::solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
  • 23.
    demo:puppeteer:news problem :: solution:: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions on pravda.com.ua
  • 24.
    problem :: solution:: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions demo:puppeteer:rozetka on rozetka.com.ua
  • 25.
    problem :: solution:: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions demo:puppeteer:amazon on amazon.com
  • 26.
    harvest() // worksin browser & keeps core logic harvestPage() // harvests 1 rec for Puppeteer | Playwright harvestPageAll() // harvests many recs for Puppeteer | Playwright API summar y problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
  • 27.
    summar y problem :: solution:: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions advantages 1. declarative (readability) 2. find nodes without ids 3. tiny size, only 791 lines 4. robustness to DOM changes 5. getting data in fuzzy DOM 6. it is pleasurably fast 7. get all data per one call 8. supports text data types 9. with Puppeteer/Playwright
  • 28.
    problem :: solution:: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions plans summar y 1. add more types 2. promotion plans 3. add cheerio support 4. add playwright support 5. add puppeteer support 6. harvestPageAll(), harvestPage(),
  • 29.
    github github.com/tmptrash/harvester npmjs.com/package/js-harvester problem :: solution:: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions
  • 30.