Memory not released after parse on Linux/Mac

Bug #2114135 reported by csm10495
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Confirmed
Medium
scoder

Bug Description

Python Info:

There seems to be some sort of issue where memory isn't being freed after the ElementTree object goes out of scope even upon forcing a garbage collection via `gc.collect()`. Interestingly on Windows, I don't see the leak. I have slightly different Python versions, but the same version of lxml itself (though different versions of underlying components.)

My test script uses a sample file from wikipedia. It can be found directly https://dumps.wikimedia.org/wikidatawiki/20250601/wikidatawiki-20250601-pages-articles-multistream6.xml-p5969005p6052571.bz2. Just extract it out of the bz2 and use it as wikidatawiki-20250601-pages-articles-multistream6.xml-p5969005p6052571. Though I believe this is easily reproduceable with any large xml file.

Here is my test script (lxmltest.py). Put it next to wikidatawiki-20250601-pages-articles-multistream6.xml-p5969005p6052571. Install dependencies via `pip install lxml psutil pympler -U`:

```
import psutil
from threading import Thread, Lock
from lxml.etree import parse
from pympler import tracker
import gc
import time
import sys
from lxml import etree

PRINT_LOCK = Lock()
FILE_STR = 'wikidatawiki-20250601-pages-articles-multistream6.xml-p5969005p6052571'

def parse_once():
    print(parse(FILE_STR))

def start_memory_usage_thread():
    def memory_usage():
        proc = psutil.Process()
        while True:
            with PRINT_LOCK:
                print(f"Memory usage: {proc.memory_info().rss / 1024 / 1024 / 1024} GB")
            time.sleep(1)

    thread = Thread(target=memory_usage)
    thread.daemon = True
    thread.start()

if __name__ == '__main__':
    print("%-20s: %s" % ('Python', sys.version_info))
    print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
    print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
    print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
    print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
    print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))

    start_memory_usage_thread()
    mem_info = tracker.SummaryTracker()
    for i in range(5):
        with PRINT_LOCK:
            print(f"Start: {i}")
            mem_info.print_diff()

        parse_once()

        with PRINT_LOCK:
            print(f"End: {i} (pre-collect)")
            mem_info.print_diff()
            print("|-----------------------------------------------------------------------------------------|")
            print("|-----------------------------------------------------------------------------------------|")

        gc.collect()

        with PRINT_LOCK:
            print(f"End: {i} (post-collect)")
            mem_info.print_diff()
            print("|-----------------------------------------------------------------------------------------|")
            print("|-----------------------------------------------------------------------------------------|")

```

Here is the output on Windows:

```
C:\Users\csm10495\Desktop>python lxmltest.py
Python : sys.version_info(major=3, minor=11, micro=0, releaselevel='final', serial=0)
lxml.etree : (5, 4, 0, 0)
libxml used : (2, 11, 9)
libxml compiled : (2, 11, 9)
libxslt used : (1, 1, 39)
libxslt compiled : (1, 1, 39)
Memory usage: 0.034893035888671875 GB
Start: 0
                       types | # objects | total size
============================ | =========== | ============
                        list | 4049 | 348.55 KB
                         str | 4046 | 281.91 KB
                         int | 896 | 24.50 KB
                        code | 1 | 368 B
                        dict | 2 | 288 B
       function (store_info) | 1 | 152 B
                        cell | 2 | 80 B
    functools._lru_list_elem | 1 | 56 B
  builtin_function_or_method | -1 | -72 B
                       tuple | -2 | -192 B
Memory usage: 0.15768051147460938 GB
Memory usage: 0.3460693359375 GB
Memory usage: 0.5390472412109375 GB
Memory usage: 0.7269554138183594 GB
Memory usage: 0.9176521301269531 GB
Memory usage: 1.0989875793457031 GB
Memory usage: 1.2846717834472656 GB
Memory usage: 1.4683494567871094 GB
Memory usage: 1.6554908752441406 GB
<lxml.etree._ElementTree object at 0x000002EAFF2C5480>
End: 0 (pre-collect)
                      types | # objects | total size
=========================== | =========== | ============
                       list | 4 | 248 B
  lxml.etree._ParserContext | 1 | 120 B
       lxml.etree._ErrorLog | 1 | 80 B
                        str | 1 | 70 B
      lxml.etree._TempStore | 1 | 48 B
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
End: 0 (post-collect)
  types | # objects | total size
======= | =========== | ============
   list | 1 | 80 B
    str | 1 | 74 B
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
Start: 1
                       types | # objects | total size
============================ | =========== | ============
  builtin_function_or_method | 2 | 144 B
Memory usage: 0.04637908935546875 GB
Memory usage: 0.22696685791015625 GB
Memory usage: 0.4188079833984375 GB
Memory usage: 0.6088142395019531 GB
Memory usage: 0.7968864440917969 GB
Memory usage: 0.9865493774414062 GB
Memory usage: 1.1704559326171875 GB
Memory usage: 1.3574256896972656 GB
Memory usage: 1.5435066223144531 GB
<lxml.etree._ElementTree object at 0x000002EAFF2A3BC0>
Memory usage: 0.04497528076171875 GB
End: 1 (pre-collect)
                       types | # objects | total size
============================ | =========== | ============
  builtin_function_or_method | -2 | -144 B
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
End: 1 (post-collect)
  types | # objects | total size
======= | =========== | ============
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
Start: 2
  types | # objects | total size
======= | =========== | ============
Memory usage: 0.1110076904296875 GB
Memory usage: 0.29840850830078125 GB
Memory usage: 0.4913749694824219 GB
Memory usage: 0.6786384582519531 GB
Memory usage: 0.8669891357421875 GB
Memory usage: 1.0520248413085938 GB
Memory usage: 1.2321434020996094 GB
Memory usage: 1.4098663330078125 GB
Memory usage: 1.5959892272949219 GB
<lxml.etree._ElementTree object at 0x000002EAF3E03780>
Memory usage: 0.046215057373046875 GB
End: 2 (pre-collect)
  types | # objects | total size
======= | =========== | ============
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
End: 2 (post-collect)
  types | # objects | total size
======= | =========== | ============
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
Start: 3
  types | # objects | total size
======= | =========== | ============
Memory usage: 0.11426544189453125 GB
Memory usage: 0.2979087829589844 GB
Memory usage: 0.4909019470214844 GB
Memory usage: 0.6770286560058594 GB
Memory usage: 0.8621673583984375 GB
Memory usage: 1.0484466552734375 GB
Memory usage: 1.2336997985839844 GB
Memory usage: 1.4193458557128906 GB
Memory usage: 1.6043853759765625 GB
<lxml.etree._ElementTree object at 0x000002EAF38BCD40>
Memory usage: 0.04677581787109375 GB
End: 3 (pre-collect)
  types | # objects | total size
======= | =========== | ============
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
End: 3 (post-collect)
  types | # objects | total size
======= | =========== | ============
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
Start: 4
  types | # objects | total size
======= | =========== | ============
Memory usage: 0.11605453491210938 GB
Memory usage: 0.2964820861816406 GB
Memory usage: 0.4882354736328125 GB
Memory usage: 0.6756134033203125 GB
Memory usage: 0.8640823364257812 GB
Memory usage: 1.0492401123046875 GB
Memory usage: 1.2366714477539062 GB
Memory usage: 1.4215164184570312 GB
Memory usage: 1.606719970703125 GB
<lxml.etree._ElementTree object at 0x000002EAF1744780>
Memory usage: 0.04656219482421875 GB
End: 4 (pre-collect)
  types | # objects | total size
======= | =========== | ============
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
End: 4 (post-collect)
  types | # objects | total size
======= | =========== | ============
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
```

Notice how on Windows, the memory usage moves like a sine wave where memory goes up while loading the document then goes down as it goes out of scope and gets garbage collected. This behavior makes sense and seems fine.

Here is the output on Linux (via WSL, though it also reproduced similarly in docker). It was much slower, thought that was likely WSL more than lxml:

```
csm10495@csm10495-desk:/mnt/c/Users/csm10495/Desktop $ python3 lxmltest.py
Python : sys.version_info(major=3, minor=8, micro=10, releaselevel='final', serial=0)
lxml.etree : (5, 4, 0, 0)
libxml used : (2, 13, 8)
libxml compiled : (2, 13, 8)
libxslt used : (1, 1, 43)
libxslt compiled : (1, 1, 43)
Memory usage: 0.016704559326171875 GB
Start: 0
                       types | # objects | total size
============================ | =========== | ============
                        list | 1551 | 133.81 KB
                         str | 1548 | 106.03 KB
                         int | 353 | 9.65 KB
                        dict | 2 | 168 B
       function (store_info) | 1 | 136 B
                        cell | 2 | 80 B
  builtin_function_or_method | 1 | 72 B
                     weakref | 1 | 72 B
                        code | 0 | 70 B
                      method | 1 | 64 B
                       float | 2 | 48 B
                   traceback | -1 | -56 B
                  _io.FileIO | -1 | -64 B
              AttributeError | -1 | -80 B
                       tuple | -2 | -136 B
Memory usage: 0.041484832763671875 GB
Memory usage: 0.07395553588867188 GB
Memory usage: 0.10471725463867188 GB
Memory usage: 0.13499069213867188 GB
Memory usage: 0.16550827026367188 GB
Memory usage: 0.19480514526367188 GB
Memory usage: 0.22434616088867188 GB
Memory usage: 0.2546195983886719 GB
Memory usage: 0.2846488952636719 GB
Memory usage: 0.3134574890136719 GB
Memory usage: 0.3434867858886719 GB
Memory usage: 0.3732719421386719 GB
Memory usage: 0.4040336608886719 GB
Memory usage: 0.4328422546386719 GB
Memory usage: 0.4623832702636719 GB
Memory usage: 0.4921684265136719 GB
Memory usage: 0.5212211608886719 GB
Memory usage: 0.5505180358886719 GB
Memory usage: 0.5798149108886719 GB
Memory usage: 0.6088676452636719 GB
Memory usage: 0.6393852233886719 GB
Memory usage: 0.6684379577636719 GB
Memory usage: 0.6967582702636719 GB
Memory usage: 0.7267875671386719 GB
Memory usage: 0.7568168640136719 GB
Memory usage: 0.7863578796386719 GB
Memory usage: 0.8161430358886719 GB
Memory usage: 0.8459281921386719 GB
Memory usage: 0.8742485046386719 GB
Memory usage: 0.9023246765136719 GB
Memory usage: 0.9291801452636719 GB
Memory usage: 0.9560356140136719 GB
Memory usage: 0.9846000671386719 GB
Memory usage: 1.0129203796386719 GB
Memory usage: 1.0427055358886719 GB
Memory usage: 1.0720024108886719 GB
Memory usage: 1.1003227233886719 GB
Memory usage: 1.1279106140136719 GB
Memory usage: 1.1564750671386719 GB
Memory usage: 1.1847953796386719 GB
Memory usage: 1.2133598327636719 GB
Memory usage: 1.2424125671386719 GB
Memory usage: 1.2712211608886719 GB
Memory usage: 1.3000297546386719 GB
Memory usage: 1.3288383483886719 GB
Memory usage: 1.3571586608886719 GB
Memory usage: 1.3852348327636719 GB
Memory usage: 1.4142875671386719 GB
Memory usage: 1.4428520202636719 GB
Memory usage: 1.4719047546386719 GB
Memory usage: 1.5009574890136719 GB
Memory usage: 1.5287895202636719 GB
Memory usage: 1.5563774108886719 GB
Memory usage: 1.5837211608886719 GB
Memory usage: 1.6144828796386719 GB
<lxml.etree._ElementTree object at 0x790d321762c0>
End: 0 (pre-collect)
                      types | # objects | total size
=========================== | =========== | ============
  lxml.etree._ParserContext | 1 | 120 B
       lxml.etree._ErrorLog | 1 | 80 B
                       code | 0 | 70 B
      lxml.etree._TempStore | 1 | 48 B
                       list | 1 | 8 B
                        str | -2 | -125 B
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
End: 0 (post-collect)
  types | # objects | total size
======= | =========== | ============
   list | 1 | 80 B
    str | 1 | 74 B
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
Start: 1
  types | # objects | total size
======= | =========== | ============
Memory usage: 1.6229896545410156 GB
Memory usage: 1.6261634826660156 GB
Memory usage: 1.6293373107910156 GB
Memory usage: 1.6329994201660156 GB
Memory usage: 1.6369056701660156 GB
Memory usage: 1.6403236389160156 GB
Memory usage: 1.6432533264160156 GB
Memory usage: 1.6474037170410156 GB
Memory usage: 1.6505775451660156 GB
Memory usage: 1.6539955139160156 GB
Memory usage: 1.6579017639160156 GB
Memory usage: 1.6618080139160156 GB
Memory usage: 1.6652259826660156 GB
Memory usage: 1.6686439514160156 GB
Memory usage: 1.6720619201660156 GB
Memory usage: 1.6754798889160156 GB
Memory usage: 1.6791419982910156 GB
Memory usage: 1.6823158264160156 GB
Memory usage: 1.6862220764160156 GB
Memory usage: 1.6901283264160156 GB
Memory usage: 1.6940345764160156 GB
Memory usage: 1.6974525451660156 GB
Memory usage: 1.7008705139160156 GB
Memory usage: 1.7045326232910156 GB
Memory usage: 1.7086830139160156 GB
Memory usage: 1.7123451232910156 GB
Memory usage: 1.7155189514160156 GB
Memory usage: 1.7191810607910156 GB
Memory usage: 1.7228431701660156 GB
Memory usage: 1.7265052795410156 GB
Memory usage: 1.7304115295410156 GB
Memory usage: 1.7338294982910156 GB
Memory usage: 1.7377357482910156 GB
Memory usage: 1.7413978576660156 GB
Memory usage: 1.7453041076660156 GB
Memory usage: 1.7492103576660156 GB
Memory usage: 1.7528724670410156 GB
Memory usage: 1.7565345764160156 GB
Memory usage: 1.7597084045410156 GB
Memory usage: 1.7638587951660156 GB
Memory usage: 1.7680091857910156 GB
Memory usage: 1.7716712951660156 GB
Memory usage: 1.7750892639160156 GB
Memory usage: 1.7785072326660156 GB
Memory usage: 1.7821693420410156 GB
Memory usage: 1.7860755920410156 GB
Memory usage: 1.7897377014160156 GB
Memory usage: 1.7931556701660156 GB
Memory usage: 1.7965736389160156 GB
Memory usage: 1.7999916076660156 GB
Memory usage: 1.8036537170410156 GB
<lxml.etree._ElementTree object at 0x790d32173780>
End: 1 (pre-collect)
  types | # objects | total size
======= | =========== | ============
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
End: 1 (post-collect)
  types | # objects | total size
======= | =========== | ============
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
Start: 2
  types | # objects | total size
======= | =========== | ============
Memory usage: 1.8046035766601562 GB
Memory usage: 1.8046035766601562 GB
Memory usage: 1.8048477172851562 GB
Memory usage: 1.8050918579101562 GB
Memory usage: 1.8055801391601562 GB
Memory usage: 1.8060684204101562 GB
Memory usage: 1.8065567016601562 GB
Memory usage: 1.8072891235351562 GB
Memory usage: 1.8075332641601562 GB
Memory usage: 1.8077774047851562 GB
Memory usage: 1.8085098266601562 GB
Memory usage: 1.8089981079101562 GB
Memory usage: 1.8092422485351562 GB
Memory usage: 1.8094863891601562 GB
Memory usage: 1.8099746704101562 GB
Memory usage: 1.8107070922851562 GB
Memory usage: 1.8111953735351562 GB
Memory usage: 1.8114395141601562 GB
Memory usage: 1.8119277954101562 GB
Memory usage: 1.8124160766601562 GB
Memory usage: 1.8126602172851562 GB
Memory usage: 1.8129043579101562 GB
Memory usage: 1.8133926391601562 GB
Memory usage: 1.8136367797851562 GB
Memory usage: 1.8141250610351562 GB
Memory usage: 1.8143692016601562 GB
Memory usage: 1.8146133422851562 GB
Memory usage: 1.8151016235351562 GB
Memory usage: 1.8155899047851562 GB
Memory usage: 1.8158340454101562 GB
Memory usage: 1.8165664672851562 GB
Memory usage: 1.8168106079101562 GB
Memory usage: 1.8175430297851562 GB
Memory usage: 1.8180313110351562 GB
Memory usage: 1.8185195922851562 GB
Memory usage: 1.8190078735351562 GB
Memory usage: 1.8192520141601562 GB
Memory usage: 1.8197402954101562 GB
Memory usage: 1.8202285766601562 GB
Memory usage: 1.8204727172851562 GB
Memory usage: 1.8209609985351562 GB
Memory usage: 1.8214492797851562 GB
Memory usage: 1.8216934204101562 GB
Memory usage: 1.8219375610351562 GB
Memory usage: 1.8224258422851562 GB
Memory usage: 1.8229141235351562 GB
Memory usage: 1.8231582641601562 GB
Memory usage: 1.8236465454101562 GB
Memory usage: 1.8238906860351562 GB
Memory usage: 1.8243789672851562 GB
Memory usage: 1.8243789672851562 GB
Memory usage: 1.8253555297851562 GB
<lxml.etree._ElementTree object at 0x790d32161100>
End: 2 (pre-collect)
  types | # objects | total size
======= | =========== | ============
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
End: 2 (post-collect)
  types | # objects | total size
======= | =========== | ============
   code | 0 | 112 B
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
Start: 3
  types | # objects | total size
======= | =========== | ============
Memory usage: 1.8261222839355469 GB
Memory usage: 1.8261222839355469 GB
Memory usage: 1.8261222839355469 GB
Memory usage: 1.8263664245605469 GB
Memory usage: 1.8266105651855469 GB
Memory usage: 1.8266105651855469 GB
Memory usage: 1.8266105651855469 GB
Memory usage: 1.8266105651855469 GB
Memory usage: 1.8266105651855469 GB
Memory usage: 1.8266105651855469 GB
Memory usage: 1.8266105651855469 GB
Memory usage: 1.8266105651855469 GB
Memory usage: 1.8266105651855469 GB
Memory usage: 1.8266105651855469 GB
Memory usage: 1.8268547058105469 GB
Memory usage: 1.8268547058105469 GB
Memory usage: 1.8268547058105469 GB
Memory usage: 1.8268547058105469 GB
Memory usage: 1.8268547058105469 GB
Memory usage: 1.8268547058105469 GB
Memory usage: 1.8268547058105469 GB
Memory usage: 1.8273429870605469 GB
Memory usage: 1.8273429870605469 GB
Memory usage: 1.8273429870605469 GB
Memory usage: 1.8273429870605469 GB
Memory usage: 1.8273429870605469 GB
Memory usage: 1.8273429870605469 GB
Memory usage: 1.8273429870605469 GB
Memory usage: 1.8273429870605469 GB
Memory usage: 1.8273429870605469 GB
Memory usage: 1.8275871276855469 GB
Memory usage: 1.8275871276855469 GB
Memory usage: 1.8278312683105469 GB
Memory usage: 1.8278312683105469 GB
Memory usage: 1.8278312683105469 GB
Memory usage: 1.8278312683105469 GB
Memory usage: 1.8278312683105469 GB
Memory usage: 1.8280754089355469 GB
Memory usage: 1.8280754089355469 GB
Memory usage: 1.8283195495605469 GB
Memory usage: 1.8283195495605469 GB
Memory usage: 1.8283195495605469 GB
Memory usage: 1.8283195495605469 GB
Memory usage: 1.8283195495605469 GB
Memory usage: 1.8283195495605469 GB
Memory usage: 1.8283195495605469 GB
Memory usage: 1.8283195495605469 GB
Memory usage: 1.8285636901855469 GB
Memory usage: 1.8285636901855469 GB
Memory usage: 1.8285636901855469 GB
<lxml.etree._ElementTree object at 0x790d3212e200>
End: 3 (pre-collect)
  types | # objects | total size
======= | =========== | ============
   code | 0 | 70 B
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
Memory usage: 1.8325309753417969 GB
End: 3 (post-collect)
  types | # objects | total size
======= | =========== | ============
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
Start: 4
  types | # objects | total size
======= | =========== | ============
Memory usage: 1.8325424194335938 GB
Memory usage: 1.8325424194335938 GB
Memory usage: 1.8325424194335938 GB
Memory usage: 1.8325424194335938 GB
Memory usage: 1.8325424194335938 GB
Memory usage: 1.8325424194335938 GB
Memory usage: 1.8325424194335938 GB
Memory usage: 1.8325424194335938 GB
Memory usage: 1.8325424194335938 GB
Memory usage: 1.8325424194335938 GB
Memory usage: 1.8325424194335938 GB
Memory usage: 1.8325424194335938 GB
Memory usage: 1.8325424194335938 GB
Memory usage: 1.8325424194335938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
Memory usage: 1.8327865600585938 GB
<lxml.etree._ElementTree object at 0x790d3211d200>
End: 4 (pre-collect)
  types | # objects | total size
======= | =========== | ============
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
End: 4 (post-collect)
  types | # objects | total size
======= | =========== | ============
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
```

In Linux, the memory usage doesn't go down between cycles (or even after a forced garbage collection via `gc.collect()`).

It seems like on Linux memory is leaked upon every file parsed via lxml. I have a feeling if we had multiple different xml files, the memory usage would keep going up. It almost seems like it caches some parts of the file to eventually level out the memory usage when using the same file over and over.

Revision history for this message
csm10495 (csm10495) wrote :
Download full text (6.3 KiB)

On Mac:

```
(venv:v) cmachalo@cmachalo-mn2:/tmp/v $ python lxmltest.py
Python : sys.version_info(major=3, minor=13, micro=2, releaselevel='final', serial=0)
lxml.etree : (5, 4, 0, 0)
libxml used : (2, 13, 8)
libxml compiled : (2, 13, 8)
libxslt used : (1, 1, 43)
libxslt compiled : (1, 1, 43)
Memory usage: 0.0308380126953125 GB
Start: 0
                       types | # objects | total size
============================ | =========== | =============
                        list | 1793 | 155.81 KB
                         str | 1764 | 107.73 KB
                         int | 360 | 9.84 KB
                        dict | 2 | 288 B
       function (store_info) | 1 | 160 B
                        code | 0 | 112 B
                        cell | 2 | 80 B
           _sre.SRE_Template | 1 | 72 B
  builtin_function_or_method | 1 | 72 B
    functools._lru_list_elem | 1 | 56 B
                       tuple | -81 | -5736 B
Memory usage: 0.2293853759765625 GB
Memory usage: 0.473175048828125 GB
Memory usage: 0.7103729248046875 GB
Memory usage: 0.9484405517578125 GB
Memory usage: 1.175750732421875 GB
Memory usage: 1.39697265625 GB
Memory usage: 1.6256103515625 GB
<lxml.etree._ElementTree object at 0x101dd4280>
End: 0 (pre-collect)
                      types | # objects | total size
=========================== | =========== | ============
                       list | 5 | 328 B
  lxml.etree._ParserContext | 1 | 120 B
                        str | 2 | 120 B
       lxml.etree._ErrorLog | 1 | 80 B
      lxml.etree._TempStore | 1 | 48 B
                      tuple | -3 | -208 B
                       code | -1 | -280 B
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
End: 0 (post-collect)
  types | # objects | total size
======= | =========== | ============
   list | 1 | 80 B
    str | 1 | 66 B
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
Memory usage: 0.8919830322265625 GB
Start: 1
  types | # objects | total size
======= | =========== | ============
Memory usage: 0.9423065185546875 GB
Memory usage: 1.128265380859375 GB
Memory usage: 1.2782135009765625 GB
Memory usage: 1.4556732177734375 GB
Memory usage: 1.6227569580078125 GB
Memory usage: 1.7854156494140625 GB
Memory usage: 1.961761474609375 GB
<lxml.etree._ElementTree object at 0x101dfbec0>
End: 1 (pre-collect)
  types | # objects | total size
======= | =========== | ============
|-----------------------------------------------------------------------------------------|
|----------------------------------------------------------------------------------...

Read more...

summary: - Memory not released after parse on Linux
+ Memory not released after parse on Linux/Mac
Revision history for this message
scoder (scoder) wrote :

Thanks for the report. I can reproduce this, but my guess is that it's due to memory fragmentation rather than actual leaking. You can see that from the fact that the numbers stop growing after a few parser runs, suggesting that later parser runs fit into the (very large) holes of the existing process memory.

lxml keeps names from the XML documents that it has seen before in a global hash table and reuses that in later parser runs. That speeds up parsing, but it can result in memory fragmentation if the hash table grows substantially during the first parse and needs to reallocate memory at the end of the currently occupied document memory.

These hash tables are thread-local, meaning that you can control their reuse by controlling the runtime of the current thread. If you parse in a separate thread, that thread will clean up its hash table as soon as it ends.

I'll consider making the hash table parser-local rather than thread-local for lxml 6.0. The default parser already gets replicated per thread, so if you don't configure the parser, you'd get more or less the same behaviour as before. But parsers are much easier to create and discard than threads, which should simplify the hash table cleanup for users.

Changed in lxml:
assignee: nobody → scoder (scoder)
importance: Undecided → Medium
milestone: none → 6.0
status: New → Confirmed
Revision history for this message
scoder (scoder) wrote :

Postponing to the next major release. This is too large a change for the "almost ready" 6.0 at this point, so I'll push that out of the door first and then see what I can do about this issue.

Changed in lxml:
milestone: 6.0 → 7.0
Revision history for this message
scoder (scoder) wrote :

Here is a branch, please give it a try:
https://github.com/lxml/lxml/pull/466

Note that the per-thread behaviour does not change if you use the default parser. But the new implementation allows you to control the lifetime of the names dict by creating and destroying parsers, per thread, per document type, time based, memory based, however you want.

Revision history for this message
scoder (scoder) wrote :

The previous automatic thread-safety probably suffered from this change (so don't be too hard on it just yet) .That will need fixing. But I'm planning to add modification locking anyway in order to support free-threading Python, and that will probably resolve both issues in one go.

Revision history for this message
csm10495 (csm10495) wrote :
Download full text (8.5 KiB)

I gave the latest master a try (ff9035f510c6a90df1d0fc50390500a54cbf70b5). It seems to be more or less similar. I changed the `parse_once()` to use a xml parser since I thought that going out of scope would then lead to memory usage going down:

```
def parse_once():
    print(parse(FILE_STR, parser=XMLParser()))
```

Though we seem to hit more/less the same issue (on Linux):

```
(venv:venv) csm10495@csm10495-desk:/mnt/c/Users/csm10495/Desktop 1 $ python3 lxmltest.py
Python : sys.version_info(major=3, minor=8, micro=10, releaselevel='final', serial=0)
lxml.etree : (7, 0, 0, -200)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)
Memory usage: 0.019012451171875 GB
Start: 0
                       types | # objects | total size
============================ | =========== | ============
                        list | 1531 | 132.25 KB
                         str | 1528 | 104.66 KB
                         int | 349 | 9.54 KB
                        dict | 2 | 168 B
       function (store_info) | 1 | 136 B
                        cell | 2 | 80 B
  builtin_function_or_method | 1 | 72 B
                     weakref | 1 | 72 B
                        code | 0 | 70 B
                      method | 1 | 64 B
                       float | 2 | 48 B
                   traceback | -1 | -56 B
                  _io.FileIO | -1 | -64 B
              AttributeError | -1 | -80 B
                       tuple | -2 | -136 B
Memory usage: 0.09310531616210938 GB
Memory usage: 0.17831039428710938 GB
Memory usage: 0.2642478942871094 GB
Memory usage: 0.3477439880371094 GB
Memory usage: 0.4351463317871094 GB
Memory usage: 0.5183982849121094 GB
Memory usage: 0.5970115661621094 GB
Memory usage: 0.6751365661621094 GB
Memory usage: 0.7574119567871094 GB
Memory usage: 0.8406639099121094 GB
Memory usage: 0.9231834411621094 GB
Memory usage: 1.0047264099121094 GB
Memory usage: 1.0867576599121094 GB
Memory usage: 1.1675682067871094 GB
Memory usage: 1.2483787536621094 GB
Memory usage: 1.3299217224121094 GB
Memory usage: 1.4092674255371094 GB
Memory usage: 1.4915428161621094 GB
Memory usage: 1.5713768005371094 GB
<lxml.etree._ElementTree object at 0x78ee063ba200>
End: 0 (pre-collect)
  types | # objects | total size
======= | =========== | ============
   code | 0 | 70 B
    str | -2 | -125 B
   list | -2 | -160 B
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
End: 0 (post-collect)
  types | # objects | total size
======= | =========== | ============
|-----------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------|
Start: 1
  types | ...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.