fix youtube music metadata extraction

fixed the metadata extraction regex's catastrophic backtracking, made it faster on all inputs, and added proper support for artists using the middle dot character and now, a rant about properly checking your work and learning how to do shit before you publish changes: simulated atomic groups did not make the regex faster - you added a newline. simulated atomic groups are always (guaranteed!) slower than normal groups and removing them from the old regex makes that regex faster: https://regex101.com/r/8Ssf2h/3 this is fairly obvious to anyone who has actually learned how regexes are matched. the fix is to add a delimiter to the start of the expression: https://regex101.com/r/XqqucW/1 without (?:\n|^), the regex attempts to find a match starting at every possible title character (which is virtually every location) it will then attempt to extend this until it can't do so. for the string "hello", it would have to check "hello", "ello", "llo", "lo", and "o". this is what backtracking is, and it causes quadratic performance in the number of input characters. again, this is fairly obvious to anyone who has actually learned how regexes are matched. i really hope the next person to "improve" this actually takes the time to review their changes before pushing them.
1 month ago · 1b4d0401e4
parent 71f30921a2
commit 1b4d0401e4
1 changed files with 7 additions and 12 deletions
--- a/yt_dlp/extractor/youtube/_video.py
+++ b/yt_dlp/extractor/youtube/_video.py
@ -4177,20 +4177,15 @@ class YoutubeIE(YoutubeBaseInfoExtractor):

        # Youtube Music Auto-generated description
        if (video_description or '').strip().endswith('\nAuto-generated by YouTube.'):
-            # XXX: Causes catastrophic backtracking if description has "·"
-            # E.g. https://www.youtube.com/watch?v=DoPaAxMQoiI
-            # Simulating atomic groups:  (?P<a>[^xy]+)x  =>  (?=(?P<a>[^xy]+))(?P=a)x
-            # reduces it, but does not fully fix it. https://regex101.com/r/8Ssf2h/2
+            # Before you change this, learn how regexes work. The last guy didn't.
            mobj = re.search(
                r'''(?xs)
-                    (?=(?P<track>[^\n·]+))(?P=track)·
-                    (?=(?P<artist>[^\n]+))(?P=artist)\n+
-                    (?=(?P<album>[^\n]+))(?P=album)\n
-                    (?:.+?℗\s*(?P<release_year>\d{4})(?!\d))?
-                    (?:.+?Released\ on\s*:\s*(?P<release_date>\d{4}-\d{2}-\d{2}))?
-                    (.+?\nArtist\s*:\s*
-                        (?=(?P<clean_artist>[^\n]+))(?P=clean_artist)\n
-                    )?.+\nAuto-generated\ by\ YouTube\.\s*$
+                    (?:\n|^)(?P<track>[^\n·]+)\ ·\ (?P<artist>[^\n]+)\n+
+                    (?P<album>[^\n]+)\n+
+                    (?:℗\s*(?P<release_year>\d{4})[^\n]+\n+)?
+                    (?:Released\ on\s*:\s*(?P<release_date>\d{4}-\d{2}-\d{2}))?.+?
+                    (\nArtist\s*:\s*(?P<clean_artist>[^\n]+)\n)?
+                    .+Auto-generated\ by\ YouTube\.\s*$
                ''', video_description)
            if mobj:
                release_year = mobj.group('release_year')