MCPcopy Index your code
hub / github.com/Python-Markdown/markdown / HTMLExtractor

Class HTMLExtractor

markdown/htmlparser.py:120–422  ·  view source on GitHub ↗

Extract raw HTML from text. The raw HTML is stored in the [`htmlStash`][markdown.util.HtmlStash] of the [`Markdown`][markdown.Markdown] instance passed to `md` and the remaining text is stored in `cleandoc` as a list of strings.

Source from the content-addressed store, hash-verified

118
119
120class HTMLExtractor(htmlparser.HTMLParser):
121 """
122 Extract raw HTML from text.
123
124 The raw HTML is stored in the [`htmlStash`][markdown.util.HtmlStash] of the
125 [`Markdown`][markdown.Markdown] instance passed to `md` and the remaining text
126 is stored in `cleandoc` as a list of strings.
127 """
128
129 def __init__(self, md: Markdown, *args, **kwargs):
130 if 'convert_charrefs' not in kwargs:
131 kwargs['convert_charrefs'] = False
132
133 # Block tags that should contain no content (self closing)
134 self.empty_tags = set(['hr'])
135
136 self.lineno_start_cache = [0]
137
138 # This calls self.reset
139 super().__init__(*args, **kwargs)
140 self.md = md
141
142 def reset(self):
143 """Reset this instance. Loses all unprocessed data."""
144 self.inraw = False
145 self.intail = False
146 self.stack: list[str] = [] # When `inraw==True`, stack contains a list of tags
147 self._cache: list[str] = []
148 self.cleandoc: list[str] = []
149 self.lineno_start_cache = [0]
150
151 super().reset()
152
153 def close(self):
154 """Handle any buffered data."""
155 super().close()
156 if len(self.rawdata):
157 # Temp fix for https://bugs.python.org/issue41989
158 # TODO: remove this when the bug is fixed in all supported Python versions.
159 if self.convert_charrefs and not self.cdata_elem: # pragma: no cover
160 self.handle_data(htmlparser.unescape(self.rawdata))
161 else:
162 self.handle_data(self.rawdata)
163 # Handle any unclosed tags.
164 if len(self._cache):
165 self.cleandoc.append(self.md.htmlStash.store(''.join(self._cache)))
166 self._cache = []
167
168 @property
169 def line_offset(self) -> int:
170 """Returns char index in `self.rawdata` for the start of the current line. """
171 for ii in range(len(self.lineno_start_cache)-1, self.lineno-1):
172 last_line_start_pos = self.lineno_start_cache[ii]
173 lf_pos = self.rawdata.find('\n', last_line_start_pos)
174 if lf_pos == -1:
175 # No more newlines found. Use end of raw data as start of line beyond end.
176 lf_pos = len(self.rawdata)
177 self.lineno_start_cache.append(lf_pos+1)

Callers 1

runMethod · 0.85

Calls

no outgoing calls

Tested by

no test coverage detected

Used in the wild real call sites across dependent graphs

searching dependent graphs…