MCPcopy Index your code
hub / github.com/python/cpython / _guess_delimiter

Method _guess_delimiter

Lib/csv.py:347–449  ·  view source on GitHub ↗

The delimiter /should/ occur the same number of times on each row. However, due to malformed data, it may not. We don't want an all or nothing approach, so we allow for small variations in this number. 1) build a table of the frequency of each character on

(self, data, delimiters)

Source from the content-addressed store, hash-verified

345
346
347 def _guess_delimiter(self, data, delimiters):
348 """
349 The delimiter /should/ occur the same number of times on
350 each row. However, due to malformed data, it may not. We don't want
351 an all or nothing approach, so we allow for small variations in this
352 number.
353 1) build a table of the frequency of each character on every line.
354 2) build a table of frequencies of this frequency (meta-frequency?),
355 e.g. 'x occurred 5 times in 10 rows, 6 times in 1000 rows,
356 7 times in 2 rows'
357 3) use the mode of the meta-frequency to determine the /expected/
358 frequency for that character
359 4) find out how often the character actually meets that goal
360 5) the character that best meets its goal is the delimiter
361 For performance reasons, the data is evaluated in chunks, so it can
362 try and evaluate the smallest portion of the data possible, evaluating
363 additional chunks as necessary.
364 """
365 from collections import Counter, defaultdict
366
367 data = list(filter(None, data.split('\n')))
368
369 # build frequency tables
370 chunkLength = min(10, len(data))
371 iteration = 0
372 num_lines = 0
373 # {char -> {count_per_line -> num_lines_with_that_count}}
374 char_frequency = defaultdict(Counter)
375 modes = {}
376 delims = {}
377 start, end = 0, chunkLength
378 while start < len(data):
379 iteration += 1
380 for line in data[start:end]:
381 num_lines += 1
382 for char, count in Counter(line).items():
383 if char.isascii():
384 char_frequency[char][count] += 1
385
386 for char, counts in char_frequency.items():
387 items = list(counts.items())
388 missed_lines = num_lines - sum(counts.values())
389 if missed_lines:
390 # Store the number of lines 'char' was missing from.
391 items.append((0, missed_lines))
392 if len(items) == 1 and items[0][0] == 0:
393 continue
394 # get the mode of the frequencies
395 if len(items) > 1:
396 modes[char] = max(items, key=lambda x: x[1])
397 # adjust the mode - subtract the sum of all
398 # other frequencies
399 items.remove(modes[char])
400 modes[char] = (modes[char][0], modes[char][1]
401 - sum(item[1] for item in items))
402 else:
403 modes[char] = items[0]
404

Callers 1

sniffMethod · 0.95

Calls 13

CounterClass · 0.90
listClass · 0.85
defaultdictClass · 0.85
isasciiMethod · 0.80
filterFunction · 0.70
splitMethod · 0.45
itemsMethod · 0.45
valuesMethod · 0.45
appendMethod · 0.45
removeMethod · 0.45
keysMethod · 0.45
countMethod · 0.45

Tested by

no test coverage detected