python中对列表快速去重(翻译)

在 python 中假如你有一个像这样的列表:

['a','b','a']
# or like this:
[1,2,2,2,3,4,5,6,6,6,6]

并且你想去除所有的重复项,得到下面这样的结果:

['a','b']
# or
[1,2,3,4,5,6]

怎么做?…最快的方法?我写了几个可供选择的方法,并且为了找出最快的方法对每种方法都做了基准测试。(测试时我没有关注内存的使用)最慢的函数比最快的慢了78倍。

然而,在不同的函数间有一个很重要的区别。有些去重后保持顺序,有些不是。例如,一个保持顺序的函数,去除重复元素后,新的元素顺序保证和他原来一样,例如,uniqify([1,2,2,3])==[1,2,3]

函数如下:

def f1(seq):
  # not order preserving
  set = {}
  map(set.__setitem__, seq, [])
  return set.keys()

def f2(seq):
  # order preserving
  checked = []
  for e in seq:
      if e not in checked:
          checked.append(e)
  return checked

def f3(seq):
  # Not order preserving
  keys = {}
  for e in seq:
      keys[e] = 1
  return keys.keys()

def f4(seq):
  # order preserving
  noDupes = []
  [noDupes.append(i) for i in seq if not noDupes.count(i)]
  return noDupes

def f5(seq, idfun=None):
  # order preserving
  if idfun is None:
      def idfun(x): return x
  seen = {}
  result = []
  for item in seq:
      marker = idfun(item)
      # in old Python versions:
      # if seen.has_key(marker)
      # but in new ones:
      if marker in seen: continue
      seen[marker] = 1
      result.append(item)
  return result

def f6(seq):
  # Not order preserving
  set = Set(seq)
  return list(set)

这是你正在等待的(如果你还在看的话),结果如下 :

* f2 13.24
* f4 11.73
* f5 0.37
f1 0.18
f3 0.17
f6 0.19

(* 保持顺序)

显然 f5 是最好的解决方法。它不仅最快,而且具有保持顺序的特性和支持传递一个转换函数使它可以像下面这样:

>>> a=list('ABeeE')
>>> f5(a)
['A','B','e','E']
>>> f5(a, lambda x: x.lower())
['A','B','e']

下载脚本

更新

根据评论内容,我又添加了更多的函数来进行基准测试。一些函数不支持对象列表去重,因此不能被哈希,除非传递给他一个特别的哈希方法(这句一直不理解是什么意思,请参考原文,如有更好译法请留言)。查看所有函数

下面是新的结果:

* f5 10.1
* f5b 9.99
* f8 6.49
* f10 6.57
* f11 6.6
f1 4.28
f3 3.55
f6 4.03
f7 2.59
f9 2.58

(f2 and f4) 在本次测试数据下很慢。

更多精彩内容可以参见原文及原文留言区,胡阳大神的另一篇博文写的也很好,请参见如下链接地址。

参考文档:

原文:http://www.peterbe.com/plog/uniqifiers-benchmark

胡阳大神关于list去重的一篇博文:http://www.the5fire.com/python-remove-duplicates-in-list.html


附:

A:

关于上面那句拗口的python元素hashable,可以参见glossary,解释如下:

hashable
An object is hashable if it has a hash value which never changes during its lifetime (it needs a __hash__() method), and can be compared to other objects (it needs an __eq__() or __cmp__() method). Hashable objects which compare equal must have the same hash value.

Hashability makes an object usable as a dictionary key and a set member, because these data structures use the hash value internally.

All of Python’s immutable built-in objects are hashable, while no mutable containers (such as lists or dictionaries) are. Objects which are instances of user-defined classes are hashable by default; they all compare unequal (except with themselves), and their hash value is their id().

下面是一个可以hashable和不可以hashable的例子,看下就理解了:

In [80]: l1 = [1,2,3]
In [81]: set(l1)
Out[81]: {1, 2, 3}
In [82]: l2 = [[1],[2]]
In [83]: set(l2)

TypeError                                 Traceback (most recent call last)
<ipython-input-83-c0515a45cc87> in <module>()
----> 1 set(l2)
TypeError: unhashable type: 'list'
In [84]:

B:

uniqifiers_benchmark.py 源码

from random import shuffle, randint
import re
from sets import Set

def f1(seq): # Raymond Hettinger
    # not order preserving
    set = {}
    map(set.__setitem__, seq, [])
    return set.keys()


def f2(seq):   # *********
    # order preserving
    checked = []
    for e in seq:
        if e not in checked:
            checked.append(e)
    return checked

def f3(seq):
    # Not order preserving
    keys = {}
    for e in seq:
        keys[e] = 1
    return keys.keys()

def f4(seq): # ********** order preserving
    noDupes = []
    [noDupes.append(i) for i in seq if not noDupes.count(i)]
    return noDupes

def f5(seq, idfun=None): # Alex Martelli ******* order preserving
    if idfun is None:
        def idfun(x): return x
    seen = {}
    result = []
    for item in seq:
        marker = idfun(item)
        # in old Python versions:
        # if seen.has_key(marker)
        # but in new ones:
        if marker in seen: continue
        seen[marker] = 1
        result.append(item)
    return result


def f5b(seq, idfun=None): # Alex Martelli ******* order preserving
    if idfun is None:
        def idfun(x): return x
    seen = {}
    result = []
    for item in seq:
        marker = idfun(item)
        # in old Python versions:
        # if seen.has_key(marker)
        # but in new ones:
        if marker not in seen:
            seen[marker] = 1
            result.append(item)

    return result



def f6(seq):
    # Not order preserving
    return list(Set(seq))

def f7(seq):
    # Not order preserving
    return list(set(seq))

def f8(seq): # Dave Kirby
    # Order preserving
    seen = set()
    return [x for x in seq if x not in seen and not seen.add(x)]

def f9(seq):
    # Not order preserving
    return {}.fromkeys(seq).keys()

def f10(seq, idfun=None): # Andrew Dalke
    # Order preserving
    return list(_f10(seq, idfun))

def _f10(seq, idfun=None):
    seen = set()
    if idfun is None:
        for x in seq:
            if x in seen:
                continue
            seen.add(x)
            yield x
    else:
        for x in seq:
            x = idfun(x)
            if x in seen:
                continue
            seen.add(x)
            yield x


def f11(seq): # f10 but simpler
    # Order preserving
    return list(_f10(seq))

def _f11(seq):
    seen = set()
    for x in seq:
        if x in seen:
            continue
        seen.add(x)
        yield x

import time

def timing(f, n, a):
    print f.__name__,
    r = range(n)
    t1 = time.clock()
    for i in r:
        f(a); f(a); f(a); f(a); f(a); f(a); f(a); f(a); f(a); f(a)
    t2 = time.clock()
    print round(t2-t1, 3)




def getRandomString(length=10, loweronly=1, numbersonly=0,
                    lettersonly=0):
    """ return a very random string """
    _letters = 'abcdefghijklmnopqrstuvwxyz'
    if numbersonly:
        l = list('0123456789')
    elif lettersonly:
        l = list(_letters + _letters.upper())
    else:
        lowercase = _letters+'0123456789'*2
        l = list(lowercase + lowercase.upper())
    shuffle(l)
    s = ''.join(l)
    if len(s) < length:
        s = s + getRandomString(loweronly=1)
    s = s[:length]
    if loweronly:
        return s.lower()
    else:
        return s

testdata = {}
for i in range(35):
    k = getRandomString(5, lettersonly=1)
    v = getRandomString(100 )
    testdata[k] = v

testdata = [int(x) for x in list('21354612')]
testdata += list('abcceeaa5efm')
class X:
    def __init__(self, n):
        self.foo = n
    def __repr__(self):
        return "<foo %r>"%self.foo
    def __cmp__(self, e):
        return cmp(self.foo, e.foo)

testdata = []
for i in range(10000):
    testdata.append(getRandomString(3, loweronly=True))
#testdata = ['f','g','c','d','b','a','a']


order_preserving = f2, f4, f5, f5b, f8, f10, f11
order_preserving = f5, f5b, f8, f10, f11

not_order_preserving = f1, f3, f6, f7, f9
testfuncs = order_preserving + not_order_preserving


for f in testfuncs:
    if f in order_preserving:
        print "*",
    timing(f, 100, testdata)