score:53

Accepted answer

Some tips :

  1. Do not index your collection before inserting, as inserts modify the index which is an overhead. Insert everything, then create index .

  2. instead of "save" , use mongoDB "batchinsert" which can insert many records in 1 operation. So have around 5000 documents inserted per batch. You will see remarkable performance gain .

    see the method#2 of insert here, it takes array of documents to insert instead of single document. Also see the discussion in this thread

    And if you want to benchmark more -

  3. This is just a guess, try using a capped collection of a predefined large size to store all your data. Capped collection without index has very good insertion performance.

score:0

Another alternative is to try TokuMX. They use Fractal Indexes which means that it does not slow down over time as the database gets bigger.

TokuMX is going to be included as a custom storage driver in an upcoming version of MongoDB.

The current version of MongoDB runs under Linux. I was up and running on Windows quite quickly using Vagrant.

score:4

What I did in my project was adding up a bit of multithreading (the project is in C# but I hope the code is self-explanatory). After playing with the necessary number of threads it turned out that setting the number of threads to the number of cores leads to a slightly better performance(10-20%) but I suppose this boost is hardware specific. Here is the code:

    public virtual void SaveBatch(IEnumerable<object> entities)
    {
        if (entities == null)
            throw new ArgumentNullException("entities");

        _repository.SaveBatch(entities);
    }


    public void ParallelSaveBatch(IEnumerable<IEnumerable<object>> batchPortions)
    {
        if (batchPortions == null)
            throw new ArgumentNullException("batchPortions");
        var po = new ParallelOptions
                 {
                     MaxDegreeOfParallelism = Environment.ProcessorCount
                 };
        Parallel.ForEach(batchPortions, po, SaveBatch);
    }

score:6

I've had the same thing. As far as I can tell, it comes down to the randomness of the index values. Whenever a new document is inserted, it obviously also needs to update all the underlying indexes. Because you're inserting random, as opposed to sequential, values into these indexes, you're constantly accessing the entire index to find where to place the new value.

This is all fine to begin with when all the indexes are sitting happily in memory, but as soon as they grow too large you need to start hitting the disk to do index inserts, then the disk starts thrashing and write performance dies.

As you're loading the data, try comparing db.collection.totalIndexSize() with the available memory, and you'll probably see this happen.

Your best bet is to create the indexes after you've loaded the data. However, this still doesn't solve the problem when it's the required _id index that contains a random value (GUID, hash, etc.), then your best approach might be to think about sharding or getting more RAM.


Related Query

More Query from same tag