Java Community | Help. Code. Learn.•2mo ago

Sequence allocation size question

Hi, I have a question on what my allocation size should be for my call to batch insert records into a table. I'm using sequence generator and I have a batch process that has to insert or update 50,000 records. I'm noticing the performance for this process isn't great and I read online I might need to adjust allocation size (currently have it as 1). So if I'm batch inserting in batches of 500 records, does that mean my allocation size should be 500?

33 Replies

JavaBot•2mo ago

⌛ This post has been reserved for your question.

Hey @Kale Vivi! Please use /close or the Close Post button above when your problem is solved. Please remember to follow the help guidelines. This post will be automatically marked as dormant after 300 minutes of inactivity.

TIP: Narrow down your issue to simple and precise questions to maximize the chance that others will reply in here. 💤 Post marked as dormant

This post has been inactive for over 300 minutes, thus, it has been archived. If your question was not answered yet, feel free to re-open this post or create a new one. In case your post is not getting any attention, you can try to use /help ping. Warning: abusing this will result in moderative actions taken against you.

puggy•2mo ago

Hi @Kale Vivi , If you're batch inserting 50,000 records in chunks of 500 and you're using a sequence generator with allocationSize set to 1, that’s almost certainly the root cause of your performance issues. Setting allocationSize to 1 forces the database to fetch a new sequence value for every single record, which is incredibly inefficient for batch operations. You should align your allocationSize with your batch size. In this case, setting allocationSize to 500 will allow the sequence generator to allocate 500 IDs in a single call. This drastically reduces the number of round trips to the database for sequence generation. Here’s how it works: - When allocationSize is 500, the sequence generator pre-fetches 500 values in one call and assigns them to your records as needed. - For a batch of 500 inserts, this means the database only hits the sequence once per batch, instead of 500 times.

Kale ViviOP•2mo ago

Thanks, I adjusted first to 500 and saw a good improvement. Trying 1000 and 1000 batch size now to see if it's even better

JavaBot•2mo ago

If you are finished with your post, please close it. If you are not, please ignore this message. Note that you will not be able to send further messages here after this post have been closed but you will be able to create new posts.

Kale ViviOP•2mo ago

My other concern is I'm also going one by one in the list to convert the domain object to entity before persistence and I'm worried about that also slowing performance

puggy•2mo ago

Use java streams instead. You can use parallelStream there and it will boost a lot the processing

Kale ViviOP•2mo ago

Ok let me take a look there. Here's kind of what I have so far and I'm feeling like I'm definitely going to need to refactor this

@PersistenceContext
private final EntityManager entityManager;
private final RepoOne repoOne;
private final RepoTwo repoTwo;
private final RepoThree repoThree;
private final RepoFour repoFour;
private final RepoFive repoFive;
private final RepoSix repoSix;
private final RepoSeven repoSeven;
private final RepoEight repoEight;
private final RepoNine repoNine;

private static final int BATCH_SIZE = 1000;

@Override
@Transactional
public void saveAll(Set<Person> persons) {
 Iterable<List<Person>> peopleBatches = ListUtils.partition(persons.stream().toList(), BATCH_SIZE);
 List<Person> newPeople = new ArrayList<>();
 List<Person> existingPeople = new ArrayList<>();
 AtomicInteger indexPersonBatches = new AtomicInteger(1);

 for (List<Person> batch : peopleBatches) {
    for (Person person : batch) {
        if (existsBy(person.getIdNumber())) {
            existingPeople.add(person);
        } else {
            newPeople.add(person);
        }
    }   
        updatePeople(existingPeople);
        savePeople(newPeople);
        existingPeople.clear();
        newPeople.clear();
    }
    entityManager.flush();
    entityManager.clear();
}

private void savePeople(List<Person> newPeople) {
    for (Person person : newPeople) {
        repoOne.save(One.fromPerson(person));
        repoTwo.save(Two.fromPerson(person));
        repoThree.save(Three.fromPerson(person));
        repoFour.save(Four.fromPerson(person));
        repoFive.saveAll(Five.fromPerson(person));
        repoSix.saveAll(Six.fromPerson(person));
        repoSeven.saveAll(Seven.fromPerson(person));
        repoEight.saveAll(Eight.fromPerson(person));
        repoNine.saveAll(Nine.fromPerson(person));
    }
}

public boolean existsBy(String number) {
 return repoOne.findByIdNumber(number) != null;
}

I think the issue is the savePeople above where I go one by one here. In repo fromPerson, that's where I take that domain object and map it to entity for each field. Should I just parallel streams there? It's crazy that I have to break down the one object above into smaller objects to persist in multiple tables too which is also what I feel impacts performance

puggy•2mo ago

OK, I see. What i can suggest you here 1. instead of checking existsBy one by one. DO it for batch of ids you already have from splitted array

private Set<String> existsBy(List<String> idNumbers) {
    return new HashSet<>(repoOne.findAllByIdNumbers(idNumbers));
}

private Set<String> existsBy(List<String> idNumbers) {
    return new HashSet<>(repoOne.findAllByIdNumbers(idNumbers));
}

It will reduce DB interaction from N calls to M calls where N - number of persons, M - number of lists (after splitting) 2. Improve exists/not exists logic with splitting into 2 arrays from the result you would have after point 1.

for (List<Person> batch : peopleBatches) {
  List<String> idNumbers = batch.stream()
    .map(Person::getIdNumber)
    .collect(Collectors.toList());
  Set<String> existingIds = fetchExistingIds(batch);

  List<Person> newPeople = batch.stream()
      .filter(person -> !existingIds.contains(person.getIdNumber()))
      .collect(Collectors.toList());

  List<Person> existingPeople = batch.stream()
      .filter(person -> existingIds.contains(person.getIdNumber()))
      .collect(Collectors.toList());

  if (!existingPeople.isEmpty()) {
       updatePeople(existingPeople);
  }

  if (!newPeople.isEmpty()) {
      savePeople(newPeople);
  }
}

for (List<Person> batch : peopleBatches) {
  List<String> idNumbers = batch.stream()
    .map(Person::getIdNumber)
    .collect(Collectors.toList());
  Set<String> existingIds = fetchExistingIds(batch);

  List<Person> newPeople = batch.stream()
      .filter(person -> !existingIds.contains(person.getIdNumber()))
      .collect(Collectors.toList());

  List<Person> existingPeople = batch.stream()
      .filter(person -> existingIds.contains(person.getIdNumber()))
      .collect(Collectors.toList());

  if (!existingPeople.isEmpty()) {
       updatePeople(existingPeople);
  }

  if (!newPeople.isEmpty()) {
      savePeople(newPeople);
  }
}

3. With this code I suggest to use saveAll instead of single save From

private void savePeople(List<Person> newPeople) {
    for (Person person : newPeople) {
        repoOne.save(One.fromPerson(person));
        repoTwo.save(Two.fromPerson(person));
        repoThree.save(Three.fromPerson(person));
        repoFour.save(Four.fromPerson(person));
        repoFive.saveAll(Five.fromPerson(person));
        repoSix.saveAll(Six.fromPerson(person));
        repoSeven.saveAll(Seven.fromPerson(person));
        repoEight.saveAll(Eight.fromPerson(person));
        repoNine.saveAll(Nine.fromPerson(person));
    }
}

private void savePeople(List<Person> newPeople) {
    for (Person person : newPeople) {
        repoOne.save(One.fromPerson(person));
        repoTwo.save(Two.fromPerson(person));
        repoThree.save(Three.fromPerson(person));
        repoFour.save(Four.fromPerson(person));
        repoFive.saveAll(Five.fromPerson(person));
        repoSix.saveAll(Six.fromPerson(person));
        repoSeven.saveAll(Seven.fromPerson(person));
        repoEight.saveAll(Eight.fromPerson(person));
        repoNine.saveAll(Nine.fromPerson(person));
    }
}

private void savePeople(List<Person> newPeople) {
    repoOne.saveAll(newPeople.stream().map(One::fromPerson).collect(Collectors.toList()));
    repoTwo.saveAll(newPeople.stream().map(Two::fromPerson).collect(Collectors.toList()));
    repoThree.saveAll(newPeople.stream().map(Three::fromPerson).collect(Collectors.toList()));
    // Other your saves
}

private void savePeople(List<Person> newPeople) {
    repoOne.saveAll(newPeople.stream().map(One::fromPerson).collect(Collectors.toList()));
    repoTwo.saveAll(newPeople.stream().map(Two::fromPerson).collect(Collectors.toList()));
    repoThree.saveAll(newPeople.stream().map(Three::fromPerson).collect(Collectors.toList()));
    // Other your saves
}

let me know if this helped to speedup processing the way you need it

Kale ViviOP•2mo ago

Thanks a lot Puggy! I got distracted by writing some tests but will take a look at the above in a minute and give it a try. Thank you. Just adding here in case the above wasn't clear for the savePeople(). I have a a list of Person objects where Person has single fields then under Person is a nested List of other objects (Person then for Person object there's a List<Six> six), so thats' why I had to use the saveAll for the other repos but not the first several where there's only one field/value for that person versus the others where person can have multiple records of that (e.g. Six here). Sorry for the names (just masked them, they're bad).

JavaBot•2mo ago

puggy•2mo ago

So you can move repos with saveAll outside of for loop, and use my suggestion for it. For other repos you can continue use your for 🙂

Kale ViviOP•2mo ago

Not sure I follow the last part here. Let me share the builders to confirm understanding: One entity:

public static One fromPerson(Person person) {
    return One.builder()
            .idNumber(person.getIdNumber())
            .build();
}

Six entity:

public static List<Six> fromPerson(Person person) {
        return person.getSix().stream().map(s -> Six.builder()
                .idNumber(person.getIdNumber())
                .someField(s.getSomeField())
                .build()).collect(Collectors.toList());
    }
}

Should I just do saveAll for all and not do a loop or do a loop for entities like Six here? Thanks again

JavaBot•2mo ago

puggy•2mo ago

Thank you for sharing more code. Suggestion

private void savePeople(List<Person> newPeople) {
    // Single saves
    repoOne.saveAll(newPeople.stream().map(One::fromPerson).collect(Collectors.toList()));
    repoTwo.saveAll(newPeople.stream().map(Two::fromPerson).collect(Collectors.toList()));
    repoThree.saveAll(newPeople.stream().map(Three::fromPerson).collect(Collectors.toList()));
    // others save single in your example

    // BULK
    repoSix.saveAll(newPeople.stream().flatMap(person -> Six.fromPerson(person).stream()).collect(Collectors.toList()));
  // others save all
}

private void savePeople(List<Person> newPeople) {
    // Single saves
    repoOne.saveAll(newPeople.stream().map(One::fromPerson).collect(Collectors.toList()));
    repoTwo.saveAll(newPeople.stream().map(Two::fromPerson).collect(Collectors.toList()));
    repoThree.saveAll(newPeople.stream().map(Three::fromPerson).collect(Collectors.toList()));
    // others save single in your example

    // BULK
    repoSix.saveAll(newPeople.stream().flatMap(person -> Six.fromPerson(person).stream()).collect(Collectors.toList()));
  // others save all
}

Kale ViviOP•2mo ago

Thank you, giving it a try now. You're right the for each loop code was slow

JavaBot•2mo ago

This post has been inactive for over 300 minutes, thus, it has been archived. If your question was not answered yet, feel free to re-open this post or create a new one. In case your post is not getting any attention, you can try to use /help ping. Warning: abusing this will result in moderative actions taken against you.

puggy•2mo ago

Great! 🫶

JavaBot•2mo ago

💤 Post marked as dormant

This post has been inactive for over 300 minutes, thus, it has been archived. If your question was not answered yet, feel free to re-open this post or create a new one. In case your post is not getting any attention, you can try to use /help ping. Warning: abusing this will result in moderative actions taken against you.

Kale ViviOP•2mo ago

Hey Puggy, while it is a dramatic improvement over the previous implementation, it is still slow. (at least when running the job locally). Other approaches where would maybe include leveraging @Async and ThreadExecutor or parallel streams? Does that seem like the right approach forward?

puggy•2mo ago

Hey @Kale Vivi , Could you please share time expectations and what is the execution time right now?

JavaBot•2mo ago

💤 Post marked as dormant

This post has been inactive for over 300 minutes, thus, it has been archived. If your question was not answered yet, feel free to re-open this post or create a new one. In case your post is not getting any attention, you can try to use /help ping. Warning: abusing this will result in moderative actions taken against you.

Kale ViviOP•2mo ago

It's still taking hours and exception is thrown with db io error. Db timeouts I guess before process can finish. I just want it done in 30 to 45 mins if possible?

ayylmao123xdd•2mo ago

if you wanna insert 50k records it should take seconds to process try giving the process method an async annotation make sure its called from another file class and you should probably not use exists by id because that generates race condition instead you could mark the column with @column(unique = true) and to avoid having six repositories you could use one with a single table for all the objects if they share a lot of similar fields also from what i was testing 10k batch size is pretty good for performance

JavaBot•2mo ago

💤 Post marked as dormant

This post has been inactive for over 300 minutes, thus, it has been archived. If your question was not answered yet, feel free to re-open this post or create a new one. In case your post is not getting any attention, you can try to use /help ping. Warning: abusing this will result in moderative actions taken against you.

Kale ViviOP•2mo ago

Hey sorry to respond late. I was able to resolve it by adding hibernate props for batch and the @async annotation earlier. That seemed to work wonders for perf. Thanks all!

JavaBot•2mo ago

Kale ViviOP•2mo ago

One thing I've been thinking about though are different approaches to handling batch updates of large data sizes. There's things like @async, completable futures, spring batch, parallel streams, ... What are some considerations I should think about in understanding when to use what?

ayylmao123xdd•2mo ago

so if you dont want to block other processes in the application you should use the async annotation for sure when i was doing file processing it was taking me about 10k objects to be saved in db per 0.6 second but i was using the async annotation with a normal loop to generate the objects from the lines and yes also with spring batch you would need to test if parallel streams will be faster in your case but in mine there wasnt really any advantage and in some tests they seemed to be slower than just a normal stream/loop as for the competable future you can use it when you need info from the thread if its job has finished

Kale ViviOP•2mo ago

So in the case of async vs completable future. If I need info, I go with CF, otherwise, just use async?

ayylmao123xdd•2mo ago

async just launches the method asynchronically in a separate thread and completable launches asynchronically with returning something you can read more about it here

ayylmao123xdd•2mo ago

https://www.baeldung.com/java-completablefuture

Baeldung

Guide To CompletableFuture | Baeldung

Quick and practical guide to Java 8's CompletableFuture.

ayylmao123xdd•2mo ago

https://www.baeldung.com/spring-async

Baeldung

How To Do @Async in Spring | Baeldung

How to enable and use @Async in Spring - from the very simple config and basic usage to the more complex executors and exception handling strategies.

JavaBot•2mo ago

💤 Post marked as dormant

This post has been inactive for over 300 minutes, thus, it has been archived. If your question was not answered yet, feel free to re-open this post or create a new one. In case your post is not getting any attention, you can try to use /help ping. Warning: abusing this will result in moderative actions taken against you.

Gaming

Programming

Sequence allocation size question

Did you find this page helpful?