Sequence allocation size question
Hi, I have a question on what my allocation size should be for my call to batch insert records into a table. I'm using sequence generator and I have a batch process that has to insert or update 50,000 records. I'm noticing the performance for this process isn't great and I read online I might need to adjust allocation size (currently have it as 1). So if I'm batch inserting in batches of 500 records, does that mean my allocation size should be 500?
33 Replies
⌛
This post has been reserved for your question.
Hey @Kale Vivi! Please useTIP: Narrow down your issue to simple and precise questions to maximize the chance that others will reply in here./close
or theClose Post
button above when your problem is solved. Please remember to follow the help guidelines. This post will be automatically marked as dormant after 300 minutes of inactivity.
💤
Post marked as dormant
This post has been inactive for over 300 minutes, thus, it has been archived.
If your question was not answered yet, feel free to re-open this post or create a new one.
In case your post is not getting any attention, you can try to use /help ping
.
Warning: abusing this will result in moderative actions taken against you.
Hi @Kale Vivi ,
If you're batch inserting 50,000 records in chunks of 500 and you're using a sequence generator with allocationSize set to 1, that’s almost certainly the root cause of your performance issues. Setting allocationSize to 1 forces the database to fetch a new sequence value for every single record, which is incredibly inefficient for batch operations.
You should align your allocationSize with your batch size. In this case, setting allocationSize to 500 will allow the sequence generator to allocate 500 IDs in a single call. This drastically reduces the number of round trips to the database for sequence generation. Here’s how it works:
- When allocationSize is 500, the sequence generator pre-fetches 500 values in one call and assigns them to your records as needed.
- For a batch of 500 inserts, this means the database only hits the sequence once per batch, instead of 500 times.
Thanks, I adjusted first to 500 and saw a good improvement. Trying 1000 and 1000 batch size now to see if it's even better
If you are finished with your post, please close it.
If you are not, please ignore this message.
Note that you will not be able to send further messages here after this post have been closed but you will be able to create new posts.
My other concern is I'm also going one by one in the list to convert the domain object to entity before persistence and I'm worried about that also slowing performance
Use java streams instead. You can use
parallelStream
there and it will boost a lot the processingOk let me take a look there.
Here's kind of what I have so far and I'm feeling like I'm definitely going to need to refactor this
@PersistenceContext
private final EntityManager entityManager;
private final RepoOne repoOne;
private final RepoTwo repoTwo;
private final RepoThree repoThree;
private final RepoFour repoFour;
private final RepoFive repoFive;
private final RepoSix repoSix;
private final RepoSeven repoSeven;
private final RepoEight repoEight;
private final RepoNine repoNine;
private static final int BATCH_SIZE = 1000;
@Override
@Transactional
public void saveAll(Set<Person> persons) {
Iterable<List<Person>> peopleBatches = ListUtils.partition(persons.stream().toList(), BATCH_SIZE);
List<Person> newPeople = new ArrayList<>();
List<Person> existingPeople = new ArrayList<>();
AtomicInteger indexPersonBatches = new AtomicInteger(1);
for (List<Person> batch : peopleBatches) {
for (Person person : batch) {
if (existsBy(person.getIdNumber())) {
existingPeople.add(person);
} else {
newPeople.add(person);
}
}
updatePeople(existingPeople);
savePeople(newPeople);
existingPeople.clear();
newPeople.clear();
}
entityManager.flush();
entityManager.clear();
}
private void savePeople(List<Person> newPeople) {
for (Person person : newPeople) {
repoOne.save(One.fromPerson(person));
repoTwo.save(Two.fromPerson(person));
repoThree.save(Three.fromPerson(person));
repoFour.save(Four.fromPerson(person));
repoFive.saveAll(Five.fromPerson(person));
repoSix.saveAll(Six.fromPerson(person));
repoSeven.saveAll(Seven.fromPerson(person));
repoEight.saveAll(Eight.fromPerson(person));
repoNine.saveAll(Nine.fromPerson(person));
}
}
public boolean existsBy(String number) {
return repoOne.findByIdNumber(number) != null;
}
I think the issue is the savePeople above where I go one by one here. In repo fromPerson, that's where I take that domain object and map it to entity for each field. Should I just parallel streams there? It's crazy that I have to break down the one object above into smaller objects to persist in multiple tables too which is also what I feel impacts performanceOK, I see.
What i can suggest you here
1. instead of checking
existsBy
one by one. DO it for batch of ids you already have from splitted array
It will reduce DB interaction from N calls to M calls where N - number of persons, M - number of lists (after splitting)
2. Improve exists/not exists logic with splitting into 2 arrays from the result you would have after point 1.
3. With this code I suggest to use saveAll
instead of single save
From
To
let me know if this helped to speedup processing the way you need itThanks a lot Puggy! I got distracted by writing some tests but will take a look at the above in a minute and give it a try. Thank you.
Just adding here in case the above wasn't clear for the savePeople().
I have a a list of Person objects where Person has single fields then under Person is a nested List of other objects (Person then for Person object there's a List<Six> six), so thats' why I had to use the saveAll for the other repos but not the first several where there's only one field/value for that person versus the others where person can have multiple records of that (e.g. Six here). Sorry for the names (just masked them, they're bad).
If you are finished with your post, please close it.
If you are not, please ignore this message.
Note that you will not be able to send further messages here after this post have been closed but you will be able to create new posts.
So you can move repos with
saveAll
outside of for loop, and use my suggestion for it.
For other repos you can continue use your for 🙂Not sure I follow the last part here.
Let me share the builders to confirm understanding:
One entity:
public static One fromPerson(Person person) {
return One.builder()
.idNumber(person.getIdNumber())
.build();
}
Six entity:
public static List<Six> fromPerson(Person person) {
return person.getSix().stream().map(s -> Six.builder()
.idNumber(person.getIdNumber())
.someField(s.getSomeField())
.build()).collect(Collectors.toList());
}
}
Should I just do saveAll for all and not do a loop or do a loop for entities like Six here?
Thanks againIf you are finished with your post, please close it.
If you are not, please ignore this message.
Note that you will not be able to send further messages here after this post have been closed but you will be able to create new posts.
Thank you for sharing more code.
Suggestion
Thank you, giving it a try now. You're right the for each loop code was slow
If you are finished with your post, please close it.
If you are not, please ignore this message.
Note that you will not be able to send further messages here after this post have been closed but you will be able to create new posts.
💤
Post marked as dormant
This post has been inactive for over 300 minutes, thus, it has been archived.
If your question was not answered yet, feel free to re-open this post or create a new one.
In case your post is not getting any attention, you can try to use /help ping
.
Warning: abusing this will result in moderative actions taken against you.
Great! 🫶
💤
Post marked as dormant
This post has been inactive for over 300 minutes, thus, it has been archived.
If your question was not answered yet, feel free to re-open this post or create a new one.
In case your post is not getting any attention, you can try to use /help ping
.
Warning: abusing this will result in moderative actions taken against you.
Hey Puggy, while it is a dramatic improvement over the previous implementation, it is still slow. (at least when running the job locally). Other approaches where would maybe include leveraging @Async and ThreadExecutor or parallel streams? Does that seem like the right approach forward?
Hey @Kale Vivi ,
Could you please share time expectations and what is the execution time right now?
💤
Post marked as dormant
This post has been inactive for over 300 minutes, thus, it has been archived.
If your question was not answered yet, feel free to re-open this post or create a new one.
In case your post is not getting any attention, you can try to use /help ping
.
Warning: abusing this will result in moderative actions taken against you.
It's still taking hours and exception is thrown with db io error. Db timeouts I guess before process can finish.
I just want it done in 30 to 45 mins if possible?
if you wanna insert 50k records it should take seconds to process
try giving the process method an async annotation
make sure its called from another file
class
and you should probably not use exists by id because that generates race condition
instead you could mark the column with @column(unique = true)
and to avoid having six repositories you could use one with a single table for all the objects if they share a lot of similar fields
also from what i was testing 10k batch size is pretty good for performance
💤
Post marked as dormant
This post has been inactive for over 300 minutes, thus, it has been archived.
If your question was not answered yet, feel free to re-open this post or create a new one.
In case your post is not getting any attention, you can try to use /help ping
.
Warning: abusing this will result in moderative actions taken against you.
Hey sorry to respond late. I was able to resolve it by adding hibernate props for batch and the @async annotation earlier. That seemed to work wonders for perf. Thanks all!
If you are finished with your post, please close it.
If you are not, please ignore this message.
Note that you will not be able to send further messages here after this post have been closed but you will be able to create new posts.
One thing I've been thinking about though are different approaches to handling batch updates of large data sizes.
There's things like @async, completable futures, spring batch, parallel streams, ... What are some considerations I should think about in understanding when to use what?
so if you dont want to block other processes in the application you should use the async annotation for sure
when i was doing file processing it was taking me about 10k objects to be saved in db per 0.6 second
but i was using the async annotation with a normal loop to generate the objects from the lines
and yes also with spring batch
you would need to test if parallel streams will be faster in your case but in mine there wasnt really any advantage and in some tests they seemed to be slower
than just a normal stream/loop
as for the competable future you can use it when you need info from the thread if its job has finished
So in the case of async vs completable future. If I need info, I go with CF, otherwise, just use async?
async just launches the method asynchronically in a separate thread and completable launches asynchronically with returning something
you can read more about it here
Baeldung
Guide To CompletableFuture | Baeldung
Quick and practical guide to Java 8's CompletableFuture.
Baeldung
How To Do @Async in Spring | Baeldung
How to enable and use @Async in Spring - from the very simple config and basic usage to the more complex executors and exception handling strategies.
💤
Post marked as dormant
This post has been inactive for over 300 minutes, thus, it has been archived.
If your question was not answered yet, feel free to re-open this post or create a new one.
In case your post is not getting any attention, you can try to use /help ping
.
Warning: abusing this will result in moderative actions taken against you.