asosoman Comments - Answer Overflow

asosoman

Posts Comments

MModular

•Created by asosoman on 10/16/2024 in #questions

Fastest way to count trailing zeros from a SIMD[Bool,32]?

Thank you very much, It seems I missed the bitcast function. (Looking basically at cast method from SIMD)

9 replies

MModular

•Created by asosoman on 2/29/2024 in #questions

Parallelize help (Running time increases a factor of 10 adding a "var += 1").

thanks a lot ! 🙂 Have a lot to learn of new ways of doing things 😄

13 replies

MModular

•Created by asosoman on 2/29/2024 in #questions

Parallelize help (Running time increases a factor of 10 adding a "var += 1").

@Michael K Sorry, was out for a couple of days. I keep not understanding why behaves this way, I stripped everything not needed. From my understanding, both functions should take similar time, they only increment one value. As it is below, int version takes 0.0009s and tuple version takes 0,68s. If I increase the parallelize to 2 workers (parallelize[read_chunk](2)) , goes 0.0009 and 3.36... And for higher versions of workers gets worse and worse... 😦 I-m really lost, as I can't understand why the updating the Tuple takes that much 😦 But also, I dont see a way of using 8 workers and keeping track of things without a tuple and storing each worker on one index. Could you try this simple code and give me your timed results?

 from algorithm import parallelize
from time import now

fn read_file_tuple():
    var total: StaticTuple[8,Int64] = StaticTuple[8,Int64](0,0,0,0,0,0,0,0)
    @parameter
    fn read_chunk(bound_idx: Int):  
            var offset: Int = 100_000_000
            while offset != 0:
                total[bound_idx] += 1
                offset -= 1 
    parallelize[read_chunk](1)
    print (total[0])
    return 

fn read_file_int():
    var total: Int = 0
    @parameter
    fn read_chunk(bound_idx: Int):  
            var offset: Int = 100_000_000
            while offset != 0:
                total += 1
                offset -= 1 
    parallelize[read_chunk](1)
    print (total)
    return 

fn main() raises:
        var start: Int = now()
        read_file_int()
        print("Total time int:", (now()-start) / 1_000_000_000)     
        start = now()
        read_file_tuple()
        print("Total time tuple:", (now()-start) / 1_000_000_000)

 from algorithm import parallelize
from time import now

fn read_file_tuple():
    var total: StaticTuple[8,Int64] = StaticTuple[8,Int64](0,0,0,0,0,0,0,0)
    @parameter
    fn read_chunk(bound_idx: Int):  
            var offset: Int = 100_000_000
            while offset != 0:
                total[bound_idx] += 1
                offset -= 1 
    parallelize[read_chunk](1)
    print (total[0])
    return 

fn read_file_int():
    var total: Int = 0
    @parameter
    fn read_chunk(bound_idx: Int):  
            var offset: Int = 100_000_000
            while offset != 0:
                total += 1
                offset -= 1 
    parallelize[read_chunk](1)
    print (total)
    return 

fn main() raises:
        var start: Int = now()
        read_file_int()
        print("Total time int:", (now()-start) / 1_000_000_000)     
        start = now()
        read_file_tuple()
        print("Total time tuple:", (now()-start) / 1_000_000_000)

13 replies

MModular

•Created by asosoman on 2/29/2024 in #questions

Parallelize help (Running time increases a factor of 10 adding a "var += 1").

I'm using @always_inline, and without the increment of total[bound_idx] takes less than 1ms. With total[bound_idx] = total[bound_idx] + 1 on the code, reaches 3s 😫
Have increases the tuple to 16 and added num_performance_cores() , and reduced it to 1.5, so, goes faster, but 1,5s increasing a counter seems too much...

13 replies

MModular

•Created by asosoman on 2/27/2024 in #questions

SIMD Troubles ( SIMD[Bool,32] to Int32? and Getting a bit from every byte from SIMD)

Hi, thanks for your input. Is slower that what I had, guess that using 32 x int32 doesn't help 😄 In case works for someone, this is my fastest approach until now. Using 32 x uint8. I hope they implement a way to get data from SIMD in a full int32, that would be helpful, and for sure faster that all the comparison is has to happen with the select.

data = c.simd_load[32]()
var TRUE_CASE = math.iota[DType.uint8, 32]() 
var FALSE_CASE = SIMD[DType.uint8,32].cast(32)
var mask_nl = (data == NEW_LINE)
var idx_nl = select[DType.uint8,simd_width_u8](mask_nl, TRUE_CASE, FALSE_CASE).reduce_min()

data = c.simd_load[32]()
var TRUE_CASE = math.iota[DType.uint8, 32]() 
var FALSE_CASE = SIMD[DType.uint8,32].cast(32)
var mask_nl = (data == NEW_LINE)
var idx_nl = select[DType.uint8,simd_width_u8](mask_nl, TRUE_CASE, FALSE_CASE).reduce_min()

4 replies

Gaming

Programming