[PATCH v4 00/11] simplify block layer based on immutable biovecs

classic Classic list List threaded Threaded
62 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

Ming Lin-2
On Tue, May 26, 2015 at 9:04 AM, Mike Snitzer <[hidden email]> wrote:

> On Tue, May 26 2015 at 11:02am -0400,
> Ming Lin <[hidden email]> wrote:
>
>> On Tue, May 26, 2015 at 7:36 AM, Mike Snitzer <[hidden email]> wrote:
>> > On Fri, May 22 2015 at  2:18pm -0400,
>> > Ming Lin <[hidden email]> wrote:
>> >
>> >> From: Kent Overstreet <[hidden email]>
>> >>
>> >> The way the block layer is currently written, it goes to great lengths
>> >> to avoid having to split bios; upper layer code (such as bio_add_page())
>> >> checks what the underlying device can handle and tries to always create
>> >> bios that don't need to be split.
>> >>
>> >> But this approach becomes unwieldy and eventually breaks down with
>> >> stacked devices and devices with dynamic limits, and it adds a lot of
>> >> complexity. If the block layer could split bios as needed, we could
>> >> eliminate a lot of complexity elsewhere - particularly in stacked
>> >> drivers. Code that creates bios can then create whatever size bios are
>> >> convenient, and more importantly stacked drivers don't have to deal with
>> >> both their own bio size limitations and the limitations of the
>> >> (potentially multiple) devices underneath them.  In the future this will
>> >> let us delete merge_bvec_fn and a bunch of other code.
>> >
>> > This series doesn't take any steps to train upper layers
>> > (e.g. filesystems) to size their bios larger (which is defined as
>> > "whatever size bios are convenient" above).
>> >
>> > bio_add_page(), and merge_bvec_fn, served as the means for upper layers
>> > (and direct IO) to build up optimally sized bios.  Without a replacement
>> > (that I can see anyway) how is this patchset making forward progress
>> > (getting Acks, etc)!?
>> >
>> > I like the idea of reduced complexity associated with these late bio
>> > splitting changes I'm just not seeing how this is ready given there are
>> > no upper layer changes that speak to building larger bios..
>> >
>> > What am I missing?
>>
>> See: [PATCH v4 02/11] block: simplify bio_add_page()
>> https://lkml.org/lkml/2015/5/22/754
>>
>> Now bio_add_page() can build lager bios.
>> And blk_queue_split() can split the bios in ->make_request() if needed.
>
> That'll result in quite large bios and always needing splitting.
>
> As Alasdair asked: please provide some performance data that justifies
> these changes.  E.g use a setup like: XFS on a DM striped target.  We
> can iterate on more complex setups once we have established some basic
> tests.

Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
Does it make sense?

                                         4.1-rc4                 4.1-rc4-patched
                                         ------------------
-----------------------
                                          (KB/s)                 (KB/s)
sequential-read-buf:            150822                 151371
sequential-read-direct:         408938                 421940
random-read-buf:                3404.9                  3389.1
random-read-direct:             4859.8                 4843.5
sequential-write-buf:            333455                 335776
sequential-write-direct:        44739                   43194
random-write-buf:                7272.1                 7209.6
random-write-direct:             4333.9                 4330.7



root@minggr:~/tmp/test# cat t.job
[global]
size=1G
directory=/mnt/
numjobs=8
group_reporting
runtime=300
time_based
bs=8k
ioengine=libaio
iodepth=64

[sequential-read-buf]
rw=read

[sequential-read-direct]
rw=read
direct=1

[random-read-buf]
rw=randread

[random-read-direct]
rw=randread
direct=1

[sequential-write-buf]
rw=write

[sequential-write-direct]
rw=write
direct=1

[random-write-buf]
rw=randwrite

[random-write-direct]
rw=randwrite
direct=1


root@minggr:~/tmp/test# cat run.sh
#!/bin/bash

jobs="sequential-read-buf sequential-read-direct random-read-buf
random-read-direct"
jobs="$jobs sequential-write-buf sequential-write-direct
random-write-buf random-write-direct"

#each partition is 100G
pvcreate /dev/sdb3 /dev/nvme0n1p1 /dev/sdc6
vgcreate striped_vol_group /dev/sdb3 /dev/nvme0n1p1 /dev/sdc6
lvcreate -i3 -I4 -L250G -nstriped_logical_volume striped_vol_group

for job in $jobs ; do
        umount /mnt > /dev/null 2>&1
        mkfs.xfs -f /dev/striped_vol_group/striped_logical_volume
        mount /dev/striped_vol_group/striped_logical_volume /mnt

        fio --output=${job}.log --section=${job} t.job
done
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

Alasdair G Kergon
On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote:
> Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
> Does it make sense?

To stripe across devices with different characteristics?

Some suggestions.

Prepare 3 kernels.
  O - Old kernel.
  M - Old kernel with merge_bvec_fn disabled.
  N - New kernel.

You're trying to search for counter-examples to the hypothesis that
"Kernel N always outperforms Kernel O".  Then if you find any, trying
to show either that the performance impediment is small enough that
it doesn't matter or that the cases are sufficiently rare or obscure
that they may be ignored because of the greater benefits of N in much more
common cases.

(1) You're looking to set up configurations where kernel O performs noticeably
better than M.  Then you're comparing the performance of O and N in those
situations.

(2) You're looking at other sensible configurations where O and M have
similar performance, and comparing that with the performance of N.

In each case you find, you expect to be able to vary some parameter (such as
stripe size) to show a progression of the effect.

When running tests you've to take care the system is reset into the same
initial state before each test, so you'll normally also try to include some
baseline test between tests that should give the same results each time
and also take the average of a number of runs (while also reporting some
measure of the variation within each set to make sure that remains low,
typically a low single digit percentage).

Since we're mostly concerned about splitting, you'll want to monitor
iostat to see if that gives you enough information to home in on
suitable configurations for (1).  Failing that, you might need to
instrument the kernels to tell you the sizes of the bios being
created and the amount of splitting actually happening.

Striping was mentioned because it forces splitting.  So show the progression
from tiny stripes to huge stripes.  (Ensure all the devices providing the
stripes have identical characteristics, but you can test with slow and
fast underlying devices.)

You may also want to test systems with a restricted amount of available
memory to show how the splitting via worker thread performs.  (Again,
instrument to prove the extent to which the new code is being exercised.)

Alasdair

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

Ming Lin-2
On Wed, May 27, 2015 at 5:36 PM, Alasdair G Kergon <[hidden email]> wrote:
> On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote:
>> Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
>> Does it make sense?
>
> To stripe across devices with different characteristics?

I intended to test it on a 2 sockets server with 10 NVMe drives.
But that server has been busy running other tests.

So I have to run test on a PC which happen to have 2 SSDs + 1 HDD.

>
> Some suggestions.

Thanks for the great detail.
I'm reading to understand.

>
> Prepare 3 kernels.
>   O - Old kernel.
>   M - Old kernel with merge_bvec_fn disabled.
>   N - New kernel.
>
> You're trying to search for counter-examples to the hypothesis that
> "Kernel N always outperforms Kernel O".  Then if you find any, trying
> to show either that the performance impediment is small enough that
> it doesn't matter or that the cases are sufficiently rare or obscure
> that they may be ignored because of the greater benefits of N in much more
> common cases.
>
> (1) You're looking to set up configurations where kernel O performs noticeably
> better than M.  Then you're comparing the performance of O and N in those
> situations.
>
> (2) You're looking at other sensible configurations where O and M have
> similar performance, and comparing that with the performance of N.
>
> In each case you find, you expect to be able to vary some parameter (such as
> stripe size) to show a progression of the effect.
>
> When running tests you've to take care the system is reset into the same
> initial state before each test, so you'll normally also try to include some
> baseline test between tests that should give the same results each time
> and also take the average of a number of runs (while also reporting some
> measure of the variation within each set to make sure that remains low,
> typically a low single digit percentage).
>
> Since we're mostly concerned about splitting, you'll want to monitor
> iostat to see if that gives you enough information to home in on
> suitable configurations for (1).  Failing that, you might need to
> instrument the kernels to tell you the sizes of the bios being
> created and the amount of splitting actually happening.
>
> Striping was mentioned because it forces splitting.  So show the progression
> from tiny stripes to huge stripes.  (Ensure all the devices providing the
> stripes have identical characteristics, but you can test with slow and
> fast underlying devices.)
>
> You may also want to test systems with a restricted amount of available
> memory to show how the splitting via worker thread performs.  (Again,
> instrument to prove the extent to which the new code is being exercised.)
>
> Alasdair
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 00/11] simplify block layer based on immutable biovecs

Ming Lin-2
In reply to this post by Christoph Hellwig-2
On Mon, May 25, 2015 at 6:51 AM, Christoph Hellwig <[hidden email]> wrote:

> On Sun, May 24, 2015 at 12:37:32AM -0700, Ming Lin wrote:
>> > Except for that these changes looks good, and the previous version
>> > passed my tests fine, so with some benchmarks you'ĺl have my ACK.
>>
>> I'll test it on a 2 sockets server with 10 NVMe drives on Monday.
>> I'm going to run fio tests:
>> 1. raw NVMe drives direct IO read/write
>> 2. ext4 read/write
>>
>> Let me know if you have other tests that I can run.
>
> That sounds like a good start, but the most important tests would be
> those that will cause a lot of splits with the new code.
>
> E.g. some old ATA devices using the piix driver, some crappy USB
> device that just allows 64 sector transfers.  Or maybe it's better
> to just simulate the case by dropping max_sectors to ease some pain :)

That 2 sockets server has been busy running other things.

I did a quick test with 1 NVMe drive to simulate 64 sector transfers.
echo 32 > /sys/block/nvme0n1/queue/max_sectors_kb

Then run fio 1M block size read that should cause a lot of splits.
As expected, no obvious difference with and without the patches.

I'll run more tests once the server is free, probably next week.

Or would you think the performance data on a PC with 1 NVMe drive is OK?

>
> The other cases is DM/MD ѕtripes or RAID5/6 with small stripe sizes.

Will also run MD stripes test.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

Ming Lin-2
In reply to this post by Alasdair G Kergon
On Wed, May 27, 2015 at 5:36 PM, Alasdair G Kergon <[hidden email]> wrote:

> On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote:
>> Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
>> Does it make sense?
>
> To stripe across devices with different characteristics?
>
> Some suggestions.
>
> Prepare 3 kernels.
>   O - Old kernel.
>   M - Old kernel with merge_bvec_fn disabled.

How to disable it?
Maybe just hack it as below?

void blk_queue_merge_bvec(struct request_queue *q, merge_bvec_fn *mbfn)
{
        //q->merge_bvec_fn = mbfn;
}

>   N - New kernel.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

Mike Snitzer
On Fri, May 29 2015 at  3:05P -0400,
Ming Lin <[hidden email]> wrote:

> On Wed, May 27, 2015 at 5:36 PM, Alasdair G Kergon <[hidden email]> wrote:
> > On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote:
> >> Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
> >> Does it make sense?
> >
> > To stripe across devices with different characteristics?
> >
> > Some suggestions.
> >
> > Prepare 3 kernels.
> >   O - Old kernel.
> >   M - Old kernel with merge_bvec_fn disabled.
>
> How to disable it?
> Maybe just hack it as below?
>
> void blk_queue_merge_bvec(struct request_queue *q, merge_bvec_fn *mbfn)
> {
>         //q->merge_bvec_fn = mbfn;
> }

Right, there isn't an existing way to disable it, you'd need a hack like
that.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

Ming Lin-2
In reply to this post by Alasdair G Kergon
On Thu, 2015-05-28 at 01:36 +0100, Alasdair G Kergon wrote:

> On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote:
> > Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
> > Does it make sense?
>
> To stripe across devices with different characteristics?
>
> Some suggestions.
>
> Prepare 3 kernels.
>   O - Old kernel.
>   M - Old kernel with merge_bvec_fn disabled.
>   N - New kernel.
>
> You're trying to search for counter-examples to the hypothesis that
> "Kernel N always outperforms Kernel O".  Then if you find any, trying
> to show either that the performance impediment is small enough that
> it doesn't matter or that the cases are sufficiently rare or obscure
> that they may be ignored because of the greater benefits of N in much more
> common cases.
>
> (1) You're looking to set up configurations where kernel O performs noticeably
> better than M.  Then you're comparing the performance of O and N in those
> situations.
>
> (2) You're looking at other sensible configurations where O and M have
> similar performance, and comparing that with the performance of N.

I didn't find case (1).

But the important thing for this series is to simplify block layer
based on immutable biovecs. I don't expect performance improvement.

Here is the changes statistics.

"68 files changed, 336 insertions(+), 1331 deletions(-)"

I run below 3 test cases to make sure it didn't bring any regressions.
Test environment: 2 NVMe drives on 2 sockets server.
Each case run for 30 minutes.

2) btrfs radi0

mkfs.btrfs -f -d raid0 /dev/nvme0n1 /dev/nvme1n1
mount /dev/nvme0n1 /mnt

Then run 8K read.

[global]
ioengine=libaio
iodepth=64
direct=1
runtime=1800
time_based
group_reporting
numjobs=4
rw=read

[job1]
bs=8K
directory=/mnt
size=1G

2) ext4 on MD raid5

mdadm --create /dev/md0 --level=5 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1
mkfs.ext4 /dev/md0
mount /dev/md0 /mnt

fio script same as btrfs test

3) xfs on DM stripped target

pvcreate /dev/nvme0n1 /dev/nvme1n1
vgcreate striped_vol_group /dev/nvme0n1 /dev/nvme1n1
lvcreate -i2 -I4 -L250G -nstriped_logical_volume striped_vol_group
mkfs.xfs -f /dev/striped_vol_group/striped_logical_volume
mount /dev/striped_vol_group/striped_logical_volume /mnt

fio script same as btrfs test

------

Results:

        4.1-rc4 4.1-rc4-patched
btrfs 1818.6MB/s 1874.1MB/s
ext4 717307KB/s 714030KB/s
xfs 1396.6MB/s 1398.6MB/s


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 00/11] simplify block layer based on immutable biovecs

Ming Lin-2
In reply to this post by Christoph Hellwig-2
On Sat, May 23, 2015 at 7:15 AM, Christoph Hellwig <[hidden email]> wrote:

> On Fri, May 22, 2015 at 11:18:32AM -0700, Ming Lin wrote:
>> This will bring not only performance improvements, but also a great amount
>> of reduction in code complexity all over the block layer. Performance gain
>> is possible due to the fact that bio_add_page() does not have to check
>> unnecesary conditions such as queue limits or if biovecs are mergeable.
>> Those will be delegated to the driver level. Kent already said that he
>> actually benchmarked the impact of this with fio on a micron p320h, which
>> showed definitely a positive impact.
>
> We'll need some actual numbers.  I actually like these changes a lot
> and don't even need a performance justification for this fundamentally
> better model, but I'd really prefer to avoid any large scale regressions.
> I don't really expect them, but for code this fundamental we'll just
> need some benchmarks.
>
> Except for that these changes looks good, and the previous version
> passed my tests fine, so with some benchmarks you'ĺl have my ACK.

Can I have your ACK with these numbers?
https://lkml.org/lkml/2015/6/1/38

>
> I'd love to see this go into 4.2, but for that we'll need Jens
> approval and a merge into for-next very soon.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

Ming Lin-2
In reply to this post by Ming Lin-2
On Sun, May 31, 2015 at 11:02 PM, Ming Lin <[hidden email]> wrote:

> On Thu, 2015-05-28 at 01:36 +0100, Alasdair G Kergon wrote:
>> On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote:
>> > Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
>> > Does it make sense?
>>
>> To stripe across devices with different characteristics?
>>
>> Some suggestions.
>>
>> Prepare 3 kernels.
>>   O - Old kernel.
>>   M - Old kernel with merge_bvec_fn disabled.
>>   N - New kernel.
>>
>> You're trying to search for counter-examples to the hypothesis that
>> "Kernel N always outperforms Kernel O".  Then if you find any, trying
>> to show either that the performance impediment is small enough that
>> it doesn't matter or that the cases are sufficiently rare or obscure
>> that they may be ignored because of the greater benefits of N in much more
>> common cases.
>>
>> (1) You're looking to set up configurations where kernel O performs noticeably
>> better than M.  Then you're comparing the performance of O and N in those
>> situations.
>>
>> (2) You're looking at other sensible configurations where O and M have
>> similar performance, and comparing that with the performance of N.
>
> I didn't find case (1).
>
> But the important thing for this series is to simplify block layer
> based on immutable biovecs. I don't expect performance improvement.
>
> Here is the changes statistics.
>
> "68 files changed, 336 insertions(+), 1331 deletions(-)"
>
> I run below 3 test cases to make sure it didn't bring any regressions.
> Test environment: 2 NVMe drives on 2 sockets server.
> Each case run for 30 minutes.
>
> 2) btrfs radi0
>
> mkfs.btrfs -f -d raid0 /dev/nvme0n1 /dev/nvme1n1
> mount /dev/nvme0n1 /mnt
>
> Then run 8K read.
>
> [global]
> ioengine=libaio
> iodepth=64
> direct=1
> runtime=1800
> time_based
> group_reporting
> numjobs=4
> rw=read
>
> [job1]
> bs=8K
> directory=/mnt
> size=1G
>
> 2) ext4 on MD raid5
>
> mdadm --create /dev/md0 --level=5 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1
> mkfs.ext4 /dev/md0
> mount /dev/md0 /mnt
>
> fio script same as btrfs test
>
> 3) xfs on DM stripped target
>
> pvcreate /dev/nvme0n1 /dev/nvme1n1
> vgcreate striped_vol_group /dev/nvme0n1 /dev/nvme1n1
> lvcreate -i2 -I4 -L250G -nstriped_logical_volume striped_vol_group
> mkfs.xfs -f /dev/striped_vol_group/striped_logical_volume
> mount /dev/striped_vol_group/striped_logical_volume /mnt
>
> fio script same as btrfs test
>
> ------
>
> Results:
>
>         4.1-rc4         4.1-rc4-patched
> btrfs   1818.6MB/s      1874.1MB/s
> ext4    717307KB/s      714030KB/s
> xfs     1396.6MB/s      1398.6MB/s

Hi Alasdair & Mike,

Would you like these numbers?
I'd like to address your concerns to move forward.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 00/11] simplify block layer based on immutable biovecs

Christoph Hellwig-2
In reply to this post by Ming Lin-2
On Sun, May 31, 2015 at 11:15:09PM -0700, Ming Lin wrote:
> > Except for that these changes looks good, and the previous version
> > passed my tests fine, so with some benchmarks you'ĺl have my ACK.
>
> Can I have your ACK with these numbers?
> https://lkml.org/lkml/2015/6/1/38

Looks good to me.  Still like to see consensus from the DM folks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 00/11] simplify block layer based on immutable biovecs

Jeff Moyer
Christoph Hellwig <[hidden email]> writes:

> On Sun, May 31, 2015 at 11:15:09PM -0700, Ming Lin wrote:
>> > Except for that these changes looks good, and the previous version
>> > passed my tests fine, so with some benchmarks you'ĺl have my ACK.
>>
>> Can I have your ACK with these numbers?
>> https://lkml.org/lkml/2015/6/1/38
>
> Looks good to me.  Still like to see consensus from the DM folks.

Ming, did you look into the increased stack usage reported by Huang
Ying?

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 00/11] simplify block layer based on immutable biovecs

Ming Lin-2
On Wed, Jun 3, 2015 at 6:28 AM, Jeff Moyer <[hidden email]> wrote:

> Christoph Hellwig <[hidden email]> writes:
>
>> On Sun, May 31, 2015 at 11:15:09PM -0700, Ming Lin wrote:
>>> > Except for that these changes looks good, and the previous version
>>> > passed my tests fine, so with some benchmarks you'ĺl have my ACK.
>>>
>>> Can I have your ACK with these numbers?
>>> https://lkml.org/lkml/2015/6/1/38
>>
>> Looks good to me.  Still like to see consensus from the DM folks.
>
> Ming, did you look into the increased stack usage reported by Huang
> Ying?

Yes, I'll reply to Ying's email.

>
> -Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

Mike Snitzer
In reply to this post by Ming Lin-2
On Tue, Jun 02 2015 at  4:59pm -0400,
Ming Lin <[hidden email]> wrote:

> On Sun, May 31, 2015 at 11:02 PM, Ming Lin <[hidden email]> wrote:
> > On Thu, 2015-05-28 at 01:36 +0100, Alasdair G Kergon wrote:
> >> On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote:
> >> > Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
> >> > Does it make sense?
> >>
> >> To stripe across devices with different characteristics?
> >>
> >> Some suggestions.
> >>
> >> Prepare 3 kernels.
> >>   O - Old kernel.
> >>   M - Old kernel with merge_bvec_fn disabled.
> >>   N - New kernel.
> >>
> >> You're trying to search for counter-examples to the hypothesis that
> >> "Kernel N always outperforms Kernel O".  Then if you find any, trying
> >> to show either that the performance impediment is small enough that
> >> it doesn't matter or that the cases are sufficiently rare or obscure
> >> that they may be ignored because of the greater benefits of N in much more
> >> common cases.
> >>
> >> (1) You're looking to set up configurations where kernel O performs noticeably
> >> better than M.  Then you're comparing the performance of O and N in those
> >> situations.
> >>
> >> (2) You're looking at other sensible configurations where O and M have
> >> similar performance, and comparing that with the performance of N.
> >
> > I didn't find case (1).
> >
> > But the important thing for this series is to simplify block layer
> > based on immutable biovecs. I don't expect performance improvement.

No simplifying isn't the important thing.  Any change to remove the
merge_bvec callbacks needs to not introduce performance regressions on
enterprise systems with large RAID arrays, etc.

It is fine if there isn't a performance improvement but I really don't
think the limited testing you've done on a relatively small storage
configuration has come even close to showing these changes don't
introduce performance regressions.

> > Here is the changes statistics.
> >
> > "68 files changed, 336 insertions(+), 1331 deletions(-)"
> >
> > I run below 3 test cases to make sure it didn't bring any regressions.
> > Test environment: 2 NVMe drives on 2 sockets server.
> > Each case run for 30 minutes.
> >
> > 2) btrfs radi0
> >
> > mkfs.btrfs -f -d raid0 /dev/nvme0n1 /dev/nvme1n1
> > mount /dev/nvme0n1 /mnt
> >
> > Then run 8K read.
> >
> > [global]
> > ioengine=libaio
> > iodepth=64
> > direct=1
> > runtime=1800
> > time_based
> > group_reporting
> > numjobs=4
> > rw=read
> >
> > [job1]
> > bs=8K
> > directory=/mnt
> > size=1G
> >
> > 2) ext4 on MD raid5
> >
> > mdadm --create /dev/md0 --level=5 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1
> > mkfs.ext4 /dev/md0
> > mount /dev/md0 /mnt
> >
> > fio script same as btrfs test
> >
> > 3) xfs on DM stripped target
> >
> > pvcreate /dev/nvme0n1 /dev/nvme1n1
> > vgcreate striped_vol_group /dev/nvme0n1 /dev/nvme1n1
> > lvcreate -i2 -I4 -L250G -nstriped_logical_volume striped_vol_group
> > mkfs.xfs -f /dev/striped_vol_group/striped_logical_volume
> > mount /dev/striped_vol_group/striped_logical_volume /mnt
> >
> > fio script same as btrfs test
> >
> > ------
> >
> > Results:
> >
> >         4.1-rc4         4.1-rc4-patched
> > btrfs   1818.6MB/s      1874.1MB/s
> > ext4    717307KB/s      714030KB/s
> > xfs     1396.6MB/s      1398.6MB/s
>
> Hi Alasdair & Mike,
>
> Would you like these numbers?
> I'd like to address your concerns to move forward.

I really don't see that these NVMe results prove much.

We need to test on large HW raid setups like a Netapp filer (or even
local SAS drives connected via some SAS controller).  Like a 8+2 drive
RAID6 or 8+1 RAID5 setup.  Testing with MD raid on JBOD setups with 8
devices is also useful.  It is larger RAID setups that will be more
sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
size boundaries.

There are tradeoffs between creating a really large bio and creating a
properly sized bio from the start.  And yes, to one of neilb's original
points, limits do change and we suck at restacking limits.. so what was
once properly sized may no longer be but: that is a relatively rare
occurrence.  Late splitting does do away with the limits stacking
disconnect.  And in general I like the idea of removing all the
merge_bvec code.  I just don't think I can confidently Ack such a
wholesale switch at this point with such limited performance analysis.
If we (the DM/lvm team at Red Hat) are being painted into a corner of
having to provide our own testing that meets our definition of
"thorough" then we'll need time to carry out those tests.  But I'd hate
to hold up everyone because DM is not in agreement on this change...

So taking a step back, why can't we introduce late bio splitting in a
phased approach?

1: introduce late bio splitting to block core BUT still keep established
   merge_bvec infrastructure
2: establish a way for upper layers to skip merge_bvec if they'd like to
   do so (e.g. block-core exposes a 'use_late_bio_splitting' or
   something for userspace or upper layers to set, can also have a
   Kconfig that enables this feature by default)
3: we gain confidence in late bio-splitting and then carry on with the
   removal of merge_bvec et al (could be incrementally done on a
   per-driver basis, e.g. DM, MD, btrfs, etc, etc).

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

Ming Lin-2
On Thu, Jun 4, 2015 at 2:06 PM, Mike Snitzer <[hidden email]> wrote:

> On Tue, Jun 02 2015 at  4:59pm -0400,
> Ming Lin <[hidden email]> wrote:
>
>> On Sun, May 31, 2015 at 11:02 PM, Ming Lin <[hidden email]> wrote:
>> > On Thu, 2015-05-28 at 01:36 +0100, Alasdair G Kergon wrote:
>> >> On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote:
>> >> > Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
>> >> > Does it make sense?
>> >>
>> >> To stripe across devices with different characteristics?
>> >>
>> >> Some suggestions.
>> >>
>> >> Prepare 3 kernels.
>> >>   O - Old kernel.
>> >>   M - Old kernel with merge_bvec_fn disabled.
>> >>   N - New kernel.
>> >>
>> >> You're trying to search for counter-examples to the hypothesis that
>> >> "Kernel N always outperforms Kernel O".  Then if you find any, trying
>> >> to show either that the performance impediment is small enough that
>> >> it doesn't matter or that the cases are sufficiently rare or obscure
>> >> that they may be ignored because of the greater benefits of N in much more
>> >> common cases.
>> >>
>> >> (1) You're looking to set up configurations where kernel O performs noticeably
>> >> better than M.  Then you're comparing the performance of O and N in those
>> >> situations.
>> >>
>> >> (2) You're looking at other sensible configurations where O and M have
>> >> similar performance, and comparing that with the performance of N.
>> >
>> > I didn't find case (1).
>> >
>> > But the important thing for this series is to simplify block layer
>> > based on immutable biovecs. I don't expect performance improvement.
>
> No simplifying isn't the important thing.  Any change to remove the
> merge_bvec callbacks needs to not introduce performance regressions on
> enterprise systems with large RAID arrays, etc.
>
> It is fine if there isn't a performance improvement but I really don't
> think the limited testing you've done on a relatively small storage
> configuration has come even close to showing these changes don't
> introduce performance regressions.
>
>> > Here is the changes statistics.
>> >
>> > "68 files changed, 336 insertions(+), 1331 deletions(-)"
>> >
>> > I run below 3 test cases to make sure it didn't bring any regressions.
>> > Test environment: 2 NVMe drives on 2 sockets server.
>> > Each case run for 30 minutes.
>> >
>> > 2) btrfs radi0
>> >
>> > mkfs.btrfs -f -d raid0 /dev/nvme0n1 /dev/nvme1n1
>> > mount /dev/nvme0n1 /mnt
>> >
>> > Then run 8K read.
>> >
>> > [global]
>> > ioengine=libaio
>> > iodepth=64
>> > direct=1
>> > runtime=1800
>> > time_based
>> > group_reporting
>> > numjobs=4
>> > rw=read
>> >
>> > [job1]
>> > bs=8K
>> > directory=/mnt
>> > size=1G
>> >
>> > 2) ext4 on MD raid5
>> >
>> > mdadm --create /dev/md0 --level=5 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1
>> > mkfs.ext4 /dev/md0
>> > mount /dev/md0 /mnt
>> >
>> > fio script same as btrfs test
>> >
>> > 3) xfs on DM stripped target
>> >
>> > pvcreate /dev/nvme0n1 /dev/nvme1n1
>> > vgcreate striped_vol_group /dev/nvme0n1 /dev/nvme1n1
>> > lvcreate -i2 -I4 -L250G -nstriped_logical_volume striped_vol_group
>> > mkfs.xfs -f /dev/striped_vol_group/striped_logical_volume
>> > mount /dev/striped_vol_group/striped_logical_volume /mnt
>> >
>> > fio script same as btrfs test
>> >
>> > ------
>> >
>> > Results:
>> >
>> >         4.1-rc4         4.1-rc4-patched
>> > btrfs   1818.6MB/s      1874.1MB/s
>> > ext4    717307KB/s      714030KB/s
>> > xfs     1396.6MB/s      1398.6MB/s
>>
>> Hi Alasdair & Mike,
>>
>> Would you like these numbers?
>> I'd like to address your concerns to move forward.
>
> I really don't see that these NVMe results prove much.
>
> We need to test on large HW raid setups like a Netapp filer (or even
> local SAS drives connected via some SAS controller).  Like a 8+2 drive
> RAID6 or 8+1 RAID5 setup.  Testing with MD raid on JBOD setups with 8
> devices is also useful.  It is larger RAID setups that will be more
> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
> size boundaries.

I'll test it on large HW raid setup.

Here is HW RAID5 setup with 19 278G HDDs on Dell R730xd(2sockets/48
logical cpus/264G mem).
http://minggr.net/pub/20150604/hw_raid5.jpg

The stripe size is 64K.

I'm going to test ext4/btrfs/xfs on it.
"bs" set to 1216k(64K * 19 = 1216k)
and run 48 jobs.

[global]
ioengine=libaio
iodepth=64
direct=1
runtime=1800
time_based
group_reporting
numjobs=48
rw=read

[job1]
bs=1216K
directory=/mnt
size=1G

Or do you have other suggestions of what tests I should run?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

Mike Snitzer
On Thu, Jun 04 2015 at  6:21pm -0400,
Ming Lin <[hidden email]> wrote:

> On Thu, Jun 4, 2015 at 2:06 PM, Mike Snitzer <[hidden email]> wrote:
> >
> > We need to test on large HW raid setups like a Netapp filer (or even
> > local SAS drives connected via some SAS controller).  Like a 8+2 drive
> > RAID6 or 8+1 RAID5 setup.  Testing with MD raid on JBOD setups with 8
> > devices is also useful.  It is larger RAID setups that will be more
> > sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
> > size boundaries.
>
> I'll test it on large HW raid setup.
>
> Here is HW RAID5 setup with 19 278G HDDs on Dell R730xd(2sockets/48
> logical cpus/264G mem).
> http://minggr.net/pub/20150604/hw_raid5.jpg
>
> The stripe size is 64K.
>
> I'm going to test ext4/btrfs/xfs on it.
> "bs" set to 1216k(64K * 19 = 1216k)
> and run 48 jobs.

Definitely an odd blocksize (though 1280K full stripe is pretty common
for 10+2 HW RAID6 w/ 128K chunk size).

> [global]
> ioengine=libaio
> iodepth=64
> direct=1
> runtime=1800
> time_based
> group_reporting
> numjobs=48
> rw=read
>
> [job1]
> bs=1216K
> directory=/mnt
> size=1G

How does time_based relate to size=1G?  It'll rewrite the same 1 gig
file repeatedly?

> Or do you have other suggestions of what tests I should run?

You're welcome to run this job but I'll also check with others here to
see what fio jobs we used in the recent past when assessing performance
of the dm-crypt parallelization changes.

Also, a lot of care needs to be taken to eliminate jitter in the system
while the test is running.  We got a lot of good insight from Bart Van
Assche on that and put it to practice.  I'll see if we can (re)summarize
that too.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

Ming Lin-2
On Thu, Jun 4, 2015 at 5:06 PM, Mike Snitzer <[hidden email]> wrote:

> On Thu, Jun 04 2015 at  6:21pm -0400,
> Ming Lin <[hidden email]> wrote:
>
>> On Thu, Jun 4, 2015 at 2:06 PM, Mike Snitzer <[hidden email]> wrote:
>> >
>> > We need to test on large HW raid setups like a Netapp filer (or even
>> > local SAS drives connected via some SAS controller).  Like a 8+2 drive
>> > RAID6 or 8+1 RAID5 setup.  Testing with MD raid on JBOD setups with 8
>> > devices is also useful.  It is larger RAID setups that will be more
>> > sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
>> > size boundaries.
>>
>> I'll test it on large HW raid setup.
>>
>> Here is HW RAID5 setup with 19 278G HDDs on Dell R730xd(2sockets/48
>> logical cpus/264G mem).
>> http://minggr.net/pub/20150604/hw_raid5.jpg
>>
>> The stripe size is 64K.
>>
>> I'm going to test ext4/btrfs/xfs on it.
>> "bs" set to 1216k(64K * 19 = 1216k)
>> and run 48 jobs.
>
> Definitely an odd blocksize (though 1280K full stripe is pretty common
> for 10+2 HW RAID6 w/ 128K chunk size).

I can change it to 10 HDDs HW RAID6 w/ 128K chunk size, then use bs=1280K

>
>> [global]
>> ioengine=libaio
>> iodepth=64
>> direct=1
>> runtime=1800
>> time_based
>> group_reporting
>> numjobs=48
>> rw=read
>>
>> [job1]
>> bs=1216K
>> directory=/mnt
>> size=1G
>
> How does time_based relate to size=1G?  It'll rewrite the same 1 gig
> file repeatedly?

Above job file is for read.
For write, I think so.
Do is make sense for performance test?

>
>> Or do you have other suggestions of what tests I should run?
>
> You're welcome to run this job but I'll also check with others here to
> see what fio jobs we used in the recent past when assessing performance
> of the dm-crypt parallelization changes.

That's very helpful.

>
> Also, a lot of care needs to be taken to eliminate jitter in the system
> while the test is running.  We got a lot of good insight from Bart Van
> Assche on that and put it to practice.  I'll see if we can (re)summarize
> that too.

Very helpful too.

Thanks.

>
> Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

Ming Lin-2
In reply to this post by Mike Snitzer
On Thu, 2015-06-04 at 17:06 -0400, Mike Snitzer wrote:
> We need to test on large HW raid setups like a Netapp filer (or even
> local SAS drives connected via some SAS controller).  Like a 8+2 drive
> RAID6 or 8+1 RAID5 setup.  Testing with MD raid on JBOD setups with 8
> devices is also useful.  It is larger RAID setups that will be more
> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
> size boundaries.

Here are tests results of xfs/ext4/btrfs read/write on HW RAID6/MD RAID6/DM stripe target.
Each case run 0.5 hour, so it took 36 hours to finish all the tests on 4.1-rc4 and 4.1-rc4-patched kernels.

No performance regressions were introduced.

Test server: Dell R730xd(2 sockets/48 logical cpus/264G memory)
HW RAID6/MD RAID6/DM stripe target were configured with 10 HDDs, each 280G
Stripe size 64k and 128k were tested.

devs="/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk"
spare_devs="/dev/sdl /dev/sdm"
stripe_size=64 (or 128)

MD RAID6 was created by:
mdadm --create --verbose /dev/md0 --level=6 --raid-devices=10 $devs --spare-devices=2 $spare_devs -c $stripe_size

DM stripe target was created by:
pvcreate $devs
vgcreate striped_vol_group $devs
lvcreate -i10 -I${stripe_size} -L2T -nstriped_logical_volume striped_vol_group

Here is an example of fio script for stripe size 128k:
[global]
ioengine=libaio
iodepth=64
direct=1
runtime=1800
time_based
group_reporting
numjobs=48
gtod_reduce=0
norandommap
write_iops_log=fs

[job1]
bs=1280K
directory=/mnt
size=5G
rw=read

All results here: http://minggr.net/pub/20150608/fio_results/

Results summary:

1. HW RAID6: stripe size 64k
                4.1-rc4 4.1-rc4-patched
                ------- ---------------
                (MB/s) (MB/s)
xfs read: 821.23 812.20  -1.09%
xfs write: 753.16 754.42  +0.16%
ext4 read: 827.80 834.82  +0.84%
ext4 write: 783.08 777.58  -0.70%
btrfs read: 859.26 871.68  +1.44%
btrfs write: 815.63 844.40  +3.52%

2. HW RAID6: stripe size 128k
                4.1-rc4 4.1-rc4-patched
                ------- ---------------
                (MB/s) (MB/s)
xfs read: 948.27 979.11  +3.25%
xfs write: 820.78 819.94  -0.10%
ext4 read: 978.35 997.92  +2.00%
ext4 write: 853.51 847.97  -0.64%
btrfs read: 1013.1 1015.6  +0.24%
btrfs write: 854.43 850.42  -0.46%

3. MD RAID6: stripe size 64k
                4.1-rc4 4.1-rc4-patched
                ------- ---------------
                (MB/s) (MB/s)
xfs read: 847.34 869.43  +2.60%
xfs write: 198.67 199.03  +0.18%
ext4 read: 763.89 767.79  +0.51%
ext4 write: 281.44 282.83  +0.49%
btrfs read: 756.02 743.69  -1.63%
btrfs write: 268.37 265.93  -0.90%

4. MD RAID6: stripe size 128k
                4.1-rc4 4.1-rc4-patched
                ------- ---------------
                (MB/s) (MB/s)
xfs read: 993.04 1014.1  +2.12%
xfs write: 293.06 298.95  +2.00%
ext4 read: 1019.6 1020.9  +0.12%
ext4 write: 371.51 371.47  -0.01%
btrfs read: 1000.4 1020.8  +2.03%
btrfs write: 241.08 246.77  +2.36%

5. DM: stripe size 64k
                4.1-rc4 4.1-rc4-patched
                ------- ---------------
                (MB/s) (MB/s)
xfs read: 1084.4 1080.1  -0.39%
xfs write: 1071.1 1063.4  -0.71%
ext4 read: 991.54 1003.7  +1.22%
ext4 write: 1069.7 1052.2  -1.63%
btrfs read: 1076.1 1082.1  +0.55%
btrfs write: 968.98 965.07  -0.40%

6. DM: stripe size 128k
                4.1-rc4 4.1-rc4-patched
                ------- ---------------
                (MB/s) (MB/s)
xfs read: 1020.4 1066.1  +4.47%
xfs write: 1058.2 1066.6  +0.79%
ext4 read: 990.72 988.19  -0.25%
ext4 write: 1050.4 1070.2  +1.88%
btrfs read: 1080.9 1074.7  -0.57%
btrfs write: 975.10 972.76  -0.23%





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

Ming Lin-2
On Mon, Jun 8, 2015 at 11:09 PM, Ming Lin <[hidden email]> wrote:

> On Thu, 2015-06-04 at 17:06 -0400, Mike Snitzer wrote:
>> We need to test on large HW raid setups like a Netapp filer (or even
>> local SAS drives connected via some SAS controller).  Like a 8+2 drive
>> RAID6 or 8+1 RAID5 setup.  Testing with MD raid on JBOD setups with 8
>> devices is also useful.  It is larger RAID setups that will be more
>> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
>> size boundaries.
>
> Here are tests results of xfs/ext4/btrfs read/write on HW RAID6/MD RAID6/DM stripe target.
> Each case run 0.5 hour, so it took 36 hours to finish all the tests on 4.1-rc4 and 4.1-rc4-patched kernels.
>
> No performance regressions were introduced.
>
> Test server: Dell R730xd(2 sockets/48 logical cpus/264G memory)
> HW RAID6/MD RAID6/DM stripe target were configured with 10 HDDs, each 280G
> Stripe size 64k and 128k were tested.
>
> devs="/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk"
> spare_devs="/dev/sdl /dev/sdm"
> stripe_size=64 (or 128)
>
> MD RAID6 was created by:
> mdadm --create --verbose /dev/md0 --level=6 --raid-devices=10 $devs --spare-devices=2 $spare_devs -c $stripe_size
>
> DM stripe target was created by:
> pvcreate $devs
> vgcreate striped_vol_group $devs
> lvcreate -i10 -I${stripe_size} -L2T -nstriped_logical_volume striped_vol_group
>
> Here is an example of fio script for stripe size 128k:
> [global]
> ioengine=libaio
> iodepth=64
> direct=1
> runtime=1800
> time_based
> group_reporting
> numjobs=48
> gtod_reduce=0
> norandommap
> write_iops_log=fs
>
> [job1]
> bs=1280K
> directory=/mnt
> size=5G
> rw=read
>
> All results here: http://minggr.net/pub/20150608/fio_results/
>
> Results summary:
>
> 1. HW RAID6: stripe size 64k
>                 4.1-rc4         4.1-rc4-patched
>                 -------         ---------------
>                 (MB/s)          (MB/s)
> xfs read:       821.23          812.20  -1.09%
> xfs write:      753.16          754.42  +0.16%
> ext4 read:      827.80          834.82  +0.84%
> ext4 write:     783.08          777.58  -0.70%
> btrfs read:     859.26          871.68  +1.44%
> btrfs write:    815.63          844.40  +3.52%
>
> 2. HW RAID6: stripe size 128k
>                 4.1-rc4         4.1-rc4-patched
>                 -------         ---------------
>                 (MB/s)          (MB/s)
> xfs read:       948.27          979.11  +3.25%
> xfs write:      820.78          819.94  -0.10%
> ext4 read:      978.35          997.92  +2.00%
> ext4 write:     853.51          847.97  -0.64%
> btrfs read:     1013.1          1015.6  +0.24%
> btrfs write:    854.43          850.42  -0.46%
>
> 3. MD RAID6: stripe size 64k
>                 4.1-rc4         4.1-rc4-patched
>                 -------         ---------------
>                 (MB/s)          (MB/s)
> xfs read:       847.34          869.43  +2.60%
> xfs write:      198.67          199.03  +0.18%
> ext4 read:      763.89          767.79  +0.51%
> ext4 write:     281.44          282.83  +0.49%
> btrfs read:     756.02          743.69  -1.63%
> btrfs write:    268.37          265.93  -0.90%
>
> 4. MD RAID6: stripe size 128k
>                 4.1-rc4         4.1-rc4-patched
>                 -------         ---------------
>                 (MB/s)          (MB/s)
> xfs read:       993.04          1014.1  +2.12%
> xfs write:      293.06          298.95  +2.00%
> ext4 read:      1019.6          1020.9  +0.12%
> ext4 write:     371.51          371.47  -0.01%
> btrfs read:     1000.4          1020.8  +2.03%
> btrfs write:    241.08          246.77  +2.36%
>
> 5. DM: stripe size 64k
>                 4.1-rc4         4.1-rc4-patched
>                 -------         ---------------
>                 (MB/s)          (MB/s)
> xfs read:       1084.4          1080.1  -0.39%
> xfs write:      1071.1          1063.4  -0.71%
> ext4 read:      991.54          1003.7  +1.22%
> ext4 write:     1069.7          1052.2  -1.63%
> btrfs read:     1076.1          1082.1  +0.55%
> btrfs write:    968.98          965.07  -0.40%
>
> 6. DM: stripe size 128k
>                 4.1-rc4         4.1-rc4-patched
>                 -------         ---------------
>                 (MB/s)          (MB/s)
> xfs read:       1020.4          1066.1  +4.47%
> xfs write:      1058.2          1066.6  +0.79%
> ext4 read:      990.72          988.19  -0.25%
> ext4 write:     1050.4          1070.2  +1.88%
> btrfs read:     1080.9          1074.7  -0.57%
> btrfs write:    975.10          972.76  -0.23%

Hi Mike,

How about these numbers?

I'm also happy to run other fio jobs your team used.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

Mike Snitzer
On Wed, Jun 10 2015 at  5:20pm -0400,
Ming Lin <[hidden email]> wrote:

> On Mon, Jun 8, 2015 at 11:09 PM, Ming Lin <[hidden email]> wrote:
> > On Thu, 2015-06-04 at 17:06 -0400, Mike Snitzer wrote:
> >> We need to test on large HW raid setups like a Netapp filer (or even
> >> local SAS drives connected via some SAS controller).  Like a 8+2 drive
> >> RAID6 or 8+1 RAID5 setup.  Testing with MD raid on JBOD setups with 8
> >> devices is also useful.  It is larger RAID setups that will be more
> >> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
> >> size boundaries.
> >
> > Here are tests results of xfs/ext4/btrfs read/write on HW RAID6/MD RAID6/DM stripe target.
> > Each case run 0.5 hour, so it took 36 hours to finish all the tests on 4.1-rc4 and 4.1-rc4-patched kernels.
> >
> > No performance regressions were introduced.
> >
> > Test server: Dell R730xd(2 sockets/48 logical cpus/264G memory)
> > HW RAID6/MD RAID6/DM stripe target were configured with 10 HDDs, each 280G
> > Stripe size 64k and 128k were tested.
> >
> > devs="/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk"
> > spare_devs="/dev/sdl /dev/sdm"
> > stripe_size=64 (or 128)
> >
> > MD RAID6 was created by:
> > mdadm --create --verbose /dev/md0 --level=6 --raid-devices=10 $devs --spare-devices=2 $spare_devs -c $stripe_size
> >
> > DM stripe target was created by:
> > pvcreate $devs
> > vgcreate striped_vol_group $devs
> > lvcreate -i10 -I${stripe_size} -L2T -nstriped_logical_volume striped_vol_group

DM had a regression relative to merge_bvec that wasn't fixed until
recently (it wasn't in 4.1-rc4), see commit 1c220c69ce0 ("dm: fix
casting bug in dm_merge_bvec()").  It was introduced in 4.1.

So your 4.1-rc4 DM stripe testing may have effectively been with
merge_bvec disabled.

> > Here is an example of fio script for stripe size 128k:
> > [global]
> > ioengine=libaio
> > iodepth=64
> > direct=1
> > runtime=1800
> > time_based
> > group_reporting
> > numjobs=48
> > gtod_reduce=0
> > norandommap
> > write_iops_log=fs
> >
> > [job1]
> > bs=1280K
> > directory=/mnt
> > size=5G
> > rw=read
> >
> > All results here: http://minggr.net/pub/20150608/fio_results/
> >
> > Results summary:
> >
> > 1. HW RAID6: stripe size 64k
> >                 4.1-rc4         4.1-rc4-patched
> >                 -------         ---------------
> >                 (MB/s)          (MB/s)
> > xfs read:       821.23          812.20  -1.09%
> > xfs write:      753.16          754.42  +0.16%
> > ext4 read:      827.80          834.82  +0.84%
> > ext4 write:     783.08          777.58  -0.70%
> > btrfs read:     859.26          871.68  +1.44%
> > btrfs write:    815.63          844.40  +3.52%
> >
> > 2. HW RAID6: stripe size 128k
> >                 4.1-rc4         4.1-rc4-patched
> >                 -------         ---------------
> >                 (MB/s)          (MB/s)
> > xfs read:       948.27          979.11  +3.25%
> > xfs write:      820.78          819.94  -0.10%
> > ext4 read:      978.35          997.92  +2.00%
> > ext4 write:     853.51          847.97  -0.64%
> > btrfs read:     1013.1          1015.6  +0.24%
> > btrfs write:    854.43          850.42  -0.46%
> >
> > 3. MD RAID6: stripe size 64k
> >                 4.1-rc4         4.1-rc4-patched
> >                 -------         ---------------
> >                 (MB/s)          (MB/s)
> > xfs read:       847.34          869.43  +2.60%
> > xfs write:      198.67          199.03  +0.18%
> > ext4 read:      763.89          767.79  +0.51%
> > ext4 write:     281.44          282.83  +0.49%
> > btrfs read:     756.02          743.69  -1.63%
> > btrfs write:    268.37          265.93  -0.90%
> >
> > 4. MD RAID6: stripe size 128k
> >                 4.1-rc4         4.1-rc4-patched
> >                 -------         ---------------
> >                 (MB/s)          (MB/s)
> > xfs read:       993.04          1014.1  +2.12%
> > xfs write:      293.06          298.95  +2.00%
> > ext4 read:      1019.6          1020.9  +0.12%
> > ext4 write:     371.51          371.47  -0.01%
> > btrfs read:     1000.4          1020.8  +2.03%
> > btrfs write:    241.08          246.77  +2.36%
> >
> > 5. DM: stripe size 64k
> >                 4.1-rc4         4.1-rc4-patched
> >                 -------         ---------------
> >                 (MB/s)          (MB/s)
> > xfs read:       1084.4          1080.1  -0.39%
> > xfs write:      1071.1          1063.4  -0.71%
> > ext4 read:      991.54          1003.7  +1.22%
> > ext4 write:     1069.7          1052.2  -1.63%
> > btrfs read:     1076.1          1082.1  +0.55%
> > btrfs write:    968.98          965.07  -0.40%
> >
> > 6. DM: stripe size 128k
> >                 4.1-rc4         4.1-rc4-patched
> >                 -------         ---------------
> >                 (MB/s)          (MB/s)
> > xfs read:       1020.4          1066.1  +4.47%
> > xfs write:      1058.2          1066.6  +0.79%
> > ext4 read:      990.72          988.19  -0.25%
> > ext4 write:     1050.4          1070.2  +1.88%
> > btrfs read:     1080.9          1074.7  -0.57%
> > btrfs write:    975.10          972.76  -0.23%
>
> Hi Mike,
>
> How about these numbers?

Looks fairly good.  I just am not sure the workload is going to test the
code paths in question like we'd hope.  I'll have to set aside some time
to think through scenarios to test.

My concern still remains that at some point it the future we'll regret
not having merge_bvec but it'll be too late.  That is just my own FUD at
this point...

> I'm also happy to run other fio jobs your team used.

I've been busy getting DM changes for the 4.2 merge window finalized.
As such I haven't connected with others on the team to discuss this
issue.

I'll see if we can make time in the next 2 days.  But I also have
RHEL-specific kernel deadlines I'm coming up against.

Seems late to be staging this extensive a change for 4.2... are you
pushing for this code to land in the 4.2 merge window?  Or do we have
time to work this further and target the 4.3 merge?

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

Ming Lin-2
On Wed, Jun 10, 2015 at 2:46 PM, Mike Snitzer <[hidden email]> wrote:

> On Wed, Jun 10 2015 at  5:20pm -0400,
> Ming Lin <[hidden email]> wrote:
>
>> On Mon, Jun 8, 2015 at 11:09 PM, Ming Lin <[hidden email]> wrote:
>> > On Thu, 2015-06-04 at 17:06 -0400, Mike Snitzer wrote:
>> >> We need to test on large HW raid setups like a Netapp filer (or even
>> >> local SAS drives connected via some SAS controller).  Like a 8+2 drive
>> >> RAID6 or 8+1 RAID5 setup.  Testing with MD raid on JBOD setups with 8
>> >> devices is also useful.  It is larger RAID setups that will be more
>> >> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
>> >> size boundaries.
>> >
>> > Here are tests results of xfs/ext4/btrfs read/write on HW RAID6/MD RAID6/DM stripe target.
>> > Each case run 0.5 hour, so it took 36 hours to finish all the tests on 4.1-rc4 and 4.1-rc4-patched kernels.
>> >
>> > No performance regressions were introduced.
>> >
>> > Test server: Dell R730xd(2 sockets/48 logical cpus/264G memory)
>> > HW RAID6/MD RAID6/DM stripe target were configured with 10 HDDs, each 280G
>> > Stripe size 64k and 128k were tested.
>> >
>> > devs="/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk"
>> > spare_devs="/dev/sdl /dev/sdm"
>> > stripe_size=64 (or 128)
>> >
>> > MD RAID6 was created by:
>> > mdadm --create --verbose /dev/md0 --level=6 --raid-devices=10 $devs --spare-devices=2 $spare_devs -c $stripe_size
>> >
>> > DM stripe target was created by:
>> > pvcreate $devs
>> > vgcreate striped_vol_group $devs
>> > lvcreate -i10 -I${stripe_size} -L2T -nstriped_logical_volume striped_vol_group
>
> DM had a regression relative to merge_bvec that wasn't fixed until
> recently (it wasn't in 4.1-rc4), see commit 1c220c69ce0 ("dm: fix
> casting bug in dm_merge_bvec()").  It was introduced in 4.1.
>
> So your 4.1-rc4 DM stripe testing may have effectively been with
> merge_bvec disabled.

I'l rebase it to latest Linus tree and re-run DM stripe testing.

>
>> > Here is an example of fio script for stripe size 128k:
>> > [global]
>> > ioengine=libaio
>> > iodepth=64
>> > direct=1
>> > runtime=1800
>> > time_based
>> > group_reporting
>> > numjobs=48
>> > gtod_reduce=0
>> > norandommap
>> > write_iops_log=fs
>> >
>> > [job1]
>> > bs=1280K
>> > directory=/mnt
>> > size=5G
>> > rw=read
>> >
>> > All results here: http://minggr.net/pub/20150608/fio_results/
>> >
>> > Results summary:
>> >
>> > 1. HW RAID6: stripe size 64k
>> >                 4.1-rc4         4.1-rc4-patched
>> >                 -------         ---------------
>> >                 (MB/s)          (MB/s)
>> > xfs read:       821.23          812.20  -1.09%
>> > xfs write:      753.16          754.42  +0.16%
>> > ext4 read:      827.80          834.82  +0.84%
>> > ext4 write:     783.08          777.58  -0.70%
>> > btrfs read:     859.26          871.68  +1.44%
>> > btrfs write:    815.63          844.40  +3.52%
>> >
>> > 2. HW RAID6: stripe size 128k
>> >                 4.1-rc4         4.1-rc4-patched
>> >                 -------         ---------------
>> >                 (MB/s)          (MB/s)
>> > xfs read:       948.27          979.11  +3.25%
>> > xfs write:      820.78          819.94  -0.10%
>> > ext4 read:      978.35          997.92  +2.00%
>> > ext4 write:     853.51          847.97  -0.64%
>> > btrfs read:     1013.1          1015.6  +0.24%
>> > btrfs write:    854.43          850.42  -0.46%
>> >
>> > 3. MD RAID6: stripe size 64k
>> >                 4.1-rc4         4.1-rc4-patched
>> >                 -------         ---------------
>> >                 (MB/s)          (MB/s)
>> > xfs read:       847.34          869.43  +2.60%
>> > xfs write:      198.67          199.03  +0.18%
>> > ext4 read:      763.89          767.79  +0.51%
>> > ext4 write:     281.44          282.83  +0.49%
>> > btrfs read:     756.02          743.69  -1.63%
>> > btrfs write:    268.37          265.93  -0.90%
>> >
>> > 4. MD RAID6: stripe size 128k
>> >                 4.1-rc4         4.1-rc4-patched
>> >                 -------         ---------------
>> >                 (MB/s)          (MB/s)
>> > xfs read:       993.04          1014.1  +2.12%
>> > xfs write:      293.06          298.95  +2.00%
>> > ext4 read:      1019.6          1020.9  +0.12%
>> > ext4 write:     371.51          371.47  -0.01%
>> > btrfs read:     1000.4          1020.8  +2.03%
>> > btrfs write:    241.08          246.77  +2.36%
>> >
>> > 5. DM: stripe size 64k
>> >                 4.1-rc4         4.1-rc4-patched
>> >                 -------         ---------------
>> >                 (MB/s)          (MB/s)
>> > xfs read:       1084.4          1080.1  -0.39%
>> > xfs write:      1071.1          1063.4  -0.71%
>> > ext4 read:      991.54          1003.7  +1.22%
>> > ext4 write:     1069.7          1052.2  -1.63%
>> > btrfs read:     1076.1          1082.1  +0.55%
>> > btrfs write:    968.98          965.07  -0.40%
>> >
>> > 6. DM: stripe size 128k
>> >                 4.1-rc4         4.1-rc4-patched
>> >                 -------         ---------------
>> >                 (MB/s)          (MB/s)
>> > xfs read:       1020.4          1066.1  +4.47%
>> > xfs write:      1058.2          1066.6  +0.79%
>> > ext4 read:      990.72          988.19  -0.25%
>> > ext4 write:     1050.4          1070.2  +1.88%
>> > btrfs read:     1080.9          1074.7  -0.57%
>> > btrfs write:    975.10          972.76  -0.23%
>>
>> Hi Mike,
>>
>> How about these numbers?
>
> Looks fairly good.  I just am not sure the workload is going to test the
> code paths in question like we'd hope.  I'll have to set aside some time

How about adding some counters to record, for example, how many time
->merge_bvec is called in old kernel and how many time bio splitting is called
in patched kernel?

> to think through scenarios to test.

Great.

>
> My concern still remains that at some point it the future we'll regret
> not having merge_bvec but it'll be too late.  That is just my own FUD at
> this point...
>
>> I'm also happy to run other fio jobs your team used.
>
> I've been busy getting DM changes for the 4.2 merge window finalized.
> As such I haven't connected with others on the team to discuss this
> issue.
>
> I'll see if we can make time in the next 2 days.  But I also have
> RHEL-specific kernel deadlines I'm coming up against.
>
> Seems late to be staging this extensive a change for 4.2... are you
> pushing for this code to land in the 4.2 merge window?  Or do we have
> time to work this further and target the 4.3 merge?

I'm OK to target the 4.3 merge.
But hope we can get it into linux-next tree ASAP for more wide tests.

>
> Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
1234