Tuesday 31 March 2015

Inside Linux cgroups for blkio subsystem

cgroups enable us to distribute the resources among the various tasks or tasks group. The cgroup uses subsystems (resources like cpu, mem, blkio) to apply per-cgroup limits for these resources. Refer [1] [2].

Following steps are required for creating a cgroup with only specialized limiting of blkio subsystem.

create blkio cgroup :
                        mount -t tmpfs cgroup_root /sys/fs/cgroup
                        mkdir /sys/fs/cgroup/blkio
                        mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
                        mkdir -p /sys/fs/cgroup/blkio/test1/                  ---------------> creation of cgroup test1
                        mkdir -p /sys/fs/cgroup/blkio/test2/                  ---------------> creation of cgroup test2

                        echo 1000 > /sys/fs/cgroup/blkio/test1/blkio.weight  -----> Set weight of cgroup test1
                        echo 500 > /sys/fs/cgroup/blkio/test2/blkio.weight  ----> Set weight of cgroup test2

                        sync
                        echo 3 > /proc/sys/vm/drop_caches
                       
                        dd if=/dev/sdbv of=file_1 bs=1M count=512 &
                        echo $! > /sys/fs/cgroup/blkio/test1/tasks   ---> Attach dd process to test1 cgroup
                        cat /sys/fs/cgroup/blkio/test1/tasks

                        dd if=/dev/sdbv of=file_2 bs=1M count=512 &
                        echo $! > /sys/fs/cgroup/blkio/test2/tasks   --> Attach dd process to test2 cgroup
                        cat /sys/fs/cgroup/blkio/test2/tasks

Here we create a cgroup with blkio subsystem, assign weights and attach “dd” process to these cgroups. "test1” cgroup will complete the io faster than the “test2” cgroup as less weight is assigned to “test2”

Peek into the changes done in task_struct of the dd process:
We added a jprobe in the generic_make_request function and tried to print the cgroup and the subsystem the “dd” process is attached to.

Here is the probe function code:
void my_handler (struct bio *bio)
{
    struct task_struct *task = current;
    char *str = "dd";
    int i = 0 ;
    if (strncmp(str,task->comm,2) == 0)
    {
        printk("assignment: current process: %s, PID: %d\n", task->comm, task->pid);
        for (i=0;i<CGROUP_SUBSYS_COUNT;i++)
        {
            printk("cgroup subsys count = %d\n",i);
            if(task->cgroups->subsys != NULL)
            {
                if(task->cgroups->subsys[i] != NULL)
                {
                    if(task->cgroups->subsys[i]->cgroup != NULL)
                    {
                        if(task->cgroups->subsys[i]->cgroup->name != NULL)
                 printk("cgroup->name  = %s\n", task->cgroups->subsys[i]->cgroup->name->name);
                        if(task->cgroups->subsys[i]->ss != NULL)
                            if (task->cgroups->subsys[i]->ss->name != NULL)
                 printk("cgroup->subsys name  = %s\n", task->cgroups->subsys[i]->ss->name);
                    }
                }
            }
            else
            {
                printk("NULL\n");
            }
        }
    }
jprobe_return();
}


Following is the output we get:

2014-12-02T13:29:32.843643+05:30 lnx kernel: [508988.896860] assignment: current process: dd, PID: 29713
2014-12-02T13:29:32.843644+05:30 lnx kernel: [508988.896861] cgroup subsys count = 0
2014-12-02T13:29:32.843645+05:30 lnx kernel: [508988.896863] cgroup->name  = /
2014-12-02T13:29:32.843653+05:30 lnx kernel: [508988.896865] cgroup->subsys name  = cpuset

2014-12-02T13:29:32.843654+05:30 lnx kernel: [508988.896866] cgroup subsys count = 1
2014-12-02T13:29:32.843656+05:30 lnx kernel: [508988.896868] cgroup->name  = /
2014-12-02T13:29:32.843657+05:30 lnx kernel: [508988.896870] cgroup->subsys name  = cpu

2014-12-02T13:29:32.843658+05:30 lnx kernel: [508988.896871] cgroup subsys count = 2
2014-12-02T13:29:32.843659+05:30 lnx kernel: [508988.896873] cgroup->name  = /
2014-12-02T13:29:32.843660+05:30 lnx kernel: [508988.896874] cgroup->subsys name  = cpuacct

2014-12-02T13:29:32.843662+05:30 lnx kernel: [508988.896876] cgroup subsys count = 3
2014-12-02T13:29:32.843663+05:30 lnx kernel: [508988.896878] cgroup->name  = /
2014-12-02T13:29:32.843665+05:30 lnx kernel: [508988.896879] cgroup->subsys name  = memory

2014-12-02T13:29:32.843666+05:30 lnx kernel: [508988.896881] cgroup subsys count = 4
2014-12-02T13:29:32.843667+05:30 lnx kernel: [508988.896882] cgroup->name  = /
2014-12-02T13:29:32.843668+05:30 lnx kernel: [508988.896884] cgroup->subsys name  = devices

2014-12-02T13:29:32.843669+05:30 lnx kernel: [508988.896886] cgroup subsys count = 5
2014-12-02T13:29:32.843671+05:30 lnx kernel: [508988.896887] cgroup->name  = /
2014-12-02T13:29:32.843672+05:30 lnx kernel: [508988.896888] cgroup->subsys name  = freezer

2014-12-02T13:29:32.843681+05:30 lnx kernel: [508988.896890] cgroup subsys count = 6
2014-12-02T13:29:32.843682+05:30 lnx kernel: [508988.896891] cgroup->name  = test1
2014-12-02T13:29:32.843683+05:30 lnx kernel: [508988.896893] cgroup->subsys name  = blkio

2014-12-02T13:29:32.843685+05:30 lnx kernel: [508988.896894] cgroup subsys count = 7
2014-12-02T13:29:32.843686+05:30 lnx kernel: [508988.896896] cgroup->name  = /
2014-12-02T13:29:32.843688+05:30 lnx kernel: [508988.896897] cgroup->subsys name  = perf_event


2014-12-02T13:29:32.843689+05:30 lnx kernel: [508988.896898] cgroup subsys count = 8
2014-12-02T13:29:32.843690+05:30 lnx kernel: [508988.896901] cgroup->name  = /
2014-12-02T13:29:32.843692+05:30 lnx kernel: [508988.896902] cgroup->subsys name  = hugetlb


2014-12-02T13:29:32.843693+05:30 lnx kernel: [508988.896902] cgroup subsys count = 9
2014-12-02T13:29:32.843694+05:30 lnx kernel: [508988.896903] cgroup subsys count = 10

From this example we see that for the dd process all the susbsystems(resources) are using the default root (”/”) cgroup. The blkio subsys uses the test1 cgroup.

Now further we will see that how the cgroup initialization is done and the code corresponding to various steps used.


Linux cgroups initialization at boot up:

A new file system of type "cgroup" (VFS) is registered on Linux start.
started like :
start_kernel -> cgroup_init_early -> cgroup_init_subsys -> cgroup_init


cgroup_init_subsys
top cgroup state is created :
                        /* Create the top cgroup state for this subsystem */
                        list_add(&ss->sibling, &cgroup_dummy_root.subsys_list);

cgroupfs_root is created.

Filesystem registration :
mounting unmounting operations : cgroup_init()
                        err = register_filesystem(&cgroup_fs_type);
                       
static struct file_system_type cgroup_fs_type = {
                        .name = "cgroup",
                        .mount = cgroup_mount,
                        .kill_sb = cgroup_kill_sb,
};


CGROUP ACTIONS :
All cgroups actions are performed via filesystem actions (create/remove directory, reading/writing to files in it, mounting/mount options).            

mount operations are mentioned previously. The read, write , create and remove are identified by :
kernel/cgroup.c :

static const struct file_operations cgroup_file_operations = {
                        .read = cgroup_file_read,
                        .write = cgroup_file_write,
                        .llseek = generic_file_llseek,
                        .open = cgroup_file_open,
                        .release = cgroup_file_release,
};

static const struct inode_operations cgroup_file_inode_operations = {
                        .setxattr = cgroup_setxattr,
                        .getxattr = cgroup_getxattr,
                        .listxattr = cgroup_listxattr,
                        .removexattr = cgroup_removexattr,
};

static const struct inode_operations cgroup_dir_inode_operations = {
                        .lookup = simple_lookup,
                        .mkdir = cgroup_mkdir,
                        .rmdir = cgroup_rmdir,
                        .rename = cgroup_rename,
                        .setxattr = cgroup_setxattr,
                        .getxattr = cgroup_getxattr,
                        .listxattr = cgroup_listxattr,
                        .removexattr = cgroup_removexattr,
};


The control group can be mounted anywhere on the filesystem. Systemd uses /sys/fs/cgroup. When mounting, we can specify with mount options (-o) which subsystems we want to use.
Say for make a cgroup with blkio susbsystem commands will be :
                        mount -t tmpfs cgroup_root /sys/fs/cgroup
                        mkdir /sys/fs/cgroup/blkio
                        mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
                        mkdir -p /sys/fs/cgroup/blkio/test1/

mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
This command calls a cgroup_mount and creates the files in this directory :

lnx:/sys/fs/cgroup/blkio # ls
blkio.io_merged                   blkio.io_service_time_recursive  blkio.reset_stats                blkio.throttle.write_bps_device   cgroup.event_control
blkio.io_merged_recursive         blkio.io_serviced                blkio.sectors                    blkio.throttle.write_iops_device  cgroup.procs
blkio.io_queued                   blkio.io_serviced_recursive      blkio.sectors_recursive          blkio.time                        cgroup.sane_behavior
blkio.io_queued_recursive         blkio.io_wait_time               blkio.throttle.io_service_bytes  blkio.time_recursive              notify_on_release
blkio.io_service_bytes            blkio.io_wait_time_recursive     blkio.throttle.io_serviced       blkio.weight                      release_agent
blkio.io_service_bytes_recursive  blkio.leaf_weight                blkio.throttle.read_bps_device   blkio.weight_device               tasks
blkio.io_service_time             blkio.leaf_weight_device         blkio.throttle.read_iops_device  cgroup.clone_children

Now make directory in this newly created cgroup.
mkdir -p /sys/fs/cgroup/blkio/test1

cgroup_create is called. Here the new cgroup is created and the blkio subsystem is initialised :
                        /* allocate the cgroup and its ID, 0 is reserved for the root */
                        cgrp = kzalloc(sizeof(*cgrp), GFP_KERNEL);
                        if (!cgrp)
                                                return -ENOMEM;

                        name = cgroup_alloc_name(dentry);
                       
cgroup blkio subsystem is allocated              :
                                                css = ss->css_alloc(cgroup_css(parent, ss));

                                               
For the (blkio controller) blkcg the function called is blkcg_css_alloc.

In this function blkcg is initialised :

                        blkcg = kzalloc(sizeof(*blkcg), GFP_KERNEL);
                        if (!blkcg)
                                                return ERR_PTR(-ENOMEM);

                        blkcg->cfq_weight = CFQ_WEIGHT_DEFAULT;
                        blkcg->cfq_leaf_weight = CFQ_WEIGHT_DEFAULT;
                        blkcg->id = atomic64_inc_return(&id_seq); /* root is 0, start from 1 */
                        spin_lock_init(&blkcg->lock);
                        INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_ATOMIC);
                        INIT_HLIST_HEAD(&blkcg->blkg_list);
                       
                       
struct blkcg {
                        struct cgroup_subsys_state         css;
                        spinlock_t                                    lock;
                        struct radix_tree_root                  blkg_tree;
                        struct blkcg_gq                          *blkg_hint;
                        struct hlist_head                          blkg_list;
                         /* for policies to test whether associated blkcg has changed */
                        uint64_t                                       id;
                         /* TODO: per-policy storage in blkcg */
                        unsigned int                                cfq_weight;                     /* belongs to cfq */
                        unsigned int                                cfq_leaf_weight;
};
                                               
init_css is called to initialise  cgroup_subsys_state  from the blkio subsystem and cgroup.
                                                init_css(css, ss, cgrp);

2014-12-08T10:28:48.196121+05:30 lnx kernel: [243681.125109] //init_css Handler hit
2014-12-08T10:28:48.196140+05:30 lnx kernel: [243681.125115] cgrp name = test3
2014-12-08T10:28:48.196144+05:30 lnx kernel: [243681.125117] ss name = blkio

jprobe from css_init

2014-12-08T10:28:48.196175+05:30 lnx kernel: [243681.125235]  [<ffffffff810d7d69>] cgroup_mkdir+0x299/0x670
2014-12-08T10:28:48.196177+05:30 lnx kernel: [243681.125246]  [<ffffffff811a9d50>] vfs_mkdir+0xb0/0x160
2014-12-08T10:28:48.196179+05:30 lnx kernel: [243681.125254]  [<ffffffff811af28b>] SyS_mkdirat+0xab/0xe0
2014-12-08T10:28:48.196181+05:30 lnx kernel: [243681.125265]  [<ffffffff81519329>] system_call_fastpath+0x16/0x1b                              

This also generates the directory structure for the subsystem using function calls of cgroup_addrm_files, cgroup_populate_dir.

dump_stack example via jprobe :
2014-12-08T09:59:57.364352+05:30 lnx kernel: [241951.139904]  [<ffffffff810d6909>] cgroup_populate_dir+0x69/0x110
2014-12-08T09:59:57.364354+05:30 lnx kernel: [241951.139909]  [<ffffffff810d80ad>] cgroup_mkdir+0x5dd/0x670
2014-12-08T09:59:57.364356+05:30 lnx kernel: [241951.139914]  [<ffffffff811a9d50>] vfs_mkdir+0xb0/0x160
2014-12-08T09:59:57.364357+05:30 lnx kernel: [241951.139919]  [<ffffffff811af28b>] SyS_mkdirat+0xab/0xe0


Changing the cgroup policies/properties:

The cgroup properties can be changed by writuing to the files in /sys/fs/cgroup/blkio/<cgroup_name>/property.

Example :
                        echo 1000 > /sys/fs/cgroup/blkio/test1/blkio.weight

This will call write function of cgroup :
cgroup_file_write
                        if (cft->write)
                                                return cft->write(css, cft, file, buf, nbytes, ppos);

                                                this will call the weight cftype's write function
                        {
                                                .name = "weight",
                                                .flags = CFTYPE_NOT_ON_ROOT,
                                                .read_seq_string = cfq_print_weight,
                                                .write_u64 = cfq_set_weight,
                        },

                        And in function __cfq_set_weight the value is set to the blkcg
                                                                        blkcg->cfq_leaf_weight = val;

                                                                      
Attaching a task to the cgroup:
echo <PID> > /sys/fs/cgroup/blkio/test1/tasks
                       
This will fetch function from:
                         */
                        {
                                                .name = "tasks",
                                                .flags = CFTYPE_INSANE,            /* use "procs" instead */
                                                .open = cgroup_tasks_open,
                                                .write_u64 = cgroup_tasks_write,
                                                .release = cgroup_pidlist_release,
                                                .mode = S_IRUGO | S_IWUSR,
                        },
                        {                      
                       
cgroup_tasks_write calls attach_task_by_pid
  attach_task_by_pid in file cgroup.c
                        ret = cgroup_attach_task(cgrp, tsk, threadgroup);
                       
                       
A new css_set is created and attached to the task_struct of this process:
                                                cgroup_task_migrate(tc->cgrp, tc->task, tc->cset);


 Association of request_queue and the block cgroup
Whenever the I/O comes to a block layer the association is created between the devices request queue and the block group.

This association “struct blkcg_gq” is created when I/O comes to a device. It is created in function “blkg_create”. A sample dump_stack of creation of association:

2014-12-15T14:02:07.036066+05:30 lnx kernel: [860978.128274] ////blkg_create Handler hit
2014-12-15T14:02:07.036078+05:30 lnx kernel: [860978.128281] CPU: 6 PID: 17627 Comm: dd Tainted: P           OENX 3.12.28-4-default #1
2014-12-15T14:02:07.036083+05:30 lnx kernel: [860978.128286]  ffff8810568495c0 ffffffff8150b1db ffffffff81acd8c0 ffffffffa039f018
2014-12-15T14:02:07.036084+05:30 lnx kernel: [860978.128291]  ffffffff8128a2f5 ffff88103e6a62c0 ffff880855f48078 ffff88103e712880
2014-12-15T14:02:07.036086+05:30 lnx kernel: [860978.128296]  ffff880855f48078 ffffffff812719b8 0000000000000001 ffff881055749808
2014-12-15T14:02:07.036092+05:30 lnx kernel: [860978.128301] Call Trace:
2014-12-15T14:02:07.036094+05:30 lnx kernel: [860978.128314]  [<ffffffff8100467d>] dump_trace+0x7d/0x2d0
2014-12-15T14:02:07.036095+05:30 lnx kernel: [860978.128321]  [<ffffffff81004964>] show_stack_log_lvl+0x94/0x170
2014-12-15T14:02:07.036096+05:30 lnx kernel: [860978.128326]  [<ffffffff81005d91>] show_stack+0x21/0x50
2014-12-15T14:02:07.036098+05:30 lnx kernel: [860978.128332]  [<ffffffff8150b1db>] dump_stack+0x41/0x51
2014-12-15T14:02:07.036099+05:30 lnx kernel: [860978.128337]  [<ffffffffa039f018>] my_handler+0x18/0x20 [probe]
2014-12-15T14:02:07.036100+05:30 lnx kernel: [860978.128347]  [<ffffffff8128a2f5>] blkg_lookup_create+0x45/0xc0
2014-12-15T14:02:07.036102+05:30 lnx kernel: [860978.128352]  [<ffffffff812719b8>] get_request+0x88/0x6f0
2014-12-15T14:02:07.036110+05:30 lnx kernel: [860978.128507]  [<ffffffff8127028f>] __blk_run_queue+0x2f/0x40
2014-12-15T14:02:07.036111+05:30 lnx kernel: [860978.128512]  [<ffffffff81273860>] blk_flush_plug_list+0x1e0/0x240
2014-12-15T14:02:07.036125+05:30 lnx kernel: [860978.128517]  [<ffffffff81273c20>] blk_finish_plug+0x10/0x40
2014-12-15T14:02:07.036127+05:30 lnx kernel: [860978.128522]  [<ffffffff81140f9f>] __do_page_cache_readahead+0x17f/0x1f0
2014-12-15T14:02:07.036128+05:30 lnx kernel: [860978.128528]  [<ffffffff8114115a>] ondemand_readahead+0x14a/0x280
2014-12-15T14:02:07.036130+05:30 lnx kernel: [860978.128534]  [<ffffffff81137129>] generic_file_aio_read+0x459/0x6f0
2014-12-15T14:02:07.036131+05:30 lnx kernel: [860978.128542]  [<ffffffff8119e2cc>] do_sync_read+0x5c/0x90
2014-12-15T14:02:07.036133+05:30 lnx kernel: [860978.128547]  [<ffffffff8119e879>] vfs_read+0x99/0x160
2014-12-15T14:02:07.036134+05:30 lnx kernel: [860978.128552]  [<ffffffff8119f378>] SyS_read+0x48/0xa0
2014-12-15T14:02:07.036136+05:30 lnx kernel: [860978.128557]  [<ffffffff81519329>] system_call_fastpath+0x16/0x1b
2014-12-15T14:02:07.036137+05:30 lnx kernel: [860978.128567]  [<00007fc484734480>] 0x7fc48473447f


In this example the blkcg_gq association is created from get_request function. The newly creates blk cgroup and request queue association is added to request_queue -> blkg_list.
Also the association is also kept in “struct blkcg -> blkg_list


References:
[1] https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt

No comments:

Post a Comment