The Task Parallel Library Sampler – Part 7: Using Parallel.For effectively

Part One: Starting with MVVM
Part Two: The MVVM solution structure and basic framework
Part Three: Base Classes
Part 4: Sampler View, View Model and Model
Part 5: Running and working with the TPL samples
Part 6: Parallel.For Sample

In the last post we discussed where using a Parallel.For isn’t effective. The answer is fairly straightforward, Parallel.For (and by extension Parallel.ForEach) isn’t effective when you can’t give it enough work. Spinning off threads from the Thread Pool has its own overhead and if you can’t give the threads enough work it doesn’t make sense. Today we are going to discuss using Parallel.For effectively and what you have to change to convert from using a for to a Parallel.For.

GreyScaleSample.Run()

public override void Run(System.Drawing.Bitmap bmp = null, Action<string> UpdateLog = null)
{
	if(bmp == null)
		throw new InvalidOperationException("Bitmap must be defined.");
	
	Stopwatch s = new Stopwatch();
	s.Start();

	System.Drawing.Imaging.BitmapData bmData = bmp.LockBits(new System.Drawing.Rectangle(0, 0, bmp.Width, bmp.Height), System.Drawing.Imaging.ImageLockMode.ReadWrite, System.Drawing.Imaging.PixelFormat.Format24bppRgb);
	int stride = bmData.Stride;
	System.IntPtr Scan0 = bmData.Scan0;
	unsafe
	{
		byte* p = (byte*)(void*)Scan0;
		byte red, green, blue;

		for (int y = 0; y < bmp.Height; ++y)
		{
			for (int x = 0; x < bmp.Width; ++x)
			{
				blue = p[0];
				green = p[1];
				red = p[2];

				p[0] = p[1] = p[2] = (byte)(.299 * red
					+ .587 * green
					+ .114 * blue);

				p += 3;
			}
		}
	}
	bmp.UnlockBits(bmData);

	s.Stop();
	RunTime = s.Elapsed;
}

In the above sample we iterate over the image, starting at the first row (which is Scan0 but is redefined as “p” for pixel just for clarity of the code) and then iterating over the columns in that row. A bitmap is made up of a long byte array where every three bytes is the blue, green and red colors (which seems opposite of what we expect) that make up a pixel. The width of the row is defined by the stride but this is really the same thing as the width of the bitmap. We get the RGB values and reset the pixels to the gray value of the color. We then increment the pixel by 3 (since it represents the 3 bytes of RGB) and move on to the next one.

There is some messy pointer stuff here but all-in-all the code should be clear in what we’re doing.

GreyScaleParallelSample.Run()

public override void Run(System.Drawing.Bitmap bmp = null, Action<string> UpdateLog = null)
{
	if(bmp == null)
		throw new InvalidOperationException("Bitmap must be defined.");
	
	Stopwatch s = new Stopwatch();
	s.Start();

	System.Drawing.Imaging.BitmapData bmData = bmp.LockBits(new System.Drawing.Rectangle(0, 0, bmp.Width, bmp.Height), System.Drawing.Imaging.ImageLockMode.ReadWrite, System.Drawing.Imaging.PixelFormat.Format24bppRgb);
	int stride = bmData.Stride;
	System.IntPtr Scan0 = bmData.Scan0;
	unsafe
	{
		byte* start = (byte*)(void*)Scan0;

		int height = bmp.Height;
		int width = bmp.Width;

		Parallel.For(0, height, y =>
		{
			byte* p = start + (y * stride);
			for (int x = 0; x < width; ++x)
			{
				byte blue = p[0];
				byte green = p[1];
				byte red = p[2];

				p[0] = p[1] = p[2] = (byte)(.299 * red
					+ .587 * green
					+ .114 * blue);

				p += 3;
			}
		});
	}
	bmp.UnlockBits(bmData);

	s.Stop();
	RunTime = s.Elapsed;
}

In the Parallel.For sample things are a bit different and these differences are important.

First off we have to remember that each loop of the Parallel.For is a seperate thread. As such there can’t be any variables that will be modified that are common between the loops(at least not without using Interlocked but that’s a different post). Imagine if the pointer to the pixel was common between the threads like it is in the first sample. If the thread pool spawns off 10 threads they would all have that same initial value for the pixel. This is problematic and as such the code is changed here to recalculate the pixel at the start of the row at the beginning of each iteration.

Second we move the declaration of the bytes for blue, green and red into the inner loop. This was only done originally merely for more evident code is isn’t really a functional change.

GreyScaleDoubleParallelSample.Run()

public override void Run(System.Drawing.Bitmap bmp = null, Action<string> UpdateLog = null)
{
	if(bmp == null)
		throw new InvalidOperationException("Bitmap must be defined.");
	
	Stopwatch s = new Stopwatch();
	s.Start();

	System.Drawing.Imaging.BitmapData bmData = bmp.LockBits(new System.Drawing.Rectangle(0, 0, bmp.Width, bmp.Height), System.Drawing.Imaging.ImageLockMode.ReadWrite, System.Drawing.Imaging.PixelFormat.Format24bppRgb);
	int stride = bmData.Stride;
	System.IntPtr Scan0 = bmData.Scan0;
	unsafe
	{
		byte* start = (byte*)(void*)Scan0;

		int height = bmp.Height;
		int width = bmp.Width;

		Parallel.For(0, height, y =>
		{
			Parallel.For(0, width, x =>
			{
				byte* p = (start + (y * stride)) + (x * 3);
				byte blue = p[0];
				byte green = p[1];
				byte red = p[2];

				p[0] = p[1] = p[2] = (byte)(.299 * red
					+ .587 * green
					+ .114 * blue);
			});
		});
	}
	bmp.UnlockBits(bmData);

	s.Stop();
	RunTime = s.Elapsed;
}

Finally we have a sample that works pretty much like LineParallelSample.Run() (except here we’re setting the pixel to gray instead of black). The code spins off a thread for each row and then within that thread spins off a thread for setting each pixel. Again, we have to move the pixel declaration internal to the inner Parallel.For since this value will be modified and must be unique to each thread.

Running the samples you will get results similar to:

Reseting Image
Starting Grey Scale Sample
Completed Grey Scale Sample
Grey Scale Sample ran in 00:00:00.0268376

Reseting Image
Starting Grey Scale Parallel Sample
Completed Grey Scale Parallel Sample
Grey Scale Parallel Sample ran in 00:00:00.0020127

Reseting Image
Starting Grey Scale Double Parallel Sample
Completed Grey Scale Double Parallel Sample
Grey Scale Double Parallel Sample ran in 00:00:00.0037469

This is the results run with the included image of my son which is 93KB, a small image.

I have another image I test against which is ~8MB. This results in:

Reseting Image
Starting Grey Scale Sample
Completed Grey Scale Sample
Grey Scale Sample ran in 00:00:02.2118701

Reseting Image
Starting Grey Scale Parallel Sample
Completed Grey Scale Parallel Sample
Grey Scale Parallel Sample ran in 00:00:00.1626661

Reseting Image
Starting Grey Scale Double Parallel Sample
Completed Grey Scale Double Parallel Sample
Grey Scale Double Parallel Sample ran in 00:00:00.2232706

You can see by these results the Parallel.For sample runs nearly 14 times faster. This is major. Now looking at the Parallel.For sample and the Double Parallel.For sample the results are actually detrimental in this case. Running the sample, as with the ParallelLine sample you don’t get any benefit to adding the interal Parallel.For just to set a pixel. Again, depending on how you use the Parallel.For, you may have an example where you can give the internal threads enough work that it may be beneficial, just not here.

Up next I’m going to add two new models showing Matrix multiplication. This sample is actually similar to the GreyScale samples here but I added it to the original source because we do a lot of matrix operations and I wanted to so a clear, real-world example that was directly applicable to the work we do.

Thanks,
Brian

Leave a Reply