lørdag den 12. juli 2014

To vectorize or not to vectorize - that's the question

Some observations on SIMD performance on A7

I have recently been dabbling a bit with an old passion of mine: computing deep zooms of the mandelbrot fractal. The core of my (very proprietary) routine consists of this computation:

y=(a.*b >> n) + c

All terms are 2-element vectors but with variable precision - from 32 bit and to infinity in multiples of 32 bit (limited by machine memory of course). The multi precision multiplication, addition and shift calls for a lot of 32 bit numbers being multiplied to form 64 bit numbers, and the additions are done as 32+32=64 bit to ease carry computations and keep the code in pure C

Intuitively this should map well to the NEON (and new SIMD) instruction set on the A7. Therefore I took a stab at implementing the code using the ARM NEON intrinsics.

When Apple introduced the A7 processor, it meant that all pure assembly NEON code could no longer be used, because the NEON instructions no longer exists in ARM64 mode. Therefore Apple now recommends using intrinsics as the intrinsics found in arm_neon.h will work both on A7 and the previous 32 bit processors. Looking at the assembly the mapping from intrinsics to machine code it shows that it maps pretty much 1:1, removing previous complaints about intrinsics doing weird stuff to your code. This is a good thing, as it allows us to write the code using functions rather than assembly notation which is quite a bit easier. 

To test performance I have implemented a set of macros and inline functions that maps the operations I need (multiplication, addition, shift etc.) to either an intrinsic or a pure C function. As an example, the unsigned multiply of two 32 bit numbers into a 64 bit number looks like this for Neon

#define vmull(_x,_y) vmull_u32(_x,_y)

And like this in pure C:

static inline v64x2_t vmull(v32x2_t x, v32x2_t y){
    v64x2_t v;
    v.r = x.r * (uint64_t)y.r;
    v.i = x.i * (uint64_t)y.i;
    return v;
}

This allows me to compare the performance of the code I write both when executed on the SIMD units (of which there supposedly is 2 in the A7) and on the normal integer data paths (of which there supposedly is 4 in the A7). Theoretically there is the same compute bandwidth for 64 bit numbers in the two types of code but a different number of registers, and therefore it is interesting to see how the code compares. 

I have done fully unrolled versions of the routine for all bit sizes from 32 to 1024 (1 to 32 words), and I built the code with -O3 allowing the compiler for the standard C code to do all possible optimizations. The  code is very light on memory access, trying to keep all results in registers and only reading each operand once and writing the result once to memory. This is what I found:


Index 100 is the scalar code, meaning that the NEON code is (sometimes very much) slower up to a precision of 8 words. Looking at assembly this coincides with the point where the normal C code runs out of registers and stack spills starts to happen. At 32 words, the NEON code is 30% faster, but it is not winning by a lot. This to me shows the power of the new ARM64 instruction set. The core is simply able to get a lot of things done without the NEON extensions. It also shows that it is somewhat tricky to apply NEON optimizations for this kind of integer code - it depends very much on the workload whether it pays off. In my case I now only run the NEON code when I use the routines operating on more than 8 words, as it doesn't pay off before. 

The case for a desktop Ax processor

When Apple presented the A7 processor, they claimed this was a desktop class processor. And with what I saw for integer performance, I thought it would be fun to compare it to my Ivy Bridge i7 from late 2011 running 2.2 GHz. I thought it would be interesting to compare both scalar and SIMD (SSE4) performance, and since my own small vector math library maps nicely to SSE4 as well, It was only a matter of writing another set of macros for SSE. Running the same benchmark, this is what I got:

There are two interesting things here: SSE4 consistently improves the performance on the Core i7 for all workloads - sometimes a lot and sometimes only a little. But note the very heavy 1024 bit computation workload to the right: Here the scalar code is only 20% slower on the A7 - and here we are talking a passively cooled processor running 1.3 GHz vs a 2.2 GHz processor with all fans on (literally). This is indeed very impressive. But SSE really works as well - It seems that the i7 is doing a much better job at keeping the vector unit busy. Although the A7 is out of order, it seems that the instruction scheduling still has some way to go here. 

For fun I tried to account for the difference in clock speed (a very simplistic approach - I know) to see what an imaginary 2.2 GHz A7 would be able to do:

For all the workloads the A7 would perform on par with the i7 when we talk scalar code, and for the very heavy workload it will even perform on par when it comes to SIMD instructions, while the scalar code will run much faster. A 2.2 GHz A7 would probably not be passively cooled, but definitely not run as hot as the i7 either. I would be surprised if we don't see an ARM based macbook air at some point.

torsdag den 5. december 2013

Getting in-app purchases from the App receipt

With the App receipt available it is straight forward to figure out which in-app purchases that have been bought by the user. This blogpost shows how to get that information as part of the receipt validation. Also in this installment I refer to the Receipt Validation Programming Guide.

Checking the receipt after an in-app purchase

A new receipt is automatically downloaded once a SKPaymentTransaction is changing state to SKPaymentTransactionStatePurchased. Thus, it is only a matter of validating and parsing the app-receipt again once that happens. This is checked in the delegate function
-(void)paymentQueue:(SKPaymentQueue*)queue updatedTransactions:(NSArray*)transactions

Getting the in-app part of the receipt

Below is an extension of listing 1-6 in the guide. The text in bold is the extra statements needed to hold the in-app purchase data size_t i; OCTET_STRING_t *iap[2]; int iap_cnt=0; for (i = 0; i < payload->list.count; i++) { ReceiptAttribute_t *entry; entry = payload->list.array[i]; switch (entry->type) { case 2: bundle_id = &entry->value; break; case 3: bundle_version = &entry->value; break; case 4: opaque = &entry->value; break; case 5: hash = &entry->value; break; case 17: iap[iap_cnt]=&entry->value; iap_cnt ++ ; break; } } The iap[] array should be large enough to hold references to all possible in-app purchases, i.e. the total number of products configured for the app. Each of the OCTET_STRING_t references now hold a new ASN.1 structure that holds the information on the in-app purchase. It is important that the free_struct call is not called before these new structures have been parsed. To parse the structures, a function like the following can be used: static inline void getIAP(OCTET_STRING_t **iap, int iap_cnt){ Payload_t *payload=NULL; asn_dec_rval_t rval; int i,n; for (n=0; n < iap_cnt; n++){ OCTET_STRING_t *this_iap = iap[n]; payload = NULL; rval = asn_DEF_Payload.ber_decoder(NULL,&asn_DEF_Payload,(void **)&payload, this_iap->buf,this_iap->size,0); if (rval.code != RC_OK){ NSLog(@"Failed to decode payload at index %d",n); asn_DEF_Payload.free_struct(&asn_DEF_Payload,payload,0); continue; } for (i=0; i < payload->list.count; i++){ ReceiptAttribute_t *entry; entry = payload->list.array[i]; switch (entry->type) { case AR_IAP_PRODUCT_ID:{ const char *id1 = "com.yourdomain.yourproduct1"; const char *id2 = "com.yourdomain.yourproduct2"; if (strcmp(id1, (char *)entry->value.buf+2)==0){ enableFeature(0); } if (strcmp(id2, (char *)entry->value.buf+2)==0){ enableFeature(1); } } break; default: break; } } asn_DEF_Payload.free_struct(&asn_DEF_Payload,payload,0); } } It is important that the payload is set to NULL for each iteration - otherwise it is not possible to free the payload in each iteration. Note that the OCTET_STRINGs should be parsed from offset 0, while the product IDs are found from offset 2. After this function is called, the full app receipt structure can be freed.

Restoring app purchases

Restoring app purchases using the receipt is a matter of re-fetching and re-parsing the receipt. Fetching an updated receipt can be done as shown here.

søndag den 1. december 2013

Verifying the iOS 7 app receipt

With iOS7, Apple has introduced the app receipt to allow a developer to check if the app is authorized to run on the device it is running, and to track non-consumable in-app purchases. The last part is interesting because it solves a lot of problems in persisting the presence of in-app purchases in a secure way. Also, it automatically transfers in-app purchases to new devices if an App is downloaded there.

Unfortunately, the handling and validation of the receipt is not very well documented - important bits are missing from the documentation. However please read the Receipt Validation Programming Guide as this post refers a lot to that . This is my attempt at clearing up some of the points. The goal for this blog post is not to provide complete code, but to clear out some of the questions one has after reading the documentation

Getting the receipt

A note before we start: REMEMBER to log out of the iTunes store before running this code, and to log in using a test account. You can set up a test account in iTunes connect. And do this every time you run your app until the app is in the iTunes store. Also - check each and every step before continuing - it is very hard to debug the code in big-bang fashion.

The receipt can be loaded from the app bundle by the following code: NSURL *url = [[NSBundle mainBundle] appStoreReceiptURL]; NSData *data = [NSData dataWithContentsOfURL:url]; If data ends up being nil, the receipt is missing. This should never happen in a production environment, but will happen in the sandbox environment used during development.

In a development environment, the receipt can be fetched by code akin to the following code: - (void) getReceipt { self.recreq = [[SKReceiptRefreshRequest alloc] init]; self.recreq.delegate = self; [self.recreq start]; } - (void)requestDidFinish:(SKRequest *)request { if ([request isEqual:self.recreq]){ NSLog(@"Got receipt"); } } - (void)request:(SKRequest *)request didFailWithError:(NSError *)error { NSLog(@"Did not get receipt"); } The class containing this code must conform to SKRequestDelegate.

Getting the Apple root certificate

The documentation that is in the Receipt Validation Programming guide is good but incomplete. One crucial part that it leaves out is how to get the Apple root certificate. It turns out that there is no way to get it other than store it locally in your app bundle. So download the file AppleIncRootCertificate.cer and put it in your app bundle. Then the certificate data can be loaded with the following code: NSData *cert = [NSData dataWithContentsOfURL:[[NSBundle mainBundle] URLForResource:@"AppleIncRootCertificate" withExtension:@"cer"]]; This allows you to fill an important missing piece in the receipt validation code, namely the loading of the Apple certificate. Note: In general it may not be a good idea to use objective-c in the code that validates the receipt, as it is much easier to disassemble. If you are paranoid enough, you should stick to core foundation. The same code to get the data is shown here using core foundation. CFBundleRef bundle = CFBundleGetMainBundle(); const char *resname = "AppleIncRootCertificate"; const char *resext = "cer"; CFStringRef rname = CFStringCreateWithCString(NULL, resname, kCFStringEncodingASCII); CFStringRef rext = CFStringCreateWithCString(NULL, resext, kCFStringEncodingASCII); CFURLRef url = CFBundleCopyResourceURL(bundle, rname, rext, NULL); unsigned char fname[1024]; CFURLGetFileSystemRepresentation(url, 0, fname, 1024); uint8_t certdata[2048]; int certlen; int fid = open((const char*)fname, O_RDONLY); certlen=read(fid, certdata, 2048); close(fid); CFRelease(url); CFRelease(rext); CFRelease(rname); And then thank your favorite celestial being that Objective-c was invented.

OpenSSL

Apple has chosen a mechanism for verifying the signature on the receipt that is not supported by their built-in software. Hence you need to include OpenSSL in your project. I have followed the directions on this site and it worked. It took some time to build though, so please be patient. You need to import libcrypto.a into your project and put the include/include/ directory where you build OpenSSL into your search path. These steps should clear the way for the actual verification.

Step 1 - Verifying the signature on the receipt

The code in listing 1-4 in the guide is incomplete. Two things are missing, namely which headers to include: #import <openssl/pkcs7.h> #import <openssl/x509.h> #import <openssl/bio.h> and how to initialize the p7 and Apple variables (note that the Apple variable is initialized differently here than what Apple suggests): b_p7 = BIO_new_mem_buf(receiptdata, receiptlen); ... const unsigned char *cdata = certdata; Apple = d2i_X509(NULL, &cdata, certlen);

If you have not worked with OpenSSL before, you will be bitten by this one: Remember to start your use of the library with OpenSSL_add_all_algorithms() and end it with EVP_cleanup();

The final step is to get the payload out from the receipt. The payload is found in the b_out BIO. As far as I know there is no way of knowing how big it is, except it is smaller than the receipt itself. Hence we use the following construct to get the data out: uint8_t *pld = malloc(receiptlen); int pld_sz = BIO_read(b_out, payload, receiptlen);

Step 2 - Verifying Bundle identifier and version

The next step involves parsing the ASN.1 data found in the payload. I have used the asn1c tool described in the manual. Two things needs to be done after you add the asn1 files to your project:
  • Add the directory with the files to the include path of your project, otherwise they won't build
  • Remove the converter-sample.c file from the project
After that, it is only a matter of using the code in listing 1-6. The manual states that the identifier and version should be checked against a hardcoded version. To get to the actual string, use the buf field of the OCTET_STRING_t. The first two bytes contain type and length, so in order to do the comparison correctly do the following: const char *ref = "1.1"; const char *tst = (const char *)bundle_version->buf+2; if (strcmp(ref,tst)){FAIL}; Since the name and version are hardcoded, remember to get it right.

Step 3 - Verifying hash

Again the code in listing 1-7 is pretty good. The only thing missing is how to get the GUID on iOS. This is fortunately quite easy: NSUUID *vendorID = [[UIDevice currentDevice] identifierForVendor]; uuid_t vendorIDBytes; [vendorID getUUIDBytes:vendorIDBytes];

uuid_t is a pointer to 16 bytes, so that makes it easy to complete listing 1-7.

When doing the comparison, there is no offset of the value, so the correct comparison is

if (memcmp(hash->buf,digest)){FAIL}

Final notes

One thing missing is the clean-up. If you just use the code as listed in the guide, the asn1 structures will leak. At some point you need to free the payload, and that is done like this: asn_DEF_Payload.free_struct(&asn_DEF_Payload, payload,0); In the next installment I will go through how to check the status of the in-app purchases using the receipt.

søndag den 12. august 2012

iOS framebuffer

Many places on the web there are questions about how to get hold of the frame buffer on an iOS device. The answer is always the same: "There is no frame buffer". Usually the question reflects the wish to do fast blit'ing of bitmaps generated on the fly to the screen. That variant of the question is usually answered with a "Quartz is too slow for that" followed by some muttering about OpenGL, textures and quads while people wander off. I have not seen any working code coming from these answers.

While it is true that there is no direct access to the frame buffer on an iOS device one can get pretty darn close without resorting to OpenGL, and also with fairly good results (60 fps can be upheld provided your generator code is fast enough). Here is how:

QuartzCore has the ability to generate a bitmap object backed by user-accessible memory. So for off-screen purposes there is a way to generate a "frame buffer" in the form of a bitmap object. The creation function is documented here, and the following is an example of a function that creates this given a CGRect with width and height: CGColorSpaceRef csp = CGColorSpaceCreateDeviceRGB(); CGContextRef _context = CGBitmapContextCreate(NULL, (size_t)frame.size.width, (size_t)frame.size.height, 8, 4*(size_t)frame.size.width, csp, kCGImageAlphaNoneSkipFirst); CGColorSpaceRelease(csp); NULL means that Quartz will allocate the memory. This frees you from memory management and also allows Quartz to optimize the location. The last argument means that the bitmap is stored ARGB in memory (byte 0 is alpha, 1 is red, 2 is green and 3 is blue etc. etc). and that the alpha-component is ignored. This is done to improve performance, but it means that the image will always be opaque. Using the following code uint32_t *framebuffer = CGBitmapContextGetData(_context); you get access to the color components (4 bytes each) of the bitmap. the frame buffer is an array which is width*height in size, and where each row is following each other (i.e. the index to the array is row*width+column for a given pixel at location (row,column)). The array starts in upper-left corner.

The bitmap context is the first element of the frame buffer emulation. This can be converted to an image by: CGImageRef img = CGBitmapContextCreateImage(_context); The documentation will tell you that you can draw this image using CGContextDrawImage() to draw it in a UIView drawRect: method. This is where Quartz gets its reputation for being slow from - because that is not a very fast operation.

The key to a fast frame buffer-like object on iOS is to use a CALayer. This is the basic building block in Quartz on iOS, and is nothing more than a cached CGImageRef that the graphics subsystem can render (you can probably already see where this is heading). The documentation is not very clear on this but it is possible (and very easy) to access a CALayers backing store. It is done using the contents property. To assign a new CGImage to a CALayer you simply do this: CGImageRef img = CGBitmapContextCreateImage(_context); layer.contents = (__bridge id)img; CGImageRelease(img); The first line obtains a CGImage from the bitmap as described above. Second line assigns this to a CALayer content property. This property is of type id but any CFType (which CGImage is) can be cast to id. the __bridge indicates that the memory management must be handled by Objective-c from this point, hence the image must be released again (last line).

That is pretty much it. The key to keeping the speed up is to not allocate a new bitmap over and over, but only obtain a new img every time the bitmap is updated on-screen and then change the backing store in-between. CGBitmapContextCreateImage() is a low-overhead operation whereas CGBitmapContextCreate() is not. The content property can be updated from anywhere in your program and the layer will update on-screen accordingly next time the screen is rendered.

Last part of this is to add the CALayer that you use for your bitmap to some view using [view.layer addSublayer:layer]

All this can be wrapped up in a small CALayer subclass (). #import <QuartzCore/QuartzCore.h> #import <stdint.h> @interface RSFrameBufferLayer : CALayer // Class method to create a new layer with an underlying // bitmap. Both will have the size set by the frame + (RSFrameBufferLayer *)layerWithFrame:(CGRect)frame; // Same as above - (id)initWithFrame:(CGRect)frame; // Draw bitmap to screen - (void)blit; // Get the underlying context to use for higher-level // drawing operations in Quartz @property(readonly) CGContextRef context; // Get the raw "frame buffer" @property(readonly) uint32_t *framebuffer; @end The implementation is very simple: #import "RSFrameBufferLayer.h" @implementation RSFrameBufferLayer @synthesize context = _context; + (RSFrameBufferLayer *)layerWithFrame:(CGRect)frame { return [[[RSFrameBufferLayer alloc] initWithFrame:frame] autorelease]; } - (id)initWithFrame:(CGRect)frame { if (self=[super init]){ self.opaque = YES; self.frame=frame; } return self; } - (void)dealloc { CGContextRelease(_context); [super dealloc]; } - (void)blit { CGImageRef img = CGBitmapContextCreateImage(_context); Self.contents = (__bridge id)img; CGImageRelease(img); } -(void)setFrame:(CGRect)frame { CGRect oldframe = self.frame; [super setFrame:frame]; if (frame.size.width != oldframe.size.width || frame.size.height != oldframe.size.height){ if (_context){ CGContextRelease(_context); } CGColorSpaceRef csp = CGColorSpaceCreateDeviceRGB(); _context = CGBitmapContextCreate(NULL, (size_t)frame.size.width, (size_t)frame.size.height, 8, 4*(size_t)frame.size.width, csp, kCGImageAlphaNoneSkipFirst); CGColorSpaceRelease(csp); } } -(uint32_t *)framebuffer { return CGBitmapContextGetData(_context); } @end