Parsing Complex Documents From S3 In PowerShell

In this article, we'll demonstrate how we can process complex data types stored on S3 in PowerShell. In our case, these are the Bitrise configuration files but it could be any information stored.

In this article, we'll demonstrate how we can process complex data types stored on S3 in PowerShell. In our case, these are the Bitrise configuration files but it could be any information stored.

Every app on Bitrise is described by a specific YAML configuration: bitrise.yml. YAML's purpose is to be easily readable. For a developer, it is not that important to save the configuration file each time a build runs but for us it holds valuable information for quality assurance and for monitoring performance of steps or stacks.

Structure of bitrise.yml

The schema of the YAML is something like below:

As you can recognize, Bitrise YAMLs serve easy editing and not easy processing. It sets a couple of barriers in front of data crunching:

  • Workflow and step names are in position of key thus each workflow name must be captured before referring the key. A real data representation would be a key-value pair for each attribute naming the attribute and defining a value of it:
    workflow: build-alpha .
  • On the other hand some attributes are named with a generic key, like steps or after_run.
  • One workflow is triggered but we should take workflows named inbefore_run and after_run into consideration if we would like to understand which steps were actually run.

As a result of these barriers above, we have a nested document where the nested objects can be sometimes lists, sometimes hash tables in other cases custom objects.

AWS Tools for PowerShell

When a build is running, the complete related YAML file is saved as a snapshot of the application settings at the time of building. Fortunately, AWS has a comprehensive package to work with its cloud services from PowerShell.

If you are a Mac or Linux user, you can only use the cross-platform PowerShell Core but if you are new to PS on Windows, I encourage you to use this, probably more future proof version of the shell. The corresponding package from AWS is AWSPowershell.NetCore, to install it use

To access AWS services we have to set our credentials.

Optionally, you can set the new profile as default by naming it default.

Solution

I was working on some historical reports so I needed to download and parse a huge number of files. It seemed to be a good idea to process monthly data parallel.

Parallel processing

Fortunately, it is not really complicated in PowerShell to start several tasks at one. Since we would like to perform the same task with different inputs what we need is only one script block and a parameter for each month/job.

Looping over the array of month we just use the piped object this, $_ as a parameter in the script block and passing it as an argument of the job.

Downloading

To access files on S3 we can use - no surprise - the Read-S3Object cmdlet. If you plan to access specific files and not complete folders it requires

  • the bucket
  • the key of the object, a kind of a path to the specific YAML
  • the file to write the S3 object and
  • the user profile in case it is not the default we set

From our website, we have lists of repositories and builds of each month which are used to construct the S3 object keys.

Downloading parallel

Combining what we have so far. We should add our packages and credentials to each session so it should go into the script block.

Parsing

By running the code above we collect all the builds we are interested in into folders containing data for each month. We also added the metadata to the file names. Let's start with this first.

We can use powershell-yaml package to deal with YAMLs. It is like you would do with JSON: read content and pass to ConvertFrom-Yaml cmdlet:

As mentioned earlier, the nested objects read from the YAML will have various object types. The workflows is not an array (as each workflow's name is used in a key place) but a hash table. GetEnumerator() method can unwrap it into array of objects thus we can loop over the elements.

Attributes of a workflow are in the value of the workflow object.

The same exercise as we did with the workflows should be repeated on the steps.

Collecting all this information into a collection of hash tables:

This collection saved as a JSON can be an input to any document database or Neo4j.

Example

You can find examples among Bitrise CLI Tutorial. This file is based on Complex Workflow lesson.

testRepo testBuild analyze 2018-08-01 17-24-34.958901.yml

No items found.
The Mobile DevOps Newsletter

Explore more topics

App Development

Learn how to optimize your mobile app deployment processes for iOS, Android, Flutter, ReactNative, and more

Bitrise & Community

Check out the latest from Bitrise and the community. Learn about the upcoming mobile events, employee spotlights, women in tech, and more

Bitrise Insights

Cache | Caching

Mobile App Releases

Learn how to release faster, better apps on the App Store, Google Play Store, Huawei AppGallery, and other app stores

Mobile DevOps

Learn Mobile DevOps best practices such as DevOps for iOS, Android, and industry-specific DevOps tips for mobile engineers

Mobile Testing & Security

Learn how to optimize mobile testing and security — from automated security checks to robust mobile testing and more.

Product Updates

Check out the latest product updates from Bitrise — Build Insights updates, product news, and more.

The Mobile DevOps Newsletter

Join 1000s of your peers. Sign up to receive Mobile DevOps tips, news, and best practice guides once every two weeks.