Category: Architecture

Stuff The Internet Says On Scalability For August 16th, 2019

Stuff The Internet Says On Scalability For August 16th, 2019

Wake up! It’s HighScalability time:

Do you like this sort of Stuff? I’d love your support on Patreon. I wrote Explain the Cloud Like I’m 10 for people who need to understand the cloud. And who doesn’t these days? On Amazon it has 53 mostly 5 star reviews (124 on Goodreads). They’ll learn a lot and likely add you to their will.

Number Stuff:

  • $1 million: Apple finally using their wealth to improve security through bigger bug bounties.
  • $4B: Alibaba cloud service yearly run rate, growth of 66%. Says they’ll overtake Amazon in 4 years. 
  • 200 billion: Pinterest pins pinned across more than 4 billion boards by 300 million users.
  • 21: technology startups took in mega-rounds of $100 million or more. 
  • 3%: of users pass their queries through resolvers that actively work to minimize the extent of leakage of superfluous information in DNS queries.
  • < 50%: Google searches result in a click. SEO dies under walled garden shade.
  • 4 million: DDoS attacks in the last 6 months, frequency grew by 39 percent in the first half of 2019. IoT devices are under attack within minutes. Rapid weaponization of vulnerable services continued. 
  • 200: distributed microservices in S3, up from 8 when it started 13 years ago.
  • 50%: cumulative improvement to’s feed page load time.
  • $318 million: Fortnite monthly revenue, likely had more than six consecutive months with at least one million concurrent active users.
  • $18,000: in fines because you just had to have the license plate NULL. 
  • $6.1 billion: Uber created Dutch weapon to avoid paying taxes.
  • 14.5%: drop in 1H19 global semiconductor sales.
  • 13%: fall in ad revenue for newspapers. 

Quotable Stuff:

  • Donald Hoffman: That is what evolution has done. It has endowed us with senses that hide the truth and display the simple icons we need to survive long enough to raise offspring. Space, as you perceive it when you look around, is just your desktop—a 3D desktop. Apples, snakes, and other physical objects are simply icons in your 3D desktop. These icons are useful, in part, because they hide the complex truth about objective reality.
  • rule11: First lesson of security: there is (almost) always a back door.
  • Paul Ormerod: A key discovery in the maths of how things spread across networks is that in any networked system, any shock, no matter how small, has the potential to create a cascade across the system as a whole. Watts coined the phrase “robust yet fragile” to describe this phenomenon. Most of the time, a network is robust when it is given a small shock. But a shock of the same size can, from time to time, percolate through the system. I collaborated with Colbaugh on this seeming paradox. We showed that it is in fact an inherent property of networked systems. Increasing the number of connections causes an improvement in the performance of the system, yet at the same time, it makes it more vulnerable to catastrophic failures on a system-wide scale.
  • @jeremiahg: InfoSec is ~$127B industry, yet there’s no price tags on any vendor website. For some reason it’s easier to find out what a private plane costs than a ‘next-gen’ security product. Oh yah, and let’s not forget the lack of warranties.
  • Hall’s Law:  the maximum complexity of artifacts that can be manufactured at scales limited only by resource availability doubles every 10 years. 
  • YouTube~ Our responsibility was never to the creators or to the users,” one former moderator told the Post. “It was to the advertisers.”
  • reaperducer: It’s for this reason that’s I’ve stopped embedding micro data in the HTML I write. Micro data only serves Google. Not my clients. Not my sites. Just Google. Every month or so I get an e-mail from a Google bot warning me that my site’s micro data is incomplete. Tough. If Google wants to use my content, then Google can pay me. If Google wants to go back to being a search engine instead of a content thief and aggregator, then I’m on board.
  • Maxime Puteaux: The small satellite launch market has grown to account for “69% of the satellites launched last year in number of satellites but only 4% of the total mass launched (i.e 372 tons). … The smallsat market experienced a 23% compound annual growth rate (CAGR) from 2009 to 2018” with even greater growth expected in the future, dominated by the launch needs of constellations.
  • @Electric_Genie: San Diego has a huge, machine-intelligence-powered smart streetlight network that monitors traffic to time traffic signals. Now, they’ve added ability to detect pedestrians and cyclists
  • Simon Wardley: How to create a map? Well, I start off with a systems diagram, I give it an anchor at the top. In this case, I put customer and then I describe position through a value chain. A customer wants online photo storage, which needs website, which needs platform, which needs computer, which needs power, and of course, the stuff at the bottom is less visible to the customer than the stuff at the top.
  • Charity Majors: When we blew up the monolith into many services, we lost the ability to step through our code with a debugger: it now hops the network.  Our tools are still coming to grips with this seismic shift.
  • Livia Gershon: According to McLaren, from 1884 to 1895, the Matrimonial Herald and Fashionable Marriage Gazette promised to provide “HIGH CLASS MATCHES” to U.K. men and women looking for wives and husbands. Prospective spouses could place ads in the paper or work directly with staff of the associated Word’s Great Marriage Association to privately make a connection.
  • @KarlBode: There is absolutely ZERO technical justification for bandwidth caps and overage fees on cable networks. Zero. It’s a glorified price hike on captive US customers who already pay more for bandwidth than most developed nations due to limited competition.
  • Fowler: That’s the other piece of app trackers, is that they do a whole bunch of bad things for our phone. Over the course of a week, I found 5,400 different trackers activated on my iPhone. Yours might be different. I may have more apps than you. But that’s still quite a lot. If you multiplied that out by an entire month, it would have taken up 1.5 gigabytes of data just going to trackers from my phone. To put that in some context, the basic data plan from AT&T is only 3 gigabyte
  • Kate Green: Starshot is straightforward, at least in theory. First, build an enormous array of moderately powerful lasers. Yoke them together—what’s called “phase lock”—to create a single beam with up to 100 gigawatts of power. Direct the beam onto highly reflective light sails attached to spacecraft weighing less than a gram and already in orbit. Turn the beam on for a few minutes, and the photon pressure blasts the spacecraft to relativistic speeds.
  • Markham Heid: Beeman says activities that are too demanding of our brain or attention — checking email, reading the news, watching TV, listening to podcasts, texting a friend, etc. — tend to stifle the kind of background thinking or mind-wandering that leads to creative inspiration. 
  • @ben11kehoe: Aurora never downsizes storage. Continue to pay at the highest roll you’ve ever made.
  • John Allspaw: Resilience is not preventative design, it is not fault-tolerance, it is not redundancy. If you want to say fault-tolerance, just say fault-tolerance. If you want to say redundancy, just say redundancy. You don’t have to say resilience just because, you can, and you absolutely are able to. I wish you wouldn’t, but you absolutely can, and that’ll be fine as well.
  • Matthew Ball: But, again, lucrative “free-to-play” games have been around for more than a decade. In fact, it turns out the most effective way to generate billions of dollars is to not require a player spend a single one (all of the aforementioned billion-dollar titles are also free-to-play). 
  • TrailofBits: Smart contract vulnerabilities are more like vulnerabilities in other systems than the literature would suggest. A large portion (about 78%) of the most important flaws (those with severe consequences that are also easy to exploit) could probably by detected using automated static or dynamic analysis tools.
  • @sfiscience: 1/2″Once you induce [auto safety] regulatory protection, there is a decline in the number of highway deaths. And then in 3-4 years, it goes right up to where it was before the safety regulation is imposed.”  2/2 There’s a kind of “risk homeostasis” with regulation: as people feel safer, they take more risks (eg, seatbelts led to faster driving and more pedestrian deaths). One exception:  @NASCAR deaths went UP with safety innovations. “People are not dumb, but they’re not rational-expectations-efficient either.”  
  • Michael F. Cohen: It may be hard to believe, but only a few years ago we debated when the first computer graphics would appear in a movie such that you could not tell if what you were looking at was real or CG. Of course, now this question seems silly, as almost everything we see in action movies is CG and you have no chance of knowing what is real or not.
  • Dropbox: Much like our data storage goals, the actual cost savings of switching to SMR (Shingled Magnetic Recording) have met our expectations. We’re able to store roughly 10 to 20 percent more data on an SMR drive than on a PMR drive of the same capacity at little to no cost difference. But we also found that moving to the high-capacity SMR drives we’re using now has resulted in more than a 20% percent savings overall compared to the last generation storage design.
  • Riot Games: The patch size was 68 MB for RADS and 83 MB for the new patcher. Despite the larger download size, the average player was able to update the game in less than 40 seconds, compared to over 8 minutes with the old patcher.
  • @grossdm: For a decade, VCs have been subsidizing the below-market provision of services to urban-dwellers: transport, food delivery, office space. Now the baton is being passed to public shareholders, who will likely have less patience. 20 years ago, public investors very quickly walked away from the below-market provision of e-commerce and delivery services  — i.e. Webvan. 
  • Julia Grace: Looking back, I should have done a lot more reorgs [at Slack] and I should’ve broken up a lot more parts of the organization so that they could have more specialization, but instead, it was working so we kept it all together.
  • Thomas Claburn: “No iCloud subscriber bargained for or agreed to have Apple turn his or her data – whether encrypted or not – to others for storage,” the complaint says. “…The subscribers bargained for, agreed, and paid to have Apple – an entity they trusted – store their data. Instead, without their knowledge or consent, these iCloud subscribers had their data turned over by Apple to third-parties for these third-parties to store the data in a manner completely unknown to the subscribers.”
  • @glitchx86: Some merit to TM: it solves the problem of the correctness of lock-based concurrent programs. TM hides all the complexity of verifying deadlock-free software .. and it isn’t an easy task 
  • @narayanarjun: We were experiencing 40ms latency spikes on queries at @MaterializeInc and @nikhilbenesch tracked it down to TCP NODELAY, and his PR just cracks me up. The canonical cite is a hacker news comment ((link:…) signed by John Nagle himself, and I can’t even.
  • Donald Hoffman: Perhaps the universe itself is a massive social network of conscious agents that experience, decide, and act. If so, consciousness does not arise from matter; this is a big claim that we will explore in detail. Instead, matter and spacetime arise from consciousness—as a perceptual interface.
  • MacCárthaigh: From the very beginning at AWS, we were building for internet scale. AWS came out of and had to support as an early customer, which is audacious and ambitious. They’re a pretty tough customer, as you can imagine, one of the busiest websites on Earth. At internet scale, it’s almost all uncoordinated. If you think about CDNs, they’re just distributed caches, and everything’s eventually consistent, and that’s handling the vast majority of things.
  • Jack Clark: Being able to measure all the ways in which AI systems fail is a superpower, because such measurements can highlight the ways existing systems break and point researchers towards problems that can be worked on.
  • Google: We investigated the remote attack surface of the iPhone, and reviewed SMS, MMS, VVM, Email and iMessage. Several tools which can be used to further test these attack surfaces were released. We reported a total of 10 vulnerabilities, all of which have since been fixed. The majority of vulnerabilities occurred in iMessage due to its broad and difficult to enumerate attack surface. Most of this attack surface is not part of normal use, and does not have any benefit to users. Visual Voicemail also had a large and unintuitive attack surface that likely led to a single serious vulnerability being reported in it.  Overall, the number and severity of the remote vulnerabilities we found was substantial. Reducing the remote attack surface of the iPhone would likely improve its security.
  • sleepydog: I work in GCP support. I think you would be surprised. Of course Linux is more common, but we still support a lot of customers who use Windows Server, SQL Server, and .NET for production.
  • Laurence Tratt: performance nondeterminism increasingly worries me, because even a cursory glance at computing history over the last 20 years suggests that both hardware (mostly for performance) and software (mostly for security) will gradually increase the levels of performance nondeterminism over time. In other words, using the minimum time of a benchmark is likely to become more inaccurate and misleading in the future…
  • Geoff Tate: A year ago, if you talked to 10 automotive customers, they all had the same plan. Everyone was going straight to fully autonomous, 7nm, and they needed boatloads of inference throughput. They wanted to license IP that they would integrate into a full ADAS chip they would design themselves. They didn’t want to buy chips. That story has backpedaled big time. Now they’re probably going to buy off-the-shelf silicon, stitch it together to do what they want, and they’re going to take baby steps rather than go to Level 5 right away.
  • Ann Steffora Mutschler: In discussions with one of the Tier 0.5 suppliers about whether sensor fusion is the way to go or if it makes better sense to do more of the computation at the sensor itself, one CTO remarked that certain types of sensor data are better handled centrally, while other types of sensor data are better handled at the edge of the car, namely the sensor, Fritz said.
  • Dai Zovi: A software engineering team would write security features, then actively go to the security team to talk about it and for advice. We want to develop generative cultures, where risk is shared. It’s everyone’s concern. If you build security responsibility into every team, you can scale much more powerfully than if security is only the security staff’s responsibility.
  • Nitasha Tiku: But that didn’t mean things would go back to normal at Google. Over the past three years, the structures that once allowed executives and internal activists to hash out tensions had badly eroded. In their place was a new machinery that the company’s activists on the left had built up, one that skillfully leveraged media attention and drew on traditional organizing tactics. Dissent was no longer a family affair. And on the right, meanwhile, the pipeline of leaks running through Google’s walls was still going as strong as ever.
  • Graham Allan: There’s another bottleneck that SoC designers are starting to struggle with, and it’s not just about bandwidth. It’s bandwidth per millimeter of die etch. So if you have a bandwidth budget that you need for your SoC, a very easy exercise is to look at all the major technologies you can find. If you have HBM2E, you can get on the order of 60+ gigabytes per second per millimeter of die edge. You can only get about a sixth of that for GDDR6. And I can only get about a tenth of that with LPDDR5.
  • Brian Bailey: If the industry is willing to give von Neumann the boot, it should perhaps go the whole way and stop considering memory to be something shared between instructions and data and start thinking about it as an accelerator. Viewed that way, it no longer has to be compared against logic or memory, but should be judged on its own merits. If it accelerates the task and uses less power, then it is a purely economic decision if the area used is worth it, which is the same as every other accelerator.
  • Barbara Tversky: This brings us to our First Law of Cognition: There are no benefits without costs. Searching through many possibilities to find the best can be time consuming and exhausting. Typically, we simply don’t have enough time or energy to search and consider all the possibilities. The evidence on action is sufficient to declare the Second Law of Cognition: Action molds perception. There are those who go farther and declare that perception is for action. Yes, perception serves action, but perception serves so much more. 
  • Jez Humble: testing is for known knowns, monitoring is for known unknowns, observability is for unknown unknowns
  • @briankrebs: Being in infosec for so long takes its toll. I’ve come to the conclusion that if you give a data point to a company, they will eventually sell it, leak it, lose it or get hacked and relieved of it. There really don’t seem to be any exceptions, and it gets depressing
  • Brendon Foye: The hyperscale giant today released a new co-branding guide (pdf), instructing partners in the AWS Partner Network (APN) how to position their marketing material when going to market with AWS. Among the guidelines, AWS said it won’t approve the use of terms like “multi-cloud,” “cross cloud,” “any cloud,” “every cloud,” “or any other language that implies designing or supporting more than one cloud provider.” The hyperscale giant today released a new co-branding guide (pdf), instructing partners in the AWS Partner Network (APN) how to position their marketing material when going to market with AWS. Among the guidelines, AWS said it won’t approve the use of terms like “multi-cloud,” “cross cloud,” “any cloud,” “every cloud,” “or any other language that implies designing or supporting more than one cloud provider.
  • Newley Purnell: Startup says it uses artificial-intelligence technology to largely automate the development of mobile apps, but several current and former employees say the company exaggerates its AI capabilities to attract customers and investors.
  • George Dyson: If you look at the most interesting computation being done on the Internet, most of it now is analog computing, analog in the sense of computing with continuous functions rather than discrete strings of code. The meaning is not in the sequence of bits; the meaning is just relative. Von Neumann very clearly said that relative frequency was how the brain does its computing. It’s pulse frequency coded, not digitally coded. There is no digital code.
  • Brendon Dixon: Because they’ve chosen to not deeply learn their deep learning systems—continuing to believe in the “magic”—the limitations of the systems elude them. Failures “are seen as merely the result of too little training data rather than existential limitations of their correlative approach” (Leetaru). This widespread lack of understanding leads to misuse and abuse of what can be, in the right venue, a useful technology.
  • Ewan Valentine: I could be completely wrong on this, but over the years, I’ve found that OO is great for mapping concepts, domain models together, and holding state. Therefor I tend to use classes to give a name to a concept and map data to it. For example, entities, repositories, and services, things which deal with data and state, I tend to create classes for. Whereas deliveries and use cases, I tend to treat functionally. The way this ends up looking, I have functions, which have instances of classes, injected through a higher-order function. The functional code then interacts with the various objects and classes passes into it, in a functional manor. I may fetch a list of items from a repository class, map through them, filter them, and pass the results into another class which will store them somewhere, or put them in a bucket.
  • Timothy Morgan: But what we do know is that the [Cray] machine will weigh in at around 30 megawatts of power consumption, which means it will have more than 10X the sustained performance of the current Sierra system on DOE applications and around 4X the performance per watt. This is a lot better energy efficiency than many might have been expecting – a few years back there was talk of exascale systems requiring as much as 80 megawatts of juice, which would have been very rough to pay for at a $1 per kilowatt per year. With those power consumption numbers, it would have cost $500 million to build El Capitan but it would have cost around $400 million to power it for five years; at 30 megawatts, you are in the range of $150 million, which is a hell of a lot more feasible even if it is an absolutely huge electric bill by any measure.
  • Timothy Prickett Morgan: All of us armchair architecture quarterbacks have been thinking the CPU of the future looks like a GPU card, with some sort of high bandwidth memory that’s really close. 
  • Garrett Heinlen (Netflix): I believe GraphQL also goes a step further beyond REST and it helps an entire organization of teams communicate in a much more efficient way. It really does change the paradigm of how we build systems and interact with other teams, and that’s where the power truly lies. Instead of the back end dictating, “Here are the APIs you receive and here’s the shape in the format you’re going to get,” they express what’s possible to access. The clients have all the power between pulling in the data just what they need. The schema is the API contract between all teams and it’s a living evolving source of truth for your organization. Gone are the days of people throwing code over the wall thing like, “Good luck, it’s done.” Instead, GraphQL promotes more of a uniform working experience amongst front end and back end, and I would go further to say even product and designer could have been involved in this process as well to understand the business domain that you’re all working within.

Useful Stuff:

  • Fun thread. @jessfraz: Tell me about the weirdest bug you had that caused a datacenter outage, can be anywhere in the stack including human error. @dormando: one day all the sun servers fired temp alarms and shut off. thought AC had died or there was a fire. Turns out cleaners had wedged the DC door open, causing a rapid humidity shift, tricking the sensors. @ewindisch: connection pool leak in a distributed message queue I wrote caused the cascade failure of a datacenter’s network switches. This brought offlin a large independent cloud provider around 2013. @davidbrunelle: Unexpected network latency caused TCP sockets to stay open indefinitely on a fleet of servers running an application. This eventually led to PAT exhaustion causing around ~50% of outbound calls from the datacenter to fail causing a DC-wide brownout.
  • What happens when you go from LAMP to serverless: case study of 90% of the requests are below 100ms. $17.37/month. Generally low effort migration.
  • By continuously monitoring increases in spend, we end up building scalable, secure and resilient Lambda based solutions while maintaining maximum cost-effectiveness. How We Reduced Lambda Functions Costs by Thousands of Dollars: In the last 7 months, we started using Lambda based functions heavily in production. It allowed us to scale quickly and brought agility to our development activities…We were serving +80M Lambda invocations per day across multiple AWS regions with an unpleasant surprise in the form of a significant bill…once we start running heavy workloads in production, the cost become significant and we spent thousands of dollars daily…to reduce AWS Lambda costs, we monitored Lambda memory usage and execution time based on logs stored in CloudWatch…we created dynamic visualizations on Grafana based on metrics available in the timeseries database and we were able to monitor in near real-time Lambda runtime usage…we gain insights into the right sizing of each Lambda function deployed in our AWS account and we avoided excessive over-allocation of memory. Hence, significantly reduced the Lambda’s cost…To gather more insights and uncover hidden costs, we had to identify the most expensive functions. Thats where Lambda Tags comes into the play. We leveraged those metadata to breakdown the cost per Stack…By reducing the invocation frequency (control concurrency with SQS), we reduced the cost up to 99%…we’re evaluating alternative services like Spot Instances & Batch Jobs to run heavy non-critical workloads considering the hidden costs of Serverless…we were using SNS and we had issues with handling errors and Lambda timeout, so we changed our architecture to use instead SQS and we configured a dead letter queue to reduce the number of times the same message can be handled by the Lambda function (avoir recursion). Hence, reducing the number of invocations.
  • Six Shades of Coupling: Content Coupling, Common Coupling, External Coupling, Control Coupling, Stamp Coupling and Data Coupling. 
  • When does redundancy actually help availability?: The complexity added by introducing redundancy mustn’t cost more availability than it adds. The system must be able to run in degraded mode. The system must reliably detect which of the redundant components are healthy and which are unhealthy. The system must be able to return to fully redundant mode.
  • AI Algorithms Need FDA-Style Drug Trials. The problem with this idea is molecules do not change whereas software continuously changes and learning software by definition changes reactively. No static proces like a one and done drug trial will yield meaningful results. We need a different approach that considers the unique nature software plays in systems. Certainly vendors can’t be trusted. Any AI will tell you that. Perhaps create a set of test courses that platforms can be continuously tested and fuzzed against?
  • AWS Lambda is not ready to replace convenctional EC2Why we didn’t brew our Chai on AWS Lambda: Chai Point, India’s largest organized Chai retailer, with over 150+ stores and over 1000+ boxC(IoT Enabled Chai and Coffee vending machines) are designed for corporate which serves approximately 250k cups of chai per day from all the channels…Most of the Chai Point’s stores and boxC machines typically run between 7 AM to 9 PM…[Lambda cold start is] one of the most critical and deciding factors for us to move back the Shark infrastructure to EC2…AWS Lambda has a limit of 50 MB as the maximum deployment package…it takes a delay of 1–2 minutes for logs to appear in the CloudWatch which makes it difficult for immediate debugging in a test environment…when it comes to deploying it in enterprise solutions where there are inter-services dependencies I think there is still time especially for languages like Java. 
  • Facebook Performance @Scale 2019 recap videos are now available. 
  • Sharing is caring until it becomes overbearing. Dropbox no longer shares code between platforms. Their policy now is to use the native language on each platform. It is simply easier and quicker to write code twice. And you don’t have to train people on using a custom stack. The tools are native. So when people move on you have not lost critical expertise. The one codebase to rule them all dream dies hard. No doubt it will be back in short order, filtered through some other promising stack.
  • Everyone these days wants your functions. Oracle Functions Now Generally Available. It’s built on the Apache 2.0 licensed Fn Project. Didn’t see much in the way of reviews or on costs.
  • On LeanXcale database. Interview with Patrick Valduriez and Ricardo Jimenez-Peris: There is a class of new NewSQL databases in the market, called Hybrid Transaction and Analytics Processing (HTAP). NewSQL is a recent class of DBMS that seeks to combine the scalability of NoSQL systems with the strong consistency and usability of RDBMSs. LeanXcale’s architecture is based on three layers that scale out independently, 1) KiVi, the storage layer that is a relational key-value data store, 2) the distributed transactional manager that provides ultra-scalable transactions, and 3) the distributed query engine that enables to scale out both OLTP and OLAP workloads. he storage layer, it is a proprietary relational key-value data store, called KiVi, which we have developed. Unlike traditional key-value data stores, KiVi is not schemaless, but relational. Thus, KiVi tables have a relational schema, but can also have a part that is schemaless. The relational part enabled us to enrich KiVi with predicate filtering, aggregation, grouping, and sorting. As a result, we can push down all algebraic operators below a join to KiVi and execute them in parallel, thus saving the movement of a very large fraction of rows between the storage layer and they query engine layer.
  • Apollo Day New York City 2019 Recap
    • During his keynote, DeBergalis announced one of Apollo’s most anticipated innovations, Federation, which utilizes the idea of a new layer in the data stack to directly meet developers’ needs for a more scalable, reliable, and structured solution to a centralized data graph.
    • Federation paired with existing features of Apollo’s platform like schema change validation listing creates a flow where teams can independently push updates to product microservices. This triggers re-computation of the whole graph, which is validated and then pushed into the gateway. Once completed, all applications contain changes in the part of the graph that is available to them. These events happen independently, so there is a way to operate, which allows each team to be responsible solely for its piece.
    • Another key concept that DeBergalis detailed was the idea that a “three-legged” stack is emerging in front-end development. The “legs” of this new “stool” that form the basis of this stack are React, Apollo, and Typescript. React provides developers with a system for managing user components, Apollo provides developers a system for managing data, and Typescript provides a foundation underneath that provides static typing end-to-end through the stack.
  • Lesson: sticker shock—in Google Cloud everything costs more you think it will, but it’s still worth it. Etsy’s Big Data Cloud Migration. Etsy generates a terabyte of data a day, they run hundreds of Hadoop workflows and thousands of jobs daily. Started out on prem. They migrated to the cloud over a year and half ago, driven by needing both the machine and people resources required to keep up with machine leaning and data processing tasks. Moving into the cloud decoupled systems so groups can operate independently. With their on prem system they didn’t worry about optimization, but on the cloud you must because the cloud will do whatever you tell it do—at a price. In the cloud there’s a business case for making things more efficient. They rearchitected as they moved over. Managed services were a huge win. As they grew bigger they simply didn’t have the resources and the expertise to run all the needed infrastructure. That’s now Google’s job. This allowed having more generalized teams. It would be impossible for their team of 4 to manage all the things they use in GCP. Specialization is not required run things. If you need it you just turn it on. That includes services like BigTable, k8s, Cloud Pub/Sub, Cloud Dataflow, and AI. It allows Etsy to punch above their weight class. They have a high level of support, with Google employees  embedded on their team. Etsy didn’t lift and shift,they remade the platform as they moved over. If they had to do it over again they might have tried for a middle road, changing things before the migration.
  • Facebook Systems @Scale 2019 recap videos are now available.
  • The human skills we need an an unpredictable world. Efficiency and robustness trade off against each other. The more efficient something is the less slack there is to handle to the unexpected. When you target efficiency you may be making yourself more vulnerable to shocks.
  • The lesson is, you can’t wait around for Netflix or anyone else to promote your show. It’s up to you to create the buzz. How a Norwegian Viking Comedy Producer Hacked Netflix’s Algorithm: The key to landing on Netflix’s radar, he knew, would be to hack its recommendation engine: get enough people interested in the show early…Three weeks before launch, he set up a campaign on Facebook, paying for targeted posts and Facebook promotions. The posts were fairly simple — most included one of six short (20- to 25-second) clips of the show and a link, either to the show’s webpage or to media coverage. They used so-called A/B testing — showing two versions of a campaign to different audiences and selecting the most successful — to fine-tune. The U.S. campaign didn’t cost much — $18,500, which Tangen and his production partners put up themselves — and it was extremely precise. In just 28 days, the Norsemen campaign reached 5.5 million Facebook users, generating 2 million video views and some 6,000 followers for the show. Netflix noticed. “Three weeks after we launched, Netflix called me: ‘You need to come to L.A., your show is exploding,'” Tangen recalls. Tangen invested a further $15,000 to promote the show on Facebook worldwide, using what he had learned during the initial U.S. campaign.
  • How did NASA Steer the Saturn V? Triply redundant in logic. Doubly redundant in memory.  Two compared to make sure getting the same answer. If the same numbers aren’t returned then a subroutine is called to determine at this point in the flight which number makes the most sense. During all Saturn flights had less than 10 miscompares. More component means less reliability. Never had catastrophic failure. Biggest problem is vibration.
  • Interesting idea, instead of interviews use how well a candidate performs on training software to determine how well a candidate knows a set of skills. The Cloudcast – Cloud Computing. Role of generalists is gone. Pick a problem people are struggling with and become an expert at solving that problem and market yourself as person who has the skill of solving the problem. 
  • The end state for any application is to write its own scheduler.  Making faster: Part 1. Use preload tags to start fetching resources as soon as possible. You can even preload GraphQL requests to get a head start on those long queries. Preloads have a higher network priority. Preload tag for all script resources and to place them in the order that they would be needed. Load in new batches before the user hits the end of their current feed. A prioritized task abstraction that handles queueing of asynchronous work (in this case, a prefetch for the next batch of feed posts).  If the user scrolls close enough to the end of the current feed, we increase the priority of this prefetch task to ‘high’ by cancelling the pending idle callback and thus firing off the prefetch immediately. Once the JSON data for the next batch of posts arrives, we queue a sequential background prefetch of all the images in that preloaded batch of posts. We prefetch these sequentially in the order the posts are displayed in the feed rather than in parallel, so that we can prioritize the download and display of images in posts closest to the user’s viewport. Also Preemption in Nomad — a greedy algorithm that scales
  • Native lazy loading has arrived! Adding the loading attribute to the images decreased the load time on a fast network connection by ~50% — it went from ~1 second to < 0.5 seconds, as well as saving up to 40 requests to the server 🎊. All of those performance enhancements just from adding one attribute to a bunch of images!
  • Maybe it should just be simpler to create APIs?  Simple Two-way Messaging using the Amazon SQS Temporary Queue Client. Seems a lot of people use queues for front-end back-end communication because it’s simpler to setup and easier to secure than createing an HTTP endpoint. So AWS came up with a virtual queue that let’s you multiplex many virtual queues over a physical queue. No extra cost. It’s all done on the client. A clever tag based heartbeat mechanism is used to garbage collect queues.
  • Monolith to Microservices to Serverless — Our journey: A large part of our technology stack at that time comprised of a Spring based application and a MySQL database running on VMs in a data centre…The application was working for our thousands of customers, day in, day out, with little to no downtime. But it couldn’t be denied that new features were becoming difficult to build and the underlying infrastructure was beginning to struggle to scale as we continued to grow as a business…We needed a drastic rethink of our infrastructure and that came in the shape of Docker containers and Kubernetes…We took a long hard look at our codebase and with the ‘independent loosely coupled services’ mantra at the forefront of our minds we were quickly able to break off large parts of the monolith into smaller much more manageable services. New functionality was designed and built in the same way and we were quickly up to a 2 node K8s cluster with over 35 running pods….Fast forward to Today and we have now been using AWS for well over 2 years, we have migrated the core parts of our reporting suite into the cloud and where appropriate all new functionality is built using serverless AWS services. Our ‘serverless first’ ethos allows us to build highly performant and highly scaleable systems that are quick to provision and easy to manage. 
  • This is Crypto 101. Security Now 727 BlackHat & DefCon. Steve Gibson details how electronic hotel locks can protect themselves against replay attacks: All that’s needed to prevent this is for the door, when challenged to unlock, to provide a nonce for the phone to sign and return. The door contains a software ratchet. This is a counter which feeds a secretly-keyed AES symmetric cipher. Each door lock is configured with its own secret key which is never exposed. The AES cipher which encrypts a counter, produces a public elliptic key which is used to verify signatures. So the door lock first checks the key that it is currently valid for and has been using. If that fails, it checks ahead to the next public key to see whether that one can verify the returned signature. If not, it ignores the request. But if the next key does successfully verify the request’s signature it makes the next key permanent, ratcheting forward and forgetting the previous no-longer-valid key. This means that the door locks do not need to communicate with the hotel. Each door lock is able to operate autonomously with its own secret key which determines the sequence of its public keys. The hotel system knows each room’s secret key so it’s able to issue the proper private signing key to each guest for the proper room. If that system is designed correctly, no one with a copy of the Mobile Key software, and the ability to eavesdrop on the conversation, is able to gain any advantage from doing so.
  • Trip report: Summer ISO C++ standards meeting (Cologne). Reddit trip report. C++20 is now feature complete. Added: modules, coroutines, concepts including in the standard library via ranges, <=> spaceship including in the standard library, broad use of normal C++ for direct compile-time programming, ranges, calendars and time zones, text formatting, span, and lots more. Contracts were moved to C++21. 
  • Ingesting data at “Bharat” Scale: Initially, we considered Redis for our failover store, but with serving an average ingestion rate of 250K events per second, we would end up needing large Redis clusters just to support minutes worth of panic of our message bus. Finally, we decided to use a failover log producer that writes logs locally to disk. This periodically rotates & uploads to S3…We’ve seen outages, where our origin crashes & as it tries to recover, it is inundated with client retries & pending requests in the surge queue. That’s a recipe for cascading failure…We want to continue to serve the requests we can sustain, for anything over that, sorry, no entry. So we added a rate-limit to each of our API servers. We arrived at this configuration after a series of simulations & load-tests, to truly understand at what RPS our boxes will not sustain the load. We use nginx to control the number of requests per second using a leaky bucket algorithm. The target tracking scaling trigger is 3/4th of the rate-limit, to allow for the room to scale; but there are still occasions where large surges are too quick for target-tracking scaling to react.

Soft Stuff:

  • jedisct1/libsodium: Sodium is a new, easy-to-use software library for encryption, decryption, signatures, password hashing and more. It is a portable, cross-compilable, installable, packageable fork of NaCl, with a compatible API, and an extended API to improve usability even further. Its goal is to provide all of the core operations needed to build higher-level cryptographic tools.
  • amejiarosario/dsa.js-data-structures-algorithms-javascript: In this repository, you can find the implementation of algorithms and data structures in JavaScript. This material can be used as a reference manual for developers, or you can refresh specific topics before an interview. Also, you can find ideas to solve problems more efficiently.
  • linkedin/brooklin: Brooklin is a distributed system intended for streaming data between various heterogeneous source and destination systems with high reliability and throughput at scale. Designed for multitenancy, Brooklin can simultaneously power hundreds of data pipelines across different systems and can easily be extended to support new sources and destinations.
  • gojekfarm/hospital: Hospital is an autonomous healing system for any System. Any failure or faults occurred in the system will be resolved automatically according to given run-book by the Hospital without manual intervention.
  • BlazingDB/pyBlazing: BlazingSQL is a GPU accelerated SQL engine built on top of the RAPIDS ecosystem. RAPIDS is based on the Apache Arrow columnar memory format, and cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.
  • serverless/components: Forget infrastructure — Serverless Components enables you to deploy entire serverless use-cases, like a blog, a user registration system, a payment system or an entire application — without managing complex cloud infrastructure configurations.

Pub Stuff:

  • Zooming in on Wide-area Latencies to a Global Cloud Provider: The network communications between the cloud and the client have become the weak link for global cloud services that aim to provide low latency services to their clients. In this paper, we first characterize WAN latency from the viewpoint of a large cloud provider Azure, whose network edges serve hundreds of billions of TCP connections a day across hundreds of locations worldwide. 
  • What is Applied Category Theory? Two themes that appear over and over (and over and over and over) in applied category theory are functorial semantics and compositionality. 
  • ML can never be fair. On Fairness and Calibration: In this paper, we investigate the tension between minimizing error disparity across different population groups while maintaining calibrated probability estimates. We show that calibration is compatible only with a single error constraint (i.e. equal false-negatives rates across groups), and show that any algorithm that satisfies this relaxation is no better than randomizing a percentage of predictions for an existing classifier. These unsettling findings, which extend and generalize existing results, are empirically confirmed on several datasets.

from High Scalability

Things to Consider When You Build REST APIs with Amazon API Gateway

Things to Consider When You Build REST APIs with Amazon API Gateway

A few weeks ago, we kicked off this series with a discussion on REST vs GraphQL APIs. This post will dive deeper into the things an API architect or developer should consider when building REST APIs with Amazon API Gateway.

Request Rate (a.k.a. “TPS”)

Request rate is the first thing you should consider when designing REST APIs. By default, API Gateway allows for up to 10,000 requests per second. You should use the built in Amazon CloudWatch metrics to review how your API is being used. The Count metric in particular can help you review the total number of API requests in a given period.

It’s important to understand the actual request rate that your architecture is capable of supporting. For example, consider this architecture:


This API accepts GET requests to retrieve a user’s cart by using a Lambda function to perform SQL queries against a relational database managed in RDS.  If you receive a large burst of traffic, both API Gateway and Lambda will scale in response to the traffic. However, relational databases typically have limited memory/cpu capacity and will quickly exhaust the total number of connections.

As an API architect, you should design your APIs to protect your down stream applications.  You can start by defining API Keys and requiring your clients to deliver a key with incoming requests. This lets you track each application or client who is consuming your API.  This also lets you create Usage Plans and throttle your clients according to the plan you define.  For example, you if you know your architecture is capable of of sustaining 200 requests per second, you should define a Usage plan that sets a rate of 200 RPS and optionally configure a quota to allow a certain number of requests by day, week, or month.

Additionally, API Gateway lets you define throttling settings for the whole stage or per method. If you know that a GET operation is less resource intensive than a POST operation you can override the stage settings and set different throttling settings for each resource.

Integrations and Design patterns

The example above describes a synchronous, tightly coupled architecture where the request must wait for a response from the backend integration (RDS in this case). This results in system scaling characteristics that are the lowest common denominator of all components. Instead, you should look for opportunities to design an asynchronous, loosely coupled architecture. A decoupled architecture separates the data ingestion from the data processing and allows you to scale each system separately. Consider this new architecture:


This architecture enables ingestion of orders directly into a highly scalable and durable data store such as Amazon Simple Queue Service (SQS).  Your backend can process these orders at any speed that is suitable for your business requirements and system ability.  Most importantly,  the health of the backend processing system does not impact your ability to continue accepting orders.


Security with API Gateway falls into three major buckets, and I’ll outline them below. Remember, you should enable all three options to combine multiple layers of security.

Option 1 (Application Firewall)

You can enable AWS Web Application Firewall (WAF) for your entire API. WAF will inspect all incoming requests and block requests that fail your inspection rules. For example, WAF can inspect requests for SQL Injection, Cross Site Scripting, or whitelisted IP addresses.

Option 2 (Resource Policy)

You can apply a Resource Policy that protects your entire API. This is an IAM policy that is applied to your API and you can use this to white/black list client IP ranges or allow AWS accounts and AWS principals to access your API.

Option 3 (AuthZ)

  1. IAM:This AuthZ option requires clients to sign requests with the AWS v4 signing process. The associated IAM role or user must have permissions to perform the execute-api:Invoke action against the API.
  2. Cognito: This AuthZ option requires clients to login into Cognito and then pass the returned ID or Access JWT token in the Authentication header.
  3. Lambda Auth: This AuthZ option is the most flexible and lets you execute a Lambda function to perform any custom auth strategy needed. A common use case for this is OpenID Connect.

A Couple of Tips

Tip #1: Use Stage variables to avoid hard coding your backend Lambda and HTTP integrations. For example, you probably have multiple stages such as “QA” and “PROD” or “V1” and “V2.” You can define the same variable in each stage and specify different values. For example, you might an API that executes a Lambda function. In each stage, define the same variable called functionArn. You can reference this variable as your Lambda ARN during your integration configuration using this notation: ${stageVariables.functionArn}. API Gateway will inject the corresponding value for the stage dynamically at runtime, allowing you to execute different Lambda functions by stage.

Tip #2: Use Path and Query variables to inject dynamic values into your HTTP integrations. For example, your cart API may define a userId Path variable that is used to lookup a user’s cart: /cart/profile/{userId}. You can inject this variable directly into your backend HTTP integration URL settings like this:{userId}


This post covered strategies you should use to ensure your REST API architectures are scalable and easy to maintain.  I hope you’ve enjoyed this post and our next post will cover GraphQL API architectures with AWS AppSync.

About the Author

George MaoGeorge Mao is a Specialist Solutions Architect at Amazon Web Services, focused on the Serverless platform. George is responsible for helping customers design and operate Serverless applications using services like Lambda, API Gateway, Cognito, and DynamoDB. He is a regular speaker at AWS Summits, re:Invent, and various tech events. George is a software engineer and enjoys contributing to open source projects, delivering technical presentations at technology events, and working with customers to design their applications in the Cloud. George holds a Bachelor of Computer Science and Masters of IT from Virginia Tech.

from AWS Architecture Blog

Architecture Monthly Magazine for July: Machine Learning

Architecture Monthly Magazine for July: Machine Learning

Every month, AWS publishes the AWS Architecture Monthly Magazine (available for free on Kindle and Flipboard) that curates some of the best technical and video content from around AWS.

In the June edition, we offered several pieces of content related to Internet of Things (IoT). This month we’re talking about artificial intelligence (AI), namely machine learning.

Machine Learning: Let’s Get it Started

Alan Turing, the British mathematician whose life and work was documented in the movie The Imitation Game, was a pioneer of theoretical computer science and AI. He was the first to put forth the idea that machines can think.

Jump ahead 80 years to this month when researchers asked four-time World Poker Tour title holder Darren Elias to play Texas Hold’em with Pluribus, a poker-playing bot (actually, five of these bots were at the table). Pluribus learns by playing against itself over and over and remembering which strategies worked best. The bot became world-class-level poker player in a matter of days. Read about it in the journal Science.

If AI is making a machine more human, AI’s subset, machine learning, involves the techniques that allow these machines to make sense of the data we feed them. Machine learning is mimicking how humans learn, and Pluribus is actually learning from itself.

From self-driving cars, medical diagnostics, and facial recognition to our helpful (and sometimes nosy) pals Siri, Alexa, and Cortana, all these smart machines are constantly improving from the moment we unbox them. We humans are teaching the machines to think like us.

For July’s magazine, we assembled architectural best practices about machine learning from all over AWS, and we’ve made sure that a broad audience can appreciate it.

  • Interview: Mahendra Bairagi, Solutions Architect, Artificial Intelligence
  • Training: Getting in the Voice Mindset
  • Quick Start: Predictive Data Science with Amazon SageMaker and a Data Lake on AWS
  • Blog post: Amazon SageMaker Neo Helps Detect Objects and Classify Images on Edge Devices
  • Solution: Fraud Detection Using Machine Learning
  • Video: Uses Deep Learning to Analyze CT Scans and Save Lives
  • Whitepaper: Power Machine Learning at Scale

We hope you find this edition of Architecture Monthly useful, and we’d like your feedback. Please give us a star rating and your comments on Amazon. You can also reach out to [email protected] anytime. Check back in a month to discover what the August magazine will offer.

from AWS Architecture Blog

Stuff The Internet Says On Scalability For August 2nd, 2019

Stuff The Internet Says On Scalability For August 2nd, 2019

Wake up! It’s HighScalability time—once again:

Do you like this sort of Stuff? I’d greatly appreciate your support on Patreon. I wrote Explain the Cloud Like I’m 10 for people who need to understand the cloud. And who doesn’t these days? On Amazon it has 52 mostly 5 star reviews (121 on Goodreads). They’ll learn a lot and hold you in even greater awe.

Number Stuff:

  • $9.6B: games investment in last 18 months, equal to the previous five years combined.
  • $3 million: won by a teenager in the Fortnite World Cup.  
  • 100,000: issues in Facebook’s codebase fixed from bugs found by static analysis. 
  • 106 million: Capital 1 IDs stolen by a former Amazon employee. (complaint)
  • 2 billion: IoT devices at risk because of 11 VXWorks zero day vulnerabilities.
  • 2.1 billion: parking spots in the US, taking 30% of city real estate, totaling 34 billion square meters, the size of West Virginia, valued at 60 trillion dollars.
  • 2.1 billion: people use Facebook, Instagram, WhatsApp, or Messenger every day on average. 
  • 100: words per minute from Facebook’s machine-learning algorithms capable of turning brain activity into speech. 
  • 51%: Facebook and Google’s ownership of the global digital ad market space on the internet.
  • 56.9%: Raleigh, NC was the top U.S. city for tech job growth.
  • 20-30: daily CPAN (Perl) uploads. 700-800 for Python.
  • 476 miles: LoRaWAN (Low Power, Wide Area (LPWA)) distance world record broken using 25mW transmission power.
  • 74%: Skyscanner savings using spot instances and containers on the Kubernetes cluster.
  • 49%: say convenience is more important than price when selecting a provider.
  • 30%: Airbnb app users prefer a non-default font size.
  • 150,000: number of databases migrated to AWS using the AWS Database Migration Service.
  • 1 billion: Google photos users, @MikeElgan: same size as Instagram but far larger than Twitter, Snapchat or Pinterest
  • 300M: Pinterest monthly active users with evenue of $261 million, up 64% year-over-year, on losses of $26 million for the second-quarter of 2019.
  • 7%: of all dating app messages were rated as false.
  • $100 million: Goldman Sachs spend to improve stock trades from hundreds of milliseconds down to 100 microseconds while handling more simultaneous trades. The article mentions using microservices and event sourcing, but it’s not clear how that’s related.

Quotable Stuff:

  • Josh Frydenberg, Australian Treasurer: Make no mistake, these companies are among the most powerful and valuable in the world. They need to be held to account and their activities need to be more transparent.
  • Neil Gershenfeld: Fabrication merges with communication and computation. Most fundamentally, it leads to things like morphogenesis and self-reproducing an assembler. Most practically, it leads to almost anybody can make almost anything, which is one of the most disruptive things I know happening right now. Think about this range I talked about as for computing the thousand, million, billion, trillion now happening for the physical world, it’s all here today but coming out on many different link scales.
  • Alan Kay: Marvin and Seymour could see that most interesting systems were crossconnected in ways that allowed parts to be interdependent on each other—not hierarchical—and that the parts of the systems needed to be processes rather than just “things”
  • Lawrence Abrams: Now that ransomware developers know that they can earn monstrous payouts from local cities and insurance policies, we see a new government agency, school district, or large company getting hit with a ransomware attack every day.
  • @tmclaughbos: A lot of serverless adoption will fail because organizations will push developers to assume more responsibility down the stack instead of forcing them to move up the stack closer to the business.
  • Lightstep: Google Cloud Functions’ reusable connection insertion makes the requests more than 4 times faster [than S3] both in region and cross region.
  • HENRY A. KISSING, ERERIC SCHMIDT, DANIEL HUTTENLOCHER: The evolution of the arms-control regime taught us that grand strategy requires an understanding of the capabilities and military deployments of potential adversaries. But if more and more intelligence becomes opaque, how will policy makers understand the views and abilities of their adversaries and perhaps even allies? Will many different internets emerge or, in the end, only one? What will be the implications for cooperation? For confrontation? As AI becomes ubiquitous, new concepts for its security need to emerge. The three of us differ in the extent to which we are optimists about AI. But we agree that it is changing human knowledge, perception, and reality—and, in so doing, changing the course of human history. We seek to understand it and its consequences, and encourage others across disciplines to do the same.
  • minesafetydisclosures: Visa’s business is all about scale. That’s because the company’s fixed costs are high, but the cost of processing a transaction is essentially zero. Said more simply, it takes a big upfront investment in computers, servers, personnel, marketing, and legal fees to run Visa. But those costs don’t increase as volume increases; i.e., they’re “fixed”. So as Visa processes more transactions through their network, profit swells. As a result, the company’s operating margin has increased from 40% to 65%. And the total expense per transaction has dropped from a dime to a nickel; of which only half of a penny goes to the processing cost. Both trends are likely to continue.
  • noobiemcfoob: Summarizing my views: MQTT seems as opaque as WebSockets without the benefits of being built on a very common protocol (HTTP) and being used in industries beyond just IoT. The main benefits proponents of MQTT argue for (low bandwidth, small libraries) don’t seem particularly true in comparison to HTTP and WebSockets.
  • erincandescent: It is still my opinion that RISC-V could be much better designed; though I will also say that if I was building a 32 or 64-bit CPU today I’d likely implement the architecture to benefit from the existing tooling.
  • Director Jon Favreau~  the plan was to create a virtual Serengeti in the Unity game engine, then apply live action filmmaking techniques to create the film — the “Lion King” team described this as a “virtual production process.”
  • Alex Heath: In confidential research Mr. Cunningham prepared for Facebook CEO Mark Zuckerberg, parts of which were obtained by The Information, he warned that if enough users started posting on Instagram or WhatsApp instead of Facebook, the blue app could enter a self-sustaining decline in usage that would be difficult to undo. Although such “tipping points” are difficult to predict, he wrote, they should be Facebook’s biggest concern. 
  • jitbit: Well, to be embarrassingly honest… We suck at pricing. We were offering “unlimited” plans to everyone until recently. And the “impressive names” like you mention, well, they mostly pay us around $250 a month – which used to be our “Enterprise” pricing plan with unlimited everything (users, storage, agents etc.) So I guess the real answer is – we suck at positioning and we suck at marketing. As the result – profits were REALLY low (Lesson learned – don’t compete on pricing). P.S. Couple of years ago I met Thomas from “FE International” at some conference, really experienced guy, who told me “dude, this is crazy, dump the unlimited plan like right now” so we did. So I guess technically we can afford a PaaS now…
  • 1e-9: The markets are kind of like a massive, distributed, realtime, ensemble, recursive predictor that performs much better than any one of its individual component algorithms could. The reason why shaving a few milliseconds (or even microseconds) can be beneficial is because the price discovery feedback loops get faster, which allows the system to determine a giant pricing vector that is more self-consistent, stable, and beneficial to the economy. It’s similar to how increasing the sample rate of a feedback control system improves performance and stability. Providers of such benefits to the markets get rewarded through profit.
  • @QuinnyPig: There’s something else afoot too. I fix cloud bills. If I offer $10k to save $100k people sign off. If I offer $10 million to save $100 million people laugh me out of the room. Large numbers are psychologically scary.
  • mrjn:  Is it worth paying $20K for any DB or DB support? If it would save you 1/10th of an engineer per year, it becomes immediately worth. That means, can you avoid 5 weeks of one SWE by using a DB designed to better suit your dataset? If the answer is yes (and most cases it is), then absolutely that price is worth. See my blog post about how much money it must be costing big companies building their graph layers. Second part is, is Dgraph worth paying for compared to Neo or others? Note that the price is for our enterprise features and support. Not for using the DB itself. Many companies run a 6-node or a 12-node distributed/replicated Dgraph cluster and we only learn that much later when they’re close to pushing it into production and need support. They don’t need to pay for it, the distributed/replicated/transactional architecture of Dgraph is all open source. How much would it cost if one were to run a distributed/replicated setup of another graph DB? Is it even possible, can it execute and perform well? And, when you add support to that, what’s the cost?
  • @codemouse: It’s halfway to 2020. At this point, if any of your strategy is continued investment into your data centers you’re doing it wrong. Yes migration may take years, but you’re not going to be doing #cloud or #ops better than @awscloud
  • hermitdev: Not Citibank, but previously worked for a financial firm that sold a copy of it’s back office fund administration stack. Large, on site deployment. It would take a month or two to make a simple DNS change so they could locate the services running on their internal network. The client was a US depository trust with trillions on deposit. No, I wont name any names. But getting our software installed and deployed was as much fun as extracting a tooth with a dull wood chisel and a mallet.
  • Insikt Group: Approximately 50% of all activity concerning ransomware on underground forums are either requests for any generic ransomware or sales posts for generic ransomware from lower-level vendors. We believe this reflects a growing number of low-level actors developing and sharing generic ransomware on underground forums.
  • Facebook: For classes of bugs intended for all or a wide variety of engineers on a given platform, we have gravitated toward a “diff time” deployment, where analyzers participate as bots in code review, making automatic comments when an engineer submits a code modification. Later, we recount a striking situation where the diff time deployment saw a 70% fix rate, where a more traditional “offline” or “batch” deployment (where bug lists are presented to engineers, outside their workflow) saw a 0% fix rate.
  • Andy Rachleff: Venture capitalists know that the thing that causes their companies to go out of business is lack of a market, not poor execution. So it’s a fool’s errand to back a company that proposes to do a ride-hailing service or renting a room or something as crazy as that. Again–how would you know if it’s going to work? So the venture industry outsourced that market risk to the angel community. The angel community thinks they won it away from the venture community, but nothing could be further from the truth, because it’s a sucker bet. It’s a horrible risk/reward. The venture capitalists said, “Okay, let the angels invest at a $5 million valuation and take all of that market risk. We’ll invest at a $50 million valuation. We have to pay up if it works.” Now they hope the company will be worth $5 billion to make the same return as they would have in the old model. Interestingly, there now are as many companies worth $5 billion today as there were companies worth $500 million 20 years ago, which is why the returns of the premier venture capital firms have stayed the same or even gone up.
  • imagetic: I dealt with a lot of high traffic live streaming video on Facebook for several years. We saw interaction rates decline almost 20x in a 3 year period but views kept increasing. Things just didn’t add up when the dust settled and we’d look at the stats. It wouldn’t be the least bit surprised if every stat FB has fed me was blown extremely out of proportion.
  • prism1234: If you are designing a small embedded system, and not a high performance general computing device, then you already know what operations your software will need and can pick what extensions your core will have. So not including a multiply by default doesn’t matter in this case, and may be preferred if your use case doesn’t involve a multiply. That’s a large use case for risc-v, as this is where the cost of an arm license actually becomes an issue. They don’t need to compete with a cell phone or laptop level cpu to still be a good choice for lots of devices.
  • oppositelock: You don’t have time to implement everything yourself, so you delegate. Some people now have credentials to the production systems, and to ease their own debugging, or deployment, spin up little helper bastion instances, so they don’t have to use 2FA each time to use SSH or don’t have to deal with limited-time SSH cert authorities, or whatever. They roll out your fairly secure design, and forget about the little bastion they’ve left hanging around, open to with the default SSH private key every dev checks into git. So, any former employee can get into the bastion.
  • Lyft: Our tech stack comprises Apache Hive, Presto, an internal machine learning (ML) platform, Airflow, and third-party APIs.
  • Casey Rosenthal: It turns out that redundancy is often orthogonal to robustness, and in many cases it is absolutely a contributing factor to catastrophic failure. The problem is, you can’t really tell which of those it is until after an incident definitively proves it’s the latter.
  • Colm MacCárthaigh: There are two complementary tools in the chest that we all have these days, that really help combat Open Loops. The first is Chaos Engineering. If you actually deliberately go break things a lot, that tends to find a lot of Open Loops and make it obvious that they have to be fixed.
  • @eeyitemi: I’m gonna constantly remind myself of this everyday. “You can outsource the work, but you can’t outsource the risk.” @Viss 2019
  • Ben Grossman~ this could lead to a situation where filmmaking is less about traditional “filmmaking or storytelling,” and more about “world-building”: “You create a world where characters have personalities and they have motivations to do different things and then essentially, you can throw them all out there like a simulation and then you can put real people in there and see what happens.”
  • cheeze: I’m a professional dev and we own a decent amount of perl. That codebase is by far the most difficult to work in out of anything we own. New hires have trouble with it (nobody learns perl these days). Lots of it is next to unreadable.
  • Annie Lowrey: All that capital from institutional investors, sovereign wealth funds, and the like has enabled start-ups to remain private for far longer than they previously did, raising bigger and bigger rounds. (Hence the rise of the “unicorn,” a term coined by the investor Aileen Lee to describe start-ups worth more than $1 billion, of which there are now 376.) Such financial resources “never existed at scale before” in Silicon Valley, says Steve Blank, a founder and investor. “Investors said this: ‘If we could pull back our start-ups from the public market and let them appreciate longer privately, we, the investors, could take that appreciation rather than give it to the public market.’ That’s it.”
  • alexis_fr: I wonder if the human life calculation worked well this time. As far as I see, Boeing lost more than the sum of the human lives; they also lost reputation for everything new they’ve designed in the last 7 years being corrupted, and they also engulfed the reputation of FAA with them, whose agents would fit the definition of “corrupted” by any people’s definition (I know, they are not, they just used agents of Boeing to inspect Boeing because they were understaffed), and the FAA showed the last step of failure by not admitting that the plane had to be stopped until a few days after the European agencies. In other words, even in financial terms, it cost more than damages. It may have cost the entire company. They “DeHavailland”’ed their company. Ever heard of DeHavailland? No? That’s probably to do with their 4 successive deintegrating planes that “CEOs have complete trust in.” It just died, as a name. The risk is high.
  • Neil Gershenfeld: computer science was one of the worst things ever to happen to computers or science, why I believe that, and what that leads me to. I believe that because it’s fundamentally unphysical. It’s based on maintaining a fiction that digital isn’t physical and happens in a disconnected virtual world.
  • @benedictevans: Netflix and Sky both realised that a new technology meant you could pay vastly more for content than anyone expected, and take it to market in a new way. The new tech (satellite, broadband) is a crowbar for breaking into TV. But the questions that matter are all TV questions
  • @iamdevloper: Therapist: And what do we do when we feel like this? Me: buy a domain name for the side project idea we’ve had for 15 seconds. Therapist: No
  • @dvassallo: Step 1: Forget that all these things exist: Microservices, Lambda, API Gateway, Containers, Kubernetes, Docker. Anything whose main value proposition is about “ability to scale” will likely trade off your “ability to be agile & survive”. That’s rarely a good trade off. 4/25 Start with a t3.nano EC2 instance, and do all your testing & staging on it. It only costs $3.80/mo. Then before you launch, use something bigger for prod, maybe an m5.large (2 vCPU & 8 GB mem). It’s $70/mo and can easily serve 1 million page views per day.
  • PeteSearch: I believe we have an opportunity right now to engineer-in privacy at a hardware level, and set the technical expectation that we should design our systems to be resistant to abuse from the very start. The nice thing about bundling the audio and image sensors together with the machine learning logic into a single component is that we have the ability to constrain the interface. If we truly do just have a single pin output that indicates if a person is present, and there’s no other way (like Bluetooth or WiFi) to smuggle information out, then we should be able to offer strong promises that it’s just not possible to leak pictures. The same for speech interfaces, if it’s only able to recognize certain commands then we should be able to guarantee that it can’t be used to record conversations.
  • Murat: As I have mentioned in the previous blog post, MAD questions, Cosmos DB has operationalized a fault-masking streamlined version of replication via nested replica-sets deployed in fan-out topology. Rather than doing offline updates from a log, Cosmos DB updates database at the replicas online, in place, to provide strong consistent and bounded-staleness consistency reads among other read levels. On the other hand, Cosmos DB also maintains a change log by way of a witness replica, which serves several useful purposes, including fault-tolerance, remote storage, and snapshots for analytic workload.
  • grauenwolf: That’s where I get so frustrated. Far too often I hear “premature optimization” as a justification for inefficient code when doing it the right way would actually require the same or less effort and be more readable.
  • Murat: Leader – I tell you Paxos joke, if you accept me as leader. Quorum – Ok comrade. Leader – Here is joke! (*Transmits joke*) Quorum – Oookay… Leader – (*Laughs* hahaha). Now you laugh!! Quorum – Hahaha, hahaha.
  • Manmax75: The amount of stories I’ve heard from SysAdmins who jokingly try to access a former employers network with their old credentials only to be shocked they still have admin access is a scary and boggling thought.
  • @dougtoppin: Fargate brings significant opportunity for cost savings and to get the maximum benefit the minimal possible number of tasks must be running to handle your capacity needs. This means quickly detecting request traffic, responding just as quickly and then scaling back down.
  • @evolvable: At a startup bank we got management pushback when revealing we planned to start testing in production – concerns around regulation and employees accessing prod. We changed the name to “Production Verification”. The discussion changed to why we hadn’t been doing it until now. 
  • @QuinnyPig: I’m saying it a bit louder every time: @awscloud’s data transfer pricing is predatory garbage. I have made hundreds of thousands of consulting dollars straightening these messes out. It’s unconscionable. I don’t want to have to do this for a living. To be very clear, it’s not that the data transfer pricing is too expensive, it’s that it’s freaking inscrutable to understand. If I can cut someone’s bill significantly with a trivial routing change, that’s not the customer’s fault.
  • @PPathole: Alternative Big O notations: O(1) = O(yeah) O(log n) = O(nice) O(nlogn) = O(k-ish) O(n) = O(ok) O(n²) = O(my) O(2ⁿ) = O(no) O(n^n) = O(f*ck) O(n!) = O(mg!)
  • Brewster Kahle: There’s only a few hackers I’ve known like Richard Stallman, he’d write flawless code at typing speed. He worked himself to the bone trying to keep up with really smart former colleagues who had been poached from MIT. Carpal tunnel, sleeping under the desk, really trying hard for a few years and it was killing him. So he basically says I give up, we’re going to lose the Lisp machine. It was going into this company that was flying high, it was going to own the world, and he said it was going to die, and with it the Lisp machine. He said all that work is going to be lost, we need a way to deal with the violence of forking. And he came up with the GNU public license. The GPL is a really elegant hack in the classic sense of a hack. His idea of the GPL was to allow people to use code but to let people put it back into things. Share and share alike.

Useful Stuff:

  • It’s probably not a good idea to start a Facebook poll on the advisability of your pending nuptials a day before the wedding. But it is very funny and disturbingly plausible. Made Public. Another funny/sad one is using a ML bot to “deal with” phone scams. The sad part will be when both sides are just AIs trying to socially engineer each other and half the world’s resources become dedicated to yet another form of digital masturbation. Perhaps we should just stop the MADness?
  • Urgent/1111 Zero Day Vulnerabilities Impacting VxWorks, the Most Widely Used Real-Time Operating System (RTOS). I read this with special interest because I’ve used VxWorks on several projects. Not once do I ever remember anyone saying “I wonder if the TCP/IP stack has security vulnerabilities?” We looked at licensing costs, board support packages, device driver support, tool chain support, ISR service latencies, priority inversion handling, task switch determinacy, etc. Why did we never think of these kind of potential vulnerabilities? One reason is social proof. Surely all these other companies use VxWorks, it must be good, right? Another reason is VxWorks is often used within a secure perimeter. None of the network interfaces are supposed to be exposed to the internet, so remote code execution is not part of your threat model. But in reality you have no idea if a customer will expose a device to the internet. And you have no idea if later product enhancements will place the device on the internet. Since it seems all network devices expand until they become a router, this seems a likely path to Armageddon. At that point nobody is going to requalify their entire toolchain. That just wouldn’t be done in practice. VxWorks is dangerous because everything is compiled into a single image the boots and runs, much like a unikernel. At least when I used it that was the case. VxWorks is basically just a library you link into your application that provides OS functionality. Your write the boot code, device drivers, and other code to make your application work. So if there’s a remote code execution bug it has access to everything. And a lot of these images are built into ROM, so they aren’t upgradeable. And if even if the images are upgradeable in EEPROM or flash, how many people will actually do that? Unless you pay a lot of money you do not get the source to VxWorks. You just get libraries and header files. So you have no idea what’s going on in the network stack. I’m surprised VxWorks never tested their stack against a fuzzing kind of attack. That’s a great way to find bugs in protocols. Though nobody can define simplicity, many of the bugs were in the handling of the little used TCP Urgent Pointer feature. Anyone surprised that code around this is broke? Who uses it? It shouldn’t be in the stack at all. Simple to say, harder to do.
  • JuliaCon 2019 videos are now available. You might like Keynote: Professor Steven G. Johnson and The Unreasonable Effectiveness of Multiple Dispatch
  • CERN is Migrating to open-source technologies. Microsoft wants too much for their licenses so CERN is giving MS the finger.
  • Memory and Compute with Onur Mutlu:
    • The main problem is that DRAM latency is hardly improving at all. From 1999 to 2017, DRAM capacity has increased by 128x, bandwidth by 20x, but latency only by 1.3x! This means that more and more effort has to be spent tolerating memory latency.  But what could be done to actually improve memory latency?
    • You could “easily” get a 30% latency improvement by having DRAM chips provide a bit more precise information to the memory controller about actual latencies and current temperatures.
    • Another concept to truly break the memory barrier is to move the compute to the memory. Basically, why not put the compute operations in memory?  One way is to use something like High-Bandwidth Memory (HBM) and shorten the distance to memory by stacking logic and memory.
    • Another rather cool (but also somewhat limited) approach is to actually use the DRAM cells themselves as a compute engine. It turns out that you can do copy, clear, and even logic ops on memory rows by using the existing way that DRAMs are built and adding a tiny amount of extra logic.
  • Want to make something in hardware? Like Pebble, Dropcam, or Ring. Who you gonna call? Dragon Innovation. Listen how on the AMP Hour podcast episode #451 – An Interview with Scott Miller
    • Typical customers build between 5k and 1 million units, but will talk with you at 100 units. Customers usually start small. They’ve built a big toolbox for IoT, so they don’t need to create the wheel every time, they have designs for sensing, processing, electronics on the edge, radios, and all the different security layers. They can deploy quickly with little customizations.
    • Dragon is moving into doing the design, manufacturing, packaging, issue all POs, and installation support. They call this Product as a Service (PaaS)—full end-to-end provider. Say you have a sensor to determine when avocados are ripe you would pay per sensor per month, or maybe per avocado, instead of a one time sale. Seeing more non-traditional getting into the IoT space, with different revenue models, Dragon has an opportunity to innovate on their business model. 
    • Consumer is dying and industrial is growing. A trend they are seeing in the US is a constriction of business to consumer startups in the hardware space, but an an expansion of industrial IoT. There have been a bunch of high profile bankruptcies in the consumer space (Anki, Jibo).
    • Europe is growing. Overall huge growth in industrial startups across Europe. Huge number of capable factories in the EU. They get feet on the ground to find and qualify factories. They have over 2000 factories in their database. 75% in China, increasingly more in the EU and the US. 
    • Factories are going global. Seeing a lot of companies driven out of China by the 25% tariffs, moving into Asian pacific countries like Taiwan, Singapore, Vietnam, Indonesia, Malaysia. Coming up quickly, but not up to China’s level yet. Dragon will include RFQs on a global basis, including factories from the US, China, EU, Indonesia, Vietnam, to see what the landed cost is as a function of geography. 
    • Factories are different in different countries. In China factories are vertically integrated. Mold making, injection molding, final assembly and test and packaging, all under one roof. Which is very convenient. In the US and Europe factories are more horizontal. It takes a lot more effort to put together your supply chain.  As an example of the degree they were vertically integrated this factory in China would make their own paint and cardboard. 
    • Automation is huge in China. Chinese labor rates are on average 5 to 6 dollars an hour, depending on region, factory, training. Focus is on automation. One factory they worked with had 100,000 workers now they have 30,000 because of automation.
    • Automation is different in China. Automation in China is bottom’s up. They’ll build a simple robot that attaches to a soldering iron and will solder the leads. In the US is top down. Build a huge full functioning worker that can do anything instead of a task specific robot. China is really good at building stuff so they build task specific robots to make their processes more efficient. Since products are always changing this allows them to stay nimble. 
    • Also Strange PartsDesign for Manufacturing Course, How I Made My Own iPhone – in China.
  • BigQuery best practices: Controlling costs: Query only the columns that you need; Don’t run queries to explore or preview table data; Before running queries, preview them to estimate costs; Using the query validator; Use the maximum bytes billed setting to limit query costs; Do not use a LIMIT clause as a method of cost control; Create a dashboard to view your billing data so you can make adjustments to your BigQuery usage. Also consider streaming your audit logs to BigQuery so you can analyze usage patterns; Partition your tables by date; f possible, materialize your query results in stages;  If you are writing large query results to a destination table, use the default table expiration time to remove the data when it’s no longer needed; Use streaming inserts only if your data must be immediately available.
  • Boeing has changed a lot over the years. Once upon a time I worked on a project with Boeing and the people were excellent. This is something I heard: “The changes can be attributed to the influence of the McDonnel family who maintain extremely high influence through their stock shares resulting from the merger. It has been gradually getting better recently but still a problem for those inside who understand the real potential impact.”
  • Maybe we are all just random matrices? What Is Universality? It turns out there are deep patterns in complex correlated systems that lie somewhere between randomness and order. They arise from components that interact and repel one another. Do such patterns exist in software systems? Also, Bubble Experiment Finds Universal Laws
  • PID Loops and the Art of Keeping Systems Stable
    • I see a lot of places where control theory is directly applicable but rarely applied. Auto-scaling and placement are really obvious examples, we’re going to walk through some, but another is fairness algorithms. A really common fairness algorithm is how TCP achieves fairness. You’ve got all these network users and you want to give them all a fair slice. Turns out that a PID loop it’s what’s happening. In system stability, how do we absorb errors, recover from those errors? 
    • Something we do in CloudFront is we run a control system. We’re constantly measuring the utilization of each site and depending on that utilization, we figure out what’s our error, how far are we from optimized? We change the mass or radius of effect of each site, so that at our really busy time of day, really close to peak, it’s servicing everybody in that city, everybody directly around it drawing those in, but that at our quieter time of day can extend a little further and go out. It’s a big system of dynamic springs all interconnected, all with PID loops. It’s amazing how optimal a system like that can be, and how applying a system like that has increased our effectiveness as a CDN provider. 
    • A surprising number of control systems are just like this, they’re just Open Loops. I can’t count the number of customers I’ve gone through control systems with and they told me, “We have this system that pushes out some states, some configuration and sometimes it doesn’t do it.” I find that scary, because what it’s saying is nothing’s actually monitoring the system. Nothing’s really checking that everything is as it should be. My classic favorite example of this as an Open Loop process, is certificate rotation. I happened to work on TLS a lot, it’s something I spent a lot of my time on. Not a week goes by without some major website having a certificate outage.
    • We have two observability systems at AWS, CloudWatch, and X-Ray. One of the things I didn’t appreciate until I joined AWS – I was a bit going on like Charlie and the chocolate factory, and seeing the insides. I expected to see all sorts of cool algorithms and all sorts of fancy techniques and things that I just never imagined. It was a little bit of that, there was some of that once I got inside working, but mostly what I found was really mundane, people were just doing a lot of things at scale that I didn’t realize. One of those things was just the sheer volume of monitoring. The number of metrics we keep on, every single host, every single system, I still find staggering.
    • Exponential Back-off is a really strong example. Exponential Back-off is basically an integral, an error happens and we retry, a second later if that fails, then we wait. Rate limiters are like derivatives, they’re just rate estimators and what’s going on and deciding what’s to let in and what to let out. We’ve built both of these into the AWS SDKs. We’ve got other back pressure strategies too, we’ve got systems where servers can tell clients, “Back off, please, I’m a little busy right now,” all those things working together. If I look at system design and it doesn’t have any of this, if it doesn’t have exponential back-off, if it doesn’t have rate-limiters in some place, if it’s not able to fight some power-law that I think might arise due to errors propagating, that tells me I need to be a bit more worried and start digging deeper.
    • I like to watch out for edge triggering in systems, it tends to be an anti-pattern. One reason is because edge triggering seems to imply a modal behavior. You cross the line, you kick into a new mode, that mode is probably rarely tested and it’s now being kicked into at a time of high stress, that’s really dangerous. Your system has to be idempotent, if you’re going to build an idempotent system, you might as well make a level-triggered system in the first place, because generally, the only benefit of building an edge-triggered system is it doesn’t have to be idempotent.
    • There is definitely tension between stability and optimality, and in general, the more finely-tuned you want to make a system to achieve absolute optimality, the more risk you are of being able to drive it into an unstable state. There are people who do entire PIDs on nothing else then finding that balance for one system. Oil refineries are a good example, where the oil industry will pay people a lot of money just to optimize that, even very slightly. Computer Science, in my opinion, and distributed systems, are nowhere near that level of advanced control theory practice yet. We have a long way to go. We’re still down at the baby steps of, “We’ll at least measure it.”
  • Re:Inforce 2019 videos are now available.
  • Top Seven Myths of Robust Systems: The number one myth we hear out in the field is that if a system is unreliable, we can fix that with redundancy; rather than trying to simplify or remove complexity, learn to live with it. Ride complexity like a wave. Navigate the complexity; The adaptive capacity to improvise well in the face of a potential system failure comes from frequent exposure to risk; Both sides — the procedure-makers and the procedure-not-followers — have the best of intentions, and yet neither is likely to believe that about the other; Unfortunately it turns out catastrophic failures in particular tend to be a unique confluence of contributing factors and circumstances, so protecting yourself from prior outages, while it shouldn’t hurt, also doesn’t help very much; Best practices aren’t really a knowable thing; Don’t blame individuals. That’s the easy way out, but it doesn’t fix the system. Change the system instead. 
  • They grow up so slow. What’s new in JavaScript: Google I/O 2019 Summary
  • From a rough calculation we saw about 40% decrease in the amount of CPU resources used. Overall, we saw latency stabilize for both avg and max p99. Max p99 latency also decreased a bit. Safely Rewriting Mixpanel’s Highest Throughput Service in Golang. Mixpanel moved from Python to Go for their data collection API. They has already migrated the Python API to use the Google Load Balancer to route messages to kubernetes pod on Google Cloud where an Envoy container load-balanced between eight Python API containers. The Python API containers then submitted the data to Google Pubsub queue via a pubsub sidecar container that had a kestrel interface. To enable testing against live traffic, we created a dedicated setup. The setup was a separate kubernetes pod running in the same namespace and cluster as the API deployments. The pod ran an open source API correctness tool, Diffy, along with copies of the old and new API services. Diffy is a service that accepts HTTP requests, and forwards them to two copies of an existing HTTP service and one copy of a candidate HTTP service. One huge improvement is we only need to run a single API container per pod. 
  • Satisfactory: Network Optimizations: It would be a big gain to stop replicating the inventory when it’s not viewed, which is essentially what we did, but the method of doing so was a bit complicated and required a lot of rework…Doing this also helps to reduce CPU time, as an inventory is a big state to compare, and look for changes in. If we can reduce that to a maximum of 4x the number of players it is a huge gain, compared to the hundreds, if not thousands, that would otherwise be present in a big base…There is, of course, a trade-off. As I mentioned there is a chance the inventory is not there when you first open to view it, as it has yet to arrive over the network…In this case the old system actually varied in size but landed around 48 bytes per delta, compared to the new system of just 3 bytes…On top of this, we also reduced how often a conveyor tries to send an update to just 3 times a second compared to the previous of over 20…the accuracy of item placements on the conveyors took a small hit, but we have added complicated systems in order to compensate for that…we’ve noticed that the biggest issue for running smooth multiplayer in large factories is not the network traffic anymore, it’s rather the general performance of the PC acting as a server.
  • MariaDB vs MySQL Differences: MariaDB is fully GPL licensed while MySQL takes a dual-license approach. Each handle thread pools in a different way. MariaDB supports a lot of different storage engines. In many scenarios, MariaDB offers improved performance.
  • Our pySpark pipeline churns through tens of billions of rows on a daily basis. Calculating 30 billion speed estimates a week with Apache Spark: Probes generated from the traces are matched against the entire world’s road network. At the end of the matching process we are able to assign each trace an average speed, a 5 minute time bucket and a road segment. Matches on the same road that fall within the same 5 minute time bucket are aggregated to create a speed histogram. Finally, we estimate a speed for each aggregated histogram which represents our prediction of what a driver will experience on a road at a given time of the week…On a weekly basis, we match on average 2.2 billion traces to 2.3 billion roads to produce 5.4 billion matches. From the matches, we build 51 billion speed histograms to finally produce 30 billion speed estimates…The first thing we spent time on was designing the pipeline and schemas of all the different datasets it would produce. In our pipeline, each pySpark application produces a dataset persisted in a hive table readily available for a downstream application to use…Instead of having one pySpark application execute all the steps (map matching, aggregation, speed estimation, etc.) we isolated each step to its own application…We favored normalizing our tables as much as possible and getting to the final traffic profiles dataset through relationships between relevant tables…Partitioning makes querying part of the data faster and easier. We partition all the resulting datasets by both a temporal and spatial dimension. 
  • Do not read this unless you can become comfortable with the feeling that everything you’ve done in your life is trivial and vainglorious. Morphogenesis for the Design of Design
    • One of my students built and runs all the computers Facebook runs on, one of my students used to run all the computers Twitter runs on—this is because I taught them to not believe in computer science. In other words, their job is to take billions of dollars, hundreds of megawatts, and tons of mass, and make information while also not believing that the digital is abstracted from the physical. Some of the other things that have come out from this lineage were the first quantum computations, or microfluidic computing, or part of creating some of the first minimal cells.
    • The Turing machine was never meant to be an architecture. In fact, I’d argue it has a very fundamental mistake, which is that the head is distinct from the tape. And the notion that the head is distinct from the tape—meaning, persistence of tape is different from interaction—has persisted. The computer in front of Rod Brooks here is spending about half of its work just shuttling from the tape to the head and back again.
    • There’s a whole parallel history of computing, from Maxwell to Boltzmann to Szilard to Landauer to Bennett, where you represent computation with physical resources. You don’t pretend digital is separate from physical. Computation has physical resources. It has all sorts of opportunities, and getting that wrong leads to a number of false dichotomies that I want to talk through now. One false dichotomy is that in computer science you’re taught many different models of computation and adherence, and there’s a whole taxonomy of them. In physics there’s only one model of computation: A patch of space occupies space, it takes time to transit, it stores state, and states interact—that’s what the universe does. Anything other than that model of computation is physics and you need epicycles to maintain the fiction, and in many ways that fiction is now breaking.
    • We did a study for DARPA of what would happen if you rewrote from scratch a computer software and hardware so that you represented space and time physically.
    • One of the places that I’ve been involved in pushing that is in exascale high-performance computing architecture, really just a fundamental do-over to make software look like hardware and not to be in an abstracted world.
    • Digital isn’t ones and zeroes. One of the hearts of what Shannon did is threshold theorems. A threshold theorem says I can talk to you as a wave form or as a symbol. If I talk to you as a symbol, if the noise is above a threshold, you’re guaranteed to decode it wrong; if the noise is below a threshold, for a linear increase in the physical resources representing the symbol there’s an exponential reduction in the fidelity to decode it. That exponential scaling means unreliable devices can operate reliably. The real meaning of digital is that scaling property. But the scaling property isn’t one and zero; it’s the states in the system. 
    • if you mix chemicals and make a chemical reaction, a yield of a part per 100 is good. When the ribosome—the molecular assembler that makes your proteins—elongates, it makes an error of one in 104. When DNA replicates, it adds one extra error-correction step, and that makes an error in 10-8, and that’s exactly the scaling of threshold theorem. The exponential complexity that makes you possible is by error detection and correction in your construction. It’s everything Shannon and von Neumann taught us about codes and reconstruction, but it’s now doing it in physical systems.
    • One of the projects I’m working on in my lab that I’m most excited about is making an assembler that can assemble assemblers from the parts that it’s assembling—a self-reproducing machine. What it’s based on is us. 
    • If you look at scaling coding construction by assembly, ribosomes are slow—they run at one hertz, one amino acid a second—but a cell can have a million, and you can have a trillion cells. As you were sitting here listening, you’re placing 1018 parts a second, and it’s because you can ring up this capacity of assembling assemblers. The heart of the project is the exponential scaling of self-reproducing assemblers.
    • As we work on the self-reproducing assembler, and writing software that looks like hardware that respects geometry, they meet in morphogenesis. This is the thing I’m most excited about right now: the design of design. Your genome doesn’t store anywhere that you have five fingers. It stores a developmental program, and when you run it, you get five fingers. It’s one of the oldest parts of the genome. Hox genes are an example. It’s essentially the only part of the genome where the spatial order matters. It gets read off as a program, and the program never represents the physical thing it’s constructing. The morphogenes are a program that specifies morphogens that do things like climb gradients and symmetry break; it never represents the thing it’s constructing, but the morphogens then following the morphogenes give rise to you.
    • What’s going on in morphogenesis, in part, is compression. A billion bases can specify a trillion cells, but the more interesting thing that’s going on is almost anything you perturb in the genome is either inconsequential or fatal. The morphogenes are a curated search space where rearranging them is interesting—you go from gills to wings to flippers. The heart of success in machine learning, however you represent it, is function representation. The real progress in machine learning is learning representation. 
    • We’re at an interesting point now where it makes as much sense to take seriously that scaling as it did to take Moore’s law scaling in 1965 when he made his first graph. We started doing these FAB labs just as outreach for NSF, and then they went viral, and they let ordinary people go from consumers to producers. It’s leading to very fundamental things about what is work, what is money, what is an economy, what is consumption.
    • Looking at exactly this question of how a code and a gene give rise to form. Turing and von Neumann both completely understood that the interesting place in computation is how computation becomes physical, how it becomes embodied and how you represent it. That’s where they both ended their life. That’s neglected in the canon of computing.
    • If I’m doing morphogenesis with a self-reproducing system, I don’t want to then just paste in some lines of code. The computation is part of the construction of the object. I need to represent the computation in the construction, so it forces you to be able to overlay geometry with construction.
    • Why align computer science and physical science? There are at least five reasons for me. Only lightly is it philosophical. It’s the cracks in the matrix. The matrix is cracking. 1) The fact that whoever has their laptop open is spending about half of its resources shuttling information from memory transistors to processor transistors even though the memory transistors have the same computational power as the processor transistors is a bad legacy of the EDVAC. It’s a bit annoying for the computer, but when you get to things like an exascale supercomputer, it breaks. You just can’t maintain the fiction as you push the scaling. The resource in very largescale computing is maintaining the fiction so the programmers can pretend it’s not true is getting just so painful you need to redo it. In fact, if you look down in the trenches, things like emerging ways to do very largescale GPU program are beginning to inch in that direction. So, it’s breaking in performance.
    •  What’s interesting is a lot of the things that are hard—for example, in parallelization and synchronization—come for free. By representing time and space explicitly, you don’t need to do the annoying things like thread synchronization and all the stuff that goes into parallel programming.
    • Communication degraded with distance. Along came Shannon. We now have the Internet. Computation degraded with time. The last great analog computer work was Vannevar Bush’s differential analyzer. One of the students working on it was Shannon. He was so annoyed that he invented our modern digital notions in his Master’s thesis to get over the experience of working on the differential analyzer.
    • When you merge communication with computation with fabrication, it’s not there’s a duopoly of communication and computation and then over here is manufacturing; they all belong together. The heart of how we work is this trinity of communication plus computation and fabrication, and for me the real point is merging them.
    • I almost took over running research at Intel. It ended up being a bad idea on both sides, but when I was talking to them about it, I was warned off. It was like the godfather: “You can do that other stuff, but don’t you dare mess with the mainline architecture.” We weren’t allowed to even think about that. In defense of them, it’s billions and billions of dollars investment. It was a good multi-decade reign. They just weren’t able to do it. 
    • Again, the embodiment of everything we’re talking about, for me, is the morphogenes—the way evolution searches for design by coding for construction. And they’re the oldest part of the genome. They were invented a very long time ago and nobody has messed with them since.
    • Get over digital and physical are separate; they can be united. Get over analog as separate from digital; there’s a really profound place in between. We’re at the beginning of fifty years of Moore’s law but for the physical world. We didn’t talk much about it, but it has the biggest impact of anything I know if anybody can make anything.

Soft Stuff:

  • paypal/hera (article): Hera multiplexes connections for MySQL and Oracle databases. It supports sharding the databases for horizontal scaling. It is a data access gateway that PayPal uses to scale database access for hundreds of billions of SQL queries per day. Additionally, HERA improves database availability through sophisticated protection mechanisms and provides application resiliency through transparent traffic failover. HERA is now available outside of PayPal as an Apache 2-licensed project.
  • zerotier/lf: a fully decentralized fully replicated key/value store. LF is built on a directed acyclic graph (DAG) data model that makes synchronization easy and allows many different security and conflict resolution strategies to be used. One way to think of LF’s DAG is as a gigantic conflict-free replicated data type (CRDT). Proof of work is used to rate limit writes to the shared data store on public networks and as one thing that can be taken into consideration for conflict resolution. 
  • pahud/fargate-fast-autoscaling: This reference architecture demonstrates how to build AWS Fargate workload that can detect the spiky traffic in less than 10 seconds followed by an immediate horizontal autoscaling.
  • ailidani/paxi: Paxi is the framework that implements WPaxos and other Paxos protocol variants. Paxi provides most of the elements that any Paxos implementation or replication protocol needs, including network communication, state machine of a key-value store, client API and multiple types of quorum systems.

Pub Stuff:

from High Scalability

How to Architect APIs for Scale and Security

How to Architect APIs for Scale and Security

We hope you’ve enjoyed reading our posts on best practices for your serverless applications. This series of posts will focus on best practices and concepts you should be familiar with when you architect APIs for your applications. We’ll kick this first post off with a comparison between REST and GraphQL API architectures.


Developers have been creating RESTful APIs for a long time, typically using HTTP methods, such as GET, POST, DELETE to perform operations against the API. Amazon API Gateway is designed to make it easy for developers to create APIs at any scale without managing any servers. API Gateway will handle all of the heavy lifting needed including traffic management, security, monitoring, and version/environment management.

GraphQL APIs are relatively new, with a primary design goal of allowing clients to define the structure of the data that they require. AWS AppSync allows you to create flexible APIs that access and combine multiple data sources.


Architecting a REST API is structured around creating combinations of resources and methods.  Resources are paths  that are present in the request URL and methods are HTTP actions that you take against the resource. For example, you may define a resource called “cart”: The cart resource can respond to HTTP POSTs for adding items to a shopping cart or HTTP GETs for retrieving the items in your cart. With API Gateway, you would implement the API like this:

Behind the scenes, you can integrate with nearly any backend to provide the compute logic, data persistence, or business work flows.  For example, you can configure an AWS Lambda function to perform the addition of an item to a shopping cart (HTTP POST).  You can also use API Gateway to directly interact with AWS services like Amazon DynamoDB.  An example is using API Gateway to retrieve items in a cart from DynamoDB (HTTP GET).

RESTful APIs tend to use Path and Query parameters to inject dynamic values into APIs. For example, if you want to retreive a specific cart with an id of abcd123, you could design the API to accept a query or path parameter that specifies the cartID:

/cart?cartId=abcd123 or /cart/abcd123

Finally, when you need to add functionality to your API, the typical approach would be to add additional resources.  For example, to add a checkout function, you could add a resource called /cart/checkout.

GraphQL APIs

Architecting GraphQL APIs is not structured around resources and HTTP verbs, instead you define your data types and configure where the operations will retrieve data through a resolver. An operation is either a query or a mutation. Queries simply retrieve data while mutations are used when you want to modify data. If we use the same example from above, you could define a cart data type as follows:

type Cart {

  cartId: ID!

  customerId: String

  name: String

  email: String

  items: [String]


Next, you configure the fields in the Cart to map to specific data sources. AppSync is then responsible for executing resolvers to obtain appropriate information. Your client will send a HTTP POST to the AppSync endpoint with the exact shape of the data they require. AppSync is responsible for executing all configured resolvers to obtain the requested data and return a single response to the client.

Rest API

With GraphQL, the client can change their query to specify the exact data that is needed. The above example shows two queries that ask for different sets of information. The first getCart query asks for all of the static customer (customerId, name, email) and a list of items in the cart. The second query just asks for the customer’s static information. Based on the incoming query, AppSync will execute the correct resolver(s) to obtain the data. The client submits the payload via a HTTP POST to the same endpoint in both cases. The payload of the POST body is the only thing that changes.

As we saw above, a REST based implementation would require the API to define multiple HTTP resources and methods or path/query parameters to accomplish this.

AppSync also provides other powerful features that are not possible with REST APIs such as real-time data synchronization and multiple methods of authentication at the field and operation level.


As you can see, these are two different approaches to architecting your API. In our next few posts, we’ll cover specific features and architecture details you should be aware of when choosing between API Gateway (REST) and AppSync (GraphQL) APIs. In the meantime, you can read more about working with API Gateway and Appsync.

About the Author

George MaoGeorge Mao is a Specialist Solutions Architect at Amazon Web Services, focused on the Serverless platform. George is responsible for helping customers design and operate Serverless applications using services like Lambda, API Gateway, Cognito, and DynamoDB. He is a regular speaker at AWS Summits, re:Invent, and various tech events. George is a software engineer and enjoys contributing to open source projects, delivering technical presentations at technology events, and working with customers to design their applications in the Cloud. George holds a Bachelor of Computer Science and Masters of IT from Virginia Tech.

from AWS Architecture Blog

Stuff The Internet Says On Scalability For July 26th, 2019

Stuff The Internet Says On Scalability For July 26th, 2019

Wake up! It’s HighScalability time—once again:

 The Apollo 11 guidance computer repeatedly crashed on descent. On earth computer scientists had just 13 hours to debug the problem. They did. It was CPU overload because of a wrong setting. Some things never change! 

Do you like this sort of Stuff? I’d greatly appreciate your support on Patreon. I wrote Explain the Cloud Like I’m 10 for people who need to understand the cloud. And who doesn’t these days? On Amazon it has 52 mostly 5 star reviews (120 on Goodreads). They’ll learn a lot and hold you in even greater awe.

Number Stuff:

  • $11 million: Google fine for discriminating against not young people. 
  • 55,000+: human-labeled 3D annotated frames, a drivable surface map, and an underlying HD spatial semantic map in Lyft’s Level 5 Dataset.
  • 645 million: LinkedIn members with 4.5 trillion daily messages pumping through Kafka.
  • 49%: drop in Facebook’s net income. A fine result.
  • 50 ms: repeatedly randomizing elements of the code that attackers need access to in order to compromise the hardware. 
  • 7.5 terabytes: hacked Russian data.
  • 5%: increase in Tinder shares after bypassing Google’s 30% app store tax.
  • $21.7 billion: Apple’s profit from other people’s apps.
  • 5 billion: records in 102 datasets in the UCR STAR spatio-temporal index.
  • 200x: speediest quantum operation yet. 
  • 45%: US Fortune 500 companies founded immigrants and their kids.
  • 70%: hard drive failures caused by media damage, including full head crashes. 
  • 12: lectures on everything Buckminster Fuller knew.
  • 600,000: satellite images taken in a single day used to create a picture of Earth
  • 149: hours between Airbus reboots needed to mask software problems. 

Quotable Stuff:

  • @mdowd: Now that Alan Turing is on the 50 pound note, they should rename ATMs to “Turing Machines”
  • @hacks4pancakes: Another real estate person: “I tried using a password manager, but then my credit card was stolen and my annual payment for it failed – and they cut off access to all my passwords during a meeting.”
  • @juliaferraioli: “Our programs were more complex because Go was so simple” — @_rsc on reshaping #Golang at #gophercon
  • Jason Shen: A YC founder once said to me that he found little correlation between the success of a YC company and how hard their founders worked. That is to say, among a group of smart, ambitious entrepreneurs who were all already working pretty hard, the factors that made the biggest difference were things like timing, strategy, and relationships. Which is why Reddit cofounder-turned-venture capitalist Alexis Ohanian now warns against the “utter bullshit” of this so-called hustle porn mentality.
  • Dale Markowitz: A successful developer today needs to be as good at writing code as she is at extinguishing that code when it explodes.
  • @allspaw: I’m with @snowded on this. Taleb’s creation of ‘antifragile’ is what the Resilience Engineering community has referred to as resilience all along.
  • Timothy Lee: Heath also said that the interviewer assumed that the word “byte” meant eight bits. In his view, this also revealed age bias. Modern computer systems use 8-bit bytes, but older computer systems could have byte sizes ranging from six to 40 bits.
  • General Patton: If everybody is thinking alike, somebody isn’t thinking
  • panpanna: Architecturally, microkernels and unikernels are direct opposites. Unikernels strive to minimize communication complexity (and size and footprint but that is not relevant to this discussion) by putting everything in the same address space. This gives them many advantages among which performance is often mentioned but ease of development is IMHO equally important. However, the two are not mutually exclusive. Unikernels often run on top of microkernels or hypervisors.
  • Wayne Ma: The team ultimately put a stop to most [Apple] device leaks—and discovered some audacious attempts, such as some factory workers who tried to build a tunnel to transport components to the outside without security spotting them.
  • Dr. Steven Gundry: I think, uploaded most of our information processing to our bacterial cloud that lives in us, on us, around us, because they have far more genes than we do. They reproduce virtually instantaneously, and so they can do fantastic information processing. Many of us think that perhaps lifeforms on Earth, particularly animal lifeforms, exist as a home for bacteria to prosper on Earth. 
  • @UdiDahan: I’ve been gradually coming around to the belief that any “good” code base lives long enough for the environment around it to change in such away that its architecture is no longer suitable, making it then “bad”. This would probably be as equally true of FP as of OOP.
  • @DmitryOpines: No one “runs” a crypto firm Holger, we are merely the mortal agents through whose minor works the dream of disaggregated ledger currency manifests on this most unworthy of Prime Material Planes.
  • Philip Ball: One of the most remarkable ideas in this theoretical framework is that the definite properties of objects that we associate with classical physics — position and speed, say — are selected from a menu of quantum possibilities in a process loosely analogous to natural selection in evolution: The properties that survive are in some sense the “fittest.” As in natural selection, the survivors are those that make the most copies of themselves. This means that many independent observers can make measurements of a quantum system and agree on the outcome — a hallmark of classical behavior.
  • @mikko: Rarely is anyone thanked for the work they did to prevent the disaster that didn’t happen.
  • David Rosenthal: Back in 1992 Robert Putnam et al published Making democracy work: civic traditions in modern Italy, contrasting the social structures of Northern and Southern Italy. For historical reasons, the North has a high-trust structure whereas the South has a low-trust structure. The low-trust environment in the South had led to the rise of the Mafia and persistent poor economic performance. Subsequent effects include the rise of Silvio Berlusconi. Now, in The Internet Has Made Dupes-And Cynics-Of Us All, Zynep Tufecki applies the same analysis to the Web
  • Diego Basch: So here is an obvious corollary. Like I just mentioned, if you have an idea for an interesting gadget you will move to a place like Shenzhen or Hong Kong or Taipei. You will build a prototype, prove your concept, work with a manufacturer to iterate your design until it’s mature enough to mass-produce. Either you will bootstrap the business or you will partner with someone local to fund it, because VCs won’t give you the time of the day. Now, let’s say hardware is not your cup of tea and you want to build services. Why be in Silicon Valley at all?
  • Buckminster Fuller~ You derive data by segregating; You derive principles by integrating; Without data, you cannot calculate; Without calculations, you cannot generalize; Without generalizations, you cannot design; Without designs, you cannot discover; Without discoveries, you cannot derive new data…Segregation and integration are not opposed: they are complementary and interdependent. Striving to be a specialist OR a generalist is counterproductive; the aim is to be COMPREHENSIVE!
  • @jessfraz: I see a lot of debates about “open core”. To me the premise behind it is “we will open this part of our software but you gotta take care of supporting it yourself.” Then they charge for support. Except the problem was some other people *cough* clouds *cough* beat them to it.
  • @greglinden: Tech companies consistently get this wrong, thinking this is a simple black-and-white ML classification problem, spam or not spam, false or not false. Disinformation exploits that by being just ambiguous enough to not get classified as false. It’s harder than that.
  • Brent Ozar: The ultimate, #1, primary, existential, responsibility of a DBA – for which all other responsibilities pale in comparison – is to implement database backup and restore processing adequate to support the business’s acceptable level of data loss.
  • Alex Hern: A dataset with 15 demographic attributes, for instance, “would render 99.98% of people in Massachusetts unique”. And for smaller populations, it gets easier: if town-level location data is included, for instance, “it would not take much to reidentify people living in Harwich Port, Massachusetts, a city of fewer than 2,000 inhabitants”.
  • Memory Guy: Our forecasts find that 3D XPoint Memory’s sub-DRAM prices will drive that technology’s revenues to over $16 billion by 2029, while stand-alone MRAM and STT-RAM revenues will approach $4 billion — over one hundred seventy times MRAM’s 2018 revenues.  Meanwhile, ReRAM and MRAM will compete to replace the bulk of embedded NOR and SRAM in SoCs, to drive even greater revenue growth. This transition will boost capital spending, increasing the spend for MRAM alone by thirty times to $854 million in 2029.
  • @unclebobmartin: John told me he considered FP a failure because, to paraphrase him, FP made it simple to do hard things but almost impossible to do simple things.
  • Dr. Neil J. Gunther: All performance is nonlinear.
  • @mathiasverraes: I wish we’d stop debating OOP vs FP, and started debating individual paradigms. Immutability, encapsulation, global state, single assignment, actor model, pure functions, IO in the type system, inheritance, composition… all of these are perfectly possible in either OOP or FP.
  • Troy Hunt: “1- All those servers were compromised. They were either running standalone VPSs or cpanel installations. 2- Most of them were running WordPress or Drupal (I think only 2 were not running any of the two). 3- They all had a malicious cron.php running”
  • Gartner: AWS makes frequent proclamations about the number of price reductions it has made. Customers interpret these proclamations as being applicable to the company’s services broadly, but this is not the case. For instance, the default and most frequently provisioned storage for AWS’s compute service has not experienced a price reduction since 2014, despite falling prices in the market for the raw components.
  • mcguire: Speaking as someone who has done a fair number of rewrites as well as watching rewrites fail, conventional wisdom is somewhat wrong. 1. Do a rewrite. Don’t try to add features, just replace the existing functionality. Avoid a moving target. 2. Rewrite the same project. Don’t redesign the database schema at the same time you are rewriting. Try to keep the friction down to a manageable level. 3. Incremental rewrites are best. Pick part of the project, rewrite and release that, then get feedback while you work on rewriting the next chunk.
  • Atlassian: Isolating context/state management association to a single point is very helpful. This was reinforced at Re:Invent 2018 where a remarkably high amount of sessions had a slide of “then we have service X which manages tenant → state routing and mapping”.
  • taxicabjesus: I have a ~77 year old friend who was recently telling me about going to Buckminster Fuller’s lectures at his engineering university, circa 1968. He quoted Mr. Fuller as saying something like, “entropy takes things apart, life puts them back together.”
  • Daniel Abadi: PA/EC systems sound good in theory, but are not particularly useful in practice. Our one example of a real PA/EC system — Hazelcast — has spent the past 1.5 years introducing features that are designed for alternative PACELC configurations — specifically PC/EC and PA/EL configurations. PC/EC and PA/EL configurations are a more natural cognitive fit for an application developer. Either the developer can be certain that the underlying system guarantees consistency in all cases (the PC/EC configuration) in which case the application code can be significantly simplified, or the system makes no guarantees about consistency at all (the PA/EL configuration) but promises high availability and low latency. CRDTs and globally unique IDs can still provide limited correctness guarantees despite the lack of consistency guarantees in PA/EL configurations.
  • Simone de Beauvoir: Then why “swindled”? When one has an existentialist view of the world, like mine, the paradox of human life is precisely that one tries to be and, in the long run, merely exists. It’s because of this discrepancy that when you’ve laid your stake on being—and, in a way you always do when you make plans, even if you actually know that you can’t succeed in being—when you turn around and look back on your life, you see that you’ve simply existed. In other words, life isn’t behind you like a solid thing, like the life of a god (as it is conceived, that is, as something impossible). Your life is simply a human life.
  • SkyPuncher: I think Netflix is the perfect example of where being data driven completely fails. If you listen to podcasts with important Netflix people everything you hear is about how they experiment and use data to decide what to do. Every decision is based on some data point.  At the end of the day, they just continue to add features that create short term payoffs and long term failures. Pennywise and pound foolish.
  • Frank Wilczek: I don’t think a singularity is imminent, although there has been quite a bit of talk about it. I don’t think the prospect of artificial intelligence outstripping human intelligence is imminent because the engineering substrate just isn’t there, and I don’t see the immediate prospects of getting there. I haven’t said much about quantum computing, other people will, but if you’re waiting for quantum computing to create a singularity, you’re misguided. That crossover, fortunately, will take decades, if not centuries.

Useful Stuff:

  • What an incredible series. Apollo 11 13 minutes to the MoonEp.05 The fourth astronaut tells the story of how the Apollo computer system was made. Apollo guidance computer weighed 30 kilos, was as big as a couple of shoe boxes, built by a team at MIT, was the worlds first digital, portable, general purpose computer. It was the first software system were people’s lives depended on it. It was the first fly-by-wire system. Contract 1, the first contract of the Apollo program, was for the navigation computer. The MIT group used inertial navigation, first pioneered in Polaris missies. The idea is if you know where you start, direction, and acceleration, then you always know where you are and where you are going. Until this time flight craft were moved by manually pushing levers and flipping switches. Apollo couldn’t affort the weight of these pulley based systems. They chose, for the first time ever (1964), to make a computer to control the flight of the space craft. They had to figure out everything. How would the computer communicate to all the different subsystems? Command a valve to open? Turn on an engine? Turn off an engine? Redirect an engine? Apollo is the moment when people stopped bragging about how big their computer was and started bragging about how small they were. Digital computers were the size of buildings at the time. At the time nobody trusted computers because they would only work a few hours or days at a time. They needed a computer to work for a couple weeks. They risked everything on a brand new technology called Integrated Circuits that were only in the labs. They made the very first computer with ICs. But they got permission to do it. A huge risk betting everything, but there was no alternative. There was no other way to build a computer with enough compute power. Use of ICs to build a digital computers is one of the lasting legacies of Apollo. Apollo bought 60% of the total chip output at the time, a huge boost to a fledgling computer industry. But the hardware needed software. Software was not even in the original 10 page contract. In 1967 they were afraid they wouldn’t meet the end of the decade deadline because software is so complicated to build. And nothing has changed since. Margaret Hamilton joined the project in 1964. There were no rules for software at the time. There was no field of software development. You could get hired just for knowing a computer language. So again, not much has changed. Nobody knew what software was. You couldn’t describe what you did to family. Very unregimented, very free environment. Don Eyles wrote the landing software on the AGC (Apollo Guidance Computer). The AGC was one square foot in size, weighed 70 pounds, 55 watts, 76kb of memory in the form of 16 bit words, only 4k was RAM, the rest was hard wired memory. Once written a program was printed out on paper and then converted to punch cards, by people key punch operators, that could be read directly into main frame computers, which translated them onto the AGC. Over 100 people worked on it at the end. All the cards had to be integrated together and submitted in one big run that executed overnight. Then the simulation would be run the next day to see if the code was OK. This was your debug cycle. The key punch operators had to go around at night and beat up on the prima donna programmers, who always wanted more time or do something over, to submit their jobs. Again, not much has changed. The key punch operators would go back to the programmers when they notice syntax errors. If the code wasn’t right the program wouldn’t go. It used core rope memory. Software was woven into the cores. If a wire went through one of the donut shaped cores of magnetic material that was a 0, if it went around a core that represented a 1. Software was hardware that was literally sewn into the structure of the computer, manually, by textile workers, by hand. Rope memory was proven tech at the time. It was bullet proof. There was no equivalent bullet proof approach to software, which is why Hamilton invented software engineering. There were no tools for finding errors at the time. They tried to find a way to build software so errors would not happen. Wrong time, wrong priority, wrong data, interface errors were the big source of errors. Nobody knew how to control a system with software. They came up with a verb and noun system that looked like a big calculator with buttons. The buttons had to be big and clear so they could be punched with gloves and seen through a visor. Verb, what do you want to do? Noun, what do you want to do it to? It used a simple little key board. There were three digital read outs, no text, it was all just numbers, three sets of numbers. To start initiating lunar landing program you would press noun, 63, enter. To start the program in 15 seconds you enter verb, 15, enter. A clock would start counting down. At zero it would start program 63 which initiated a large breaking burn to slow you down so you start dropping down to the surface of the moon. The astronauts didn’t fly, they controlled programs. They ended 200 meters from where they intended to land. Flying manually would have taken a lot more fuel. The computer was always on and in operation. It was a balance of control, a partnership. The intention at first was to create a fully automated system with two buttons: go to the moon; go home. They ended up with 500 buttons. Again, things don’t change.
  • Why Some Platforms Thrive and Others Don’t: Some digital networks are fragmented into local clusters of users. In Uber’s network, riders and drivers interact with network members outside their home cities only occasionally. But other digital networks are global; on Airbnb, visitors regularly connect with hosts around the world. Platforms on global networks are much less vulnerable to challenges, because it’s difficult for new rivals to enter a market on a global scale…As for Didi and Uber, our analysis doesn’t hold out much hope. Their networks consist of many highly local clusters. They both face rampant multi-homing, which may worsen as more rivals enter the markets. Network-bridging opportunities—their best hope—so far have had only limited success. They’ve been able to establish bridges just with other highly competitive businesses, like food delivery and snack vending. (In 2018 Uber struck a deal to place Cargo’s snack vending machines in its vehicles, for instance.) And the inevitable rise of self-driving taxis will probably make it challenging for Didi and Uber to sustain their market capitalization. Network properties are trumping platform scale.
  • James Hamilton: Where Aurora took a different approach from that of common commercial and open source database management systems is in implementing log-only storage. Looking at contemporary database transaction systems, just about every system only does synchronous writes with an active transaction waiting when committing log records. The new or updated database pages might not be written for tens of seconds or even minutes after the transaction has committed. This has the wonderful characteristic that the only writes that block a transaction are sequential rather than random. This is generally a useful characteristic and is particularly important when logging to spinning media but it also supports an important optimization when operating under high load. If the log is completing an I/O while a new transaction is being committed, then the commit is deferred until the previous log I/O has completed and the next log I/O might carry out tens of completed transactions that had been waiting during the previous I/O. The busier the log gets, the more transactions that get committed in a single write. When the system is lightly loaded each log I/O commits a single transaction as quickly as possible. When the system is under heavy load, each commit takes out tens of transaction changes at a slight delay but at much higher I/O efficiency. Aurora takes a bit more radical approach where it simply only writes log records out and never writes out data pages synchronously or otherwise. Even more interesting, the log is remote and stored with 6-way redundancy using a 4/6 write quorum and a 3/6 read quorum. Further improving the durability of the transaction log, the log writes are done across 3 different Availability Zones (each are different data centers).  In this approach Aurora can continue to read without problem if an entire data center goes down and, at the same time, another storage server fails. 
  • Videos from DSConf 2019 are now available
  • Given Microsoft’s purchase of LinkedIn three years ago, that LinkedIn is moving cloud to Azure should not be a big surprise. Have to wonder if there will be an Azure tax? Moving over from your own datacenters will certainly chew up a lot of cycles that could have went in to product.
  • Best name ever: A grimoire of functions
  • ML helping programmers is becoming a thing. A GPT-2 model trained on ~2 million files from GitHubAutocompletion with deep learningTabNine is an autocompleter that helps you write code faster. We’re adding a deep learning model which significantly improves suggestion quality. 
  • It’s hard to get a grasp on how EventBridge will change architectures. This article on using it as a new webhook is at least concrete. Amazon EventBridge: The biggest thing since AWS Lambda itself. Though with webhooks I just enter a url in a field on a form and I start receiving events. This works for PayPal, Slack, chatbots, etc. What’s the EventBridge equivalent? The whole how to hook things up isn’t clear at all. Also, Why Amazon EventBridge will change the way you build serverless applications
  • Tired of pumping all your data into a lake? Mesh it. The eternal cycle of centralizing, distributing, and then centralizing continues. How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh: “In order to decentralize the monolithic data platform, we need to reverse how we think about data, it’s locality and ownership. Instead of flowing the data from domains into a centrally owned data lake or platform, domains need to host and serve their domain datasets in an easily consumable way.”  There’s also an interview at Straining Your Data Lake Through A Data Mesh – Episode 90
  • The problem of false knowledge. Exponential Wisdom Episode 74: The Future of Construction. In this podcast there’s a segment that extols the wonders of visiting the Sistine Chapel through VR instead of visiting in-person. Is anyone worried about the problem of false knowledge? If I show you a picture of a chocolate bar do you know what chocolate tastes like? Constricting all experience through only our visual senses is a form of false knowledge. The Sistine Chapel evoked in me a visceral feeling of awe tinged with sadness. Would I feel that through VR? I don’t think so. I walked the streets. Tasted the food. Met the people. Saw the city. All experiences that can’t be shoved through our eyes.
  • Good to see Steve Balmer hasn’t changed. Players players players. Developers developers developers.
  • Never thought of this downside of open source before. SECURITY NOW 724 HIDE YOUR RDP NOW!. Kazakhstan is telling citizens to install a root cert into their browser so they can perform man- in-the-middle attacks. An interesting question is how browser makers should respond. More interesting is what if Kazakhstan responds by making their own browser based on open source, compromising it, and requring its use? Black Mirror should get on this. Software around us appears real, but has actually been replaced by pod-progs. Also, Open Source Could Be a Casualty of the Trade War
  • Darkweb Vendors and the Basic Opsec Mistakes They Keep Making. Don’t use email addresses that link to other accounts. Don’t use the same IDs across accounts. Don’t ship from the same area. Don’t do stuff yourself so you can be photographed. Don’t model your product using your own hands. Don’t cause anyone to die. Don’t sell your accounts to others. Don’t believe someone when they offer to launder your money. 
  • Though it’s still Electron. When a rewrite isn’t: rebuilding Slack on the desktop: The first order of business was to create the modern codebase: All UI components had to be built with React; All data access had to assume a lazily loaded and incomplete data model; All code had to be “multi-workspace aware”. The key to our approach ended up being Redux. The key to its success is the incremental release strategy that we adopted early on in the project: as code was modernized and features were rebuilt, we released them to our customers.
  • Re-Architecting the Video Gatekeeper: We [Netflix] decided to employ a total high-density near cache (i.e., Hollow) to eliminate our I/O bottlenecks. For each of our upstream systems, we would create a Hollow dataset which encompasses all of the data necessary for Gatekeeper to perform its evaluation. Each upstream system would now be responsible for keeping its cache updated. With this model, liveness evaluation is conceptually separated from the data retrieval from upstream systems. Instead of reacting to events, Gatekeeper would continuously process liveness for all assets in all videos across all countries in a repeating cycle. The cycle iterates over every video available at Netflix, calculating liveness details for each of them. At the end of each cycle, it produces a complete output (also a Hollow dataset) representing the liveness status details of all videos in all countries.
  • Should you hire someone who has already done the job you need to do? Not necessarily. Business Lessons from How Marvel Makes Movies: Marvel does something that is very counterintuitive. Instead of hiring people that are going to be really good at directing blockbusters, they look for people that have done a really good job with medium-sized budgets, but developing very strong storylines and characters. So, generally speaking, what they do is they looked to other genres like Shakespeare or horror. You can have spy films, comedy films, buddy cop films and what they do is they say, if I brought this director into the Marvel universe, what could they do with our characters? How could they shake up our stories and kind of reinvigorate them and provide new energy and new life?
  • What is a senior engineer? A historian. EliRivers: I work on some software of which the oldest parts of the source code date back to about 2009. Over the years, some very smart (some of them dangerously smart and woefully inexperienced, and clearly – not at all their fault – not properly mentored or supervised) people worked on it and left. What we have now is frequently a mystery. Simple changes are difficult, difficult changes verge on the impossible. Every new feature requires reverse-engineering of the existing code. Sometimes literally 95% of the time is spent reverse-engineering the existing code (no exaggeration – we measured it); changes can take literally 20 times as long as they should while we work out what the existing code does (and also, often not quite the same, what it’s meant to do, which is sometimes simply impossible to ever know). Pieces are gradually being documented as we work out what they do, but layers of cruft from years gone by from people with deadlines to meet and no chance of understanding the existing code sit like landmines and sometimes like unbreakable bonds that can never be undone. In our estimates, every time we have to rely on existing functionality that should be rock solid reliable and completely understood yet that we have not yet had to fully reverse-engineer, we mark it “high risk, add a month”. The time I found that someone had rewritten several pieces of the Qt libraries (without documenting what, or why) was devastating; it took away one of the cornerstones I’d been relying on, the one marked “at least I know I can trust the Qt libraries”. It doesn’t matter how smart we are, how skilled a coder we are, how genius our algorithms are; if we write something that can’t be understood by the next person to read it, and isn’t properly documented somewhere in some way that our fifth replacement can find easily five years later – if we write something of which even the purpose, let alone the implementation, will take someone weeks to reverse engineer – we’re writing legacy code on day one and, while we may be skilled programmers, we’re truly Godawful software engineers.
  • You always learn something new when you listen to Martin Thompson. Protocols and Sympathy With Martin Thompson. He goes into the many implications of the Universal Scalability Law which covers what can be split up and shared whlle considering coherence costs, which is the time it takes parties working together to reach agreement. The mathematics for systems and the mathematics for people are all very similar because it’s just a system. Doubling the size of system doesn’t mean doubling the amount of work done. You have to ask if the workload is decomposable. The workload needs to decompose and be done in parallel, but not concurrently. Parallelism is doing multiple things at the same time. Concurrency is dealing with multiple things at the same time. Concurrency requires coordination. Adding slack to a system reduces response time because it reduces utilization. If we constantly break teams up and reform them we end up spending more time on achieving coherence. If your team has become more efficient and reaches agreement faster than you can do more things at the same time with less overhead. You get more throughput by maximizing parallelism and minimizing coherency. Slow down and think more. Also, Understanding the LMAX Disruptor
  • Excellent explanation. Distributed Locks are Dead; Long Live Distributed Locks! and Testing the CP Subsystem with Jepsen
  • Atlassian on Our not-so-magic journey scaling low latency, multi-region services on AWS. Do you have something like this: “a context service which needed to be called multiple times per user request, with incredibly low latency, and be globally distributed. Essentially, it would need to service tens of thousands of requests per second and be highly resilient.” They were stuck with a previous sharding solution so couldn’t make a complete break as they moved to AWS. The first cut was CQRS with DynamoDB, which worked well until higher load hits and DynamoDB had latency problems. They used SNS to invalidate node level caches. They replaced ELBs with ALBs which increased reliability but the p99 latency went from 10ms to 20ms. They went with Caffeine instead of Guava for their cache. They added a sidecar as a local proxy for a service.  A sidecar is essentially just another containerised application that is run alongside the main application on the EC2 node. The benefit of using sidecars (as opposed to libraries) is that it’s technology agnostic. Latencies fell drastically. 
  • Nike on Moving Faster With AWS by Creating an Event Stream Database: we turned to the Kinesis Data Firehose service…a service called Athena that gives us the ability to perform SQL queries over partitioned data…how does our solution compare to more traditional architectures using RDS or Dynamo? Being able to ingest data and scale automatically via Firehose means our team doesn’t need to write or maintain pre-scaling code…Data storage costs on S3($0.023 per GB-month) are lower when compared to DynamoDB($0.25 per GB-month) and Aurora($0.10 per GB-month)…In a sample test, Athena delivered 5 million records in seconds, which we found difficult to achieve with DynamoDB…One limitation is that Firehose batches out data in windows of either data size or a time limit. This introduces a delay between when the data is ingested to when the data is discoverable by Athena…Queries to Athena are charged by the amount of data scanned, and if we scan the entire event stream frequently, we could rack up serious costs in our AWS bill.
  • It’s not easy to create a broadcast feed. Here’s how Hoststar did it. Building Pubsub for 50M concurrent socket connections. They went through a lot of different options. They ended up using EMQX, client side load balancing, and multiple clusters with bridges connecting them and a reverse bridge. Each subscriber could support 250k clients. With 200 subscribe nodes, the system can support 50M connections and more. Also, Ingesting data at “Bharat” Scale
  • Making Containers More Isolated: An Overview of Sandboxed Container Technologies: We have looked at several solutions that tackle the current container technology’s weak isolation issue. IBM Nabla is a unikernel-based solution that packages applications into a specialized VM. Google gVisor is a merge of a specialized hypervisor and guest OS kernel that provides a secure interface between the applications and their host. Amazon Firecracker is a specialized hypervisor that provisions each guest OS a minimal set of hardware and kernel resources. OpenStack Kata is a highly optimized VM with built-in container engine that can run on hypervisors. It is difficult to say which one works best as they all have different pros and cons. 

Soft Stuff

  • Nodes: In Nodes you write programs by connecting “blocks” of code. Each node – as we refer to them – is a self contained piece of functionality like loading a file, rendering a 3D geometry or tracking the position of the mouse. The source code can be as big or as tiny as you like. We’ve seen some of ours ranging from 5 lines of code to the thousands. Conceptual/functional separation is usually more important.
  • Picat: Picat is a simple, and yet powerful, logic-based multi-paradigm programming language aimed for general-purpose applications. Picat is a rule-based language, in which predicates, functions, and actors are defined with pattern-matching rules. Picat incorporates many declarative language features for better productivity of software development, including explicit non-determinism, explicit unification, functions, list comprehensions, constraints, and tabling. Picat also provides imperative language constructs, such as assignments and loops, for programming everyday things. 
  • When the aliens find the dead husk of our civilization the irony is what will remain of history are clay cuneiform tablets. Something comforting knowing what’s oldest will last longest. Cracking Ancient Codes: Cuneiform Writing.
  • donnaware/AGCFPGA Based Apollo Guidance Computer. 

Pub Stuff

  • Unikernels: The Next Stage of Linux’s Dominance (overview): In this paper, we posit that an upstreamable unikernel target is achievable from the Linux kernel, and, through an early Linux unikernel prototype, demonstrate that some simple changes can bring dramatic performance advantages. rwmj: The entire point of this paper is not to start over from scratch, but to reuse existing software (Linux and memcached in this case), and fiddle with the linker command line and a little bit of glue to link them into a single binary. If you want to start over from scratch using a safe language then see MirageOS.
  • Linux System Programming. rofo1: The book is solid. I mentally place it up there with “Advanced programming in the UNIX Environment” by Richard Stevens. 
  • Checking-in on network functions:  we need beter approaches to VERIFY and INTERACT with network functions and packet processing program properties. here, we provide a HYBRID-APPROACH and implementation for GRADUALLY checking and validating the arbitrary logic and side effects by COMBINING design by contract, static assertions and type-checking, and code generation via macros all without PENALIZING programmers at development time
  • Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches: In this work, we report the results of a systematic analysis of algorithmic proposals for top-n recommendation tasks. Specifically, we considered 18 algorithms that were presented at top-level research conferences in the last years. Only 7 of them could be reproduced with reasonable effort. For these methods, it however turned out that 6 of them can often be outperformed with comparably simple heuristic methods, e.g., based on nearest-neighbor or graph-based techniques. The remaining one clearly outperformed the baselines but did not consistently outperform a well-tuned non-neural linear ranking method. 
  • DistME: A Fast and Elastic Distributed Matrix Computation Engine using GPUs: We implement a fast and elastic matrix computation engine called DistME by integrating CuboidMM with GPU acceleration on top of Apache Spark. Through extensive experiments, we have demonstrated that CuboidMM and DistME significantly outperform the state-of-the-art methods and systems, respectively, in terms of both performance and data size.
  • PARTISAN: Scaling the Distributed Actor Runtime (github, video, twitter): We present the design of an alternative runtime system for improved scalability and reduced latency in actor applications called PARTISAN. PARTISAN provides higher scalability by allowing the application developer to specify the network overlay used at runtime without changing application semantics, thereby specializing the network communication patterns to the application. PARTISAN reduces message latency through a combination of three predominately automatic optimizations: parallelism, named channels, and affinitized scheduling. We implement a prototype of PARTISAN in Erlang and demonstrate that PARTISAN achieves up to an order of magnitude increase in the number of nodes the system can scale to through runtime overlay selection, up to a 38.07x increase in throughput, and up to a 13.5x reduction in latency over Distributed Erlang.
  • BPF Performance Tools (book): This is the official site for the book BPF Performance Tools: Linux System and Application Observability, published by Addison Wesley (2019). This book can help you get the most out of your systems and applications, helping you improve performance, reduce costs, and solve software issues. Here I’ll describe the book, link to related content, and list errata.

from High Scalability

Ten Things Serverless Architects Should Know

Ten Things Serverless Architects Should Know

Building on the first three parts of the AWS Lambda scaling and best practices series where you learned how to design serverless apps for massive scale, AWS Lambda’s different invocation models, and best practices for developing with AWS Lambda, we now invite you to take your serverless knowledge to the next level by reviewing the following 10 topics to deepen your serverless skills.

1: API and Microservices Design

With the move to microservices-based architectures, decomposing monothlic applications and de-coupling dependencies is more important than ever. Learn more about how to design and deploy your microservices with Amazon API Gateway:

Get hands-on experience building out a serverless API with API Gateway, AWS Lambda, and Amazon DynamoDB powering a serverless web application by completing the self-paced Wild Rydes web application workshop.

Figure 1: WildRydes serverless web application workshop

2: Event-driven Architectures and Asynchronous Messaging Patterns

When building event-driven architectures, whether you’re looking for simple queueing and message buffering or a more intricate event-based choreography pattern, it’s valuable to learn about the mechanisms to enable asynchronous messaging and integration. These are enabled primarily through the use of queues or streams as a message buffer and topics for pub/sub messaging. Understand when to use each and the unique advantages and features of all three:

Gets hands-on experience building a real-time data processing application using Amazon Kinesis Data Streams and AWS Lambda by completing the self-paced Wild Rydes data processing workshop.

3: Workflow Orchestration in a Distributed, Microservices Environment

In distributed microservices architectures, you must design coordinated transactions in different ways than traditional database-based ACID transactions, which are typically implemented using a monolithic relational database. Instead, you must implement coordinated sequenced invocations across services along with rollback and retry mechanisms. For workloads where there a significant orchestration logic is required and you want to use more of an orchestrator pattern than the event choreography pattern mentioned above, AWS Step Functions enables the building complex workflows and distributed transactions through integration with a variety of AWS services, including AWS Lambda. Learn about the options you have to build your business workflows and keep orchestration logic out of your AWS Lambda code:

Get hands-on experience building an image processing workflow using computer vision AI services with AWS Rekognition and AWS Step Functions to orchestrate all logic and steps with the self-paced Serverless image processing workflow workshop.

Figure 2: Several AWS Lambda functions managed by an AWS Step Functions state machine

4: Lambda Computing Environment and Programming Model

Though AWS Lambda is a service that is quick to get started, there is value in learning more about the AWS Lambda computing environment and how to take advantage of deeper performance and cost optimization strategies with the AWS Lambda runtime. Take your understanding and skills of AWS Lambda to the next level:

5: Serverless Deployment Automation and CI/CD Patterns

When dealing with a large number of microservices or smaller components—such as AWS Lambda functions all working together as part of a broader application—it’s critical to integrate automation and code management into your application early on to efficiently create, deploy, and version your serverless architectures. AWS offers several first-party deployment tools and frameworks for Serverless architectures, including the AWS Serverless Application Model (SAM), the AWS Cloud Development Kit (CDK), AWS Amplify, and AWS Chalice. Additionally, there are several third party deployment tools and frameworks available, such as the Serverless Framework, Claudia.js, Sparta, or Zappa. You can also build your own custom-built homegrown framework. The important thing is to ensure your automation strategy works for your use case and team, and supports your planned data source integrations and development workflow. Learn more about the available options:

Learn how to build a full CI/CD pipeline and other DevOps deployment automation with the following workshops:

6: Serverless Identity Management, Authentication, and Authorization

Modern application developers need to plan for and integrate identity management into their applications while implementing robust authentication and authorization functionality. With Amazon Cognito, you can deploy serverless identity management and secure sign-up and sign-in directly into your applications. Beyond authentication, Amazon API Gateway also allows developers to granularly manage authorization logic at the gateway layer and authorize requests directly, without exposing their using several types of native authorization.

Learn more about the options and benefits of each:

Get hands-on experience working with Amazon Cognito, AWS Identity and Access Management (IAM), and Amazon API Gateway with the Serverless Identity Management, Authentication, and Authorization Workshop.

Figure 3: Serverless Identity Management, Authentication, and Authorization Workshop

7: End-to-End Security Techniques

Beyond identity and authentication/authorization, there are many other areas to secure in a serverless application. These include:

  • Input and request validation
  • Dependency and vulnerability management
  • Secure secrets storage and retrieval
  • IAM execution roles and invocation policies
  • Data encryption at-rest/in-transit
  • Metering and throttling access
  • Regulatory compliance concerns

Thankfully, there are AWS offerings and integrations for each of these areas. Learn more about the options and benefits of each:

Get hands-on experience adding end-to-end security with the techniques mentioned above into a serverless application with the Serverless Security Workshop.

8: Application Observability with Comprehensive Logging, Metrics, and Tracing

Before taking your application to production, it’s critical that you ensure your application is fully observable, both at a microservice or component level, as well as overall through comprehensive logging, metrics at various granularity, and tracing to understand distributed system performance and end user experiences end-to-end. With many different components making up modern architectures, having centralized visibility into all of your key logs, metrics, and end-to-end traces will make it much easier to monitor and understand your end users’ experiences. Learn more about the options for observability of your AWS serverless application:

9. Ensuring Your Application is Well-Architected

Adding onto the considerations mentioned above, we suggest architecting your applications more holistically to the AWS Well-Architected framework. This framework includes the five key pillars: security, reliability, performance efficiency, cost optimization, and operational excellence. Additionally, there is a serverless-specific lens to the Well-Architected framework, which more specifically looks at key serverless scenarios/use cases such as RESTful microservices, Alexa skills, mobile backends, stream processing, and web applications, and how they can implement best practices to be Well-Architected. More information:

10. Continuing your Learning as Serverless Computing Continues to Evolve

As we’ve discussed, there are many opportunities to dive deeper into serverless architectures in a variety of areas. Though the resources shared above should be helpful in familiarizing yourself with key concepts and techniques, there’s nothing better than continued learning from others over time as new advancements come out and patterns evolve.

Finally, we encourage you to check back often as we’ll be continuing further blog post series on serverless architectures, with the next series focusing on API design patterns and best practices.

About the author

Justin PritleJustin Pirtle is a specialist Solutions Architect at Amazon Web Services, focused on the Serverless platform. He’s responsible for helping customers design, deploy, and scale serverless applications using services such as AWS Lambda, Amazon API Gateway, Amazon Cognito, and Amazon DynamoDB. He is a regular speaker at AWS conferences, including re:Invent, as well as other AWS events. Justin holds a bachelor’s degree in Management Information Systems from the University of Texas at Austin and a master’s degree in Software Engineering from Seattle University.

from AWS Architecture Blog

Sponsored Post: Educative, PA File Sight, Etleap, PerfOps, InMemory.Net, Triplebyte, Stream, Scalyr

Sponsored Post: Educative, PA File Sight, Etleap, PerfOps, InMemory.Net, Triplebyte, Stream, Scalyr

Who’s Hiring? 

  • Triplebyte lets exceptional software engineers skip screening steps at hundreds of top tech companies like Apple, Dropbox, Mixpanel, and Instacart. Make your job search O(1), not O(n). Apply here.
  • Need excellent people? Advertise your job here! 

Cool Products and Services

  • Grokking the System Design Interview is a popular course on (taken by 20,000+ people) that’s widely considered the best System Design interview resource on the Internet. It goes deep into real-world examples, offering detailed explanations and useful pointers on how to improve your approach. There’s also a no questions asked 30-day return policy. Try a free preview today.
  • PA File Sight – Actively protect servers from ransomware, audit file access to see who is deleting files, reading files or moving files, and detect file copy activity from the server. Historical audit reports and real-time alerts are built-in. Try the 30-day free trial!
  • For heads of IT/Engineering responsible for building an analytics infrastructure, Etleap is an ETL solution for creating perfect data pipelines from day one. Unlike older enterprise solutions, Etleap doesn’t require extensive engineering work to set up, maintain, and scale. It automates most ETL setup and maintenance work, and simplifies the rest into 10-minute tasks that analysts can own. Read stories from customers like Okta and PagerDuty, or try Etleap yourself.
  • PerfOps is a data platform that digests real-time performance data for CDN and DNS providers as measured by real users worldwide. Leverage this data across your monitoring efforts and integrate with PerfOps’ other tools such as Alerts, Health Monitors and FlexBalancer – a smart approach to load balancing. FlexBalancer makes it easy to manage traffic between multiple CDN providers, API’s, Databases or any custom endpoint helping you achieve better performance, ensure the availability of services and reduce vendor costs. Creating an account is Free and provides access to the full PerfOps platform.
  • InMemory.Net provides a Dot Net native in memory database for analysing large amounts of data. It runs natively on .Net, and provides a native .Net, COM & ODBC apis for integration. It also has an easy to use language for importing data, and supports standard SQL for querying data. http://InMemory.Net
  • Build, scale and personalize your news feeds and activity streams with Try the API now in this 5 minute interactive tutorialStream is free up to 3 million feed updates so it’s easy to get started. Client libraries are available for Node, Ruby, Python, PHP, Go, Java and .NET. Stream is currently also hiring Devops and Python/Go developers in Amsterdam. More than 400 companies rely on Stream for their production feed infrastructure, this includes apps with 30 million users. With your help we’d like to ad a few zeros to that number. Check out the job opening on AngelList.
  • Scalyr is a lightning-fast log management and operational data platform.  It’s a tool (actually, multiple tools) that your entire team will love.  Get visibility into your production issues without juggling multiple tabs and different services — all of your logs, server metrics and alerts are in your browser and at your fingertips. .  Loved and used by teams at Codecademy, ReturnPath, Grab, and InsideSales. Learn more today or see why Scalyr is a great alternative to Splunk.
  • Advertise your product or service here!

Fun and Informative Events

  • Advertise your event here!

If you are interested in a sponsored post for an event, job, or product, please contact us for more information.

PA File Sight monitors file access on a server in real-time.

It can track who is accessing what, and with that information can help detect file copying, detect (and stop) ransomware attacks in real-time, and record the file activity for auditing purposes. The collected audit records include user account, target file, the user’s IP address and more. This solution does NOT require Windows Native Auditing, which means there is no performance impact on the server. Join thousands of other satisfied customers by trying PA File Sight for yourself. No sign up is needed for the 30-day fully functional trial.

Make Your Job Search O(1) — not O(n)

Triplebyte is unique because they’re a team of engineers running their own centralized technical assessment. Companies like Apple, Dropbox, Mixpanel, and Instacart now let Triplebyte-recommended engineers skip their own screening steps.

We found that High Scalability readers are about 80% more likely to be in the top bracket of engineering skill.

Take Triplebyte’s multiple-choice quiz (system design and coding questions) to see if they can help you scale your career faster.

The Solution to Your Operational Diagnostics Woes

Scalyr gives you instant visibility of your production systems, helping you turn chaotic logs and system metrics into actionable data at interactive speeds. Don’t be limited by the slow and narrow capabilities of traditional log monitoring tools. View and analyze all your logs and system metrics from multiple sources in one place. Get enterprise-grade functionality with sane pricing and insane performance. Learn more today

If you are interested in a sponsored post for an event, job, or product, please contact us for more information.

from High Scalability

Running Red Hat Enterprise Linux as Kubernetes Worker Nodes -XI

Running Red Hat Enterprise Linux as Kubernetes Worker Nodes -XI

Priyanka Sharma

In our previous blogs, we have covered the deployment strategies, networking, and logging of the Kubernetes cluster. By default, for the EKS workers, AWS provides EKS optimized AMIs which uses Amazon Linux as the Operating System. In this article, we will be discussing how we can have RHEL workers configured with the AWS EKS Cluster.

  • Red Hat Enterprise Linux 7.6
  • Kubernetes 1.13 on AWS EKS. We have opted for private subnets for the EKS Control Plane. To provision a new EKS cluster, refer to the below command:
aws eks create-cluster --name <CLUSTER_NAME> --role-arn arn:aws:iam::<ACCOUNT>:role/<EKS_SERVICE_ROLE> --resources-vpc-config subnetIds=<PRIV_SUBNETA>,<PRIV_SUBNETB>,<PRIV_SUBNETC>,securityGroupIds=<EKS_SECURITYGROUP_ID>,endpointPublicAccess=false,endpointPrivateAccess=true --region ap-south-1

If running an old version, upgrade to the latest one by using the below command:

aws eks update-cluster-version --name <CLUSTER_NAME> --client-request-token updating-version --kubernetes-version 1.13 --region ap-south-1

Check the status using below command:

aws eks describe-cluster --name <CLUSTER_NAME> --query cluster.status --region ap-south-1

Update the Kube Config. Ensure you are using the latest version of AWS CLI. In our case, it is 1.16.195.

aws eks --region ap-south-1 update-kubeconfig --name <CLUSTER_NAME>
  • Provision RHEL 7.6 as standalone EC2 Server.
  • Execute a shell script to make it as EKS Optimized. The script is available in the Git Repo.
  • Take an AMI of the RHEL server.
  • Pass the AMI to the CF template parameters to provision the worker nodes.
  • Create AWS Auth ConfigMap and pass the ARN of the Instance Role.
  • See the RHEL server registering as workers.
  • Switch to EC2 Console and Provision an EC2 Server with RHEL 7.6 AMI.
  • Install the dependencies using the below commands:
yum install -y git
yum install -y
yum install -y python-pip
pip install --upgrade awscli
pip install --upgrade aws-cfn-bootstrap
mkdir -p /opt/aws/bin
ln -s /usr/bin/cfn-signal /opt/aws/bin/cfn-signal
yum install -y ****can be replaced with the version required by docker****
sed -i 's/enforcing/permissive/g' /etc/selinux/config ****If not set to permissive, the docker containers will not be able to provision and throw Permission Denied Error****
  • Clone the git repo and Execute
git clone
cd aws-eks-rhel-workers
  • Go to EC2 Console and create an AMI of this server.
  • Provision a Cloudformation Stack with the below template provided by AWS:
  • In the parameter “NodeImageId”, input the Image ID of the AMI created in the previous step.

Wait for the bootstrap script to execute inside the Worker Node. Get the Instance Role ARN from the Cloudformation stack outputs and provide as the value of rolearn in the below yaml template.

apiVersion: v1
kind: ConfigMap
name: aws-auth
namespace: kube-system
mapRoles: |
- rolearn: <ARN of instance role (not instance profile)>
username: system:node:
- system:bootstrappers
- system:nodes

Execute “kubectl apply -f aws-auth.yaml”.

Run “kubectl get nodes”. The RHEL worker node is registered with the EKS Cluster.

And that’s all. At this point, we have RHEL 7.6 worker nodes running in K8s Cluster.


from Powerupcloud Tech Blog – Medium

Intuit: Serving Millions of Global Customers with Amazon Connect

Intuit: Serving Millions of Global Customers with Amazon Connect

Recently, Bill Schuller, Intuit Contact Center Domain Architect met with AWS’s Simon Elisha to discuss how Intuit manages its customer contact centers with AWS Connect.

As a 35-year-old company with an international customer base, Intuit is widely known as the maker of Quick Books and Turbo Tax, among other software products. Its 50 million customers can access its global contact centers not just for password resets and feature explanations, but for detailed tax interpretation and advice. As you can imagine, this presents a challenge of scale.

Using Amazon Connect, a self-service, cloud-based contact center service, Intuit has been able to provide a seamless call-in experience to Intuit customers from around the globe. When a customer calls in to Amazon Connect, Intuit is able to do a “data dip” through AWS Lambda out to the company’s CRM system (in this case, SalesForce) in order to get more information from the customer. At this point, Intuit can leverage other services like Amazon Lex for national language feedback and then get the customer to the right person who can help. When the call is over, instead of having that important recording of the call locked up in a proprietary system, the audio is moved into an S3 bucket, where Intuit can do some post-call processing. It can also be sent it out to third parties for analysis, or Intuit can use Amazon Transcribe or Amazon Comprehend to get a transcription or sentiment analysis to understand more about what happened during that particular call.

Watch the video below to understand the reasons why Intuit decided on this set of AWS services (hint: it has to do with the ability to experiment with speed and scale but without the cost overhead).

*Check out more This Is My Architecture video series.

About the author

Annik StahlAnnik Stahl is a Senior Program Manager in AWS, specializing in blog and magazine content as well as customer ratings and satisfaction. Having been the face of Microsoft Office for 10 years as the Crabby Office Lady columnist, she loves getting to know her customers and wants to hear from you.

from AWS Architecture Blog