Fishing from Java 8 Streams - can you fish from the AWS sea as well?


Fishing from Java 8 Streams?!




I've been using Java 8 for a few years, but obviously I didn't read the complete new feature list as it appears to support Streams: Map, Filter, Reduce on Collections etc.

A few articles:


I wish I'd known this a year or 2 ago as I essentially hand coded similar functionality and/or used Apache Spark which Streams may have done the trick.

But I also wonder if Java 8 Streams work in any useful way with distributed data processing in AWS? i.e. can you hop in a trawler and fish from the AWS sea using Java 8 Streams?

Maybe, there are some "interesting" (but not exhaustive) use cases:

Maybe searching for specific AWS services + Java 8 Streams would find more...

Or are Java 8 Streams just "broken"?


I.e. Streams are just lazy collections, and they haven't addressed the underlying fork/join concurrency model, oh well.


This is the gist of the problems with the application of F/J in the implementation of the Streams API:
  1. F/J is good for the parallelization of in-memory, random-access structures: it wants to be able to divide the full problem top-down, by recursively halving it into two subproblems of equal size;
  2. the stream paradigm is primarily about the processing of lazily materialized, sequential data sources, which can only be divided into a sequence of chunks, and the number of chunks is usually not known in advance.
While F/J can be bent somewhat to support sequential chunking, this is perceived by it as "anomalous" and "lopsided", eventually giving rise to insurmountable issues when combined with the unpredictable I/O latency in reading those chunks1.
Streams API excels at the parallelization of in-memory structures and is usually helpful with the processing of lazy, I/O-backed streams, but it fails when you try to combine these two features in a single use case.
If you have a loop in your code which introduces a CPU-bound bottleneck, it is fairly likely that this loop is iterating over the contents of some file, network request, or rows of an SQL result set. None of these targets for parallelization get support from the Streams API.
The official position is that this use case is not supported because the Streams API has a different, equally legitimate focus. In the department of lazy parallel streams, this focus amounts to stream sources which are calculated from data existing within working memory, with the additional constraint that these sources must be unordered—that each member can be calculated independently, without the need to first calculate any other. An example of such a stream is a range of integers, but a stream of random numbers from an LCG is already outside of the area being focused on by the API, because these random numbers can only be generated sequentially.

Comments

  1. Want to change your career in Selenium? Red Prism Group is one of the best training coaching for Selenium in Noida. Now start your career for Selenium Automation with Red Prism Group. Join training institute for selenium in noida.

    ReplyDelete

Post a Comment

Popular posts from this blog

Which Amazon Web Services are Interoperable?

AWS Certification glossary quiz: IAM

AWS SWF vs Lambda + step functions? Simple answer is use Lambda for all new applications.