Fishing from Java 8 Streams - can you fish from the AWS sea as well?

May 19, 2017

Fishing from Java 8 Streams?!

(cool photo from https://www.exclamationlabs.com/blog/refactoring-for-java-8-streams/)

I've been using Java 8 for a few years, but obviously I didn't read the complete new feature list as it appears to support Streams: Map, Filter, Reduce on Collections etc.

A few articles:

https://www.exclamationlabs.com/blog/refactoring-for-java-8-streams/

http://www.oracle.com/technetwork/articles/java/ma14-java-se-8-streams-2177646.html

https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html

I wish I'd known this a year or 2 ago as I essentially hand coded similar functionality and/or used Apache Spark which Streams may have done the trick.

But I also wonder if Java 8 Streams work in any useful way with distributed data processing in AWS? i.e. can you hop in a trawler and fish from the AWS sea using Java 8 Streams?

Maybe, there are some "interesting" (but not exhaustive) use cases:

S3 and Java 8 Streams.
Java 8 Streams: 10 Missing features (or how to fix Java 8 streams with AWS and microservices)
Kinesis Java 8 Streams Record De-aggregator (!? Will it make me an espresso as well?)

Maybe searching for specific AWS services + Java 8 Streams would find more...

Or are Java 8 Streams just "broken"?

Why Java 8 Streams are broken.

I.e. Streams are just lazy collections, and they haven't addressed the underlying fork/join concurrency model, oh well.

Is this the definitive analysis?

This is the gist of the problems with the application of F/J in the implementation of the Streams API:

F/J is good for the parallelization of in-memory, random-access structures: it wants to be able to divide the full problem top-down, by recursively halving it into two subproblems of equal size;
the stream paradigm is primarily about the processing of lazily materialized, sequential data sources, which can only be divided into a sequence of chunks, and the number of chunks is usually not known in advance.

While F/J can be bent somewhat to support sequential chunking, this is perceived by it as "anomalous" and "lopsided", eventually giving rise to insurmountable issues when combined with the unpredictable I/O latency in reading those chunks¹.

Streams API excels at the parallelization of in-memory structures and is usually helpful with the processing of lazy, I/O-backed streams, but it fails when you try to combine these two features in a single use case.

If you have a loop in your code which introduces a CPU-bound bottleneck, it is fairly likely that this loop is iterating over the contents of some file, network request, or rows of an SQL result set. None of these targets for parallelization get support from the Streams API.

The official position is that this use case is not supported because the Streams API has a different, equally legitimate focus. In the department of lazy parallel streams, this focus amounts to stream sources which are calculated from data existing within working memory, with the additional constraint that these sources must be unordered—that each member can be calculated independently, without the need to first calculate any other. An example of such a stream is a range of integers, but a stream of random numbers from an LCG is already outside of the area being focused on by the API, because these random numbers can only be generated sequentially.

Comments

Rekha Roy22 November 2021 at 07:19
Want to change your career in Selenium? Red Prism Group is one of the best training coaching for Selenium in Noida. Now start your career for Selenium Automation with Red Prism Group. Join training institute for selenium in noida.
ReplyDelete
Replies

Search This Blog

A computer scientist learns Amazon Web Services (AWS)