编程语言
272
1、概要
大多数任务都能够通过简单的单进程单线程任务处理好,但是还有一大部分现实诉求无法满足。批量任务存在两种并行模式
- 单进程、多线程
- 多进程
我们也可以细分为
- 多线程Step(单进程) Multi-thread Step
- 并行Step(单进程) Parallel Steps
- 对Step进行远程分块(多进程)Remote Chunking of Step
- 对Step进行分区 Partitioning a Step
今天我们将通过两个例子来解释多线程和并行任务...目前还仅限于单进程模式,后面会继续通过示例的方式说明多线程模式
2、开启并发并行之旅
项目依赖就不多说了,在之前的入门文章中已经说明。但是我们还需要添加如下两个依赖
<!-- https://mvnrepository.com/artifact/com.thoughtworks.xstream/xstream --> <dependency> <groupId>com.thoughtworks.xstream</groupId> <artifactId>xstream</artifactId> <version>1.4.12</version> </dependency> <!-- https://mvnrepository.com/artifact/org.springframework/spring-oxm --> <dependency> <groupId>org.springframework</groupId> <artifactId>spring-oxm</artifactId> </dependency>
2.1 准备脚本
create table TRANSACTION ( ACCOUNT varchar(32) null, AMOUNT decimal null, TIMESTAMP datetime null );
我们创建了一张表,用于储存文件中的数据。
2.2、准备CSV数据
5113971498870901,-546.68,2018-02-08 17:46:12 4041373995909987,-37.06,2018-02-02 21:10:33 3573694401052643,-784.93,2018-02-04 13:01:30 3543961469650122,925.44,2018-02-05 23:41:50 ....
2.3、准备XM文件
<transactions> <transaction> <account>633110684460535475</account> <amount>961.93</amount> <timestamp>2018-02-03 18:30:51</timestamp> </transaction> <transaction> <account>3555221131716404</account> <amount>759.62</amount> <timestamp>2018-02-12 20:02:01</timestamp> </transaction> <transaction> <account>30315923571992</account> <amount>648.92</amount> <timestamp>2018-02-12 23:16:45</timestamp> </transaction> ...... </transactions>
2.4、多线程Step
最简单开启spring batch并发处理能力的办法就是将TaskExecutor添加到Step的配置中,如下
@Configuration public class MultiThreadJobConfiguration extends BaseJobConfiguration { public FlatFileItemReader<Transaction> fileTransactionReader() { Resource resource = new FileSystemResource("csv/bigtransactions.csv"); return new FlatFileItemReaderBuilder<Transaction>() .saveState(false) .resource(resource) .delimited() .names(new String[]{"account", "amount", "timestamp"}) .fieldSetMapper(fieldSet -> { Transaction transaction = new Transaction(); transaction.setAccount(fieldSet.readString("account")); transaction.setAmount(fieldSet.readBigDecimal("amount")); transaction.setTimestamp(fieldSet.readDate("timestamp", "yyyy-MM-dd HH:mm:ss")); return transaction; }) .build(); } @Bean @StepScope public JdbcBatchItemWriter<Transaction> writer(@Qualifier("dataSource") DataSource dataSource) { return new JdbcBatchItemWriterBuilder<Transaction>() .dataSource(dataSource) .beanMapped() .sql("INSERT INTO TRANSACTION (ACCOUNT, AMOUNT, TIMESTAMP) VALUES (:account, :amount, :timestamp)") .build(); } @Bean("multithreadedJob") public Job multithreadedJob() { return this.jobs.get("multithreadedJob") .start(step1()) .build(); } @Bean public Step step1() { ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor(); taskExecutor.setCorePoolSize(4); taskExecutor.setMaxPoolSize(4); taskExecutor.afterPropertiesSet(); return this.steps.get("multithreadedStep") .<Transaction, Transaction>chunk(1000) .reader(fileTransactionReader()) .writer(writer(null)) .taskExecutor(taskExecutor) .build(); } }
以上代码说明,我们分了4个线程,read和writer按照每块1000条数据执行。使用我当前的Intel® Core™ i5-3210M CPU @ 2.50GHz × 4 机器读取60000万条数据并且落地花费时间1分半钟。调整chunk大小,经过测试也会发现对于性能也存在一定的影响,实际生产环境中使用需要调整优化chunk大小。
2.5、并行Step
并行的代码看起来稍微复杂一点,个人理解并行任务和多线程并发任务没有本质区别,只是区别于不同的业务场景,并行任务区别于并发任务关键在于并行任务将一个大任务拆分为多个Flow,一个Flow可以串联多个Flow,一个Flow可以包含多个Step.下面是一个例子,并行读取两个文件,一个csv文件,一个xml文件。
@Configuration public class ParallelJobConfiguration extends BaseJobConfiguration { @Bean @StepScope public FlatFileItemReader<Transaction> fileTransactionReader() { Resource resource = new FileSystemResource("data/csv/bigtransactions.csv"); return new FlatFileItemReaderBuilder<Transaction>() .saveState(false) .resource(resource) .delimited() .names(new String[]{"account", "amount", "timestamp"}) .fieldSetMapper(fieldSet -> { Transaction transaction = new Transaction(); transaction.setAccount(fieldSet.readString("account")); transaction.setAmount(fieldSet.readBigDecimal("amount")); transaction.setTimestamp(fieldSet.readDate("timestamp", "yyyy-MM-dd HH:mm:ss")); return transaction; }) .build(); } @Bean @StepScope public StaxEventItemReader<Transaction> xmlTransactionReader() { Resource resource = new FileSystemResource("data/xml/bigtransactions.xml"); Map<String, Class> map = new HashMap<>(); map.put("transaction", Transaction.class); map.put("account", String.class); map.put("amount", BigDecimal.class); map.put("timestamp", Date.class); XStreamMarshaller marshaller = new XStreamMarshaller(); marshaller.setAliases(map); String[] formats = {"yyyy-MM-dd HH:mm:ss", "yyyy-MM-dd"}; marshaller.setConverters(new DateConverter("yyyy-MM-dd HH:mm:ss", formats)); return new StaxEventItemReaderBuilder<Transaction>() .name("xmlFileTransactionReader") .resource(resource) .addFragmentRootElements("transaction") .unmarshaller(marshaller) .build(); } @Bean @StepScope public JdbcBatchItemWriter<Transaction> jdbcBatchItemWriter(@Qualifier("dataSource") DataSource dataSource) { return new JdbcBatchItemWriterBuilder<Transaction>() .dataSource(dataSource) .beanMapped() .sql("INSERT INTO TRANSACTION (ACCOUNT, AMOUNT, TIMESTAMP) VALUES (:account, :amount, :timestamp)") .build(); } @Bean("parallelJob") public Job parallelStepsJob() { return this.jobs.get("parallelJob") .start(parallelFlow()) .end() .build(); } @Bean public Flow parallelFlow() { return new FlowBuilder<Flow>("parallelFlow") .split(new SimpleAsyncTaskExecutor()) .add(flow1(), flow2()) .build(); } @Bean public Flow flow1() { return new FlowBuilder<Flow>("flow1") .start(step1()) .build(); } @Bean public Flow flow2() { return new FlowBuilder<Flow>("flow2") .start(step2()) .build(); } @Bean("xmlStep") public Step step1() { return this.steps.get("xmlStep") .<Transaction, Transaction>chunk(1000) .reader(xmlTransactionReader()) .writer(jdbcBatchItemWriter(null)) .build(); } @Bean("fileStep") public Step step2() { return this.steps.get("fileStep") .<Transaction, Transaction>chunk(1000) .reader(fileTransactionReader()) .writer(jdbcBatchItemWriter(null)) .build(); }
2.6、运行任务
# 执行多线程任务 curl http://localhost:8080/launchMultiThreadjob # 执行并行任务 curl http://localhost:8080/launchParallelJobjob # 或者通过浏览器打开上面的地址