Nov 28, 2020

Read spark dataframe using regex expression

There's a way to load spark dataframes using regular expressions. I'm loading a lot of data to process in spark from aws, and specifying path regexp helps to cut down loading times tremendously. Spark inherits Hadoop ability to read paths as pattern matching. It works then reading from aws too. Link on documentation is here. Sample usage:
  
spark.read.parquet("s3://backet/data/year=20{20,19}/month=*/day=[01]*").
Closest alternative will be:
  
spark.read.parquet("s3://backet/data").where("[insane sql here]")
Timewise first approach is odrers of magnitude faster.

May 3, 2017

How to read CSV for Spark 2.1.0 mllib for dummies

Intro

It looks like noone is reading CSV for spark 2.1.0 anymore. Only reference I could find was https://elbauldelprogramador.com/en/how-to-convert-column-to-vectorudt-densevector-spark/.
So, here comes my 5 cents on the issue.

Code

    
    val sc: SparkContext = new SparkContext(master, \
       "SuperApp", System.getenv("SPARK_HOME"))
    val session: SparkSession = SparkSession.builder().getOrCreate()

    //Firstly define schema
    val struct = StructType(
      StructField("price", DoubleType, false) ::
        StructField("_id", StringType, false) ::
        StructField("modelYearId", IntegerType, false) ::
        StructField("zip", IntegerType, false) ::
        StructField("modelYear", IntegerType, false) ::
        StructField("modelId", IntegerType, false) ::
        StructField("makeId", IntegerType, false) ::
        StructField("mileage", DoubleType, false) :: Nil)

    val df: DataFrame = session.sqlContext.read.schema(struct)
        .option("header", "true").csv("cars.csv")
    var data: DataFrame = df
    //Transform variable into categorical one
    data = new OneHotEncoder()
      .setInputCol("zip")
      .setOutputCol("zipVec").transform(data)

    //Assemble features that matter
    val assembler = new VectorAssembler().
      setInputCols(Array("modelYearIdVec", "zipVec", 
        "modelYear", "modelIdVec", "makeIdVec", "mileage")).
      setOutputCol("features")
    //to verify our schema is as we want
    data.printSchema()

    data = assembler.transform(data)

 
And then pretty usual mllib tutorial stuff:
   // Split the data into training and 
   //test sets (30% held out for testing).
    val Array(trainingData, testData) = data
        .randomSplit(Array(0.7, 0.3))

    val lr = new LinearRegression()
      .setLabelCol("price")
      .setMaxIter(200)
      .setRegParam(10)
      .setElasticNetParam(0.75)

    // Fit the model
    val lrModel: LinearRegressionModel = lr.fit(trainingData)

    // Print the coefficients and intercept for linear regression
    println(s"Coefficients: ${lrModel.coefficients} 
        Intercept: ${lrModel.intercept}")

    // Summarize the model over the training set 
    //and print out some metrics
    val trainingSummary = lrModel.summary
    println(s"numIterations: ${trainingSummary.totalIterations}")
    println(s"objectiveHistory: 
        [${trainingSummary.objectiveHistory.mkString(",")}]")
    trainingSummary.residuals.show()
    println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
    println(s"r2: ${trainingSummary.r2}")

May 1, 2017

Cost of regular expressions in java

Recently I've came across an interesting case of names transformation to so called "slug" format:
Step 1: covert to lowercase
Step 2: replace any alpheNumeric to hypen(-). Start to End of the string: Don't consider unicodes
Step 3: Remove any adjacent hypen(-)
Step 4: Remove trailing or leading hyphens
 First, quite readable solution:

    public static final String CHARS_REG = "[^a-zA-Z0-9]";
    public static final String DOUBLE_REG = "[-]+";
    public static final String TRIM_REG = "-$|^-";
    public static final String DASH = "-";
    static final Pattern CHARS = Pattern.compile(CHARS_REG);
    static final Pattern DOUBLE = Pattern.compile(DOUBLE_REG);
    static final Pattern TRIM = Pattern.compile(TRIM_REG);

    public static String makeSlugNamePatterny(String name) {
        String it = name.toLowerCase().trim();
        it = CHARS.matcher(it).replaceAll(DASH);
        it = DOUBLE.matcher(it).replaceAll(DASH);
        it = TRIM.matcher(it).replaceAll("");
        return it;
    }
Unfortunately, out-of-the-box java regexp support can not trim and lowercase, so we see 5 operations here. It was curious for me to implement it regex-free and to compare performance:
     public static String makeSlugNameOptimized(String name) {
        String result = name + " ";
        StringBuilder temp = new StringBuilder();
        char ch1 = 0;
        char ch2 = toLowCaseAlfanumeric(result.charAt(0));
        for (int i = 0; i < result.length() - 1; i++) {
            ch1 = ch2;
            ch2 = toLowCaseAlfanumeric(result.charAt(i + 1));
            if (ch2 != ch1) {
                temp.append(ch1);
            } else if ('-' != ch1) {
                temp.append(ch1);
            }
        }

        while ('-' == temp.charAt(0)) {
            temp.deleteCharAt(0);
        }
        while ('-' == temp.charAt(temp.length() - 1)) {
            temp.deleteCharAt(temp.length() - 1);
        }

        return temp.toString();
    }

    private static final int Aa = 'a' - 'A';

    private static char toLowCaseAlfanumeric(char c) {
        if (c >= 'a' && c <= 'z') {
            return c;
        }
        if (c >= 'A' && c <= 'Z') {
            return (char) (c + Aa);
        }
        if (c >= '0' && c <= '9') {
            return c;
        }
        return '-';
    }
Note that optimized case doesn't trim or lowercase separately - it's a part of  the loop.
On 500 random strings 500 times gave following performance: patterny-2059 milliseconds, optimised - 277.

Conclusion

10x time improvements  - it's the cost of regexps in my case. Readability can be improved by extracting some methods, so it should not be a concern here. Generally speaking, either regexp is simple - in this case you can work around like I did, or it's complex - in this case it's both slow and unreadable.  
Basically,  as classic(https://xkcd.com/1171/) says: 

Jul 29, 2016

Part 2. Why software outsourcing companies should be worried that business is booming

Continuation of my previous post
This blog post will speculate on Innovation and Automation processes and how they might affect the IT outsourcing business.

Fictional, yet quite probable story of one project in one bank

 In a galaxy far, far away, one bank decided to deprecate its legacy system and re-create a project, in order to increase its flexibility and deliver more features that will reliably provide
a greater amount of revenue.

Chapter 1. Team & Project

How we should reassign team & project in a safe way. There are several obvious choices
  1. Same team, in-house - probably, new system will be just old system rewritten - same technology, same architecture.  It might give some benefits, but overall delivery & time will suffer. Additional hire is possible, with question - what to do
    with extra people once migration is be over?
  2. 2 teams, in-house - take most promising people from old project, hire some new ones and re-create project with a new approach.
  3. 2 teams, in-house old project and outsource new one -  promote someone from old team, so he will be transferring  his knowledge to some smart people offshore.
  4. Outsource support + develop new project in-house (or mixed) - leave undesired people coaching how how to fix bugs in the old system, take  brilliant in-house engineers and mix them with some talented contractors 

Chapter 2. Automation

 Every project needs infrastructure. What will be your choice ?
  1. Have same old reliable manual deployment process - +20% to development time, +1 Ops guy
  2. Have separate automation engineer - -10% development time, +1 devOps
  3. Have everyone ( or at least 2ppl) in a new project team who can support automated deployment solution (CD) process - -10% dev time, requires up-front investment, +1 devOps
Would you outsource automation of CD process ? What is something goes wrong ?

Chapter 3. Innovation

What technology and approach to use for the new project ?
  1. Same technology, same approach  - who needs innovation anyway ?
  2. New technology - learning curve might be steep, risks of instability are high (really?), potential profit - fast time to market with less maintenance.
Here I want to add some specifics. Let's compare Java and Scala :
  1.  Stability - Scala was presented in 2003, so it's pretty stable now. Many fall victims of the familiarity bias.
  2.  Quality - less code means less bugs
  3.  Productivity - once you master it (it's more complex than java), you can be much faster than before (survey result with numbers)

Chapter 4. Conclusions and alternative endings

Now let's say there are 100 companies on the same market. How many of them can survive with ch1.3 approach (outsource new stuff)? It is much riskier than ch1.4 and in 2 years perspective quite suicidal  for the business (As in Funky Business - outsource all but critical).
Let's say 30 of them will go with ch1.4 approach.  10 will choose ch2.3 and 5 will go with ch3.2. 3 of them will go bankrupt due to various reasons, new technology included.
So 2 out of 100 projects will have time to market at least twice as short as market average.  Management in these projects will listen to their developers, and developers will listen to product owners. I would definitely buy stocks of these companies.
In 4 years perspective, what companies/projects will survive and flourish ? I think the ones
that have chosen ch1.4+ch2.3+ch3.2 Do you think these companies will be outsourcing new projects/new technologies ? Quite possible. Will service providers be ready for it ?

Jun 10, 2016

Part 1. Why software outsourcing companies should be worried that business is booming

I'll show why booming business might not be such a good thing for IT outsourcing companies.
The trend looks good for outsourcing providers, at least for INFOSYS, EPAM, LUXOFT. Quarterly (yoy) growth 13.30% -32.20%, with steady positive cash flow. But can it be the beginning of the end?

5 bullet points

IT outsourcing growth might be linked with:
This leads to the following risks:

Technology adoption challenge

In a future blog post I'll speculate about the Innovation and Automation points above, in particular how emerging languages can affect IT business.

Summary

The outsourcing boom might be linked with unimportant projects with deprecated technologies at the end of their life-cycle, and can can turn into durst pretty quickly.

Links

Deloitte’s 2014  Global Outsourcing  and Insourcing Survey  
10 IT Outsourcing Trends to Watch in 2014

May 27, 2016

Pass arguments to your main method in gradle bootRun

If you are tired of Spring Boot configuration magic, there is one more trick to confuse you completely.
How to configure Spring to work without knowing profile or environment? Dropwizard style, like this:
java -jar myjar.jar myconfig.yml.
No profiles, no wandering how did my properties got populated.

Spring Boot first class configuration is Java bean. Unfortunately, due to legacy issues, we need to include xml config. After some pain, suffering and reading, I found solution:
@ImportResource("classpath:application-context/applicationContext.xml")
//@Component
public class BatchConfiguration 

In similar way you can include properties file.
Next step is ${properties}. In xml configuration there are some environment-specific properties, like database url and so on. Some advise JNDI, but it introduces one more layer of magic - configuration of a web container.
So, question is, how to populate properties with external config. There are a ton of solutions out there, but none worked for me. Maybe because it's Spring Boot+Spring Data+Spring Batch or maybe because I don't understand this page. Anyway, I found my own way:
    
public static void main(String[] args) throws IOException {
        if (args.length >= 1) {
            Properties p = new Properties();
            p.load(new FileReader(args[0]));
            p.forEach((x, y) -> {
                System.setProperty((String) x, (String) y);
            });
        }
        SpringApplication.run(Application.class, args);
    }
Now it works perfectly in fatJar task, but what about bootRun? Boot Run saves a lot of dev time and is very easy to use. As you probably know, BootRun extends gradle standard JavaExec task, with awesome parameter args. It goes straight to your main method. So gradle bootRun task looks like:
bootRun {
    args =["cars-etl.properties"]
    
}
This is all for make your solution work! Have fun.
P.S. HowTo pass jvmArgs
P.P.S. HowTo run job from Controller

May 25, 2016

MongoBulkItemWriter

Spring Batch is a nice choice for simple ETL jobs, but it doesn't work well with mongodb, especially writing to it. Provided in Spring Batch MongoItemWriter doesn't do bulk inserts.
Fortunately for us, bulk inserts are quite easy to implement:
import com.mongodb.BulkWriteOperation;
import com.mongodb.BulkWriteResult;
import com.mongodb.DBObject;
import org.springframework.batch.item.ItemWriter;
import org.springframework.data.mongodb.core.MongoTemplate;

import java.util.List;

public class MongoBulkItemWriter<T> implements ItemWriter<T> {

    private String collection;
    private MongoTemplate template;

    public MongoBulkItemWriter(String collection, MongoTemplate mongoTemplate) {
        this.collection = collection;
        this.template = mongoTemplate;
    }

    @Override
    public void write(List items) throws Exception {
        BulkWriteOperation bulk = template.getCollection(collection).initializeUnorderedBulkOperation();
        items.forEach(i->{
                bulk.insert((DBObject) template.getConverter().convertToMongoType(i));
        });
        BulkWriteResult result = bulk.execute();
    }
}
It works much faster, but beware - inserts only, so item with duplicate id will ruin your batch. Solution might look something like this:
        BulkWriteOperation bulk = template.getCollection(COLLECTION_NAME).initializeUnorderedBulkOperation();
        updates.forEach(u -> {
            bulk.find(new BasicDBObject("id", u.getId())).upsert().update(u.getDbObject());
        });
        bulk.execute();
Upserts are much slower than pure inserts, but still a huge win compared with per object writes.