Case Sensitive Search Configuration with Presto
When dealing with character data, case sensitivity can be important when searching for specific matches or patterns. But not all databases and query engines behave in the same way. Some are case insensitive by default, some are not. How do we configure things so they behave in the way we want?
Here’s an example of why we might need to take steps for case sensitive search configuration. We’re accessing a MySQL database directly:
mysql> select * from state where name='iowa'; +------+----+--------------+ | name | id | abbreviation | +------+----+--------------+ | Iowa | 19 | IA | +------+----+--------------+ 1 row in set (0.00 sec)
MySQL is case-insensitive by default. Even though the MySQL column contains the capitalized string ‘Iowa’ it still matched the query’s restriction of ‘iowa’. This may be acceptable, but in some use cases it could lead to unexpected results.
Using Presto to access the same MySQL data source things behave differently, and arguably, in a more expected way:
presto:demo> select * from state where name='iowa'; name | id | abbreviation ------+----+-------------- (0 rows) Query 20201120_151345_00001_wjx6r, FINISHED, 1 node Splits: 17 total, 17 done (100.00%) 0:07 [1 rows, 0B] [0 rows/s, 0B/s] presto:demo> select * from state where name='Iowa'; name | id | abbreviation ------+----+-------------- Iowa | 19 | IA (1 row)
Now we only get a match with ‘Iowa’, and not with ‘iowa’. Presto has essentially made this data source (MySQL) case sensitive, even though it is exactly the same database in both the above examples, with default configurations used.
Reconfigure Case Sensitivity
With a RDBMS like MySQL you can configure the collation setting to control if you want case sensitivity or not. You can set the collation at the database creation or table creation level as a part of the CREATE statement. Or you can use ALTER to change the collation of a database, table or individual column. This is described in MySQL’s documentation.
But how do you change Presto to be case-insensitive? Presto does not support collation, because it is not a database and doesn’t store data. And there is no configuration parameter that controls this. To manage case sensitivity in Presto, and mimic collation, we rewrite the query to force case insensitivity explicitly by using:
- simple lower() or upper() functions
select * from state where lower(name)='california'; name ------------ california California CALIFORNIA (3 rows)
This query has matched any upper/lower case combination in the table, mimicking case insensitivity.
Or regular expressions.
select * from state where regexp_like(name, '(?i)california'); name ------------ california California CALIFORNIA (3 rows)
The regular expression syntax (?i) means matches are case insensitive.
When it comes to Case Sensitive Search Configuration you are now an eXpErT.