Presto Delete Statement
Why are Delete query statements important?
DELETE statements are helpful in a wide variety of situations, such as:
- Cleaning up error records
- Removing duplicate records
- Removing unnecessary or unexpected results from result tables
Previously, Treasure Data only supported delete based on the td_partial_delete CLI command. This Presto delete feature drastically improves the flexibility with which you can delete and manage data in their tables.
How do I issue a DELETE query statement?
Presto DELETE query statements can be issued using the the following syntax
DELETE FROM table_name [ WHERE condition ]
|As a protection mechanism, DELETE statements require you to include a WHERE clause. Use of the WHERE clause reduces the occurrence of unintentional DELETE commands.|
Delete all line items shipped by air:
DELETE FROM lineitem WHERE shipmode = 'AIR'
Delete all event logs from a July 1st, 2017 UTC:
DELETE FROM lineitem WHERE TD_TIME_RANGE(time, '2017-07-01','2017-07-02')
Resource Consumption with the Delete Statement
DELETE queries issue jobs and consume resources just like regular presto queries.
|Due to the nature of Treasure Data's environment, some of these queries can be quite resource intensive if not restricted by the partitioning of your table.|
For example, if your table is only partitioned by time (partitioning based on columns other than ‘time’ is coming soon), and you are trying to delete a set of users that were created in June 2018, it is important to include the time restriction even if that’s not a critical parameter for your DELETE statement.
For example, you should:
DELETE FROM logtable WHERE userid IN (1234, 1235, 1236) AND TD_TIME_RANGE(time, '2018-06-01','2018-07-01')
Including the time restriction will significantly improve the speed of your query, and reduce the resources that the query consumes.
No DELETE for Recently Streaming Collected Data
One restriction on this functionality is that data that is recently collected (typically last 2-4 hours) through a streaming ingestion method Can Not be deleted using DELETE statement. This is because streaming data is temporarily stored in our “real-time” data storage.
If you try to delete data in real-time storage, the query fails to run. In these cases, users have the following options: 1. Users can select a range of data that does not include data within real-time storage. 2. The data in real-time storage is propagated to archive storage, and can then be deleted. The propagation typically occurs within 2-6 hours after arrival, though it can be longer at times. 3. Users can add the following “magic comment” within the SQL statement, in order to force the query to run. In this case, the query STILL DOES NOT DELETE data in real-time storage, but does not cause the query to fail.
-- set session force_delete = 'true'
DELETE against Data Tank
DELETE statements can not be issued against data stored in Data Tank. See our documentation regarding querying Data Tank from Presto for more details.
Multiple DELETE jobs on the same table are NOT allowed when there is overwrap on a partition file
Multiple DELETE jobs on the same table are allowed only when there’s no overwrap on a partition file. If the table is partitioned by 1-hour, it can be deleted with a different time-range. But there could be large time range data files so it’s not always guaranteed to be executed. More precisely, multiple jobs cannot remove the same partition file at the same time.
Also, when a DELETE job is running and internal optimization process for partitioning merges (deletes small files and create large one) the data which the DELETE involves, an error occurs:
Metadata transaction conflicted with others.
If this error occurs, retry the DELETE job manually.