The Presto DELETE statement has several known limitations.
No DELETE for Recently Collected Streaming Data
Data that is recently collected (typically last 2-4 hours) through streaming ingestion method can not be deleted using the DELETE statement because streaming data is temporarily stored in the TD “real-time” data storage.
If you try to delete data in real-time storage, the query will fail to run.
To workaround this issue, you can choose one of the following:
- Select a range of data that does not include data within real-time storage.
- Wait for data in real-time storage to be propagated to archive storage, then delete it. This typically occurs within 2-6 hours after arrival, though it can be longer.
- You can add the following magic comment within the SQL statement, to force the query to run.
-- set session force_delete = 'true'
The query will NOT DELETE data in real-time storage, but the query is considered to have run successfully.
To ensure the deletion of records in a table, including those in real-time storage, use the following steps:
- Use DELETE with the magic comment to delete the records in archive storage.
- Use a SELECT(COUNT(1)) ... LIMIT 1 with the same predicates as the DELETE to determine whether there are any matching records remaining. These will be in real-time storage, or they might have been propagated to archive storage after the initial DELETE statement.
- Repeat the preceding steps periodically, until SELECT(COUNT(1))... LIMIT 1 returns a zero result.
Depending on propagation time from real-time storage to archive storage, it may take a while before all matching records are deleted.
Delete your records:
-- set session force_delete = 'true'
DELETE FROM lineitem where ship='AIR';
Then check whether there are remaining records in the table not covered by the delete:
SELECT COUNT(1) as REMAINING_ROWS FROM lineitem where ship='AIR' LIMIT 1;
After this SELECT statement returns 0 for REMAINING_ROWS, you know there are no rows left in realtime storage or archive storage.
DELETE against DataTank
DELETE statements cannot be issued against data stored in Data Tank. Connect directly to the PostgreSQL Data Tank to issue DELETE statements on data stored there. See querying Data Tank from Presto for more details.
Multiple DELETE jobs on the same table are NOT allowed when there is overwrap on a partition file
Multiple DELETE jobs on the same table are allowed only when there’s no overwrap on a partition file. If the table is partitioned by 1-hour, it can be deleted with different time-range. But there could be large time range data files so it’s not always guaranteed to be executed. More precisely, multiple jobs cannot remove the same partition file at the same time.
Also when a DELETE job is running and internal optimization process for partitioning merges (deletes small files and creates large ones) the data which DELETE involves, an error occurs: Metadata transaction conflicted with others.
In this case, retry the DELETE job manually.
Undoing Delete Statements
If you want to rollback an executed delete statement and recover deleted data, you must contact Arm Treasure Data technical support and provide the jobID of the job that contained the executed Presto DELETE statement. In many cases, you can undo a delete statement. However, certain situations prevent the rollback of Presto DELETE statements.
Situations that prevent rollback include:
- When new partitions have been deleted by another DELETE statement.
- When original partitions have been discarded, which can happen after a retention period.
- When a table is dropped and a new table with the same name is created.
- When new partitions have been further modified or replaced, which can happen as a result of internal storage maintenance processes.
DELETE statement Timeout Error
Depending on the number of partitions that contain rows to be deleted, a DELETE operation can take several hours. An operation can time out and the job fail if too many partitions are affected. If the job fails and the output contains the following message:
org.plazmadb.metadb.MetadataSQLException: ERROR: canceling statement due to statement timeout
Then you should rewrite the DELETE statement to affect fewer rows or fewer partitions. The most effective way to do this is to include a WHERE clause that limits the time range for the DELETE with TD_TIME_RANGE or TD_INTERVAL. For example, if you have a year's worth of game history data, instead of
DELETE FROM game_history WHERE player_id=1000;
DELETE FROM game_history WHERE player_id=1000 AND TD_TIME_RANGE(time, '2018-01-01', '2018-02-01','PDT')
and then delete more time ranges until all rows are deleted. TD support can provide a workflow that simplifies automating this process.