Why are Delete query statements important?
DELETE statements are helpful in a wide variety of situations, such as:
- Cleaning up error records
- Removing duplicate records
- Removing unnecessary or unexpected results from result tables
Presto DELETE of individual records from tables offers far greater flexibility than the previous td_partial_delete CLI command. Presto DELETE drastically improves the flexibility with which our users are able to delete and manage data in their tables.
How do I issue a DELETE query statement?
Presto DELETE query statements can be issued using the following syntax
DELETE FROM table_name [ WHERE condition ]
|As a protection mechanism DELETE statements do require you to include a WHERE clause. This is to reduce the occurrence of unintentional DELETE commands.|
Delete all line items shipped by air:
DELETE FROM lineitem WHERE shipmode = 'AIR'
Delete all event logs from a July 1st, 2017 UTC:
DELETE FROM lineitem WHERE TD_TIME_RANGE(time, '2017-07-01','2017-07-02')
Resource Consumption During Delete Statement
DELETE queries issue jobs and consume resources just like regular presto queries.
|Due to the storage and indexing methods used in Treasure Data, a DELETE query, can be quite resource intensive. To prevent a DELETE query from performing a full table scan, you should use a time expression like TD_TIME_RANGE or TD_INTERVAL if the table uses default time-based partitioning, or an equality predicate on all partition keys, if the table uses user-defined partitioning.|
For example, if your table is only partitioned by time (as opposed to tables using user-defined partitioning), and you are trying to delete a set of users that were created in June 2018, it is important to include the time restriction even if that’s not a critical parameter for your DELETE statement.
For example, you should:
DELETE FROM logtable WHERE userid IN (1234, 1235, 1236) AND TD_TIME_RANGE(time, '2017-06-01','2017-07-01')
Including the time restriction will significantly improve the speed of your query, and reduce the resources this query consumes.
Timeout Error During Delete Operations
Depending on the number of partitions that contain rows to be deleted, a DELETE operation can take several hours. An operation can time out and the job fail if too many partitions are affected. If the job fails and the output contains the following message:
org.plazmadb.metadb.MetadataSQLException: ERROR: canceling statement due to statement timeout
Then you should rewrite the DELETE statement to affect fewer rows or fewer partitions. The most effective way to do this is to include a WHERE clause that limits the time range for the DELETE with TD_TIME_RANGE or TD_INTERVAL. For example, if you have a year's worth of game history data, instead of
DELETE FROM game_history WHERE player_id=1000;
DELETE FROM game_history WHERE player_id=1000 AND TD_TIME_RANGE(time, '2018-01-01', '2018-02-01','PDT')
and then delete more time ranges until all rows are deleted. TD support can provide a workflow that simplifies automating this process.
No DELETE for Recently Collected Streaming Data
One restriction on this functionality is that data that is recently collected (typically last 2-4 hours) through streaming ingestion method Can Not be deleted using DELETE statement. This is because streaming data is temporarily stored in our “real-time” data storage.
If you try to delete data in real-time storage, the query will fail to run. In these cases, users have the following options:
- Select a range of data that does not include data within real-time storage.
- Wait for data in real-time storage to be propagated to archive storage, then delete it. This typically occurs within 2-6 hours after arrival, though it can be longer.
- You can add the following “magic comment” within the SQL statement, to force the query to run.
-- set session force_delete = 'true'
In this case the query WILL STILL NOT DELETE data in real-time storage, but the query will be considered to have run successfully.
You can ensure the deletion of records in a table, including those in real-time storage, using the following steps:
- Use DELETE with the magic comment to delete the records in archive storage;
- Use a SELECT(COUNT(1)) ... LIMIT 1 with the same predicates as the DELETE to determine whether there are any matching records remaining. These will be in real-time storage, or they may have been propagated to archive storage after the initial DELETE statement.
- Repeat the preceding steps periodically, until SELECT(COUNT(1))... LIMIT 1 returns a zero result.
Depending on propagation time from real-time storage to archive storage, it may take a while before all matching records are deleted.
First delete your records:
-- set session force_delete = 'true' DELETE FROM lineitem where ship='AIR';
Then check whether there are remaining records in the table not covered by the delete:
-- check for any remaining records SELECT COUNT(1) as REMAINING_ROWS FROM lineitem where ship='AIR' LIMIT 1;
After this SELECT statement returns 0 for REMAINING_ROWS, you know there are no rows left in realtime storage or archive storage.
DELETE against DataTank
DELETE statements cannot be issued against data stored in Data Tank. Connect directly to the PostgreSQL Data Tank to issue DELETE statements on data stored there. See our documentation regarding querying Data Tank from Presto for more details.
Multiple DELETE jobs on the same table are NOT allowed when there is overwrap on a partition file
Multiple DELETE jobs on the same table are allowed only when there’s no overwrap on a partition file. If the table is partitioned by 1-hour, it can be deleted with different time-range. But there could be large time range data files so it’s not always guaranteed to be executed. More precisely, multiple jobs cannot remove the same partition file at the same time.
Also when a DELETE job is running and internal optimization process for partitioning merges (deletes small files and creates large ones) the data which DELETE involves, an error occurs:
Metadata transaction conflicted with others.
In this case, retry the DELETE job manually.
Undoing Delete Statements
If you want to rollback an executed delete statement and recover deleted data, you must contact Arm Treasure Data technical support and provide the jobID of the job that contained the executed Presto delete. In many cases, you can undo a delete statement. However, certain situations prevent the rollback of Presto delete statements.
- When new partitions have been deleted by another DELETE statement.
- When original partitions have been discarded, which can happen after a retention period.
- When a table is dropped and a new table with the same name is created.
- When new partitions have been further modified or replaced (which can happen as a result of internal storage maintenance processes).