https://questdb.io logo
Title
h

Hristo Katsarski

04/21/2023, 6:45 AM
Hello, How can I find duplicate rows in a table? We have a table that has a Symbol type column ("code") and a Timestamp column ("ddate"). I'd like to run a query to see such rows, however "having count()" does not seem supported. Subsequently I'd like to write a Java procedure (via PostgreJDBC) to get rid of duplicate rows.
SELECT code, ddate FROM stock GROUP BY code, ddate HAVING COUNT(ddate) > 1;
b

Bolek Ziobrowski

04/21/2023, 7:17 AM
HAVING clause is not yet supported but you could easily work around that with a subquery :
SELECT code, ddate
FROM
(
  SELECT code, ddate, count(ddate) as cnt 
  FROM stock 
  GROUP BY code, ddate 
) WHERE cnt > 1;
Are you sure you only want to count duplicates for pairs with non-null ddate ? Removing duplicates could either be: • logical (update stock set deleted = true where ...) or • physical - DELETE statement is not supported so you'd need to recreate table or partitions . See https://questdb.io/docs/guides/modifying-data/ .
h

Hristo Katsarski

04/21/2023, 7:33 AM
Hmm, can't make that SQL work...it produces errors. I was thinking of physical deletions (during maintenance hours), logical mark rows as deleted is also a very good option! But still can't make that SQL work, would you be able to check it again?
a

Alex Pelagenko

04/21/2023, 7:59 AM
what’s the error?
m

Miguel Arregui

04/21/2023, 8:04 AM
SELECT code, ddate FROM 
(
  SELECT code, ddate, count(ddate) as cnt 
  FROM p 
  GROUP BY code, ddate 
) WHERE cnt > 1;



create table p(code symbol, ddate timestamp, ts timestamp) timestamp(ts) partition by day;
insert into p values('A', '2023-04-21T10:00:00.000001Z', '2023-04-21T10:00:00.000001Z');
insert into p values('A', '2023-04-21T10:00:01.000001Z', '2023-04-21T10:00:01.000001Z');
insert into p values('A', '2023-04-21T10:00:01.000001Z', '2023-04-21T10:00:01.000001Z');
insert into p values('B', '2023-04-21T10:00:01.000002Z', '2023-04-21T10:00:01.000002Z');
insert into p values('C', '2023-04-21T10:00:02.000002Z', '2023-04-21T10:00:02.000002Z');
insert into p values('A', '2023-04-21T10:00:02.000003Z', '2023-04-21T10:00:03.000002Z');
missing FROM