Thorny SQL Set Operation

Software

UPDATE 2: I posted a full script that contains both working solutions in a related post.

UPDATE: Hans developed a query that works. See the comments. It works even if my group is (Female + French), but my Person is (Female + French + Politician) – e.g., extra categories not part of the group are ignored.

Thanks for all the help!

********************************

I’ve been working for a bit on a (possibly) hard SQL query and thought I would write out my problem here, in the hopes that (a) I’ll solve it, and (b) I’ll make the solution available to others via the magic of Teh Google.

Here’s the problem: I have a concept of “category”. Example categories: Man, Woman, Movie Star, Politician, American, French.

I have a concept of “person”. Persons can belong to one or more categories.

I have a concept of “group”. A Group is merely a listing of one or more categories. For example, “Male Movie Stars” is a Group (Man + Movie Star). “Female French Politicians” is a group (Woman + Politician + French).

What I want to do in my query is this: Given a Group, find all Persons who belong to that Group.

pseudo-code query:

select * from Person p
join PersonCategory pc on (p.personid = pc.personid)
join GroupCategory gc ON (pc.categoryid = gc.categoryid)
where gc.groupid = @groupid

However, normal JOIN syntax using the IN operator does an “OR”-type operation, when what I really want is an “AND”-type operation. Using the “Female French Politicians” example from above, this query will return French OR Woman OR Politician. Not what I want.

Thinking…

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • Furl
  • Ma.gnolia
  • Reddit
  • TwitThis
No Comments

9 Comments

  1. Tom Music  •  May 28, 2008 @7:41 am

    No one-query solutions come to mind for me. At least nothing that isn’t really a two-query solution mashed together into a single one.

    The way I would do this would be to run one real fast query that gets the number of categories contained in the specific group, and call this $category_count.

    Then I would use a second query that returned person_ids and the number of category_id matches from that specific group for each person_id. At the end of the query I would stick on a HAVING statement to limit the query results to only show people who matched exactly $category_count number of categories.

    I look forward to seeing what other approaches people recommend!

  2. Hans  •  May 28, 2008 @8:23 am

    For the AND, I think you’ll need to self join PersonCategory. Not sure how to do it in a single query though. If you know the category_ids for the group in advance, you could use something like:

    SELECT p.*
    FROM Person p, PersonCategory pc1, PersonCategory pc2
    WHERE pc1.person_id = pc2.person_id
    AND pc1.person_id = p.id
    AND pc1.category_id = @category_id_1
    AND pc2.category_id = @category_id_2;

    For each additional category, you’d have to add another self join. Pretty ugly.

  3. Hans  •  May 28, 2008 @9:00 am

    Something like this might also do the trick:

    select * from Person p
    join PersonCategory pc on (p.personid = pc.personid)
    where pc.personid in (
    select pc.personid from PersonCategory pc
    where pc.categoryid in (
    select cg.categoryid from GroupCategory gc
    where gc.groupid = @groupid
    )
    group by pc.personid
    having count(*) = (
    select count(*) from GroupCategory gc
    where gc.groupid = @groupid
    )
    and pc.categoryid in (
    select cg.categoryid from GroupCategory gc
    where gc.groupid = @groupid
    );

  4. Brian Dorsey  •  May 28, 2008 @9:24 am

    Another approach using ‘group by’ and ‘having’:
    http://pastebin.com/f69dd5434

    The core query is:

    – Female, French, Politicians
    select p.id, p.full_name, count(*) from person p
    join person_category pc on pc.person_id = p.id
    join category c on c.id = pc.category_id
    where c.id in (2,6,4) — Woman, French, Politician
    group by p.id, p.full_name
    having count(*) = 3 — the number of categories
    order by p.id

  5. Anthony Stevens  •  May 28, 2008 @9:28 am

    @Hans: thanks! This works. I had just implemented a bitmasking solution that relied on a scalar function IsMember(personid, groupid), but this looks more direct.

  6. Brian Dorsey  •  May 28, 2008 @9:28 am

    Re-reading Tom’s comment, my solution is the same. (I didn’t get the group info dynamically, but that could be done with sub-selects.)

  7. Anthony Stevens  •  May 28, 2008 @9:39 am

    @tom music, @brian dorsey: Thanks for your help. The problem with the COUNT method is that extra categories on the person that are not part of the group will throw it off. (I think). So if you have

    Person = (Female + French + Politician)

    and

    Group = (Female + Politician)

    Then the COUNT method will exclude that person because 3 != 2.

  8. Tom Music  •  May 28, 2008 @10:17 am

    @Anthony: There shouldn’t be the “3 != 2″ situation (due to extra categories in the Person but not the Group) because the only Categories that are counted are the ones the Person has that are part of the specific Group being queried.

    Since we’d only count when P.CategoryID = G.CategoryID…

    Person = (Female + French + Politician)
    - and -
    Group = (Female + Politician)

    would only result in a count of 2, since there are only two in common. This would then be compared via HAVING to the count of 2 that comes from the outside statement (counting the CategoryIDs in the specific Group), and voila!

  9. Anthony Stevens  •  May 28, 2008 @12:18 pm

    @Tom: you’re absolutely correct. I wrote out the version of the query that you and Brian pointed to and posted a full solution at http://xidey.wordpress.com/2008/05/28/comparing-sql-queries-that-solve-my-thorny-sql-set-operation-problem/

2 Trackbacks

Leave a Reply

Allowed tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">