Performant cartesian product (CROSS JOIN) with pandas

The contents of this post were originally meant to be a part of
Pandas Merging 101,
but due to the nature and size of the content required to fully do
justice to this topic, it has been moved to its own QnA.

Given two simple DataFrames;

left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})

right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})



left



  col1  col2

0    A     1

1    B     2

2    C     3



right



  col1  col2

0    X    20

1    Y    30

2    Z    50

The cross product of these frames can be computed, and will look something like:

A       1      X      20

A       1      Y      30

A       1      Z      50

B       2      X      20

B       2      Y      30

B       2      Z      50

C       3      X      20

C       3      Y      30

C       3      Z      50

What is the most performant method of computing this result?

edited Dec 10 '18 at 9:04

asked Dec 10 '18 at 3:12

coldspeed

120k19119195

1

Would you like share your input in Github as well , I think adding cross join in pandas is really good to match all the join function in SQL . github.com/pandas-dev/pandas/issues/5401
– W-B
Dec 11 '18 at 20:58

1

@W-B Thanks! That thread seems interesting, let me see if there's something I can contribute to it.
– coldspeed
Dec 12 '18 at 2:53

add a comment |

The contents of this post were originally meant to be a part of
Pandas Merging 101,
but due to the nature and size of the content required to fully do
justice to this topic, it has been moved to its own QnA.

Given two simple DataFrames;

left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})

right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})



left



  col1  col2

0    A     1

1    B     2

2    C     3



right



  col1  col2

0    X    20

1    Y    30

2    Z    50

The cross product of these frames can be computed, and will look something like:

A       1      X      20

A       1      Y      30

A       1      Z      50

B       2      X      20

B       2      Y      30

B       2      Z      50

C       3      X      20

C       3      Y      30

C       3      Z      50

What is the most performant method of computing this result?

edited Dec 10 '18 at 9:04

asked Dec 10 '18 at 3:12

coldspeed

120k19119195

1

Would you like share your input in Github as well , I think adding cross join in pandas is really good to match all the join function in SQL . github.com/pandas-dev/pandas/issues/5401
– W-B
Dec 11 '18 at 20:58

1

@W-B Thanks! That thread seems interesting, let me see if there's something I can contribute to it.
– coldspeed
Dec 12 '18 at 2:53

add a comment |

The contents of this post were originally meant to be a part of
Pandas Merging 101,
but due to the nature and size of the content required to fully do
justice to this topic, it has been moved to its own QnA.

Given two simple DataFrames;

left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})

right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})



left



  col1  col2

0    A     1

1    B     2

2    C     3



right



  col1  col2

0    X    20

1    Y    30

2    Z    50

The cross product of these frames can be computed, and will look something like:

A       1      X      20

A       1      Y      30

A       1      Z      50

B       2      X      20

B       2      Y      30

B       2      Z      50

C       3      X      20

C       3      Y      30

C       3      Z      50

What is the most performant method of computing this result?

edited Dec 10 '18 at 9:04

asked Dec 10 '18 at 3:12

coldspeed

120k19119195

The contents of this post were originally meant to be a part of
Pandas Merging 101,
but due to the nature and size of the content required to fully do
justice to this topic, it has been moved to its own QnA.

Given two simple DataFrames;

left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})

right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})



left



  col1  col2

0    A     1

1    B     2

2    C     3



right



  col1  col2

0    X    20

1    Y    30

2    Z    50

The cross product of these frames can be computed, and will look something like:

A       1      X      20

A       1      Y      30

A       1      Z      50

B       2      X      20

B       2      Y      30

B       2      Z      50

C       3      X      20

C       3      Y      30

C       3      Z      50

What is the most performant method of computing this result?

python pandas numpy dataframe merge

edited Dec 10 '18 at 9:04

asked Dec 10 '18 at 3:12

coldspeed

120k19119195

edited Dec 10 '18 at 9:04

asked Dec 10 '18 at 3:12

coldspeed

120k19119195

edited Dec 10 '18 at 9:04

asked Dec 10 '18 at 3:12

coldspeed

120k19119195

asked Dec 10 '18 at 3:12

coldspeed

120k19119195

asked Dec 10 '18 at 3:12

coldspeed

120k19119195

1

Would you like share your input in Github as well , I think adding cross join in pandas is really good to match all the join function in SQL . github.com/pandas-dev/pandas/issues/5401
– W-B
Dec 11 '18 at 20:58

1

@W-B Thanks! That thread seems interesting, let me see if there's something I can contribute to it.
– coldspeed
Dec 12 '18 at 2:53

add a comment |

1

Would you like share your input in Github as well , I think adding cross join in pandas is really good to match all the join function in SQL . github.com/pandas-dev/pandas/issues/5401
– W-B
Dec 11 '18 at 20:58

1

@W-B Thanks! That thread seems interesting, let me see if there's something I can contribute to it.
– coldspeed
Dec 12 '18 at 2:53

Would you like share your input in Github as well , I think adding cross join in pandas is really good to match all the join function in SQL . github.com/pandas-dev/pandas/issues/5401
– W-B
Dec 11 '18 at 20:58

@W-B Thanks! That thread seems interesting, let me see if there's something I can contribute to it.
– coldspeed
Dec 12 '18 at 2:53

add a comment |

3 Answers
3

active

oldest

votes

Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

def cartesian_product_basic(left, right):

    return (

       left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))



cartesian_product_basic(left, right)



  col1_x  col2_x col1_y  col2_y

0      A       1      X      20

1      A       1      Y      30

2      A       1      Z      50

3      B       2      X      20

4      B       2      Y      30

5      B       2      Z      50

6      C       3      X      20

7      C       3      Y      30

8      C       3      Z      50

How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is @senderle's first implementation.

def cartesian_product(*arrays):

    la = len(arrays)

    dtype = np.result_type(*arrays)

    arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)

    for i, a in enumerate(np.ix_(*arrays)):

        arr[...,i] = a

    return arr.reshape(-1, la)

Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

Disclaimer

These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your
own risk!

This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product, use this to reindex the DataFrames, and

def cartesian_product_generalized(left, right):

    la, lb = len(left), len(right)

    idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])

    return pd.DataFrame(

        np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))



cartesian_product_generalized(left, right)



   0  1  2   3

0  A  1  X  20

1  A  1  Y  30

2  A  1  Z  50

3  B  2  X  20

4  B  2  Y  30

5  B  2  Z  50

6  C  3  X  20

7  C  3  Y  30

8  C  3  Z  50



np.array_equal(cartesian_product_generalized(left, right),

               cartesian_product_basic(left, right))

True

And, along similar lines,

left2 = left.copy()

left2.index = ['s1', 's2', 's1']



right2 = right.copy()

right2.index = ['x', 'y', 'y']





left2

   col1  col2

s1    A     1

s2    B     2

s1    C     3



right2

  col1  col2

x    X    20

y    Y    30

y    Z    50



np.array_equal(cartesian_product_generalized(left, right),

               cartesian_product_basic(left2, right2))

True

This solution can generalise to multiple DataFrames. For example,

def cartesian_product_multi(*dfs):

    idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])

    return pd.DataFrame(

        np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))



cartesian_product_multi(*[left, right, left]).head()



   0  1  2   3  4  5

0  A  1  X  20  A  1

1  A  1  X  20  B  2

2  A  1  X  20  C  3

3  A  1  X  20  D  4

4  A  1  Y  30  A  1

Further Simplification

A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

def cartesian_product_simplified(left, right):

    la, lb = len(left), len(right)

    ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])



    return pd.DataFrame(

        np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))



np.array_equal(cartesian_product_simplified(left, right),

               cartesian_product_basic(left2, right2))

True

Performance Comparison

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

enter image description here

Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.

Functions from Other Answers

# Wen's answer: https://stackoverflow.com/a/53699198/4909087

# I've put my own spin on this to make it as fast as possible.

def cartesian_product_itertools(left, right):

    return pd.DataFrame([

           [*x, *y] for x, y in itertools.product(

               left.values.tolist(), right.values.tolist())])

Performance Benchmarking Code

This is the timing script. All functions called here are defined above.

from timeit import timeit

import pandas as pd

import matplotlib.pyplot as plt



res = pd.DataFrame(

       index=['cartesian_product_basic', 'cartesian_product_generalized', 

              'cartesian_product_multi', 'cartesian_product_simplified',

              'cartesian_product_itertools'],

       columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],

       dtype=float

)



for f in res.index: 

    for c in res.columns:

        # print(f,c)

        if f in {'cartesian_product_itertools'} and c > 600:

            continue

        left2 = pd.concat([left] * c, ignore_index=True)

        right2 = pd.concat([right] * c, ignore_index=True)

        stmt = '{}(left2, right2)'.format(f)

        setp = 'from __main__ import left2, right2, {}'.format(f)

        res.at[f, c] = timeit(stmt, setp, number=5)



ax = res.div(res.min()).T.plot(loglog=True) 

ax.set_xlabel("N"); 

ax.set_ylabel("time (relative)");



plt.show()

edited Dec 12 '18 at 14:58

answered Dec 10 '18 at 3:12

coldspeed

120k19119195

add a comment |

Using itertools product and recreate the value in dataframe

import itertools

l=list(itertools.product(left.values.tolist(),right.values.tolist()))

pd.DataFrame(list(map(lambda x : sum(x,),l)))

   0  1  2   3

0  A  1  X  20

1  A  1  Y  30

2  A  1  Z  50

3  B  2  X  20

4  B  2  Y  30

5  B  2  Z  50

6  C  3  X  20

7  C  3  Y  30

8  C  3  Z  50

answered Dec 10 '18 at 3:41

W-B

101k73163

add a comment |

Here's an approach with triple concat

m = pd.concat([pd.concat([left]*len(right)).sort_index().reset_index(drop=True),

       pd.concat([right]*len(left)).reset_index(drop=True) ], 1)



    col1  col2 col1  col2

0     A     1    X    20

1     A     1    Y    30

2     A     1    Z    50

3     B     2    X    20

4     B     2    Y    30

5     B     2    Z    50

6     C     3    X    20

7     C     3    Y    30

8     C     3    Z    50

enter image description here

edited Dec 10 '18 at 14:00

answered Dec 10 '18 at 13:39

Dark

21.1k31946

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53699012%2fperformant-cartesian-product-cross-join-with-pandas%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

def cartesian_product_basic(left, right):

    return (

       left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))



cartesian_product_basic(left, right)



  col1_x  col2_x col1_y  col2_y

0      A       1      X      20

1      A       1      Y      30

2      A       1      Z      50

3      B       2      X      20

4      B       2      Y      30

5      B       2      Z      50

6      C       3      X      20

7      C       3      Y      30

8      C       3      Z      50

How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

def cartesian_product(*arrays):

    la = len(arrays)

    dtype = np.result_type(*arrays)

    arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)

    for i, a in enumerate(np.ix_(*arrays)):

        arr[...,i] = a

    return arr.reshape(-1, la)

Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

Disclaimer

These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your
own risk!

def cartesian_product_generalized(left, right):

    la, lb = len(left), len(right)

    idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])

    return pd.DataFrame(

        np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))



cartesian_product_generalized(left, right)



   0  1  2   3

0  A  1  X  20

1  A  1  Y  30

2  A  1  Z  50

3  B  2  X  20

4  B  2  Y  30

5  B  2  Z  50

6  C  3  X  20

7  C  3  Y  30

8  C  3  Z  50



np.array_equal(cartesian_product_generalized(left, right),

               cartesian_product_basic(left, right))

True

And, along similar lines,

left2 = left.copy()

left2.index = ['s1', 's2', 's1']



right2 = right.copy()

right2.index = ['x', 'y', 'y']





left2

   col1  col2

s1    A     1

s2    B     2

s1    C     3



right2

  col1  col2

x    X    20

y    Y    30

y    Z    50



np.array_equal(cartesian_product_generalized(left, right),

               cartesian_product_basic(left2, right2))

True

This solution can generalise to multiple DataFrames. For example,

def cartesian_product_multi(*dfs):

    idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])

    return pd.DataFrame(

        np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))



cartesian_product_multi(*[left, right, left]).head()



   0  1  2   3  4  5

0  A  1  X  20  A  1

1  A  1  X  20  B  2

2  A  1  X  20  C  3

3  A  1  X  20  D  4

4  A  1  Y  30  A  1

Further Simplification

A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

def cartesian_product_simplified(left, right):

    la, lb = len(left), len(right)

    ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])



    return pd.DataFrame(

        np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))



np.array_equal(cartesian_product_simplified(left, right),

               cartesian_product_basic(left2, right2))

True

Performance Comparison

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

enter image description here

Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.

Functions from Other Answers

# Wen's answer: https://stackoverflow.com/a/53699198/4909087

# I've put my own spin on this to make it as fast as possible.

def cartesian_product_itertools(left, right):

    return pd.DataFrame([

           [*x, *y] for x, y in itertools.product(

               left.values.tolist(), right.values.tolist())])

Performance Benchmarking Code

This is the timing script. All functions called here are defined above.

from timeit import timeit

import pandas as pd

import matplotlib.pyplot as plt



res = pd.DataFrame(

       index=['cartesian_product_basic', 'cartesian_product_generalized', 

              'cartesian_product_multi', 'cartesian_product_simplified',

              'cartesian_product_itertools'],

       columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],

       dtype=float

)



for f in res.index: 

    for c in res.columns:

        # print(f,c)

        if f in {'cartesian_product_itertools'} and c > 600:

            continue

        left2 = pd.concat([left] * c, ignore_index=True)

        right2 = pd.concat([right] * c, ignore_index=True)

        stmt = '{}(left2, right2)'.format(f)

        setp = 'from __main__ import left2, right2, {}'.format(f)

        res.at[f, c] = timeit(stmt, setp, number=5)



ax = res.div(res.min()).T.plot(loglog=True) 

ax.set_xlabel("N"); 

ax.set_ylabel("time (relative)");



plt.show()

edited Dec 12 '18 at 14:58

answered Dec 10 '18 at 3:12

coldspeed

120k19119195

add a comment |

Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

def cartesian_product_basic(left, right):

    return (

       left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))



cartesian_product_basic(left, right)



  col1_x  col2_x col1_y  col2_y

0      A       1      X      20

1      A       1      Y      30

2      A       1      Z      50

3      B       2      X      20

4      B       2      Y      30

5      B       2      Z      50

6      C       3      X      20

7      C       3      Y      30

8      C       3      Z      50

How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

def cartesian_product(*arrays):

    la = len(arrays)

    dtype = np.result_type(*arrays)

    arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)

    for i, a in enumerate(np.ix_(*arrays)):

        arr[...,i] = a

    return arr.reshape(-1, la)

Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

Disclaimer

These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your
own risk!

def cartesian_product_generalized(left, right):

    la, lb = len(left), len(right)

    idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])

    return pd.DataFrame(

        np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))



cartesian_product_generalized(left, right)



   0  1  2   3

0  A  1  X  20

1  A  1  Y  30

2  A  1  Z  50

3  B  2  X  20

4  B  2  Y  30

5  B  2  Z  50

6  C  3  X  20

7  C  3  Y  30

8  C  3  Z  50



np.array_equal(cartesian_product_generalized(left, right),

               cartesian_product_basic(left, right))

True

And, along similar lines,

left2 = left.copy()

left2.index = ['s1', 's2', 's1']



right2 = right.copy()

right2.index = ['x', 'y', 'y']





left2

   col1  col2

s1    A     1

s2    B     2

s1    C     3



right2

  col1  col2

x    X    20

y    Y    30

y    Z    50



np.array_equal(cartesian_product_generalized(left, right),

               cartesian_product_basic(left2, right2))

True

This solution can generalise to multiple DataFrames. For example,

def cartesian_product_multi(*dfs):

    idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])

    return pd.DataFrame(

        np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))



cartesian_product_multi(*[left, right, left]).head()



   0  1  2   3  4  5

0  A  1  X  20  A  1

1  A  1  X  20  B  2

2  A  1  X  20  C  3

3  A  1  X  20  D  4

4  A  1  Y  30  A  1

Further Simplification

A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

def cartesian_product_simplified(left, right):

    la, lb = len(left), len(right)

    ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])



    return pd.DataFrame(

        np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))



np.array_equal(cartesian_product_simplified(left, right),

               cartesian_product_basic(left2, right2))

True

Performance Comparison

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

enter image description here

Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.

Functions from Other Answers

# Wen's answer: https://stackoverflow.com/a/53699198/4909087

# I've put my own spin on this to make it as fast as possible.

def cartesian_product_itertools(left, right):

    return pd.DataFrame([

           [*x, *y] for x, y in itertools.product(

               left.values.tolist(), right.values.tolist())])

Performance Benchmarking Code

This is the timing script. All functions called here are defined above.

from timeit import timeit

import pandas as pd

import matplotlib.pyplot as plt



res = pd.DataFrame(

       index=['cartesian_product_basic', 'cartesian_product_generalized', 

              'cartesian_product_multi', 'cartesian_product_simplified',

              'cartesian_product_itertools'],

       columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],

       dtype=float

)



for f in res.index: 

    for c in res.columns:

        # print(f,c)

        if f in {'cartesian_product_itertools'} and c > 600:

            continue

        left2 = pd.concat([left] * c, ignore_index=True)

        right2 = pd.concat([right] * c, ignore_index=True)

        stmt = '{}(left2, right2)'.format(f)

        setp = 'from __main__ import left2, right2, {}'.format(f)

        res.at[f, c] = timeit(stmt, setp, number=5)



ax = res.div(res.min()).T.plot(loglog=True) 

ax.set_xlabel("N"); 

ax.set_ylabel("time (relative)");



plt.show()

edited Dec 12 '18 at 14:58

answered Dec 10 '18 at 3:12

coldspeed

120k19119195

add a comment |

Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

def cartesian_product_basic(left, right):

    return (

       left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))



cartesian_product_basic(left, right)



  col1_x  col2_x col1_y  col2_y

0      A       1      X      20

1      A       1      Y      30

2      A       1      Z      50

3      B       2      X      20

4      B       2      Y      30

5      B       2      Z      50

6      C       3      X      20

7      C       3      Y      30

8      C       3      Z      50

How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

def cartesian_product(*arrays):

    la = len(arrays)

    dtype = np.result_type(*arrays)

    arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)

    for i, a in enumerate(np.ix_(*arrays)):

        arr[...,i] = a

    return arr.reshape(-1, la)

Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

Disclaimer

These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your
own risk!

def cartesian_product_generalized(left, right):

    la, lb = len(left), len(right)

    idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])

    return pd.DataFrame(

        np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))



cartesian_product_generalized(left, right)



   0  1  2   3

0  A  1  X  20

1  A  1  Y  30

2  A  1  Z  50

3  B  2  X  20

4  B  2  Y  30

5  B  2  Z  50

6  C  3  X  20

7  C  3  Y  30

8  C  3  Z  50



np.array_equal(cartesian_product_generalized(left, right),

               cartesian_product_basic(left, right))

True

And, along similar lines,

left2 = left.copy()

left2.index = ['s1', 's2', 's1']



right2 = right.copy()

right2.index = ['x', 'y', 'y']





left2

   col1  col2

s1    A     1

s2    B     2

s1    C     3



right2

  col1  col2

x    X    20

y    Y    30

y    Z    50



np.array_equal(cartesian_product_generalized(left, right),

               cartesian_product_basic(left2, right2))

True

This solution can generalise to multiple DataFrames. For example,

def cartesian_product_multi(*dfs):

    idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])

    return pd.DataFrame(

        np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))



cartesian_product_multi(*[left, right, left]).head()



   0  1  2   3  4  5

0  A  1  X  20  A  1

1  A  1  X  20  B  2

2  A  1  X  20  C  3

3  A  1  X  20  D  4

4  A  1  Y  30  A  1

Further Simplification

A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

def cartesian_product_simplified(left, right):

    la, lb = len(left), len(right)

    ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])



    return pd.DataFrame(

        np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))



np.array_equal(cartesian_product_simplified(left, right),

               cartesian_product_basic(left2, right2))

True

Performance Comparison

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

enter image description here

Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.

Functions from Other Answers

# Wen's answer: https://stackoverflow.com/a/53699198/4909087

# I've put my own spin on this to make it as fast as possible.

def cartesian_product_itertools(left, right):

    return pd.DataFrame([

           [*x, *y] for x, y in itertools.product(

               left.values.tolist(), right.values.tolist())])

Performance Benchmarking Code

This is the timing script. All functions called here are defined above.

from timeit import timeit

import pandas as pd

import matplotlib.pyplot as plt



res = pd.DataFrame(

       index=['cartesian_product_basic', 'cartesian_product_generalized', 

              'cartesian_product_multi', 'cartesian_product_simplified',

              'cartesian_product_itertools'],

       columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],

       dtype=float

)



for f in res.index: 

    for c in res.columns:

        # print(f,c)

        if f in {'cartesian_product_itertools'} and c > 600:

            continue

        left2 = pd.concat([left] * c, ignore_index=True)

        right2 = pd.concat([right] * c, ignore_index=True)

        stmt = '{}(left2, right2)'.format(f)

        setp = 'from __main__ import left2, right2, {}'.format(f)

        res.at[f, c] = timeit(stmt, setp, number=5)



ax = res.div(res.min()).T.plot(loglog=True) 

ax.set_xlabel("N"); 

ax.set_ylabel("time (relative)");



plt.show()

edited Dec 12 '18 at 14:58

answered Dec 10 '18 at 3:12

coldspeed

120k19119195

Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

def cartesian_product_basic(left, right):

    return (

       left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))



cartesian_product_basic(left, right)



  col1_x  col2_x col1_y  col2_y

0      A       1      X      20

1      A       1      Y      30

2      A       1      Z      50

3      B       2      X      20

4      B       2      Y      30

5      B       2      Z      50

6      C       3      X      20

7      C       3      Y      30

8      C       3      Z      50

How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

def cartesian_product(*arrays):

    la = len(arrays)

    dtype = np.result_type(*arrays)

    arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)

    for i, a in enumerate(np.ix_(*arrays)):

        arr[...,i] = a

    return arr.reshape(-1, la)

Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

Disclaimer

These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your
own risk!

def cartesian_product_generalized(left, right):

    la, lb = len(left), len(right)

    idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])

    return pd.DataFrame(

        np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))



cartesian_product_generalized(left, right)



   0  1  2   3

0  A  1  X  20

1  A  1  Y  30

2  A  1  Z  50

3  B  2  X  20

4  B  2  Y  30

5  B  2  Z  50

6  C  3  X  20

7  C  3  Y  30

8  C  3  Z  50



np.array_equal(cartesian_product_generalized(left, right),

               cartesian_product_basic(left, right))

True

And, along similar lines,

left2 = left.copy()

left2.index = ['s1', 's2', 's1']



right2 = right.copy()

right2.index = ['x', 'y', 'y']





left2

   col1  col2

s1    A     1

s2    B     2

s1    C     3



right2

  col1  col2

x    X    20

y    Y    30

y    Z    50



np.array_equal(cartesian_product_generalized(left, right),

               cartesian_product_basic(left2, right2))

True

This solution can generalise to multiple DataFrames. For example,

def cartesian_product_multi(*dfs):

    idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])

    return pd.DataFrame(

        np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))



cartesian_product_multi(*[left, right, left]).head()



   0  1  2   3  4  5

0  A  1  X  20  A  1

1  A  1  X  20  B  2

2  A  1  X  20  C  3

3  A  1  X  20  D  4

4  A  1  Y  30  A  1

Further Simplification

A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

def cartesian_product_simplified(left, right):

    la, lb = len(left), len(right)

    ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])



    return pd.DataFrame(

        np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))



np.array_equal(cartesian_product_simplified(left, right),

               cartesian_product_basic(left2, right2))

True

Performance Comparison

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

enter image description here

Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.

Functions from Other Answers

# Wen's answer: https://stackoverflow.com/a/53699198/4909087

# I've put my own spin on this to make it as fast as possible.

def cartesian_product_itertools(left, right):

    return pd.DataFrame([

           [*x, *y] for x, y in itertools.product(

               left.values.tolist(), right.values.tolist())])

Performance Benchmarking Code

This is the timing script. All functions called here are defined above.

from timeit import timeit

import pandas as pd

import matplotlib.pyplot as plt



res = pd.DataFrame(

       index=['cartesian_product_basic', 'cartesian_product_generalized', 

              'cartesian_product_multi', 'cartesian_product_simplified',

              'cartesian_product_itertools'],

       columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],

       dtype=float

)



for f in res.index: 

    for c in res.columns:

        # print(f,c)

        if f in {'cartesian_product_itertools'} and c > 600:

            continue

        left2 = pd.concat([left] * c, ignore_index=True)

        right2 = pd.concat([right] * c, ignore_index=True)

        stmt = '{}(left2, right2)'.format(f)

        setp = 'from __main__ import left2, right2, {}'.format(f)

        res.at[f, c] = timeit(stmt, setp, number=5)



ax = res.div(res.min()).T.plot(loglog=True) 

ax.set_xlabel("N"); 

ax.set_ylabel("time (relative)");



plt.show()

edited Dec 12 '18 at 14:58

answered Dec 10 '18 at 3:12

coldspeed

120k19119195

edited Dec 12 '18 at 14:58

answered Dec 10 '18 at 3:12

coldspeed

120k19119195

answered Dec 10 '18 at 3:12

coldspeed

120k19119195

answered Dec 10 '18 at 3:12

coldspeed

120k19119195

add a comment |

Using itertools product and recreate the value in dataframe

import itertools

l=list(itertools.product(left.values.tolist(),right.values.tolist()))

pd.DataFrame(list(map(lambda x : sum(x,),l)))

   0  1  2   3

0  A  1  X  20

1  A  1  Y  30

2  A  1  Z  50

3  B  2  X  20

4  B  2  Y  30

5  B  2  Z  50

6  C  3  X  20

7  C  3  Y  30

8  C  3  Z  50

answered Dec 10 '18 at 3:41

W-B

101k73163

add a comment |

Using itertools product and recreate the value in dataframe

import itertools

l=list(itertools.product(left.values.tolist(),right.values.tolist()))

pd.DataFrame(list(map(lambda x : sum(x,),l)))

   0  1  2   3

0  A  1  X  20

1  A  1  Y  30

2  A  1  Z  50

3  B  2  X  20

4  B  2  Y  30

5  B  2  Z  50

6  C  3  X  20

7  C  3  Y  30

8  C  3  Z  50

answered Dec 10 '18 at 3:41

W-B

101k73163

add a comment |

Using itertools product and recreate the value in dataframe

import itertools

l=list(itertools.product(left.values.tolist(),right.values.tolist()))

pd.DataFrame(list(map(lambda x : sum(x,),l)))

   0  1  2   3

0  A  1  X  20

1  A  1  Y  30

2  A  1  Z  50

3  B  2  X  20

4  B  2  Y  30

5  B  2  Z  50

6  C  3  X  20

7  C  3  Y  30

8  C  3  Z  50

answered Dec 10 '18 at 3:41

W-B

101k73163

Using itertools product and recreate the value in dataframe

import itertools

l=list(itertools.product(left.values.tolist(),right.values.tolist()))

pd.DataFrame(list(map(lambda x : sum(x,),l)))

   0  1  2   3

0  A  1  X  20

1  A  1  Y  30

2  A  1  Z  50

3  B  2  X  20

4  B  2  Y  30

5  B  2  Z  50

6  C  3  X  20

7  C  3  Y  30

8  C  3  Z  50

answered Dec 10 '18 at 3:41

W-B

101k73163

answered Dec 10 '18 at 3:41

W-B

101k73163

answered Dec 10 '18 at 3:41

W-B

101k73163

answered Dec 10 '18 at 3:41

W-B

101k73163

add a comment |

Here's an approach with triple concat

m = pd.concat([pd.concat([left]*len(right)).sort_index().reset_index(drop=True),

       pd.concat([right]*len(left)).reset_index(drop=True) ], 1)



    col1  col2 col1  col2

0     A     1    X    20

1     A     1    Y    30

2     A     1    Z    50

3     B     2    X    20

4     B     2    Y    30

5     B     2    Z    50

6     C     3    X    20

7     C     3    Y    30

8     C     3    Z    50

enter image description here

edited Dec 10 '18 at 14:00

answered Dec 10 '18 at 13:39

Dark

21.1k31946

add a comment |

Here's an approach with triple concat

m = pd.concat([pd.concat([left]*len(right)).sort_index().reset_index(drop=True),

       pd.concat([right]*len(left)).reset_index(drop=True) ], 1)



    col1  col2 col1  col2

0     A     1    X    20

1     A     1    Y    30

2     A     1    Z    50

3     B     2    X    20

4     B     2    Y    30

5     B     2    Z    50

6     C     3    X    20

7     C     3    Y    30

8     C     3    Z    50

enter image description here

edited Dec 10 '18 at 14:00

answered Dec 10 '18 at 13:39

Dark

21.1k31946

add a comment |

Here's an approach with triple concat

m = pd.concat([pd.concat([left]*len(right)).sort_index().reset_index(drop=True),

       pd.concat([right]*len(left)).reset_index(drop=True) ], 1)



    col1  col2 col1  col2

0     A     1    X    20

1     A     1    Y    30

2     A     1    Z    50

3     B     2    X    20

4     B     2    Y    30

5     B     2    Z    50

6     C     3    X    20

7     C     3    Y    30

8     C     3    Z    50

enter image description here

edited Dec 10 '18 at 14:00

answered Dec 10 '18 at 13:39

Dark

21.1k31946

Here's an approach with triple concat

m = pd.concat([pd.concat([left]*len(right)).sort_index().reset_index(drop=True),

       pd.concat([right]*len(left)).reset_index(drop=True) ], 1)



    col1  col2 col1  col2

0     A     1    X    20

1     A     1    Y    30

2     A     1    Z    50

3     B     2    X    20

4     B     2    Y    30

5     B     2    Z    50

6     C     3    X    20

7     C     3    Y    30

8     C     3    Z    50

enter image description here

edited Dec 10 '18 at 14:00

answered Dec 10 '18 at 13:39

Dark

21.1k31946

edited Dec 10 '18 at 14:00

answered Dec 10 '18 at 13:39

Dark

21.1k31946

answered Dec 10 '18 at 13:39

Dark

21.1k31946

answered Dec 10 '18 at 13:39

Dark

21.1k31946

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Xrfgtjtk