Get all combinations of elements from two lists? - python

If I have two lists
l1 = ['A', 'B']
l2 = [1, 2]
what is the most elegant way to get a pandas data frame which looks like:
+-----+-----+-----+
| | l1 | l2 |
+-----+-----+-----+
| 0 | A | 1 |
+-----+-----+-----+
| 1 | A | 2 |
+-----+-----+-----+
| 2 | B | 1 |
+-----+-----+-----+
| 3 | B | 2 |
+-----+-----+-----+
Note, the first column is the index.

use product from itertools:
>>> from itertools import product
>>> pd.DataFrame(list(product(l1, l2)), columns=['l1', 'l2'])
l1 l2
0 A 1
1 A 2
2 B 1
3 B 2

As an alternative you can use pandas' cartesian_product (may be more useful with large numpy arrays):
In [11]: lp1, lp2 = pd.core.reshape.util.cartesian_product([l1, l2])
In [12]: pd.DataFrame(dict(l1=lp1, l2=lp2))
Out[12]:
l1 l2
0 A 1
1 A 2
2 B 1
3 B 2
This seems a little messy to read in to a DataFrame with the correct orient...
Note: previously cartesian_product was located at pd.core.reshape.util.cartesian_product.

You can also use the sklearn library, which uses a NumPy-based approach:
from sklearn.utils.extmath import cartesian
df = pd.DataFrame(cartesian((L1, L2)))
For more verbose but possibly more efficient variants see Numpy: cartesian product of x and y array points into single array of 2D points.

You can use the function merge:
df1 = pd.DataFrame(l1, columns=['l1'])
df2 = pd.DataFrame(l2, columns=['l2'])
df1.merge(df2, how='cross')
Output:
l1 l2
0 A 1
1 A 2
2 B 1
3 B 2

Related

Compare Two column in pandas

I have a dataframe like this
| A | B | C |
|-------|---|---|
| ['1'] | 1 | 1 |
|['1,2']| 2 | |
| ['2'] | 3 | 0 |
|['1,3']| 2 | |
if the value of B is equal to A within the quotes then C is 1. if not present in A it will be 0. Expected output is:
| A | B | C |
|-------|---|---|
| ['1'] | 1 | 1 |
|['1,2']| 2 | 1 |
| ['2'] | 3 | 0 |
|['1,3']| 2 | 0 |
Like this I want to get the dataframe for multiple rows. How do I write in python to get this kind of data frame?
If values in A are strings use:
print (df.A.tolist())
["['1']", "['1,2']", "['2']", "['1,3']"]
df['C'] = [int(str(b) in a.strip("[]'").split(',')) for a, b in zip(df.A, df.B)]
print (df)
A B C
0 ['1'] 1 1
1 ['1,2'] 2 1
2 ['2'] 3 0
3 ['1,3'] 2 0
Or if values are one element lists use:
print (df.A.tolist())
[['1'], ['1,2'], ['2'], ['1,3']]
df['C'] = [int(str(b) in a[0].split(',')) for a, b in zip(df.A, df.B)]
print (df)
A B C
0 [1] 1 1
1 [1,2] 2 1
2 [2] 3 0
3 [1,3] 2 0
My code:
df = pd.read_clipboard()
df
'''
A B
0 ['1'] 1
1 ['1,2'] 2
2 ['2'] 3
3 ['1,3'] 2
'''
(
df.assign(A=df.A.str.replace("'",'').map(eval))
.assign(C=lambda d: d.apply(lambda s: s.B in s.A, axis=1))
.assign(C=lambda d: d.C.astype(int))
)
'''
A B C
0 [1] 1 1
1 [1, 2] 2 1
2 [2] 3 0
3 [1, 3] 2 0
'''
df['C'] = np.where(df['B'].astype(str).isin(df.A), 1,0)
basically you need to transform column b to string since column A is string. then seek for column B inside columnA.
result will be as you are defined.

How to select multiple columns in pandas dataframe after doing some operation on one column?

Say I have a dataframe with 7 columns. I'm only interested in columns A and B. Column B contains numerical values.
What I want to do is select only columns A and B, after doing some mathematical operation f on B. The sql equivalent of what I'm saying is:
SELECT A, f(B)
FROM df;
I know that I can select just columns A and B by doing df[['A', 'B']]. Also, I can just add another column f_B saying: df['f_B'] = f(df['B']), and then select df[['A', 'f_B']].
However, is there a way of doing it without adding an extra column? What if when f is as simple as a divide by 100 or something?
EDIT: I do not want to use pandasql
EDIT2: Sharing sample input and expected output:
Input:
A | B | C | D
--------------
a | 1 | c | d
b | 2 | c | d
c | 3 | c | d
d | 4 | c | d
Expected output (only columns A and B required), assuming f is multiply by 2:
A | B
-----
a | 2
b | 4
c | 6
d | 8
First you take only the columns you need:
df = df[['A', 'B']] # replace the original df with a smaller one
new_df = df[['A', 'B']] # or allocate a new space
You can simply do:
df.B = df.B / 10
Using lambda:
df.B = df.B.apply(lambda value: value / 10)
For more complicated cases:
def f(value):
# some logic
result = value ** 2
return result
df.B = df.B.apply(f)

How to get the aggregate result only on the first occurrence, and 0 to the other, in a computation with groupby and transform?

I would like to groupby and sum dataframe, without modifying the number of indexes but applying the operations to the first occurrence only.
Initial DF:
C1 | Val
a | 1
a | 1
b | 1
c | 1
c | 1
Wanted DF:
C1 | Val
a | 2
a | 0
b | 1
c | 2
c | 0
I tried to apply the following code:
df.groupby(['C1'])['Val'].transform('sum')
which it helps to propagate the aggregated results to the total number or rows. However, it does not seem that transform have arguments which allow to apply the results to first or last occurrence only.
Indeed, what I currently get is:
C1 | Val
a | 2
a | 2
b | 1
c | 2
c | 2
Using pandas.DataFrame.groupby:
s = df.groupby('C1')['Val']
v = s.sum().values
df.loc[:, 'Val'] = 0
df.loc[s.head(1).index, 'Val'] = v
print(df)
Output:
C1 Val
0 a 2
1 a 0
2 b 1
3 c 2
4 c 0

Pandas - select column using other column value as column name

I have a dataframe that contains a column, let's call it "names". "names" has the name of other columns. I would like to add a new column that would have for each row the value based on the column name contained on that "names" column.
Example:
Input dataframe:
pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b']})
a | b | names |
--- | --- | ---- |
1 | -1 | 'a' |
2 | -2 | 'b' |
3 | -3 | 'a' |
4 | -4 | 'b' |
Output dataframe:
pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b'], "new_col":[1,-2,3,-4]})
a | b | names | new_col |
--- | --- | ---- | ------ |
1 | -1 | 'a' | 1 |
2 | -2 | 'b' | -2 |
3 | -3 | 'a' | 3 |
4 | -4 | 'b' | -4 |
You can use lookup:
df['new_col'] = df.lookup(df.index, df.names)
df
# a b names new_col
#0 1 -1 a 1
#1 2 -2 b -2
#2 3 -3 a 3
#3 4 -4 b -4
EDIT
lookup has been deprecated, here's the currently recommended solution:
idx, cols = pd.factorize(df['names'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Because DataFrame.lookup is deprecated as of Pandas 1.2.0, the following is what I came up with using DataFrame.melt:
df['new_col'] = df.melt(id_vars='names', value_vars=['a', 'b'], ignore_index=False).query('names == variable').loc[df.index, 'value']
Output:
>>> df
a b names new_col
0 1 -1 a 1
1 2 -2 b -2
2 3 -3 a 3
3 4 -4 b -4
Can this be simplified? For correctness, the index must not be ignored.
Additional reference:
Looking up values by index/column labels (archive)
Solution using pd.factorize (from https://github.com/pandas-dev/pandas/issues/39171#issuecomment-773477244):
idx, cols = pd.factorize(df['names'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
With the straightforward and easy solution (lookup) deprecated, another alternative to the pandas-based ones proposed here is to convert df into a numpy array and use numpy indexing:
df['new_col'] = df.values[df.index.get_indexer(df['names'].index), df.columns.get_indexer(df['names'])]
Let me explain what this does. df.values is a numpy array based on the DataFrame. As numpy arrays have to be indexed numerically, we need to use the get_indexer function to convert the pandas row and column index names to index numbers that can be used with numpy:
>>> df.index.get_indexer(df['names'].index)
array([0, 1, 2, 3], dtype=int64)
>>> df.columns.get_indexer(df['names'])
array([0, 1, 0, 1], dtype=int64)
(In this case, where the row index is already numerical, you could get away with simply using df.index as the first argument inside the bracket, but this does not work generally.)
Here's a short solution using df.melt and df.merge:
df.merge(df.melt(var_name='names', ignore_index=False), on=[None, 'names'])
Outputs:
key_0 a b names value
0 0 1 -1 a 1
1 1 2 -2 b -2
2 2 3 -3 a 3
3 3 4 -4 b -4
There's a redundant key_0 column which you need to drop with df.drop.

Rolling up data frame along with count of rows in python

I am still in a learning phase in python and wanted to know how do we roll up the data and count the duplicate data rows in a column called count
The data frame structure is as follows
Col1| Value
A | 1
B | 1
A | 1
B | 1
C | 3
C | 3
C | 3
C | 3
My result should be as follows
Col1|Value|Count
A | 1 | 2
B | 1 | 2
C | 3 | 4
>>> df2 = df.groupby(['Col1', 'Value']).size().reset_index()
>>> df2.columns = ['Col1', 'Value', 'Count']
>>> df2
Col1 Value Count
0 A 1 2
1 B 1 2
2 C 3 4
Roman Pekar's fine answer is correct for this case. However, I saw it after trying to write a solution for the general case stated in the text of your question, not just the example with specific column names. So, for the general case, consider:
df.groupby([df[c] for c in df.columns]).size().reset_index().rename(columns={0: 'Count'})
For example:
import pandas as pd
df = pd.DataFrame({'Col1': ['a', 'a', 'a', 'b', 'c'], 'Value': [1, 2, 1, 3, 2]})
>>> df.groupby([df[c] for c in df.columns]).size().reset_index().rename(columns={0: 'Count'})
Col1 Value Count
0 a 1 2
1 a 2 1
2 b 3 1
3 c 2 1
You can also try:
df.groupby('Col1')['Value'].value_counts().reset_index(name='Count')

Categories

Resources