Create DataFrame from multiple lists? - python

I have two lists
list1=['a','b','c']
list2=[1,2]
I want my dataframe output to look like:
col1 col2
a 1
a 2
b 1
b 2
c 1
c 2
How can this be done?

Use itertools.product:
import itertools
list1 = ['a','b','c']
list2 = [1,2]
df = pd.DataFrame(itertools.product(list1, list2), columns=['col1', 'col2'])
print(df)
Output:
col1 col2
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c 2

If you don't want to explicitly import itertools, pd.MultiIndex has a from_product method that you might piggyback on:
list1 = ['a','b','c']
list2 = [1, 2]
pd.DataFrame(pd.MultiIndex.from_product((list1, list2)).to_list(), columns=['col1', 'col2'])
col1 col2
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c 2

Related

Pandas: Split dataframe with duplicate values into dataframe with unique values

I have a dataframe in Pandas with duplicate values in Col1:
Col1
a
a
b
a
a
b
What I want to do is to split this df into different df-s with unique Col1 values in each.
DF1:
Col1
a
b
DF2:
Col1
a
b
DF3:
Col1
a
DF4:
Col1
a
Any suggestions ?
I don't think you can achieve this in a vectorial way.
One possibility is to use a custom function to iterate the items and keep track of the unique ones. Then use this to split with groupby:
def cum_uniq(s):
i = 0
seen = set()
out = []
for x in s:
if x in seen:
i+=1
seen = set()
out.append(i)
seen.add(x)
return pd.Series(out, index=s.index)
out = [g for _,g in df.groupby(cum_uniq(df['Col1']))]
output:
[ Col1
0 a,
Col1
1 a
2 b,
Col1
3 a,
Col1
4 a
5 b]
intermediate:
cum_uniq(df['Col1'])
0 0
1 1
2 1
3 2
4 3
5 3
dtype: int64
if order doesn't matter
Let's ad a Col2 to the example:
Col1 Col2
0 a 0
1 a 1
2 b 2
3 a 3
4 a 4
5 b 5
the previous code gives:
[ Col1 Col2
0 a 0,
Col1 Col2
1 a 1
2 b 2,
Col1 Col2
3 a 3,
Col1 Col2
4 a 4
5 b 5]
If order does not matter, you can vectorize it:
out = [g for _,g in df.groupby(df.groupby('Col1').cumcount())]
output:
[ Col1 Col2
0 a 0
2 b 2,
Col1 Col2
1 a 1
5 b 5,
Col1 Col2
3 a 3,
Col1 Col2
4 a 4]

pandas drop last group element

I have a DataFrame df = pd.DataFrame({'col1': ["a","b","c","d","e", "f","g","h"], 'col2': [1,1,1,2,2,3,3,3]}) that looks like
Input:
col1 col2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 3
6 g 3
7 h 3
I want to drop the last row bases off of grouping "col2" which would look like...
Expected Output:
col1 col2
0 a 1
1 b 1
3 d 2
5 f 3
6 g 3
I wrote df.groupby('col2').tail(1) which gets me what I want to delete but when I try to write df.drop(df.groupby('col2').tail(1)) I get an axis error. What would be a solution to this
Look like duplicated would work:
df[df.duplicated('col2', keep='last') |
(~df.duplicated('col2', keep=False)) # this is to keep all single-row groups
]
Or with your approach, you should drop the index:
# this would also drop all single-row groups
df.drop(df.groupby('col2').tail(1).index)
Output:
col1 col2
0 a 1
1 b 1
3 d 2
5 f 3
6 g 3
try this:
df.groupby('col2', as_index=False).apply(lambda x: x.iloc[:-1,:]).reset_index(drop=True)

Create new Dataframe from matching two dataframe index's

I'm looking create a new dataframe from data in two separate dataframes - effectively matching the index of each cell and input into a two column dataframe. My real datasets have the exact same number of rows and columns, FWIW. Example below:
DF1:
Col1 Col2 Col3
1 2 3
3 8 7
DF2:
Col1 Col2 Col3
A B E
R S W
Desired Dataframe:
Col1 Col2
1 A
2 B
3 E
3 R
8 S
7 W
Thank you for your help!
here is your code
df3 = pd.Series(df1.values.ravel('F'))
df4 = pd.Series(df2.values.ravel('F'))
df = pd.concat([df3, df4], axis=1)
Use, DataFrame.to_numpy and .flatten:
df = pd.DataFrame(
{'Col1': df1.to_numpy().flatten(), 'Col2': df2.to_numpy().flatten()})
# print(df)
Col1 Col2
0 1 A
1 2 B
2 3 E
3 3 R
4 8 S
5 7 W
You can do it easily like so:
list1 = df1.values.tolist()
list1 = [item for sublist in list1 for item in sublist]
list2 = df2.values.tolist()
list2 = [item for sublist in list2 for item in sublist]
df = {
'Col1': list1,
'Col2': list2
}
df = DataFrame(df)
print(df)
Hope this helps :)
pd.concat(map(lambda x: x.unstack().sort_index(level=-1), (df1, df2)), axis=1).reset_index(drop=True).rename(columns=['Col1', 'Col2'].__getitem__)
Result:
Col1 Col2
0 1 A
1 2 B
2 3 E
3 3 R
4 8 S
5 7 W
Another way (alternative):
pd.concat((df1.stack(),df2.stack()),axis=1).add_prefix('Col').reset_index(drop=True)
or:
d = {'Col1':df1,'Col2':df2}
pd.concat((v.stack() for k,v in d.items()),axis=1,keys=d.keys()).reset_index(drop=True)
#or pd.concat((d.values()),keys=d.keys()).stack().unstack(0).reset_index(drop=True)
Col1 Col2
0 1 A
1 2 B
2 3 E
3 3 R
4 8 S
5 7 W

Create expanded/permuted dataframe from several lists

I'm very open to changing the title of the question if there's a clearer way to ask this.
I want to convert several lists into repeated columns of a dataframe. Somehow, between itertools and np.tile, I wasn't able to get the behavior I wanted.
Input:
list_1 = [1, 2]
list_2 = [a, b]
list_3 = [A, B]
Output:
col1 col2 col3
1 a A
1 a B
1 b A
1 b B
2 a A
2 a B
2 b A
2 b B
itertools.product is I think what you're looking for:
>>> pd.DataFrame(itertools.product(list_1, list_2, list_3))
0 1 2
0 1 a A
1 1 a B
2 1 b A
3 1 b B
4 2 a A
5 2 a B
6 2 b A
7 2 b B
Not sure how efficient this would be with very large lists, but it is a possible approach to your problem.
list_1 = [1, 2]
list_2 = ['a', 'b']
list_3 = ['A', 'B']
indices = []
values = []
for i in list_1:
for m in list_2:
for n in list_3:
indices.append(i)
values.append([m,n])
df = pd.DataFrame(data=values, index=indices)
print(df)
Output:
0 1
1 a A
1 a B
1 b A
1 b B
2 a A
2 a B
2 b A
2 b B

Creating new DataFrame from the cartesian product of 2 lists

What I want to achieve is the following in Pandas:
a = [1,2,3,4]
b = ['a', 'b']
Can I create a DataFrame like:
column1 column2
'a' 1
'a' 2
'a' 3
'a' 4
'b' 1
'b' 2
'b' 3
'b' 4
Use itertools.product with DataFrame constructor:
a = [1, 2, 3, 4]
b = ['a', 'b']
from itertools import product
# pandas 0.24.0+
df = pd.DataFrame(product(b, a), columns=['column1', 'column2'])
# pandas below
# df = pd.DataFrame(list(product(b, a)), columns=['column1', 'column2'])
print (df)
column1 column2
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
I will put here another method, just in case someone prefers it.
full mockup below:
import pandas as pd
a = [1,2,3,4]
b = ['a', 'b']
df=pd.DataFrame([(y, x) for x in a for y in b], columns=['column1','column2'])
df
result below:
column1 column2
0 a 1
1 b 1
2 a 2
3 b 2
4 a 3
5 b 3
6 a 4
7 b 4

Categories

Resources