Take multiple lists into dataframe - python

How do I take multiple lists and put them as different columns in a python dataframe? I tried this solution but had some trouble.
Attempt 1:
Have three lists, and zip them together and use that res = zip(lst1,lst2,lst3)
Yields just one column
Attempt 2:
percentile_list = pd.DataFrame({'lst1Tite' : [lst1],
'lst2Tite' : [lst2],
'lst3Tite' : [lst3] },
columns=['lst1Tite','lst1Tite', 'lst1Tite'])
yields either one row by 3 columns (the way above) or if I transpose it is 3 rows and 1 column
How do I get a 100 row (length of each independent list) by 3 column (three lists) pandas dataframe?

I think you're almost there, try removing the extra square brackets around the lst's (Also you don't need to specify the column names when you're creating a dataframe from a dict like this):
import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
percentile_list = pd.DataFrame(
{'lst1Title': lst1,
'lst2Title': lst2,
'lst3Title': lst3
})
percentile_list
lst1Title lst2Title lst3Title
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
...
If you need a more performant solution you can use np.column_stack rather than zip as in your first attempt, this has around a 2x speedup on the example here, however comes at bit of a cost of readability in my opinion:
import numpy as np
percentile_list = pd.DataFrame(np.column_stack([lst1, lst2, lst3]),
columns=['lst1Title', 'lst2Title', 'lst3Title'])

Adding to Aditya Guru's answer here. There is no need of using map. You can do it simply by:
pd.DataFrame(list(zip(lst1, lst2, lst3)))
This will set the column's names as 0,1,2. To set your own column names, you can pass the keyword argument columns to the method above.
pd.DataFrame(list(zip(lst1, lst2, lst3)),
columns=['lst1_title','lst2_title', 'lst3_title'])

Adding one more scalable solution.
lists = [lst1, lst2, lst3, lst4]
df = pd.concat([pd.Series(x) for x in lists], axis=1)

There are several ways to create a dataframe from multiple lists.
list1=[1,2,3,4]
list2=[5,6,7,8]
list3=[9,10,11,12]
pd.DataFrame({'list1':list1, 'list2':list2, 'list3'=list3})
pd.DataFrame(data=zip(list1,list2,list3),columns=['list1','list2','list3'])

Just adding that using the first approach it can be done as -
pd.DataFrame(list(map(list, zip(lst1,lst2,lst3))))

Adding to above answers, we can create on the fly
df= pd.DataFrame()
list1 = list(range(10))
list2 = list(range(10,20))
df['list1'] = list1
df['list2'] = list2
print(df)
hope it helps !

#oopsi used pd.concat() but didn't include the column names. You could do the following, which, unlike the first solution in the accepted answer, gives you control over the column order (avoids dicts, which are unordered):
import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
s1=pd.Series(lst1,name='lst1Title')
s2=pd.Series(lst2,name='lst2Title')
s3=pd.Series(lst3 ,name='lst3Title')
percentile_list = pd.concat([s1,s2,s3], axis=1)
percentile_list
Out[2]:
lst1Title lst2Title lst3Title
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
...

you can simply use this following code
train_data['labels']= train_data[["LABEL1","LABEL1","LABEL2","LABEL3","LABEL4","LABEL5","LABEL6","LABEL7"]].values.tolist()
train_df = pd.DataFrame(train_data, columns=['text','labels'])

I just did it like this (python 3.9):
import pandas as pd
my_dict=dict(x=x, y=y, z=z) # Set column ordering here
my_df=pd.DataFrame.from_dict(my_dict)
This seems to be reasonably straightforward (albeit in 2022) unless I am missing something obvious...
In python 2 one could've used a collections.OrderedDict().

Related

Python : Replace two for loops with the fastest way to sum the elements

I have list of 5 elements which could be 50000, now I want to sum all the combinations from the same list and create a dataframe from the results, so I am writing following code,
x =list(range(1,5))
t=[]
for i in x:
for j in x:
t.append((i,j,i+j))
df=pd.Dataframe(t)
The above code is generating the correct results but taking so long to execute when I have more elements in the list. Looking for the fastest way to do the same thing
Combinations can be obtained through the pandas.merge() method without using explicit loops
x = np.arange(1, 5+1)
df = pd.DataFrame(x, columns=['x']).merge(pd.Series(x, name='y'), how='cross')
df['sum'] = df.x.add(df.y)
print(df)
x y sum
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 2 1 3
6 2 2 4
...
Option 2: with itertools.product()
import itertools
num = 5
df = pd.DataFrame(itertools.product(range(1,num+1),range(1,num+1)))
df['sum'] = df[0].add(df[1])
print(df)
List Comprehension can make it faster. So, you can use t=[(i,j,i+j) for i in x for j in x] instead of for loop, as the traditional for loop is slower than list comprehensions, and nested loop is even slower. Here's the updated code in replacement of nested loops.
x =list(range(1,5))
t=[(i,j,i+j) for i in x for j in x]
df=pd.Dataframe(t)

How to find the number of unique values in comma separated strings stored in an pandas data frame column?

x
Unique_in_x
5,5,6,7,8,6,8
4
5,9,8,0
4
5,9,8,0
4
3,2
2
5,5,6,7,8,6,8
4
Unique_in_x is my expected column.Sometime x column might be string also.
You can use a list comprehension with a set
df['Unique_in_x'] = [len(set(x.split(','))) for x in df['x']]
Or using a split and nunique:
df['Unique_in_x'] = df['x'].str.split(',', expand=True).nunique(1)
Output:
x Unique_in_x
0 5,5,6,7,8,6,8 4
1 5,9,8,0 4
2 5,9,8,0 4
3 3,2 2
4 5,5,6,7,8,6,8 4
You can find the unique value of the list with np.unique() and then just use the length
import pandas as pd
import numpy as np
df['Unique_in_x'] = df['X'].apply(lambda x : len(np.unique(x.split(','))))

Pandas Vectorization with Function on Parts of Column

So I have a dataframe that looks something like this:
df1 = pd.DataFrame([[1,2, 3], [5,7,8], [2,5,4]])
0 1 2
0 1 2 3
1 5 7 8
2 2 5 4
I then have a function that adds 5 to a number called add5. I'm trying to create a new column in df1 that adds 5 to all the numbers in column 2 that are greater than 3. I want to use vectorization not apply as this concept is going to be expanded to a dataset with hundreds of thousands of entries and speed will be important. I can do it without the greater than 3 constraint like this:
df1['3'] = add5(df1[2])
But my goal is to do something like this:
df1['3'] = add5(df1[2]) if df1[2] > 3
Hoping someone can point me in the right direction on this. Thanks!
With Pandas, a function applied explicitly to each row typically cannot be vectorised. Even implicit loops such as pd.Series.apply will likely be inefficient. Instead, you should use true vectorised operations, which lean heavily on NumPy in both functionality and syntax.
In this case, you can use numpy.where:
df1[3] = np.where(df1[2] > 3, df1[2] + 5, df1[2])
Alternatively, you can use pd.DataFrame.loc in a couple of steps:
df1[3] = df1[2]
df1.loc[df1[2] > 3, 3] = df1[2] + 5
In each case, the term df1[2] > 3 creates a Boolean series, which is then used to mask another series.
Result:
print(df1)
0 1 2 3
0 1 2 3 3
1 5 7 8 13
2 2 5 4 9

Drop few rows of a pandas dataframe using lambda

I'm currently facing a problem with method chaining in manipulating data frames in pandas, here is the structure of my data:
import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
df = pd.DataFrame(
{'Frenquency': lst1,
'lst2Tite': lst2,
'lst3Tite': lst3
})
the question is get entries(rows) if the frequency is less than 6, but it needs to be done in method chaining.
I know using a traditional way is easy, I could just do
df[df["Frenquency"]<6]
to get the answer.
However, the question is about how to do it with method chaining, I tried something like
df.drop(lambda x:x.index if x["Frequency"] <6 else null)
but it raised an error "[<function <lambda> at 0x7faf529d3510>] not contained in axis"
Could anyone share some light on this issue?
This is an old question but I will answer since there is no accepted answer for future reference.
df[df.apply(lambda x: True if (x.Frenquency) <6 else False,axis=1)]
explanation: This lambda function checks the frequency and if yes it assigns True otherwise False and that series of True and False used by df to index the true values only. Note the column name Frenquency is a typo but I kept as it is since the question was like so.
Or maybe this:
df.drop(i for i in df.Frequency if i >= 6)
Or use inplace:
df.drop((i for i in df.Frequency if i >= 6), inplace=True)
For this sort of selection, you can maintain a fluent interface and use method-chaining by using the query method:
>>> df.query('Frenquency < 6')
Frenquency lst2Tite lst3Tite
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
>>>
So something like:
df.rename(<something>).query('Frenquency <6').assign(<something>)
Or more concretely:
>>> (df.rename(columns={'Frenquency':'F'})
... .query('F < 6')
... .assign(FF=lambda x: x.F**2))
F lst2Tite lst3Tite FF
0 0 0 0 0
1 1 1 1 1
2 2 2 2 4
3 3 3 3 9
4 4 4 4 16
5 5 5 5 25
Feel this post did not have the answers that addressed the spirit of the question. The most chain-friendly way is (probably) to use Panda's .loc.
import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
df = pd.DataFrame({"Frequency": lst1, "lst2Tite": lst2, "lst3Tite": lst3})
df.loc[lambda _df: 6 < _df["Frequency"]]
Simple!
Would this satisfy your needs?
df.mask(df.Frequency >= 6).dropna()

Python [[0]] meaning

I am running a Python script (Kaggle script). It works in a 3.4.5 virtualenv, but not in 3.5.2
I am not sure why and I am not familiar with the [[0]] syntax. Below is the snippet.
import pandas as pd
data = pd.read_csv(r'path\train.csv')
labels_flat = data[[0]].values.ravel()
It should produce a list of values from the csv's first column.
In 3.5.2 I get this error:
KeyError: '[0] not in index'
I tried to replicate the value with
labels_flat = []
lf = data.values.tolist()
for row in lf:
labels_flat.append(row[0])
But I don't think it is the same thing.
I dont think the problem is with the syntax, your Dataframe just does not contain the index you are looking for.
For me this works:
In [1]: data = pd.DataFrame({0:[1,2,3], 1:[4,5,6], 2:[7,8,9]})
In [2]: data[[0]]
Out[2]:
0
0 1
1 2
2 3
I think what confuses you about the [[0]] syntax is that the squared brackets are used in python for two completely different things, and the [[0]] statement uses both:
A. [] is used to create a list. In the above example [0] creates a list with the single element 0.
B. [] is also used to access an element from a list (or dict,...). So data[0] returns the 0.-th element of data.
The next confusion thing is that while the usual python lists are indexed by numbers (eg. data[4] is the 4. element of data), Pandas Dataframes can be indexed by lists. This is syntactic sugar to easily access multiple columns of the dataframe at once.
So in my example from above, to get column 0 and 1 you can do:
In [3]: data[[0, 1]]
Out[3]:
0 1
0 1 4
1 2 5
2 3 6
Here the inner [0, 1] creates a list with the elements 0 and 1. The outer [ ] retrieve the columns of the dataframe by using the inner list as an index.
For more readability look at this, its the exact same:
In [4]: l = [0, 1]
In [5]: data[l]
Out[5]:
0 1
0 1 4
1 2 5
2 3 6
If you only want the first column (column 0) you get this:
In [6]: data[[0]]
Out[6]:
0
0 1
1 2
2 3
Which is exactly what you were looking for.

Categories

Resources