Read lists into columns of pandas DataFrame - python

I want to load lists into columns of a pandas DataFrame but cannot seem to do this simply. This is an example of what I want using transpose() but I would think that is unnecessary:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: x = np.linspace(0,np.pi,10)
In [4]: y = np.sin(x)
In [5]: data = pd.DataFrame(data=[x,y]).transpose()
In [6]: data.columns = ['x', 'sin(x)']
In [7]: data
Out[7]:
x sin(x)
0 0.000000 0.000000e+00
1 0.349066 3.420201e-01
2 0.698132 6.427876e-01
3 1.047198 8.660254e-01
4 1.396263 9.848078e-01
5 1.745329 9.848078e-01
6 2.094395 8.660254e-01
7 2.443461 6.427876e-01
8 2.792527 3.420201e-01
9 3.141593 1.224647e-16
[10 rows x 2 columns]
Is there a way to directly load each list into a column to eliminate the transpose and insert the column labels when creating the DataFrame?

Someone just recommended creating a dictionary from the data then loading that into the DataFrame like this:
In [8]: data = pd.DataFrame({'x': x, 'sin(x)': y})
In [9]: data
Out[9]:
x sin(x)
0 0.000000 0.000000e+00
1 0.349066 3.420201e-01
2 0.698132 6.427876e-01
3 1.047198 8.660254e-01
4 1.396263 9.848078e-01
5 1.745329 9.848078e-01
6 2.094395 8.660254e-01
7 2.443461 6.427876e-01
8 2.792527 3.420201e-01
9 3.141593 1.224647e-16
[10 rows x 2 columns]
Note than a dictionary is an unordered set of key-value pairs. If you care about the column orders, you should pass a list of the ordered key values to be used (you can also use this list to only include some of the dict entries):
data = pd.DataFrame({'x': x, 'sin(x)': y}, columns=['x', 'sin(x)'])

Here's another 1-line solution preserving the specified order, without typing x and sin(x) twice:
data = pd.concat([pd.Series(x,name='x'),pd.Series(y,name='sin(x)')], axis=1)

If you don't care about the column names, you can use this:
pd.DataFrame(zip(*[x,y]))
run-time-wise it is as fast as the dict option, and both are much faster than using transpose.

Related

How to find the number of unique values in comma separated strings stored in an pandas data frame column?

x
Unique_in_x
5,5,6,7,8,6,8
4
5,9,8,0
4
5,9,8,0
4
3,2
2
5,5,6,7,8,6,8
4
Unique_in_x is my expected column.Sometime x column might be string also.
You can use a list comprehension with a set
df['Unique_in_x'] = [len(set(x.split(','))) for x in df['x']]
Or using a split and nunique:
df['Unique_in_x'] = df['x'].str.split(',', expand=True).nunique(1)
Output:
x Unique_in_x
0 5,5,6,7,8,6,8 4
1 5,9,8,0 4
2 5,9,8,0 4
3 3,2 2
4 5,5,6,7,8,6,8 4
You can find the unique value of the list with np.unique() and then just use the length
import pandas as pd
import numpy as np
df['Unique_in_x'] = df['X'].apply(lambda x : len(np.unique(x.split(','))))

Pandas groupby aggregation with percentages

I have the following dataframe:
import pandas as pd
import numpy as np
np.random.seed(123)
n = 10
df = pd.DataFrame({"val": np.random.randint(1, 10, n),
"cat": np.random.choice(["X", "Y", "Z"], n)})
val cat
0 3 Z
1 3 X
2 7 Y
3 2 Z
4 4 Y
5 7 X
6 2 X
7 1 X
8 2 X
9 1 Y
I want to know the percentage each category X, Y, and Z has of the entire val column sum. I can aggregate df like this:
total_sum = df.val.sum()
#32
s = df.groupby("cat").val.sum().div(total_sum)*100
#this is the desired result in % of total val
cat
X 46.875 #15/32
Y 37.500 #12/32
Z 15.625 #5/32
Name: val, dtype: float64
However, I find it rather surprising that pandas seemingly does not have a percentage/frequency function something like df.groupby("cat").val.freq() instead of df.groupby("cat").val.sum() or df.groupby("cat").val.mean(). I assumed this is a common operation, and Series.value_counts has implemented this with normalize=True - but for groupby aggregation, I cannot find anything similar. Am I missing here something or is there indeed no out-of-the-box function?

Extract data-set with a given range from a larger dataset

I have a 2D data-set of type with (X,Y) values as such:
X
Y
99.96
2
99.76
4
100.15
6
100.28
`0
100.66
11
101.17
14
102.36
4
I wish to extract a part of above 2D data-set such that 100.00 <= X <= 100.99 and its corresponding Y-values.
So the output generated would be as such:
X
Y
100.15
6
100.28
`0
100.66
11
Can anybody please let me know how do we go about doing this in Python?
You can create a data frame from your data using pandas and filter using between.
you can use pd.read_csv , pd.read_excel, pd.from_dict, etc to easily transform your source data.
import pandas as pd
# example pd read csv
# df = pd.read_csv('somefile.csv', header=0)
df = pd.DataFrame([[1,2],[3,4],[5,6],[2,3],[4,5]], columns=['a','b'])
print(df[df['a'].between(2,4)])
# a b
#1 3 4
#3 2 3
#4 4 5
Maybe just a simple loop, without any 3rd party package?
If you need to save the result, then you just substitute the print statement with result.append().
data = [[99.96, 2],
[97, 4],
[100.15,6],
[100.28,0],
[101.17, 14],
[102.36, 11]]
for x, y in data:
#print(x, y)
if 100.00 <= x <= 100.99:
print(x, y)
If the given data is of type "numpy.ndarray" then we can use of 'where' command as such:
import numpy as np
# Origianl data
data = np.array([[99.96,2],[99.76,4],[100.15,6],[100.28,0],[100.66,11],[101.17,14],[102.36,4]])
print("\n","Original data=\n",data)
# Extracted Data
data_extracted = data[np.where((data[:,0] >= 100.001) & ( data[:,0]<= 100.999))]
print("\n","Extracted data=\n",data_extracted)

Pandas dataframe: creating a new column that is a custom function using 2 other columns

Consider the following data set stored in a pandas DataFrame dfX:
A B
1 2
4 6
7 9
I have a function that is:
def someThingSpecial(x,y)
# z = do something special with x,y
return z
I now want to create a new column in df that bears the computed z value
Looking at other SO examples, I've tried several variants including:
dfX['C'] = dfX.apply(lambda x: someThingSpecial(x=x['A'], y=x['B']), axis=1)
Which returns errors. What is the right way to do this?
This seems to work for me on v0.21. Take a look -
df
A B
0 1 2
1 4 6
2 7 9
def someThingSpecial(x,y):
return x + y
df.apply(lambda x: someThingSpecial(x.A, x.B), 1)
0 3
1 10
2 16
dtype: int64
You might want to try upgrading your pandas version to the latest stable release (0.21 as of now).
Here's another option. You can vectorise your function.
v = np.vectorize(someThingSpecial)
v now accepts arrays, but operates on each pair of elements individually. Note that this just hides the loop, as apply does, but is much cleaner. Now, you can compute C as so -
df['C'] = v(df.A, df.B)
if your function only needs one column's value, then do this instead of coldspeed's answer:
dfX['A'].apply(your_func)
to store it:
dfX['C'] = dfX['A'].apply(your_func)

Take multiple lists into dataframe

How do I take multiple lists and put them as different columns in a python dataframe? I tried this solution but had some trouble.
Attempt 1:
Have three lists, and zip them together and use that res = zip(lst1,lst2,lst3)
Yields just one column
Attempt 2:
percentile_list = pd.DataFrame({'lst1Tite' : [lst1],
'lst2Tite' : [lst2],
'lst3Tite' : [lst3] },
columns=['lst1Tite','lst1Tite', 'lst1Tite'])
yields either one row by 3 columns (the way above) or if I transpose it is 3 rows and 1 column
How do I get a 100 row (length of each independent list) by 3 column (three lists) pandas dataframe?
I think you're almost there, try removing the extra square brackets around the lst's (Also you don't need to specify the column names when you're creating a dataframe from a dict like this):
import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
percentile_list = pd.DataFrame(
{'lst1Title': lst1,
'lst2Title': lst2,
'lst3Title': lst3
})
percentile_list
lst1Title lst2Title lst3Title
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
...
If you need a more performant solution you can use np.column_stack rather than zip as in your first attempt, this has around a 2x speedup on the example here, however comes at bit of a cost of readability in my opinion:
import numpy as np
percentile_list = pd.DataFrame(np.column_stack([lst1, lst2, lst3]),
columns=['lst1Title', 'lst2Title', 'lst3Title'])
Adding to Aditya Guru's answer here. There is no need of using map. You can do it simply by:
pd.DataFrame(list(zip(lst1, lst2, lst3)))
This will set the column's names as 0,1,2. To set your own column names, you can pass the keyword argument columns to the method above.
pd.DataFrame(list(zip(lst1, lst2, lst3)),
columns=['lst1_title','lst2_title', 'lst3_title'])
Adding one more scalable solution.
lists = [lst1, lst2, lst3, lst4]
df = pd.concat([pd.Series(x) for x in lists], axis=1)
There are several ways to create a dataframe from multiple lists.
list1=[1,2,3,4]
list2=[5,6,7,8]
list3=[9,10,11,12]
pd.DataFrame({'list1':list1, 'list2':list2, 'list3'=list3})
pd.DataFrame(data=zip(list1,list2,list3),columns=['list1','list2','list3'])
Just adding that using the first approach it can be done as -
pd.DataFrame(list(map(list, zip(lst1,lst2,lst3))))
Adding to above answers, we can create on the fly
df= pd.DataFrame()
list1 = list(range(10))
list2 = list(range(10,20))
df['list1'] = list1
df['list2'] = list2
print(df)
hope it helps !
#oopsi used pd.concat() but didn't include the column names. You could do the following, which, unlike the first solution in the accepted answer, gives you control over the column order (avoids dicts, which are unordered):
import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
s1=pd.Series(lst1,name='lst1Title')
s2=pd.Series(lst2,name='lst2Title')
s3=pd.Series(lst3 ,name='lst3Title')
percentile_list = pd.concat([s1,s2,s3], axis=1)
percentile_list
Out[2]:
lst1Title lst2Title lst3Title
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
...
you can simply use this following code
train_data['labels']= train_data[["LABEL1","LABEL1","LABEL2","LABEL3","LABEL4","LABEL5","LABEL6","LABEL7"]].values.tolist()
train_df = pd.DataFrame(train_data, columns=['text','labels'])
I just did it like this (python 3.9):
import pandas as pd
my_dict=dict(x=x, y=y, z=z) # Set column ordering here
my_df=pd.DataFrame.from_dict(my_dict)
This seems to be reasonably straightforward (albeit in 2022) unless I am missing something obvious...
In python 2 one could've used a collections.OrderedDict().

Categories

Resources