How to Convert Pandas Dataframe to Single List - python

Suppose I have a dataframe:
col1 col2 col3
0 1 5 2
1 7 13
2 9 1
3 7
How do I convert to a single list such as:
[1, 7, 9, 5, 13, 1, 7]
I have tried:
df.values.tolist()
However this returns a list of lists rather than a single list:
[[1.0, 5.0, 2.0], [7.0, 13.0, nan], [9.0, 1.0, nan], [nan, 7.0, nan]]
Note the dataframe will contain an unknown number of columns. The order of the values is not important so long as the list contains all values in the dataframe.
I imagine I could write a function to unpack the values, however I'm wondering if there is a simple built-in way of converting a dataframe to a series/list?

Following your current approach, you can flatten your array before converting it to a list. If you need to drop nan values, you can do that after flattening as well:
arr = df.to_numpy().flatten()
list(arr[~np.isnan(arr)])
Also, future versions of Pandas seem to prefer to_numpy over values
An alternate, perhaps cleaner, approach is to 'stack' your dataframe:
df.stack().tolist()

you can use dataframe stack
In [12]: df = pd.DataFrame({"col1":[np.nan,3,4,np.nan], "col2":['test',np.nan,45,3]})
In [13]: df.stack().tolist()
Out[13]: ['test', 3.0, 4.0, 45, 3]

For Ordered list (As per problem statement):
Only if your data contains integer values:
Firstly get all items in data frame and then remove the nan from the list.
items = [item for sublist in [df[cols].tolist() for cols in df.columns] for item in sublist]
items = [int(x) for x in items if str(x) != 'nan']
For Un-Ordered list:
Only if your data contains integer values:
items = [int(x) for x in sum(df.values.tolist(),[]) if str(x) != 'nan']

Related

Adding list to cell in dataframe base on 2 conditions to delete elements in each list

I have two columns in a data frame, the first one with a list of numbers in each cell and the second one with a list of letters in each cell.
I want to create two more columns considering the following conditions:
When a value in column "A" is < 1 this value is going to stay in the list and the other ones are gonna be deleted, with this condition the letter that is at the same index as the number in column "A"
Output:
I wasn´t able to do this within the datframe so I try to create a list of list and then add'em as a columns but this works fine if I use only a list but for the columns is not working.
I would like some advice for this.
big_a = []
big_b = []
new_list_a = []
new_list_b = []
for a, b in zip(x['COLUMN_A'], x['COLUMN_B']):
if a < 1:
new_list_a = []
new_list_b = []
new_list_a.append(a)
new_list_b.append(b)
big_a.append(new_list_a)
big_b.append(new_list_b)
This gives me the following error:
TypeError: '<' not supported between instances of 'list' and 'int'
import numpy as np
import pandas as pd
def process(row):
np_A = np.array(row.COLUMN_A)
np_B = np.array(row.COLUMN_B)
return np_A[np_A<1], np_B[np_A<1]
df[["NEW_A","NEW_B"]] = df.apply(lambda row: pd.Series(process(row)), axis=1)
import numpy as np
import pandas as pd
# Create the dataframe
df = pd.DataFrame({
'A': [[0.99, 1.0, 1.0], [1.0, 1.0, 1.0], [1.0, 0.25, 0.87]],
'B': [['a', 'b', 'c'], ['a', 'b', 'c'], ['a', 'b', 'c']]
})
# Convert the lists to numpy ndarray
df = df.applymap(np.asarray)
# Explode the dataframe
df = df.reset_index().apply(pd.Series.explode).set_index(['index', 'B'])
# Filter for rows whose value for column 'A' is less than 1
df = df[df < 1].dropna().reset_index().groupby('index').agg(list)
The initial DataFrame is
A B
0 [0.99, 1.0, 1.0] [a, b, c]
1 [1.0, 1.0, 1.0] [a, b, c]
2 [1.0, 0.25, 0.87] [a, b, c]
The final DataFrame will look like:
B A
index
0 [a] [0.99]
2 [b, c] [0.25, 0.87]
Notes:
Find more details about pandas explode here.

Populate Pandas Series with list

I would like to populate a pd.Series() with a list.
I tried doing the following:
series = pd.Series(index=['a','b','c','d'])
series['a'] = 2
series['b'] = [2,3]
This is the error that I get. How can I populate the list in the pd.Series?
File "C:\Users\Sergej Shteriev\Anaconda3\lib\site-packages\pandas\core\internals.py", line 940, in setitem
values[indexer] = value
ValueError: setting an array element with a sequence.
This is because the initial dtype is assumed to be float (as the series is filled with NaNs).
series.dtype
# dtype('float64')
Since lists are only supported by object type columns, you'd need to cast before assigning.
series = series.astype(object)
series['b'] = [2, 3]
series
a 2 # this is still a float
b [2, 3]
c NaN
d NaN
dtype: object
series.tolist()
# [2.0, [[2, 3]], nan, nan]
A better suggestion is to declare series as an object at the start if that's what you intend stuffing into it.
series = pd.Series(index=['a','b','c','d'], dtype=object)
series['a'] = 2
series['b'] = [2, 3]
series
a 2
b [2, 3]
c NaN
d NaN
dtype: object
series.tolist()
# [2, [2, 3], nan, nan]
Of course, for performance reasons, I don't condone this. You're better off using python lists -- they're usually faster than object Series.

Remove 'nan' from Dictionary of list

My data contain columns with empty rows that are read by pandas as nan.
I want to create a dictionary of list from this data. However, some list contains nan and I want to remove it.
If I use dropna() in data.dropna().to_dict(orient='list'), this will remove all the rows that contains at least one nan, thefore I lose data.
Col1 Col2 Col3
a x r
b y v
c x
z
data = pd.read_csv(sys.argv[2], sep = ',')
dict = data.to_dict(orient='list')
Current output:
dict = {Col1: ['a','b','c',nan], Col2: ['x', 'y',nan,nan], Col3: ['r', 'v', 'x', 'z']}
Desire Output:
dict = {Col1: ['a','b','c'], Col2: ['x', 'y'], Col3: ['r', 'v', 'x', 'z']}
My goal: get the dictionary of a list, with nan remove from the list.
Not sure exactly the format you're expecting, but you can use list comprehension and itertuples to do this.
First create some data.
import pandas as pd
import numpy as np
data = pd.DataFrame.from_dict({'Col1': (1, 2, 3), 'Col2': (4, 5, 6), 'Col3': (7, 8, np.nan)})
print(data)
Giving a data frame of:
Col1 Col2 Col3
0 1 4 7.0
1 2 5 8.0
2 3 6 NaN
And then we create the dictionary using the iterator.
dict_1 = {x[0]: [y for y in x[1:] if not pd.isna(y)] for x in data.itertuples(index=True) }
print(dict_1)
>>>{0: [1, 4, 7.0], 1: [2, 5, 8.0], 2: [3, 6]}
To do the same for the columns is even easier:
dict_2 = {data[column].name: [y for y in data[column] if not pd.isna(y)] for column in data}
print(dict_2)
>>>{'Col1': [1, 2, 3], 'Col2': [4, 5, 6], 'Col3': [7.0, 8.0]}
I am not sure if I understand your question correctly, but if I do and what you want is to replace the nan with a value so as not to lose your data then what you are looking for is pandas.DataFrame.fillna function. You mentioned the original value is an empty row, so filling the nan with data.fillna('') which fills it with empty string.
EDIT: After providing the desired output, the answer to your question changes a bit. What you'll need to do is to use dict comprehension with list comprehension to build said dictionary, looping by column and filtering nan. I see that Andrew already provided the code to do this in his answer so have a look there.

Largest (n) numbers with Index and Column name in Pandas DataFrame

I wish to find out the largest 5 numbers in a DataFrame and store the Index name and Column name for these 5 values.
I am trying to use nlargest() and idxmax methods but failing to achieve what i want. My code is as below:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
df = DataFrame({'a': [1, 10, 8, 11, -1],'b': [1.0, 2.0, 6, 3.0, 4.0],'c': [1.0, 2.0, 6, 3.0, 4.0]})
Can you kindly let me know How can i achieve this. Thank you
Use stack and nlargest:
max_vals = df.stack().nlargest(5)
This will give you a Series with a multiindex, where the first level is the original DataFrame's index, and the second level is the column name for the given value. Here's what max_vals looks like:
3 a 11.0
1 a 10.0
2 a 8.0
b 6.0
c 6.0
To explicitly get the index and column names, use get_level_values on the index of max_vals:
max_idx = max_vals.index.get_level_values(0)
max_cols = max_vals.index.get_level_values(1)
The result of max_idx:
Int64Index([3, 1, 2, 2, 2], dtype='int64')
The result of max_cols:
Index(['a', 'a', 'a', 'b', 'c'], dtype='object')

average over multiple entries in a sorted list

I have a sorted 2-dimensional list in which in the first column a specific value can occur multiple times, but with different corresponding values in the second column.
Example:
1 10
2 20
3 30
3 35
4 40
5 45
5 50
5 55
6 60
I'd like to average over those multiple entries, so that my final list looks like
1 10
2 20
3 32.5
4 40
5 50
6 60
One problem is, that you don't know how many times a value occurs. My code so far looks like
for i in range(len(list)):
print i
if i+1 < len(list):
if list[i][0] == list[i+1][0]:
j = 0
sum = 0
while list[i][0] == list[i+j][0]: #this while loop is there to account for the unknown number of multiple values
sum += list[i+j][1]
j += 1
avg = sum / j
#print avg
#i+=j # here I try to skip the next j steps in the for loop, but it doesn't work
#final[i].append(i)
#final[i].append(avg) # How do I append a tuple [i, avg] to the final list?
else:
final.append(list[i])
else:
final.append(list[i])
print final
My questions are:
How do I properly account for the multiple entries and don't count
them twice with the for loop?
How do I append a tuple [i, avg] to the final list?
Following code is using groupby from itertools:
lst = [[1, 10],
[2, 20],
[3, 30],
[3, 35],
[4, 40],
[5, 45],
[5, 50],
[5, 55],
[6, 60],
]
from itertools import groupby
avglst = []
for grpname, grpvalues in groupby(lst, lambda itm: itm[0]):
values = [itm[1] for itm in grpvalues]
avgval = float(sum(values)) / len(values)
avglst.append([grpname, avgval])
print(avglst)
When run:
$ python avglist.py (env: stack)
python[[1, 10.0], [2, 20.0], [3, 32.5], [4, 40.0], [5, 50.0], [6, 60.0]]
it provides the result you asked for.
Explanation:
groupby gets iterable (the list) and a function, which calculates s called key, that is a value,
used for creating groups. In our case we are going to group according to first element in list item.
Note, that groupby creates groups each time the key value changes, so be sure, your input list is
sorted, otherwise you get more groups than you expect.
The groupby returns tuples (grpname, groupvalues) where grpname is the key value for given
group, and the groupvalues is iterator over all items in that groups. Be careful, that it is not
list, to get list from it, something (like call to list(grpvalues)) must iterate over the values.
In our case we iterate using list comprehension picking only 2nd item in each list element.
While iterators, generators and similar constructs in python might seem to be too complex at first,
they serve excellently at the moment, one has to process very large lists and iterables. In such a
case, Python iterators are holding in memory only current item so one can manage really huge or even
endless iterables.
You can use a dictionary to count how many times each value in the left column occurs? And a separate dictionary to map the sum of elements associated with each left entry. And then with one final for loop, divide the sum by the count.
from collections import defaultdict
someList = [(1,10), (2,20), (3,30), (4,40), (5,45), (5,50), (5,55)]
count_dict = defaultdict(lambda:0)
sum_dict = defaultdict(lambda:0.0)
for left_val, right_val in someList:
count_dict[left_val] += 1
sum_dict[left_val] += right_val
for left_val in sorted(count_dict):
print left_val, sum_dict[left_val]/count_dict[left_val]
Output
1 10.0
2 20.0
3 30.0
4 40.0
5 50.0
First we need to group the columns together. We'll do this with a dictionary where the key is the left column and the value is a list of the values for that key. Then, we can do a simple calculation to get the averages.
from collections import defaultdict
data = [
(1, 10),
(2, 20),
(3, 30),
(3, 35),
(4, 40),
(5, 45),
(5, 50),
(5, 55),
(6, 60)
]
# Organize the data into a dict
d = defaultdict(list)
for key, value in data:
d[key].append(value)
# Calculate the averages
averages = dict()
for key in d:
averages[key] = sum(d[key]) / float(len(d[key]))
# Use the averages
print(averages)
Output:
{1: 10.0, 2: 20.0, 3: 32.5, 4: 40.0, 5: 50.0, 6: 60.0}
Here's how you can do it with a combination of Counter and OrderedDict:
from __future__ import division # Python 2
from collections import Counter, OrderedDict
counts, sums = OrderedDict(), Counter()
for left, right in [(1,10), (2,20), (3,30), (4,40), (5,45), (5,50), (5,55)]:
counts[left] = counts.get(left, 0) + 1
sums[left] += right
result = [(key, sums[key]/counts[key]) for key in counts]

Categories

Resources