I am running a Python script (Kaggle script). It works in a 3.4.5 virtualenv, but not in 3.5.2
I am not sure why and I am not familiar with the [[0]] syntax. Below is the snippet.
import pandas as pd
data = pd.read_csv(r'path\train.csv')
labels_flat = data[[0]].values.ravel()
It should produce a list of values from the csv's first column.
In 3.5.2 I get this error:
KeyError: '[0] not in index'
I tried to replicate the value with
labels_flat = []
lf = data.values.tolist()
for row in lf:
labels_flat.append(row[0])
But I don't think it is the same thing.
I dont think the problem is with the syntax, your Dataframe just does not contain the index you are looking for.
For me this works:
In [1]: data = pd.DataFrame({0:[1,2,3], 1:[4,5,6], 2:[7,8,9]})
In [2]: data[[0]]
Out[2]:
0
0 1
1 2
2 3
I think what confuses you about the [[0]] syntax is that the squared brackets are used in python for two completely different things, and the [[0]] statement uses both:
A. [] is used to create a list. In the above example [0] creates a list with the single element 0.
B. [] is also used to access an element from a list (or dict,...). So data[0] returns the 0.-th element of data.
The next confusion thing is that while the usual python lists are indexed by numbers (eg. data[4] is the 4. element of data), Pandas Dataframes can be indexed by lists. This is syntactic sugar to easily access multiple columns of the dataframe at once.
So in my example from above, to get column 0 and 1 you can do:
In [3]: data[[0, 1]]
Out[3]:
0 1
0 1 4
1 2 5
2 3 6
Here the inner [0, 1] creates a list with the elements 0 and 1. The outer [ ] retrieve the columns of the dataframe by using the inner list as an index.
For more readability look at this, its the exact same:
In [4]: l = [0, 1]
In [5]: data[l]
Out[5]:
0 1
0 1 4
1 2 5
2 3 6
If you only want the first column (column 0) you get this:
In [6]: data[[0]]
Out[6]:
0
0 1
1 2
2 3
Which is exactly what you were looking for.
Related
For some reason, my code works when a list I am passing contains only integers. Using strings otherwise leads to the error in the title.
Here is my code:
def get_support(self, data, itemset):
return data[itemset].all(axis = 'columns').sum()
# I also tried: return data.loc[:, itemset].all(axis = 'columns').sum()
# this function returns the number of True values (from .all()) of a given column or given set of columns
A sample of a data where this code works is:
0 1 2
0 0 0 1
1 1 1 1
2 1 1 0
3 0 1 0
4 1 1 1
5 1 1 0
Running get_support(df, [0]) returns 4 and running get_support(df, [0,2] returns 2.
However, once columns are labeled, the code no longer works and outputs the error. I've checked the .csv file, and it's completely clean, with no spaces or extra stuff.
Sample of a data that will cause an error in my code:
Red Yellow Blue
0 0 0 1
1 1 1 1
2 1 1 0
3 0 1 0
4 1 1 1
5 1 1 0
Where exactly am I wrong?
Edit: Thank you very very much to #osint_alex. The error is gone now, but there is unfortunately a newfound problem:
print(get_support(temp_df, ['A']))
print(get_support(temp_df, ['A', 'B']))
print(get_support(temp_df, ['A', 'B', 'C']))
Running this block of code only outputs this value for each: 9835, which is the number of rows the dataset has
I have attempted commenting out the other lines, and I get 9835 nonetheless. However, after checking the .csv file, I should only get 516 for A (unable to test for others).
As of now, I am still trying to solve it on my own, but the numbers are too all over the place I do not even where to begin.
I think what is going wrong is that you are using itemset to select the columns of interest in the dataframe. So if you want to use itemset to denote the indicies of the columns - which is what I assume you want to do if you want to always pass in numbers - then the get_support function needs to change to reflect this.
Essentially if you pass in numbers and the columns are labelled with strings, you'll get a key error because those numbers aren't the column names.
Here is a suggested revision:
import pandas as pd
df = pd.DataFrame({'A': [1,2,3],
'B': [3,4,0]})
def get_support(data, itemset):
cols = [x for x in data.columns if list(data.columns).index(x) in itemset]
return data[cols].all(axis = 'columns').sum()
print(get_support(df, [0, 1]))
Out:
2
You're going wrong here:
data[itemset].all(axis = 'columns').sum()
You can't sum() a string. You could run it through a data cleaning function first to make sure the list only has integers or floats.
I am slicing a pandas dataframe and I seem to be getting unexpected slices using .loc, at least as compared to numpy and ordinary python slicing. See the example below.
>>> import pandas as pd
>>> a = pd.DataFrame([[0,1,2],[3,4,5],[4,5,6],[9,10,11],[34,2,1]])
>>> a
0 1 2
0 0 1 2
1 3 4 5
2 4 5 6
3 9 10 11
4 34 2 1
>>> a.loc[1:3, :]
0 1 2
1 3 4 5
2 4 5 6
3 9 10 11
>>> a.values[1:3, :]
array([[3, 4, 5],
[4, 5, 6]])
Interestingly, this only happens with .loc, not .iloc.
>>> a.iloc[1:3, :]
0 1 2
1 3 4 5
2 4 5 6
Thus, .loc appears to be inclusive of the terminating index, but numpy and .iloc are not.
By the comments, it seems this is not a bug and we are well warned. But why is it the case?
Remember .loc is primarily label based indexing. The decision to include the stop endpoint becomes far more obvious when working with a non-RangeIndex:
df = pd.DataFrame([1,2,3,4], index=list('achz'))
# 0
#a 1
#c 2
#h 3
#z 4
If I want to select all rows between 'a' and 'h' (inclusive) I only know about 'a' and 'h'. In order to be consistent with other python slicing, you'd need to also know what index follows 'h', which in this case is 'z' but could have been anything.
There's also a section of the documents hidden away that explains this design choice Endpoints are Inclusive
Additionally to the point in the docs, pandas slice indexing using .loc is not cell index based. It is in fact value based indexing (in the pandas docs it is called "label based", but for numerical data I prefer the term "value based"), whereas with .iloc it is traditional numpy-style cell indexing.
Furthermore, value based indexing is right-inclusive, whereas cell indexing is not. Just try the following:
a = pd.DataFrame([[0,1,2],[3,4,5],[4,5,6],[9,10,11],[34,2,1]])
a.index = [0, 1, 2, 3.1, 4] # add a float index
# value based slicing: the following will output all value up to the slice value
a.loc[1:3.1]
# Out:
# 0 1 2
# 1.0 3 4 5
# 2.0 4 5 6
# 3.1 9 10 11
# index based slicing: will raise an error, since only integers are allowed
a.iloc[1:3.1]
# Out: TypeError: cannot do slice indexing on <class 'pandas.core.indexes.numeric.Float64Index'> with these indexers [3.2] of <class 'float'>
To give an explicit answer to your question why it is right-inclusive:
When using values/labels as indices, it is, at least in my opinion, intuitive, that the last index is included. This is as far as I know a design decision of how the implemented function is meant to work.
Basically if a column of my pandas dataframe looks like this:
[1 1 1 2 2 2 3 3 3 1 1]
I'd like it to be turned into the following:
[1 2 3 1]
You can write a simple function that loops through the elements of your series only storing the first element in a run.
As far as I know, there is no tool built in to pandas to do this. But it is not a lot of code to do it yourself.
import pandas
example_series = pandas.Series([1, 1, 1, 2, 2, 3])
def collapse(series):
last = ""
seen = []
for element in series:
if element != last:
last = element
seen.append(element)
return seen
collapse(example_series)
In the code above, you will iterate through each element of a series and check if it is the same as the last element seen. If it is not, store it. If it is, ignore the value.
If you need to handle the return value as a series you can change the last line of the function to:
return pandas.Series(seen)
You could write a function that does the following:
x = pandas.Series([1 1 1 2 2 2 3 3 3 1 1])
y = x-x.shift(1)
y[0] = 1
result = x[y!=0]
You can use DataFrame's diff and indexing:
>>> df = pd.DataFrame([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df[0].diff()!=0]
0
0 1
2 2
6 3
10 1
>>> df[df[0].diff()!=0].values.ravel() # If you need an array
array([1, 2, 3, 1])
Same works for Series:
>>> df = pd.Series([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df.diff()!=0].values
array([1, 2, 3, 1])
You can use shift to create a boolean mask to compare the row against the previous row:
In [67]:
s = pd.Series([1,1,2,2,2,2,3,3,3,3,4,4,5])
s[s!=s.shift()]
Out[67]:
0 1
2 2
6 3
10 4
12 5
dtype: int64
Say I have a list (or numpy array or pandas series) as below
l = [1,2,6,6,4,2,4]
I want to return a list of each value's ordinal, 1-->1(smallest), 2-->2, 4-->3, 6-->4 and
to_ordinal(l) == [1,2,4,4,3,2,4]
and I want it to also work for list of strings input.
I can try
s = numpy.unique(l)
then loop over each element in l and find its index in s. Just wonder if there is a direct method?
In pandas you can call rank and pass method='dense':
In [18]:
l = [1,2,6,6,4,2,4]
s = pd.Series(l)
s.rank(method='dense')
Out[18]:
0 1
1 2
2 4
3 4
4 3
5 2
6 3
dtype: float64
This also works for strings:
In [19]:
l = ['aaa','abc','aab','aba']
s = pd.Series(l)
Out[19]:
0 aaa
1 abc
2 aab
3 aba
dtype: object
In [20]:
s.rank(method='dense')
Out[20]:
0 1
1 4
2 2
3 3
dtype: float64
I don't think that there is a "direct method" for this1. The most straight forward way that I can think to do it is to sort a set of the elements:
sorted_unique = sorted(set(l))
Then make a dictionary mapping the value to it's ordinal:
ordinal_map = {val: i for i, val in enumerate(sorted_unique, 1)}
Now one more pass over the data and we can get your list:
ordinals = [ordinal_map[val] for val in l]
Note that this is a roughly O(NlogN) algorithm (due to the sort) -- And the more non-unique elements you have, the closer it becomes to O(N).
1Certainly not in vanilla python and I don't know of anything in numpy. I'm less familiar with pandas so I can't speak to that.
How do I take multiple lists and put them as different columns in a python dataframe? I tried this solution but had some trouble.
Attempt 1:
Have three lists, and zip them together and use that res = zip(lst1,lst2,lst3)
Yields just one column
Attempt 2:
percentile_list = pd.DataFrame({'lst1Tite' : [lst1],
'lst2Tite' : [lst2],
'lst3Tite' : [lst3] },
columns=['lst1Tite','lst1Tite', 'lst1Tite'])
yields either one row by 3 columns (the way above) or if I transpose it is 3 rows and 1 column
How do I get a 100 row (length of each independent list) by 3 column (three lists) pandas dataframe?
I think you're almost there, try removing the extra square brackets around the lst's (Also you don't need to specify the column names when you're creating a dataframe from a dict like this):
import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
percentile_list = pd.DataFrame(
{'lst1Title': lst1,
'lst2Title': lst2,
'lst3Title': lst3
})
percentile_list
lst1Title lst2Title lst3Title
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
...
If you need a more performant solution you can use np.column_stack rather than zip as in your first attempt, this has around a 2x speedup on the example here, however comes at bit of a cost of readability in my opinion:
import numpy as np
percentile_list = pd.DataFrame(np.column_stack([lst1, lst2, lst3]),
columns=['lst1Title', 'lst2Title', 'lst3Title'])
Adding to Aditya Guru's answer here. There is no need of using map. You can do it simply by:
pd.DataFrame(list(zip(lst1, lst2, lst3)))
This will set the column's names as 0,1,2. To set your own column names, you can pass the keyword argument columns to the method above.
pd.DataFrame(list(zip(lst1, lst2, lst3)),
columns=['lst1_title','lst2_title', 'lst3_title'])
Adding one more scalable solution.
lists = [lst1, lst2, lst3, lst4]
df = pd.concat([pd.Series(x) for x in lists], axis=1)
There are several ways to create a dataframe from multiple lists.
list1=[1,2,3,4]
list2=[5,6,7,8]
list3=[9,10,11,12]
pd.DataFrame({'list1':list1, 'list2':list2, 'list3'=list3})
pd.DataFrame(data=zip(list1,list2,list3),columns=['list1','list2','list3'])
Just adding that using the first approach it can be done as -
pd.DataFrame(list(map(list, zip(lst1,lst2,lst3))))
Adding to above answers, we can create on the fly
df= pd.DataFrame()
list1 = list(range(10))
list2 = list(range(10,20))
df['list1'] = list1
df['list2'] = list2
print(df)
hope it helps !
#oopsi used pd.concat() but didn't include the column names. You could do the following, which, unlike the first solution in the accepted answer, gives you control over the column order (avoids dicts, which are unordered):
import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
s1=pd.Series(lst1,name='lst1Title')
s2=pd.Series(lst2,name='lst2Title')
s3=pd.Series(lst3 ,name='lst3Title')
percentile_list = pd.concat([s1,s2,s3], axis=1)
percentile_list
Out[2]:
lst1Title lst2Title lst3Title
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
...
you can simply use this following code
train_data['labels']= train_data[["LABEL1","LABEL1","LABEL2","LABEL3","LABEL4","LABEL5","LABEL6","LABEL7"]].values.tolist()
train_df = pd.DataFrame(train_data, columns=['text','labels'])
I just did it like this (python 3.9):
import pandas as pd
my_dict=dict(x=x, y=y, z=z) # Set column ordering here
my_df=pd.DataFrame.from_dict(my_dict)
This seems to be reasonably straightforward (albeit in 2022) unless I am missing something obvious...
In python 2 one could've used a collections.OrderedDict().