A straightforword method to select columns by position - python

I am trying to clarify how can I manage pandas methods to call columns and rows in a Dataframe. An example will clarify my issue
dic = {'a': [1, 5, 2, 7], 'b': [6, 8, 4, 2], 'c': [5, 3, 2, 7]}
df = pd.DataFrame(dic, index = ['e', 'f', 'g', 'h'] )
than
df =
a b c
e 1 6 5
f 5 8 3
g 2 4 2
h 7 2 7
Now if I want to select column 'a' I just have to type
df['a']
while if I want to select row 'e' I have to use the ".loc" method
df.loc['e']
If I don't know the name of the row, but just it's position ( 0 in this case) than I can use the "iloc" method
df.iloc[0]
What looks like it is missing is a method for calling columns by position and not by name, something that is the "equivalent for columns of the 'iloc' method for rows". The only way I can find to do this is
df[df.keys()[0]]
is there something like
df.ilocColumn[0]
?

You can add : because first argument is position of selected indexes and second position of columns in function iloc:
And : means all indexes in DataFrame:
print (df.iloc[:,0])
e 1
f 5
g 2
h 7
Name: a, dtype: int64
If need select first index and first column value:
print (df.iloc[0,0])
1
Solution with ix work nice if need select index by name and column by position:
print (df.ix['e',0])
1

Related

Find value smaller but closest to current value

I have a very large pandas dataframe that contains two columns, column A and column B. For each value in column A, I would like to find the largest value in column B that is less than the corresponding value in column A. Note that each value in column B can be mapped to many values in column A.
Here's an example with a smaller dataset. Let's say I have the following dataframe:
df = pd.DataFrame({'a' : [1, 5, 7, 2, 3, 4], 'b' : [5, 2, 7, 5, 1, 9]})
I would like to find some third column -- say c -- so that
c = [nil, 2, 5, 1, 2, 2].
Note that each entry in c is strictly less than the corresponding value in c.
Upon researching, I think that I want something similar to pandas.merge_asof, but I can't quite get the query correct. In particular, I'm struggling because I only have one dataframe and not two. Perhaps I can form a second dataframe from my current one to get what I need, but I can't quite get it right. Any help is appreciated.
Yes, it is doable using pandas.merge_asof. Explanation as comments in the code -
import pandas as pd
df = pd.DataFrame({'a' : [1, 5, 7, 2, 3, 4], 'b' : [5, 2, 7, 5, 1, 9]})
# merge_asof requires the keys to be sorted
adf = df[['a']].sort_values(by='a')
bdf = df[['b']].sort_values(by='b')
# your example wants 'strictly less' so we also add 'allow_exact_matches=False'
cdf_ordered = pd.merge_asof(adf, bdf, left_on='a', right_on='b', allow_exact_matches=False, direction='backward')
# rename the dataframe |a|b| -> |a|c|
cdf_ordered = cdf_ordered.rename(columns={'b': 'c'})
# since c is based on sorted a, we merge with original dataframe column a
new_df = pd.merge(df, cdf_ordered, on='a')
print(new_df)
"""
a b c
0 1 5 NaN
1 5 2 2.0
2 7 7 5.0
3 2 5 1.0
4 3 1 2.0
5 4 9 2.0
"""

Set pandas dataframe cell to list [duplicate]

I have a list 'abc' and a dataframe 'df':
abc = ['foo', 'bar']
df =
A B
0 12 NaN
1 23 NaN
I want to insert the list into cell 1B, so I want this result:
A B
0 12 NaN
1 23 ['foo', 'bar']
Ho can I do that?
1) If I use this:
df.ix[1,'B'] = abc
I get the following error message:
ValueError: Must have equal len keys and value when setting with an iterable
because it tries to insert the list (that has two elements) into a row / column but not into a cell.
2) If I use this:
df.ix[1,'B'] = [abc]
then it inserts a list that has only one element that is the 'abc' list ( [['foo', 'bar']] ).
3) If I use this:
df.ix[1,'B'] = ', '.join(abc)
then it inserts a string: ( foo, bar ) but not a list.
4) If I use this:
df.ix[1,'B'] = [', '.join(abc)]
then it inserts a list but it has only one element ( ['foo, bar'] ) but not two as I want ( ['foo', 'bar'] ).
Thanks for help!
EDIT
My new dataframe and the old list:
abc = ['foo', 'bar']
df2 =
A B C
0 12 NaN 'bla'
1 23 NaN 'bla bla'
Another dataframe:
df3 =
A B C D
0 12 NaN 'bla' ['item1', 'item2']
1 23 NaN 'bla bla' [11, 12, 13]
I want insert the 'abc' list into df2.loc[1,'B'] and/or df3.loc[1,'B'].
If the dataframe has columns only with integer values and/or NaN values and/or list values then inserting a list into a cell works perfectly. If the dataframe has columns only with string values and/or NaN values and/or list values then inserting a list into a cell works perfectly. But if the dataframe has columns with integer and string values and other columns then the error message appears if I use this: df2.loc[1,'B'] = abc or df3.loc[1,'B'] = abc.
Another dataframe:
df4 =
A B
0 'bla' NaN
1 'bla bla' NaN
These inserts work perfectly: df.loc[1,'B'] = abc or df4.loc[1,'B'] = abc.
Since set_value has been deprecated since version 0.21.0, you should now use at. It can insert a list into a cell without raising a ValueError as loc does. I think this is because at always refers to a single value, while loc can refer to values as well as rows and columns.
df = pd.DataFrame(data={'A': [1, 2, 3], 'B': ['x', 'y', 'z']})
df.at[1, 'B'] = ['m', 'n']
df =
A B
0 1 x
1 2 [m, n]
2 3 z
You also need to make sure the column you are inserting into has dtype=object. For example
>>> df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [1,2,3]})
>>> df.dtypes
A int64
B int64
dtype: object
>>> df.at[1, 'B'] = [1, 2, 3]
ValueError: setting an array element with a sequence
>>> df['B'] = df['B'].astype('object')
>>> df.at[1, 'B'] = [1, 2, 3]
>>> df
A B
0 1 1
1 2 [1, 2, 3]
2 3 3
Pandas >= 0.21
set_value has been deprecated. You can now use DataFrame.at to set by label, and DataFrame.iat to set by integer position.
Setting Cell Values with at/iat
# Setup
>>> df = pd.DataFrame({'A': [12, 23], 'B': [['a', 'b'], ['c', 'd']]})
>>> df
A B
0 12 [a, b]
1 23 [c, d]
>>> df.dtypes
A int64
B object
dtype: object
If you want to set a value in second row of the "B" column to some new list, use DataFrame.at:
>>> df.at[1, 'B'] = ['m', 'n']
>>> df
A B
0 12 [a, b]
1 23 [m, n]
You can also set by integer position using DataFrame.iat
>>> df.iat[1, df.columns.get_loc('B')] = ['m', 'n']
>>> df
A B
0 12 [a, b]
1 23 [m, n]
What if I get ValueError: setting an array element with a sequence?
I'll try to reproduce this with:
>>> df
A B
0 12 NaN
1 23 NaN
>>> df.dtypes
A int64
B float64
dtype: object
>>> df.at[1, 'B'] = ['m', 'n']
# ValueError: setting an array element with a sequence.
This is because of a your object is of float64 dtype, whereas lists are objects, so there's a mismatch there. What you would have to do in this situation is to convert the column to object first.
>>> df['B'] = df['B'].astype(object)
>>> df.dtypes
A int64
B object
dtype: object
Then, it works:
>>> df.at[1, 'B'] = ['m', 'n']
>>> df
A B
0 12 NaN
1 23 [m, n]
Possible, But Hacky
Even more wacky, I've found that you can hack through DataFrame.loc to achieve something similar if you pass nested lists.
>>> df.loc[1, 'B'] = [['m'], ['n'], ['o'], ['p']]
>>> df
A B
0 12 [a, b]
1 23 [m, n, o, p]
You can read more about why this works here.
df3.set_value(1, 'B', abc) works for any dataframe. Take care of the data type of column 'B'. For example, a list can not be inserted into a float column, at that case df['B'] = df['B'].astype(object) can help.
Quick work around
Simply enclose the list within a new list, as done for col2 in the data frame below. The reason it works is that python takes the outer list (of lists) and converts it into a column as if it were containing normal scalar items, which is lists in our case and not normal scalars.
mydict={'col1':[1,2,3],'col2':[[1, 4], [2, 5], [3, 6]]}
data=pd.DataFrame(mydict)
data
col1 col2
0 1 [1, 4]
1 2 [2, 5]
2 3 [3, 6]
Also getting
ValueError: Must have equal len keys and value when setting with an iterable,
using .at rather than .loc did not make any difference in my case, but enforcing the datatype of the dataframe column did the trick:
df['B'] = df['B'].astype(object)
Then I could set lists, numpy array and all sorts of things as single cell values in my dataframes.
As mentionned in this post pandas: how to store a list in a dataframe?; the dtypes in the dataframe may influence the results, as well as calling a dataframe or not to be assigned to.
I've got a solution that's pretty simple to implement.
Make a temporary class just to wrap the list object and later call the value from the class.
Here's a practical example:
Let's say you want to insert list object into the dataframe.
df = pd.DataFrame([
{'a': 1},
{'a': 2},
{'a': 3},
])
df.loc[:, 'b'] = [
[1,2,4,2,],
[1,2,],
[4,5,6]
] # This works. Because the list has the same length as the rows of the dataframe
df.loc[:, 'c'] = [1,2,4,5,3] # This does not work.
>>> ValueError: Must have equal len keys and value when setting with an iterable
## To force pandas to have list as value in each cell, wrap the list with a temporary class.
class Fake(object):
def __init__(self, li_obj):
self.obj = li_obj
df.loc[:, 'c'] = Fake([1,2,5,3,5,7,]) # This works.
df.c = df.c.apply(lambda x: x.obj) # Now extract the value from the class. This works.
Creating a fake class to do this might look like a hassle but it can have some practical applications. For an example you can use this with apply when the return value is list.
Pandas would normally refuse to insert list into a cell but if you use this method, you can force the insert.
I prefer .at and .loc. It is important to note, that the target column needs a dtype (object), which can handle the list.
import numpy as np
import pandas as pd
df = pd.DataFrame({
'A': [0, 1, 2, 3],
'B': np.array([np.nan]*3 + [[3, 33]], dtype=object),
})
print('df to start with:', df, '\ndtypes:', df.dtypes, sep='\n')
df.at[0, 'B'] = [0, 100] # at assigns single elemnt
df.loc[1, 'B'] = [[ [1, 11] ]] # loc expects 2d input
print('df modified:', df, '\ndtypes:', df.dtypes, sep='\n')
output
df to start with:
A B
0 0 NaN
1 1 NaN
2 2 NaN
3 3 [3, 33]
dtypes:
A int64
B object
dtype: object
df modified:
A B
0 0 [0, 100]
1 1 [[1, 11]]
2 2 NaN
3 3 [3, 33]
dtypes:
A int64
B object
dtype: object
first set the cell to blank. next use at to assign the abc list to the cell at 1, 'B'
abc = ['foo', 'bar']
df =pd.DataFrame({'A':[12,23],'B':[np.nan,np.nan]})
df.loc[1,'B']=''
df.at[1,'B']=abc
print(df)

Manipulating DataFrames in a function

I need to make a function that can act on any dataframe and perform an action on it.
To clarify, for example let's say I have this sample dataframe here:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
Which looks like this.
a b c
0 1 2 3
1 4 5 6
2 7 8 9
I have created a function that does something of this sort:
def ColDrop(df, collist):
> df=df.drop(columns = collist)
> return df
(assume > as indent)
What I'd like is for it to accept a list as the 'collist' variable and drop all of those from the dataframe stated as 'df', so...
col = ['a', 'b']
ColDrop(df, col)
Would look like...
c
0 3
1 6
2 9
However, it doesn't seem to work. Similarly I want to remove values from any dataframe based on its row, for example...
def rowvaluedrop(df, column, pattern):
> filter = df[column].str.contains(pattern)
> df = df[~filter]
> return df
rowvaluedrop(df, a, 4)
Would look like...
a b c
0 1 2 3
2 7 8 9
(i realise this second example may not work since the values are integers rather than strings, but i hope that my point gets across regardless.)
Thanks in advance.
You need to store the returning dataframe back to df implicitly
df = rowvaluedrop(df, a, 4)

how to add columns label on a Pandas DataFrame

I can't understand how can I add column names on a pandas dataframe, an easy example will clarify my issue:
dic = {'a': [4, 1, 3, 1], 'b': [4, 2, 1, 4], 'c': [5, 7, 9, 1]}
df = pd.DataFrame(dic)
now if I type df than I get
a b c
0 4 4 5
1 1 2 7
2 3 1 9
3 1 4 1
say now that I generate another dataframe just by summing up the columns on the previous one
a = df.sum()
if I type 'a' than I get
a 9
b 11
c 22
That looks like a dataframe without with index and without names on the only column. So I wrote
a.columns = ['column']
or
a.columns = ['index', 'column']
and in both cases Python was happy because he didn't provide me any message of errors. But still if I type 'a' I can't see the columns name anywhere. What's wrong here?
The method DataFrame.sum() does an aggregation and therefore returns a Series, not a DataFrame. And a Series has no columns, only an index. If you want to create a DataFrame out of your sum you can change a = df.sum() by:
a = pandas.DataFrame(df.sum(), columns = ['whatever_name_you_want'])

Merging and sum up several value-counts series in Pandas

I usually use value_counts() to get the number of occurrences of a value. However, I deal now with large database-tables (cannot load it fully into RAM) and query the data in fractions of 1 month.
Is there a way to store the result of value_counts() and merge it with / add it to the next results?
I want to count the number user actions. Assume the following structure of
user-activity logs:
# month 1
id userId actionType
1 1 a
2 1 c
3 2 a
4 3 a
5 3 b
# month 2
id userId actionType
6 1 b
7 1 b
8 2 a
9 3 c
Using value_counts() on those produces:
# month 1
userId
1 2
2 1
3 2
# month 2
userId
1 2
2 1
3 1
Expected output:
# month 1+2
userId
1 4
2 2
3 3
Up until now, I just have found a method using groupby and sum:
# count users actions and remember them in new column
df1['count'] = df1.groupby(['userId'], sort=False)['id'].transform('count')
# delete not necessary columns
df1 = df1[['userId', 'count']]
# delete not necessary rows
df1 = df1.drop_duplicates(subset=['userId'])
# repeat
df2['count'] = df2.groupby(['userId'], sort=False)['id'].transform('count')
df2 = df2[['userId', 'count']]
df2 = df2.drop_duplicates(subset=['userId'])
# merge and sum up
print pd.concat([df1,df2]).groupby(['userId'], sort=False).sum()
What is the pythonic / pandas' way of merging the information of several series' (and dataframes) efficiently?
Let me suggest "add" and specify a fill value of 0. This has an advantage over the previously suggested answer in that it will work when the two Dataframes have non-identical sets of unique keys.
# Create frames
df1 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'c', 'c', 'd'], 'a': [1, 1, 2, 3, 3, 5]})
df2 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'b', 'c', 'c', 'c'], 'a': [1, 1, 2, 2, 3, 3, 4]})
Now add the the two sets of values_counts(). The fill_value argument will handle any NaN values that would arise, in this example, the 'd' that appears in df1, but not df2.
a = df1.User_id.value_counts()
b = df2.User_id.value_counts()
a.add(b,fill_value=0)
You can sum the series generated by the value_counts method directly:
#create frames
df= pd.DataFrame({'User_id': ['a','a','b','c','c'],'a':[1,1,2,3,3]})
df1= pd.DataFrame({'User_id': ['a','a','b','b','c','c','c'],'a':[1,1,2,2,3,3,4]})
sum the series:
df.User_id.value_counts() + df1.User_id.value_counts()
output:
a 4
b 3
c 5
dtype: int64
This is know as "Split-Apply-Combine". It is done in 1 line and 3-4 clicks, using a lambda function as follows.
1️⃣ paste this into your code:
df['total_for_this_label'] = df.groupby('label', as_index=False)['label'].transform(lambda x: x.count())
2️⃣ replace 3x label with the name of the column whose values you are counting (case-sensitive)
3️⃣ print df.head() to check it's worked correctly

Categories

Resources