What is the difference between dataframe.series and dataframe['series']? - python

I just tried sorting my dataframe and used the following function:
df[df.count >= df.count.quantile(.95)]
It returned the error:
AttributeError: 'function' object has no attribute 'quantile'
But bracketing the series works fine:
df[df['count'] >= df['count'].quantile(.95)]
It's not the first time I've gotten different results based on this distinction, but it also usually doesn't happen, and I always thought that these were two identical objects.
Why does this happen?

Because count is one of data frame's built in method, when you use . it is recognized as the method instead of the column count, i.e, . prioritize built-in method over columns:
df = pd.DataFrame({
'A':[1,2,3],
'B':[2,3,4],
'count': [4,5,6]
})
df.count()
#A 3
#B 3
#count 3
#dtype: int64
df.count
# V V V V V V V V
#<bound method DataFrame.count of A B count
#0 1 2 4
#1 2 3 5
#2 3 4 6>
Another distinction between dot and bracket is, you can not use dot to create a new column. i.e. if the column doesn't exist, df.column = ... won't work, you have to use bracket in this case. as df[column] = ..., using the above dummy data frame:
# original data frame
df
# A B count
#0 1 2 4
#1 2 3 5
#2 3 4 6
Using dot to create a new column won't work, C is set as an attribute instead of a column:
df.C = 2
df
# A B count
#0 1 2 4
#1 2 3 5
#2 3 4 6
While bracket is the standard way to add a new column to the data frame:
df['C'] = 2
df
# A B count C
#0 1 2 4 2
#1 2 3 5 2
#2 3 4 6 2
If a column already exists, it's valid to modify it with dot assuming the data frame doesn't have an attribute with the same name (as is the case of the count above):
df.B = 3
df
# A B count C
#0 1 3 4 2
#1 2 3 5 2
#2 3 3 6 2

Related

How to delete different column names with duplicated values?

Given this DF:
a b c d
1 2 1 4
4 3 4 2
foo bar foo yes
What is the best way to delete same columns but with different name in a large pandas DF? For example:
a b d
1 2 4
4 3 2
foo bar yes
Column c was removed from the above dataframe becase a and c where the same column but with different name. So far I tried to
df = df.iloc[:, ~df.columns.duplicated()]
However it is not clear to me how to check the row values inside the DF?
use transpose as below
df.T.drop_duplicates().T
I tried straight forward approach - loop through column names and compare each column with rest of others. Use np.all for exact match. These approach took only 336ms.
repeated_columns = []
for i, column in enumerate(df.columns):
r_columns = df.columns[i+1:]
for r_c in r_columns:
if np.all(df[column] == df[r_c]):
repeated_columns.append(r_c)
new_columns = [x for x in df.columns if x not in repeated_columns]
df[new_columns]
It will give you following output
a b d
0 1 2 4
1 4 3 2
2 foo bar yes
df.loc[:,~df.T.duplicated()]
a b d
0 1 2 4
1 4 3 2
2 foo bar yes

How to find the values of a column such that no values in another column takes value greater than 3

I want to find the values corresponding to a column such that no values in another column takes value greater than 3.
For example, in the following dataframe
df = pd.DataFrame({'a':[1,2,3,1,2,3,1,2,3], 'b':[4,5,6,4,5,6,4,5,6], 'c':[4,3,5,4,3,5,4,3,3]})
I want the values of the column 'a' for which all the values of 'c' which are greater than 3.
I think groupby is the correct way to do it. My below code comes closer to it.
df.groupby('a')['c'].max()>3
a
1 True
2 False
3 True
4 False
Name: c, dtype: bool
The above code gives me a boolean frame. How can I get the values of 'a' such that it is true.
I want my output to be [1,3]
Is there a better and efficient way to get this on a very large data frame (with more than 30 million rows).
From your code I see that you actually want to output:
group keys for each group (df grouped by a),
where no value in c column (within the current group) is greater than 3.
In order to get some non-empty result, let's change the source DataFrame to:
a b c
0 1 4 4
1 2 5 1
2 3 6 5
3 1 4 4
4 2 5 2
5 3 6 5
6 1 4 4
7 2 5 2
8 3 6 3
For readability, let's group df by a and print each group.
The code to do it:
for key, grp in df.groupby('a'):
print(f'\nGroup: {key}\n{grp}')
gives result:
Group: 1
a b c
0 1 4 4
3 1 4 4
6 1 4 4
Group: 2
a b c
1 2 5 1
4 2 5 2
7 2 5 2
Group: 3
a b c
2 3 6 5
5 3 6 5
8 3 6 3
And now take a look at each group.
Only group 2 meets the condition that each element in c column
is less than 3.
So actually you need a groupby and filter, passing only groups
meeting the above condition:
To get full rows from the "good" groups, you can run:
df.groupby('a').filter(lambda grp: grp.c.lt(3).all())
getting:
a b c
1 2 5 1
4 2 5 2
7 2 5 2
But you want only values from a column, without repetitions.
So extend the above code to:
df.groupby('a').filter(lambda grp: grp.c.lt(3).all()).a.unique().tolist()
getting:
[2]
Note that your code: df.groupby('a')['c'].max() > 3 is wrong,
as it marks with True groups for which max is greater than 3
(instead of ">" there should be "<").
So an alternative solution is:
res = df.groupby('a')['c'].max()<3
res[res].index.tolist()
giving the same result.
Yet another solution can be based on a list comprehension:
[ key for key, grp in df.groupby('a') if grp.c.lt(3).all() ]
Details:
for key, grp in df.groupby('a') - creates groups,
if grp.c.lt(3).all() - filters groups,
key (at the start) - adds particular group key to the result.
import pandas as pd
#Create DataFrame
df = pd.DataFrame({'a':[1,2,3,1,2,3,1,2,3], 'b':[4,5,6,4,5,6,4,5,6], 'c':[4,3,5,4,3,5,4,3,3]})
#Write a function to find values greater than 3 if found return.
def grt(x):
for i in x:
if i>3:
return(i)
#Groupby column a and call function grt
p = {'c':grt}
grp = df.groupby(['a']).agg(p)
print(grp)

Returning dataframe of multiple rows/columns per one row of input

I am using apply to leverage one dataframe to manipulate a second dataframe and return results. Here is a simplified example that I realize could be more easily answered with "in" logic, but for now let's keep the use of .apply() as a constraint:
import pandas as pd
df1 = pd.DataFrame({'Name':['A','B'],'Value':range(1,3)})
df2 = pd.DataFrame({'Name':['A']*3+['B']*4+['C'],'Value':range(1,9)})
def filter_df(x, df):
return df[df['Name']==x['Name']]
df1.apply(filter_df, axis=1, args=(df2, ))
Which is returning:
0 Name Value
0 A 1
1 A 2
2 ...
1 Name Value
3 B 4
4 B 5
5 ...
dtype: object
What I would like to see instead is one formated DataFrame with Name and Value headers. All advice appreciated!
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7
In my opinion, this cannot be done solely based on apply, you need pandas.concat:
result = pd.concat(df1.apply(filter_df, axis=1, args=(df2,)).to_list())
print(result)
Output
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7

Duplicate row of low occurrence in pandas dataframe

In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)

How do you concatenate two single rows in pandas?

I am trying to select a bunch of single rows in bunch of dataframes and trying to make a new data frame by concatenating them together.
Here is a simple example
x=pd.DataFrame([[1,2,3],[1,2,3]],columns=["A","B","C"])
A B C
0 1 2 3
1 1 2 3
a=x.loc[0,:]
A 1
B 2
C 3
Name: 0, dtype: int64
b=x.loc[1,:]
A 1
B 2
C 3
Name: 1, dtype: int64
c=pd.concat([a,b])
I end up with this:
A 1
B 2
C 3
A 1
B 2
C 3
Name: 0, dtype: int64
Whearas I would expect the original data frame:
A B C
0 1 2 3
1 1 2 3
I can get the values and create a new dataframe, but this doesn't seem like the way to do it.
If you want to concat two series vertically (vertical stacking), then one option is a concat and transpose.
Another is using np.vstack:
pd.DataFrame(np.vstack([a, b]), columns=a.index)
A B C
0 1 2 3
1 1 2 3
Since you are slicing by index I'd use .iloc and then notice the difference between [[]] and [] which return a DataFrame and Series*
a = x.iloc[[0]]
b = x.iloc[[1]]
pd.concat([a, b])
# A B C
#0 1 2 3
#1 1 2 3
To still use .loc, you'd do something like
a = x.loc[[0,]]
b = x.loc[[1,]]
*There's a small caveat that if index 0 is duplicated in x then x.loc[0,:] will return a DataFrame and not a Series.
It looks like you want to make a new dataframe from a collection of records. There's a method for that:
import pandas as pd
x = pd.DataFrame([[1,2,3],[1,2,3]], columns=["A","B","C"])
a = x.loc[0,:]
b = x.loc[1,:]
c = pd.DataFrame.from_records([a, b])
print(c)
# A B C
# 0 1 2 3
# 1 1 2 3

Categories

Resources