I just wanted to ask the community and see if there is a more efficient to do this.
I have several rows in a data frame and I am using .loc to filter values in row A for I can perform calculations on row B.
I can easily do something like...
filter_1 = df.loc['Condition'] = 1
And then perform the mathematical calculation on row B that I need.
But there are many conditions I must go through so I was wondering if I could possibly make a list of the conditions and then iterate them through the .loc function in less lines of code?
Would something like this work where I create a list, then iterate the conditions through a loop?
Thank you!
This example gets most of what I want. I just need it to show 6.4 and 7.0 in this example. How can I manipulate the iteration for it shows the results for the unique values in row 'a'?
import pandas as pd
a = [1,2,1,2,1,2,1,2,1,2]
b = [5,1,3,5,7,20,9,5,8,4]
col = ['a', 'b']
list_1 = []
for i, j in zip(a,b):
list_1.append([i,j])
df1 = pd.DataFrame(list_1, columns= col)
for i in a:
aa = df1[df1['a'].isin([i])]
aa1 = aa['b'].mean()
print (aa1)
Solution using set
set_a = set(a)
for i in set_a:
aa = df[df['a'].isin([i])]
aa1 = aa['b'].mean()
print (aa1)
Solution using pandas mean function
Is this what you are looking for?
import pandas as pd
a = [1,2,1,2,1,2,1,2,1,2]
b = [5,1,3,5,7,20,9,5,8,4]
df = pd.DataFrame({'a':a,'b':b})
print (df)
print(df.groupby('a').mean())
The results from this are:
Original Dataframe df:
a b
0 1 5
1 2 1
2 1 3
3 2 5
4 1 7
5 2 20
6 1 9
7 2 5
8 1 8
9 2 4
The mean value of df['a'] is:
b
a
1 6.4
2 7.0
Here you go:
df = df[(df['A'] > 1) & (df['A'] < 10)]
Related
Imagine that I have a Dataframe and the columns are [A,B,C]. There are some different values for each of these columns. And I want to produce one more column D which can be received with the following function:
def produce_column(i):
# Extract current row by index
raw = df.loc[i]
# Extract previous 3 values for the same sub-df which are before i
df_same = df[
(df['A'] == raw.A)
& (df['B'] == raw.B)
].loc[:i].tail(3)
# Check that we have enough values
if df_same.shape[0] != 3:
return False
# Doesn't matter which function is in use, I just need to apply it on the column / columns
diffs = df_same['C'].map(lambda x: x <= 10 and x > 0)
return all(diffs)
df['D'] = df.index.map(lambda x: produce_column(x))
So on each step, I need to get the Dataframe, which have the same set of properties as a row and perform some operations on columns of this Dataframe. I have a few hundred thousands of rows, so this code takes a lot of time to be executed. I think that a good idea is to vectorize the operation, but I don't know how to do that. Maybe there's another way to perform this?
Thanks in advance!
UPD Here's an example
df = pd.DataFrame([(1,2,3), (4,5,6), (7,8,9)], columns=['A','B','C'])
A B C
0 1 2 3
1 4 5 6
2 7 8 9
df['D'] = df.index.map(lambda x: produce_column(x))
A B C D
0 1 2 3 True
1 4 5 6 True
2 7 8 9 False
This question already has answers here:
How to apply a function to two columns of Pandas dataframe
(15 answers)
Closed 4 years ago.
I need to make a column in my pandas dataframe that relies on other items in that same row. For example, here's my dataframe.
df = pd.DataFrame(
[['a',],['a',1],['a',1],['a',2],['b',2],['b',2],['c',3]],
columns=['letter','number']
)
letters numbers
0 a 1
1 a 1
2 a 1
3 a 2
4 b 2
5 b 2
6 c 3
I need a third column, that is 1 if 'a' and 2 are present in the row, and 0 otherwise. So it would be [`0,0,0,1,0,0,0]`
How can I use Pandas `apply` or `map` to do this? Iterating over the rows is my first thought, but this seems like a clumsy way of doing it.
You can use apply with axis=1. Suppose you wanted to call your new column c:
df['c'] = df.apply(
lambda row: (row['letter'] == 'a') and (row['number'] == 2),
axis=1
).astype(int)
print(df)
# letter number c
#0 a NaN 0
#1 a 1.0 0
#2 a 1.0 0
#3 a 2.0 1
#4 b 2.0 0
#5 b 2.0 0
#6 c 3.0 0
But apply is slow and should be avoided if possible. In this case, it would be much better to boolean logic operations, which are vectorized.
df['c'] = ((df['letter'] == "a") & (df['number'] == 2)).astype(int)
This has the same result as using apply above.
You can try to use pd.Series.where()/np.where(). If you only are interested in the int represantation of the boolean values, you can pick the other solution. If you want more freedom for the if/else value you can use np.where()
import pandas as pd
import numpy as np
# create example
values = ['a', 'b', 'c']
df = pd.DataFrame()
df['letter'] = np.random.choice(values, size=10)
df['number'] = np.random.randint(1,3, size=10)
# condition
df['result'] = np.where((df['letter'] == 'a') & (df['number'] == 2), 1, 0)
Let's say I have a DataFrame that looks like this:
a b c d e f g
1 2 3 4 5 6 7
4 3 7 1 6 9 4
8 9 0 2 4 2 1
How would I go about deleting every column besides a and b?
This would result in:
a b
1 2
4 3
8 9
I would like a way to delete these using a simple line of code that says, delete all columns besides a and b, because let's say hypothetically I have 1000 columns of data.
Thank you.
In [48]: df.drop(df.columns.difference(['a','b']), 1, inplace=True)
Out[48]:
a b
0 1 2
1 4 3
2 8 9
or:
In [55]: df = df.loc[:, df.columns.intersection(['a','b'])]
In [56]: df
Out[56]:
a b
0 1 2
1 4 3
2 8 9
PS please be aware that the most idiomatic Pandas way to do that was already proposed by #Wen:
df = df[['a','b']]
or
df = df.loc[:, ['a','b']]
Another option to add to the mix. I prefer this approach for readability.
df = df.filter(['a', 'b'])
Where the first positional argument is items=[]
Bonus
You can also use a like argument or regex to filter.
Helpful if you have a set of columns like ['a_1','a_2','b_1','b_2']
You can do
df = df.filter(like='b_')
and end up with ['b_1','b_2']
Pandas documentation for filter.
there are multiple solution .
df = df[['a','b']] #1
df = df[list('ab')] #2
df = df.loc[:,df.columns.isin(['a','b'])] #3
df = pd.DataFrame(data=df.eval('a,b').T,columns=['a','b']) #4 PS:I do not recommend this method , but still a way to achieve this
Hey what you are looking for is:
df = df[["a","b"]]
You will recive a dataframe which only contains the columns a and b
If you only want to keep more columns than you're dropping put a "~" before the .isin statement to select every column except the ones you want:
df = df.loc[:, ~df.columns.isin(['a','b'])]
If you have more than two columns that you want to drop, let's say 20 or 30, you can use lists as well. Make sure that you also specify the axis value.
drop_list = ["a","b"]
df = df.drop(df.columns.difference(drop_list), axis=1)
I have a dataframe as show below
>> df
A 1
B 2
A 5
B 6
A 7
B 8
How do I reformat it to make it
A 1 5 7
B 2 6 8
Thanks
Given a data frame like this
df = pd.DataFrame(dict(one=list('ABABAB'), two=range(6)))
you can do
df.groupby('one').two.apply(lambda s: s.reset_index(drop=True)).unstack()
# 0 1 2
# one
# A 0 2 4
# B 1 3 5
or (slightly slower, and giving a slightly different result)
df.groupby('one').apply(lambda d: d.two.reset_index(drop=True))
# two 0 1 2
# one
# A 0 2 4
# B 1 3 5
The first approach works with a DataFrameGroupBy, the second uses a SeriesGroupBy.
You can grab the series and use np.reshape to keep the correct dimensions.
The order = 'F' makes it scroll through columns (such as Fortran), order = 'C' scrolls through rows like C
Then it gets into a dataframe
df = pd.DataFrame(data=np.arange(10))
data = df['a'].values.reshape((2, 5), order='F')
df = pd.DataFrame(data=data, index=['a', 'b'])
how did you generate this data frame. I think it should have been generated using dictionary and then generate dataframe using that dict.
d = {'A': [1,5,7], 'B':[2,6,8]}
df = pandas.DataFrame(data=d, index=['p1','p2','p3'])
and then you can use df.T to transpose your dataframe if you need to.
I attempting to add a Series to an empty DataFrame and can not find an answer
either in the Doc's or other questions. Since you can append two DataFrames by row
or by column it would seem there must be an "axis marker" missing from a Series. Can
anyone explain why this does not work?.
import Pandas as pd
df1 = pd.DataFrame()
s1 = pd.Series(['a',5,6])
df1 = pd.concat([df1,s1],axis = 1)
#go run some process return s2, s3, sn ...
s2 = pd.Series(['b',8,9])
df1 = pd.concat([df1,s2],axis = 1)
s3 = pd.Series(['c',10,11])
df1 = pd.concat([df1,s3],axis = 1)
If my example above is some how misleading perhaps using the example from the docs will help.
Quoting: Appending rows to a DataFrame.
While not especially efficient (since a new object must be created), you can append a
single row to a DataFrame by passing a Series or dict to append, which returns a new DataFrame as above. End Quote.
The example from the docs appends "S", which is a row from a DataFrame, "S1" is a Series
and attempting to append "S1" produces an error. My question is WHY will appending "S1 not work? The assumption behind the question is that a DataFrame must code or contain axes information for two axes, where a Series must contain only information for one axes.
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
s = df.xs(3); #third row of DataFrame
s1 = pd.Series([np.random.randn(4)]); #new Series of equal len
df= df.append(s, ignore_index=True)
Result
0 1
0 a b
1 5 8
2 6 9
Desired
0 1 2
0 a 5 6
1 b 8 9
You were close, just transposed the result from concat
In [14]: s1
Out[14]:
0 a
1 5
2 6
dtype: object
In [15]: s2
Out[15]:
0 b
1 8
2 9
dtype: object
In [16]: pd.concat([s1, s2], axis=1).T
Out[16]:
0 1 2
0 a 5 6
1 b 8 9
[2 rows x 3 columns]
You also don't need to create the empty DataFrame.
The best way is to use DataFrame to construct a DF from a sequence of Series, rather than using concat:
import pandas as pd
s1 = pd.Series(['a',5,6])
s2 = pd.Series(['b',8,9])
pd.DataFrame([s1, s2])
Output:
In [4]: pd.DataFrame([s1, s2])
Out[4]:
0 1 2
0 a 5 6
1 b 8 9
A method of accomplishing the same objective as appending a Series to a DataFrame
is to just convert the data to an array of lists and append the array(s) to the DataFrame.
data as an array of lists
def get_example(idx):
list1 = (idx+1,idx+2 ,chr(idx + 97))
data = [list1]
return(data)
df1 = pd.DataFrame()
for idx in range(4):
data = get_example(idx)
df1= df1.append(data, ignore_index = True)