I want something like this:
df.groupby("A")["B"].diff()
But instead of diff(), I want be able to compute if the two rows are different or identical, and return 1 if the current row is different from the previous, and 0 if it is identical.
Moreover, I really would like to use a custom function instead of diff(), so that I can do general pairwise row operations.
I tried using .rolling(2) and .apply() at different places, but I just can not get it to work.
Edit:
Each row in the dataset is a packet.
The first row in the dataset is the first recorded packet, and the last row is the last recorded packet, i.e., they are ordered by time.
One of the features(columns) is called "ID", and several packets have the same ID.
Another column is called "data", its values are 64 bit binary values (strings), i.e., 001011010011001.....10010 (length 64).
I want to create two new features(columns):
Compare the "data" field of the current packet with the data field of the previous packet with the Same ID, and compute:
If they are different (1 or 0)
How different (a figure between 0 and 1)
Hi I think it is best if you forgo using the grouby and shift instead:
equal_index = (df == df.shift(1))[X].all(axis=1)
where X is a list of columns you want to be identic. Then you can create your own grouper by
my_grouper = (~equal_index).cumsum()
and use it together with agg to aggregate with whatever function you wish
df.groupby(my_grouper).agg({'B':f})
Use DataFrameGroupBy.shift with compare for not equal by Series.ne:
df["dc"] = df.groupby("ID")["data"].shift().ne(df['data']).astype(int)
EDIT: for correlation between 2 Series use:
df["dc"] = df['data'].corr(df.groupby("ID")["data"].shift())
Ok, I solved it myself with
def create_dc(df: pd.DataFrame):
dc = df.groupby("ID")["data"].apply(lambda x: x != x.shift(1)).astype(int)
dc.fillna(1, inplace=True)
df["dc"] = dc
this does what I want.
Thank you #Arnau for inspiring me to use .shift()!
Related
I want to find the min value of every row of a dataframe restricting to only few columns.
For example: consider a dataframe of size 10*100. I want the min of middle 5 rows and this becomes of size 10*5.
I know to find the min using df.min(axis=0) but i dont know how to restrict the number of columns. Thanks for the help.
I use pandas lib.
You can start by selecting the slice of columns you are interested in and applying DataFrame.min() to only that selection:
df.iloc[:, start:end].min(axis=0)
If you want these to be the middle 5, simply find the integer indices which correspond to the start and end of that range:
start = int(n_columns/2 - 2.5)
end = start + 5
Following the 'pciunkiewicz's logic:
First you should select the columns that you desire. You can use the functions: .loc[..] or .iloc[..].
The first one you can use the names of the columns. When it takes 2 arguments, the first one is the row's index. The second is the columns.
df.loc[[rows], [columns]] # The filter data should be inside the brakets.
df.loc[:, [columns]] # This will consider all rows.
You can also use .iloc. In this case, you have to use integers to locate the data. So you don't have to know the name of the columns, but their position.
I have a six column matrix. I want to find the row(s) where BOTH columns match the query.
I've been trying to use numpy.where, but I can't specify it to match just two columns.
#Example of the array
x = np.array([[860259, 860328, 861277, 861393, 865534, 865716], [860259, 860328, 861301, 861393, 865534, 865716], [860259, 860328, 861301, 861393, 871151, 871173],])
print(x)
#Match first column of interest
A = np.where(x[:,2] == 861301)
#Match second column on interest
B = np.where(x[:,3] == 861393)
#rows in both A and B
np.intersect1d(A, B)
#This approach works, but is not column specific for the intersect, leaving me with extra rows I don't want.
#This is the only way I can get Numpy to match the two columns, but
#when I query I will not actually know the values of columns 0,1,4,5.
#So this approach will not work.
#Specify what row should exactly look like
np.where(all([860259, 860328, 861277, 861393, 865534, 865716]))
#I want something like this:
#Where * could be any number. But I think that this approach may be
#inefficient. It would be best to just match column 2 and 3.
np.where(all([*, *, 861277, 861393, *, *]))
I'm looking for an efficient answer, because I am looking through a 150GB HDF5 file.
Thanks for your help!
If I understand you correctly,
you can use a little more advanced slicing, like this:
np.where(np.all(x[:,2:4] == [861277, 861393], axis=1))
this will give you only where these 2 cols are equal to [861277, 861393]
I have a dataset from which I want a few averages of multiple variables I created.
I started off with:
data2['socialIdeology2'].mean()
data2['econIdeology'].mean()
^ that works perfectly, and gives me the averages I'm looking for.
Now, I'm trying to do a conditional mean, so the mean only for a select group within the data set. (I want the ideologies broken down by whom voted for in the 2016 election) In Stata, the code would be similar to: mean(variable) if voteChoice == 'Clinton'
Looking into it, I came to the conclusion a conditional mean just isn't a thing (although hopefully I am wrong?), so I was writing my own function for it.
This is me just starting out with a 'mean' function, to create a foundation for a conditional mean function:
def mean():
sum = 0.0
count = 0
for index in range(0, len(data2['socialIdeology2'])):
sum = sum + (data2['socialIdeology2'][index])
print(data2['socialIdeology2'][index])
count = count + 1
return sum / count
print(mean())
Yet I keep getting 'nan' as the result. Printing data2['socialIdeology2'][index] within the loop prints nan over and over again.
So my question is: if the data stored within the socialIdeology2 variable really is a nan (which I don't understand how it could be), why is it that the .mean() function works with it?
And how can I get generate means by category?
Conditional mean is indeed a thing in pandas. You can use DataFrame.groupby():
means = data2.groupby('voteChoice').mean()
or maybe, in your case, the following would be more efficient:
means = data2.groupby('voteChoice')['socialIdeology2'].mean()
to drill down to the mean you're looking for. (The first case will calculate means for all columns.) This is assuming that voteChoice is the name of the column you want to condition on.
If you're only interested in the mean for a single group (e.g. Clinton voters) then you could create a boolean series that is True for members of that group, then use this to index into the rows of the DataFrame before taking the mean:
voted_for_clinton = data2['voteChoice'] == 'Clinton'
mean_for_clinton_voters = data2.loc[voted_for_clinton, 'socialIdeology2'].mean()
If you want to get the means for multiple groups simultaneously then you can use groupby, as in Brad's answer. However, I would do it like this:
means_by_vote_choice = data2.groupby('voteChoice')['socialIdeology2'].mean()
Placing the ['socialIdeology2'] index before the .mean() means that you only compute the mean over the column you're interested in, whereas if you place the indexing expression after the .mean() (i.e. data2.groupby('voteChoice').mean()['socialIdeology2']) this computes the means over all columns and then selects only the 'socialIdeology2' column from the result, which is less efficient.
See here for more info on indexing DataFrames using .loc and here for more info on groupby.
Background
I deal with a csv datasheet that prints out columns of numbers. I am working on a program that will take the first column, ask a user for a time in float (ie. 45 and a half hours = 45.5) and then subtract that number from the first column. I have been successful in that regard. Now, I need to find the row index of the "zero" time point. I use min to find that index and then call that off of the following column A1. I need to find the reading at Time 0 to then normalize A1 to so that on a graph, at the 0 time point the reading is 1 in column A1 (and eventually all subsequent columns but baby steps for me)
time_zero = float(input("Which time would you like to be set to 0?"))
df['A1']= df['A1']-time_zero
This works fine so far to set the zero time.
zero_location_series = df[df['A1'] == df['A1'].min()]
r1 = zero_location_series[' A1.1']
df[' A1.1'] = df[' A1.1']/r1
Here's where I run into trouble. The first line will correctly identify a series that I can pull off of for all my other columns. Next r1 correctly identifies the proper A1.1 value and this value is a float when I use type(r1).
However when I divide df[' A1.1']/r1 it yields only one correct value and that value is where r1/r1 = 1. All other values come out NaN.
My Questions:
How to divide a column by a float I guess? Why am I getting NaN?
Is there a faster way to do this as I need to do this for 16 columns.(ie 'A2/r2' 'a3/r3' etc.)
Do I need to do inplace = True anywhere to make the operations stick prior to resaving the data? or is that only for adding/deleting rows?
Example
Dataframe that looks like this
!http://i.imgur.com/ObUzY7p.png
zero time sets properly (image not shown)
after dividing the column
!http://i.imgur.com/TpLUiyE.png
This should work:
df['A1.1']=df['A1.1']/df['A1.1'].min()
I think the reason df[' A1.1'] = df[' A1.1']/r1 did not work was because r1 is a series. Try r1? instead of type(r1) and pandas will tell you that r1 is a series, not an individual float number.
To do it in one attempt, you have to iterate over each column, like this:
for c in df:
df[c] = df[c]/df[c].min()
If you want to divide every value in the column by r1 it's best to apply, for example:
import pandas as pd
df = pd.DataFrame([1,2,3,4,5])
# apply an anonymous function to the first column ([0]), divide every value
# in the column by 3
df = df[0].apply(lambda x: x/3.0, 0)
print(df)
So you'd probably want something like this:
df = df["A1.1"].apply(lambda x: x/r1, 0)
This really only answers part 2 of you question. Apply is probably your best bet for running a function on multiple rows and columns quickly. As for why you're getting nans when dividing by a float, is it possible the values in your columns are anything other than floats or integers?
TL;DR - I want to mimic the behaviour of functions such as DataFrameGroupBy.std()
I have a DataFrame which I group.
I want to take one row to represent each group, and then add extra statistics regarding these groups to the resulting DataFrame (such as the mean and std of these groups)
Here's an example of what I mean:
df = pandas.DataFrame({"Amount": [numpy.nan,0,numpy.nan,0,0,100,200,50,0,numpy.nan,numpy.nan,100,200,100,0],
"Id": [0,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
"Date": pandas.to_datetime(["2011-11-02","NA","2011-11-03","2011-11-04",
"2011-11-05","NA","2011-11-04","2011-11-04",
"2011-11-06","2011-11-06","2011-11-06","2011-11-06",
"2011-11-08","2011-11-08","2011-11-08"],errors='coerce')})
g = df.groupby("Id")
f = g.first()
f["std"] = g.Amount.std()
Now, this works - but let's say I want a special std, which ignores 0, and regards each unique value only once:
def get_unique_std(group):
vals = group.unique()
vals = vals[vals>0]
return vals.std() if vals.shape[0] > 1 else 0
If I use
f["std"] = g.Amount.transform(get_unique_std)
I only get zeros... (Also for any other function such as max etc.)
But if I do it like this:
std = g.Amount.transform(get_unique_std)
I get the correct result, only not grouped anymore... I guess I can calculate all of these into columns of the original DataFrame (in this case df) before I take the representing row of the group:
df["std"] = g.Amount.transform(get_unique_std)
# regroup again the modified df
g = df.groupby("Id")
f = g.first()
But that would just be a waste of memory space since many rows corresponding to the same group would get the same value, and I'd also have to group df twice - once for calculating these statistics, and a second time to get the representing row...
So, as stated in the beginning, I wonder how I can mimic the behaviour of DataFrameGroupBy.std().
I think you may be looking for DataFrameGroupBy.agg()
You can pass your custom function like this and get a grouped result:
g.Amount.agg(get_unique_std)
You can also pass a dictionary and get each key as a column:
g.Amount.agg({'my_std': get_unique_std, 'numpy_std': pandas.np.std})