Related
So basically I am stuck on a very simple thing. For some reason when I execute this code:
import pandas as pd
x = pd.read_csv('titanic.csv')
v = x.dropna(axis=0,how="any")
z = v[["Survived"]]
y = z.where(z == 1)
print (y)
It still prints values with NaN, even though I have already done dropna on the whole file and it works. I just want to print rows with value 1. I have tried many variations and I cant seem to fix it. Any ideas?
Output
Part of the file I am interested in
try:
y = z.where(z == 1).dropna(subset=['Survived'])
SAMPLE DATA:
PassengerId Survived pClass
1 1 3
2 1 4
3 0 2
4 1 9
5 0 6
6 0 0
import pandas as pd
import numpy as np
columns = ['PassengerId','Survived', 'pClass']
PassengerIdList = [1,2,3,4,5,6]
SurvivedList = [1,1,0,1,0,0]
pClassList = [3,4,2,9,6,0]
newList = list(zip(PassengerIdList,SurvivedList,pClassList))
data = np.array(newList)
# print(data)
df = pd.DataFrame(data, columns=columns)
filtered_df = df.loc[df['Survived'] == 1]
print(filtered_df)
OUTPUT:
PassengerId Survived pClass
1 1 3
2 1 4
4 1 9
pyFiddle
Am guessing you have empty rows in your datasets, try using the:
x.fillna(-99999, inplace=True)
that should solve the problem or better still, post what your output looks like and we can know what to do.
You can do this also
y = z.loc[z['Survived'] == 1]
You could use loc, and just locate every row which fulfills your criteria.
survivors = df.loc[df['Survived'] == 1]
I have a very simple query.
I have a csv that looks like this:
ID X Y
1 10 3
2 20 23
3 21 34
And I want to add a new column called Z which is equal to 1 if X is equal to or bigger than Y, or 0 otherwise.
My code so far is:
import pandas as pd
data = pd.read_csv("XYZ.csv")
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
You can do this without using a loop by using ge which means greater than or equal to and cast the boolean array to int using astype:
In [119]:
df['Z'] = (df['X'].ge(df['Y'])).astype(int)
df
Out[119]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
Regarding your attempt:
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
it wouldn't work, firstly you're using Data not data, even with that fixed you'd be comparing a scalar against an array so this would raise a warning as it's ambiguous to compare a scalar with an array, thirdly you're assigning the entire column so overwriting the column.
You need to access the index label which your loop didn't you can use iteritems to do this:
In [125]:
for idx, x in df["X"].iteritems():
if x >= df['Y'].loc[idx]:
df.loc[idx, 'Z'] = 1
else:
df.loc[idx, 'Z'] = 0
df
Out[125]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
But really this is unnecessary as there is a vectorised method here
Firstly, your code is just fine. You simply capitalized your dataframe name as 'Data' instead of making it 'data'.
However, for efficient code, EdChum has a great answer above. Or another method similar to the for loop in efficiency but easier code to remember:
import numpy as np
data['Z'] = np.where(data.X >= data.Y, 1, 0)
This is a weird one: I have 3 dataframes, "prov_data" with contains a provider id and counts on regions and categories (ie. how many times that provider interacted with those regions and categories).
prov_data = DataFrame({'aprov_id':[1122,3344,5566,7788],'prov_region_1':[0,0,4,0],'prov_region_2':[2,0,0,0],
'prov_region_3':[0,1,0,1],'prov_cat_1':[0,2,0,0],'prov_cat_2':[1,0,3,0],'prov_cat_3':[0,0,0,4],
'prov_cat_4':[0,3,0,0]})
"tender_data" which contains the same but for tenders.
tender_data = DataFrame({'atender_id':['AA12','BB33','CC45'],
'ten_region_1':[0,0,1,],'ten_region_2':[0,1,0],
'ten_region_3':[1,1,0],'ten_cat_1':[1,0,0],
'ten_cat_2':[0,1,0],'ten_cat_3':[0,1,0],
'ten_cat_4':[0,0,1]})
And finally a "no_match" DF wich contains forbidden matches between provider and tender.
no_match = DataFrame({ 'prov_id':[1122,3344,5566],
'tender_id':['AA12','BB33','CC45']})
I need to do the following: create a new df that will append the rows of the prov_data & tender_data DataFrames if they (1) match one or more categories (ie the same category is > 0) AND (2) match one or more regions AND (3) are not on the no_match list.
So that would give me this DF:
df = DataFrame({'aprov_id':[1122,3344,7788],'prov_region_1':[0,0,0],'prov_region_2':[2,0,0],
'prov_region_3':[0,1,1],'prov_cat_1':[0,2,0],'prov_cat_2':[1,0,0],'prov_cat_3':[0,0,4],
'prov_cat_4':[0,3,0], 'atender_id':['BB33','AA12','BB33'],
'ten_region_1':[0,0,0],'ten_region_2':[1,0,1],
'ten_region_3':[1,1,1],'ten_cat_1':[0,1,0],
'ten_cat_2':[1,0,1],'ten_cat_3':[1,0,1],
'ten_cat_4':[0,0,0]})
code
# the first columns of each dataframe are the ids
# i'm going to use them several times
tid = tender_data.values[:, 0]
pid = prov_data.values[:, 0]
# first columns [1, 2, 3, 4] are cat columns
# we could have used filter, but this is good
# for this example
pc = prov_data.values[:, 1:5]
tc = tender_data.values[:, 1:5]
# columns [5, 6, 7] are rgn columns
pr = prov_data.values[:, 5:]
tr = tender_data.values[:, 5:]
# I want to mave this an m x n array, where
# m = number of rows in prov df and n = rows in tender
nm = no_match.groupby(['prov_id', 'tender_id']).size().unstack()
nm = nm.reindex_axis(tid, 1).reindex_axis(pid, 0)
nm = ~nm.fillna(0).astype(bool).values * 1
# the dot products of the cat arrays gets a handy
# array where there are > 1 co-positive values
# this combined with the a no_match construct
a = pd.DataFrame(pc.dot(tc.T) * pr.dot(tr.T) * nm > 0, pid, tid)
a = a.mask(~a).stack().index
fp = a.get_level_values(0)
ft = a.get_level_values(1)
pd.concat([
prov_data.set_index('aprov_id').loc[fp].reset_index(),
tender_data.set_index('atender_id').loc[ft].reset_index()
], axis=1)
index prov_cat_1 prov_cat_2 prov_cat_3 prov_cat_4 prov_region_1 \
0 1122 0 1 0 0 0
1 3344 2 0 0 3 0
2 7788 0 0 4 0 0
prov_region_2 prov_region_3 atender_id ten_cat_1 ten_cat_2 ten_cat_3 \
0 2 0 BB33 0 1 1
1 0 1 AA12 1 0 0
2 0 1 BB33 0 1 1
ten_cat_4 ten_region_1 ten_region_2 ten_region_3
0 0 0 1 1
1 0 0 0 1
2 0 0 1 1
explanation
use dot products to determine matches
many other things I'll try to explain more later
Straightforward solution that uses only "standard" pandas techniques.
prov_data['tkey'] = 1
tender_data['tkey'] = 1
df1 = pd.merge(prov_data,tender_data,how='outer',on='tkey')
df1 = pd.merge(df1,no_match,how='outer',left_on = 'aprov_id', right_on = 'prov_id')
df1['dropData'] = df1.apply(lambda x: True if x['tender_id'] == x['atender_id'] else False, axis=1)
df1['dropData'] = df1.apply(lambda x: (x['dropData'] == True) or not(
((x['prov_cat_1'] > 0 and x['ten_cat_1'] > 0) or
(x['prov_cat_2'] > 0 and x['ten_cat_2'] > 0) or
(x['prov_cat_3'] > 0 and x['ten_cat_3'] > 0) or
(x['prov_cat_4'] > 0 and x['ten_cat_4'] > 0)) and(
(x['prov_region_1'] > 0 and x['ten_region_1'] > 0) or
(x['prov_region_2'] > 0 and x['ten_region_2'] > 0) or
(x['prov_region_3'] > 0 and x['ten_region_3'] > 0))),axis=1)
df1 = df1[~df1.dropData]
df1 = df1[[u'aprov_id', u'atender_id', u'prov_cat_1', u'prov_cat_2', u'prov_cat_3',
u'prov_cat_4', u'prov_region_1', u'prov_region_2', u'prov_region_3',
u'ten_cat_1', u'ten_cat_2', u'ten_cat_3', u'ten_cat_4', u'ten_region_1',
u'ten_region_2', u'ten_region_3']].reset_index(drop=True)
print df1.equals(df)
First we do a full cross product of both dataframes and merge that with the no_match dataframe, then add a boolean column to mark all rows to be dropped.
The boolean column is assigned by two boolean lambda functions with all the necessary conditions, then we just take all rows where that column is False.
This solution isn't very ressource-friendly due to the merge operations, so if your data is very large it may be disadvantageous.
So I have two pandas dataframes, A and B.
A is 1000 rows x 500 columns, filled with binary values indicating either presence or absence.
B is 1024 rows x 10 columns, and is a full iteration of 0's and 1's, hence having 1024 rows.
I am trying to find which rows in A, at a particular 10 columns of A, correspond with a given row in B. I need the whole row to match up, rather than element by element.
For example, I would want
A[(A.ix[:,(1,2,3,4,5,6,7,8,9,10)==(1,0,1,0,1,0,0,1,0,0)).all(axis=1)]
To return something that rows (3,5,8,11,15) in A match up with that (1,0,1,0,1,0,0,1,0,0) row of B at those particular columns (1,2,3,4,5,6,7,8,9,10)
And I want to do this over every row in B.
The best way I could figure out to do this was:
import numpy as np
for i in B:
B_array = np.array(i)
Matching_Rows = A[(A.ix[:,(1,2,3,4,5,6,7,8,9,10)] == B_array).all(axis=1)]
Matching_Rows_Index = Matching_Rows.index
This isn't terrible for one instance, but I use it in a while loop that runs around 20,000 times; therefore, it slows it down quite a bit.
I have been messing around with DataFrame.apply to no avail. Could map work better?
I was just hoping someone saw something obviously more efficient as I am fairly new to python.
Thanks and best regards!
We can abuse the fact that both dataframes have binary values 0 or 1 by collapsing the relevant columns from A and all columns from B into 1D arrays each, when considering each row as a sequence of binary numbers that could be converted to decimal number equivalents. This should reduce the problem set considerably, which would help with performance. Now, after getting those 1D arrays, we can use np.in1d to look for matches from B in A and finally np.where on it to get the matching indices.
Thus, we would have an implementation like so -
# Setup 1D arrays corresponding to selected cols from A and entire B
S = 2**np.arange(10)
A_ID = np.dot(A[range(1,11)],S)
B_ID = np.dot(B,S)
# Look for matches that exist from B_ID in A_ID, whose indices
# would be desired row indices that have matched from B
out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]
Sample run -
In [157]: # Setup dataframes A and B with rows 0, 4 in A having matches from B
...: A_arr = np.random.randint(0,2,(10,14))
...: B_arr = np.random.randint(0,2,(7,10))
...:
...: B_arr[2] = A_arr[4,1:11]
...: B_arr[4] = A_arr[4,1:11]
...: B_arr[5] = A_arr[0,1:11]
...:
...: A = pd.DataFrame(A_arr)
...: B = pd.DataFrame(B_arr)
...:
In [158]: S = 2**np.arange(10)
...: A_ID = np.dot(A[range(1,11)],S)
...: B_ID = np.dot(B,S)
...: out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]
...:
In [159]: out_row_idx
Out[159]: array([0, 4])
You can use merge with reset_index - output are indexes of B which are equal in A in custom columns:
A = pd.DataFrame({'A':[1,0,1,1],
'B':[0,0,1,1],
'C':[1,0,1,1],
'D':[1,1,1,0],
'E':[1,1,0,1]})
print (A)
A B C D E
0 1 0 1 1 1
1 0 0 0 1 1
2 1 1 1 1 0
3 1 1 1 0 1
B = pd.DataFrame({'0':[1,0,1],
'1':[1,0,1],
'2':[1,0,0]})
print (B)
0 1 2
0 1 1 1
1 0 0 0
2 1 1 0
print (pd.merge(B.reset_index(),
A.reset_index(),
left_on=B.columns.tolist(),
right_on=A.columns[[0,1,2]].tolist(),
suffixes=('_B','_A')))
index_B 0 1 2 index_A A B C D E
0 0 1 1 1 2 1 1 1 1 0
1 0 1 1 1 3 1 1 1 0 1
2 1 0 0 0 1 0 0 0 1 1
print (pd.merge(B.reset_index(),
A.reset_index(),
left_on=B.columns.tolist(),
right_on=A.columns[[0,1,2]].tolist(),
suffixes=('_B','_A'))[['index_B','index_A']])
index_B index_A
0 0 2
1 0 3
2 1 1
You can do it in pandas by using loc or ix and telling it to find the rows where the ten columns are all equal. Like this:
A.loc[(A[1]==B[1]) & (A[2]==B[2]) & (A[3]==B[3]) & A[4]==B[4]) & (A[5]==B[5]) & (A[6]==B[6]) & (A[7]==B[7]) & (A[8]==B[8]) & (A[9]==B[9]) & (A[10]==B[10])]
This is quite ugly in my opinion but it will work and gets rid of the loop so it should be significantly faster. I wouldn't be surprised if someone could come up with a more elegant way of coding the same operation.
In this special case, your rows of 10 zeros and ones can be interpreted as 10 digit binaries. If B is in order, then it can be interpreted as a range from 0 to 1023. In this case, all we need to do is take A's rows in 10 column chunks and calculate what its binary equivalent is.
I'll start by defining a range of powers of two so I can do matrix multiplication with it.
twos = pd.Series(np.power(2, np.arange(10)))
Next, I'll relabel A's columns into a MultiIndex and stack to get my chunks of 10.
A = pd.DataFrame(np.random.binomial(1, .5, (1000, 500)))
A.columns = pd.MultiIndex.from_tuples(zip((A.columns / 10).tolist(), (A.columns % 10).tolist()))
A_ = A.stack(0)
A_.head()
Finally, I'll multiply A_ with twos to get integer representation of each row and unstack.
A_.dot(twos).unstack()
This is now a 1000 x 50 DataFrame where each cell represents which of B's rows we matched for that particular 10 column chunk for that particular row of A. There isn't even a need for B.
I have the following dataframe:
Timestamp S_time1 S_time2 End_Time_1 End_time_2 Sign_1 Sign_2
0 2413044 0 0 0 0 x x
1 2422476 0 0 0 0 x x
2 2431908 0 0 0 0 x x
3 2441341 0 0 0 0 x x
4 2541232 2526631 2528631 2520631 2530631 10 80
5 2560273 2544946 2546496 2546496 2548496 40 80
6 2577224 2564010 2566010 2566010 2568010 null null
7 2592905 2580959 2582959 2582959 2584959 null null
The table goes on and on like that. The first column is a timestamp which is in milliseconds. S_time1 and End_time_1 are the duration where a particular sign (number) appear. For example, if we take the 5th row, S_time1 is 2526631, End_time_1 is 2520631, and the corresponding sign_1 is 10, which means from 2526631 to 2520631 the sign 10 will be displayed. And the same thing goes to S_time2 and End_time_2. The corresponding values in sign_2 will appear in the duration from S_time2 to End_time_2.
I want to resample the index column (Timestamp) in 100-millisecond bin time and check in which bin times the signs belong. For instance, between each start time and end time there is 2000 milliseconds difference. So the corresponding sign number will appear repeatedly in around 20 consecutive bin times because each bin time is 100 millisecond. So I need to have two columns only: one with the bin times and the second with the signs. Looks like the following table: (I am just making up the bin time just for example)
Bin_time signs
...100 0
...200 0
...300 10
...400 10
...500 10
...600 10
The sign 10 will be for the duration of the corresponding S_time1 to End_time_1. Then the next sign which is 80 continues for the duration of S_time2 to End_time_2. I am not sure if this can be done in pandas or not. But I really need help either in pandas or other methods.
Thanks for your help and suggestion in advance.
Input:
print df
Timestamp S_time1 S_time2 End_Time_1 End_time_2 Sign_1 Sign_2
0 2413044 0 0 0 0 x x
1 2422476 0 0 0 0 x x
2 2431908 0 0 0 0 x x
3 2441341 0 0 0 0 x x
4 2541232 2526631 2528631 2520631 2530631 10 80
5 2560273 2544946 2546496 2546496 2548496 40 80
6 2577224 2564010 2566010 2566010 2568010 null null
7 2592905 2580959 2582959 2582959 2584959 null null
2 approaches:
In [231]: %timeit s(df)
1 loops, best of 3: 2.78 s per loop
In [232]: %timeit m(df)
1 loops, best of 3: 690 ms per loop
def m(df):
#resample column Timestamp by 100ms, convert bak to integers
df['Timestamp'] = df['Timestamp'].astype('timedelta64[ms]')
df['i'] = 1
df = df.set_index('Timestamp')
df1 = df[[]].resample('100ms', how='first').reset_index()
df1['Timestamp'] = (df1['Timestamp'] / np.timedelta64(1, 'ms')).astype(int)
#felper column i for merging
df1['i'] = 1
#print df1
out = df1.merge(df,on='i', how='left')
out1 = out[['Timestamp', 'Sign_1']][(out.Timestamp >= out.S_time1) & (out.Timestamp <= out.End_Time_1)]
out2 = out[['Timestamp', 'Sign_2']][(out.Timestamp >= out.S_time2) & (out.Timestamp <= out.End_time_2)]
out1 = out1.rename(columns={'Sign_1':'Bin_time'})
out2 = out2.rename(columns={'Sign_2':'Bin_time'})
df = pd.concat([out1, out2], ignore_index=True).drop_duplicates(subset='Timestamp')
df1 = df1.set_index('Timestamp')
df = df.set_index('Timestamp')
df = df.reindex(df1.index).reset_index()
#print df.head(10)
def s(df):
#resample column Timestamp by 100ms, convert bak to integers
df['Timestamp'] = df['Timestamp'].astype('timedelta64[ms]')
df = df.set_index('Timestamp')
out = df[[]].resample('100ms', how='first')
out = out.reset_index()
out['Timestamp'] = (out['Timestamp'] / np.timedelta64(1, 'ms')).astype(int)
#print out.head(10)
#search start end
def search(x):
mask1 = (df.S_time1<=x['Timestamp']) & (df.End_Time_1>=x['Timestamp'])
#if at least one True return first value of series
if mask1.any():
return df.loc[mask1].Sign_1[0]
#check second start and end time
else:
mask2 = (df.S_time2<=x['Timestamp']) & (df.End_time_2>=x['Timestamp'])
if mask2.any():
#if at least one True return first value
return df.loc[mask2].Sign_2[0]
else:
#if all False return NaN
return np.nan
out['Bin_time'] = out.apply(search, axis=1)
#print out.head(10)