In the following pandas dataframe there are missing values in different columns for each row.
import pandas as pd
import numpy as np
d = {'col1': [1, 2, None], 'col2': [None, 4, 5], 'col3': [3, None, None]}
df = pd.DataFrame(data=d)
df
I know I can use this to locate which columns are not empty in the ith row
df.iloc[0].notnull()
And then something like the following to find which specific columns are not empty.
np.where(df.iloc[0].notnull())
However, how can I then use those values as indices to return the non missing columns in the ith row?
For example, in the 0th row I'd like to return back columns
df.iloc[0, [0,2]]
This isn't quite right, but I'm guessing is somewhere along these lines?
df.iloc[0, np.where(df.iloc[0].notnull())]
** Edit
I realize I can do this
df.iloc[0, np.where(df.iloc[0].notnull())[0].tolist()]
And this returns the expected result. However, is this the most efficient approach?
Here's a way using np.isnan
# set row number
row_number = 0
# get dataframe
df.loc[row_number, ~np.isnan(df.values)[row_number]]
Related
I have a data frame with the following columns:
d = {'find_no': [1, 2, 3], 'zip_code': [32351, 19207, 8723]}
df = pd.DataFrame(data=d)
When there are 5 digits in the zip_code column, I want to return True. When there are not 5 digits, I want to return the "find_no".
Sample output would have the results in an added column to the dataframe, corresponding to the row it's referencing.
You could try np.where:
import numpy as np
df['result'] = np.where(df['zip_code'].astype(str).str.len() == 5, True, df['find_no'])
Only downside with this approach is that NumPy will convert your True values to 1's, which could be confusing. An approach to keep the values you want is to do
import numpy as np
df['result'] = np.where(df['zip_code'].astype(str).str.len() == 5, 'True', df['find_no'].astype(str))
The downside here being that you lose the meaning of those values by casting them to strings. I guess it all depends on what you're hoping to accomplish.
I would like to apply the loop below where for each index value the unique values of a column called SERIAL_NUMBER will be returned. Essentially I want to confirm that for each index there is a unique serial number.
index_values = df.index.levels
for i in index_values:
x = df.loc[[i]]
x["SERIAL_NUMBER"].unique()
The problem, however, is that my dataset has a multi-index and as you can see below it is stored in a frozen list. I am just interested in the index values that contain a long number. The word "vehicle" also as an index can be removed as it is repeated all over the dataset.
How can I extract these values into a list so I can use them in the loop?
index_values
>>
FrozenList([['0557bf98-c3e0-4955-a23f-2394635ab531', '074705a3-a96a-418c-9bfe-14c37f5c4e6f', '0f47e260-0fa2-40ba-a417-7c00ea74248c', '17342ca2-6246-4150-8080-96d6125cf2b5', '26c6c0d1-0134-4b3a-a149-61dd93afab3b', '7600be43-5d0a-49b3-a1ee-fd107db5822f', 'a07f2b0c-447c-4143-a361-d7ddbffdcc77', 'b929801c-2f32-4a95-bfc4-48a05b48ee01', 'cc912023-0113-42cd-8fe7-4df4005127c2', 'e424bd02-e188-462e-a1a6-2f4ed8fe0a2d'], ['vehicle']])
without an example its hard to judge, but I think you need
df.index.get_level_values(0).unique() # add .tolist() if you want a list
import pandas as pd
df = pd.DataFrame({'A' : [5]*5, 'B' : [6]*5})
df = df.set_index('A',append=True)
df.index.get_level_values(0).unique()
Int64Index([0, 1, 2, 3, 4], dtype='int64')
df.index.get_level_values(1).unique()
Int64Index([5], dtype='int64', name='A')
to drop duplicates from an index level use the .duplicated() method.
df[~df.index.get_level_values(1).duplicated(keep='first')]
B
A
0 5 6
I have this Dataframe in python
and I want to multiple every row in the first dataframe by this single row in the dataframe below as a vector
Some things I have tried from googling : df.mul, df.apply. But it seems to multiply the two frames together normally instead of a vectorized operation
Example data:
df = pd.DataFrame({'x':[1,2,3], 'y':[1,2,3]})
v1 = pd.DataFrame({'x':[2], 'y':[3]})
Multiply DataFrame with row:
df.multiply(np.array(v1), axis='columns')
If the use case needs accurate matching of columns
Example:
df = pd.DataFrame([[1, 2], [3, 4]], columns=['x', 'y'])
coeffs_df = pd.DataFrame([[10, 9]], columns=['y', 'x'])
Need to convert the df with single row (coeffs_df) to a series first, the perform multiply
df.multiply(coeffs_df.iloc[0], axis='columns')
I need to create a duplicate for each row in a dataframe, apply some basic operations to the duplicate row and then combine these dupped rows along with the originals back into a dataframe.
I'm trying to use apply for it and the print shows that it's working correctly but when I return these 2 rows from the function and the dataframe is assembled I get an error message "cannot copy sequence with size 7 to array axis with dimension 2". It is as if it's trying to fit these 2 new rows back into the original 1 row slot. Any insight on how I can achieve it within apply (and not by iterating over every row in a loop)?
def f(x):
x_cpy=x.copy()
x_cpy['A']=x['B']
print(pd.concat([x,x_cpy],axis=1).T.reset_index(drop=True))
#return pd.concat([x,x_cpy],axis=1).T.reset_index(drop=True)
hld_pos.apply(f,axis=1)
The apply function of pandas operates along an axis. With axis=1, it operates along every row. To do something like what you're trying to do, think of how you would construct a new row from your existing row. Something like this should work:
import pandas as pd
my_df = pd.DataFrame({'a': [1, 2, 3], 'b': [2, 4, 6]})
def f(row):
"""Return a new row with the items of the old row squared"""
pd.Series({'a': row['a'] ** 2, 'b': row['b'] ** 2})
new_df = my_df.apply(f, axis=1)
combined = concat([my_df, new_df], axis=0)
I would like to know how to add a new row efficiently to the dataframe.
Assuming I have a empty dataframe
"A" "B"
columns = ['A','B']
user_list = pd.DataFrame(columns=columns)
I want to add one row like {A=3, B=4} to the dataframe, how to do that in most efficient way?
columns = ['A', 'B']
user_list = pd.DataFrame(np.zeros((1000, 2)) + np.nan, columns=columns)
user_list.iloc[0] = [3, 4]
user_list.iloc[1] = [4, 5]
Pandas doesn't have built-in resizing, but it will ignore nan's pretty well. You'll have to manage your own resizing, though :/