I want to acquire all rows in Dataframe where if the length of any cloumu shorter than 2.
For example:
df = pd.DataFrame({"col1":["a","ab",""],"col2":["bc","abc", "a"]})
col1 col2
0 a bc
1 ab abc
2 a
How to get this output:
col1 col2
0 a bc
2 a
Let's try stack to reshape then using str.len compute the length and create boolean mask with lt + any:
df[df.stack().str.len().lt(2).any(level=0)]
col1 col2
0 a bc
2 a
You can use the len() method of the pandas Series :
for col in df.columns:
df[col] = df[col][df[col].str.len() < 3]
df = df.dropna()
A list comprehension could help here :
df.loc[[not any(len(word) > 2 for word in entry)
for entry in df.to_numpy()]
]
col1 col2
0 a bc
2 a
Related
I am trying to find a better, more pythonic way of accomplishing the following:
I want to add a new column to business_df called 'dot_prod', which is the dot product of a fixed vector (fixed_vector) and a vector from another data frame (rating_df). The rows of both business_df and rating_df have the same index values (business_id).
I have this loop which appears to work, however I know it's super clumsy (and takes forever). Essentially it loops through once for every row, calculates the dot product, then dumps it into the business_df dataframe.
n=0
for i in range(business_df.shape[0]):
dot_prod = np.dot(fixed_vector, rating_df.iloc[n])
business_df['dot_prod'][n] = dot_prod
n+=1
IIUC, you are looking for apply across axis=1 like:
business_df['dot_prod'] = rating_df.apply(lambda x: np.dot(fixed_vector, x), axis=1)
>>> fixed_vector = [1, 2, 3]
>>> df = pd.DataFrame({'col1' : [1,2], 'col2' : [3,4], 'col3' : [5,6]})
>>> df
col1 col2 col3
0 1 3 5
1 2 4 6
>>> df['col4'] = np.dot(fixed_vector, [df['col1'], df['col2'], df['col3']])
>>> df
col1 col2 col3 col4
0 1 3 5 22
1 2 4 6 28
I have a pandas dataframe that looks like this:
col1 col2 col3
0 A,B,C 0|0 1|1
1 D,E,F 2|2 3|3
2 G,H,I 4|4 0|0
My goal is to apply a function on col2 through the last column of the dataframe that splits the corresponding string in col1, using the comma as the delimiter, and uses the first number as the index to get the corresponding list element. For numbers that are greater than the length of the list, I'd like to replace with the 0th element of the list.
Expected output:
col1 col2 col3
0 A,B,C A B
1 D,E,F F D
2 G,H,I G G
In reality, my dataframe has thousands of columns with millions of entries that need this replacement, so I need a method that doesn't refer to 'col2' and 'col3' explicity (and preference for a computationally efficient method).
You can use this code to create the original dataframe:
df = pd.DataFrame(
{
'col1': ['A,B,C', 'D,E,F', 'G,H,I'],
'col2': ['0|0', '2|2', '4|4'],
'col3': ['1|1', '3|3', '0|0']
}
)
Taking into account that you could have a lot of columns and the length of the arrays in col1 could vary, you can use the following generalization, which only loops through the columns:
for col in df.columns[1:]:
df[col] = (df['col1']+','+df[col].str.split('|').str[0]).str.split(',') \
.apply(lambda x: x[int(x[-1])] if int(x[-1]) < len(x[:-1]) else x[0])
which outputs for your example:
>>> print(df)
col1 col2 col3
0 A,B,C A B
1 D,E,F F D
2 G,H,I G G
Explanation:
first you get the index as string from colX and append to the string in col1 so that you get something like 'A,B,C,0' and split it to get a list with the last element been the index that you need ([A,B,C,0]):
(df['col1']+','+df[col].str.split('|').str[0]).str.split(',')
Then you apply a function that returns the ith element been i the last element of the list and if i is bigger then the len of the list - 1 then return just the first element of the list.
(df['col1']+','+df[col].str.split('|').str[0]).str.split(',') \
.apply(lambda x: x[int(x[-1])] if int(x[-1]) < len(x[:-1]) else x[0])
Last but not least, you just put it in a loop over your desired columns.
I would first reduce your strange x|x for all x format:
df['col2'] = df['col2'].str.split('|', expand=True).iloc[:, 0]
df['col3'] = df['col3'].str.split('|', expand=True).iloc[:, 0]
Then split the letter mappings while keeping them aligned by row.
ndf = pd.concat([df, df['col1'].str.split(',', expand=True)], axis=1)
After that, map them back by row while making sure to prevent overflows:
def bad_mapping(row, c):
value = int(row[c])
if value <= 2: # adjust if needed
return row[value]
else:
return row[0]
for c in ['col2', 'col3']:
ndf['mapped_' + c] = ndf.apply(lambda r: bad_mapping(r, c), axis=1)
Output looks like:
col1 col2 col3 0 1 2 mapped_col2 mapped_col3
0 A,B,C 0 1 A B C A B
1 D,E,F 2 3 D E F F D
2 G,H,I 4 0 G H I G G
Drop columns with df.drop(columns=['your', 'columns', 'here'], inplace=True) as needed.
My data looks like this: (I have 28 columns)
col1 col2 col3 col4 col5
AA 0 0 B 0
0 CC 0 D 0
0 0 E F G
I am trying to merge these columns to get an output like this:
col1 col2 col3 col4 col5 col6
AA 0 0 B 0 AA;B
0 C 0 DD 0 C;DD
0 0 E F G E;F;G
I want to merge only the non-numeric characters into the new column.
I tried like this:
cols=['col1','col2', 'col3', 'col4', 'col5']
df2["col6"] = df2[cols].apply(lambda x: ';'.join(x.dropna()), axis=1)
But it doesn't take out the zeros. I am aware it is a small change but couldn't figure it out.
Thanks
try via where() method and apply() method:
df2["col6"]=df2.where((df2!='0')&(df2!=0)).apply(lambda x: ';'.join(x.dropna()), axis=1)
If there are numbers other than 0(including 0) then use:
df2["col6"]=(df2.where(df2.apply(lambda x:x.str.isalpha(),1))
.apply(lambda x: ';'.join(x.dropna()), axis=1))
With your shown samples please try following. Trying to fix OP's attempts here. Simple explanation would be, major change is to use condition x[x!=0] to make boolean mask in OP's attempted code(join function).
df2['col6'] = df2[cols].apply(lambda x: ';'.join(x[x!=0]), axis=1)
I have many columns in dataframe , I want to fill one column by manipulating other two column in same datframe
col1 | col2 | col3 | col4
nan 1 2 4
2 2 2 3
3 nan 1 2
I want fill value of col1 ,col2 and col3 if nan exist on the basis of col1 ,col2 and col3 value.
I have code as follows:
indices_of_nan_cell = [(index,col1,col2,col3) for index,(col1,col2,col3) in enumerate(zip(col1,col2,col3)) if str(col1)=='nan' or str(col2)=='nan' or str(col3)=='nan']
for nan_values in indices:
if np.isnan(nan_values[1]) or nan_values[1] == 'nan':
read4['col1'][nan_values[0]]=float(nan_values[2])*float(nan_values[3])
if np.isnan(nan_values[2]) or nan_values[2] == 'nan':
read4['col2'][nan_values[0]]=float(nan_values[1])/float(nan_values[3])
if np.isnan(nan_values[3]) or nan_values[3] == 'nan':
read4['col3'][nan_values[0]]=float(nan_values[1])*float(nan_values[2])
It's working fine for me , but taking to much time as I have rows in thousands in my dataframe, Is there any efficient way,we can do this?
I believe need fillna for replace NaNs only with mul, div and parameter fill_value for replace NaNs in division and multiplication:
df['col1'] = df['col1'].fillna(df['col2'].mul(df['col3'], fill_value=1))
df['col2'] = df['col2'].fillna(df['col1'].div(df['col3'], fill_value=1))
df['col3'] = df['col3'].fillna(df['col1'].mul(df['col2'], fill_value=1))
print (df)
col1 col2 col3 col4
0 2.0 1.0 2 4
1 2.0 2.0 2 3
2 3.0 3.0 1 2
Another approach is working only with NaNs rows:
m1 = df['col1'].isna()
m2 = df['col2'].isna()
m3 = df['col3'].isna()
#oldier versions of pandas
#m1 = df['col1'].isnull()
#m2 = df['col2'].isnull()
#m3 = df['col3'].isnull()
df.loc[m1, 'col1'] = df.loc[m1, 'col2'].mul(df.loc[m1, 'col3'], fill_value=1)
df.loc[m2, 'col2'] = df.loc[m2, 'col1'].div(df.loc[m2, 'col3'], fill_value=1)
df.loc[m3, 'col3'] = df.loc[m3, 'col1'].mul(df.loc[m3, 'col2'], fill_value=1)
Explanation:
Filter each column with isna for 3 separate boolean masks.
For each mask first filter rows like df.loc[m1, 'col2'] and multiple or divide
Last assign back - replace NaNs only because filtering again by df.loc[m1, 'col1']
I am looking to find the unique values for each column in my dataframe. (Values unique for the whole dataframe)
Col1 Col2 Col3
1 A A B
2 C A B
3 B B F
Col1 has C as a unique value, Col2 has none and Col3 has F.
Any genius ideas ? thank you !
You can use stack for Series, then drop_duplicates - keep=False remove all, remove first level by reset_index and last reindex:
df = df.stack()
.drop_duplicates(keep=False)
.reset_index(level=0, drop=True)
.reindex(index=df.columns)
print (df)
Col1 C
Col2 NaN
Col3 F
dtype: object
Solution above works nice if only one unique value per column.
I try create more general solution:
print (df)
Col1 Col2 Col3
1 A A B
2 C A X
3 B B F
s = df.stack().drop_duplicates(keep=False).reset_index(level=0, drop=True)
print (s)
Col1 C
Col3 X
Col3 F
dtype: object
s = s.groupby(level=0).unique().reindex(index=df.columns)
print (s)
Col1 [C]
Col2 NaN
Col3 [X, F]
dtype: object
I don't believe this is exactly what you want, but as useful information - you can find unique values for a DataFrame using numpy's .unique() like so:
>>> np.unique(df[['Col1', 'Col2', 'Col3']])
['A' 'B' 'C' 'F']
You can also get unique values of a specific column, e.g. Col3:
>>> df.Col3.unique()
['B' 'F']