How to add prefix to rows of a columns if (conditions met) - python

I have a data frame with certain columns and rows and in which I need to add prefix to rows from one of the columns if it meet certain condition,
df = pd.DataFrame({'col':['a',0,2,3,5],'col2':['PFD_1','PFD_2','PFD_3','PFD_4','PFD_5']})
Samples=pd.DataFrame({'Sam':['PFD_1','PFD_5']})
And I need to add a suffix to df.col2 based on values in Samples dataframe, and I tried it with np.where as following,
df['col2'] = np.where(df.col2.isin(samples.Sam),'Yes' + df.col2, 'Non_'+ df.col2)
Whhich throws error as,
TypeError: can only perform ops with scalar values
It doesn't return what I am asking for, and throwing errors
in the end the data frame should look like,
>>>df.head()
col col2
a Yes_PFD_1
0 no_PFD_2
2 no_PFD_3
3 no_PFD_4
5 Yes_PFD_5

Your code worked fine for me once I changed the capitalization of 'samples' ..
import pandas as pd
import numpy as np
df = pd.DataFrame({'col':['a',0,2,3,5],'col2': ['PFD_1','PFD_2','PFD_3','PFD_4','PFD_5']})
Samples=pd.DataFrame({'Sam':['PFD_1','PFD_5']})
df['col2'] = np.where(df.col2.isin(Samples.Sam),'Yes' + df.col2, 'Non_'+ df.col2)
df['col2']
Outputs ..
0 YesPFD_1
1 Non_PFD_2
2 Non_PFD_3
3 Non_PFD_4
4 YesPFD_5
Name: col2, dtype: object

Related

Replacing all instances of standalone "." in a pandas dataframe

Beginner question incoming.
I have a dataframe derived from an excel file with a column that I will call "input".
In this column are floats (e.g. 7.4, 8.1, 2.2,...). However, there are also some wrong values such as strings (which are easy to filter out) and, what I find difficult, single instances of "." or "..".
I would like to clean the column to generate only numeric float values.
I have used this approach for other columns, but cannot do so here because if I get rid of the "." instances, my floats will be messed up:
for col in [col for col in new_df.columns if col.startswith("input")]:
new_df[col] = new_df[col].str.replace(r',| |\-|\^|\+|#|j|0|.', '', regex=True)
new_df[col] = pd.to_numeric(new_df[col], errors='raise')
I have also tried the following, but it then replaces every value in the column with None:
for index, row in new_df.iterrows():
col_input = row['input']
if re.match(r'^-?\d+(?:.\d+)$', str(col_input)) is None:
new_df["input"] = None
How do I get rid of the dots?
Thanks!
You can simply use pandas.to_numeric and pass errors='coerce' without the loop :
from io import StringIO
import pandas as pd
s = """input
7.4
8.1
2.2
foo
foo.bar
baz/foo"""
df = pd.read_csv(StringIO(s))
df['input'] = pd.to_numeric(df['input'], errors='coerce')
# Outputs :
print(df)
input
0 7.4
1 8.1
2 2.2
3 NaN
4 NaN
5 NaN
df.dropna(inplace=True)
print(df)
input
0 7.4
1 8.1
2 2.2
If you need to clean up multiple mixed columns, use :
cols = ['input', ...] # put here the name of the columns concerned
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
df.dropna(subset=cols, inplace=True)

Python, Pandas: Using isin() like functionality but do not ignore duplicates in input list

I am trying to filter an input dataframe (df_in) against a list of indices. The indices list contains duplicates and I want my output df_out to contain all occurrences of a particular index. As expected, isin() gives me only a single entry for every index.
How do I try and not ignore duplicates and get output similar to df_out_desired?
import pandas as pd
import numpy as np
df_in = pd.DataFrame(index=np.arange(4), data={'A':[1,2,3,4],'B':[10,20,30,40]})
indices_needed_list = pd.Series([1,2,3,3,3])
# In the output df, I do not particularly care about the 'index' from the df_in
df_out = df_in[df_in.index.isin(indices_needed_list)].reset_index()
# With isin, as expected, I only get a single entry for each occurence of index in indices_needed_list
# What I am trying to get is an output df that has many rows and occurences of df_in index as in the indices_needed_list
temp = df_out[df_out['index'] == 3]
# This is what I would like to try and get
df_out_desired = pd.concat([df_out, df_out[df_out['index']==3], df_out[df_out['index']==3]])
Thanks!
Check reindex
df_out_desired = df_in.reindex(indices_needed_list)
df_out_desired
Out[177]:
A B
1 2 20
2 3 30
3 4 40
3 4 40
3 4 40

Remove Column with Duplicate Values in Pandas

I have a database with sample as below:
Data frame is generated when I load data in Python as per below code
import os
import pandas as pd
data_dir="D:\\userdata\\adbharga\\Desktop\\AVA\\PythonCoding\\VF-Aus\\4G Cell Graphs"
os.chdir(data_dir)
df = pd.read_csv('CA Throughput(Kbit_s) .csv',index_col=None, header=0)
Output:
Is there any way by which we can avoid reading duplicate columns in Pandas, or remove the duplicate columns post reading.
Pl Note: Column Name is different once data is read in Pandas, so command like df=df.loc[:,~df.columns.duplicated()] won't work.
Actual database is very big and has many duplicate column with Dates only.
There are 2 ways you can do this.
Ignore columns when reading the data
pandas.read_csv has the argument usecols, which accepts an integer list.
So you can try:
# work out required columns
df = pd.read_csv('file.csv', header=0)
cols = [0] + list(range(1, len(df.columns), 2))
# use column integer list
df = pd.read_csv('file.csv', usecols=cols)
Remove columns from dataframe
You can use similar logic with pd.DataFrame.iloc to remove unwanted columns.
# cols as defined in previous example
df = df.iloc[:, cols]
One way of do it could be to read only the first row and create a mask using drop_duplicates(). This we pass to the usecols without the need to specify the index beforehand. It should be failsafe.
m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)
Full example:
import pandas as pd
data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''
m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)
print(df)
# Date Value1 Value2
#0 2018-01-01 0 1
#1 2018-01-02 0 1
Another way to do it would be to remove all columns with a dot inside .. This should work in most cases as the dot is rarely used in column names:
df = df.loc[:,~df.columns.str.contains('.', regex=False)]
Full example:
import pandas as pd
data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''
df = pd.read_csv(pd.compat.StringIO(data))
df = df.loc[:,~df.columns.str.contains('.', regex=False)]
print(df)
# Date Value1 Value2
#0 2018-01-01 0 1
#1 2018-01-02 0 1

If cell contains more than one string, put in to the new cell in Pandas

So I'm working with Pandas and I have multiple words (i.e. strings) in one cell, and I need to put every word into the new row and keep coordinated data. I've found a method which could help me,but it works with numbers, not strings.
So what method do I need to use?
Simple example of my table:
id name method
1 adenosis mammography, mri
And I need it to be:
id name method
1 adenosis mammography
mri
Thanks!
UPDATE:
That's what I'm trying to do, according to #jezrael's proposal:
import pandas as pd
import numpy as np
xl = pd.ExcelFile("./dev/eyetoai/google_form_pure.xlsx")
xl.sheet_names
df = xl.parse("Form Responses 1")
df.groupby(['Name of condition','Condition description','Relevant Modality','Type of finding Mammography', 'Type of finding MRI', 'Type of finding US']).mean()
splitted = df['Relevant Modality'].str.split(',')
l = splitted.str.len()
df = pd.DataFrame({col: np.repeat(df[col], l) for col in ['Name of condition','Condition description']})
df['Relevant Modality'] = np.concatenate(splitted)
But I have this type of error:
TypeError: repeat() takes exactly 2 arguments (3 given)
You can use read_excel + split + stack + drop + join + reset_index:
#define columns which need split by , and then flatten them
cols = ['Condition description','Relevant Modality']
#read csv to dataframe
df = pd.read_excel('Untitled 1.xlsx')
#print (df)
df1 = pd.DataFrame({col: df[col].str.split(',', expand=True).stack() for col in cols})
print (df1)
Condition description Relevant Modality
0 0 Fibroadenomas are the most common cause of a b... Mammography
1 NaN US
2 NaN MRI
1 0 Papillomas are benign neoplasms Mammography
1 arising in a duct US
2 either centrally or peripherally within the b... MRI
3 leading to a nipple discharge. As they are of... NaN
4 the discharge may be bloodstained. NaN
2 0 OK Mammography
3 0 breast cancer Mammography
1 NaN US
4 0 breast inflammation Mammography
1 NaN US
#remove original columns
df = df.drop(cols, axis=1)
#create Multiindex in original df for align rows
df.index = [df.index, [0]* len(df.index)]
#join original to flattened columns, remove Multiindex
df = df1.join(df).reset_index(drop=True)
#print (df)
The previous answer is correct, I think you should use the id of reference.
an easier way could possibly be to just parse the method string to a list:
method_list = method.split(',')
method_list = np.asarray(method_list)
If you have any trouble with indexing when initializing your Dataframe, just set index to:
pd.Dataframe(data, index=[0,0])
df.set_index('id')
passing the list as a value for your method key will automatically create a copy of both the index - 'id' and 'name'
id method name
1 mammography adenosis
1 mri adenosis
I hope this helps, all the best

Data selection using pandas

I have a file where the separator(delimiter) is ';' . I read that file into a pandas dataframe df. Now, I want to select some rows from df using a criteria from column c in df. The format of data in column c is as follows:
[0]science|time|boot
[1]history|abc|red
and so on...
I have another list of words L, which has values such as
[history, geography,....]
Now, if I split the text in column c on '|', then I want to select those rows from df, where the first word does not belong to L.
Therefore, in this example, I will select df[0] but will not chose df[1], since history is present in L and science is not.
I know, I can write a for loop and iter over each object in the dataframe but I was wondering if I could do something in a more compact and efficient way.
For example, we can do:
df.loc[df['column_name'].isin(some_values)]
I have this:
df = pd.read_csv(path, sep=';', header=None, error_bad_lines=False, warn_bad_lines=False)
dat=df.ix[:,c].str.split('|')
But, I do not know how to index 'dat'. 'dat' is a Pandas Series, as follows:
0 [science, time, boot]
1 [history, abc, red]
....
I tried indexing dat as follows:
dat.iloc[:][0]
But, it gives the entire series instead of just the first element.
Any help would be appreciated.
Thank You in advance
Here is an approach:
Data
df = pd.DataFrame({'c':['history|science','science|chemistry','geography|science','biology|IT'],'col2':range(4)})
Out[433]:
c col2
0 history|science 0
1 science|chemistry 1
2 geography|science 2
3 biology|IT 3
lst = ['geography', 'biology','IT']
Resolution
You can use list comprehension:
df.loc[pd.Series([not x.split('|')[0] in lst for x in df.c.tolist()])]
Out[444]:
c col2
0 history|science 0
1 science|chemistry 1

Categories

Resources