How to add prefix to rows of a columns if (conditions met)

How to add prefix to rows of a columns if (conditions met) - python

I have a data frame with certain columns and rows and in which I need to add prefix to rows from one of the columns if it meet certain condition,
df = pd.DataFrame({'col':['a',0,2,3,5],'col2':['PFD_1','PFD_2','PFD_3','PFD_4','PFD_5']})
Samples=pd.DataFrame({'Sam':['PFD_1','PFD_5']})
And I need to add a suffix to df.col2 based on values in Samples dataframe, and I tried it with np.where as following,
df['col2'] = np.where(df.col2.isin(samples.Sam),'Yes' + df.col2, 'Non_'+ df.col2)
Whhich throws error as,
TypeError: can only perform ops with scalar values
It doesn't return what I am asking for, and throwing errors
in the end the data frame should look like,
>>>df.head()
col col2
a Yes_PFD_1
0 no_PFD_2
2 no_PFD_3
3 no_PFD_4
5 Yes_PFD_5

Your code worked fine for me once I changed the capitalization of 'samples' ..
import pandas as pd
import numpy as np
df = pd.DataFrame({'col':['a',0,2,3,5],'col2': ['PFD_1','PFD_2','PFD_3','PFD_4','PFD_5']})
Samples=pd.DataFrame({'Sam':['PFD_1','PFD_5']})
df['col2'] = np.where(df.col2.isin(Samples.Sam),'Yes' + df.col2, 'Non_'+ df.col2)
df['col2']
Outputs ..
0 YesPFD_1
1 Non_PFD_2
2 Non_PFD_3
3 Non_PFD_4
4 YesPFD_5
Name: col2, dtype: object

Related

Replacing all instances of standalone "." in a pandas dataframe

Beginner question incoming.
I have a dataframe derived from an excel file with a column that I will call "input".
In this column are floats (e.g. 7.4, 8.1, 2.2,...). However, there are also some wrong values such as strings (which are easy to filter out) and, what I find difficult, single instances of "." or "..".
I would like to clean the column to generate only numeric float values.
I have used this approach for other columns, but cannot do so here because if I get rid of the "." instances, my floats will be messed up:
for col in [col for col in new_df.columns if col.startswith("input")]:
new_df[col] = new_df[col].str.replace(r',| |\-|\^|\+|#|j|0|.', '', regex=True)
new_df[col] = pd.to_numeric(new_df[col], errors='raise')
I have also tried the following, but it then replaces every value in the column with None:
for index, row in new_df.iterrows():
col_input = row['input']
if re.match(r'^-?\d+(?:.\d+)$', str(col_input)) is None:
new_df["input"] = None
How do I get rid of the dots?
Thanks!

You can simply use pandas.to_numeric and pass errors='coerce' without the loop :
from io import StringIO
import pandas as pd
s = """input
7.4
8.1
2.2
foo
foo.bar
baz/foo"""
df = pd.read_csv(StringIO(s))
df['input'] = pd.to_numeric(df['input'], errors='coerce')
# Outputs :
print(df)
input
0 7.4
1 8.1
2 2.2
3 NaN
4 NaN
5 NaN
df.dropna(inplace=True)
print(df)
input
0 7.4
1 8.1
2 2.2
If you need to clean up multiple mixed columns, use :
cols = ['input', ...] # put here the name of the columns concerned
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
df.dropna(subset=cols, inplace=True)

Python, Pandas: Using isin() like functionality but do not ignore duplicates in input list

I am trying to filter an input dataframe (df_in) against a list of indices. The indices list contains duplicates and I want my output df_out to contain all occurrences of a particular index. As expected, isin() gives me only a single entry for every index.
How do I try and not ignore duplicates and get output similar to df_out_desired?
import pandas as pd
import numpy as np
df_in = pd.DataFrame(index=np.arange(4), data={'A':[1,2,3,4],'B':[10,20,30,40]})
indices_needed_list = pd.Series([1,2,3,3,3])
# In the output df, I do not particularly care about the 'index' from the df_in
df_out = df_in[df_in.index.isin(indices_needed_list)].reset_index()
# With isin, as expected, I only get a single entry for each occurence of index in indices_needed_list
# What I am trying to get is an output df that has many rows and occurences of df_in index as in the indices_needed_list
temp = df_out[df_out['index'] == 3]
# This is what I would like to try and get
df_out_desired = pd.concat([df_out, df_out[df_out['index']==3], df_out[df_out['index']==3]])
Thanks!

Check reindex
df_out_desired = df_in.reindex(indices_needed_list)
df_out_desired
Out[177]:
A B
1 2 20
2 3 30
3 4 40
3 4 40
3 4 40

Remove Column with Duplicate Values in Pandas

I have a database with sample as below:
Data frame is generated when I load data in Python as per below code
import os
import pandas as pd
data_dir="D:\\userdata\\adbharga\\Desktop\\AVA\\PythonCoding\\VF-Aus\\4G Cell Graphs"
os.chdir(data_dir)
df = pd.read_csv('CA Throughput(Kbit_s) .csv',index_col=None, header=0)
Output:
Is there any way by which we can avoid reading duplicate columns in Pandas, or remove the duplicate columns post reading.
Pl Note: Column Name is different once data is read in Pandas, so command like df=df.loc[:,~df.columns.duplicated()] won't work.
Actual database is very big and has many duplicate column with Dates only.

There are 2 ways you can do this.
Ignore columns when reading the data
pandas.read_csv has the argument usecols, which accepts an integer list.
So you can try:
# work out required columns
df = pd.read_csv('file.csv', header=0)
cols = [0] + list(range(1, len(df.columns), 2))
# use column integer list
df = pd.read_csv('file.csv', usecols=cols)
Remove columns from dataframe
You can use similar logic with pd.DataFrame.iloc to remove unwanted columns.
# cols as defined in previous example
df = df.iloc[:, cols]

One way of do it could be to read only the first row and create a mask using drop_duplicates(). This we pass to the usecols without the need to specify the index beforehand. It should be failsafe.
m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)
Full example:
import pandas as pd
data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''
m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)
print(df)
# Date Value1 Value2
#0 2018-01-01 0 1
#1 2018-01-02 0 1
Another way to do it would be to remove all columns with a dot inside .. This should work in most cases as the dot is rarely used in column names:
df = df.loc[:,~df.columns.str.contains('.', regex=False)]
Full example:
import pandas as pd
data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''
df = pd.read_csv(pd.compat.StringIO(data))
df = df.loc[:,~df.columns.str.contains('.', regex=False)]
print(df)
# Date Value1 Value2
#0 2018-01-01 0 1
#1 2018-01-02 0 1

If cell contains more than one string, put in to the new cell in Pandas

So I'm working with Pandas and I have multiple words (i.e. strings) in one cell, and I need to put every word into the new row and keep coordinated data. I've found a method which could help me,but it works with numbers, not strings.
So what method do I need to use?
Simple example of my table:
id name method
1 adenosis mammography, mri
And I need it to be:
id name method
1 adenosis mammography
mri
Thanks!
UPDATE:
That's what I'm trying to do, according to #jezrael's proposal:
import pandas as pd
import numpy as np
xl = pd.ExcelFile("./dev/eyetoai/google_form_pure.xlsx")
xl.sheet_names
df = xl.parse("Form Responses 1")
df.groupby(['Name of condition','Condition description','Relevant Modality','Type of finding Mammography', 'Type of finding MRI', 'Type of finding US']).mean()
splitted = df['Relevant Modality'].str.split(',')
l = splitted.str.len()
df = pd.DataFrame({col: np.repeat(df[col], l) for col in ['Name of condition','Condition description']})
df['Relevant Modality'] = np.concatenate(splitted)
But I have this type of error:
TypeError: repeat() takes exactly 2 arguments (3 given)

You can use read_excel + split + stack + drop + join + reset_index:
#define columns which need split by , and then flatten them
cols = ['Condition description','Relevant Modality']
#read csv to dataframe
df = pd.read_excel('Untitled 1.xlsx')
#print (df)
df1 = pd.DataFrame({col: df[col].str.split(',', expand=True).stack() for col in cols})
print (df1)
Condition description Relevant Modality
0 0 Fibroadenomas are the most common cause of a b... Mammography
1 NaN US
2 NaN MRI
1 0 Papillomas are benign neoplasms Mammography
1 arising in a duct US
2 either centrally or peripherally within the b... MRI
3 leading to a nipple discharge. As they are of... NaN
4 the discharge may be bloodstained. NaN
2 0 OK Mammography
3 0 breast cancer Mammography
1 NaN US
4 0 breast inflammation Mammography
1 NaN US
#remove original columns
df = df.drop(cols, axis=1)
#create Multiindex in original df for align rows
df.index = [df.index, [0]* len(df.index)]
#join original to flattened columns, remove Multiindex
df = df1.join(df).reset_index(drop=True)
#print (df)

The previous answer is correct, I think you should use the id of reference.
an easier way could possibly be to just parse the method string to a list:
method_list = method.split(',')
method_list = np.asarray(method_list)
If you have any trouble with indexing when initializing your Dataframe, just set index to:
pd.Dataframe(data, index=[0,0])
df.set_index('id')
passing the list as a value for your method key will automatically create a copy of both the index - 'id' and 'name'
id method name
1 mammography adenosis
1 mri adenosis
I hope this helps, all the best

Data selection using pandas

I have a file where the separator(delimiter) is ';' . I read that file into a pandas dataframe df. Now, I want to select some rows from df using a criteria from column c in df. The format of data in column c is as follows:
[0]science|time|boot
[1]history|abc|red
and so on...
I have another list of words L, which has values such as
[history, geography,....]
Now, if I split the text in column c on '|', then I want to select those rows from df, where the first word does not belong to L.
Therefore, in this example, I will select df[0] but will not chose df[1], since history is present in L and science is not.
I know, I can write a for loop and iter over each object in the dataframe but I was wondering if I could do something in a more compact and efficient way.
For example, we can do:
df.loc[df['column_name'].isin(some_values)]
I have this:
df = pd.read_csv(path, sep=';', header=None, error_bad_lines=False, warn_bad_lines=False)
dat=df.ix[:,c].str.split('|')
But, I do not know how to index 'dat'. 'dat' is a Pandas Series, as follows:
0 [science, time, boot]
1 [history, abc, red]
....
I tried indexing dat as follows:
dat.iloc[:][0]
But, it gives the entire series instead of just the first element.
Any help would be appreciated.
Thank You in advance

Here is an approach:
Data
df = pd.DataFrame({'c':['history|science','science|chemistry','geography|science','biology|IT'],'col2':range(4)})
Out[433]:
c col2
0 history|science 0
1 science|chemistry 1
2 geography|science 2
3 biology|IT 3
lst = ['geography', 'biology','IT']
Resolution
You can use list comprehension:
df.loc[pd.Series([not x.split('|')[0] in lst for x in df.c.tolist()])]
Out[444]:
c col2
0 history|science 0
1 science|chemistry 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to add prefix to rows of a columns if (conditions met) - python

Related

Replacing all instances of standalone "." in a pandas dataframe

Python, Pandas: Using isin() like functionality but do not ignore duplicates in input list

Remove Column with Duplicate Values in Pandas

If cell contains more than one string, put in to the new cell in Pandas

Data selection using pandas

Categories

Resources