Append dataframe to a specified existing column - python

I'am writing a code that requires appending a dataframe to include the new data to a specific column
Here is an extract of the code as below.
Descriptive summary of the code:
I have two variables (i and j) that I want to copy to a pandas dataframe.
I started by creating an empty dataframe with column names (4 columns in total)
Once the variables (i and j) are calculated in the for loop, i want to copy them to the dataframe into their respective columns(i_col and j_col, respectively)
Iam getting an error in the hashtagged code line (df=df.append.....)
import pandas as pd
df=pd.DataFrame(columns=['i_column','j_column','type','Location',])
for i in range (1,10):
i=3+i
print (i)
#df = df.append([i], column=['i_column'])
for j in range (5,12):
j=j+5
print (j)
#df = df.append([j],column=['j_column'])
print (df)
Currently I'am getting this error:
TypeError: append() got an unexpected keyword argument 'column'
Instead i want to append the dataframe with i values in the i_column and j values in the j_column. Please advise the correct code line for it.

Maybe you want something like this?
import numpy as np
import pandas as pd
columns = ['col_i', 'col_j', 'col_notused', 'col_alsonotused']
df = pd.DataFrame(columns=columns)
vals_i = [1,2,3]
vals_j = [2,3,1]
for index, (i, j) in enumerate(zip(vals_i, vals_j)):
df_temp = pd.DataFrame(columns=columns)
df_temp.loc[index] = (i, j, np.nan, np.nan)
df = df.append(df_temp)
print(df)
Output:
col_i col_j col_notused col_alsonotused
0 1.0 2.0 NaN NaN
1 2.0 3.0 NaN NaN
2 3.0 1.0 NaN NaN

Related

Replacing all instances of standalone "." in a pandas dataframe

Beginner question incoming.
I have a dataframe derived from an excel file with a column that I will call "input".
In this column are floats (e.g. 7.4, 8.1, 2.2,...). However, there are also some wrong values such as strings (which are easy to filter out) and, what I find difficult, single instances of "." or "..".
I would like to clean the column to generate only numeric float values.
I have used this approach for other columns, but cannot do so here because if I get rid of the "." instances, my floats will be messed up:
for col in [col for col in new_df.columns if col.startswith("input")]:
new_df[col] = new_df[col].str.replace(r',| |\-|\^|\+|#|j|0|.', '', regex=True)
new_df[col] = pd.to_numeric(new_df[col], errors='raise')
I have also tried the following, but it then replaces every value in the column with None:
for index, row in new_df.iterrows():
col_input = row['input']
if re.match(r'^-?\d+(?:.\d+)$', str(col_input)) is None:
new_df["input"] = None
How do I get rid of the dots?
Thanks!
You can simply use pandas.to_numeric and pass errors='coerce' without the loop :
from io import StringIO
import pandas as pd
s = """input
7.4
8.1
2.2
foo
foo.bar
baz/foo"""
df = pd.read_csv(StringIO(s))
df['input'] = pd.to_numeric(df['input'], errors='coerce')
# Outputs :
print(df)
input
0 7.4
1 8.1
2 2.2
3 NaN
4 NaN
5 NaN
df.dropna(inplace=True)
print(df)
input
0 7.4
1 8.1
2 2.2
If you need to clean up multiple mixed columns, use :
cols = ['input', ...] # put here the name of the columns concerned
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
df.dropna(subset=cols, inplace=True)

Using pandas.DataFrame.at() in a for loop

pandas DataFrame I'm starting with:
pandas DataFrame I'm trying to build:
I'm very new to computer science so I wasn't quite sure how to word my question without providing images. Basically, I want to build a pandas DataFrame with one row that has columns with column names -3 to 3 and the values below are the maximum absolute values of the second column from the first pandas DataFrame in relation to the first column from the first pandas DataFrame.
I also have the same data in a list a shown here:
Here is what I've tried but I keep getting an error:
Here's the solution. Looping over the dataframe to get what you want seems overkill
import pandas as pd
df = pd.DataFrame([[-1,1],[-2,2],[-2,1],[-2,2],[-1,6],[-1,2],[-1,1],[1,-2],[2,-2],[1,-2],[2,-1],[6,-1],[2,-1],[1,-1]])
max = df.groupby(0)[1].max()
x = dict()
for i in range(-3,4):
try:
if y[i] < 0:
x[i] = z[i]
else:
x[i] = y[i]
except KeyError:
x[i] = 0
x = pd.DataFrame(x, index = [0])
which gives the result
-3 -2 -1 0 1 2 3
0 2 6 0 -2 -2 0
This results in a dataframe with a column for '0' - that should be easy to get rid of at any point

If cell contains more than one string, put in to the new cell in Pandas

So I'm working with Pandas and I have multiple words (i.e. strings) in one cell, and I need to put every word into the new row and keep coordinated data. I've found a method which could help me,but it works with numbers, not strings.
So what method do I need to use?
Simple example of my table:
id name method
1 adenosis mammography, mri
And I need it to be:
id name method
1 adenosis mammography
mri
Thanks!
UPDATE:
That's what I'm trying to do, according to #jezrael's proposal:
import pandas as pd
import numpy as np
xl = pd.ExcelFile("./dev/eyetoai/google_form_pure.xlsx")
xl.sheet_names
df = xl.parse("Form Responses 1")
df.groupby(['Name of condition','Condition description','Relevant Modality','Type of finding Mammography', 'Type of finding MRI', 'Type of finding US']).mean()
splitted = df['Relevant Modality'].str.split(',')
l = splitted.str.len()
df = pd.DataFrame({col: np.repeat(df[col], l) for col in ['Name of condition','Condition description']})
df['Relevant Modality'] = np.concatenate(splitted)
But I have this type of error:
TypeError: repeat() takes exactly 2 arguments (3 given)
You can use read_excel + split + stack + drop + join + reset_index:
#define columns which need split by , and then flatten them
cols = ['Condition description','Relevant Modality']
#read csv to dataframe
df = pd.read_excel('Untitled 1.xlsx')
#print (df)
df1 = pd.DataFrame({col: df[col].str.split(',', expand=True).stack() for col in cols})
print (df1)
Condition description Relevant Modality
0 0 Fibroadenomas are the most common cause of a b... Mammography
1 NaN US
2 NaN MRI
1 0 Papillomas are benign neoplasms Mammography
1 arising in a duct US
2 either centrally or peripherally within the b... MRI
3 leading to a nipple discharge. As they are of... NaN
4 the discharge may be bloodstained. NaN
2 0 OK Mammography
3 0 breast cancer Mammography
1 NaN US
4 0 breast inflammation Mammography
1 NaN US
#remove original columns
df = df.drop(cols, axis=1)
#create Multiindex in original df for align rows
df.index = [df.index, [0]* len(df.index)]
#join original to flattened columns, remove Multiindex
df = df1.join(df).reset_index(drop=True)
#print (df)
The previous answer is correct, I think you should use the id of reference.
an easier way could possibly be to just parse the method string to a list:
method_list = method.split(',')
method_list = np.asarray(method_list)
If you have any trouble with indexing when initializing your Dataframe, just set index to:
pd.Dataframe(data, index=[0,0])
df.set_index('id')
passing the list as a value for your method key will automatically create a copy of both the index - 'id' and 'name'
id method name
1 mammography adenosis
1 mri adenosis
I hope this helps, all the best

How to add prefix to rows of a columns if (conditions met)

I have a data frame with certain columns and rows and in which I need to add prefix to rows from one of the columns if it meet certain condition,
df = pd.DataFrame({'col':['a',0,2,3,5],'col2':['PFD_1','PFD_2','PFD_3','PFD_4','PFD_5']})
Samples=pd.DataFrame({'Sam':['PFD_1','PFD_5']})
And I need to add a suffix to df.col2 based on values in Samples dataframe, and I tried it with np.where as following,
df['col2'] = np.where(df.col2.isin(samples.Sam),'Yes' + df.col2, 'Non_'+ df.col2)
Whhich throws error as,
TypeError: can only perform ops with scalar values
It doesn't return what I am asking for, and throwing errors
in the end the data frame should look like,
>>>df.head()
col col2
a Yes_PFD_1
0 no_PFD_2
2 no_PFD_3
3 no_PFD_4
5 Yes_PFD_5
Your code worked fine for me once I changed the capitalization of 'samples' ..
import pandas as pd
import numpy as np
df = pd.DataFrame({'col':['a',0,2,3,5],'col2': ['PFD_1','PFD_2','PFD_3','PFD_4','PFD_5']})
Samples=pd.DataFrame({'Sam':['PFD_1','PFD_5']})
df['col2'] = np.where(df.col2.isin(Samples.Sam),'Yes' + df.col2, 'Non_'+ df.col2)
df['col2']
Outputs ..
0 YesPFD_1
1 Non_PFD_2
2 Non_PFD_3
3 Non_PFD_4
4 YesPFD_5
Name: col2, dtype: object

Python pandas dataframe getting NaN instead of values

I have a large dataset where multiple columns had NaN values. I used python pandas to replace the missing values in few columns by mean and the rest by median. I got rid of all the NaN values and wrote the resultant the Dataframe to a new file.
Now when I read the new file again it contains NaNs instead of values. I am unable to figure out why is this happening. Below is my code for reference:
df = pd.DataFrame.from_csv('temp_train.csv',header=0)
df.prop_review_score=df.prop_review_score.fillna(0)
mean_score_2 = np.mean(df.prop_location_score2)
df.prop_location_score2 = df.prop_location_score2.fillna(mean_score_2)
median_search_query = np.median(df.srch_query_affinity_score)
df.srch_query_affinity_score = df.srch_query_affinity_score.fillna(median_search_query)
median_orig_distance = np.median(df.orig_destination_distance)
df.orig_destination_distance = df.orig_destination_distance.fillna(median_orig_distance)
df.to_csv('final_train_data.csv')
Now in another script when I type the following I get NaNs in srch_query_affinity_score
df = pd.DataFrame.from_csv('final_train_data.csv',header=0)
print df
I would recommend to use pandas.DataFrame.median instead of numpy.median on the dataframe.
A quick test for me shows (when there are NaNs in the data as Woody suggests):
df = pd.DataFrame({'x':[10,pd.np.NAN,np.NAN,20]})
df.x.median() # returns 20.0
np.median(df.x) # returns NaN
So consider replacing:
median_search_query = np.median(df.srch_query_affinity_score)
with
median_search_query = df.srch_query_affinity_score.median()
To make sure before you go to csv do something like:
assert df.srch_query_affinity_score.isnull().sum() == 0

Categories

Resources