I'am writing a code that requires appending a dataframe to include the new data to a specific column
Here is an extract of the code as below.
Descriptive summary of the code:
I have two variables (i and j) that I want to copy to a pandas dataframe.
I started by creating an empty dataframe with column names (4 columns in total)
Once the variables (i and j) are calculated in the for loop, i want to copy them to the dataframe into their respective columns(i_col and j_col, respectively)
Iam getting an error in the hashtagged code line (df=df.append.....)
import pandas as pd
df=pd.DataFrame(columns=['i_column','j_column','type','Location',])
for i in range (1,10):
i=3+i
print (i)
#df = df.append([i], column=['i_column'])
for j in range (5,12):
j=j+5
print (j)
#df = df.append([j],column=['j_column'])
print (df)
Currently I'am getting this error:
TypeError: append() got an unexpected keyword argument 'column'
Instead i want to append the dataframe with i values in the i_column and j values in the j_column. Please advise the correct code line for it.
Maybe you want something like this?
import numpy as np
import pandas as pd
columns = ['col_i', 'col_j', 'col_notused', 'col_alsonotused']
df = pd.DataFrame(columns=columns)
vals_i = [1,2,3]
vals_j = [2,3,1]
for index, (i, j) in enumerate(zip(vals_i, vals_j)):
df_temp = pd.DataFrame(columns=columns)
df_temp.loc[index] = (i, j, np.nan, np.nan)
df = df.append(df_temp)
print(df)
Output:
col_i col_j col_notused col_alsonotused
0 1.0 2.0 NaN NaN
1 2.0 3.0 NaN NaN
2 3.0 1.0 NaN NaN
Related
Beginner question incoming.
I have a dataframe derived from an excel file with a column that I will call "input".
In this column are floats (e.g. 7.4, 8.1, 2.2,...). However, there are also some wrong values such as strings (which are easy to filter out) and, what I find difficult, single instances of "." or "..".
I would like to clean the column to generate only numeric float values.
I have used this approach for other columns, but cannot do so here because if I get rid of the "." instances, my floats will be messed up:
for col in [col for col in new_df.columns if col.startswith("input")]:
new_df[col] = new_df[col].str.replace(r',| |\-|\^|\+|#|j|0|.', '', regex=True)
new_df[col] = pd.to_numeric(new_df[col], errors='raise')
I have also tried the following, but it then replaces every value in the column with None:
for index, row in new_df.iterrows():
col_input = row['input']
if re.match(r'^-?\d+(?:.\d+)$', str(col_input)) is None:
new_df["input"] = None
How do I get rid of the dots?
Thanks!
You can simply use pandas.to_numeric and pass errors='coerce' without the loop :
from io import StringIO
import pandas as pd
s = """input
7.4
8.1
2.2
foo
foo.bar
baz/foo"""
df = pd.read_csv(StringIO(s))
df['input'] = pd.to_numeric(df['input'], errors='coerce')
# Outputs :
print(df)
input
0 7.4
1 8.1
2 2.2
3 NaN
4 NaN
5 NaN
df.dropna(inplace=True)
print(df)
input
0 7.4
1 8.1
2 2.2
If you need to clean up multiple mixed columns, use :
cols = ['input', ...] # put here the name of the columns concerned
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
df.dropna(subset=cols, inplace=True)
pandas DataFrame I'm starting with:
pandas DataFrame I'm trying to build:
I'm very new to computer science so I wasn't quite sure how to word my question without providing images. Basically, I want to build a pandas DataFrame with one row that has columns with column names -3 to 3 and the values below are the maximum absolute values of the second column from the first pandas DataFrame in relation to the first column from the first pandas DataFrame.
I also have the same data in a list a shown here:
Here is what I've tried but I keep getting an error:
Here's the solution. Looping over the dataframe to get what you want seems overkill
import pandas as pd
df = pd.DataFrame([[-1,1],[-2,2],[-2,1],[-2,2],[-1,6],[-1,2],[-1,1],[1,-2],[2,-2],[1,-2],[2,-1],[6,-1],[2,-1],[1,-1]])
max = df.groupby(0)[1].max()
x = dict()
for i in range(-3,4):
try:
if y[i] < 0:
x[i] = z[i]
else:
x[i] = y[i]
except KeyError:
x[i] = 0
x = pd.DataFrame(x, index = [0])
which gives the result
-3 -2 -1 0 1 2 3
0 2 6 0 -2 -2 0
This results in a dataframe with a column for '0' - that should be easy to get rid of at any point
So I'm working with Pandas and I have multiple words (i.e. strings) in one cell, and I need to put every word into the new row and keep coordinated data. I've found a method which could help me,but it works with numbers, not strings.
So what method do I need to use?
Simple example of my table:
id name method
1 adenosis mammography, mri
And I need it to be:
id name method
1 adenosis mammography
mri
Thanks!
UPDATE:
That's what I'm trying to do, according to #jezrael's proposal:
import pandas as pd
import numpy as np
xl = pd.ExcelFile("./dev/eyetoai/google_form_pure.xlsx")
xl.sheet_names
df = xl.parse("Form Responses 1")
df.groupby(['Name of condition','Condition description','Relevant Modality','Type of finding Mammography', 'Type of finding MRI', 'Type of finding US']).mean()
splitted = df['Relevant Modality'].str.split(',')
l = splitted.str.len()
df = pd.DataFrame({col: np.repeat(df[col], l) for col in ['Name of condition','Condition description']})
df['Relevant Modality'] = np.concatenate(splitted)
But I have this type of error:
TypeError: repeat() takes exactly 2 arguments (3 given)
You can use read_excel + split + stack + drop + join + reset_index:
#define columns which need split by , and then flatten them
cols = ['Condition description','Relevant Modality']
#read csv to dataframe
df = pd.read_excel('Untitled 1.xlsx')
#print (df)
df1 = pd.DataFrame({col: df[col].str.split(',', expand=True).stack() for col in cols})
print (df1)
Condition description Relevant Modality
0 0 Fibroadenomas are the most common cause of a b... Mammography
1 NaN US
2 NaN MRI
1 0 Papillomas are benign neoplasms Mammography
1 arising in a duct US
2 either centrally or peripherally within the b... MRI
3 leading to a nipple discharge. As they are of... NaN
4 the discharge may be bloodstained. NaN
2 0 OK Mammography
3 0 breast cancer Mammography
1 NaN US
4 0 breast inflammation Mammography
1 NaN US
#remove original columns
df = df.drop(cols, axis=1)
#create Multiindex in original df for align rows
df.index = [df.index, [0]* len(df.index)]
#join original to flattened columns, remove Multiindex
df = df1.join(df).reset_index(drop=True)
#print (df)
The previous answer is correct, I think you should use the id of reference.
an easier way could possibly be to just parse the method string to a list:
method_list = method.split(',')
method_list = np.asarray(method_list)
If you have any trouble with indexing when initializing your Dataframe, just set index to:
pd.Dataframe(data, index=[0,0])
df.set_index('id')
passing the list as a value for your method key will automatically create a copy of both the index - 'id' and 'name'
id method name
1 mammography adenosis
1 mri adenosis
I hope this helps, all the best
I have a data frame with certain columns and rows and in which I need to add prefix to rows from one of the columns if it meet certain condition,
df = pd.DataFrame({'col':['a',0,2,3,5],'col2':['PFD_1','PFD_2','PFD_3','PFD_4','PFD_5']})
Samples=pd.DataFrame({'Sam':['PFD_1','PFD_5']})
And I need to add a suffix to df.col2 based on values in Samples dataframe, and I tried it with np.where as following,
df['col2'] = np.where(df.col2.isin(samples.Sam),'Yes' + df.col2, 'Non_'+ df.col2)
Whhich throws error as,
TypeError: can only perform ops with scalar values
It doesn't return what I am asking for, and throwing errors
in the end the data frame should look like,
>>>df.head()
col col2
a Yes_PFD_1
0 no_PFD_2
2 no_PFD_3
3 no_PFD_4
5 Yes_PFD_5
Your code worked fine for me once I changed the capitalization of 'samples' ..
import pandas as pd
import numpy as np
df = pd.DataFrame({'col':['a',0,2,3,5],'col2': ['PFD_1','PFD_2','PFD_3','PFD_4','PFD_5']})
Samples=pd.DataFrame({'Sam':['PFD_1','PFD_5']})
df['col2'] = np.where(df.col2.isin(Samples.Sam),'Yes' + df.col2, 'Non_'+ df.col2)
df['col2']
Outputs ..
0 YesPFD_1
1 Non_PFD_2
2 Non_PFD_3
3 Non_PFD_4
4 YesPFD_5
Name: col2, dtype: object
I have a large dataset where multiple columns had NaN values. I used python pandas to replace the missing values in few columns by mean and the rest by median. I got rid of all the NaN values and wrote the resultant the Dataframe to a new file.
Now when I read the new file again it contains NaNs instead of values. I am unable to figure out why is this happening. Below is my code for reference:
df = pd.DataFrame.from_csv('temp_train.csv',header=0)
df.prop_review_score=df.prop_review_score.fillna(0)
mean_score_2 = np.mean(df.prop_location_score2)
df.prop_location_score2 = df.prop_location_score2.fillna(mean_score_2)
median_search_query = np.median(df.srch_query_affinity_score)
df.srch_query_affinity_score = df.srch_query_affinity_score.fillna(median_search_query)
median_orig_distance = np.median(df.orig_destination_distance)
df.orig_destination_distance = df.orig_destination_distance.fillna(median_orig_distance)
df.to_csv('final_train_data.csv')
Now in another script when I type the following I get NaNs in srch_query_affinity_score
df = pd.DataFrame.from_csv('final_train_data.csv',header=0)
print df
I would recommend to use pandas.DataFrame.median instead of numpy.median on the dataframe.
A quick test for me shows (when there are NaNs in the data as Woody suggests):
df = pd.DataFrame({'x':[10,pd.np.NAN,np.NAN,20]})
df.x.median() # returns 20.0
np.median(df.x) # returns NaN
So consider replacing:
median_search_query = np.median(df.srch_query_affinity_score)
with
median_search_query = df.srch_query_affinity_score.median()
To make sure before you go to csv do something like:
assert df.srch_query_affinity_score.isnull().sum() == 0