I come to encounter a problem with reading my data the first column is assigned as index column even though I use index_col=None or index_col=None. Similar issue posted as pandas read_csv index_col=None not working with delimiters at the end of each line
raw_data = {'patient': ['spried & roy']*5,
'obs': [1, 2, 3, 1, 2],
'treatment': [0, 1, 0, 1, 0],
'score': ['strong', 'weak', 'normal', 'weak', 'strong'],
}
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])
patient obs treatment score
0 spried & roy 1 0 strong
1 spried & roy 2 1 weak
2 spried & roy 3 0 normal
3 spried & roy 1 1 weak
4 spried & roy 2 0 strong
writing df to csv with tab seperated format
df.to_csv('xgboost.txt', sep='\t', index=False)
reading back again
read_df=pd.read_table(r'xgboost.txt', header=0,index_col=None, skiprows=0, skipfooter=0, sep="\t",delim_whitespace=True)
read_df
patient obs treatment score
spried & roy 1 0 strong
& roy 2 1 weak
& roy 3 0 normal
& roy 1 1 weak
& roy 2 0 strong
As we can see patient column separated into spried & and roy and spried & became index column even if I explicitly write index_col=None.
How can we correctly get patient column as it is and control index column exist or not ?
thx
Just remove delim_whitespace=True, because it use whitespaces separator instead tabs in your solution, but here working only sep='\t' parameter with file name:
df.to_csv('xgboost.txt', sep='\t', index=False)
read_df=pd.read_table(r'xgboost.txt', sep="\t")
print (read_df)
patient obs treatment score
0 spried & roy 1 0 strong
1 spried & roy 2 1 weak
2 spried & roy 3 0 normal
3 spried & roy 1 1 weak
4 spried & roy 2 0 strong
Another idea is write to file whitespace separator, so delim_whitespace=True working nice:
df.to_csv('xgboost.txt', sep=' ', index=False)
read_df=pd.read_table(r'xgboost.txt', delim_whitespace=True)
Related
I have a data frame containing one row:
df_1D = pd.DataFrame({'Day1':[5],
'Day2':[6],
'Day3':[7],
'ID':['AB12'],
'Country':['US'],
'Destination_A':['Miami'],
'Destination_B':['New York'],
'Destination_C':['Chicago'],
'First_Agent':['Jim'],
'Second_Agent':['Ron'],
'Third_Agent':['Cynthia']},
)
Day1 Day2 Day3 ID ... Destination_C First_Agent Second_Agent Third_Agent
0 5 6 7 AB12 ... Chicago Jim Ron Cynthia
I'm wondering if there's an easy way, to transform it into a dataframe with three rows as shown here:
Day ID Country Destination Agent
0 5 AB12 US Miami Jim
1 6 AB12 US New York Ron
2 7 AB12 US Chicago Cynthia
Have you tried to pivot it with .pivot function? https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html
One option using reshaping, which only requires to know the final columns:
# define final columns
cols = ['Day', 'ID', 'Destination', 'Country', 'Agent']
# the part below is automatic
# ------
# extract the keywords
pattern = f"({'|'.join(cols)})"
new = df_1D.columns.str.extract(pattern)[0]
# and reshape
out = (df_1D
.set_axis(pd.MultiIndex.from_arrays([new, new.groupby(new).cumcount()]), axis=1)
.loc[0].unstack(0).ffill()[cols]
)
Output:
Day ID Destination Country Agent
0 5 AB12 Miami US Jim
1 6 AB12 New York US Ron
2 7 AB12 Chicago US Cynthia
alternative defining idx/cols separately
idx = ['ID', 'Country']
cols = ['Day', 'Destination', 'Agent']
df2 = df_1D.set_index(idx)
pattern = f"({'|'.join(cols)})"
new = df2.columns.str.extract(pattern)[0]
out = (df2
.set_axis(pd.MultiIndex.from_arrays([new, new.groupby(new).cumcount().astype(str)],
names=[None, None]),
axis=1)
.stack().reset_index(idx)
)
clomuns_day=[col for col in df_1D if col.startswith('Day')]
clomuns_dest=[col for col in df_1D if col.startswith('Destination')]
clomuns_agent=[col for col in df_1D if 'Agent'in col]
new_df=pd.DataFrame()
new_df['Day']=df_1D[clomuns_day].values.tolist()[0]
new_df['ID']= list(df_1D['ID'])*len(new_df)
new_df['Country']= list(df_1D['Country'])*len(new_df)
new_df['Destination']=df_1D[clomuns_dest].values.tolist()[0]
new_df['Agent']=df_1D[clomuns_agent].values.tolist()[0]
Out:
Day ID Country Destination Agent
0 5 AB12 US Miami Jim
1 6 AB12 US New York Ron
2 7 AB12 US Chicago Cynthia
you can use it whatever destination is repeat
One option is with pivot_longer from pyjanitor, where for this case, you pass a list of regexes to names_pattern, and the new column names to names_to:
# pip install pyjanitor
import janitor
import pandas as pd
(df_1D
.pivot_longer(
index=['ID','Country'],
names_to = ['Day','Destination','Agent'],
names_pattern=['Day','Destination','Agent'])
)
ID Country Day Destination Agent
0 AB12 US 5 Miami Jim
1 AB12 US 6 New York Ron
2 AB12 US 7 Chicago Cynthia
I don't think there is a way to treat this fully automated. It requires manual manipulation. This is the shortest code that comes to my mind. Feel free to comment:
d1 = {}
for k in ['Day', 'Destination', 'Agent']:
d1[k] = [d[i][0] for i in d.keys() if k in i]
for k in ['ID', 'Country']:
d1[k] = d[k] * len(d1['Day'])
d1 = pd.DataFrame(d1)
Output:
Hope this help.
I have 2 dataframes that I would like to merge on a common column. However the column I would like to merge on are not of the same string, but rather a string from one is contained in the other as so:
import pandas as pd
df1 = pd.DataFrame({'column_a':['John','Michael','Dan','George', 'Adam'], 'column_common':['code','other','ome','no match','word']})
df2 = pd.DataFrame({'column_b':['Smith','Cohen','Moore','K', 'Faber'], 'column_common':['some string','other string','some code','this code','word']})
The outcome I would like from d1.merge(d2, ...) is the following:
column_a | column_b
----------------------
John | Moore <- merged on 'code' contained in 'some code'
Michael | Cohen <- merged on 'other' contained in 'other string'
Dan | Smith <- merged on 'ome' contained in 'some string'
George | n/a
Adam | Faber <- merged on 'word' contained in 'word'
New Answer
Here is one approach based on pandas/numpy.
rhs = (df1.column_common
.apply(lambda x: df2[df2.column_common.str.find(x).ge(0)]['column_b'])
.bfill(axis=1)
.iloc[:, 0])
(pd.concat([df1.column_a, rhs], axis=1, ignore_index=True)
.rename(columns={0: 'column_a', 1: 'column_b'}))
column_a column_b
0 John Moore
1 Michael Cohen
2 Dan Smith
3 George NaN
4 Adam Faber
Old Answer
Here's a solution for left-join behaviour, as in it doesn't keep column_a values that do not match any column_b values. This is slower than the above numpy/pandas solution because it uses two nested iterrows loops to build a python list.
tups = [(a1, a2) for i, (a1, b1) in df1.iterrows()
for j, (a2, b2) in df2.iterrows()
if b1 in b2]
(pd.DataFrame(tups, columns=['column_a', 'column_b'])
.drop_duplicates('column_a')
.reset_index(drop=True))
column_a column_b
0 John Moore
1 Michael Cohen
2 Dan Smith
3 Adam Faber
My solution involves applying a function to the common column. I can't imagine it holds up well when df2 is large but perhaps someone more knowledgeable than I can suggest an improvement.
def strmerge(strcolumn):
for i in df2['column_common']:
if strcolumn in i:
return df2[df2['column_common'] == i]['column_b'].values[0]
df1['column_b'] = df1['column_common'].apply(strmerge)
df1
column_a column_common column_b
0 John code Moore
1 Michael other Cohen
2 Dan ome Smith
3 George no match None
4 Adam word Faber
A simple, readable and purely vectorized approach could be to have a cross join and then filter where columns column_common of one is substring of other:
df = df1.merge(df2, how='cross')
df.loc[df.column_common_x.eq('no match'),'column_b'] = pd.NA
df.loc[df.apply(lambda x:x.column_common_y.__contains__(x.column_common_x) or x.column_common_x == 'no match', axis=1),
['column_a', 'column_b']].drop_duplicates(subset=['column_a'])
Output:
column_a
column_b
John
Moore
Michael
Cohen
Dan
Smith
George
Adam
Faber
I'm currently trying to use a number of medical codes to find out if a person has a certain disease and would require help as I tried searching for a couple of days but couldn't find any. Hoping someone can help me with this. Considering I've imported excel file 1 into df1 and excel file 2 into df2, how do I use excel file 2 to identify what disease does the patients in excel file 1 have and indicate them with a header? Below is an example of what the data looks like. I'm currently using pandas Jupyter notebook for this.
Excel file 1:
Patient
Primary Diagnosis
Secondary Diagnosis
Secondary Diagnosis 2
Secondary Diagnosis 3
Alex
50322
50111
John
50331
60874
50226
74444
Peter
50226
74444
Peter
50233
88888
Excel File 2:
Primary Diagnosis
Medical Code
Diabetes Type 2
50322
Diabetes Type 2
50331
Diabetes Type 2
50233
Cardiovescular Disease
50226
Hypertension
50111
AIDS
60874
HIV
74444
HIV
88888
Intended output:
Patient
Positive for Diabetes Type 2
Positive for Cardiovascular Disease
Positive for Hypertension
Positive for AIDS
Positive for HIV
Alex
1
1
0
0
0
John
1
1
0
1
1
Peter
1
1
0
0
1
You can use merge and pivot_table
out = (
df1.melt('Patient', var_name='Diagnosis', value_name='Medical Code').dropna()
.merge(df2, on='Medical Code').assign(dummy=1)
.pivot_table('dummy', 'Patient', 'Primary Diagnosis', fill_value=0)
.add_prefix('Positive for ').rename_axis(columns=None).reset_index()
)
Output:
Patient
Positive for AIDS
Positive for Cardiovescular Disease
Positive for Diabetes Type 2
Positive for HIV
Positive for Hypertension
Alex
0
0
1
0
1
John
1
1
1
1
0
Peter
0
1
1
1
0
IIUC, you could melt df1, then map the codes from reshaped df2, finally pivot_table on the output:
diseases = df2.set_index('Medical Code')['Primary Diagnosis']
(df1
.reset_index()
.melt(id_vars=['index', 'Patient'])
.assign(disease=lambda d: d['value'].map(diseases),
value=1,
)
.pivot_table(index='Patient', columns='disease', values='value', fill_value=0)
)
output:
disease AIDS Cardiovescular Disease Diabetes Type 2 HIV Hypertension
Patient
Alex 0 0 1 0 1
John 1 1 1 1 0
Peter 0 1 1 1 0
Maybe you could convert your excel file 2 to some form of key value pair and then replace the primary diagnostics column in file 1 with the corresponding disease name, later apply some form of encoding like one-hot or something similar to file 1. Not sure if this approach would definitely help, but just sharing my thoughts.
I've been searching around for a while now, but I can't seem to find the answer to this small problem.
I have this code to make a function for replace values:
df = {'Name':['al', 'el', 'naila', 'dori','jlo'],
'living':['Alvando','Georgia GG','Newyork NY','Indiana IN','Florida FL'],
'sample2':['malang','kaltim','ambon','jepara','sragen'],
'output':['KOTA','KAB','WILAYAH','KAB','DAERAH']
}
df = pd.DataFrame(df)
df = df.replace(['KOTA', 'WILAYAH', 'DAERAH'], 0)
df = df.replace('KAB', 1)
But I am actually expecting this output with the simple code that doesn't repeat replace
Name living sample2 output
0 al Alvando malang 0
1 el Georgia GG kaltim 1
2 naila Newyork NY ambon 0
3 dori Indiana IN jepara 1
4 jlo Florida FL sragen 0
I've tried using np.where but it doesn't give the desired result. all results display 0, but the original value is 1
df['output'] = pd.DataFrame({'output':np.where(df == "KAB", 1, 0).reshape(-1, )})
This code should work for you:
df = df.replace(['KOTA', 'WILAYAH', 'DAERAH'], 0).replace('KAB', 1)
Output:
>>> df
Name living sample2 output
0 al Alvando malang 0
1 el Georgia GG kaltim 1
2 naila Newyork NY ambon 0
3 dori Indiana IN jepara 1
4 jlo Florida FL sragen 0
I want to read a file that has partial header i.e. some columns have names some have not. I want to read the file as it is. So I want to keep the names of the columns that already have names and the rest as it is. Is there any clean way to do that in pandas?
The short answer to your question is no, since pandas dataframes cannot have more than one empty column name, so if you try to import a .csv file with multiple empty column names, you won't get the behavior you expect: pandas will fill in empty column names with Unnamed: 0, Unnamed: 1... and so on (or possibly something else if you have a space in place of the column name in the .csv file).
For example, this .csv file with columns of index 0, 3, 4, 5 removed...
,Doe,120 jefferson st.,,,
Jack,McGinnis,220 hobo Av.,Phila, PA,09119
"John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075
Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234
,Blankman,,SomeTown, SD, 00298
"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123
...will get imported in the following way:
Unnamed: 0 Doe 120 jefferson st. Unnamed: 3 Unnamed: 4
0 Jack McGinnis 220 hobo Av. Phila PA 9119
1 John "Da Man" Repici 120 Jefferson St. Riverside NJ 8075
2 Stephen Tyler 7452 Terrace "At the Plaza" road SomeTown SD 91234
3 NaN Blankman NaN SomeTown SD 298
4 Joan "the bone", Anne Jet 9th, at Terrace plc Desert City CO 123
If for example you have missing columns names for column 1,2. you will have this structure after reading the file normally by pandas
df.head()
Unnamed: 0 Unnamed: 1 col3 col4 col5
0 .. ..
1 .. ..
After reading the df, you can rename the unnamed columns as below
df.rename(columns = {'Unnamed: 1':'Col1','Unnamed: 2':'Col2'})