I'm trying some operations on Excel file using pandas. I want to extract some columns from a excel file and add another column to those extracted columns. And want to write all the columns to new excel file. To do this I have to append new column to old columns.
Here is my code-
import pandas as pd
#Reading ExcelFIle
#Work.xlsx is input file
ex_file = 'Work.xlsx'
data = pd.read_excel(ex_file,'Data')
#Create subset of columns by extracting columns D,I,J,AU from the file
data_subset_columns = pd.read_excel(ex_file, 'Data', parse_cols="D,I,J,AU")
#Compute new column 'Percentage'
#'Num Labels' and 'Num Tracks' are two different columns in given file
data['Percentage'] = data['Num Labels'] / data['Num Tracks']
data1 = data['Percentage']
print data1
#Here I'm trying to append data['Percentage'] to data_subset_columns
Final_data = data_subset_columns.append(data1)
print Final_data
Final_data.to_excel('111.xlsx')
No error is shown. But Final_data is not giving me expected results. ( Data not getting appended)
There is no need to explicitly append columns in pandas. When you calculate a new column, it is included in the dataframe. When you export it to excel, the new column will be included.
Try this, assuming 'Num Labels' and 'Num Tracks' are in "D,I,J,AU" [otherwise add them]:
import pandas as pd
data_subset = pd.read_excel(ex_file, 'Data', parse_cols="D,I,J,AU")
data_subset['Percentage'] = data_subset['Num Labels'] / data_subset['Num Tracks']
data_subset.to_excel('111.xlsx')
The append function of a dataframe adds rows, not columns to the dataframe. Well, it does add columns if the appended rows have more columns than in the source dataframe.
DataFrame.append(other, ignore_index=False, verify_integrity=False)[source]
Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.
I think you are looking for something like concat.
Combine DataFrame objects horizontally along the x axis by passing in axis=1.
>>> df1 = pd.DataFrame([['a', 1], ['b', 2]],
... columns=['letter', 'number'])
>>> df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],
... columns=['animal', 'name'])
>>> pd.concat([df1, df4], axis=1)
letter number animal name
0 a 1 bird polly
1 b 2 monkey george
Related
Hi I am working with pandas to manipulate some lab data. I currently have a data frame with 5 columns.
The first three columns(Analyte,CAS NO(1), and Value) are in the correct order.
The last two columns(CAS NO 2 and Value 2) are not.
Is there a way to align CAS No(2) and Value(2) with the first three columns based off of matching CAS Numbers(aka CAS NO(2)=CAS(NO1).
I am new to python and pandas. Thank you for your help
you can reorder the columns by reassigning the df variable as a slice of itself indexed on a list whose entries are the column names in question.
colidx = ['Analyte', 'CAS NO(1)', 'CAS NO(2)']
df = df[colidx]
Better provide input data in text format so we can copy-paste it. I understand you question like this: You need to sort two last columns together, so that CAS NO(2) matches CAS NO(1).
Since CAS NO(2)=CAS(NO1) you then do not need duplicated CAS NO(2) column, right?
Split off two last columns and make a Series from it, then convert that series to dict, and use that dict to map new values.
# Split 2 last columns and assign index.
df_tmp = df[['CAS NO(2)', 'Value(2)']]
df_tmp = df_tmp.set_index('CAS NO(2)')
# Keep only 3 first columns of original dataframe
df = df[['Analyte',' CASNo(1)', 'Value(1)']]
# Now copy the CasNO(1) to CAS NO(2)
df['CAS NO(2)'] = df['CasNO(1)']
# Now create Value(2) column on original dataframe
df['Value(2)'] = df['CASNo(1)'].map(df_tmp.to_dict()['Value(2)'])
Try the following:
import pandas as pd
import numpy as np
#create an example of your table
list_CASNo1 = ['71-43-2', '100-41-4', np.nan, '1634-04-4']
list_Val1 = [np.nan]*len(list_CASNo1)
list_CASNo2 = [np.nan, np.nan, np.nan, '100-41-4']
list_Val2 = [np.nan, np.nan, np.nan, '18']
df = pd.DataFrame(zip(list_CASNo1, list_Val1, list_CASNo2, list_Val2), columns =['CASNo(1)','Value(1)','CAS NO(2)','Value(2)'], index = ['Benzene','Ethylbenzene','Gasonline Range Organics','Methyl-tert-butyl ether'])
#split the data to two dataframes
df1 = df[['CASNo(1)','Value(1)']]
df2 = df[['CAS NO(2)','Value(2)']]
#merge df2 to df1 based on the specified columns
#reset_index and set_index will take care
#that df_adjusted will have the same index names as df1
df_adjusted = df1.reset_index().merge(df2.dropna(),
how = 'left',
left_on = 'CASNo(1)',
right_on = 'CAS NO(2)').set_index('index')
but be careful with duplicates in your columns, those will cause the merge to fail..
AttributeError: 'NoneType' object has no attribute 'transpose'
I have been trying to extract cells as dictionary(from pandas dataframe) and trying to join with existing data
for example , I have csv file which contains two columns id,device_type.each cell in device_type column contains dictionary data. i have trying to split and add with original data.
And trying to do something like below.
import json
import pandas
df = pandas.read_csv('D:\\1. Work\\csv.csv',header=0)
sf = df.head(12)
sf['visitor_home_cbgs'].fillna("{}", inplace = True).transpose()
-- csv file sample
ID,device_type
3c30ee03047b478,{"060379800281":11,"061110053031":5,"060372062002":5}
f5d639a64a88496099,{}
-- looks for output like below
id,device_type,ttype,tvalue
3c30ee03047b478,{"060379800281":11,"061110053031":5,"060372062002":5},"060379800281",11
3c30ee03047b478,{"060379800281":11,"061110053031":5,"060372062002":5},"061110053031",5
3c30ee03047b478,{"060379800281":11,"061110053031":5,"060372062002":5},"060372062002",5
f5d639a64a88496099,{},NIL,NIL
avoid inplace=True
sf['visitor_home_cbgs'].fillna("{}").transpose()
when you give inplace=True, it converts the same dataframe and returns null.
If you want to use inplace=True, then do like below
sf['visitor_home_cbgs'].fillna("{}", inplace=True)
sf.transpose()
To create rows from column values
One solution it to iterate through dataframe rows and create new dataframe with desired columns and values.
import json
def extract_JSON(row):
df2 = pd.DataFrame(columns=['ID', 'device_type', 'ttype', 'tvalue'])
device_type = row['device_type']
dict = json.loads(device_type)
for key in dict:
df2.loc[len(df2)] = [row['ID'], row['device_type'], key, dict[key]]
if df2.empty:
df2.loc[0] = [row['ID'], row['device_type'], '', '']
return df2
df3 = pd.DataFrame(columns=['ID', 'device_type', 'ttype', 'tvalue'])
for _, row in df.iterrows():
df3 = df3.append(extract_JSON(row))
df3
I'm new to python and pandas and working to create a Pandas MultiIndex with two independent variables: flow and head which create a dataframe and I have 27 different design points. It's currently organized in a single dataframe with columns for each variable and rows for each design point.
Here's how I created the MultiIndex:
flow = df.loc[0, ["Mass_Flow_Rate", "Mass_Flow_Rate.1",
"Mass_Flow_Rate.2"]]
dp = df.loc[:,"Design Point"]
index = pd.MultiIndex.from_product([dp, flow], names=
['DP','Flows'])
I then created three columns of data:
df0 = df.loc[:,"Head2D"]
df1 = df.loc[:,"Head2D.1"]
df2 = df.loc[:,"Head2D.1"]
And want to merge these into a single column of data such that I can use this command:
pc = pd.DataFrame(data, index=index)
Using the three columns with the same indexes for the rows (0-27), I want to merge the columns into a single column such that the data is interspersed. If I call the columns col1, col2 and col3 and I denote the index in parentheses such that col1(0) indicates column1 index 0, I want the data to look like:
col1(0)
col2(0)
col3(0)
col1(1)
col2(1)
col3(1)
col1(2)...
it is a bit confusing. But what I understood is that you are trying to do this:
flow = df.loc[0, ["Mass_Flow_Rate", "Mass_Flow_Rate.1",
"Mass_Flow_Rate.2"]]
dp = df.loc[:,"Design Point"]
index = pd.MultiIndex.from_product([dp, flow], names=
['DP','Flows'])
df0 = df.loc[:,"Head2D"]
df1 = df.loc[:,"Head2D.1"]
df2 = df.loc[:,"Head2D.1"]
data = pd.concat[df0, df1, df2]
pc = pd.DataFrame(data=data, index=index)
I'm trying to use python to read my csv file extract specific columns to a pandas.dataframe and show that dataframe. However, I don't see the data frame, I receive Series([], dtype: object) as an output. Below is the code that I'm working with:
My document consists of:
product sub_product issue sub_issue consumer_complaint_narrative
company_public_response company state zipcode tags
consumer_consent_provided submitted_via date_sent_to_company
company_response_to_consumer timely_response consumer_disputed?
complaint_id
I want to extract :
sub_product issue sub_issue consumer_complaint_narrative
import pandas as pd
df=pd.read_csv("C:\\....\\consumer_complaints.csv")
df=df.stack(level=0)
df2 = df.filter(regex='[B-F]')
df[df2]
import pandas as pd
input_file = "C:\\....\\consumer_complaints.csv"
dataset = pd.read_csv(input_file)
df = pd.DataFrame(dataset)
cols = [1,2,3,4]
df = df[df.columns[cols]]
Here specify your column numbers which you want to select. In dataframe, column start from index = 0
cols = []
You can select column by name wise also. Just use following line
df = df[["Column Name","Column Name2"]]
A simple way to achieve this would be as follows:
df = pd.read_csv("C:\\....\\consumer_complaints.csv")
df2 = df.loc[:,'B':'F']
Hope that helps.
This worked for me, using slicing:
df=pd.read_csv
df1=df[n1:n2]
Where $n1<n2# are both columns in the range, e.g:
if you want columns 3-5, use
df1=df[3:5]
For the first column, use
df1=df[0]
Though not sure how to select a discontinuous range of columns.
We can also use i.loc. Given data in dataset2:
dataset2.iloc[:3,[1,2]]
Will spit out the top 3 rows of columns 2-3 (Remember numbering starts at 0)
Then dataset2.iloc[:3,[1,2]] spits out
I have a pandas dataframe which I have created from data stored in an xml file:
Initially the xlm file is opened and parsed
xmlData = etree.parse(filename)
trendData = xmlData.findall("//TrendData")
I created a directory which lists all the data names (which are used as column names) as keys and gives the position of the data in the xml file:
Parameters = {"TreatmentUnit":("Worklist/AdminData/AdminValues/TreatmentUnit"),
"Modality":("Worklist/AdminData/AdminValues/Modality"),
"Energy":("Worklist/AdminData/AdminValues/Energy"),
"FieldSize":("Worklist/AdminData/AdminValues/Fieldsize"),
"SDD":("Worklist/AdminData/AdminValues/SDD"),
"Gantry":("Worklist/AdminData/AdminValues/Gantry"),
"Wedge":("Worklist/AdminData/AdminValues/Wedge"),
"MU":("Worklist/AdminData/AdminValues/MU"),
"My":("Worklist/AdminData/AdminValues/My"),
"AnalyzeParametersCAXMin":("Worklist/AdminData/AnalyzeParams/CAX/Min"),
"AnalyzeParametersCAXMax":("Worklist/AdminData/AnalyzeParams/CAX/Max"),
"AnalyzeParametersCAXTarget":("Worklist/AdminData/AnalyzeParams/CAX/Target"),
"AnalyzeParametersCAXNorm":("Worklist/AdminData/AnalyzeParams/CAX/Norm"),
....}
This is just a small part of the directory, the actual one list over 80 parameters
The directory keys are then sorted:
sortedKeys = list(sorted(Parameters.keys()))
A header is created for the pandas dataframe:
dateList=[]
dateList.append('date')
headers = dateList+sortedKeys
I then create an empty pandas dataframe with the same number of rows as the number of records in trendData and with the column headers set to 'headers' and then loop through the file filling the dataframe:
df = pd.DataFrame(index=np.arange(0,len(trendData)), columns=headers)
for a,b in enumerate(trendData):
result={}
result["date"] = dateutil.parser.parse(b.attrib['date'])
for i,j in enumerate(Parameters):
result[j] = b.findtext(Parameters[j])
df.loc[a]=(result)
df = df.set_index('date')
This seems to work fine but the problem is that the dtype for each colum is set to 'object' whereas most should be integers. It's possible to use:
df.convert_objects(convert_numeric=True)
and it works fine but is now depricated.
I can also use, for example, :
df.AnalyzeParametersBQFMax = pd.to_numeric(df.AnalyzeParametersBQFMax)
to convert individual columns. But is there a way of using pd.to_numeric with a list of column names. I can create a list of columns which should be integers using the following;
int64list=[]
for q in sortedKeys:
if q.startswith("AnalyzeParameters"):
int64list.append(q)
but cant find a way of passing this list to the function.
You can explicitly replace columns in a DataFrame with the same column just with another dtype.
Try this:
import pandas as pd
data = pd.DataFrame({'date':[2000, 2001, 2002, 2003], 'type':['A', 'B', 'A', 'C']})
data['date'] = data['date'].astype('int64')
when now calling data.dtypes it should return the following:
date int64
type object
dtype: object
for multiple columns use a for loop to run through the int64list you mentioned in your question.
for multiple columns you can do it this way:
cols = df.filter(like='AnalyzeParameters').columns.tolist()
df[cols] = df[cols].astype(np.int64)