BACKGROUND:
I have two columns: 'address' and 'raw_data'. The dataset looks like this:
this is just a sample I made up, the original dataset is over 6m rows and in a different language
Problem:
I need to find all the data where the 'address' and 'raw_data' are not matched meaning there were some sorta of mistakes were made when logging in the data from 'address' to 'raw_data.
I'm fairly new to Pandas. My plan is separate the 'raw_data' column by comma, then compare the newly produced columns with the original 'address' column (to see if the 'address' column has those info, if not, that means there is a mistake?).
Like I said, I'm new to pandas and this is what I have so far.
import pandas as pd
columns = ['address', 'raw_data']
df=pd.read_csv('address.csv', usecols=columns)
df = pd.concat([df['address'], df['raw_data'].str.split(',', expand=True)], axis=1)
Now the new columns has info like this: "CITY":"ATLANTA". I want to the columns to just have ATLANTA without all the the colons and 'CITY' in order to compare the info with 'address' column.
How should I go on about it?
Also, at this point of my pandas learning experience, I do not yet know how to compare two columns. Could someone help a newbie out please? Thanks a lot!
PS: by comparison of two columns I meant to check whether one column has the characters in the second column, not to check whether the two columns are equal. Just want to point that out.
df = pd.DataFrame([[2, 2], [3, 6],[1,1]], columns = ["col1", "col2"])
comparison_column = np.where(df["col1"] == df["col2"], True, False)
df["equal"] = comparison_column
col1 col2 equal
2 2 True
3 6 False
1 1 True
I will use this data:
import numpy as np
import pandas as pd
j = {"address":"foo","b": "bar"}
j2 = {"address":"foo2","b": "bar2"}
values = [["foo", j], ["bar", j2]]
df = pd.DataFrame(data=values, columns=["address", "raw_data"])
df
address raw_data
0 foo {'address': 'foo', 'b': 'bar'}
1 bar {'address': 'foo2', 'b': 'bar2'}
I will separate columns from raw_data (with .values.tolist()) in another df (df2):
df2 = pd.DataFrame(df['raw_data'].values.tolist())
df2
address b
0 foo bar
1 foo2 bar2
To compare you use:
df.address == df2.address
0 True
1 False
If you need save this in the original df you can add a column:
df["result"] = df.address == df2.address
You can separate them from , by just treating them as a dict. You can map custom functions to columns with apply function. In this case you have define a function that accesses to keys of dictionary and extracts values.
df['address_raw'] = df['raw_data'].apply(lambda x: x['address'])
df['city_raw'] = df['raw_data'].apply(lambda x: x['CITY'])
df['addrline2_raw'] = df['raw_data'].apply(lambda x: x['ADDR_LINE_2'])
df['addrline3_raw'] = df['raw_data'].apply(lambda x: x['ADDR_LINE_3'])
df['utmnorthing_raw'] = df['raw_data'].apply(lambda x: x['UTM_NORTHING'])
These lines will create columns of each field in the dict and then you can just compare the ones like:
df['address'] == df['address_raw']
Related
I have saved out a data column as follows:
[[A,1], [B,5], [C,18]....]
i was hoping to group A,B,C as shown above into Category and 1,5,18 into Values/Series for updating of my powerpoint chart using python pptx.
Example:
Category
Values
A
1
B
5
Is there any way i can do it? currently the above example is also extracted as strings so i believe i have to convert it to lists first?
thanks in advance!
Try to parse your strings (a list of lists) then create your dataframe from the real list:
import pandas as pd
import re
s = '[[A,1], [B,5], [C,18]]'
cols = ['Category', 'Values']
data = [row.split(',') for row in re.findall('\[([^]]+)\]', s[1:-1])]
df = pd.DataFrame(data, columns=cols)
print(df)
# Output:
Category Values
0 A 1
1 B 5
2 C 18
You should be able to just use pandas.DataFrame and pass in your data, unless I'm misunderstanding the question. Anyway, try:
df = pandas.DataFrame(data=d, columns = ['Category', 'Value'])
where d is your list of tuples.
from prettytable import PrettyTable
column = [["A",1],["B",5],["C",18]]
columnname=[]
columnvalue =[]
t = PrettyTable(['Category', 'Values'])
for data in column:
columnname.append(data[0])
columnvalue.append(data[1])
t.add_row([data[0], data[1]])
print(t)
Hi I am working with pandas to manipulate some lab data. I currently have a data frame with 5 columns.
The first three columns(Analyte,CAS NO(1), and Value) are in the correct order.
The last two columns(CAS NO 2 and Value 2) are not.
Is there a way to align CAS No(2) and Value(2) with the first three columns based off of matching CAS Numbers(aka CAS NO(2)=CAS(NO1).
I am new to python and pandas. Thank you for your help
you can reorder the columns by reassigning the df variable as a slice of itself indexed on a list whose entries are the column names in question.
colidx = ['Analyte', 'CAS NO(1)', 'CAS NO(2)']
df = df[colidx]
Better provide input data in text format so we can copy-paste it. I understand you question like this: You need to sort two last columns together, so that CAS NO(2) matches CAS NO(1).
Since CAS NO(2)=CAS(NO1) you then do not need duplicated CAS NO(2) column, right?
Split off two last columns and make a Series from it, then convert that series to dict, and use that dict to map new values.
# Split 2 last columns and assign index.
df_tmp = df[['CAS NO(2)', 'Value(2)']]
df_tmp = df_tmp.set_index('CAS NO(2)')
# Keep only 3 first columns of original dataframe
df = df[['Analyte',' CASNo(1)', 'Value(1)']]
# Now copy the CasNO(1) to CAS NO(2)
df['CAS NO(2)'] = df['CasNO(1)']
# Now create Value(2) column on original dataframe
df['Value(2)'] = df['CASNo(1)'].map(df_tmp.to_dict()['Value(2)'])
Try the following:
import pandas as pd
import numpy as np
#create an example of your table
list_CASNo1 = ['71-43-2', '100-41-4', np.nan, '1634-04-4']
list_Val1 = [np.nan]*len(list_CASNo1)
list_CASNo2 = [np.nan, np.nan, np.nan, '100-41-4']
list_Val2 = [np.nan, np.nan, np.nan, '18']
df = pd.DataFrame(zip(list_CASNo1, list_Val1, list_CASNo2, list_Val2), columns =['CASNo(1)','Value(1)','CAS NO(2)','Value(2)'], index = ['Benzene','Ethylbenzene','Gasonline Range Organics','Methyl-tert-butyl ether'])
#split the data to two dataframes
df1 = df[['CASNo(1)','Value(1)']]
df2 = df[['CAS NO(2)','Value(2)']]
#merge df2 to df1 based on the specified columns
#reset_index and set_index will take care
#that df_adjusted will have the same index names as df1
df_adjusted = df1.reset_index().merge(df2.dropna(),
how = 'left',
left_on = 'CASNo(1)',
right_on = 'CAS NO(2)').set_index('index')
but be careful with duplicates in your columns, those will cause the merge to fail..
could someone please look at below code and advice what I have done wrong.
I have 2 panda dataframes - df and x1
Both have the same columns and column names
I have to execute below set of codes for df.Date_Appointment, x1.Date_Appointment and similary for df.Date_Scheduled and x1.Date_Scheduled. As such created a list for columns and dataframes.
I am trying to write a single code but obviously I am doing something wrong. Please advice.
import pandas as pd
df = pd.read_csv(file1.csv)
x1 = pd.read_csv(file2.csv)
# x1 is a dataframe created after filtering on one column.
# df and x1 have same number of columns and column names
# x1 is a subset of df ``
dataframe = ['df','x1']
column = ['Date_Appointment', 'Date_Scheduled']
def df_det (dataframe.column):
(for df_det in dataframe.column :
d_da = df_det.describe()
mean_da = df_det.value_counts().mean()
median_da = df_det.value_counts().median()
mode_da = df_det.value_counts().mode()
print('Details of all appointments', '\n',
d_da, '\n',
'Mean = ', mean_da,'\n',
'Median = ', median_da,'\n',
'Mode = ',mode_da,'\n'))
Please indicate the steps.
Thank you in advance.
It looks like you function should have two arguments -- dataframe and column -- both of which are lists, so I made the names plural.
Then you need to loop over each argument. Note that you are also assigning a dataframe in the function the same name as your function, so I changed the name of the function.
dataframes = [dataframe1, dataframe2]
columns = ['Date_Appointment', 'Date_Scheduled']
def summary_stats(dataframes, columns):
for df in dataframes:
for col in cols:
df_det = df.loc[:, col]
# print summary stats about df_det
So I have a pandas DataFrame, df, with columns that represent taxonomical classification (i.e. Kingdom, Phylum, Class etc...) I also have a list of taxonomic labels that correspond to the order I would like the DataFrame to be ordered by.
The list looks something like this:
class_list=['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes', 'Clostridia', 'Bacilli', 'Actinobacteria', 'Betaproteobacteria', 'delta/epsilon subdivisions', 'Synergistia', 'Mollicutes', 'Nitrospira', 'Spirochaetia', 'Thermotogae', 'Aquificae', 'Fimbriimonas', 'Gemmatimonadetes', 'Dehalococcoidia', 'Oscillatoriophycideae', 'Chlamydiae', 'Nostocales', 'Thermodesulfobacteria', 'Erysipelotrichia', 'Chlorobi', 'Deinococci']
This list would correspond to the Dataframe column df['Class']. I would like to sort all the rows for the whole dataframe based on the order of the list as df['Class'] is in a different order currently. What would be the best way to do this?
You could make the Class column your index column
df = df.set_index('Class')
and then use df.loc to reindex the DataFrame with class_list:
df.loc[class_list]
Minimal example:
>>> df = pd.DataFrame({'Class': ['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes'], 'Number': [3, 5, 6]})
>>> df
Class Number
0 Gammaproteobacteria 3
1 Bacteroidetes 5
2 Negativicutes 6
>>> df = df.set_index('Class')
>>> df.loc[['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']]
Number
Bacteroidetes 5
Negativicutes 6
Gammaproteobacteria 3
Alex's solution doesn't work if your original dataframe does not contain all of the elements in the ordered list i.e.: if your input data at some point in time does not contain "Negativicutes", this script will fail. One way to get past this is to append your df's in a list and concatenate them at the end. For example:
ordered_classes = ['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']
df_list = []
for i in ordered_classes:
df_list.append(df[df['Class']==i])
ordered_df = pd.concat(df_list)
I have a dataframe which can be generated from the code as given below
df = pd.DataFrame({'person_id' :[1,2,3],'date1':
['12/31/2007','11/25/2009','10/06/2005'],'val1':
[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'val2':[1,3,5],'date3':
['12/31/2027','11/25/2029','10/06/2025'],'val3':[7,9,11]})
I followed the below solution to convert it from wide to long
pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id',
j='grp').sort_index(level=0)
Though this works with sample data as shown below, it doesn't work with my real data which has more than 200 columns. Instead of person_id, my real data has subject_ID which is values like DC0001,DC0002 etc. Does "I" always have to be numeric? Instead it adds the stub values as new columns in my dataset and has zero rows
This is how my real columns looks like
My real data might contains NA's as well. So do I have to fill them with default values for wide_to_long to work?
Can you please help as to what can be the issue? Or any other approach to achieve the same result is also helpful.
Try adding additional argument in the function which allows the strings suffix.
pd.long_to_wide(.......................,suffix='\w+')
The issue is with your column names, the numbers used to convert from wide to long need to be at the end of your column names or you need to specify a suffix to groupby. I think the easiest solution is to create a function that accepts regex and the dataframe.
import pandas as pd
import re
def change_names(df, regex):
# Select one of three column groups
old_cols = df.filter(regex = regex).columns
# Create list of new column names
new_cols = []
for col in old_cols:
# Get the stubname of the original column
stub = ''.join(re.split(r'\d', col))
# Get the time point
num = re.findall(r'\d+', col) # returns a list like ['1']
# Make new column name
new_col = stub + num[0]
new_cols.append(new_col)
# Create dictionary mapping old column names to new column names
dd = {oc: nc for oc, nc in zip(old_cols, new_cols)}
# Rename columns
df.rename(columns = dd, inplace = True)
return df
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
# Change date columns
tdf = change_names(tdf, 'date$')
tdf = change_names(tdf, 'val$')
print(tdf)
person_id hdate1 tval1 hdate2 tval2 hdate3 tval3
0 1 12/31/2007 2 12/31/2017 1 12/31/2027 7
1 2 11/25/2009 4 11/25/2019 3 11/25/2029 9
2 3 10/06/2005 6 10/06/2015 5 10/06/2025 11
This is quite late to answer this question. But putting the solution here in case someone else find it useful
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
## You can use m13op22 solution to rename your columns with numeric part at the
## end of the column name. This is important.
tdf = tdf.rename(columns={'h1date': 'hdate1', 't1val': 'tval1',
'h2date': 'hdate2', 't2val': 'tval2',
'h3date': 'hdate3', 't3val': 'tval3'})
## Then use the non-numeric portion, (in this example 'hdate', 'tval') as
## stubnames. The mistake you were doing was using ['date', 'val'] as stubnames.
df = pd.wide_to_long(tdf, stubnames=['hdate', 'tval'], i='person_id', j='grp').sort_index(level=0)
print(df)