I have a tricky problem to select column in a dataframe. I have a dataframe and multiple columns in it have the same name "PTime".
This is my dataframe:
PTime first_column PTime third_column PTime fourth_column
0 4 first_value 1 first_value 6 first_value
1 4 second_value 2 second_value 7 second_value
This is what I want:
PTime first_column PTime fourth_column
0 4 first_value 6 first_value
1 4 second_value 7 second_value
I will select my columns from a list:
My code:
data = {'PTime': ['1', '1'],
'first_column': ['first_value', 'second_value'],
'PTime': ['2', '2'],
'third_column': ['first_value', 'second_value'],
'PTime': ['4', '4'],
'fourth_column': ['first_value', 'second_value'],
}
list_c = ['PTime','first_column','fourth_column']
df = pd.DataFrame(data)
#df = df[df.columns.intersection(list_c)]
df = df[list_c]
df
So my goal is to select the column that is in the list and select the column to the left of the one in the list. I if you have any idea to do that, thank you really much. Regards
I don't exactly know how to get left of one in list
But i have a trick to get desired table which you want as shown
PTime first_column PTime fourth_column
0 4 first_value 6 first_value
1 4 second_value 7 second_value
what we can do is simply remove the column by index
But here as there are same name pandas will to try to delete the first row
But you can simply rename the columns if there are duplicates name and then you can use indexing to delete columns..
So here find some logic to rename it like PTime1 .. PTime2 .. PTime3 ..
and then use indexes to remove it
df.drop(df.columns[i], axis=1,inplace=True)
// or //
df = df.drop(df.columns[i], axis=1)
Here you have to pass the list of indices . In your case it will be like
df.drop(df.columns[[2,3]],axis=1)
After renaming columns
In my dataframe I will not have multiple columns with the same name. All names will be distinct.
So in the case I have ten columns to select it will be difficult to list them all in a list.
data = {'PTime1': ['1', '1'],
'first_column': ['first_value', 'second_value'],
'PTime2': ['2', '2'],
'third_column': ['first_value', 'second_value'],
'PTime3': ['4', '4'],
'fourth_column': ['first_value', 'second_value'],
}
list_c = ['first_column','fourth_column'] #define column to select
df = pd.DataFrame(data) #create dataframe
list_index = [] #create list to store index column
for col in list_c:
index_no = df.columns.get_loc(col) #get index column
list_index.append(index_no-1) #insert index-1 in a list. Get column from the left
list_index.append(index_no) #insert index from the column in the list.
df = df.iloc[:, list_index] #Subset the dataframe with the list of column selected.
df
Like this I can select the column from my list and the column on the left of each element in my list.
Related
I am trying to make a function to spot the columns with "100" in the header and replace all values in these columns that are above 100 with nan values :
import pandas as pd
data = {'first_100': ['25', '1568200', '5'],
'second_column': ['first_value', 'second_value', 'third_value'],
'third_100':['89', '9', '589'],
'fourth_column':['first_value', 'second_value', 'third_value'],
}
df = pd.DataFrame(data)
print (df)
so this is the output I am looking for
Use filter to identify the columns with '100', to_numeric to ensure having numeric values, then mask with a boolean array:
cols = df.filter(like='100').columns
df[cols] = df[cols].mask(df[cols].apply(pd.to_numeric, errors='coerce').gt(100))
output:
first_100 second_column third_100 fourth_column
0 25 first_value 89 first_value
1 NaN second_value 9 second_value
2 5 third_value NaN third_value
The data I have to work with is a bit messy.. It has header names inside of its data. How can I choose a row from an existing pandas dataframe and make it (rename it to) a column header?
I want to do something like:
header = df[df['old_header_name1'] == 'new_header_name1']
df.columns = header
In [21]: df = pd.DataFrame([(1,2,3), ('foo','bar','baz'), (4,5,6)])
In [22]: df
Out[22]:
0 1 2
0 1 2 3
1 foo bar baz
2 4 5 6
Set the column labels to equal the values in the 2nd row (index location 1):
In [23]: df.columns = df.iloc[1]
If the index has unique labels, you can drop the 2nd row using:
In [24]: df.drop(df.index[1])
Out[24]:
1 foo bar baz
0 1 2 3
2 4 5 6
If the index is not unique, you could use:
In [133]: df.iloc[pd.RangeIndex(len(df)).drop(1)]
Out[133]:
1 foo bar baz
0 1 2 3
2 4 5 6
Using df.drop(df.index[1]) removes all rows with the same label as the second row. Because non-unique indexes can lead to stumbling blocks (or potential bugs) like this, it's often better to take care that the index is unique (even though Pandas does not require it).
This works (pandas v'0.19.2'):
df.rename(columns=df.iloc[0])
It would be easier to recreate the data frame.
This would also interpret the columns types from scratch.
headers = df.iloc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)
To rename the header without reassign df:
df.rename(columns=df.iloc[0], inplace = True)
To drop the row without reassign df:
df.drop(df.index[0], inplace = True)
You can specify the row index in the read_csv or read_html constructors via the header parameter which represents Row number(s) to use as the column names, and the start of the data. This has the advantage of automatically dropping all the preceding rows which supposedly are junk.
import pandas as pd
from io import StringIO
In[1]
csv = '''junk1, junk2, junk3, junk4, junk5
junk1, junk2, junk3, junk4, junk5
pears, apples, lemons, plums, other
40, 50, 61, 72, 85
'''
df = pd.read_csv(StringIO(csv), header=2)
print(df)
Out[1]
pears apples lemons plums other
0 40 50 61 72 85
Keeping it Python simple
Padas DataFrames have columns attribute why not use it with standard Python, it is much clearer what you are doing:
table = [['name', 'Rf', 'Rg', 'Rf,skin', 'CRI'],
['testsala.cxf', '86', '95', '92', '87'],
['testsala.cxf: 727037 lm', '86', '95', '92', '87'],
['630.cxf', '18', '8', '11', '18'],
['Huawei stk-lx1.cxf', '86', '96', '88', '83'],
['dedo uv no filtro.cxf', '52', '93', '48', '58']]
import pandas as pd
data = pd.DataFrame(table[1:],columns=table[0])
or in the case is not the first row, but the 10th for instance:
columns = table.pop(10)
data = pd.DataFrame(table,columns=columns)
df_1 = {'budget_id':['1', '2', '3', '4'],
'budget_amount':[200, 300, 400, 500]}
df_2 = {'budget_id':['1', '2', '3', '4', '5'],
'budget_amount':[200, 300, 400, 550, 700]}
df_1.compare(df_2, align_axis=0, keep_equal=True).rename(index={'self': 'Prev', 'other': 'New'}, level=1)
Desired output of df.compare():
budget_id budget_amount
4 550
5 700
I have two data frames that I wish to compare using df.compare. They both have the same columns and index labels.
However, I can not guarantee they have the same number of rows. This causes issues as compare expects a two DFs with the same shape.
I need to know if a new row has been added as part of the compare.
Is the best solution would be to append blank rows to either data frame until they're equal? How would you do that?
Is there a more elegant way?
Does mergework for you:
(df_1.merge(df_2, on='budget_id', how='right')
.query('budget_amount_x != budget_amount_y')
)
Output:
budget_id budget_amount_x budget_amount_y
3 4 500.0 550
4 5 NaN 700
This is the solution I wrote based on Giovanni Frison's comment.
def compare_dataframes(df_1, df_2):
if df_1.equals(df_2):
return pandas.DataFrame()
else:
#Get indexes of rows present in df_2, but not in df_1
new_row_indexes = df_2.index.difference(df_1.index)
new_rows = df_2.loc[new_row_indexes]
#Create second index to match df.compare output
new_rows[''] = 'New'
new_rows = new_rows.set_index('',append=True)
#Drop new rows from df_2 to create same shape for df.compare
df_2 = df_2.drop(new_row_indexes)
compare_df = df_1.compare(df_2, align_axis=0, keep_equal=True).rename(index={'self': 'Prev', 'other': 'New'}, level=1)
compare_df = compare_df.append(new_rows)
return compare_df
I wanna append a longer list to dataframe .But get an error ValueError: Length of values (4) does not match length of index (3)
import pandas as pd
df = pd.DataFrame({'Data': ['1', '2', '3']})
df['Data2'] =['1', '2', '3', '4']
print(df)
How can I fix it .
Use DataFrame.reindex for add new rows by maximal length by new list or original DataFrame, if length of list should be changed, sometimes same length or sometimes length is shorter:
df = pd.DataFrame({'Data': ['1', '2', '3']})
L = ['1', '2', '3', '4']
df = df.reindex(range(max(len(df), len(L))))
df['Data2'] = L
print (df)
Data Data2
0 1 1
1 2 2
2 3 3
3 NaN 4
If always is length of list longer:
df = df.reindex(range(len(L)))
df['Data2'] = L
You can try using pd.concat here but convert your list to a Series then use pd.concat
l = ['1', '2', '3', '4']
pd.concat([df, pd.Series(l, name='Data2')], axis=1)
Data Data2
0 1 1
1 2 2
2 3 3
3 NaN 4
The data I have to work with is a bit messy.. It has header names inside of its data. How can I choose a row from an existing pandas dataframe and make it (rename it to) a column header?
I want to do something like:
header = df[df['old_header_name1'] == 'new_header_name1']
df.columns = header
In [21]: df = pd.DataFrame([(1,2,3), ('foo','bar','baz'), (4,5,6)])
In [22]: df
Out[22]:
0 1 2
0 1 2 3
1 foo bar baz
2 4 5 6
Set the column labels to equal the values in the 2nd row (index location 1):
In [23]: df.columns = df.iloc[1]
If the index has unique labels, you can drop the 2nd row using:
In [24]: df.drop(df.index[1])
Out[24]:
1 foo bar baz
0 1 2 3
2 4 5 6
If the index is not unique, you could use:
In [133]: df.iloc[pd.RangeIndex(len(df)).drop(1)]
Out[133]:
1 foo bar baz
0 1 2 3
2 4 5 6
Using df.drop(df.index[1]) removes all rows with the same label as the second row. Because non-unique indexes can lead to stumbling blocks (or potential bugs) like this, it's often better to take care that the index is unique (even though Pandas does not require it).
This works (pandas v'0.19.2'):
df.rename(columns=df.iloc[0])
It would be easier to recreate the data frame.
This would also interpret the columns types from scratch.
headers = df.iloc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)
To rename the header without reassign df:
df.rename(columns=df.iloc[0], inplace = True)
To drop the row without reassign df:
df.drop(df.index[0], inplace = True)
You can specify the row index in the read_csv or read_html constructors via the header parameter which represents Row number(s) to use as the column names, and the start of the data. This has the advantage of automatically dropping all the preceding rows which supposedly are junk.
import pandas as pd
from io import StringIO
In[1]
csv = '''junk1, junk2, junk3, junk4, junk5
junk1, junk2, junk3, junk4, junk5
pears, apples, lemons, plums, other
40, 50, 61, 72, 85
'''
df = pd.read_csv(StringIO(csv), header=2)
print(df)
Out[1]
pears apples lemons plums other
0 40 50 61 72 85
Keeping it Python simple
Padas DataFrames have columns attribute why not use it with standard Python, it is much clearer what you are doing:
table = [['name', 'Rf', 'Rg', 'Rf,skin', 'CRI'],
['testsala.cxf', '86', '95', '92', '87'],
['testsala.cxf: 727037 lm', '86', '95', '92', '87'],
['630.cxf', '18', '8', '11', '18'],
['Huawei stk-lx1.cxf', '86', '96', '88', '83'],
['dedo uv no filtro.cxf', '52', '93', '48', '58']]
import pandas as pd
data = pd.DataFrame(table[1:],columns=table[0])
or in the case is not the first row, but the 10th for instance:
columns = table.pop(10)
data = pd.DataFrame(table,columns=columns)