The data I have to work with is a bit messy.. It has header names inside of its data. How can I choose a row from an existing pandas dataframe and make it (rename it to) a column header?
I want to do something like:
header = df[df['old_header_name1'] == 'new_header_name1']
df.columns = header
In [21]: df = pd.DataFrame([(1,2,3), ('foo','bar','baz'), (4,5,6)])
In [22]: df
Out[22]:
0 1 2
0 1 2 3
1 foo bar baz
2 4 5 6
Set the column labels to equal the values in the 2nd row (index location 1):
In [23]: df.columns = df.iloc[1]
If the index has unique labels, you can drop the 2nd row using:
In [24]: df.drop(df.index[1])
Out[24]:
1 foo bar baz
0 1 2 3
2 4 5 6
If the index is not unique, you could use:
In [133]: df.iloc[pd.RangeIndex(len(df)).drop(1)]
Out[133]:
1 foo bar baz
0 1 2 3
2 4 5 6
Using df.drop(df.index[1]) removes all rows with the same label as the second row. Because non-unique indexes can lead to stumbling blocks (or potential bugs) like this, it's often better to take care that the index is unique (even though Pandas does not require it).
This works (pandas v'0.19.2'):
df.rename(columns=df.iloc[0])
It would be easier to recreate the data frame.
This would also interpret the columns types from scratch.
headers = df.iloc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)
To rename the header without reassign df:
df.rename(columns=df.iloc[0], inplace = True)
To drop the row without reassign df:
df.drop(df.index[0], inplace = True)
You can specify the row index in the read_csv or read_html constructors via the header parameter which represents Row number(s) to use as the column names, and the start of the data. This has the advantage of automatically dropping all the preceding rows which supposedly are junk.
import pandas as pd
from io import StringIO
In[1]
csv = '''junk1, junk2, junk3, junk4, junk5
junk1, junk2, junk3, junk4, junk5
pears, apples, lemons, plums, other
40, 50, 61, 72, 85
'''
df = pd.read_csv(StringIO(csv), header=2)
print(df)
Out[1]
pears apples lemons plums other
0 40 50 61 72 85
Keeping it Python simple
Padas DataFrames have columns attribute why not use it with standard Python, it is much clearer what you are doing:
table = [['name', 'Rf', 'Rg', 'Rf,skin', 'CRI'],
['testsala.cxf', '86', '95', '92', '87'],
['testsala.cxf: 727037 lm', '86', '95', '92', '87'],
['630.cxf', '18', '8', '11', '18'],
['Huawei stk-lx1.cxf', '86', '96', '88', '83'],
['dedo uv no filtro.cxf', '52', '93', '48', '58']]
import pandas as pd
data = pd.DataFrame(table[1:],columns=table[0])
or in the case is not the first row, but the 10th for instance:
columns = table.pop(10)
data = pd.DataFrame(table,columns=columns)
Related
Lets say I have a DataFrame df with a multi index ['siec', 'geo'] (shown in italic):
siec
geo
value
a
DE
1
a
FR
2
and a mapping DataFrame mapping_df from geo to id_region with a single index ['geo']:
geo
id_region
DE
10
FR
20
=> How can I join/merge/replace the index column 'geo' of df with the values of the column 'id_region' from mapping_df?
Expected result with new multi index ['siec', 'id_region']:
siec
id_region
value
a
10
1
a
20
2
I tried following code:
import pandas as pd
df = pd.DataFrame([{'siec': 'a', 'geo': 'DE', 'value': 1}, {'siec': 'a', 'geo': 'FR', 'value': 1}])
df.set_index(['siec', 'geo'], inplace=True)
mapping_df = pd.DataFrame([{'geo': 'DE', 'id_region': 10}, {'geo': 'FR', 'id_region': 20}])
mapping_df.set_index(['geo'], inplace=True)
joined_data = df.join(mapping_df)
merged_data = df.merge(mapping_df, left_index=True, right_index=True)
but it does not do what I want. It adds an additional column and keeps the old index.
siec
geo
value
id_region
a
DE
1
10
a
FR
2
20
=> Is there a convenient method for my use case or would I need to manually correct the index after a joining step?
As a workaround, I could reindex() the DataFrames, do some joining manipulations and then reintroduce some multi index.
However, I would like to avoid switching back and forth between the indexed and non-indexed versions of the DataFrames if possible (?).
Try as follows.
Use MultiIndex.get_level_values to select only level 1 (or: geo) and apply Index.map with mapping_df['id_region'] as mapper.
Wrap the result inside MultiIndex.set_levels to overwrite level 1.
Finally, chain Index.set_names to rename the level (or use MultiIndex.rename).
df.index = df.index.set_levels(
df.index.get_level_values(1).map(mapping_df['id_region']), level=1)\
.set_names('id_region', level=1)
print(df)
value
siec id_region
a 10 1
20 2
The data I have to work with is a bit messy.. It has header names inside of its data. How can I choose a row from an existing pandas dataframe and make it (rename it to) a column header?
I want to do something like:
header = df[df['old_header_name1'] == 'new_header_name1']
df.columns = header
In [21]: df = pd.DataFrame([(1,2,3), ('foo','bar','baz'), (4,5,6)])
In [22]: df
Out[22]:
0 1 2
0 1 2 3
1 foo bar baz
2 4 5 6
Set the column labels to equal the values in the 2nd row (index location 1):
In [23]: df.columns = df.iloc[1]
If the index has unique labels, you can drop the 2nd row using:
In [24]: df.drop(df.index[1])
Out[24]:
1 foo bar baz
0 1 2 3
2 4 5 6
If the index is not unique, you could use:
In [133]: df.iloc[pd.RangeIndex(len(df)).drop(1)]
Out[133]:
1 foo bar baz
0 1 2 3
2 4 5 6
Using df.drop(df.index[1]) removes all rows with the same label as the second row. Because non-unique indexes can lead to stumbling blocks (or potential bugs) like this, it's often better to take care that the index is unique (even though Pandas does not require it).
This works (pandas v'0.19.2'):
df.rename(columns=df.iloc[0])
It would be easier to recreate the data frame.
This would also interpret the columns types from scratch.
headers = df.iloc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)
To rename the header without reassign df:
df.rename(columns=df.iloc[0], inplace = True)
To drop the row without reassign df:
df.drop(df.index[0], inplace = True)
You can specify the row index in the read_csv or read_html constructors via the header parameter which represents Row number(s) to use as the column names, and the start of the data. This has the advantage of automatically dropping all the preceding rows which supposedly are junk.
import pandas as pd
from io import StringIO
In[1]
csv = '''junk1, junk2, junk3, junk4, junk5
junk1, junk2, junk3, junk4, junk5
pears, apples, lemons, plums, other
40, 50, 61, 72, 85
'''
df = pd.read_csv(StringIO(csv), header=2)
print(df)
Out[1]
pears apples lemons plums other
0 40 50 61 72 85
Keeping it Python simple
Padas DataFrames have columns attribute why not use it with standard Python, it is much clearer what you are doing:
table = [['name', 'Rf', 'Rg', 'Rf,skin', 'CRI'],
['testsala.cxf', '86', '95', '92', '87'],
['testsala.cxf: 727037 lm', '86', '95', '92', '87'],
['630.cxf', '18', '8', '11', '18'],
['Huawei stk-lx1.cxf', '86', '96', '88', '83'],
['dedo uv no filtro.cxf', '52', '93', '48', '58']]
import pandas as pd
data = pd.DataFrame(table[1:],columns=table[0])
or in the case is not the first row, but the 10th for instance:
columns = table.pop(10)
data = pd.DataFrame(table,columns=columns)
I have trouble with some pandas dataframes.
Its very simple, I have 4 columns, and I want to reshape them in 2...
For 'practical' reasons, I don't want to use 'header names', but I need to use 'index' (for the columns header names).
I have :
df = pd.DataFrame({'a': [1,2,3],'b': [4,5,6],'c': [7,8,9],'d':[10,11,12]})
I want as a result :
df_res = pd.DataFrame({'NewName1': [1,2,3,4,5,6],'NewName2': [7,8,9,10,11,12]})
(in fact NewName1 doesn't matter, it can stay a or whatever the name...)
I tried with for loops, append, concat, but couldn't figured it out...
Any suggestions ?
Thanks for your help !
Bina
You can extract the desired columns and create a new pandas.DataFrame like so:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1,2,3],'b': [4,5,6],'c': [7,8,9],'d':[10,11,12]})
first_col = np.concatenate((df.a.to_numpy(), df.b.to_numpy()))
second_col = np.concatenate((df.c.to_numpy(), df.d.to_numpy()))
df2 = pd.DataFrame({"NewName1": first_col, "NewName2": second_col})
>>> df2
NewName1 NewName2
0 1 7
1 2 8
2 3 9
3 4 10
4 5 11
5 6 12
This is probably not the most elegant solution, but I would isolate the two dataframes and then concatenate them. I needed to rename the column axis so that the four columns could be aligned correctly.
import pandas as pd
df = pd.DataFrame({'a': [1,2,3],'b': [4,5,6],'c': [7,8,9],'d':[10,11,12]})
af = df[['a', 'c']]
bf = df[['b', 'd']]
frames = (
af.rename({'a': 'NewName1', 'c': 'NewName2'}, axis=1),
bf.rename({'b': 'NewName1', 'd': 'NewName2'}, axis=1)
)
out = pd.concat(frames)
[EDIT] Replying to the comment.
I'm not that familiar with indexing but this might be one solution. You could avoid column names by using .iloc. Replace the af, and bf frames above with these lines.
af = df.iloc[:, ::2]
bf = df.iloc[:, 1::2]
df_1 = {'budget_id':['1', '2', '3', '4'],
'budget_amount':[200, 300, 400, 500]}
df_2 = {'budget_id':['1', '2', '3', '4', '5'],
'budget_amount':[200, 300, 400, 550, 700]}
df_1.compare(df_2, align_axis=0, keep_equal=True).rename(index={'self': 'Prev', 'other': 'New'}, level=1)
Desired output of df.compare():
budget_id budget_amount
4 550
5 700
I have two data frames that I wish to compare using df.compare. They both have the same columns and index labels.
However, I can not guarantee they have the same number of rows. This causes issues as compare expects a two DFs with the same shape.
I need to know if a new row has been added as part of the compare.
Is the best solution would be to append blank rows to either data frame until they're equal? How would you do that?
Is there a more elegant way?
Does mergework for you:
(df_1.merge(df_2, on='budget_id', how='right')
.query('budget_amount_x != budget_amount_y')
)
Output:
budget_id budget_amount_x budget_amount_y
3 4 500.0 550
4 5 NaN 700
This is the solution I wrote based on Giovanni Frison's comment.
def compare_dataframes(df_1, df_2):
if df_1.equals(df_2):
return pandas.DataFrame()
else:
#Get indexes of rows present in df_2, but not in df_1
new_row_indexes = df_2.index.difference(df_1.index)
new_rows = df_2.loc[new_row_indexes]
#Create second index to match df.compare output
new_rows[''] = 'New'
new_rows = new_rows.set_index('',append=True)
#Drop new rows from df_2 to create same shape for df.compare
df_2 = df_2.drop(new_row_indexes)
compare_df = df_1.compare(df_2, align_axis=0, keep_equal=True).rename(index={'self': 'Prev', 'other': 'New'}, level=1)
compare_df = compare_df.append(new_rows)
return compare_df
As untitled, I noticed that pandas 'to_csv' transforms automatically columns where there are only alphanumerical strings to float .
I am creating a dataframe in Jupyter notebook and creating a column ['A'] full of values '1'. Hence, I have a dataframe composed of a column of string '1'.
When i convert my dataframe to csv file with 'to_csv'. the output csv file is a one column full of integers 1.
You may advise me to reconvert the column to string when reloaded in jupyter, However that's won't work because I don't know beforehand what columns may be penalized because of this behaviour.
Is there a way to avoid this strange situation.
You can set the quoting parameter in to_csv, take a look at this example:
a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a)
df.to_csv('test.csv', sep='\t', quoting=csv.QUOTE_NONNUMERIC)
The created csv file is:
"" 0 1 2
0 "a" "1.2" "4.2"
1 "b" "70" "0.03"
2 "x" "5" "0"
You can also set the quote character with quotechar parameter, e.g. quotechar="'" will produce this output:
'' 0 1 2
0 'a' '1.2' '4.2'
1 'b' '70' '0.03'
2 'x' '5' '0'
One way is to store your types separately and load this with your data:
df = pd.DataFrame({0: ['1', '1', '1'],
1: [2, 3, 4]})
df.dtypes.to_frame('types').to_csv('types.csv')
df.to_csv('file.csv', index=False)
df_types = pd.read_csv('types.csv')['types']
df = pd.read_csv('file.csv', dtype=df_types.to_dict())
print(df.dtypes)
# 0 object
# 1 int64
# dtype: object
You may wish to consider Pickle to ensure your dataframe is guaranteed to be unchanged:
df.to_pickle('file.pkl')
df = pd.read_pickle('file.pkl')
print(df.dtypes)
# 0 object
# 1 int64
# dtype: object