How to coerce pandas dataframe column to be normal index - python

I create a DataFrame from a dictionary. I want the keys to be used as index and the values as a single column. This is what I managed to do so far:
import pandas as pd
my_counts = {"A": 43, "B": 42}
df = pd.DataFrame(pd.Series(my_counts, name=("count",)).rename_axis("letter"))
I get the following:
count
letter
A 43
B 42
The problem is I want to concatenate (with pd.concat) this with other dataframes, that have the same index name (letter), and seemingly the same single column (count), but I end up with an
AssertionError: invalid dtype determination in get_concat_dtype.
I discovered that the other dataframes have a different type for their columns: Index(['count'], dtype='object'). The above dataframe has MultiIndex(levels=[['count']], labels=[[0]]).
How can I ensure my dataframe has a normal index?

You can prevent the multiIndex column with this code by eliminating a ',':
df = pd.DataFrame(pd.Series(my_counts, name=("count")).rename_axis("letter"))
df.columns
Output:
Index(['count'], dtype='object')
OR you can flatten your multiindex columns like this:
df = pd.DataFrame(pd.Series(my_counts, name=("count",)).rename_axis("letter"))
df.columns = df.columns.map(''.join)
df.columns
Output:
Index(['count'], dtype='object')

Related

Subset of a Pandas Dataframe consisting of rows with specific column values

I'm having a problem with a single line of my code.
Here is what I'd like to achieve:
reading_now is a string consisting of 3 characters
df2 is a data frame that is a subset of df1
I'd like df2 to consist of rows from df1 where the first three characters of the value in column "Code" is equal to "reading_now"
I tried using the following two lines with no success:
*df2 = df1.loc[(df1['Code'])[0:3] == reading_now]*
*df2 = df1[(str(df1.Code)[0:3] == reading_now)]*
Looks like you were really close with your 2nd attempt.
You could solve this a couple of different ways.
reading_now = 'AAA'
df1 = pd.DataFrame([{'Code': 'AAA'}, {'Code': 'BBB'}, {'Code': 'CCC'}])
solution:
df2 = df1[df1['Code'].str.startswith(reading_now)]
or
df2 = df1[df1['Code'][0:3] == reading_now]
The df2 dataframe will contain the row that starts with the reading_now string.
You could use
df2 = df1[df1['Code'].str[0:3] == reading_now]
For example:
data = ['abcd', 'cbdz', 'abcz', 'bdaz']
df1 = pd.DataFrame(data, columns=['Code'])
df2 = df1[df1['Code'].str[0:3] == 'abc']
df2 will result in a dataframe with 'Code' column containing 'abcd' and 'abcz'

Pandas create Dataframe with index name as column name

I have an existing dataframe with column name and data. I want to change index.name for dataframe to be column's name. I am confused about multi - indexing how do I do that? Because then I need to pass that dataframe to the to_sql function which considers index as name of column for table.
Currently for me dataframe.index is RangeIndex(start=0, stop=1669, step=1)
and dataframe.index.name is None
I have done as follows :
dataframe.index.names = dataframe.columns
dataframe = dataframe.rename_axis(dataframe.columns)
It's giving me error as Length of new names must be 1, got 67. 67 is number of column I have in dataframe.
It depends if MultiIndex or not.
For single index need:
df.index.name = 'foo'
df = df.rename_axis('foo')
For MultiIndex need:
df.index.names = ('foo', 'bar')
df = df.rename_axis(('foo', 'bar'))

Appending a column to data frame using Pandas in python

I'm trying some operations on Excel file using pandas. I want to extract some columns from a excel file and add another column to those extracted columns. And want to write all the columns to new excel file. To do this I have to append new column to old columns.
Here is my code-
import pandas as pd
#Reading ExcelFIle
#Work.xlsx is input file
ex_file = 'Work.xlsx'
data = pd.read_excel(ex_file,'Data')
#Create subset of columns by extracting columns D,I,J,AU from the file
data_subset_columns = pd.read_excel(ex_file, 'Data', parse_cols="D,I,J,AU")
#Compute new column 'Percentage'
#'Num Labels' and 'Num Tracks' are two different columns in given file
data['Percentage'] = data['Num Labels'] / data['Num Tracks']
data1 = data['Percentage']
print data1
#Here I'm trying to append data['Percentage'] to data_subset_columns
Final_data = data_subset_columns.append(data1)
print Final_data
Final_data.to_excel('111.xlsx')
No error is shown. But Final_data is not giving me expected results. ( Data not getting appended)
There is no need to explicitly append columns in pandas. When you calculate a new column, it is included in the dataframe. When you export it to excel, the new column will be included.
Try this, assuming 'Num Labels' and 'Num Tracks' are in "D,I,J,AU" [otherwise add them]:
import pandas as pd
data_subset = pd.read_excel(ex_file, 'Data', parse_cols="D,I,J,AU")
data_subset['Percentage'] = data_subset['Num Labels'] / data_subset['Num Tracks']
data_subset.to_excel('111.xlsx')
The append function of a dataframe adds rows, not columns to the dataframe. Well, it does add columns if the appended rows have more columns than in the source dataframe.
DataFrame.append(other, ignore_index=False, verify_integrity=False)[source]
Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.
I think you are looking for something like concat.
Combine DataFrame objects horizontally along the x axis by passing in axis=1.
>>> df1 = pd.DataFrame([['a', 1], ['b', 2]],
... columns=['letter', 'number'])
>>> df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],
... columns=['animal', 'name'])
>>> pd.concat([df1, df4], axis=1)
letter number animal name
0 a 1 bird polly
1 b 2 monkey george

Adding columns to a dataframe where all other columns are periods

I have a timeseries dataframe with a PeriodIndex. I would like to use the values as column names in another dataframe and add other columns, which are not Periods. The problem is that when I create the dataframe by using only periods as column-index adding a column whos index is a string raises an error. However if I create the dataframe with a columns index that has periods and strings, then I'm able to add a columns with string indices.
import pandas as pd
data = np.random.normal(size=(5,2))
idx = pd.Index(pd.period_range(2011,2012,freq='A'),name=year)
df = pd.DataFrame(data,columns=idx)
df['age'] = 0
This raises an error.
import pandas as pd
data = np.random.normal(size=(5,2))
idx = pd.Index(pd.period_range(2011,2012,freq='A'),name=year)
df = pd.DataFrame(columns=idx.tolist()+['age'])
df = df.iloc[:,:-1]
df[:] = data
df['age'] = 0
This does not raise an error and gives my desired outcome, but doing it this way I can't assign the data in a convenient way when I create the dataframe. I would like a more elegant way of achieving the result. I wonder if this is a bug in Pandas?
Not really sure what you are trying to achieve, but here is one way to get what I understood you wanted:
import pandas as pd
idx = pd.Index(pd.period_range(2011,2015,freq='A'),name='year')
df = pd.DataFrame(index=idx)
df1 = pd.DataFrame({'age':['age']})
df1 = df1.set_index('age')
df = df.append(df1,ignore_index=False).T
print df
Which gives:
Empty DataFrame
Columns: [2011, 2012, 2013, 2014, 2015, age]
Index: []
And it keeps you years as Periods:
df.columns[0]
Period('2011', 'A-DEC')
The same result most likely can be achieved using .merge.

Pandas Re-indexing command

*RE Add missing dates to pandas dataframe, previously ask question
import pandas as pd
import numpy as np
idx = pd.date_range('09-01-2013', '09-30-2013')
df = pd.DataFrame(data = [2,10,5,1], index = ["09-02-2013","09-03-2013","09-06-2013","09-07-2013"], columns = ["Events"])
df.index = pd.DatetimeIndex(df.index); #question (1)
df = df.reindex(idx, fill_value=np.nan)
print(df)
In the above script what does the command noted as question one do? If you leave this
command out of the script, the df will be re-indexed but the data portion of the
original df will not be retained. As there is no reference to the df data in the
DatetimeIndex command, why is the data from the starting df lost?
Short answer: df.index = pd.DatetimeIndex(df.index); converts the string index of df to a DatetimeIndex.
You have to make the distinction between different types of indexes. In
df = pd.DataFrame(data = [2,10,5,1], index = ["09-02-2013","09-03-2013","09-06-2013","09-07-2013"], columns = ["Events"])
you have an index containing strings. When using
df.index = pd.DatetimeIndex(df.index);
you convert this standard index with strings to an index with datetimes (a DatetimeIndex). So the values of these two types of indexes are completely different.
Now, when you reindex with
idx = pd.date_range('09-01-2013', '09-30-2013')
df = df.reindex(idx)
where idx is also an index with datetimes. When you reindex the original df with a string index, there are no matching index values, so no column values of the original df are retained. When you reindex the second df (after converting the index to a datetime index), there will be matching index values, so the column values on those indixes are retained.
See also http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.reindex.html

Categories

Resources