I have a DataFrame and I need to turn one column into multiple columns, and then create another column that index/labels values of the new/multiple columns
import pandas as pd
df = pd.DataFrame({'state':['AK','AK','AK','AK','AL','AL','AL','AL'], 'county':['Cnty1','Cnty1','Cnty2','Cnty2','Cnty3','Cnty3','Cnty4','Cnty4'],
'year':['2000','2001','2000','2001','2000','2001','2000','2001'], 'count1':[5,7,4,8,9,1,0,1], 'count2':[8,1,4,6,7,3,8,5]})
Using pivot_table() and reset_index() I'm able to move the values of year into columns, but not able to dis-aggregate it by the other columns.
Using:
pivotDF = pd.pivot_table(df, index = ['state', 'county'], columns = 'year')
pivotDF = pivotDF.reset_index()
Gets me close, but not what I need.
What I need is, another column that labels count1 and count2, with the values in the year columns. Something that looks like this:
I realize a DataFrame would have all the values for 'state' and 'county' filled in, which is fine, but I'm outputting this to Excel and need it to look just like this so if there's a way to have this format that would be a bonus.
Many thanks.
You are looking for pivot then stack
s=df.pivot_table(index=['state','county'],columns='year',values=['count1','count2'],aggfunc='mean').stack(level=0)
s
Out[142]:
year 2000 2001
state county
AK Cnty1 count1 5 7
count2 8 1
Cnty2 count1 4 8
count2 4 6
AL Cnty3 count1 9 1
count2 7 3
Cnty4 count1 0 1
count2 8 5
You've got most of the answer down. Just add a stack with level=0 to stack on that level rather than the default year level.
pd.pivot_table(df, index=['state', 'county'], columns='year', values=['count1', 'count2']) \
.stack(level=0)
Related
Hoping someone can help me here - i believe i am close to the solution.
I have a dataframe, of which i have am using .count() in order to return a series of all column names of my dataframe, and each of their respective non-NAN value counts.
Example dataframe:
feature_1
feature_2
1
1
2
NaN
3
2
4
NaN
5
3
Example result for .count() here would output a series that looks like:
feature_1 5
feature_2 3
I am now trying to get this data into a dataframe, with the column names "Feature" and "Count". To have the expected output look like this:
Feature
Count
feature_1
5
feature_2
3
I am using .to_frame() to push the series to a dataframe in order to add column names. Full code:
df = data.count()
df = df.to_frame()
df.columns = ['Feature', 'Count']
However receiving this error message - "ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements", as if though it is not recognising the actual column names (Feature) as a column with values.
How can i get it to recognise both Feature and Count columns to be able to add column names to them?
Add Series.reset_index instead Series.to_frame for 2 columns DataFrame - first column from index, second from values of Series:
df = data.count().reset_index()
df.columns = ['Feature', 'Count']
print (df)
Feature Count
0 feature_1 5
1 feature_2 3
Another solution with name parameter and Series.rename_axis or with DataFrame.set_axis:
df = data.count().rename_axis('Feature').reset_index(name='Count')
#alternative
df = data.count().reset_index().set_axis(['Feature', 'Count'], axis=1)
print (df)
Feature Count
0 feature_1 5
1 feature_2 3
This happens because your new dataframe has only one column (the column name is taken as series index, then translated into dataframe index with the func to_frame()). In order to assign a 2 elements list to df.columns you have to reset the index first:
df = data.count()
df = df.to_frame().reset_index()
df.columns = ['Feature', 'Count']
This question already has answers here:
Merge two dataframes by index
(7 answers)
Closed 1 year ago.
I am working with an adult dataset where I split the dataframe to label encode categorical columns. Now I want to append the new dataframe with the original dataframe. What is the simplest way to perform the same?
Original Dataframe-
age
salary
32
3000
25
2300
After label encoding few columns
country
gender
1
1
4
2
I want to append the above dataframe and the final result should be the following.
age
salary
country
gender
32
3000
1
1
25
2300
4
2
Any insights are helpful.
lets consider two dataframe named as df1 and df2 hence,
df1.merge(df2,left_index=True, right_index=True)
You can use .join() if the datrframes rows are matched by index, as follows:
.join() is a left join by default and join by index by default.
df1.join(df2)
In addition to simple syntax, it has the extra advantage that when you put your master/original dataframe on the left, left join ensures that the dataframe indexes of the master are retained in the result.
Result:
age salary country gender
0 32 3000 1 1
1 25 2300 4 2
You maybe find your solution in checking pandas.concat.
import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.array([[32,3000],[25,2300]]), columns=['age', 'salary'])
df2 = pd.DataFrame(np.array([[1,1],[4,2]]), columns=['country', 'gender'])
pd.concat([df1, df2], axis=1)
age salary country gender
0 32 25 1 1
1 3000 2300 4 2
I have a csv having rows like this:
Year 1
Year 1
Year 1
Year 1
Month 1
Month 2
Month 3
Month 4
I want these first two columns to be merged into one like this:
| Year1-Month1 | Year1-Month2 | etc.
I am reading the csv using pandas dataframe.
All the answers on stack overflow combine the two columns but not rows. Please help.
First convert first 2 rows of data to MultiIndex:
df = pd.read_csv(file, header=[0, 1])
And then join values by -:
df.columns = df.columns.map('-'.join)
Or use f-strings:
df.columns = [f'{a}-{b}' for a, b in df.columns]
I am running a for loop for each of 12 months. For each month I get bunch of dates in random order over various years in history. I also have corresponding temperature data on those dates. e.g. if I am in month January, of loop all dates and temperature I get from history are for January only.
I want to start with empty pandas dataframe with two columns namely 'Dates' and 'Temperature'. As the loop progresses I want to add the dates from another month and corresponding data to the 'Temperature' column.
After my dataframe is ready I want to finally use the 'Dates'column as index to order the 'Temperature' history available so that I have correct historical sorted dates with their temperatures.
I have thought about using numpy array and storing dates and data in two separate arrays; sort the dates and then sort the temperature using some kind of index. I believe using pandas pivot table feature it will be better implemented in pandas.
#Zanam Pls refer this syntax. I think your question is similar to this answer
df = DataFrame(columns=('lib', 'qty1', 'qty2'))
for i in range(5):
df.loc[i] = [randint(-1,1) for n in range(3)]
print(df)
lib qty1 qty2
0 0 0 -1
1 -1 -1 1
2 1 -1 1
3 0 0 0
4 1 -1 -1
[5 rows x 3 columns]
Brief background: I just started recently using Pandas to read in a csv file of data. I'm able to create a dataframe from reading the csv but now I want to do some calculations using only specific columns of the dataset.
Is there a way to create a new dataframe where I only use rows where the relevant columns are not NA or 0? For example imagine an array that looks like:
blah blah1 blah2 blah3
0 1 1 1 1
1 NA NA 1 NA
2 1 1 1 1
So say I want to do things with data under columns "blah1" and "blah2", but I want to only use rows 0 and 2 because 1 has an NA under the column "blah".
Is there a simple way of doing this? Thanks!
Edit (Clarifications):
- I don't know ahead of time that I want to drop row 1, thus I need to be able to check for a NA value (and possibly any other placeholder value beyond just whether it is null).
Yes, you can use dropna
df = df.dropna(axis = 1)
and to select columns use this:
df = df[["blah1", "blah2"]]
Now df contains only cols "blah1" and "blah2" and rows 0 and 2
EDIT 1
To limit NaN verification to some columns you can use isnull().
mask = df[["blah1", "blah2"]].isnull().all(axis=1)
df = df[~mask]
EDIT 2
mask = df.B == 'placeholder'
df = df[~mask]