Pandas Dataframes- Adding Fields Based on Column Titles - python

I have a pandas dataframe with some information in the column titles that I want to add to each row. The dataframe looks like:
print working_df
Retail Sales of Electricity : Arkansas : Industrial : Annual \
Year
0 16709.19272
1 16847.75502
2 16993.92202
3 16774.69902
4 14710.29400
Retail Sales of Electricity : Arizona : Residential : Annual \
Year
0 33138.47860
1 32922.97001
2 33079.07402
3 32448.13802
4 32846.84298
[8 rows x 701 columns]
How can I pull out two variables from the column name (the state, e.g. Arizona, and the sector, e.g. Industrial or Residential) and put them as a value the row in two new columns, respectively?
I would like the to have fields that look like
Year State Sector Sales
0 Arizona Residential 33138.47860
1 Arizona Residential 32922.97001
2 Arizona Residential 33079.07402
3 Arizona Residential 32448.13802
4 Arizona Residential 32846.84298
0 Arkansas Industrial 16709.19272
1 Arkansas Industrial 16847.75502
2 Arkansas Industrial 16993.92202
3 Arkansas Industrial 16774.69902
4 Arkansas Industrial 14710.29400

I think I'd do something like
d2 = df.unstack().reset_index()
d2 = d2.rename(columns={0: "Sales"})
parts = d2.pop("level_0").str.split(":")
d2["State"] = [p[1].strip() for p in parts]
d2["Sector"] = [p[2].strip() for p in parts]
which produces
>>> d2
Year Sales State Sector
0 0 16709.19272 Arkansas Industrial
1 1 16847.75502 Arkansas Industrial
2 2 16993.92202 Arkansas Industrial
3 3 16774.69902 Arkansas Industrial
4 4 14710.29400 Arkansas Industrial
5 0 33138.47860 Arizona Residential
6 1 32922.97001 Arizona Residential
7 2 33079.07402 Arizona Residential
8 3 32448.13802 Arizona Residential
9 4 32846.84298 Arizona Residential
[10 rows x 4 columns]
You could be a little fancier and do something with str.extract -- str.extract(r".*?:\s*(?P<State>.*?)\s*:\s*(?P<Sector>.*?)\s*:.*"), maybe -- but I don't think it's really worth it.

Related

Python Pivot: Can I get the count of columns per row(id/index) and store it in a new columns?

hope you can help me this.
The df looks like this.
region AMER
country Brazil Canada Columbia Mexico United States
metro Rio de Janeiro Sao Paulo Toronto Bogota Mexico City Monterrey Atlanta Boston Chicago Culpeper Dallas Denver Houston Los Angeles Miami New York Philadelphia Seattle Silicon Valley Washington D.C.
ID
321321 2 1 1 13 15 29 1 2 1 11 6 15 3 2 14 3
23213 3
231 2 2 3 1 5 6 3 3 4 3 3 4
23213 4 1 1 1 4 1 2 27 1
21321 4 2 2 1 14 3 2 4 2
12321 1 2 1 1 1 1 10
123213 2 45 5 1
12321 1
123 1 3 2
I want to get the count of columns that have data per of metro and country per region of all the rows(id/index) and store that count into a new column.
Regards,
RJ
You may want to try
df['new']df.sum(level=0, axis=1)

How to filter dataframe by splitting categories of a columns into sets?

I have a dataframe:
Prop_ID Unit_ID Prop_Usage Unit_Usage
1 1 RESIDENTIAL RESIDENTIAL
1 2 RESIDENTIAL COMMERCIAL
1 3 RESIDENTIAL INDUSTRIAL
1 4 RESIDENTIAL RESIDENTIAL
2 1 COMMERCIAL RESIDENTIAL
2 2 COMMERCIAL COMMERCIAL
2 3 COMMERCIAL COMMERCIAL
3 1 INDUSTRIAL INDUSTRIAL
3 2 INDUSTRIAL COMMERCIAL
4 1 RESIDENTIAL - COMMERCIAL RESIDENTIAL
4 2 RESIDENTIAL - COMMERCIAL COMMERCIAL
4 3 RESIDENTIAL - COMMERCIAL INDUSTRIAL
5 1 COMMERCIAL / RESIDENTIAL RESIDENTIAL
5 2 COMMERCIAL / RESIDENTIAL COMMERCIAL
5 3 COMMERCIAL / RESIDENTIAL INDUSTRIAL
5 4 COMMERCIAL / RESIDENTIAL COMMERCIAL
One property may have more than 1 unit. That means units are the subcategory of properties. I want to filter rows where Prop_Usage does not match with Unit_Usage. We have a category in Prop_Usage column that's RESIDENTIAL - COMMERCIAL then Unit_Usage can be either RESIDENTIAL or COMMERCIAL. Similarly for COMMERCIAL / RESIDENTIAL.
Expected Output:
Prop_ID Unit_ID Prop_Usage Unit_Usage
1 2 RESIDENTIAL COMMERCIAL
1 3 RESIDENTIAL INDUSTRIAL
2 1 COMMERCIAL RESIDENTIAL
3 2 INDUSTRIAL COMMERCIAL
4 3 RESIDENTIAL - COMMERCIAL INDUSTRIAL
5 3 COMMERCIAL / RESIDENTIAL INDUSTRIAL
Use in statement in DataFrame.apply:
df = df[~df.apply(lambda x: x['Unit_Usage'] in x['Prop_Usage'], axis=1)]
Or use zip in list comprehension:
df = df[[not a in b for a, b in zip(df['Unit_Usage'], df['Prop_Usage'])]]
print (df)
Prop_ID Unit_ID Prop_Usage Unit_Usage
1 1 2 RESIDENTIAL COMMERCIAL
2 1 3 RESIDENTIAL INDUSTRIAL
4 2 1 COMMERCIAL RESIDENTIAL
8 3 2 INDUSTRIAL COMMERCIAL
11 4 3 RESIDENTIAL - COMMERCIAL INDUSTRIAL
14 5 3 COMMERCIAL / RESIDENTIAL INDUSTRIAL

python pandas groupby sort rank/top n

I have a dataframe that is grouped by state and aggregated to total revenue where sector and name are ignored. I would now like to break the underlying dataset out to show state, sector, name and the top 2 by revenue in a certain order(i have a created an index from a previous dataframe that lists states in a certain order). Using the below example, I would like to use my sorted index (Kentucky, California, New York) that lists only the top two results per state (in previously stated order by Revenue):
Dataset:
State Sector Name Revenue
California 1 Tom 10
California 2 Harry 20
California 3 Roger 30
California 2 Jim 40
Kentucky 2 Bob 15
Kentucky 1 Roger 25
Kentucky 3 Jill 45
New York 1 Sally 50
New York 3 Harry 15
End Goal Dataframe:
State Sector Name Revenue
Kentucky 3 Jill 45
Kentucky 1 Roger 25
California 2 Jim 40
California 3 Roger 30
New York 1 Sally 50
New York 3 Harry 15
You could use a groupby in conjunction with apply:
df.groupby('State').apply(lambda grp: grp.nlargest(2, 'Revenue'))
Output:
Sector Name Revenue
State State
California California 2 Jim 40
California 3 Roger 30
Kentucky Kentucky 3 Jill 45
Kentucky 1 Roger 25
New York New York 1 Sally 50
New York 3 Harry 15
Then you can drop the first level of the MultiIndex to get the result you're after:
df.index = df.index.droplevel()
Output:
Sector Name Revenue
State
California 2 Jim 40
California 3 Roger 30
Kentucky 3 Jill 45
Kentucky 1 Roger 25
New York 1 Sally 50
New York 3 Harry 15
You can sort_values then using groupby + head
df.sort_values('Revenue',ascending=False).groupby('State').head(2)
Out[208]:
State Sector Name Revenue
7 NewYork 1 Sally 50
6 Kentucky 3 Jill 45
3 California 2 Jim 40
2 California 3 Roger 30
5 Kentucky 1 Roger 25
8 NewYork 3 Harry 15

Itering through a list if identical elements

I have the following function, which returns the pandas series of States - Associated Counties
def answer():
census_df.set_index(['STNAME', 'CTYNAME'])
for name, state, cname in zip(census_df['STNAME'], census_df['STATE'], census_df['CTYNAME']):
print(name, state, cname)
Alabama 1 Tallapoosa County
Alabama 1 Tuscaloosa County
Alabama 1 Walker County
Alabama 1 Washington County
Alabama 1 Wilcox County
Alabama 1 Winston County
Alaska 2 Alaska
Alaska 2 Aleutians East Borough
Alaska 2 Aleutians West Census Area
Alaska 2 Anchorage Municipality
Alaska 2 Bethel Census Area
Alaska 2 Bristol Bay Borough
Alaska 2 Denali Borough
Alaska 2 Dillingham Census Area
Alaska 2 Fairbanks North Star Borough
I would like to know the state with the most counties in it. I can iterate through each state like this:
counter = 0
counter2 = 0
for name, state, cname in zip(census_df['STNAME'], census_df['STATE'], census_df['CTYNAME']):
if state == 1:
counter += 1
print(counter)
if state == 1:
counter2 += 1
print(counter2)
and so on. I can range the number of states (rng = range(1, 56)) and iterate through it, but creating 56 lists is a nightmare. Is there an easier way if doing so?
Pandas allows us to do such operations without loops/iterating:
In [21]: df.STNAME.value_counts()
Out[21]:
Alaska 9
Alabama 6
Name: STNAME, dtype: int64
In [24]: df.STNAME.value_counts().head(1)
Out[24]:
Alaska 9
Name: STNAME, dtype: int64
or
In [18]: df.groupby('STNAME')['CTYNAME'].count()
Out[18]:
STNAME
Alabama 6
Alaska 9
Name: CTYNAME, dtype: int64
In [19]: df.groupby('STNAME')['CTYNAME'].count().idxmax()
Out[19]: 'Alaska'

pandas: filtering by group size and data value

Having grouped data, I want to drop from the results groups that contain only a single observation with the value below a certain threshold.
Initial data:
df = pd.DataFrame(data={'Province' : ['ON','QC','BC','AL','AL','MN','ON'],
'City' :['Toronto','Montreal','Vancouver','Calgary','Edmonton','Winnipeg','Windsor'],
'Sales' : [13,6,16,8,4,3,1]})
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
Now grouping the data:
df.groupby(['Province', 'City']).sum()
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
MN Winnipeg 3
ON Toronto 13
Windsor 1
QC Montreal 6
Now the part I can't figure out is how to drop provinces with only one city (or generally N observations) with the total sales less then 10. The expected output should be:
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
ON Toronto 13
Windsor 1
I.e. MN/Winnipeg and QC/Montreal are gone from the results. Ideally, they won't be completely gone but combined into a new group called 'Other', but this may be material for another question.
you can do it this way:
In [188]: df
Out[188]:
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
In [189]: g = df.groupby(['Province', 'City']).sum().reset_index()
In [190]: g
Out[190]:
Province City Sales
0 AL Calgary 8
1 AL Edmonton 4
2 BC Vancouver 16
3 MN Winnipeg 3
4 ON Toronto 13
5 ON Windsor 1
6 QC Montreal 6
Now we will create a mask for those 'provinces with more than one city':
In [191]: mask = g.groupby('Province').City.transform('count') > 1
In [192]: mask
Out[192]:
0 True
1 True
2 False
3 False
4 True
5 True
6 False
dtype: bool
And cities with the total sales greater or equal to 10 win:
In [193]: g[(mask) | (g.Sales >= 10)]
Out[193]:
Province City Sales
0 AL Calgary 8
1 AL Edmonton 4
2 BC Vancouver 16
4 ON Toronto 13
5 ON Windsor 1
I wasn't satisfied with any of the answers given, so I kept chipping at this until I figured out the following solution:
In [72]: df
Out[72]:
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
In [73]: df.groupby(['Province', 'City']).sum().groupby(level=0).filter(lambda x: len(x)>1 or x.Sales > 10)
Out[73]:
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
ON Toronto 13
Windsor 1

Categories

Resources