using library to replace column values in python - python

I am trying to replace the FIPS code with state abbreviations using us library. This is how I can get the value for each individual state
fips_name = us.states.mapping('fips', 'name')
fips_name['20']
Out[31]: 'Kansas'

Assming that fips_name is a dictionary of fips -> state names, you can use the .map method of a pandas.Series (column):
df["state_names"] = df["fips"].map(fips_name)
Update w/ working example:
import pandas as pd
import us
df = pd.DataFrame({"fips": ["01", "01", "08", "09", "10", "06"]})
fips_to_name = us.states.mapping("fips", "name")
df["states"] = df["fips"].map(fips_to_name)
print(df)
fips states
0 01 Alabama
1 01 Alabama
2 08 Colorado
3 09 Connecticut
4 10 Delaware
5 06 California

Related

DataFrame from variable and filtering data

I have a DataFrame and want to extract 3 columns from it, but one of them is an input from the user. I made a list, but need it to be iterable so I can run a For iteration.
So far I made it through by making a dictionary with 2 of the columns making a list of each and zipping them... but I really need the 3 columns...
My code:
Data=pd.read_csv(----------)
selec=input("What month would you want to show?")
NewData=[(Data['Country']),(Data['City']),(Data[selec].astype('int64')]
#here I try to iterate:
iteration=[i for i in NewData if NewData[i]<=25]
print (iteration)
*TypeError:list indices must be int ot slices, not Series*
My CSV is the following:
I want to be able to choose the month with the variable "selec" and filter the results of the month I've chosen... so the output for selec="Feb" would be:
I tried as well with loc/iloc, but not lucky at all (unhashable type:'list').
See the below example for how you can:
select specific columns from a DataFrame by providing a list of columns between the selection brackets (link to tutorial)
select specific rows from a DataFrame by providing a condition between the selection brackets (link to tutorial)
iterate rows of a DataFrame, although I don't suppose you need it - if you'd like to keep working with the DataFrame after filtering it, it's better to use the method mentioned above (you won't have to put the rows back together, and it will likely be more performant because pandas is optimized for bulk operations)
import pandas as pd
# this is just for testing, instead of pd.read_csv(...)
df = pd.DataFrame([
dict(Country="Spain", City="Madrid", Jan="15", Feb="16", Mar="17", Apr="18", May=""),
dict(Country="Spain", City="Galicia", Jan="1", Feb="2", Mar="3", Apr="4", May=""),
dict(Country="France", City="Paris", Jan="0", Feb="2", Mar="3", Apr="4", May=""),
dict(Country="Algeria", City="Argel", Jan="20", Feb="28", Mar="29", Apr="30", May=""),
])
print("---- Original df:")
print(df)
selec = "Feb" # let's pretend this comes from input()
print("\n---- Just the 3 columns:")
df = df[["Country", "City", selec]] # narrow down the df to just the 3 columns
df[selec] = df[selec].astype("int64") # convert the selec column to proper type
print(df)
print("\n---- Filtered dataframe:")
df1 = df[df[selec] <= 25]
print(df1)
print("\n---- Iterated & filtered rows:")
for row in df.itertuples():
# we could also use row[3] instead of getattr(...)
if getattr(row, selec) <= 25:
print(row)
Output:
---- Original df:
Country City Jan Feb Mar Apr May
0 Spain Madrid 15 16 17 18
1 Spain Galicia 1 2 3 4
2 France Paris 0 2 3 4
3 Algeria Argel 20 28 29 30
---- Just the 3 columns:
Country City Feb
0 Spain Madrid 16
1 Spain Galicia 2
2 France Paris 2
3 Algeria Argel 28
---- Filtered dataframe:
Country City Feb
0 Spain Madrid 16
1 Spain Galicia 2
2 France Paris 2
---- Iterated & filtered dataframe:
Pandas(Index=0, Country='Spain', City='Madrid', Feb=16)
Pandas(Index=1, Country='Spain', City='Galicia', Feb=2)
Pandas(Index=2, Country='France', City='Paris', Feb=2)

Delete value Pandas Column

I started to learn about pandas and try to analyze a data
So in my data there is a column country which contain a few country,I only want to take the first value and change it to a new column.
An example First index have Colombia,Mexico,United Stated and I only wanna to take the first one Colombia [0] and delete the other contry[1:x],is this possible?
I try a few like loc,iloc or drop() but I hit a dead end so I asked in here
You can use Series.str.split:
df['country'] = df['country'].str.split(',').str[0]
Consider below df for example:
In [1520]: df = pd.DataFrame({'country':['Colombia, Mexico, US', 'Croatia, Slovenia, Serbia', 'Denmark', 'Denmark, Brazil']})
In [1521]: df
Out[1521]:
country
0 Colombia, Mexico, US
1 Croatia, Slovenia, Serbia
2 Denmark
3 Denmark, Brazil
In [1523]: df['country'] = df['country'].str.split(',').str[0]
In [1524]: df
Out[1524]:
country
0 Colombia
1 Croatia
2 Denmark
3 Denmark
Use .str.split():
df['country'] = df['country'].str.split(',',expand=True)[0]

Calculating new rows in a Pandas Dataframe on two different columns

So I'm a beginner at Python and I have a dataframe with Country, avgTemp and year.
What I want to do is calculate new rows on each country where the year adds 20 and avgTemp is multiplied by a variable called tempChange. I don't want to remove the previous values though, I just want to append the new values.
This is how the dataframe looks:
Preferably I would also want to create a loop that runs the code a certain number of times
Super grateful for any help!
If you need to copy the values from the dataframe as an example you can have it here:
Country avgTemp year
0 Afghanistan 14.481583 2012
1 Africa 24.725917 2012
2 Albania 13.768250 2012
3 Algeria 23.954833 2012
4 American Samoa 27.201417 2012
243 rows × 3 columns
If you want to repeat the rows, I'd create a new dataframe, perform any operation in the new dataframe (sum 20 years, multiply the temperature by a constant or an array, etc...) and use then use concat() to append it to the original dataframe:
import pandas as pd
tempChange=1.15
data = {'Country':['Afghanistan','Africa','Albania','Algeria','American Samoa'],'avgTemp':[14,24,13,23,27],'Year':[2012,2012,2012,2012,2012]}
df = pd.DataFrame(data)
df_2 = df.copy()
df_2['avgTemp'] = df['avgTemp']*tempChange
df_2['Year'] = df['Year']+20
df = pd.concat([df,df_2]) #ignore_index=True if you wish to not repeat the index value
print(df)
Output:
Country avgTemp Year
0 Afghanistan 14.00 2012
1 Africa 24.00 2012
2 Albania 13.00 2012
3 Algeria 23.00 2012
4 American Samoa 27.00 2012
0 Afghanistan 16.10 2032
1 Africa 27.60 2032
2 Albania 14.95 2032
3 Algeria 26.45 2032
4 American Samoa 31.05 2032
where df is your data frame name:
df['tempChange'] = df['year']+ 20 * df['avgTemp']
This will add a new column to your df with the logic above. I'm not sure if I understood your logic correct so the math may need some work
I believe that what you're looking for is
dfName['newYear'] = dfName.apply(lambda x: x['year'] + 20,axis=1)
dfName['tempDiff'] = dfName.apply(lambda x: x['avgTemp']*tempChange,axis=1)
This is how you apply to each row.

How to remove duplicates rows based on partial strings in Python

If i have a dataframe as follows in which 01 and 02, 03 and 04, 05 and 06 are same cites:
id city
01 New York City
02 New York
03 Tokyo City
04 Tokyo
05 Shanghai City
06 Shanghai
07 Beijing City
08 Paris
09 Berlin
How can I drop duplicates cites and get following dataframe? Thanks.
id city
01 New York
02 Tokyo
03 Shanghai
04 Beijing City
05 Paris
06 Berlin
Replace City part with null string and apply group by keeping the first row
df=pd.DataFrame({'id':[1,2,3,4],'city':['New York City','New York','Tokyo City','Tokyo']})
df looks like this
city id
0 New York City 1
1 New York 2
2 Tokyo City 3
3 Tokyo 4
Apply replace and group by to get first row in each group
df.city=df.city.str.replace('City','').str.strip()
df.groupby('city').first().sort_values('id')
Output:
city id
New York 1
Tokyo 3
Or use drop_duplicates on subset of columns. Thanks #JR ibkr
df.drop_duplicates(subset='city')
This is much easier in pandas now with drop_duplicates and the keep parameter.
# dataset
df = pd.DataFrame({'id':[1,2,3,4],'city':['New York City','New York','Tokyo City','Tokyo']})
# replace values
df.city = df.city.str.replace('City','').str.strip()
# drop duplicate (answer of original question)
df.drop_duplicates(subset=['city'])
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

How to use pandas to pull out the counties with the largest amount of water used in a given year?

I am new to python and pandas and I am struggling to figure out how to pull out the 10 counties with the most water used for irrigation in 2014.
%matplotlib inline
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('info.csv') #reads csv
data['Year'] = pd.to_datetime(['Year'], format='%Y') #converts string to
datetime
data.index = data['Year'] #makes year the index
del data['Year'] #delete the duplicate year column
This is what the data looks like (this is only partial of the data):
County WUCode RegNo Year SourceCode SourceID Annual CountyName
1 IR 311 2014 WELL 1 946 Adams
1 IN 311 2014 INTAKE 1 268056 Adams
1 IN 312 2014 WELL 1 48 Adams
1 IN 312 2014 WELL 2 96 Adams
1 IR 312 2014 INTAKE 1 337968 Adams
3 IR 315 2014 WELL 5 81900 Putnam
3 PS 315 2014 WELL 6 104400 Putnam
I have a couple questions:
I am not sure how to pull out only the "IR" in the WUCode Column with pandas and I am not sure how to print out a table with the 10 counties with the highest water usage for irrigation in 2014.
I have been able to use the .loc function to pull out the information I need, with something like this:
data.loc['2014', ['CountyName', 'Annual', 'WUCode']]
From here I am kind of lost. Help would be appreciated!
import numpy as np
import pandas as pd
import string
df = pd.DataFrame(data={"Annual": np.random.randint(20, 1000000, 1000),
"Year": np.random.randint(2012, 2016, 1000),
"CountyName": np.random.choice(list(string.ascii_letters), 1000)},
columns=["Annual", "Year", "CountyName"])
Say df looks like:
Annual Year CountyName
0 518966 2012 s
1 44511 2013 E
2 332010 2012 e
3 382168 2013 c
4 202816 2013 y
For the year 2014...
df[df['Year'] == 2014]
Group by CountyName...
df[df['Year'] == 2014].groupby("CountyName")
Look at Annual...
df[df['Year'] == 2014].groupby("CountyName")["Annual"]
Get the sum...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum()
Sort the result descending...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum().sort_values(ascending=False)
Take the top 10...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum().sort_values(ascending=False).head(10)
This example prints out (your actual result may vary since my data was random):
CountyName
Q 5191814
y 4335358
r 4315072
f 3985170
A 3685844
a 3583360
S 3301817
I 3231621
t 3228578
u 3164965
This may work for you:
res = df[df['WUCode'] == 'IR'].groupby(['Year', 'CountyName'])['Annual'].sum()\
.reset_index()\
.sort_values('Annual', ascending=False)\
.head(10)
# Year CountyName Annual
# 0 2014 Adams 338914
# 1 2014 Putnam 81900
Explanation
Filter by WUCode, as required, and groupby Year and CountyName.
Use reset_index so your result is a dataframe rather than a series.
Use sort_values and extract top 10 via pd.DataFrame.head.

Categories

Resources