Create a new DataFrame from selected data from another DataFrame - python

I want to create a box plot using pandas. I have data with average temperatures and I want to select three cities and create three box plots to compare temperatures among these cities. To achieve this, I have created a result DataFrame to store the data, the values for cities are supposed to be stored in three columns (one column per city).
However, the following code only shows plot for the first city. The problem is with the DataFrame. A separated query correctly gives a series of values, but when I insert it into the result dataset, a column of NaN values is stored there. What I am missing here?
import pandas
import matplotlib.pyplot as plt
import wget
wget.download("https://raw.githubusercontent.com/pesikj/python-012021/master/zadani/5/temperature.csv")
temperatures = pandas.read_csv("temperature.csv")
helsinki = temperatures[temperatures["City"] == "Helsinki"]["AvgTemperature"]
miami = temperatures[temperatures["City"] == "Miami Beach"]["AvgTemperature"]
tokyo = temperatures[temperatures["City"] == "Tokyo"]["AvgTemperature"]
result = pandas.DataFrame()
result["Helsinki"] = helsinki
result["Miami Beach"] = miami
result["Tokyo"] = tokyo
result.plot(kind="box",whis=[0,100])
plt.show()

Pivot into City columns using pivot_table() and select the 3 cities you want:
result = temperatures.pivot_table(
index='Day',
columns='City',
values='AvgTemperature',
)[['Helsinki', 'Miami Beach', 'Tokyo']]
# City Helsinki Miami Beach Tokyo
# Day
# 1 29.6 74.6 59.1
# 2 29.5 76.8 62.3
# ...
# 29 35.3 77.7 58.4
# 30 35.7 78.0 51.5
result.plot(kind='box', whis=[0,100])

Since you're using data science packages, consider using seaborn, which does the job of filtering/grouping data for you whenever you call one of its plot functions:
# Load dataset
url = "https://raw.githubusercontent.com/pesikj/python-012021/master/zadani/5/temperature.csv"
temperatures = pd.read_csv(url)
# Filter for cities of interest
cities = ['Helsinki', 'Miami Beach', 'Tokyo']
filtered_temperatures = temperatures.loc[temperatures['City'].isin(cities)]
# Let seaborn do the grouping
sns.violinplot(data=filtered_temperatures, x='City', y='AvgTemperature')
plt.show()
Result:

Related

How to construct new DataFrame based on data from for loops?

I have a data set (datacomplete2), where I have data for each country for two different years. I want to calculate the difference between these years for each country (for values life, health, and lifegdp) and create a new data frame with the results.
The code:
for i in datacomplete2['Country'].unique():
life.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'life'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'life'])
health.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'health'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'health'])
lifegdp.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'lifegdp'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'lifegdp'])
newData = pd.DataFrame([life, health, lifegdp, datacomplete2['Country'].unique()], columns = ['life', 'health', 'lifegdp', 'country'])
newData
I think the for loop for calculating is correct, and the problem is in creating the new DataFrame. When I try to run the code, I get an error message: 4 columns passed, passed data had 210 columns.
I have 210 countries so I assume it somehow throws these values to the columns?
Here is also a link to a sneak peek of the data I'm using: https://i.imgur.com/jbGFPpk.png
The data as text would look like:
Country Code Year life health lifegdp
0 Algeria DZA 2000 70.292000 3.489033 20.146558
1 Algeria DZA 2016 76.078000 6.603844 11.520259
2 Angola AGO 2000 47.113000 1.908599 24.684593
3 Angola AGO 2016 61.547000 2.713149 22.684710
4 Antigua and Barbuda ATG 2000 73.541000 4.480701 16.412834
... ... ... ... ... ... ...
415 Vietnam VNM 2016 76.253000 5.659194 13.474181
416 World OWID_WRL 2000 67.684998 8.617628 7.854249
417 World OWID_WRL 2016 72.035337 9.978453 7.219088
418 Zambia ZMB 2000 44.702000 7.152371 6.249955
419 Zambia ZMB 2016 61.874000 4.477207 13.819775
Quick help required !!!
I started coding like two weeks ago so I'm very novice with this stuff.
Anurag Reddy's answer is a good concise solution if you know the dates in advance. To present an alternative and slightly more general answer - this problem is a good example use case for pandas.DataFrame.diff.
Note you don't actually need to sort the data in your example data but I've included a sort_values() line below to account for unsorted DataFrames.
import pandas as pd
# Read the raw datafile in
df = pd.read_csv("example.csv")
# Sort the data if required
df.sort_values(by=["Country"], inplace=True)
# Remove columns where you don't need the difference
new_df = df.drop(["Code", "Year"], axis=1)
# Group the data by country, take the difference between the rows, remove NaN rows, and reset the index to sequential integers
new_df = new_df.groupby(["Country"], as_index=False).diff().dropna().reset_index(drop=True)
# Add back the country names and codes as columns in the new DataFrame
new_df.insert(loc=0, column="Country", value=df["Country"].unique())
new_df.insert(loc=1, column="Code", value=df["Code"].unique())
You could do this instead
country_list = df.Country.unique().tolist()
df.drop(columns = ['Code'])
df_2016 = df.loc[(df['Country'].isin(country_list))&(df['Year']==2016)].reset_index()
df_2000 = df.loc[(df['Country'].isin(country_list))&(df['Year']==2000)].reset_index()
df_2016.drop(columns=['Year'])
df_2000.drop(columns=['Year'])
df_2016.set_index('Country').subtract(df_2000.set_index('Country'), fill_value=0)

Adding information from a smaller table to a large one with Pandas

I would like to add the regional information to the main table that contains entity and account columns. In this way, each row in the main table should be duplicated, just like the append tool in Alteryx.
Is there a way to do this operation with Pandas in Python?
Thanks!
Unfortunately no build-in method exist, as you'll need to build cartesian product of those DataFrame check that fancy explanation of merge DataFrames in pandas
But for your specific problem, try this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(columns=['Entity', 'Account'])
df1.Entity = ['Entity1', 'Entity1']
df1.Account = ['Sales', 'Cost']
df2 = pd.DataFrame(columns=['Region'])
df2.Region = ['North America', 'Europa', 'Asia']
def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))
resultdf = cartesian_product_simplified(df1, df2)
print(resultdf)
output:
0 1 2
0 Entity1 Sales North America
1 Entity1 Sales Europa
2 Entity1 Sales Asia
3 Entity1 Cost North America
4 Entity1 Cost Europa
5 Entity1 Cost Asia
as expected.
Btw, please provide the Data Frame the next time as code, not as a screenshot or even as link. It helps up saving time (please check how to ask)

Make title of plots in for loop the column name of dataframe

I have a dataframe that looks like this:
points time Antwerp Busan Colombo Dalian Guangzhou Hamburg Hong Kong Jebel Ali/Dubai Kaohsiung ... Qingdao Rotterdam Shanghai Shenzhen Singapore Tanjung Pelepas Tanjung Priok/Jakarta Tianjin Xiamen Yingkou
0 1990-01-01 00:00:00 273.70395 279.31912 298.03195 268.42200 285.93228 271.31534 290.31357 289.83023 292.94135 ... 273.34103 274.18726 279.60450 288.37366 298.10950 298.23816 299.37143 272.06094 285.92570 265.19046
1 1990-01-01 01:00:00 273.72702 279.94266 298.02042 268.18445 286.04940 271.18503 290.59730 289.69333 292.95950 ... 273.52084 274.12128 280.13235 288.59967 298.21176 298.40808 299.59576 272.04776 286.36612 265.10303
I just want to run a bunch of different types of plots using a for loop.
For example:
for i in data.columns[1:]:
plt.figure()
plt.plot(data[i]-273)
This creates a bunch of line plots for all 25 locations that I have. Now, I want to assign the plot title to be the city name, which is every column name. I know how to do this with a dictionary, but am unsure if there is an easier way to do this without converting everything to a dictionary.
Column labels are an attribute of a dataframe in pandas. So here, your variable "i" is in fact the column name, and you can title your plot with plt.title(i). No need to use a dictionary to convert indexes to names here.

How to recursively select group of columns from panel data in pandas?

I have cross-sectional data which consists of yearly crime frequencies in the chicago area and house price. I want to select a group of columns recursively from dataset because I want to use them as features for training regression model. Is there any quick way to do this? any idea?
example data snippet:
here is the screenshot of my data:
here is example data snippet on the cloud for browsing data.
my attempt:
here is one example that I could select group of columns as features for the training ML model.
import urllib
import pandas as pd
# download data from cloud
u = "https://filebin.net/ml0sjn455gr8pvh3/crime_realEstate?t=7dkm15wq"
crime_realEstate = urllib.request.urlretrieve (u, "Ktest.csv")
# or just manually download data first and read
crime_realEstate = pd.read_csv('crime_realEstate.csv')
cols_2012 = crime_realEstate.filter(regex='_2012').columns
crime_realEstate['Area_Name']=crime_realEstate['Area_Name'].apply(lambda x: re.sub(' ', '_', str(x)))
regDF_2012 = crime_realEstate[cols_2012]
regDF_2012 = regDF_2012.assign(community_code=crime_finalDF['community_area'])
regDF_2012.dropna(inplace=True)
X_feats = regDF_2012.drop(['Avg_Price_2012'], axis=1)
y_label = regDF_2012['Avg_Price_2012'].values
basically, I want to do same things for regDF_2013, regDF_2014 and so on in the loop for better manipulation and easy to access data.
any idea to make this happen? any thoughts? Thanks
Melt your dataframe. This way you have separate column for each variable and index by Area_name. :
import pandas as pd
crime_realEstate = pd.read_csv("Ktest.csv", delimiter="\t", index_col=0)
crime_melted = pd.melt(crime_realEstate, id_vars=['Area_Name', 'community_area'])
crime_melted["crime"] = crime_melted["variable"].apply(lambda x: x[:-5])
crime_melted["year"] = crime_melted["variable"].apply(lambda x: x[-4:])
crime_melted.drop(columns=["variable"], inplace=True)
crime_melted.set_index("Area_Name", inplace=True)
Resulting dataframe is (example rows):
community_area value crime year
Area_Name
Grand Boulevard 38.0 135.000000 assault 2012
Grand Boulevard 38.0 108.000000 assault 2013
Grand Boulevard 38.0 116.000000 assault 2014
Grand Boulevard 38.0 78.000000 assault 2015
Grand Boulevard 38.0 105.000000 assault 2016
Index can be accessed by using loc:
crime_melted.loc["Grand Boulevard"]
Separate column for every variable is what you need for machine learning :-)

I would like help optimising a triple for loop with an if in statement

I am doing a triple for loop on a dataframe with almost 70 thousand entries. How do I optimize it?
My ultimate goal is to create a new column that has the country of a seismic event. I have a latitude, longitude and 'place' (ex: '17km N of North Nenana, Alaska') column. I tried to reverse geocode, but with 68,488 entries, there is no free service that lets me do that. And as a student, I cannot afford it.
So I am using a dataframe with a list of countries and a dataframe with a list of states to compare to USGS['place']'s values. To do that, I ultimately settled on using 3 for loops.
As you can assume, it takes a long time. I was hoping there is a way to speed things up. I am using python, but I use r as well. The for loops just run better on python.
Any better options I'll take.
USGS = pd.DataFrame(data = {'latitide':[64.7385, 61.116], 'longitude':[-149.136, -138.655], 'place':['17km N of North Nenana, Alaska', '74km WNW of Haines Junction, Canada'], 'country':[NA, NA]})
states = pd.DataFrame(data = {'state':['AK', 'AL'], 'name':['Alaska', 'Alabama']})
countries = pd.DataFrame(data = {'country':['Afghanistan', 'Canada']})
for head in states:
for state in states[head]:
for p in USGS['place']:
if state in p:
USGS['country'] = USGS['country'].map({p : 'United 'States'})
# I have not finished the code for the countries dataframe
You do have options to do geocoding. Mapquest offers a free 15,000 calls per month. You can also look at using geopy which is what I use.
import pandas as pd
import geopy
from geopy.geocoders import Nominatim
USGS_df = pd.DataFrame(data = {'latitude':[64.7385, 61.116], 'longitude':[-149.136, -138.655], 'place':['17km N of North Nenana, Alaska', '74km WNW of Haines Junction, Canada'], 'country':[None, None]})
geopy.geocoders.options.default_user_agent = "locations-application"
geolocator=Nominatim(timeout=10)
for i, row in USGS_df.iterrows():
try:
lat = row['latitude']
lon = row['longitude']
location = geolocator.reverse('%s, %s' %(lat, lon))
country = location.raw['address']['country']
print ('Found: ' + location.address)
USGS_df.loc[i, 'country'] = country
except:
print ('Location not identified: %s, %s' %(lat, lon))
Input:
print (USGS_df)
latitude longitude place country
0 64.7385 -149.136 17km N of North Nenana, Alaska None
1 61.1160 -138.655 74km WNW of Haines Junction, Canada None
Output:
print (USGS_df)
latitude longitude place country
0 64.7385 -149.136 17km N of North Nenana, Alaska USA
1 61.1160 -138.655 74km WNW of Haines Junction, Canada Canada

Categories

Resources