I have a data set (datacomplete2), where I have data for each country for two different years. I want to calculate the difference between these years for each country (for values life, health, and lifegdp) and create a new data frame with the results.
The code:
for i in datacomplete2['Country'].unique():
life.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'life'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'life'])
health.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'health'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'health'])
lifegdp.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'lifegdp'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'lifegdp'])
newData = pd.DataFrame([life, health, lifegdp, datacomplete2['Country'].unique()], columns = ['life', 'health', 'lifegdp', 'country'])
newData
I think the for loop for calculating is correct, and the problem is in creating the new DataFrame. When I try to run the code, I get an error message: 4 columns passed, passed data had 210 columns.
I have 210 countries so I assume it somehow throws these values to the columns?
Here is also a link to a sneak peek of the data I'm using: https://i.imgur.com/jbGFPpk.png
The data as text would look like:
Country Code Year life health lifegdp
0 Algeria DZA 2000 70.292000 3.489033 20.146558
1 Algeria DZA 2016 76.078000 6.603844 11.520259
2 Angola AGO 2000 47.113000 1.908599 24.684593
3 Angola AGO 2016 61.547000 2.713149 22.684710
4 Antigua and Barbuda ATG 2000 73.541000 4.480701 16.412834
... ... ... ... ... ... ...
415 Vietnam VNM 2016 76.253000 5.659194 13.474181
416 World OWID_WRL 2000 67.684998 8.617628 7.854249
417 World OWID_WRL 2016 72.035337 9.978453 7.219088
418 Zambia ZMB 2000 44.702000 7.152371 6.249955
419 Zambia ZMB 2016 61.874000 4.477207 13.819775
Quick help required !!!
I started coding like two weeks ago so I'm very novice with this stuff.
Anurag Reddy's answer is a good concise solution if you know the dates in advance. To present an alternative and slightly more general answer - this problem is a good example use case for pandas.DataFrame.diff.
Note you don't actually need to sort the data in your example data but I've included a sort_values() line below to account for unsorted DataFrames.
import pandas as pd
# Read the raw datafile in
df = pd.read_csv("example.csv")
# Sort the data if required
df.sort_values(by=["Country"], inplace=True)
# Remove columns where you don't need the difference
new_df = df.drop(["Code", "Year"], axis=1)
# Group the data by country, take the difference between the rows, remove NaN rows, and reset the index to sequential integers
new_df = new_df.groupby(["Country"], as_index=False).diff().dropna().reset_index(drop=True)
# Add back the country names and codes as columns in the new DataFrame
new_df.insert(loc=0, column="Country", value=df["Country"].unique())
new_df.insert(loc=1, column="Code", value=df["Code"].unique())
You could do this instead
country_list = df.Country.unique().tolist()
df.drop(columns = ['Code'])
df_2016 = df.loc[(df['Country'].isin(country_list))&(df['Year']==2016)].reset_index()
df_2000 = df.loc[(df['Country'].isin(country_list))&(df['Year']==2000)].reset_index()
df_2016.drop(columns=['Year'])
df_2000.drop(columns=['Year'])
df_2016.set_index('Country').subtract(df_2000.set_index('Country'), fill_value=0)
Related
I have a Data Frame with 4 columns. I want to calculate the log form of three columns values and then make a new Data Frame. my problem is that after getting the log form of values, their type become as series. My question is that how I can create a new dataframe with these new series.
Here is my dataset:
year gnp labor capital
0 1955 114043 8310 182113
1 1956 120410 8529 193745
2 1957 129187 8738 205192
3 1958 134705 8952 215130
I got log forms of columns by this code:
ln_gnp = np.log(df.gnp)
ln_labor = np.log(df.labor)
ln_capital = np.log(df.capital)
Now, I want to create a new DataFrame with columns 'year', 'ln_gnp', 'ln_labor', and 'ln_capital'.
I have tried pd.DataFrame('year', 'ln_gnp', ' ln_labor', 'ln_capital')
but it didn't work. I think there is another way to make a new dataframe.
here is one way to do it
Simpler approach
# using applymap, take log for the three columns
# concat with the year column
df2=pd.concat([df['year'],
df[['gnp', 'labor','capital']].applymap(np.log)],
axis=1)
df2
year gnp labor capital
0 1955 11.644331 9.025215 12.112383
1 1956 11.698658 9.051227 12.174298
2 1957 11.769016 9.075437 12.231701
3 1958 11.810842 9.099632 12.278998
if you need to use the series you created then
# create a dataframe from the series you already created
df2=pd.DataFrame({'year': df['year'], 'gnp': ln_gnp, 'labor': ln_labor, 'capital' :ln_capital} )
df2
Considering the following simple dataset:
df = pd.DataFrame({'year':[1,2,3,4,5],
'gnp':[100, 200, 300, 400, 500],
'labor':[1000, 2000, 3000, 4000, 5000],
'capital':[1e4, 2e4, 3e4, 4e4, 5e4]},
)
df
Here is one of the solutions:
df['ln_gnp'] = np.log(df['gnp'])
df['ln_labor'] = np.log(df['labor'])
df['ln_capital'] = np.log(df['capital'])
df1=df[['year', 'ln_gnp', 'ln_labor', 'ln_capital']].copy()
df1
Output:
Another possible solution:
df['ln_' + df.columns[1:]] = np.log(df.iloc[:,1:])
Output:
year gnp labor capital ln_gnp ln_labor ln_capital
0 1955 114043 8310 182113 11.644331 9.025215 12.112383
1 1956 120410 8529 193745 11.698658 9.051227 12.174298
2 1957 129187 8738 205192 11.769016 9.075437 12.231701
3 1958 134705 8952 215130 11.810842 9.099632 12.278998
I have a large data set : https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-19/technology.csv
here is the head of the dataset:
head
I have to grouped this dataset by the variables and taken the averages of each country's technology adoption with this:
df.groupby(['variable','iso3c'])[['value']].mean()
here is the output
value
variable iso3c
BCG AFG 45.763158
AGO 56.648649
ALB 93.875000
ARE 86.650000
ARG 93.700000
... ...
visitorrooms VNM 46920.636364
YEM 5527.280000
ZAF 48431.850000
ZMB 3518.000000
ZWE 4696.440000
Now, I want to sort within the variables by largest values to smallest. I thought of doing this:
df.groupby(['variable','iso3c'])[['value']].mean().sort_values(['variable','value'])
but this is the output
value
variable iso3c
BCG SWE 1.722500e+01
SOM 3.812500e+01
AFG 4.576316e+01
TCD 4.586111e+01
ETH 5.141026e+01
... ...
visitorrooms ESP 5.755948e+05
JPN 6.531027e+05
DEU 7.400641e+05
ITA 9.286496e+05
USA 3.040499e+06
[16933 rows x 1 columns]
I have no idea what happens to the values here. How do I fix this?
It looks like you just have a large variance in the values so it's using exp() form.
Option 1: You can chain your sort_values() with a round(x) where x is the number of significant digits you want.
Option 2: Set the pandas precision option to a form you find more comfortable to work with.
import pandas as pd
# import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-19/technology.csv')
# Data Pre-Process
df_v2 = df.groupby(['variable','iso3c'])['value'].mean().reset_index()
df_v2.sort_values(['variable','value'],ascending=[True, False] ,inplace=True)
# Showing Output
df_v2
Hi Brother,
I have attached the code for you, if you have any question please let me know
Thanks
Leon
I would like to add the regional information to the main table that contains entity and account columns. In this way, each row in the main table should be duplicated, just like the append tool in Alteryx.
Is there a way to do this operation with Pandas in Python?
Thanks!
Unfortunately no build-in method exist, as you'll need to build cartesian product of those DataFrame check that fancy explanation of merge DataFrames in pandas
But for your specific problem, try this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(columns=['Entity', 'Account'])
df1.Entity = ['Entity1', 'Entity1']
df1.Account = ['Sales', 'Cost']
df2 = pd.DataFrame(columns=['Region'])
df2.Region = ['North America', 'Europa', 'Asia']
def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))
resultdf = cartesian_product_simplified(df1, df2)
print(resultdf)
output:
0 1 2
0 Entity1 Sales North America
1 Entity1 Sales Europa
2 Entity1 Sales Asia
3 Entity1 Cost North America
4 Entity1 Cost Europa
5 Entity1 Cost Asia
as expected.
Btw, please provide the Data Frame the next time as code, not as a screenshot or even as link. It helps up saving time (please check how to ask)
I have cross-sectional data which consists of yearly crime frequencies in the chicago area and house price. I want to select a group of columns recursively from dataset because I want to use them as features for training regression model. Is there any quick way to do this? any idea?
example data snippet:
here is the screenshot of my data:
here is example data snippet on the cloud for browsing data.
my attempt:
here is one example that I could select group of columns as features for the training ML model.
import urllib
import pandas as pd
# download data from cloud
u = "https://filebin.net/ml0sjn455gr8pvh3/crime_realEstate?t=7dkm15wq"
crime_realEstate = urllib.request.urlretrieve (u, "Ktest.csv")
# or just manually download data first and read
crime_realEstate = pd.read_csv('crime_realEstate.csv')
cols_2012 = crime_realEstate.filter(regex='_2012').columns
crime_realEstate['Area_Name']=crime_realEstate['Area_Name'].apply(lambda x: re.sub(' ', '_', str(x)))
regDF_2012 = crime_realEstate[cols_2012]
regDF_2012 = regDF_2012.assign(community_code=crime_finalDF['community_area'])
regDF_2012.dropna(inplace=True)
X_feats = regDF_2012.drop(['Avg_Price_2012'], axis=1)
y_label = regDF_2012['Avg_Price_2012'].values
basically, I want to do same things for regDF_2013, regDF_2014 and so on in the loop for better manipulation and easy to access data.
any idea to make this happen? any thoughts? Thanks
Melt your dataframe. This way you have separate column for each variable and index by Area_name. :
import pandas as pd
crime_realEstate = pd.read_csv("Ktest.csv", delimiter="\t", index_col=0)
crime_melted = pd.melt(crime_realEstate, id_vars=['Area_Name', 'community_area'])
crime_melted["crime"] = crime_melted["variable"].apply(lambda x: x[:-5])
crime_melted["year"] = crime_melted["variable"].apply(lambda x: x[-4:])
crime_melted.drop(columns=["variable"], inplace=True)
crime_melted.set_index("Area_Name", inplace=True)
Resulting dataframe is (example rows):
community_area value crime year
Area_Name
Grand Boulevard 38.0 135.000000 assault 2012
Grand Boulevard 38.0 108.000000 assault 2013
Grand Boulevard 38.0 116.000000 assault 2014
Grand Boulevard 38.0 78.000000 assault 2015
Grand Boulevard 38.0 105.000000 assault 2016
Index can be accessed by using loc:
crime_melted.loc["Grand Boulevard"]
Separate column for every variable is what you need for machine learning :-)
I have the following two databases:
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/rgdp_catcode.merge'
df=pd.read_csv(url, index_col=0)
df.head(1)
naics catcode GeoName Description ComponentName year GDP state
0 22 E1600',\t'E1620',\t'A4000',\t'E5000',\t'E3000'... Alabama Utilities Real GDP by state 2004 5205 AL
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/mpl.Bspons.merge'
df1=pd.read_csv(url, index_col=0)
df1.head(1)
state year unemployment log_diff_unemployment id.thomas party type date bills id.fec years_exp session name disposition catcode
0 AK 2006 6.6 -0.044452 1440 Republican sen 2006-05-01 s2686-109 S2AK00010 39 109 National Cable & Telecommunications Association support C4500
Regarding df, I had to manually input the catcode values. I think that is why the formatting is off. What I would like is to simply have the values without the \t prefix. I want to merge the dfs on catcode, state, year. I made a test earlier wherein a df1.catcode with only one value per cell was matched with the values in another df.catcode that had more than one value per cell and it worked.
So technically, all I need to do is lose the \t before each consecutive value in df.catcode, but additionally, if anyone has ever done a merge of this sort before, any 'caveats' learned through experience would be appreciated. My merge code looks like this:
mplmerge=pd.merge(df1,df, on=(['catcode', 'state', 'year']), how='left' )
I think this can be done with the regex method, I'm looking at the documentation now.
Cleaning catcode column in df is rather straightforward:
catcode_fixed = df.catcode.str.findall('[A-Z][0-9]{4}')
This will produce a series with a list of catcodes in every row:
catcode_fixed.head(3)
Out[195]:
0 [E1600, E1620, A4000, E5000, E3000, E1000]
1 [X3000, X3200, L1400, H6000, X5000]
2 [X3000, X3200, L1400, H6000, X5000]
Name: catcode, dtype: object
If I understand correctly what you want, then you need to "ungroup" these lists. Here is the trick, in short:
catcode_fixed = catcode_fixed = catcode_fixed.apply(pd.Series).stack()
catcode_fixed.index = catcode_fixed.index.droplevel(-1)
So, we've got (note the index values):
catcode_fixed.head(12)
Out[206]:
0 E1600
0 E1620
0 A4000
0 E5000
0 E3000
0 E1000
1 X3000
1 X3200
1 L1400
1 H6000
1 X5000
2 X3000
dtype: object
Now, dropping the old catcode and joining in the new one:
df.drop('catcode',axis = 1, inplace = True)
catcode_fixed.name = 'catcode'
df = df.join(catcode_fixed)
By the way, you may also need to use df1.reset_index() when merging the data frames.