I am working on the Cars.csv DataFrame which can be found here: https://www.kaggle.com/ljanjughazyan/cars1
My goal is to create a new Data Frame with the column names: USA, Europe and Japan and to save the number of cars that are in each category.
for a in list(cars.origin.unique()):
x= pd.DataFrame({a:[cars.loc[cars["origin"]==a,"origin"].size]})
I tried it with this code, but as a result I obtain a Data Frame with only one column that is "Europe". So it kind of works, but I cant figure why it just dismisses the other values. Why doesnt it work and can it be done using a for-loop?
Thanks in advance!!
I assume "Europe" would be the last item in your list. Because you are resetting x in every iteration of your for-loop.
So if you print(x) inside the loop, you should first see a DataFrame with just USA, then just Japan and then just Europe, which is your final result.
I would suggest putting the data into a python dict and creating you DataFrame afterwards.
data = {}
for a in list(cars.origin.unique()):
data[a] = [cars.loc[cars["origin"]==a,"origin"].size]
x = pd.DataFrame(data)
Related
I am having tremendous difficulty getting my data sorted. I'm at the point where I could have manually created a new .csv file in the time I have spent trying to figure this out, but I need to do this through code. I have a large dataset of baseball salaries by player going back 150 years.
This is what my dataset looks like.
I want to create a new dataframe that adds the individual player salaries for a given team for a given year, organized by team and by year. Using the following technique I have come up with this: team_salaries_groupby_team = salaries.groupby(['teamID','yearID']).agg({'salary' : ['sum']}), which outputs this: my output. On screen it looks sort of like what I want, but I want a dataframe with three columns (plus an index on the left). I can't really do the sort of analysis I want to do with this output.
Lastly, I have also tried this method: new_column = salaries['teamID'] + salaries['yearID'].astype(str) salaries['teamyear'] = new_column salaries teamyear = salaries.groupby(['teamyear']).agg({'salary' : ['sum']}) print(teamyear). Another output It adds the individual player salaries per team for a given year, but now I don't know how to separate the year and put it into its own column. Help please?
You just need to reset_index()
Here is sample code :
salaries = pd.DataFrame(columns=['yearID','teamID','igID','playerID','salary'])
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'A','salary':10000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'B','salary':20000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'A','salary':10000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'C','salary':5000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'B','salary':20000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'A','salary':100000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'B','salary':200000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'C','salary':50000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'A','salary':100000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'B','salary':200000},ignore_index=True)
After that , groupby and reset_index
sample_df = salaries.groupby(['teamID', 'yearID']).salary.sum().reset_index()
Is this what you are looking for ?
I'm very new to coding, so perhaps there is a super simple answer for this, but here it goes:
I have a dataframe of a bunch of stocks. Each stock has a ticker and their names are stored in a column. I've created a list of all the stocks I want in my data frame. I am wondering how I remove the stocks with tickers that do not appear in my list.
from pandas import *
C = DataFrame(["TD","CM","AAPL","GOOG", "GOOS"],columns=["Ticker"])
There are several hundred occurrences of each ticker, and each has an associated price, return, risk free rate, and time. I've created a list of stocks that I want to analyze, based on how many occurrences they have in the dataframe. I have already done this, but the simplified list looks like this:
list = ['GOOG', 'AAPL']
I want to return a dataframe that only has these tickers in it, but also includes all the row data associated with each one. I'm honestly pretty stumped on how to do this, but I'm sure there is a simple answer. Any help would be super appreciated!
You can use this.
tickers = ['GOOG', 'AAPL']
df = C.loc[C['Ticker'].isin(tickers)].reset_index(drop=True)
Output:
#df
Ticker
0 AAPL
1 GOOG
I have a pandas dataframe top3 with data as in the image below.
Using the two columns, STNAME and SENSUS2010POP, I need to find the sum for Wyoming (sum: 91738+75450+46133=213321), then the sum for Wisconsin (sum:1825699), West Virginia and so on. Summing up the 3 counties for each state. (and need to sort them in ascending order after that).
I have tried this code to compute the answer:
topres=top3.groupby('STNAME').sum().sort_values(['CENSUS2010POP'], ascending=False)
Maybe you can suggest a more efficient way to do it? Maybe with a lambda expression?
You can use groupby:
df.groupby('STNAME').sum()
Note: I'm starting in the problem before selecting the top 3 counties per state, and jumping straight to their sum.
I found it helpful with this problem to use a list selection.
I created a data frame view of the counties with:
counties_df=census_df[census_df['SUMLEV'] == 50]
and a separate one of the states so I could get at their names.
states_df=census_df[census_df['SUMLEV'] == 40]
Then I was able to create that sum of the populations of the top 3 counties per state, by looping over all states and summing the largest 3.
res = [(x, counties_df[(counties_df['STNAME']==x)].nlargest(3,['CENSUS2010POP'])['CENSUS2010POP'].sum()) for x in states_df['STNAME']]
I converted that result to a data frame
dfObj = pd.DataFrame(res)
named its columns
dfObj.columns = ['STNAME','POP3']
sorted in place
dfObj.sort_values(by=['POP3'], inplace=True, ascending=False)
and returned the first 3
return dfObj['STNAME'].head(3).tolist()
Definitely groupby is a more compact way to do the above, but I found this way helped me break down the steps (and the associated course had not yet dealt with groupby).
I'm organizing a new dataframe in order to easily insert data into a Bokeh visualization code snippet. I think my problem is due to differing row lengths, but I am not sure.
Below, I organized the dataset in alphabetical order, by country name, and created an alphabetical list of the individual countries. new_data.tail() Although Zimbabwe is listed last, there are 80336 rows, hence the sorting.
df_ind_data = pd.DataFrame(ind_data)
new_data = df_ind_data.sort_values(by=['country'])
new_data = new_data.reset_index(drop=True)
country_list = list(ind_data['country'])
new_country_set = sorted(set(country_list))
My goal is create a new DataFrame, with 76 cols (country names), with the specific 'trust' data in the rows underneath each country column.
df = pd.DataFrame()
for country in new_country_set:
pink = new_data.loc[(new_data['country'] == country)]
df[country] = pink.trust
Output here
As you can see, the data does not get included for the rest of the columns after the first. I believe this is due to the fact that the number of rows of 'trust' data for each country varies. While the first column has 1000 rows, there are some with as many as 2500 data points, and as little as 500.
I have attempted a few different methods to specify the number of rows in 'df', but to no avail.
The visualization code snippet I have utilizes this same exact data structure for the template data, so that it why I'm attempting to put it in a dataframe. Plus, I can't do it, so I want to know how to do it.
Yes, I can put it in a dictionary, but I want to put it in a dataframe.
You should use combine_first when you add a new column so that the dataframe index gets extended. Instead of
df[country] = pink.trust
you should use
df = pink.trust.combine_first(df)
which ensures that your index is always union of all added columns.
I think in this case pd.pivot(columns = 'var', values = 'val') , will work for you, especially when you already have dataframe. This function will transfer values from particular column into column names. You could see the documentation for additional info. I hope that helps.
For a list of daily maximum temperature values from 5 to 27 degrees celsius, I want to calculate the corresponding maximum ozone concentration, from the following pandas DataFrame:
I can do this by using the following code, by changing the 5 then 6, 7 etc.
df_c=df_b[df_b['Tmax']==5]
df_c.O3max.max()
Then I have to copy and paste the output values into an excel spreadsheet. I'm sure there must be a much more pythonic way of doing this, such as by using a list comprehension. Ideally I would like to generate a list of values from the column 03max. Please give me some suggestions.
use pd.Series.map with another pd.Series
pd.Series(list_of_temps).map(df_b.set_index('Tmax')['O3max'])
You can get a dataframe
result_df = pd.DataFrame(dict(temps=list_of_temps))
result_df['O3max'] = result_df.temps.map(df_b.set_index('Tmax')['O3max'])
I had another play around and think the following piece of code seems to do the job:
df_c=df_b.groupby(['Tmax'])['O3max'].max()
I would appreciate any thoughts on whether this is correct