I need to print 2 specific columns as a data frame and not a series.
this question is different from
when df1 is printed it returns a series. movie title and movie year are the columns I need to print. I used the df = my_series.to_frame()
function but it still returns a series.
def find_movie_by_year():
x = input("enter movie year")
d = df[df["title_year"] == int(x)]
df1 = d[["movie_title","title_year"]]
print('''
''')
print(df1)
EDIT 1:
this is the output format i need:
this is the output format i get from the code:
You can use display function instead of simply printing it.
def find_movie_by_year():
x = input("enter movie year")
d = df[df["title_year"] == int(x)]
df1 = d[["movie_title","title_year"]]
print('''
''')
#print(df1)
display(df1)
I am trying to display the name of a country by searching for the biggest difference between the amount of Gold in "Gold" and the amount of Gold in "Gold.1". Now I am unsure how to display the name of the country (column 1) when calculating this difference.
def answer_two():
for country in df :
valx = df["Gold"]
valy = df["Gold.1"]
valAns = abs(valx - valy)
if df.iloc[country] > df.iloc[country-1]:
ans = valAns
return ans
print(answer_two())
My plan was to calculate valAns and then maybe get the index of country and then return that name...
Here's an example that would assist in answering your question. This (idxmax()) will return the index of the greatest value in a Series, which as you can see in this case is the difference between columns one and two.
import pandas as pd
df = pd.DataFrame()
df['one'] = [1,2,3]
df['two'] = [2,5,8]
df['diff'] = df.two - df.one
max_value_index = df['diff'].idxmax()
I'm trying to add rows and columns to pandas incrementally. I have a lot of data stored across multiple datastores and a heuristic to determine a value. As I navigate across this datastore, I'd like to be able to incrementally update a dataframe, where in some cases, either names or days will be missing.
def foo():
df = pd.DataFrame()
year = 2016
names = ['Bill', 'Bob', 'Ryan']
for day in range(1, 4, 1):
for name in names:
if random.choice([True, False]): # sometimes a name will be missing
continue
value = random.randrange(0, 20, 1) # random value from heuristic
col = '{}_{}'.format(year, day) # column name
df = df.append({col: value, 'name': name}, ignore_index=True)
df.set_index('name', inplace=True, drop=True)
print(df.loc['Bill'])
This produces the following results:
2016_1 2016_2 2016_3
name
Bill 15.0 NaN NaN
Bill NaN 12.0 NaN
I've created a heatmap of the data and it's blocky due to duplicate names, so the output I'm looking for is:
2016_1 2016_2 2016_3
name
Bill 15.0 12.0 NaN
How can I combine these rows?
Is there a more efficient means of creating this dataframe?
Try this :-
df.groupby('name')[df.columns.values].sum()
try this:
df.pivot_table(index='name', aggfunc='sum', dropna=False)
After you run your foo() function, you can use any aggregation function (if you have only one value per column and all the othes are null) and groupby on df.
First, use reset_index to get back your name column.
Then use groupby and apply. Here I propose a custom function which checks that there is only one value per column, and raise a ValueError if not.
df.reset_index(inplace=True)
def aggdata(x):
if all([i <= 1 for i in x.count()]):
return x.mean()
else:
raise ValueError
ddf = df.groupby('name').apply(aggdata)
If all the values of the column are null but one, x.mean() will return that value (actually, you can use almost any aggregator, since there is only one value, that is the one returned).
It would be easier to have the name as column and date as index instead. Plus, you can work within the loop with lists and afterwards create the pd.DataFrame.
e.g.
year = 2016
names = ['Bill', 'Bob', 'Ryan']
index = []
valueBill = []
valueBob = []
valueRyan = []
for day in range(1, 4):
if random.choice([True, False]): # sometimes a name will be missing
valueBill.append(random.randrange(0, 20))
valueBob.append(random.randrange(0, 90))
valueRyan.append(random.randrange(0, 200))
index.append('{}-0{}'.format(year, day)) # column name
else:
valueBill.append(np.nan)
valueBob.append(np.nan)
valueRyan.append(np.nan)
index.append(np.nan)
df = pd.DataFrame({})
for name, value in zip(names,[valueBill,valueBob,valueRyan]):
df[name] = value
df.set_index(pd.to_datetime(index))
You can append the entries with new names if it does not already exist and then do an update to update existing entries.
import pandas as pd
import random
def foo():
df = pd.DataFrame()
year = 2016
names = ['Bill', 'Bob', 'Ryan']
for day in range(1, 4, 1):
for name in names:
if random.choice([True, False]): # sometimes a name will be missing
continue
value = random.randrange(0, 20, 1) # random value from heuristic
col = '{}_{}'.format(year, day) # column name
new_df = pd.DataFrame({col: value, 'name':name}, index=[1]).set_index('name')
df = pd.concat([df,new_df[~new_df.index.isin(df.index)].dropna()])
df.update(new_df)
#df.set_index('name', inplace=True, drop=True)
print(df)
I have a pandas dataframe that I created using groupby and the return result is this:
loan_type
type
risky 23150
safe 99457
I want to create a column called pct and add it to the dataframe I did this:
total = loans.sum(numeric_only=True)
loans['pct'] = loans.apply(lambda x:x/ total)
And the result was this:
loan_type pct
type
risky 23150 NaN
safe 99457 NaN
At this point I'm not sure what I need to do to get that percentage column see the code below as to how I created the whole thing:
import numpy as np
bad_loans = np.array(club['bad_loans'])
for index, row in enumerate(bad_loans):
if row == 0:
bad_loans[index] = 1
else:
bad_loans[index] = -1
loans = pd.DataFrame({'loan_type' : bad_loans})
loans['type'] = np.where(loans['loan_type'] == 1, 'safe', 'risky')loans = np.absolute(loans.groupby(['type']).agg({'loan_type': 'sum'}))
total = loans.sum(numeric_only=True)
loans['pct'] = loans.apply(lambda x:x/ total)
There is problem you want divide not by value, but one value Series and because not align indexes get NaNs.
I think the simpliest is convert Series total to numpy array:
total = loans.sum(numeric_only=True)
loans['pct'] = loans.loan_type / total.values
print (loans)
loan_type pct
type
risky 23150 0.188815
safe 99457 0.811185
Or convert select by indexing [0] - output is number:
total = loans.sum(numeric_only=True)[0]
loans['pct'] = loans.loan_type / total
print (loans)
loan_type pct
type
risky 23150 0.188815
safe 99457 0.811185
I am reading an xlsx file and I want for every row to create columns based on the rows before.
import pandas as pd
import numpy as np
def get_total(x):
name = x["NAME"]
city = x["CITY"]
index = x.index
records = df[df.index < index) & (df["NAME"] == name) & (df["CITY"] == city)]
return records.size[0]
data_filename = "data.xslx"
df = pd.read_excel(data_filename, na_values=["", " ", "-"])
df["TOTAL"] = df.apply(lambda x: get_total(x), axis=1)
The get_total function is a simple example of what I want to achieve.
I could use df.reset_index(inplace=True) to get the dataframe's index as a column. I think there must be a better way to get the index of a row.
You can rewrite your function like this:
def get_total(x):
name = x["NAME"]
city = x["CITY"]
index = x.name
records df.loc[0:index]
return records.loc[(records['NAME'] == name) & (records['CITY']==city)].size
the name attribute is the current row index value