I need to print 2 specific columns as a data frame and not a series.
this question is different from
when df1 is printed it returns a series. movie title and movie year are the columns I need to print. I used the df = my_series.to_frame()
function but it still returns a series.
def find_movie_by_year():
x = input("enter movie year")
d = df[df["title_year"] == int(x)]
df1 = d[["movie_title","title_year"]]
print('''
''')
print(df1)
EDIT 1:
this is the output format i need:
this is the output format i get from the code:
You can use display function instead of simply printing it.
def find_movie_by_year():
x = input("enter movie year")
d = df[df["title_year"] == int(x)]
df1 = d[["movie_title","title_year"]]
print('''
''')
#print(df1)
display(df1)
Related
I am having trouble in filtering databased on a multiple conditions.
[dataframe image][1]
[1]: https://i.stack.imgur.com/TN9Nd.png
When I filter it based on multiple condition, I am getting empty DataFrame.
user_ID_existing = input("Enter User ID:")
print("Available categories are:\n Vehicle\tGadgets")
user_Category_existing = str(input("Choose from the above category:"))
info = pd.read_excel("Test.xlsx")
data = pd.DataFrame(info)
df = data[((data.ID == user_ID_existing) & (data.Category == user_Category_existing))]
print(df)
if I replace the variables user_ID_existing and user_Category_existing with values, I am getting the rows. I even tried with numpy and only getting empty dataframe
filtered_values = np.where((data['ID'] == user_ID_existing) & (data['Category'].str.contains(user_Category_existing)))
print(filtered_values)
print(data.loc[filtered_values])
input always returs a string but since the column ID read by pandas has a number dtype, when you filter it by a string, you're then getting an empty dataframe.
You need to use int to convert the value/ID (entered by the user) to a number.
Try this :
user_ID_existing = int(input("Enter User ID:"))
print("Available categories are:\n Vehicle\tGadgets")
user_Category_existing = input("Choose from the above category:")
data = pd.read_excel("Test.xlsx")
df = data[(data["ID"].eq(user_ID_existing))
& (data["Category"].eq(user_Category_existing))].copy()
print(df)
I have a data frame
cat input.csv
dwelling,wall,weather,occ,height,temp
5,2,Ldn,Pen,154.7,23.4
5,4,Ldn,Pen,172.4,28.7
3,4,Ldn,Pen,183.5,21.2
3,4,Ldn,Pen,190.2,30.3
To which I'm trying to apply the following function:
input_df = pd.read_csv('input.csv')
def folder_column(row):
if row['dwelling'] == 5 and row['wall'] == 2:
return 'folder1'
elif row['dwelling'] == 3 and row['wall'] == 4:
return 'folder2'
else:
return 0
I want to run the function on the input dataset and store the output in a separate data frame using something like this:
temp_df = pd.DataFrame()
temp_df = input_df['archetype_folder'] = input_df.apply(folder_column, axis=1)
But when I do this I only get the newly created 'archetype_folder' in the temp_df, when I would like all the original columns from the input_df. Can anyone help? Note that I don't want to add the new column 'archetype_folder' to the original, input_df. I've also tried this:
temp_df = input_df
temp_df['archetype_folder'] = temp_df.apply(folder_column, axis=1)
But when I run the second command both input_df and temp_df end up with the new column?
Any help is appreciated!
Use Dataframe.copy :
temp_df = input_df.copy()
temp_df['archetype_folder'] = temp_df.apply(folder_column, axis=1)
You need to create copy of original DataFrame, then assign return values of your function to it, consider following simple example
import pandas as pd
def is_odd(row):
return row.value % 2 == 1
df1 = pd.DataFrame({"value":[1,2,3],"name":["uno","dos","tres"]})
df2 = df1.copy()
df2["odd"] = df1.apply(is_odd,axis=1)
print(df1)
print("=====")
print(df2)
gives output
value name
0 1 uno
1 2 dos
2 3 tres
=====
value name odd
0 1 uno True
1 2 dos False
2 3 tres True
You don't need apply. Use .loc to be more efficient.
temp_df = input_df.copy()
m1 = (input_df['dwelling'] == 5) & (input_df['wall'] == 2)
m2 = (input_df['dwelling'] == 3) & (input_df['wall'] == 4)
temp_df.loc[m1, 'archetype_folder'] = 'folder1'
temp_df.loc[m2, 'archetype_folder'] = 'folder2'
Let's say I have the following data of a match in a CSV file:
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4
I'm writing a python program. Somewhere in my program I have scores collected for a match stored in a list, say x = [1,0,4]. I have found where in the data these scores exist using pandas and I can print "found" or "not found". However I want my code to print out to which name these scores correspond to. In this case the program should output "charlie" since charlie has all these values [1,0,4]. how can I do that?
I will have a large set of data so I must be able to tell which name corresponds to the numbers I pass to the program.
Yes, here's how to compare entire rows in a dataframe:
df[(df == x).all(axis=1)].index # where x is the pd.Series we're comparing to
Also, it makes life easiest if you directly set name as the index column when you read in the CSV.
import pandas as pd
from io import StringIO
df = """\
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4"""
df = pd.read_csv(StringIO(df), index_col='name')
x = pd.Series({'match1':1, 'match2':0, 'match3':4})
Now you can see that doing df == x, or equivalently df.eq(x), is not quite what you want because it does element-wise compare and returns a row of True/False. So you need to aggregate those rows with .all(axis=1) which finds rows where all comparison results were True...
df.eq(x).all(axis=1)
df[ (df == x).all(axis=1) ]
# match1 match2 match3
# name
# Charlie 1 0 4
...and then finally since you only want the name of such rows:
df[ (df == x).all(axis=1) ].index
# Index(['Charlie'], dtype='object', name='name')
df[ (df == x).all(axis=1) ].index.tolist()
# ['Charlie']
which is what you wanted. (I only added the spaces inside the expression for clarity).
You need to use DataFrame.loc which would work like this:
print(df.loc[(df.match1 == 1) & (df.match2 == 0) & (df.match3 == 4), 'name'])
Maybe try something like this:
import pandas as pd
import numpy as np
# Makes sample data
match1 = np.array([2,2,1])
match2 = np.array([4,4,0])
match3 = np.array([3,3,4])
name = np.array(['Alice','Bob','Charlie'])
df = pd.DataFrame({'name': id, 'match1': match1, 'match2':match2, 'match3' :match3})
df
# example of the list you want to get the data from
x=[1,0,4]
#x=[2,4,3]
# should return the name Charlie as well as the index (based on the values in the list x)
df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])]
# Makes a new dataframe out of the above
mydf = pd.DataFrame(df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])])
# Loop that prints out the name based on the index of mydf
# Assuming there are more than one name, it will print all. if there is only one name, it will print only that)
for i in range(0,len(mydf)):
print(mydf['name'].iloc[i])
you can use this
here data is your Data frame ,you can change accordingly your data frame name,
and
considering [1,0,4] is int type
data = data[(data['match1']== 1)&(data['match2']==0)&(data['match3']== 4 ).index
print(data[0])
if data is object type then use this
data = data[(data['match1']== "1")&(data['match2']=="0")&(data['match3']== "4" ).index
print(data[0])
I just wrote this function to calculated the age's person based in two columns in a Python DataFrame. Unfortunately, if a use the return the function return the same value for all rows, but if I use the print statement the function gives me the right values.
Here is the code:
def calc_age(dataset):
index = dataset.index
for element in index:
year_nasc = train['DT_NASCIMENTO_BENEFICIARIO'][element][6:]
year_insc = train['ANO_CONCESSAO_BOLSA'][element]
age = int(year_insc) - int(year_nasc)
print ('Age: ', age)
#return age
train['DT_NASCIMENTO_BENEFICIARIO'] = 03-02-1987
train['ANO_CONCESSAO_BOLSA'] = 2009
What am I doing wrong?!
If what you want is to subtract the year of DT_NASCIMENTO_BENEFICIARIO from ANO_CONCESSAO_BOLSA, and df is your DataFrame:
# cast to datetime
df["DT_NASCIMENTO_BENEFICIARIO"] = pd.to_datetime(df["DT_NASCIMENTO_BENEFICIARIO"])
df["age"] = df["ANO_CONCESSAO_BOLSA"] - df["DT_NASCIMENTO_BENEFICIARIO"].dt.year
# print the result, or do something else with it:
print(df["age"])
I am trying to write my own function in Python 3.5, but not having much luck.
I have a data frame that is 17 columns, 1,200 rows (tiny)
One of the columns is called "placement". Within this column, I have text contained in each row. The naming convention is as follows:
Campaign_Publisher_Site_AdType_AdSize_Device_Audience_Tactic_
The following code works perfectly and does exactly what i need it to do; I just don't want to do this for every data set i have:
df_detailed = df['Placement'].str[0:-1].str.split('_', expand=True).astype(str)
df_detailed = df.join(df_detailed)
new_columns = *["Then i rename the columns labelled 0,1,2 etc"]*
df_detailed.columns = new_columns
df_detailed.head()
What I'm trying to do is build a function, that takes any columns with _ as the delimitator and splits it across new columns.
I have tried the following (but unfortunately defining my own functions is something I'm horrible at.
def text_to_column(df):
df_detailed = df['Placement'].str[0:-1].str.split('_', expand=True).astype(str)
headings = df_detailed.columns
headings.replace(" ", "_")
df_detailed = df.join(df_detailed)
df_detailed.columns = headings
return (df)
and I get the following error "AttributeError: 'RangeIndex' object has no attribute 'replace'"
The end goal here is to write a function where I can pass the column name into the function, it separates the values contained within the column into new columns and then joins this back to my original Data Frame.
If I'm being ridiculous, please let me know. If someone can help me, it would be greatly appreciated.
Thanks,
Adrian
You need rename function for replace columns names:
headings = df_detailed.columns
headings.replace(" ", "_")
change to:
df_detailed = df_detailed.rename(columns=lambda x: x.replace(" ", "_"))
Or convert columns to_series because replace does not work with index (columns names):
headings.replace(" ", "_")
change to:
headings = headings.to_series().replace(" ", "_")
Also:
df_detailed = df['Placement'].str[0:-1].str.split('_', expand=True).astype(str)
is possible change to:
df_detailed = df['Placement'].str.rstrip('_').str.split('_', expand=True).astype(str)
EDIT:
Sample:
df = pd.DataFrame({'a': [1, 2], 'Placement': ['Campaign_Publisher_Site_AdType_AdSize_Device_Audience_Tactic_', 'a_b_c_d_f_g_h_i_']})
print (df)
Placement a
0 Campaign_Publisher_Site_AdType_AdSize_Device_A... 1
1 a_b_c_d_f_g_h_i_ 2
#input is DataFrame and column name
def text_to_column(df, col):
df_detailed = df[col].str.rstrip('_').str.split('_', expand=True).astype(str)
#replace columns names if necessary
df_detailed.columns = df_detailed.columns.to_series().replace(" ", "_")
#remove column and join new df
df_detailed = df.drop(col, axis=1).join(df_detailed)
return df_detailed
df = text_to_column(df, 'Placement')
print (df)
a 0 1 2 3 4 5 6 7
0 1 Campaign Publisher Site AdType AdSize Device Audience Tactic
1 2 a b c d f g h i