So i have Dataframe that has around 40 columns. They contain (made up) scores for a test. The columns are now named as follows:
Student, Date, Score, Score.1, Score.2 all the way to Score.39.
We were asked to reset the column names so they match the score (change Score to Score.1, Score.1 to Score.2, Score.2 to Score.3 and so on).
My code looks like this now:
import pandas as pd
prog = pd.read_excel('File.xlsx')
for c in prog.columns:
prog[c].rename(columns = lambda x : 'Score_' + x)
Unfortunatly this does not give the output i want it to.I was hoping someone could show me how to do this.
Thanks in advance
John Galt came up with the solution in the comments: cols = df.columns.tolist() and df.columns = cols[:2] + ['Score_%i' % i for i in xrange(1, len(cols[2:])+1)]
Related
I have a csv file with a huge dataset containing two columns. I want to compare the data of these two columns such that, if a duplicated pair is present then it gets deleted. For example, if my data file looks something like this:
Column A Column B
DIP-1N DIP-1N
DIP-2N DIP-3N
DIP-3N DIP-2N
DIP-4N DIP-5N
Then the first entry gets deleted because I don't want two "DIP-1Ns". Also, the order of occurrence of pair is not an issue as far as the entry is unique. For example, here, DIP-2N & DIP-3N and DIP-3N & DIP-3N are paired. But both the entries mean the same thing. So I want to keep one entry and delete the rest.
I have written the following code, but I don't know how to compare simultaneously the entry of both the columns.
import csv
import pandas as pd
file = pd.read_csv("/home/staph.csv")
for i in range(len(file['Column A'])):
for j in range(len(file['Column B'])):
list1 = []
list2 = []
list1.append(file[file['Column A'].str.contains('DIP-'+str(i)+'N')])
list2.append(file[file['Column B'].str.contains('DIP-'+str(i)+'N')])
for ele1,ele2 in list1,list2:
if(list1[ele1]==list2[ele2]):
print("Duplicate")
else:
print("The 1st element is :", ele1)
print("The 2nd element is :", ele2)
Seems like something is wrong, as there is no output. The program just ends without any output or error. Any help would be much appreciated in terms of whether my code is wrong or if I can optimize the process in a better way. Thanks :)
It might not be the best way to get what you need but, it works.
df['temp'] = df['Column A'] + " " + df['Column B']
df['temp'] = df['temp'].str.split(" ")
df['temp'] = df['temp'].apply(lambda list_: " ".join(sorted(list_)))
df.drop_duplicates(subset=['temp'], inplace=True)
df = df[df['Column A'] != df['Column B']]
df.drop('temp', axis=1, inplace=True)
Output:
index
Column A
Column B
1
DIP-2N
DIP-3N
3
DIP-4N
DIP-5N
With some tweaking you could use the pandas method:
# get indices of duplicate-free (except first occurence) combined sets of col A and B
keep_ind = pd.Series(df[["Column A", "Column B"]].values.tolist()).apply(set).drop_duplicates(keep="first").index
# use these indices to filter the DataFrame
df = df.loc[keep_ind]
So I have a really big dataframe with the following information:
There are 2 columns "eethuis" and "caternaar" which return True or False whether they have it or not. Now I need to find the number of places where they have both eethuis and caternaar. So this means that I need to count the rows where eethuis and caternaar return True. But I can't really find a way? Even after searching for sometime.
This is what I have. I merged the 2 rows that I need togheter but now I still need to select and count the columns that are both True:
In the picture You will not see a column where both are true, but there are some. Its a really long table with over 800 columns.
Would be nice if someone could help me!
If I understand your question correctly, you can use '&', here is an example on random data:
import pandas as pd
import random
# create random data
df = pd.DataFrame()
df['col1'] = [random.randint(0,1) for x in range(10000)]
df['col2'] = [random.randint(0,1) for x in range(10000)]
df = df.astype(bool)
# filter it:
df1 = df[(df['col1']==True) & (df['col2']==True)]
# check sise:
df1.shape
Thanks to Ezer K I found the solution! Here is the code:
totaal = df_resto[(df_resto['eethuis']==True) & (df_resto['cateraar']==True)]
This is the output:
`
So u see it works!
And the count is 41!
I am reading two dataframes looking at one column and then showing the difference in position between the two dataframe with a -1 or +1 etc.
I have try the following code but it only shows 0 in Position Change when there should be a difference between British Airways and Ryanair
first = pd.read_csv("C:\\Users\\airma\\PycharmProjects\\Vatsim_Stats\\Vatsim_stats\\Base.csv", encoding='unicode_escape')
df1 = pd.DataFrame(first, columns=['airlines', 'Position'])
second = pd.read_csv("C:\\Users\\airma\\PycharmProjects\\Vatsim_Stats\\Vatsim_stats\\Base2.csv", encoding='unicode_escape')
df2 = pd.DataFrame(second, columns=['airlines', 'Position'])
df1['Position Change'] = np.where(df1['airlines'] == df2['airlines'], 0, df1['Position'] - df2['Position'])
I have also try to do it with the following code, but just keep getting a ValueError: cannot reindex from a duplicate axis
df1.set_index('airlines', drop=False) # Set index to cross reference by (icao)
df2.set_index('airlines', drop=False)
df2['Position Change'] = df1[['Position']].sub(df2['Position'], axis=0)
df2 = df2.reset_index(drop=True)
pd.set_option('display.precision', 0)
Base csv looks like this -
and Base2 csv looks like this -
As you can see British Airways is in 3 position on Base csv and 4 in Base 2 csv, but when running the code it just shows 0 and does not do the math between the two dataframes.
Have been stuck on this for days now, would be so grateful for any help.
I would like to offer some easier way based on columns, value and if-statement.
It is probably a little bit useless while you have big dataframe, but it can gives you the information you expect.
first = pd.read_csv("C:\\Users\\airma\\PycharmProjects\\Vatsim_Stats\\Vatsim_stats\\Base.csv", encoding='unicode_escape')
df1 = pd.DataFrame(first, columns=['airlines', 'Position'])
second = pd.read_csv("C:\\Users\\airma\\PycharmProjects\\Vatsim_Stats\\Vatsim_stats\\Base2.csv", encoding='unicode_escape')
df2 = pd.DataFrame(second, columns=['airlines', 'Position'])
I agree, that my answer was not correct with your question.
Now, if I understand correctly - you want to create new column in DataFrame that gives you -1 if two same columns in 2 DataFrames are incorrect and 1 if correct.
It should help:
key = "Name_Of_Column"
new = []
for i in range(0, len(df1)):
if df1[key][i] != df2[key][i]:
new.append(-1)
else:
new.append(1)
df3 = pd.DataFrame({"Diff":new}) # I create new DataFrame as Dictionary.
df1 = df1.append(df3, ignore_index = True)
print(df1)
i am giving u an alternative, i am not sure whether it is appreciated or not. But just an idea.
After reading two csv's and getting the column u require, why don't you try to join two dataframes for the column'airlines'? it will merge two dataframes with key as 'airlines'
could someone please look at below code and advice what I have done wrong.
I have 2 panda dataframes - df and x1
Both have the same columns and column names
I have to execute below set of codes for df.Date_Appointment, x1.Date_Appointment and similary for df.Date_Scheduled and x1.Date_Scheduled. As such created a list for columns and dataframes.
I am trying to write a single code but obviously I am doing something wrong. Please advice.
import pandas as pd
df = pd.read_csv(file1.csv)
x1 = pd.read_csv(file2.csv)
# x1 is a dataframe created after filtering on one column.
# df and x1 have same number of columns and column names
# x1 is a subset of df ``
dataframe = ['df','x1']
column = ['Date_Appointment', 'Date_Scheduled']
def df_det (dataframe.column):
(for df_det in dataframe.column :
d_da = df_det.describe()
mean_da = df_det.value_counts().mean()
median_da = df_det.value_counts().median()
mode_da = df_det.value_counts().mode()
print('Details of all appointments', '\n',
d_da, '\n',
'Mean = ', mean_da,'\n',
'Median = ', median_da,'\n',
'Mode = ',mode_da,'\n'))
Please indicate the steps.
Thank you in advance.
It looks like you function should have two arguments -- dataframe and column -- both of which are lists, so I made the names plural.
Then you need to loop over each argument. Note that you are also assigning a dataframe in the function the same name as your function, so I changed the name of the function.
dataframes = [dataframe1, dataframe2]
columns = ['Date_Appointment', 'Date_Scheduled']
def summary_stats(dataframes, columns):
for df in dataframes:
for col in cols:
df_det = df.loc[:, col]
# print summary stats about df_det
I have a data frame named df1 like this:
as_id TCGA_AF_2687 TCGA_AF_2689_Norm TCGA_AF_2690 TCGA_AF_2691_Norm
31 1 5 9 2
I wanna select all the columns which end with "Norm", I have tried the code down below
import os;
print os.getcwd()
os.chdir('E:/task')
import pandas as pd
df1 = pd.read_table('haha.txt')
Norms = []
for s in df1.columns:
if s.endswith('Norm'):
Norms.append(s)
print Norms
but I only get a list of names. what can I do to select all the columns including their values rather than just the columns names? I know it may be a silly question, but I am a new beginner, really need someone to help, thank you so much for your kindness and your time.
df1[Norms] will get the actual columns from df1.
As a matter of fact the whole code can be simplified to
import os
import pandas as pd
os.chdir('E:/task')
df1 = pd.read_table('haha.txt')
norm_df = df1[[column for column in df1.columns if column.endswith('Norm')]]
One can also use the filter higher-order function:
newdf = df[list(filter(lambda x: x.endswith("Norm"),df.columns))]
print(newdf)
Output:
TCGA_AF_2689_Norm TCGA_AF_2691_Norm
0 5 2