Python Pandas Replacing column names - python

I am using pandas and python to process multiple files with different column names for columns with the same data.
dataset = pd.read_csv('Test.csv', index_col=0)
cols= dataset.columns
I have the different possible column titles in a list.
AddressCol=['sAddress','address','Adrs', 'cAddress']
Is there a way to normalize all the possible column names to "Address" in pandas so I use the script on different files?
Without pandas I would use something like a double for loop to go through the list of column names and possible column names and a if statement to extract out the whole array.

You can use the rename DataFrame method:
dataset.rename(columns={typo: 'Address' for typo in AddressCol}, inplace=True)

Related

Python: Create dataframe with 'uneven' column entries

I am trying to create a dataframe where the column lengths are not equal. How can I do this?
I was trying to use groupby. But I think this will not be the right way.
import pandas as pd
data = {'filename':['file1','file1'], 'variables':['a','b']}
df = pd.DataFrame(data)
grouped = df.groupby('filename')
print(grouped.get_group('file1'))
Above is my sample code. The output of which is:
What can I do to just have one entry of 'file1' under 'filename'?
Eventually I need to write this to a csv file.
Thank you
If you only have one entry in a column the other will be NaN. So you could just filter the NaNs by doing something like df = df.at[df["filename"].notnull()]

How to find if a values exists in all rows of a dataframe?

I have an array of unique elements and a dataframe.
I want to find out if the elements in the array exist in all the row of the dataframe.
p.s- I am new to python.
This is the piece of code I've written.
for i in uniqueArray:
for index,row in newDF.iterrows():
if i in row['MKT']:
#do something to find out if the element i exists in all rows
Also, this way of iterating is quite expensive, is there any better way to do the same?
Thanks in Advance.
Pandas allow you to filter a whole column like if it was Excel:
import pandas
df = pandas.Dataframe(tableData)
Imagine your columns names are "Column1", "Column2"... etc
df2 = df[ df["Column1"] == "ValueToFind"]
df2 now has only the rows that has "ValueToFind" in df["Column1"]. You can concatenate several filters and use AND OR logical doors.
You can try
for i in uniqueArray:
if newDF['MKT'].contains(i).any():
# do your task
You can use isin() method of pd.Series object.
Assuming you have a data frame named df and you check if your column 'MKT' includes any items of your uniqueArray.
new_df = df[df.MKT.isin(uniqueArray)].copy()
new_df will only contain the rows where values of MKT is contained in unique Array.
Now do your things on new_df, and join/merge/concat to the former df as you wish.

python pandas difference between df_train["x"] and df_train[["x"]]

I have the following dataset and reading it from csv file.
x =[1,2,3,4,5]
with the pandas i can access the array
df_train = pd.read_csv("train.csv")
x = df_train["x"]
And
x = df_train[["x"]]
I could wonder since both producing the same result the former one could make sense but later one not. PLEASE, COULD YOU explain the difference and use?
In pandas, you can slice your data frame in different ways. On a high level, you can choose to select a single column out of a data frame, or many columns.
When you select many columns, you have to slice using a list, and the return is a pandas DataFrame. For example
df[['col1', 'col2', 'col3']] # returns a data frame
When you select only one column, you can pass only the column name, and the return is just a pandas Series
df['col1'] # returns a series
When you do df[['col1']], you return a DataFrame with only one column. In other words, it's like your telling pandas "give me all the columns from the following list:" and just give it a list with one column on it. It will filter your df, returning all columns in your list (in this case, a data frame with only 1 column)
If you want more details on the difference between a Series and a one-column DataFrame, check this thread with very good answers

Pandas read_excel, csv; names column names mapper?

Suppose you have a bunch of excel files with ID and company name. You have N number of excel files in a directory and you read them all into a dataframe, however, in each file company name is spelled slightly differently and you end up with a dataframe with N + 1 columns.
is there a way to create a mapping for columns names for example:
col_mappings = {
'company_name': ['name1', 'name2', ... , 'nameN],
}
So that when your run read_excel you can map all the different possibilities of company name to just one column? Also could you do this with any type of datafile? E.g. read_csv ect..
Are you concatenating the files after you read them one by one? If yes, you can simply change the column name once you read the file. From you question, I assume your dataframe only contains two columns - Id and CompanyName. So, you can simply change it by indexing.
df = pd.read_csv(one_file)
df.rename(columns={df.columns[1]:'company_name'})
then concatenate it to the original dataframe.
Otherwise, simply read with given column names,
df = pd.read_csv(one_file, names=['Id','company_name'])
then remove first row from df as it contains original column names.
It can be performed on both .csv and .xlsx file.

pandas automatically create dataframe from list of series with column names

I have a list of pandas series objects. I have a list of functions that generate them. How do I create a dataframe of the objects with the column names being the names of the functions that created the objects?
So, to create the regular dataframe, I've got:
pandas.concat([list of series objects],axis=1,join='inner')
But I don't currently have a way to insert all the functionA.__name__, functionB.__name__, etc. as column names in the dataframe.
How would I preserve the same conciseness, and set the column names?
IIUC, given your concat dataframe df you can:
df = pandas.concat([list of series objects],axis=1,join='inner')
and then assign the column names as a list of functions names:
df.columns = [functionA.__name__, functionB.__name__, etc.]
Hope that helps.
You can set the column names in a second step:
df = pandas.concat([list of series objects],axis=1,join='inner')
df.columns = [functionA.__name__, functionB.__name__]

Categories

Resources