Suppose you have a bunch of excel files with ID and company name. You have N number of excel files in a directory and you read them all into a dataframe, however, in each file company name is spelled slightly differently and you end up with a dataframe with N + 1 columns.
is there a way to create a mapping for columns names for example:
col_mappings = {
'company_name': ['name1', 'name2', ... , 'nameN],
}
So that when your run read_excel you can map all the different possibilities of company name to just one column? Also could you do this with any type of datafile? E.g. read_csv ect..
Are you concatenating the files after you read them one by one? If yes, you can simply change the column name once you read the file. From you question, I assume your dataframe only contains two columns - Id and CompanyName. So, you can simply change it by indexing.
df = pd.read_csv(one_file)
df.rename(columns={df.columns[1]:'company_name'})
then concatenate it to the original dataframe.
Otherwise, simply read with given column names,
df = pd.read_csv(one_file, names=['Id','company_name'])
then remove first row from df as it contains original column names.
It can be performed on both .csv and .xlsx file.
Related
I have created a large dataframe using 19 individual CSV files. All the CSV files have a similar data structure/type because those are the same experimental data from multiple runs. After merging all the CSV file into a large dataframe, I want to change the Column name. I have 40 columns. I want to use the same name for some columns, such as column 2,5,8,..should have "Counts" as column name, column 3,6,8.....should have 'File name' as column name, etc. Right now, all the column names are in number. How can I change the column name?
I have tried this code
newDf.rename(columns = {'0':'Time',tuple(['2','5','8','11','14','17','20','23','26','29','32','35','38','41','44','47','50','53','56']):'File_Name' })
But it didn't work
My datafile looks like this ...
I'm not sure if I understand it correctly, you wish to modify the name of the columns based from its content:
df.columns = [f"FileName_{v[0]}" if df[v[1]].dtype == "O" else f"Count_{v[0]}" for v in enumerate(df.columns)]
What this one does is to check if the column's data type is object where it will assign "Filename" in that element; else "Count"
Then add first column as "Time":
df.columns[0] == "Time"
I am reading multiple CSV files from a folder into a dataframe. I loop for all the files in the folder and then concat the dataframes to obtain the final dataframe.
However the CSV file has one summary row from which I want to extract the date, and then add as a new column for all the rows in that csv/dataframe.
'''
df=pd.read_csv(f,header=None,names=['Inverter',"Day Yield",'month Yield','Year Yield','SpecificYieldDay','SYMth','SYYear','Power'],sep=';', **kwargs)
df['date']=df.loc[[0],['Day Yield']]
df
I expect ['date'] column to be filled with the date for that file for all the rows in that particular csv, but it gets filled correctly only for the first row.
Refer to image of dataframe. I want all the rows of the 'date' column to be showing 7/25/2019 instead of only the first row.
I have also added an example of one of the csv files I am reading from
csv file
If I understood correctly, the value that you want to add as a new column for all rows is in df.loc[[0],['Day Yield']].
If that is correct you can do the following:
df = df.assign(date=[df.loc[[0],['Day Yield']]]*len(df))
I have an excel sheet that is really poorly formatted. The actual column names I would like to use are across two rows; For example, if the correct column name should be Labor Percent, cell A1 would contain Labor, and cell A2 would contain Percent).
I try to load the file, here's what I'm doing:
import os
os.getcwd()
os.chdir(r'xxx')
import pandas as pd
file = 'problem.xls'
xl = pd.ExcelFile(file)
print(xl.sheet_names)
df = xl.parse('WEEKLY NUMBERS', skiprows=35)
As you can see in the picture, the remainder of what should be the column name is in the second row. Is there a way to rename the columns by concatenating? Can this somehow be done with the header= argument in the xl.parse bit?
You can rename the columns yourself by setting:
df.columns = ['name1', 'name2', 'name3' ...]
Note that you must specify a name for every column.
Then drop the first row to get rid of the unwanted row of column names.
df = df.drop(0)
Here's something you can try. Essentially it reads in the first two rows as your header, but treats it as a hierarchical multi-index. The second line of code below then flattens that multi-index down to a single string. I'm not 100% certain it will work for your data but is worth a try - it worked for the small dummy test data I tried it with:
df = pd.read_excel('problem.xlsx', sheetname='WEEKLY NUMBERS', header=[0, 1])
df.columns = df.columns.map(' '.join)
The second line was taken from this answer about flattening a multi-index.
I am looking into searching a csv file with 242000 rows and want to sum the unique identifiers in one of the columns. The column name is 'logid' and has a number of different values i.e. 1002, 3004, 5003. I want to search the csv file using the panda dataframe and sum the amount of unique identifiers. If possible I would then like to create a new csv file that stores this information. For example if I find there are 50 logid's of 1004 I would then like to create a csv file that has column name 1004 and the count of 50 displayed below. I would do this for all unique identifiers and add them in the same csv file. I am completely new at this and have done some searching but no idea where to start.
Thanks!
As you haven't post your code I can give you an answer only about the general way it would work.
Load the CSV file into a pd.Dataframe using pandas.read_csv
Save all values which a occurence > 1 in a seperate df1 using pandas.DataFrame.drop_duplicates like:
df1=df.drop_duplicates(keep="first)
--> This Will return a DataFrame which only contains the rows with the first occurence of duplicate values. E.g. if the value 1000 is in 5 rows only the first row will be returned while the others are dropped.
--> Applying df1.shape[0] will return you the number of duplicate values in your df.
3.If you want to store all rows of df which contain a "duplicate value" in a seperate CSV file you have to do smth like this:
df=pd.DataFrame({"A":[0,1,2,3,0,1,2,5,5]}) # This should represent your original data set
print(df)
df1=df.drop_duplicates(subset="A",keep="first") #I assume the column with the duplicate values is columns "A" if you want to check the whole row just omit the subset keyword.
print(df1)
list=[]
for m in df1["A"]:
mask=(df==m)
list.append(df[mask].dropna())
for dfx in range(len(list)):
name="file{0}".format(dfx)
list[dfx].to_csv(r"YOUR PATH\{0}".format(name))
I am using pandas and python to process multiple files with different column names for columns with the same data.
dataset = pd.read_csv('Test.csv', index_col=0)
cols= dataset.columns
I have the different possible column titles in a list.
AddressCol=['sAddress','address','Adrs', 'cAddress']
Is there a way to normalize all the possible column names to "Address" in pandas so I use the script on different files?
Without pandas I would use something like a double for loop to go through the list of column names and possible column names and a if statement to extract out the whole array.
You can use the rename DataFrame method:
dataset.rename(columns={typo: 'Address' for typo in AddressCol}, inplace=True)