concatenate pandas dataframe in a loop of files - python

I am trying to write a script that loops over files via a certain pattern/variable, then it concatenates the 8th column of the files while keeping the first 4 columns which are common to all files. The script works if I use the following command:
reader = csv.reader(open("1isoforms.fpkm_tracking.txt", 'rU'), delimiter='\t') #to read the header names so i can use them as index. all headers for the three files are the same
header_row = reader.next() # Gets the header
df1 = pd.read_csv("1isoforms.fpkm_tracking.txt", index_col=header_row[0:4], sep="\t") #file #1 with index as first 5 columns
df2 = pd.read_csv("2isoforms.fpkm_tracking.txt", index_col=header_row[0:4], sep="\t") #file #2 with index as first 5 columns
df3 = pd.read_csv("3isoforms.fpkm_tracking.txt", index_col=header_row[0:4], sep="\t") #file #3 with index as first 5 columns
result = pd.concat([df1.ix[:,4], df2.ix[:,4]], keys=["Header1", "Header2", "Header3"], axis=1) #concatenates the 8th column of the files and changes the header
result.to_csv("OutputTest.xls", sep="\t")
While this works, it is NOT practical for me to enter file names one by one as I sometimes have 100's of files, so cant type in a df...function for each. Instead, I was trying to use a for loop to do this but i couldnt figure it out. here is what I have so far:
k=0
for geneFile in glob.glob("*_tracking*"):
while k < 3:
reader = csv.reader(open(geneFile, 'rU'), delimiter='\t')
header_row = reader.next()
key = str(k)
key = pd.read_csv(geneFile, index_col=header_row[0:1], sep="\t")
result = pd.concat([key[:,5]], axis=1)
result.to_csv("test2.xls", sep="\t")
However, this is not working .
The issues I am facing are as follows:
How can I be able to iterate over input files and generate different
variable names for each which I can then have it used in the
pd.concat function one after the other?
How can I use a for loop to generate a string file name that is a
combination of df and an integer
How can I fix the above script get my desired item.
A minor issue is regarding the way I am using the col_index function: is there a way to use the column # rather than column names? I know it works for index_col=0 or any single #. But I couldn't use integers for > 1 column of indexing.
Note that all files have the exact same structure, and the index columns are the same.
Your feedback is highly appreciated.

Consider using merge with right_index and left_index arguments:
import pandas as pd
numberoffiles = 100
# FIRST IMPORT (CREATE RESULT DATA FRAME)
result = pd.read_csv("1isoforms.fpkm_tracking.txt", sep="\t",
index_col=[0,1,2,3], usecols=[0,1,2,3,7])
# ALL OTHER IMPORTS (MERGE TO RESULT DATA FRAME, 8TH COLUMN SUFFIXED ITERATIVELY)
for i in range(2,numberoffiles+1):
df = pd.read_csv("{}isoforms.fpkm_tracking.txt".format(i), sep="\t",
index_col=[0,1,2,3], usecols=[0,1,2,3,7])
result = pd.merge(result, df, right_index=True, left_index=True, suffixes=[i-1, i])
result.to_excel("Output.xlsx")
result.to_csv("Output.csv")

Related

How can I clean text in an Excel file with Python?

I have an Excel file with numbers (integers) in some rows of the first column (A) and text in all rows of the second column (B):
I want to clean this text, that is I want to remove tags like < b r > (without spaces). My current approach doesn't seem to work:
file_name = "F:\Project\comments_all_sorted.xlsx"
import pandas as pd
df = pd.read_excel(file_name, header=None, index_col=None, usecols='B') # specify that there's no header and no column for row labels, use only column B (which includes the text)
clean_df = df.replace('<br>', '')
clean_df.to_excel('output.xlsx')
What this code does (which I don't want it to do) is it adds running numbers in the first column (A), replacing also the few numbers that were already there, and it adds a first row with '1' in second column of this row (cell 1B):
I'm sure there's an easy way to solve my problem and I'm just not trained enough to see it.
Thanks!
Try this:
df['column_name'] = df['column_name'].str.replace(r'<br>', '')
The index in the output file can be turned off with index=False in the df.to_excel function, i.e,
clean_df.to_excel('output.xlsx', index=False)
As far as I'm aware, you can't use .replace on an entire dataframe. You need to explicitly call out the column. In this case, I just iterate through all columns in case there are more than just the one column.
To get rid of the first column with the sequential numbers (that's the index of the dataframe), add the parameter index=False. The number 1 on the top is the column name. To get rid of that, use header=False
import pandas as pd
file_name = "F:\Project\comments_all_sorted.xlsx"
df = pd.read_excel(file_name, header=None, index_col=None, usecols='B') # specify that there's no header and no column for row labels, use only column B (which includes the text)
clean_df = df.copy()
for col in clean_df.columns:
clean_df[col] = df[col].str.replace('<br>', '')
clean_df.to_excel('output.xlsx', index=False, header=False)

How to build a dataframe row by row, where each row comes from a different csv?

I have searched through perhaps a dozen variations of the question "How to build a dataframe row by row", but none of the solutions have worked for me. Thus, though this is a frequently asked question, my case is unique enough to be a valid question. I think the problem might be that I am grabbing each row from a different csv. This code demonstrates that I am successfully making dataframes in the loop:
onlyfiles = list_of_csvs
for idx, f in enumerate(onlyfiles):
row = pd.read_csv(mypath + f,sep="|").iloc[0:1]
But the rows are individual dataframes and cannot be combined (so far). I have attempted the following:
df = pd.DataFrame()
for idx, f in enumerate(onlyfiles):
row = pd.read_csv(path + f,sep="|").iloc[0:1]
df.iloc(idx) = row
Which returns
df.loc(idx) = row
^
SyntaxError: can't assign to function call
I think the problem is that each row, or dataframe, has its own headers. I've also tried df.loc(idx) = row[1] but that doesn't work either (where we grab row[:] when idx = 0). Neither iloc(idx) or loc(idx) works.
In the end, I want one dataframe that has the header (column names) from the first data frame, and then n rows where n is the number of files.
Try pd.concat().
Note, you can read just the first line from the file directly, instead of reading in the file and then limiting to first row. pass parameter nrows=1 in pd.read_csv.
onlyfiles = list_of_csvs
df_joint = pd.DataFrame()
for f in enumerate(onlyfiles):
df_ = pd.read_csv(mypath + f,sep="|", nrows=1)
df_joint = pd.concat([df_joint, df_])

Pandas dataframe read_excel does not consider blank upper left cells as columns?

I'm trying to read an Excel or CSV file into pandas dataframe. The file will read the first two columns only, and the top row of the first two columns will be the column names. The problem is when I have the first column of the top row empty in the Excel file.
IDs
2/26/2010 2
3/31/2010 4
4/31/2010 2
5/31/2010 2
Then, the last line of the following code fails:
uploaded_file = request.FILES['file-name']
if uploaded_file.name.endswith('.csv'):
df = pd.read_csv(uploaded_file, usecols=[0,1])
else:
df = pd.read_excel(uploaded_file, usecols=[0,1])
ref_date = 'ref_date'
regime_tag = 'regime_tag'
df.columns = [ref_date, regime_tag]
Apparently, it only reads one column (i.e. the IDs). However, with read_csv, it reads both column, with the first column being unnamed. I want it to behave that way and read both columns regardless of whether the top cells are empty or filled. How do I go about doing that?
What's happening is the first "column" in the Excel file is being read in as an index, while in the CSV file it's being treated as a column / series.
I recommend you work the other way and amend pd.read_csv to read the first column as an index. Then use reset_index to elevate the index to a series:
if uploaded_file.name.endswith('.csv'):
df = pd.read_csv(uploaded_file, usecols=[0,1], index_col=0)
else:
df = pd.read_excel(uploaded_file, header=[0,1], usecols=[0,1])
df = df.reset_index() # this will elevate index to a column called 'index'
This will give consistent output, i.e. first series will have label 'index' and the index of the dataframe will be the regular pd.RangeIndex.
You could potentially use a dispatcher to get rid of the unwieldy if / else construct:
file_flag = {True: pd.read_csv, False: pd.read_excel}
read_func = file_flag[uploaded_file.name.endswith('.csv')]
df = read_func(uploaded_file, usecols=[0,1], index_col=0).reset_index()

Concatenate 2 Rows to be header/column names

I have an excel sheet that is really poorly formatted. The actual column names I would like to use are across two rows; For example, if the correct column name should be Labor Percent, cell A1 would contain Labor, and cell A2 would contain Percent).
I try to load the file, here's what I'm doing:
import os
os.getcwd()
os.chdir(r'xxx')
import pandas as pd
file = 'problem.xls'
xl = pd.ExcelFile(file)
print(xl.sheet_names)
df = xl.parse('WEEKLY NUMBERS', skiprows=35)
As you can see in the picture, the remainder of what should be the column name is in the second row. Is there a way to rename the columns by concatenating? Can this somehow be done with the header= argument in the xl.parse bit?
You can rename the columns yourself by setting:
df.columns = ['name1', 'name2', 'name3' ...]
Note that you must specify a name for every column.
Then drop the first row to get rid of the unwanted row of column names.
df = df.drop(0)
Here's something you can try. Essentially it reads in the first two rows as your header, but treats it as a hierarchical multi-index. The second line of code below then flattens that multi-index down to a single string. I'm not 100% certain it will work for your data but is worth a try - it worked for the small dummy test data I tried it with:
df = pd.read_excel('problem.xlsx', sheetname='WEEKLY NUMBERS', header=[0, 1])
df.columns = df.columns.map(' '.join)
The second line was taken from this answer about flattening a multi-index.

Python / glob glob - change datatype during import

I'm looping through all excel files in a folder and appending them to a dataframe. One column (column C) has an ID number. In some of the sheets, the ID is formatted as text and in others it's formatted as a number. What's the best way to change the data type during or after the import so that the datatype is consistent? I could always change them in each excel file before importing but there are 40+ sheets.
for f in glob.glob(path):
dftemp = pd.read_excel(f,sheetname=0,skiprows=13)
dftemp['file_name'] = os.path.basename(f)
df = df.append(dftemp,ignore_index=True)
Don't append to a dataframe in a loop, every append relocates the whole dataframe to a new location in memory, very slow. Do one single concat after reading all your dataframes:
dfs = []
for f in glob.glob(path):
df = pd.read_excel(f,sheetname=0,skiprows=13)
df['file_name'] = os.path.basename(f)
df['c'] = df['c'].astype(str)
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
It sounds like your ID, that's the c column, is a string, but sometimes lacks alphabets. Ideally this should be used as a string.

Categories

Resources