Combining multiple csv's by averaging each cell in python - python

I have 7 csv files that each contain the same # of columns and rows. I am trying to merge the data from these into 1 csv where each cell is the average of the 7 identical cells. (ex. new-csv(c3) = average(input-csv's(c3)
Here is an example of what the inputs look like. The output should look identical (6 columns x 15 rows) except the values will be averaged in each cell.
So far I have this code to load the csv files, and am reading about making them into a matrix but I don't see anything about merging and averaging by each cell, only row or column.
listdrs = os.listdir(dir_path)
listdrs_path = [ dir_path + x for x in listdrs]
failed_list = []
csv_matrix = []
for file_path in listdrs_path:
tickercsv = file_path.replace(string, '')
ticker = tickercsv.replace('.csv', '')
data = pd.read_csv(file_path, index_col=0)
csv_matrix.append(data)

If you run this in the directory with all of your csv files, you can use glob to find them all, then create a tuple of dfs using the pd.read_csv() with the optional parameter header=None depending on whether or not you have column names. Then you can concat them, group by the index, and take the mean.
import pandas as pd
import glob
files = glob.glob('*.csv')
dfs = (pd.read_csv(f, headers=None) for f in files)
pd.concat(dfs).groupby(level=0).mean()

Related

Python - Extract mutual column of multiple csv files into one DataFrame for statistics/plotting

I want to analyze 26 csv files from lab experiments, all in the same format.
However, I just want to extract column #5 of each csv file, and put all columns #5 into one DataFrame, which each column named after the initial csv file.
The final df should contain 26 columns and one row as header.
In simpler words: load multiple csv files, extract same column of each, put all extracted columns into a new DataFrame with column names = filename_n, filename_n+1, ...
I could find partly some lines of code to get the df, but I'm not able to adjust it for the final goal though...
path = '*' # use your path
files = glob.glob(path + "/*.csv")
get_df = lambda f: pd.read_csv(f, header=None)
dodf = {f: get_df(f) for f in files}
dodf
If someone has an idea, I'd highly appreciate it!
Thanks,
Niels

Creating a new dataframe to contain a section of 1 column from multiple csv files in Python

So I am trying to create a new dataframe that includes some data from 300+ csv files.
Each file contains upto 200,000 rows of data, I am only interested in 1 of the columns within each file (the same column for each file)
I am trying to combine these columns into 1 dataframe, where column 6 from csv 1 would be in the 1st column of the new dataframe, column 6 from csv 2 would be in the 2nd column of the new dataframe, and so on up until the 315th csv file.
I dont need all 200,000 rows of data to be extracted, but I am unsure of how I would extract just 2,000 rows from the middle section of the data (each file ranges in number of rows so the same exact rows for each file arent necessary, as long as it is the middle 2000)
Any help in extracting the 2000 rows from each file to populate different columns in the new dataframe would be greatly appreciated.
So far, I have manipulated the data to only contain the relevant column for each file. This displays all the rows of data in the column for each file individually.
I tried to use the iloc function to reduce this to 2000 rows but it did not display any actual data in the output.
I am unsure as to how I would now extract this data into a dataframe for all the columns to be contained.
import pandas as pd
import os
import glob
import itertools
#glob to get all csv files
path = os.getcwd()
csv_files = glob.glob(os.path.join('filepath/', "*.csv"))
#loop list of csv files
for f in csv_files:
df = pd.read_csv(f, header=None)
df.rename(columns={6: 'AE'}, inplace=True)
new_df = df.filter(['AE'])
print('Location:', f)
print('File Name:', f.split("\\")[-1])
print('Content:')
display(new_df)
print()
Based on your description, I am inferring that you have a number of different files in csv format, each of which has at least 2000 lines and 6 columns. You want to take the data only from the 6th column of each file and only for the middle 2000 records in each file and to put all of those blocks of 2000 records into a new dataframe, with a column that in some way identifies which file the block came from.
You can read each file using pandas, as you have done, and then you need to use loc, as one of the commenters said, to select the 2000 records you want to keep. If you save each of those blocks of records in a separate dataframe you can then use the pandas concat method to join them all together into different columns of a new dataframe.
Here is some code that I hope will be self-explanatory. I have assumed that you want the 6th column, which is the one with index 5 in pandas because we start counting from 0. I have also used usecols to keep only the 6th column, and I rename the column to an index number based on the order in which the files are being read. You would need to change this for your own choice of column naming.
I choose the middle 2000 records by defining the starting point as record x, say, so that x + 2000 + x = total number of records, therefore x=(total number of records) / 2 - 1000. This might not be exactly how you want to define the middle 2000 records, so you could change this.
df_middles is a list to which we append every new dataframe of the new file's middle 2000 records. We use pd.concat at the end to put all the columns into a new dataframe.
import os
import glob
import pandas as pd
# glob to get all csv files
path = os.getcwd()
csv_files = glob.glob(os.path.join("filepath/", "*.csv"))
df_middles = []
# loop list of csv files
for idx, f in enumerate(csv_files, 1):
# only keep the 6th column (index 5)
df = pd.read_csv(f, header=None, usecols=[5])
colname = f"column_{idx}"
df.rename(columns={5: colname}, inplace=True)
number_of_lines = len(df)
if number_of_lines < 2000:
raise IOError(f"Not enough lines in the input file: {f}")
middle_range_start = int(number_of_lines / 2) - 1000
middle_range_end = middle_range_start + 1999
df_middle = df.loc[middle_range_start:middle_range_end].reset_index(drop=True)
df_middles.append(df_middle)
df_final = pd.concat(df_middles, axis="columns")

How to read multiple ann files (from brat annotation) within a folder into one pandas dataframe?

I can read one ann file into pandas dataframe as follows:
df = pd.read_csv('something/something.ann', sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)
df.head()
But I don't know how to read multiple ann files into one pandas dataframe. I tried to use concat, but the result is not what I expected.
How can I read many ann files into one pandas dataframe?
It sounds like you need to use glob to pull in all the .ann files from a folder and add them to a list of dataframes. After that you probably want to join/merge/concat etc. as required.
I don't know your exact requirements but the code below should get you close. As it stands at the moment the script assumes, from where you are running the Python script, you have a subfolder called files and in that you want to pull in all the .ann files (it will not look at anything else). Obviously review and change as required as it's commented per line.
import pandas as pd
import glob
path = r'./files' # use your path
all_files = glob.glob(path + "/*.ann")
# create empty list to hold dataframes from files found
dfs = []
# for each file in the path above ending .ann
for file in all_files:
#open the file
df = pd.read_csv(file, sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)
#add this new (temp during the looping) frame to the end of the list
dfs.append(df)
#at this point you have a list of frames with each list item as one .ann file. Like [annFile1, annFile2, etc.] - just not those names.
#handle a list that is empty
if len(dfs) == 0:
print('No files found.')
#create a dummy frame
df = pd.DataFrame()
#or have only one item/frame and get it out
elif len(dfs) == 1:
df = dfs[0]
#or concatenate more than one frame together
else: #modify this join as required.
df = pd.concat(dfs, ignore_index=True)
df = df.reset_index(drop=True)
#check what you've got
print(df.head())

How can i combine all the data from the different .csv files into a single table?

I have given 5 CSV file, now I want to combine all the data from these file into one single table.
I have tried pd.concat and .join from pandas so far, can only get only two files combined. so far I've tried the following
data = pd.read_csv('data.csv')
data1 = pd.read_csv('data2.csv)
merge = data.join(data1,lsuffix='_NOM',rSuffix='_NIM')
in the end, I want to have every data side by side in my table.sample data.csv
You just loop through the directory which contains the .csv files. For example, refer below:
import glob
df = pd.DataFrame() # An empty data frame
for filename in glob.glob('./<path to your data files>/*.csv'):
df_temp = pd.read_csv(filename)
df = df.append(df_temp)

Python / glob glob - change datatype during import

I'm looping through all excel files in a folder and appending them to a dataframe. One column (column C) has an ID number. In some of the sheets, the ID is formatted as text and in others it's formatted as a number. What's the best way to change the data type during or after the import so that the datatype is consistent? I could always change them in each excel file before importing but there are 40+ sheets.
for f in glob.glob(path):
dftemp = pd.read_excel(f,sheetname=0,skiprows=13)
dftemp['file_name'] = os.path.basename(f)
df = df.append(dftemp,ignore_index=True)
Don't append to a dataframe in a loop, every append relocates the whole dataframe to a new location in memory, very slow. Do one single concat after reading all your dataframes:
dfs = []
for f in glob.glob(path):
df = pd.read_excel(f,sheetname=0,skiprows=13)
df['file_name'] = os.path.basename(f)
df['c'] = df['c'].astype(str)
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
It sounds like your ID, that's the c column, is a string, but sometimes lacks alphabets. Ideally this should be used as a string.

Categories

Resources