I have 10 .txt (csv) files that I want to merge together in a single csv file to use later in analysis. when I use pd.append, it always merges the files below each other.
I use the following code:
master_df = pd.DataFrame()
for file in os.listdir(os.getcwd()):
if file.endswith('.txt'):
files = pd.read_csv(file, sep='\t', skiprows=[1])
master_df = master_df.append(files)
the output is:
output
what I need is to insert the columns of each file side-by-side, as follows:
The required output
could you please help with this?
To merge DataFrames side by side, you should use pd.concat.
frames = []
for file in os.listdir(os.getcwd()):
if file.endswith('.txt'):
files = pd.read_csv(file, sep='\t', skiprows=[1])
frames.append(files)
# axis = 0 has the same behavior of your original approach
master_df = pd.concat(frames, axis = 1)
Related
I have many .csv files like this (with one column):
picture
Id like to merge them into one .csv file, so that each of the column will contain one of the csv files data. The headings should be like this (when converted to spreadsheet):
picture (the first number is the number of minutes extracted from the file name, the second is the first word in the file name behind "export_" in the name, and third is the whole name of the file).
Id like to work in Python.
Can you please someone help me with this? I am new in Python.
Thank you very much.
I tried to join only 2 files, but I have no idea how to do it with more files without writing all down manually. Also, i dont know, how to extract headings from the file names:
import pandas as pd
file_list = ['export_Control 37C 4h_Single Cells_Single Cells_Single Cells.csv', 'export_Control 37C 0 min_Single Cells_Single Cells_Single Cells.csv']
df = pd.DataFrame()
for file in file_list:
temp_df = pd.read_csv(file)
df = pd.concat([df, temp_df], axis=1)
print(df)
df.to_csv('output2.csv', index=False)
Assuming that your .csv files they all have a header and the same number of rows, you can use the code below to put all the .csv (single-columned) one besides the other in a single Excel worksheet.
import os
import pandas as pd
csv_path = r'path_to_the_folder_containing_the_csvs'
csv_files = [file for file in os.listdir(csv_path)]
list_of_dfs=[]
for file in csv_files :
temp=pd.read_csv(csv_path + '\\' + file, header=0, names=['Header'])
time_number = pd.DataFrame([[file.split('_')[1].split()[2]]], columns=['Header'])
file_title = pd.DataFrame([[file.split('_')[1].split()[0]]], columns=['Header'])
file_name = pd.DataFrame([[file]], columns=['Header'])
out = pd.concat([time_number, file_title, file_name, temp]).reset_index(drop=True)
list_of_dfs.append(out)
final= pd.concat(list_of_dfs, axis=1, ignore_index=True)
final.columns = ['Column' + str(col+1) for col in final.columns]
final.to_csv(csv_path + '\output.csv', index=False)
final
For example, considering three .csv files, running the code above yields to :
>>> Output (in Jupyter)
>>> Output (in Excel)
I have a lots of excel files +200, all of these have the same format.
the directorys are saved in this list
dir_list = ['all','files]
I want to convert all of them into one single df
below is what I want to select from each excel file into the new df
used_col = ['Dimension', 'Length','Customer']
df_x = pd.read_excel(file,sheet_name='Tabelle1',skiprows=3,skipinitialspace=True, usecols=used_col)
how can I do that ?
You are close, you need to use concat to create a single df from all files.
tmp = []
used_col = ['Dimension', 'Length','Customer']
for file in dir_list:
df_x = pd.read_excel(file,sheet_name='Tabelle1',skiprows=3,skipinitialspace=True, usecols=used_col)
tmp.append(df_x)
final_df = pd.concat(tmp)
I am trying to come up with a script that will allow me to read all csv files with greater than 62 bits and print two columns into a separate excel file and create a list.
The following is one of the csv files:
FileUUID Table RowInJSON JSONVariable Error Notes SQLExecuted
ff3ca629-2e9c-45f7-85f1-a3dfc637dd81 lng02_rpt_b_calvedets 1 Duplicate entry 'ETH0007805440544' for key 'nosameanimalid' INSERT INTO lng02_rpt_b_calvedets(farmermobile,hh_id,rpt_b_calvedets_rowid,damidyesno,damid,calfdam_id,damtagid,calvdatealv,calvtype,calvtypeoth,easecalv,easecalvoth,birthtyp,sex,siretype,aiprov,othaiprov,strawidyesno,strawid) VALUES ('0974502779','1','1','0','ETH0007805440544','ETH0007805470547',NULL,'2017-09-16','1',NULL,'1',NULL,'1','2','1',NULL,NULL,NULL,NULL,NULL,'0',NULL,NULL,NULL,NULL,NULL,NULL,'0',NULL,'Tv',NULL,NULL,'Et','23',NULL,'5',NULL,NULL,NULL,'0','0')
This is my attempt to solving this problem:
path = 'csvs/'
for infile in glob.glob( os.path.join(path, '*csv') ):
output = infile + '.out'
with open(infile, 'r') as source:
readr = csv.reader(source)
with open(output,"w") as result:
writr = csv.writer(result)
for r in readr:
writr.writerow((r[4], r[2]))
Please help point me to the right direction with any alternative solution
pandas does a lot of what you are trying to achieve:
import pandas as pd
# Read a csv file to a dataframe
df = pd.read_csv("<path-to-csv>")
# Filter two columns
columns = ["FileUUID", "Table"]
df = df[columns]
# Combine multiple dataframes
df_combined = pd.concat([df1, df2, df3, ...])
# Output dataframe to excel file
df_combined.to_excel("<output-path>", index=False)
To loop through all csv files > 62bits, you can use glob.glob() and os.stat()
import os
import glob
dataframes = []
for csvfile in glob.glob("<csv-folder-path>/*.csv"):
if os.stat(csvfile).st_size > 62:
dataframes.append(pd.read_csv(csvfile))
Use the standard csv module. Don't re-invent the wheel.
https://docs.python.org/3/library/csv.html
I am trying to get my code to read a folder containing various files.
I was hoping to get Jupyter to read each file within that folder and create separate dataframes by taking the names of the files as the dataframe names.
So far I have the code:
import glob
path = r'C:\Users\SemR\Documents\Jupyter\Submissions'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0, usecols=['Date', 'Usage'])
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
This code concatenates the data however I want separate data frames for each so that I can store the values separately. Is there something I can use instead?
Here are examples of how the CSV files look:
These CSV files are in the same folder so I was hoping that when I run my code, new dataframes would be created with the same name as the CSV file name.
Thank you.
A better approach to using different variables for each of your dataframes would be to load each dataframe into a dictionary.
The basename of each filename could be extracted using a combination of os.path.basename() and os.path.splitext().
For example:
d = {os.path.splitext(os.path.basename(f))[0] : pd.read_csv(f) for f in glob.glob('*test*.csv')}
Also, using *test* would avoid the need for the if in the comprehension.
From the question what I can suggest is that you have got different DataFrames stored in the list.
import glob
path = r'C:\Users\SemR\Documents\Jupyter\Submissions'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0, usecols=['Date', 'Usage'])
li.append(df)
for dataframes in li:
""" For getting the mean of a specific column """
df.loc[:,"Usage"].mean()
I have a for loop that imports all of the Excel files in the directory and merge them together in a single dataframe. However, I want to create a new column where each row takes the string of the filename of the Excel-file.
Here is my import and merge code:
path = os.getcwd()
files = os.listdir(path)
df = pd.DataFrame()
for f in files:
data = pd.read_excel(f, 'Sheet1', header = None, names = ['col1','col2'])
df = df.append(data)
For example if first Excel file is named "file1.xlsx", I want all rows from that file to have value file1.xlsx in col3 (a new column). If the second Excel file is named "file2.xlsx" I want all rows from that file to have value file2.xlsx. Notice that there is no real pattern of the Excel files, and I just use those names as an example.
Many thanks
Create new column in loop:
df = pd.DataFrame()
for f in files:
data = pd.read_excel(f, 'Sheet1', header = None, names = ['col1','col2'])
data['col3'] = f
df = df.append(data)
Another possible solution with list comprehension:
dfs = [pd.read_excel(f, 'Sheet1', header = None, names = ['col1','col2']).assign(col3 = f)
for f in files]
df = pd.concat(dfs)