This is driving me nuts! I have several Dataframe that I am trying to concatenate with pandas. The index is the filename. When I use df.to_csv for individual data frames I can see the index column (filename) along with the column of interest. When I concatenate along the filename axis I only get the column of interest and numbers. No filename.
Here is the code I am using as is. It works as I expect up until the "all_filename" line.
for filename in os.listdir(directory):
if filename.endswith("log.csv"):
df = pd.read_fwf(filename, skiprows=186, nrows=1, names=["Attribute"])
df['System_Library_Name'] = [x.split('/')[6] for x in df['Attribute']]
df2= pd.concat([df for filename in os.listdir(directory)], keys=[filename])
df2.to_csv(filename+"log_info.csv", index=filename)
all_filenames = glob.glob(os.path.join(directory,'*log_info.csv'))
cat_log = pd.concat([pd.read_csv(f) for f in all_filenames ])
cat_log2= cat_log[['System_Library_Name']]
cat_log2.to_excel("log.xlsx", index=filename)
I have tried adding keys=filename to the 3rd to last line and giving the index a name with df.index.name=
I have used similar code before and had it work fine, however this is only one column that I am using from a larger original input file if that makes a difference.
Any advice is greatly appreciated!
df = pd.concat(
# this is just reading one value from each file, yes?
[pd.read_fwf(filename, skiprows=186, nrows=1, names=["Attribute"])
.set_index(pd.Index([filename]))
.applymap(lambda x: x.split('/')[6])
.rename(columns={'Attribute':'System_Library_Name'})
for filename in glob.glob(os.path.join(directory,'*log.csv'))
]
)
df.to_xlsx("log_info.xlsx")
Related
I have 10 .txt (csv) files that I want to merge together in a single csv file to use later in analysis. when I use pd.append, it always merges the files below each other.
I use the following code:
master_df = pd.DataFrame()
for file in os.listdir(os.getcwd()):
if file.endswith('.txt'):
files = pd.read_csv(file, sep='\t', skiprows=[1])
master_df = master_df.append(files)
the output is:
output
what I need is to insert the columns of each file side-by-side, as follows:
The required output
could you please help with this?
To merge DataFrames side by side, you should use pd.concat.
frames = []
for file in os.listdir(os.getcwd()):
if file.endswith('.txt'):
files = pd.read_csv(file, sep='\t', skiprows=[1])
frames.append(files)
# axis = 0 has the same behavior of your original approach
master_df = pd.concat(frames, axis = 1)
I have many .csv files like this (with one column):
picture
Id like to merge them into one .csv file, so that each of the column will contain one of the csv files data. The headings should be like this (when converted to spreadsheet):
picture (the first number is the number of minutes extracted from the file name, the second is the first word in the file name behind "export_" in the name, and third is the whole name of the file).
Id like to work in Python.
Can you please someone help me with this? I am new in Python.
Thank you very much.
I tried to join only 2 files, but I have no idea how to do it with more files without writing all down manually. Also, i dont know, how to extract headings from the file names:
import pandas as pd
file_list = ['export_Control 37C 4h_Single Cells_Single Cells_Single Cells.csv', 'export_Control 37C 0 min_Single Cells_Single Cells_Single Cells.csv']
df = pd.DataFrame()
for file in file_list:
temp_df = pd.read_csv(file)
df = pd.concat([df, temp_df], axis=1)
print(df)
df.to_csv('output2.csv', index=False)
Assuming that your .csv files they all have a header and the same number of rows, you can use the code below to put all the .csv (single-columned) one besides the other in a single Excel worksheet.
import os
import pandas as pd
csv_path = r'path_to_the_folder_containing_the_csvs'
csv_files = [file for file in os.listdir(csv_path)]
list_of_dfs=[]
for file in csv_files :
temp=pd.read_csv(csv_path + '\\' + file, header=0, names=['Header'])
time_number = pd.DataFrame([[file.split('_')[1].split()[2]]], columns=['Header'])
file_title = pd.DataFrame([[file.split('_')[1].split()[0]]], columns=['Header'])
file_name = pd.DataFrame([[file]], columns=['Header'])
out = pd.concat([time_number, file_title, file_name, temp]).reset_index(drop=True)
list_of_dfs.append(out)
final= pd.concat(list_of_dfs, axis=1, ignore_index=True)
final.columns = ['Column' + str(col+1) for col in final.columns]
final.to_csv(csv_path + '\output.csv', index=False)
final
For example, considering three .csv files, running the code above yields to :
>>> Output (in Jupyter)
>>> Output (in Excel)
I would like to read several SAV files (SPSS) from a directory into pandas and concatenate them into one big DataFrame. I have not been able to figure it out though. Here is what I have so far:
path = r'\C:\abc\path'
all_files = glob.glob(path + "\*.sav")
df_list = []
for filename in all_files:
df = pd.read_spss(filename,convert_categoricals=False)
df_list.append(filename)
pd.concat(df_list)
I am getting the error below.
OverflowError: date value out of range
The below code is running fine but I am getting error when I am looping through files and reading them.
df = pd.read_spss(all_files[0])
When you append an element to the list, I think you should add the df instead of filename, like this:
df_list.append(df)
I have a strange problem. In my folder i have .dat data with CO2 values from a CO2 sensor in the laboratory. Data from experiment 4,5,6,7,8 with names CO2_4.dat,CO2_5.dat,CO2_6.dat,CO2_7.dat,CO2_8.dat
I know how to read them manually. For example for reading CO2_4 this works :
dfCO2_4_manual = pd.read_csv(r'C:\data\CO2\co2_4.dat', sep=";", encoding= 'unicode_escape', header = 0, skiprows=[0], usecols=[0,1,2,4], names =["ts","t","co2_4", "p"])
display(dfCO2_4_manual)
which gives me a dataframe with the correct values:
every minute one value
But if i want to loop over my folder and read them all with this technique ("which work for other CSV files from the laboratory") which is safing the dataframes in a dictionary:
exp_list =[4,5,6,7,8] # list with number of each experiment
path_CO2 = r'C:\data\CO2'
CO2_files = glob.glob(os.path.join(path_CO2, "*.dat"))
CO2_dict = {}
for f, i in zip(offline_files, exp_list):
CO2_dict["CO2_{0}".format(i)] = pd.read_csv(f, sep=";", encoding= 'unicode_escape', header = 0, skiprows=[0], usecols=[0,1,2,4], names =["ts","t","CO2_{0}".format(i), "p"])
display(CO2_dict["CO2_4"])
gives me a dataframe with many skipped and completely wrong values.
If i open the CO2_4.dat data with text editor it looks like this:
Does someone know what is happening?
It's not clear how to help exactly given we don't have access to your files, however, is this line
for f, i in zip(offline_files, exp_list):
correct? Where is offline_files defined? It's not in the code you have provided. Also, are you wanting to analyze each df separately? Is that why you are storing them in a dictionary?
As an alternative you can store each df in a list and concatenate them. You can then group and apply analyses to them that way.
df_hold_list = []
for f, i in zip(CO2_files, exp_list): #changed file list name; please verify
df = pd.read_csv(f, sep=";", encoding= 'unicode_escape', header = 0, skiprows=[0], usecols=[0,1,2,4], names =["ts","t","CO2_{0}".format(i), "p"])
df['file'] = 'CO2_{0}'.format(i) # add column for sorting/grouping
df_hold_list.append(df)
df_new = pd.concat(df_hold_list, axis=0) # check the axis, should be 0 or 1
I can't test the code, but should work. Let me know if it doesn't.
Python newbie, please be gentle. I have data in two "middle sections" of a multiple Excel spreadsheets that I would like to isolate into one pandas dataframe. Below is a link to a data screenshot.
Within each file, my headers are in Row 4 with data in Rows 5-15, Columns B:O. The headers and data then continue with headers on Row 21, data in Rows 22-30, Columns B:L. I would like to move the headers and data from the second set and append them to the end of the first set of data.
This code captures the header from Row 4 and data in Columns B:O but captures all Rows under the header including the second Header and second set of data. How do I move this second set of data and append it after the first set of data?
path =r'C:\Users\sarah\Desktop\Original'
allFiles = glob.glob(path + "/*.xls")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_excel(file_,sheetname="Data1", parse_cols="B:O",index_col=None, header=3, skip_rows=3 )
list_.append(df)
frame = pd.concat(list_)
Screenshot of my data
If all of your Excel files have the same number of rows and this is a one time operation, you could simply hard code those numbers in your read_excel. If not, it will be a little tricky, but you pretty much follow the same procedure:
for file_ in allFiles:
top = pd.read_excel(file_, sheetname="Data1", parse_cols="B:O", index_col=None,
header=4, skip_rows=3, nrows=14) # Note the nrows kwag
bot = pd.read_excel(file_, sheetname="Data1", parse_cols="B:L", index_col=None,
header=21, skip_rows=20, nrows=14)
list_.append(top.join(bot, lsuffix='_t', rsuffix='_b'))
you can do it this way:
df1 = pd.read_excel(file_,sheetname="Data1", parse_cols="B:O",index_col=None, header=3, skip_rows=3)
df2 = pd.read_excel(file_,sheetname="Data1", parse_cols="B:L",index_col=None, header=20, skip_rows=20)
# pay attention at `axis=1`
df = pd.concat([df1,df2], axis=1)