This question already has answers here:
How do I combine two dataframes?
(8 answers)
Closed 8 months ago.
Working on my first data project and I'm new to stackoverflow. All the other examples I have found use append, but whenever I try append, the data gets organized wrong since I want to concatenate the columns vertically. This is what I have so far:
import pandas as pd
import os
input_file_path = "C:/Users/laura/Downloads/excel files/"
output_file_path = "C:/Users/laura/OneDrive/Desktop/master excel/"
excel_file_list = os.listdir(input_file_path)
df = pd.DataFrame()
for excel_files in excel_file_list:
if excel_files.endswith('.csv'):
df1 = pd.read_csv(input_file_path+excel_files)
df = pd.concat(df1, axis=1, ignore_index=True)
And this is the error I am getting:
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
Simply do this: (File1 and File2 are the paths to the [.csv]/excel files)
dataFrame = pd.concat(
map(pd.read_csv, [file1, file2]), ignore_index=True)
Make sure your paths are something like this:
C:\username\folder\1.csv
Hy Laura, try this:
df1 = pd.read_csv("Directory/file.csv",sep=';')
df2 = pd.read_csv("Directory/file.csv", sep=';')
df = pd.concat([df1, df2])
I usually d'ont use
df = pd.DataFrame()
Directly I put the line
df = pd.concat([df1, df2])
Be careful because in some exel files, you have to modify the atribute 'sep' of function pd.read_csv.
I hope have you helped.
Related
I am looking into creating a big dataframe (pandas) from several individual frames. The data is organized in MF4-Files and the number of source files varies for each cycle. The goal is to have this process automated.
Creation of Dataframes:
df = (MDF('File1.mf4')).to_dataframe(channels)
df1 = (MDF('File2.mf4')).to_dataframe(channels)
df2 = (MDF('File3.mf4')).to_dataframe(channels)
These Dataframes are then merged:
df = pd.concat([df, df1, df2], axis=0)
How can I do this without dynamically creating variables for df, df1 etc.? Or is there no other way?
I have all filepathes in an Array of the form:
Filepath = ['File1.mf4', 'File2.mf4','File3.mf4',]
Now I am thinking of looping through it and create dynamically the data frames df,df1.df1000.... Any advice here?
Edit here is the full code:
df = (MDF('File1.mf4')).to_dataframe(channels)
df1 = (MDF('File2.mf4')).to_dataframe(channels)
df2 = (MDF('File3.mf4')).to_dataframe(channels)
#The Data has some offset:
x = df.index.max()
df1.index += x
x = df1.index.max()
df2.index += x
#With correct index now the data can be merged
df = pd.concat([df, df1, df2], axis=0)
The way I'm interpreting your question is that you have a predefined list you want. So just:
l = []
for f in [ list ... of ... files ]:
df = load_file(f) # however you load it
l.append(df)
big_df = pd.concat(l)
del l, df, f # if you want to clean it up
You therefore don't need to manually specify variable names for your data sub-sections. If you also want to do checks or column renaming between the various files, you can also just put that into the for-loop (or alternatively, if you want to simplify to a list comprehension, into the load_file function body).
Try this:
df_list = [(MDF(file)).to_dataframe(channels) for file in Filepath]
df = pd.concat(df_list)
I relatively new to Python but my understanding of Python modules is that any object defined in a module can be exported, for example is you had:
# my_module.py
obj1 = 4
obj2 = 8
you can import both these objects simply with from my_module import obj1, obj2.
While working with Pandas, it is common to have code with looks like this (not actual working code):
# pandas_module.py
import pandas as pd
df = pd.DataFrame(...)
df = df.drop()
df = df[df.col > 0]
where the same object (df) is redefined multiple times. If I want to export df how should I handle this? My guess is that if I simply from pandas_module import df from elsewhere, all the pandas code will run first and I will the the final df as expected, but I'm not sure if this is good practice. Maybe it is better to do something like final_df = df.copy() and export final_df instead. This seems like it would be more understandable for someone who is not that familiar with Python.
So my question is, what is the proper way to handle this situation of exporting a df which is defined multiple times?
Personally, I usually create a function that returns a Dataframe object. Such as:
# pandas_module.py
import pandas as pd
def clean_data():
df = pd.DataFrame(...)
df = df.drop()
df = df[df.col > 0]
return df
Then you can call the function from your main work flow and get the expected Dataframe:
from pandas_module.py import clean_data
df = clean_data()
I've the below code
import pandas as pd
Orders = pd.read_excel (r"C:\Users\Bharath Shana\Desktop\Python\Sample.xls", sheet_name='Orders')
Returns = pd.read_excel (r"C:\Users\Bharath Shana\Desktop\Python\Sample.xls", sheet_name='Returns')
Sum_value = pd.DataFrame(Orders['Sales']).sum
Orders_Year = pd.DatetimeIndex(Orders['Order Date']).year
Orders.merge(Returns, how="inner", on="Order ID")
which gives the output as below
My Requirement is i would like to use groupby and would like to see the output as below
Can some one please help me how to use groupby in my above code, it means i would like to see everything in the single line by using groupby
Regards,
Bharath
You can do by selecting column then define to a new dataframe
grouped = pd.DataFrame()
groupby = ['Year','Segment','Sales']
for i in groupby:
grouped[i] = Orders[i]
I am trying to read csv files and concatenate them and output them as one csv file. I keep getting this error:
TypeError: cannot concatenate object of type '< class 'pandas.io.parsers.TextFileReader'>'; only Series and DataFrame objs are valid;
I am not sure how to fix it. I am a beginner, so I would appreciate any help! Thank you! Here is the code I wrote:
csv.field_size_limit(sys.maxsize)
df1 = pd.read_csv('file1.csv', chunksize=20000)
df2 = pd.read_csv('file2.csv', chunksize=20000)
df3 = pd.read_csv('file3.csv', chunksize=20000)
df4 = pd.read_csv('file4.csv', chunksize=20000)
df5 = pd.read_csv('file5.csv', chunksize=20000)
df6 = pd.read_csv('file6.csv', chunksize=20000)
frames = [df1, df2, df3, df4, df5, df6]
result = pd.concat(frames, ignore_index=True, sort=False)
result.to_csv('new.csv')
If you call read_csv passing chunksize parameter, then:
it returns a TextFileReader object,
which can be used, e.g. in a loop, to read and process
consecutive chunks.
An example of how to use "chunked" CSV file reading:
reader = pd.read_csv('input.csv', chunksize=20000)
for chunk in reader:
# Process the chunk (DataFrame)
Or maybe you want to:
read only initial 20000 rows from each source file,
concatenate them into a new DataFrame?
If this is the case, pass nrows=20000 (instead of chunksize),
while reading from each file.
Then all returned objects will be just DataFrames and you will be able
to concat them.
i have an empty dataframe[] and want to append additional dataframes using for loop without overwriting existing dataframes, regular append method is overwriting the existing dataframe and showing only the last appended dataframe in output.
use concat() from the pandas module.
import pandas as pd
df_new = pd.concat([df_empty, df_additional])
read more about it in the pandas Docs.
regarding the question in the comment...
df = pd.DataFrame(insert columns which your to-be-appended-df has too)
for i in range(10):
function_to_get_df_new()
df = pd.concat([df, df_new])
Let you have list of dataframes list_of_df = [df1, df2, df3].
You have empty dataframe df = pd.Dataframe()
If you want to append all dataframes in list into that empty dataframe df:
for i in list_of_df:
df = df.append(i)
Above loop will not change df1, df2, df3. But df will change.
Note that doing df.append(df1) will not change df, unless you assign it back to df so that df = df.append(df1)
You can't also use set:
df_new = pd.concat({df_empty, df_additional})
Because pandas.DataFrame objects can't be hashed, set needs hashed so that's why
Or tuple:
df_new = pd.concat((df_empty, df_additional))
They are little quicker...
Update for for loop:
df = pd.DataFrame(data)
for i in range(your number):
df_new=function_to_get_df_new()
df = pd.concat({df, df_new}) # or tuple: df = pd.concat((df, df_new))
The question is already well answered, my 5cts are the suggestion to use ignore_index=True option to get a continuous new index, not duplicate the older ones.
import pandas as pd
df_to_append = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB')) # sample
df = pd.DataFrame() # this is a placeholder for the destination
for i in range(3):
df = df.append(df_to_append, ignore_index=True)
I don't think you need to use for loop here, try concat()
import pandas
result = pandas.concat([emptydf,additionaldf])
pandas.concat documentation