I'm trying to merge multiple csv files within a shared folder. In addition to merging the files, I'd like to add two additional columns that identifies the data based on a portfolio code in the file name and a file date.
Currently I have the following code, which successfully merges the files:
import os, glob import pandas as pd
path = "C:\\mydirectory"
all_files = glob.glob(os.path.join(path, "Mnthly_*.csv"))
df_from_each_file = (pd.read_csv(f, sep=',') for f in all_files)
df_merged = pd.concat(df_from_each_file, ignore_index=True)
df_merged.to_csv( "merged.csv")
pd.read_csv("merged.csv")
How can I go about adding in some kind of loop or portion within the code to include the additional columns?
Input: 'Mnthly_1_XXX'
| DATE ----- | Col_1 -- | Col_2 ---- |
+========+======+========+
| 9/30/2020 | 410900 | 44991418 |
+------------+------------+--------------+
| 10/31/2020 | 44936 | 48560570 |
+------------+------------+--------------+
Output:
| DATE --- | Col_1 ---- | Col_2 --- | Indicator | Date1 -- | Date2----- |
+=======+=======+========+=====+========+========+
| 9/30/2020 | 410900 | 44991418 | XXXX | 10/5/2020 | 10/6/2020 |
+------------+------------+--------------+---------+--------------+--------------+
| 10/31/2020 | 44936 | 48560570 | XXXX | 10/5/2020 | 10/6/2020 |
+------------+------------+--------------+---------+--------------+--------------+
Related
so I currently have a large csv containing data for a number of events.
Column one contains a number of dates as well as some id's for each event for example.
Basically I want to write something within Python that whenever there is an id number (AL.....) it creates a new csv with the id number as the title with all the data in it before the next id number so i end up with a csv for each event.
For info the whole csv contains 8 columns but the division into individual csvs is only predicated on column one
Use Python to split a CSV file with multiple headers
I notice this questions is quite similar but in my case I I have AL and then a different string of numbers after it each time and also I want to call the new csvs by the id numbers.
You can achieve this using pandas, so let's first generate some data:
import pandas as pd
import numpy as np
def date_string():
return str(np.random.randint(1, 32)) + "/" + str(np.random.randint(1, 13)) + "/1997"
l = [date_string() for x in range(20)]
l[0] = "AL123"
l[10] = "AL321"
df = pd.DataFrame(l, columns=['idx'])
# -->
| | idx |
|---:|:-----------|
| 0 | AL123 |
| 1 | 24/3/1997 |
| 2 | 8/6/1997 |
| 3 | 6/9/1997 |
| 4 | 31/12/1997 |
| 5 | 11/6/1997 |
| 6 | 2/3/1997 |
| 7 | 31/8/1997 |
| 8 | 21/5/1997 |
| 9 | 30/1/1997 |
| 10 | AL321 |
| 11 | 8/4/1997 |
| 12 | 21/7/1997 |
| 13 | 9/10/1997 |
| 14 | 31/12/1997 |
| 15 | 15/2/1997 |
| 16 | 21/2/1997 |
| 17 | 3/3/1997 |
| 18 | 16/12/1997 |
| 19 | 16/2/1997 |
So, interesting positions are 0 and 10 as there are the AL* strings...
Now to filter the AL* you can use:
idx = df.index[df['idx'].str.startswith('AL')] # get's you all index where AL is
dfs = np.split(df, idx) # splits the data
for out in dfs[1:]:
name = out.iloc[0, 0]
out.to_csv(name + ".csv", index=False, header=False) # saves the data
This gives you two csv files named AL123.csv and AL321.csv with the first line being the AL* string.
I have a folder with multiple CSV files. Each CSV file has the same dimensions. They all have 2 columns, and the first column of each is the same. Is there a way to import all the CSVs and concatenate into one Dataframe in which the first file provides the first column along with its second column, and the subsequent files just have their second column of values added next to that? The header of the second column for each file is unique, but they have the same header of the first file.
This would give you a combination of all your file in path folder
you can find all material related to merge or combine df in here
Check out for all sort of combination for df(CSV that you read as df)
import pandas as pd
import os
path='path to folder'
all_files=os.listdir(path)
li = []
for filename in all_files:
df = pd.read_csv(path+filename, index_col='H1')
print(df)
li.append(df)
frame = pd.concat(li, axis=1, ignore_index=False)
frame.to_csv(path+'out.csv')
print(frame)
input files are like:
File1
+----+----+
| H1 | H2 |
+----+----+
| 1 | A |
| 2 | B |
| 3 | C |
+----+----+
File2:
+----+----+
| H1 | H2 |
+----+----+
| 1 | D |
| 2 | E |
| 3 | F |
+----+----+
File13:
+----+----+
| H1 | H2 |
+----+----+
| 1 | G |
| 2 | H |
| 3 | I |
+----+----+
Output is:(saved in out.csv file in same directory)
+----+----+----+----+
| H1 | H2 | H2 | H2 |
+----+----+----+----+
| 1 | A | D | G |
| 2 | B | E | H |
| 3 | C | F | I |
+----+----+----+----+
Here is how I will proceed.
I am assuming that only csv files are present in the folder.
import os
import pandas as pd
files = os.listdir("path_of_the_folder")
dfs = [pd.read_csv(file).set_index('col1') for file in files]
df_final = dfs[0].join(dfs[1:])
I have multiple csv files in the following manner. All of the files have the same format.
| | items | per_unit_amount | number of units |
|---:|:--------|------------------:|------------------:|
| 0 | book | 25 | 5 |
| 1 | pencil | 3 | 10 |
First, I want to calculate the total amount of bills in python. Once calculated the total amount, I need to calculate the total amount of bills for all the csv files at the same time i.e in a multi-threaded manner.
I need to do it using multi threading.
this would be my way,
first merge all CSV files then sum each item:
import glob
import os
import pandas as pd
# the path to your csv file directory
mycsvdir = 'C:\\your csv location\\your csv location'
#select all csv file you can have some kind of filter too
csvfiles = glob.glob(os.path.join(mycsvdir, '*.csv'))
# loop through the files and read them in with pandas
dataframes = [] # a list to hold all the individual pandas DataFrames
for csvfile in csvfiles:
df = pd.read_csv(csvfile)
dataframes.append(df)
# concatenate them all together
result = pd.concat(dataframes, ignore_index=True)
# print out to a new csv file
result.to_csv('all.csv')
now you have all.csv file that is the merge of your CSV file. we can sum any item now by below code:
dff = pd.read_csv('C:\\output folder\\output folder\\all.csv')
table = pd.pivot_table(dff, index =['items', 'per_unit_amount'])
print(table)
You can use pandas library to achieve that. Install pandas via, pip install pandas.
The workflow should go like this:
Get a list of the filenames (filepath actually) of the csv files via glob
Iterate the filenames, load the files using pandas and keep them in a list
Concat the list of the dataframes into a big dataframe
Perform you desired calculations
from glob import glob
import pandas as pd
# getting a list of all the csv files' path
filenames = glob('./*csv')
# list of dataframes
dfs = [pd.read_csv(filename) for filename in filenames]
# concat all dataframes into one dataframe
big_df = pd.concat(dfs, ignore_index=True)
The big_df should look like this. Here, I have used two csv files with two rows of input. So the concatenated dataframe has 4 rows in total.
| | items | per_unit_amount | number of units |
|---:|:--------|------------------:|------------------:|
| 0 | book | 25 | 5 |
| 1 | pencil | 3 | 10 |
| 2 | book | 25 | 5 |
| 3 | pencil | 3 | 10 |
Now let's multiply per_unit_amount with number of units to get unit_total:
big_df['unit_total'] = big_df['per_unit_amount'] * big_df['number of units']
Now the dataframe has an extra column:
| | items | per_unit_amount | number of units | unit_total |
|---:|:--------|------------------:|------------------:|-------------:|
| 0 | book | 25 | 5 | 125 |
| 1 | pencil | 3 | 10 | 30 |
| 2 | book | 25 | 5 | 125 |
| 3 | pencil | 3 | 10 | 30 |
You can calculate the total by summing all the entries in the unit_total column:
total_amount = big_df['unit_total'].sum()
> 310
So I have an excel data like:
+---+--------+----------+----------+----------+----------+---------+
| | A | B | C | D | E | F |
+---+--------+----------+----------+----------+----------+---------+
| 1 | Name | 266 | | | | |
| 2 | A | B | C | D | E | F |
| 3 | 0.1744 | 0.648935 | 0.947621 | 0.121012 | 0.929895 | 0.03959 |
+---+--------+----------+----------+----------+----------+---------+
My main labels are on row 2. but I need to delete the first row. I am using the following Pandas code:
import pandas as pd
excel_file = 'Data.xlsx'
c1 = pd.read_excel(excel_file)
How do I make the 2nd row as my main label row?
You can use the skiprows parameter to skip the top row, also you can read more about the parameters you can use with read_excel on the pandas documentation
I have a table that needs to be split into multiple files grouped by values in column 1 - serial.
+--------+--------+-------+
| serial | name | price |
+--------+--------+-------+
| 100-a | rdl | 123 |
| 100-b | gm1 | -120 |
| 100-b | gm1 | 123 |
| 180r | xxom | 12 |
| 182d | data11 | 11.50 |
+--------+--------+-------+
the output would be like this:
100-a.xls
100-b.xls
180r.xls etc.etc.
and opening 100-b.xls cotains this:
+--------+------+-------+
| serial | name | price |
+--------+------+-------+
| 100-b | gm1 | -120 |
| 100-b | gm1 | 123 |
+--------+------+-------+
I tried using Pandas to define the dataframe by using this code:
import pandas as pd
#from itertools import groupby
df = pd.read_excel('myExcelFile.xlsx')
I was successful in getting the data frame, but i have no idea what to do next. I tried following this similar question on Stackoverflow, but the scenario is a bit different. What is the next approach to this?
This is not a groupby but a filter.
You need to follow 2 steps :
Generate the data that you need in the excel file
Save dataframe as excel.
Something like this should do the trick -
for x in list(df.serial.unique()) :
df[df.serial == x].to_excel("{}.xlsx".format(x))