process multiple csv file in python

process multiple csv file in python - python

I have multiple csv files in the following manner. All of the files have the same format.
| | items | per_unit_amount | number of units |
|---:|:--------|------------------:|------------------:|
| 0 | book | 25 | 5 |
| 1 | pencil | 3 | 10 |
First, I want to calculate the total amount of bills in python. Once calculated the total amount, I need to calculate the total amount of bills for all the csv files at the same time i.e in a multi-threaded manner.
I need to do it using multi threading.

this would be my way,
first merge all CSV files then sum each item:
import glob
import os
import pandas as pd
# the path to your csv file directory
mycsvdir = 'C:\\your csv location\\your csv location'
#select all csv file you can have some kind of filter too
csvfiles = glob.glob(os.path.join(mycsvdir, '*.csv'))
# loop through the files and read them in with pandas
dataframes = [] # a list to hold all the individual pandas DataFrames
for csvfile in csvfiles:
df = pd.read_csv(csvfile)
dataframes.append(df)
# concatenate them all together
result = pd.concat(dataframes, ignore_index=True)
# print out to a new csv file
result.to_csv('all.csv')
now you have all.csv file that is the merge of your CSV file. we can sum any item now by below code:
dff = pd.read_csv('C:\\output folder\\output folder\\all.csv')
table = pd.pivot_table(dff, index =['items', 'per_unit_amount'])
print(table)

You can use pandas library to achieve that. Install pandas via, pip install pandas.
The workflow should go like this:
Get a list of the filenames (filepath actually) of the csv files via glob
Iterate the filenames, load the files using pandas and keep them in a list
Concat the list of the dataframes into a big dataframe
Perform you desired calculations
from glob import glob
import pandas as pd
# getting a list of all the csv files' path
filenames = glob('./*csv')
# list of dataframes
dfs = [pd.read_csv(filename) for filename in filenames]
# concat all dataframes into one dataframe
big_df = pd.concat(dfs, ignore_index=True)
The big_df should look like this. Here, I have used two csv files with two rows of input. So the concatenated dataframe has 4 rows in total.
| | items | per_unit_amount | number of units |
|---:|:--------|------------------:|------------------:|
| 0 | book | 25 | 5 |
| 1 | pencil | 3 | 10 |
| 2 | book | 25 | 5 |
| 3 | pencil | 3 | 10 |
Now let's multiply per_unit_amount with number of units to get unit_total:
big_df['unit_total'] = big_df['per_unit_amount'] * big_df['number of units']
Now the dataframe has an extra column:
| | items | per_unit_amount | number of units | unit_total |
|---:|:--------|------------------:|------------------:|-------------:|
| 0 | book | 25 | 5 | 125 |
| 1 | pencil | 3 | 10 | 30 |
| 2 | book | 25 | 5 | 125 |
| 3 | pencil | 3 | 10 | 30 |
You can calculate the total by summing all the entries in the unit_total column:
total_amount = big_df['unit_total'].sum()
> 310

Related

Passing dataframe and using its name to create the csv file

I have a requirment where i need to pass different dataframes and print the rows in dataframes to the csv file and the name of the file needs to be the dataframe name. Example Below is the data frame
**Dataframe**
| Students | Mark | Grade |
| -------- | -----|------ |
| A | 90 | a |
| B | 60 | d |
| C | 40 | b |
| D | 45 | b |
| E | 66 | d |
| F | 80 | b |
| G | 70 | c |
A_Grade=df.loc[df['grade']=='a']
B_Grade=df.loc[df['grade']=='b']
C_Grade=df.loc[df['grade']=='c']
D_Grade=df.loc[df['grade']=='d']
E_Grade=df.loc[df['grade']=='e']
F_Grade=df.loc[df['grade']=='f']
each of these dataframes A_Grade,B_Grade,C_Grade etc needs to be created in separate file with name A_Grade.csv,B_Grade.csv,C_Grade.csv.
i wanted to use a for loop and pass the dataframe so to create it rather than writing separate lines to create the files as number of dataframe varies . This also sends msg using telegram bot.so the code snippet i tried is below. but it didnt work. in short the main thing is dynamically create the csv file with the dataframe name.
for df in (A_Grade,B_Grade,C_Grade):
if(len(df))
dataframeitems.to_csv(f'C:\Documents\'+{df}+'{dt.date.today()}.csv',index=False)
bot.send_message(chat_id=group_id,text='##{dfname.name} ##')
The solution given #Ynjxsjmh by work. Thanks #Ynjxsjmh. but i have another senario where in a function as below has dataframe passed as argument and the result on the dataframe needs to be saved as csv with dataframe name.
def func(dataframe):
...
...
...
dataframe2=some actions and operations on dataframe
result = dataframe2
result.to_csv(params.datafilepath+f'ResultFolder\{dataframe}_{dt.date.today()}.csv',index=False)
The file needs to be saved as per the name of the dataframe.csv

I could get using below code
def get_df_name(df):
name =[x for x in globals() if globals()[x] is df][0]
return name
filename=get_df_name(dataframe)
print(filename)

Your f-string is weird, you can use Series.unique() to get unique values in Series.
for grade in df['Grade'].unique():
grade_df = df[df['Grade'].eq(grade)]
grade_df.to_csv(f'C:\Documents\{grade.upper()}_Grade_{dt.date.today()}.csv', index=False)

Splitting a csv into multiple csv's depending on what is in column 1 using python

so I currently have a large csv containing data for a number of events.
Column one contains a number of dates as well as some id's for each event for example.
Basically I want to write something within Python that whenever there is an id number (AL.....) it creates a new csv with the id number as the title with all the data in it before the next id number so i end up with a csv for each event.
For info the whole csv contains 8 columns but the division into individual csvs is only predicated on column one
Use Python to split a CSV file with multiple headers
I notice this questions is quite similar but in my case I I have AL and then a different string of numbers after it each time and also I want to call the new csvs by the id numbers.

You can achieve this using pandas, so let's first generate some data:
import pandas as pd
import numpy as np
def date_string():
return str(np.random.randint(1, 32)) + "/" + str(np.random.randint(1, 13)) + "/1997"
l = [date_string() for x in range(20)]
l[0] = "AL123"
l[10] = "AL321"
df = pd.DataFrame(l, columns=['idx'])
# -->
| | idx |
|---:|:-----------|
| 0 | AL123 |
| 1 | 24/3/1997 |
| 2 | 8/6/1997 |
| 3 | 6/9/1997 |
| 4 | 31/12/1997 |
| 5 | 11/6/1997 |
| 6 | 2/3/1997 |
| 7 | 31/8/1997 |
| 8 | 21/5/1997 |
| 9 | 30/1/1997 |
| 10 | AL321 |
| 11 | 8/4/1997 |
| 12 | 21/7/1997 |
| 13 | 9/10/1997 |
| 14 | 31/12/1997 |
| 15 | 15/2/1997 |
| 16 | 21/2/1997 |
| 17 | 3/3/1997 |
| 18 | 16/12/1997 |
| 19 | 16/2/1997 |
So, interesting positions are 0 and 10 as there are the AL* strings...
Now to filter the AL* you can use:
idx = df.index[df['idx'].str.startswith('AL')] # get's you all index where AL is
dfs = np.split(df, idx) # splits the data
for out in dfs[1:]:
name = out.iloc[0, 0]
out.to_csv(name + ".csv", index=False, header=False) # saves the data
This gives you two csv files named AL123.csv and AL321.csv with the first line being the AL* string.

Python - Merge CSV Files, Adding Additional Columns Based on File Name

I'm trying to merge multiple csv files within a shared folder. In addition to merging the files, I'd like to add two additional columns that identifies the data based on a portfolio code in the file name and a file date.
Currently I have the following code, which successfully merges the files:
import os, glob import pandas as pd
path = "C:\\mydirectory"
all_files = glob.glob(os.path.join(path, "Mnthly_*.csv"))
df_from_each_file = (pd.read_csv(f, sep=',') for f in all_files)
df_merged = pd.concat(df_from_each_file, ignore_index=True)
df_merged.to_csv( "merged.csv")
pd.read_csv("merged.csv")
How can I go about adding in some kind of loop or portion within the code to include the additional columns?
Input: 'Mnthly_1_XXX'
| DATE ----- | Col_1 -- | Col_2 ---- |
+========+======+========+
| 9/30/2020 | 410900 | 44991418 |
+------------+------------+--------------+
| 10/31/2020 | 44936 | 48560570 |
+------------+------------+--------------+
Output:
| DATE --- | Col_1 ---- | Col_2 --- | Indicator | Date1 -- | Date2----- |
+=======+=======+========+=====+========+========+
| 9/30/2020 | 410900 | 44991418 | XXXX | 10/5/2020 | 10/6/2020 |
+------------+------------+--------------+---------+--------------+--------------+
| 10/31/2020 | 44936 | 48560570 | XXXX | 10/5/2020 | 10/6/2020 |
+------------+------------+--------------+---------+--------------+--------------+

Split and use values from Excel columns

I'm really new to coding. I have 2 columns in Excel - one for ingredients and other the ratio.
Like this:
ingredients [methanol/ipa,ethanol/methanol,ethylacetate]
spec[90/10,70/30,100]
qty[5,6,10]
So this data is entered continuously. I want to get the total amount of ingredients, by eg from first column methanol will be 5x90 and ipa will be 10x5.
I tried to split them based on / and use a for loop to iterate
import pandas as pd
solv={'EA':0,'M':0,'AL':0,'IPA':0}
data_xls1=pd.read_excel(r'C:\Users\IT123\Desktop\Solvent stock.xlsx',Sheet_name='PLANT',index_col=None)
sz=range(len(data_xls1.index))
a=data_xls1.Solvent.str.split('/',0).tolist()
b=data_xls1.Spec.str.split('/',0).tolist()
print(a)
for i in sz:
print(b[i][0:1])
print(b[i][1:2])
I want to split the ingredients and spec column multiply with qty and store in a solve dictionary
Error right now is float object is not subscript able

You have already found the key part, namely using the str.split function.
I would suggest that you bring the data to a a long format like this:
| | Transaction | ingredients | spec | qty |
|---:|--------------:|:--------------|-------:|------:|
| 0 | 0 | methanol | 90 | 4.5 |
| 1 | 0 | ipa | 10 | 0.5 |
| 2 | 1 | ethanol | 70 | 4.2 |
| 3 | 1 | methanol | 30 | 1.8 |
| 4 | 2 | ethylacetate | 100 | 10 |
The following code produces that result:
import pandas as pd
d = {"ingredients":["methanol/ipa","ethanol/methanol","ethylacetate"],
"spec":["90/10","70/30","100"],
"qty":[5,6,10]
}
df = pd.DataFrame(d)
df.index = df.index.rename("Transaction") # Add sensible name to the index
#Each line represents a transcation with one or more ingridients
#Following lines split the lines by the delimter. Stack Functinos moves them to long format.
ingredients = df.ingredients.str.split("/", expand = True).stack()
spec = df.spec.str.split("/", expand = True).stack()
Each of them will look like this:
| TrID, |spec |
|:-------|----:|
| (0, 0) | 90 |
| (0, 1) | 10 |
| (1, 0) | 70 |
| (1, 1) | 30 |
| (2, 0) | 100 |
Now we just need to put everything together:
df_new = pd.concat([ingredients, spec], axis = "columns")
df_new.columns = ["ingredients", "spec"]
#Switch from string to float
df_new.spec = df_new.spec.astype("float")
#Multiply by the quantity,
#Pandas automatically uses Transaction (Index of both frames) to filter accordingly
df_new["qty"] = df_new.spec * df.qty / 100
#As long as you are not comfortable to work with multiindex, just run this line:
df_new = df_new.reset_index(level = 0, drop = False).reset_index(drop = True)
The good thing about this format is that you can have a multiple-way splits for your ingredients, str.split will work without a problem, and summing up is straightforward.

I should have posted this first bur this is what my input excel sheet looks like

How do I select corresponding column field values in a dataframe?

So I have created a data frame as follows -
|id | Image_name | result | classified |
-------------------------------------------------
|01 | 1.bmp | 0 | 10 |
|02 | 2.bmp | 1 | 11 |
|03 | 3.bmp | 0 | 10 |
|04 | 4.bmp | 2 | 12 |
Now, in my directory I have a folder called images, where I have all the .bmp files stored (1.bmp, 2.bmp, 3.bmp, 4.bmp and so on ).
I am trying to write a script the automatically finds those files in the "Image_name" in the data frame and returns their result and classified values respectively.
import pandas as pd
import glob
import os
data = pd.read_csv("filename.csv")
for file in glob.glob("*.bmp"):
fname = os.path.basename(file)
So this was my initial code, I want to find all fnames extracted and then check if the following fname exists in the dataframe and display it with its result and classified columns.

First get all the images names from the folder and store in a list
all_files_names=os.listdir("#path to the dir")
df.loc[df['Image_name'].isin(all_files_names)]
Output (assuming all four are there)
id Image_name result classified
0 1 1.bmp 0 10
1 2 2.bmp 1 11
2 3 3.bmp 0 10
3 4 4.bmp 2 12

seems like you just want to access the row where Image_name is same as the file, and get the result and classified columns.
try this:
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO("""
id | Image_name | result | classified
01 | 1.bmp | 0 | 10
02 | 2.bmp | 1 | 11
03 | 3.bmp | 0 | 10
04 | 4.bmp | 2 | 12
"""), sep=r"\s+\|\s+")
file_example = "2.bmp"
print(df[df['Image_name'] == file_example][["result", "classified"]])

You can use boolean masking for this. You can read more about it in below link. https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
for file_name in df['Image_name']:
print(df[df['Image_name']== file_name][['result', 'classified']])
Hope this helped!

In case you need the same algo for a lot of images (few thousands/hundred of thousands). It is best to use the column needed for filter as the index of your DataFrame before doing the .isin() method in it.
image_file_names=os.listdir("#path to the dir")
df = df.set_index(df['Image_name'])
df = df.loc[df.index.isin(image_file_names)]
Hope this helps :))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

process multiple csv file in python - python

Related

Passing dataframe and using its name to create the csv file

Splitting a csv into multiple csv's depending on what is in column 1 using python

Python - Merge CSV Files, Adding Additional Columns Based on File Name

Split and use values from Excel columns

How do I select corresponding column field values in a dataframe?

Categories

Resources