Passing dataframe and using its name to create the csv file - python

I have a requirment where i need to pass different dataframes and print the rows in dataframes to the csv file and the name of the file needs to be the dataframe name. Example Below is the data frame
**Dataframe**
| Students | Mark | Grade |
| -------- | -----|------ |
| A | 90 | a |
| B | 60 | d |
| C | 40 | b |
| D | 45 | b |
| E | 66 | d |
| F | 80 | b |
| G | 70 | c |
A_Grade=df.loc[df['grade']=='a']
B_Grade=df.loc[df['grade']=='b']
C_Grade=df.loc[df['grade']=='c']
D_Grade=df.loc[df['grade']=='d']
E_Grade=df.loc[df['grade']=='e']
F_Grade=df.loc[df['grade']=='f']
each of these dataframes A_Grade,B_Grade,C_Grade etc needs to be created in separate file with name A_Grade.csv,B_Grade.csv,C_Grade.csv.
i wanted to use a for loop and pass the dataframe so to create it rather than writing separate lines to create the files as number of dataframe varies . This also sends msg using telegram bot.so the code snippet i tried is below. but it didnt work. in short the main thing is dynamically create the csv file with the dataframe name.
for df in (A_Grade,B_Grade,C_Grade):
if(len(df))
dataframeitems.to_csv(f'C:\Documents\'+{df}+'{dt.date.today()}.csv',index=False)
bot.send_message(chat_id=group_id,text='##{dfname.name} ##')
The solution given #Ynjxsjmh by work. Thanks #Ynjxsjmh. but i have another senario where in a function as below has dataframe passed as argument and the result on the dataframe needs to be saved as csv with dataframe name.
def func(dataframe):
...
...
...
dataframe2=some actions and operations on dataframe
result = dataframe2
result.to_csv(params.datafilepath+f'ResultFolder\{dataframe}_{dt.date.today()}.csv',index=False)
The file needs to be saved as per the name of the dataframe.csv

I could get using below code
def get_df_name(df):
name =[x for x in globals() if globals()[x] is df][0]
return name
filename=get_df_name(dataframe)
print(filename)

Your f-string is weird, you can use Series.unique() to get unique values in Series.
for grade in df['Grade'].unique():
grade_df = df[df['Grade'].eq(grade)]
grade_df.to_csv(f'C:\Documents\{grade.upper()}_Grade_{dt.date.today()}.csv', index=False)

Related

Splitting a csv into multiple csv's depending on what is in column 1 using python

so I currently have a large csv containing data for a number of events.
Column one contains a number of dates as well as some id's for each event for example.
Basically I want to write something within Python that whenever there is an id number (AL.....) it creates a new csv with the id number as the title with all the data in it before the next id number so i end up with a csv for each event.
For info the whole csv contains 8 columns but the division into individual csvs is only predicated on column one
Use Python to split a CSV file with multiple headers
I notice this questions is quite similar but in my case I I have AL and then a different string of numbers after it each time and also I want to call the new csvs by the id numbers.
You can achieve this using pandas, so let's first generate some data:
import pandas as pd
import numpy as np
def date_string():
return str(np.random.randint(1, 32)) + "/" + str(np.random.randint(1, 13)) + "/1997"
l = [date_string() for x in range(20)]
l[0] = "AL123"
l[10] = "AL321"
df = pd.DataFrame(l, columns=['idx'])
# -->
| | idx |
|---:|:-----------|
| 0 | AL123 |
| 1 | 24/3/1997 |
| 2 | 8/6/1997 |
| 3 | 6/9/1997 |
| 4 | 31/12/1997 |
| 5 | 11/6/1997 |
| 6 | 2/3/1997 |
| 7 | 31/8/1997 |
| 8 | 21/5/1997 |
| 9 | 30/1/1997 |
| 10 | AL321 |
| 11 | 8/4/1997 |
| 12 | 21/7/1997 |
| 13 | 9/10/1997 |
| 14 | 31/12/1997 |
| 15 | 15/2/1997 |
| 16 | 21/2/1997 |
| 17 | 3/3/1997 |
| 18 | 16/12/1997 |
| 19 | 16/2/1997 |
So, interesting positions are 0 and 10 as there are the AL* strings...
Now to filter the AL* you can use:
idx = df.index[df['idx'].str.startswith('AL')] # get's you all index where AL is
dfs = np.split(df, idx) # splits the data
for out in dfs[1:]:
name = out.iloc[0, 0]
out.to_csv(name + ".csv", index=False, header=False) # saves the data
This gives you two csv files named AL123.csv and AL321.csv with the first line being the AL* string.

Python Pandas datasets - Having integers values in new column by making a dictionary

I am trying to output in a new column integers values (labels/classes) based on labels of another column in my dataset. Actually I did it by creating new columns (numerical column heading) for each class with boolean values in them, so then I can use these to create the new class column with numerical values. But I was trying to do it with a dictionary, which I think it is a good and faster way.
If I run a code like this:
x=df['Item_Type'].value_counts()
item_type_mapping={}
item_list=x.index
for i in range(0,len(item_list)):
item_type_mapping[item_list[i]]=i
It generates the dictionary, but then if I run:
df['Item_Type']=df['Item_Type'].map(lambda x:item_type_mapping[x])
or
df['New_column']=[item_type_mapping[item] for item in data.Item_Type]
It displays KeyError=None
Anybody know why this occurs? I think that's strange since the dictionary has been created and I can see it through my variables
Thanks
Edit 1
#Fourier
simply I have this column:
| Item_type|
| -------- |
| Nino |
| Nino |
| Nino |
| Pasquale |
| Franco |
| Franco |
and then I need the same column or a new one to display:
| Item_type| New_column |
| -------- | ---------- |
| Nino | 1 |
| Nino | 1 |
| Nino | 1 |
| Pasquale | 2 |
| Franco | 3 |
| Franco | 3 |
Your code works for me, but what you're trying to do is already provided by pandas as Categorical data.
df = pd.DataFrame({'Item_Type': list('abca')})
df['New_column'] = df.Item_Type.astype('category').cat.codes
Result:
Item_Type New_column
0 a 0
1 b 1
2 c 2
3 a 0

process multiple csv file in python

I have multiple csv files in the following manner. All of the files have the same format.
| | items | per_unit_amount | number of units |
|---:|:--------|------------------:|------------------:|
| 0 | book | 25 | 5 |
| 1 | pencil | 3 | 10 |
First, I want to calculate the total amount of bills in python. Once calculated the total amount, I need to calculate the total amount of bills for all the csv files at the same time i.e in a multi-threaded manner.
I need to do it using multi threading.
this would be my way,
first merge all CSV files then sum each item:
import glob
import os
import pandas as pd
# the path to your csv file directory
mycsvdir = 'C:\\your csv location\\your csv location'
#select all csv file you can have some kind of filter too
csvfiles = glob.glob(os.path.join(mycsvdir, '*.csv'))
# loop through the files and read them in with pandas
dataframes = [] # a list to hold all the individual pandas DataFrames
for csvfile in csvfiles:
df = pd.read_csv(csvfile)
dataframes.append(df)
# concatenate them all together
result = pd.concat(dataframes, ignore_index=True)
# print out to a new csv file
result.to_csv('all.csv')
now you have all.csv file that is the merge of your CSV file. we can sum any item now by below code:
dff = pd.read_csv('C:\\output folder\\output folder\\all.csv')
table = pd.pivot_table(dff, index =['items', 'per_unit_amount'])
print(table)
You can use pandas library to achieve that. Install pandas via, pip install pandas.
The workflow should go like this:
Get a list of the filenames (filepath actually) of the csv files via glob
Iterate the filenames, load the files using pandas and keep them in a list
Concat the list of the dataframes into a big dataframe
Perform you desired calculations
from glob import glob
import pandas as pd
# getting a list of all the csv files' path
filenames = glob('./*csv')
# list of dataframes
dfs = [pd.read_csv(filename) for filename in filenames]
# concat all dataframes into one dataframe
big_df = pd.concat(dfs, ignore_index=True)
The big_df should look like this. Here, I have used two csv files with two rows of input. So the concatenated dataframe has 4 rows in total.
| | items | per_unit_amount | number of units |
|---:|:--------|------------------:|------------------:|
| 0 | book | 25 | 5 |
| 1 | pencil | 3 | 10 |
| 2 | book | 25 | 5 |
| 3 | pencil | 3 | 10 |
Now let's multiply per_unit_amount with number of units to get unit_total:
big_df['unit_total'] = big_df['per_unit_amount'] * big_df['number of units']
Now the dataframe has an extra column:
| | items | per_unit_amount | number of units | unit_total |
|---:|:--------|------------------:|------------------:|-------------:|
| 0 | book | 25 | 5 | 125 |
| 1 | pencil | 3 | 10 | 30 |
| 2 | book | 25 | 5 | 125 |
| 3 | pencil | 3 | 10 | 30 |
You can calculate the total by summing all the entries in the unit_total column:
total_amount = big_df['unit_total'].sum()
> 310

Trouble with schema using pyarrow.parquet ParquetDataset (how to force a particular schema)

Let's explain the context:
Someone gives me multiple parquet files obtain from multiple .csv files. I want to read all this parquet files and make a big dataset. For this I use pyarrow.parquet package.
So, I have multiple parquet file (We could call them file1.pq ; file2.pq ; file3.pq). All the files have exactly the same structure: the same column name and the same content of column. But sometimes in all the rows of a column in one file the value is the same and egal to NA. In this particular case the function dataset = pq.ParquetDataset(file_list) fail because physical type change.
lets make a visual exemple:
| File1.csv |
|-------------|-----|----|
| Column Name | C1 | C2 |
|-------------|-----|----|
| Row 1 | YES | 10 |
| Row 2 | NA | 15 |
| Row 3 | NO | 9 |
| File2.csv |
|-------------|-----|----|
| Column Name | C1 | C2 |
|-------------|-----|----|
| Row 1 | NA | 10 |
| Row 2 | NA | 15 |
| Row 3 | NA | 9 |
| File2.csv |
|-------------|-----|----|
| Column Name | C1 | C2 |
|-------------|-----|----|
| Row 1 | YES | 10 |
| Row 2 | NA | 15 |
| Row 3 | NO | 9 |
After conversion to parquet we have:
pq.ParquetFile("File1.pq").schema[1].physical_type = 'BYTE_ARRAY' --> good !
pq.ParquetFile("File1.pq").schema[2].physical_type = 'DOUBLE' --> good !
pq.ParquetFile("File2.pq").schema[1].physical_type = 'DOUBLE' --> BAD !
pq.ParquetFile("File2.pq").schema[2].physical_type = 'DOUBLE' --> good !
pq.ParquetFile("File3.pq").schema[1].physical_type = 'BYRE_ARRAY' --> good!
pq.ParquetFile("File3.pq").schema[2].physical_type = 'DOUBLE' --> good !
I try to open each parquet file and modify column type with something like that :
for i in np.arange(0,len(file_list)):
if list_have_to_change[i] != []:
df = pd.read_parquet(file_list[i])
df[list_have_to_change[i]] = df[list_have_to_change[i]].astype(bytearray)
df.to_parquet(COPIEPATH + "\\" + ntpath.basename(file_list[i]))
else :
shutil.move(file_list[i],COPIEPATH + "\\" + ntpath.basename(file_list[i]))
Where:
file_list contain all the parquet file
list_have_to_change is a list of a list of column name whose the name has to be change. In our example it was [[],[C1],[]].
But after to_parquet() methode schema return
BYTE_ARRAY for 1;
DOUBLE for 2;
BYTE_ARRAY for 3;
So it's change anything.
Question: How can I force schema when I save into parquet file or how can I use pq.ParquetDataset(file_list) with non coherent physical type ?
Hope that I was clear, thank you in advance for your help.

How to link data records in pandas?

I have a csv file which I can read into a pandas data frame. The data is like:
+--------+---------+------+----------------+
| Name | Address | ID | Linked_To |
+--------+---------+------+----------------+
| Name 1 | ABC | 1233 | 1234;1235 |
| Name 2 | DEF | 1234 | 1233;1236;1237 |
| Name 3 | GHI | 1235 | 1234;1233;2589 |
+--------+---------+------+----------------+
How do I run analysis on the linkage between ID and the Linked_To columns. For example, should I be turning the Linked_To values into a list and doing a VLOOKUP type analysis on the ID column? I know there must be an obvious way to do this but I am stumped.
Ideally the end result should be a list or dictionary which has the entire attributes of the row, including all of the other records its linked to.
OR is this a problem where I should be transforming the data into an SQL database?
for the unique and non-unique cases, a dictionary of IDs in linked_to for each ID could be obtained via:
def linked_ids(df):
#set up the dictionary
dict = {}
#iterate through the rows
for row in df.index:
#separate the semi-colon delimited linked to field
linked_to = df.ix[row,'Linked_to'].split(";")
if df.ix[row,'ID'] not in dict.keys():
dict[df.ix[row,'ID']] = []
for linked_id in linked_to:
if linked_id not in dict[df.ix[row,'ID']]:
dict[df.ix[row,'ID']].append(linked_id)
else:
for linked_id in linked_to:
if linked_id not in dict[df.ix[row,'ID']]:
dict[df.ix[row,'ID']].append(linked_id)
return dict
If you working with pandas dataframe , try this
df.set_index('ID').Linked_To.str.split(';').to_dict()
Out[142]:
{1233: ['1234', '1235'],
1234: ['1233', '1236', '1237'],
1235: ['1234', '1233', '2589']}

Categories

Resources