Create new dataframe based on condition - python

I have a series of csv files inside a directory. Each csv file has the following columns:
slotID; NLunUn; NLunTot; MeanBPM
I would like, starting from the values contained within the slotID column, to create data frames that contain the relative values. Eg
the 1st csv has the following values:
slotID NLun An NLunTot MeanBPM
7 11 78 129,7
11 6 63 123,3
12 6 33 120,6
13 5 41 124,5
14 4 43 118,9
the 2nd csv has the following values
slotID NMarAn NMarTot MeanBPM
7 10 72 131,2
11 5 48 121,5
12 4 17 120,9
13 4 19 125,6
16 6 45 127,4
I would like to create a dataframe which for example is called dataframe1 which has the values of slot 7 inside, another csv which contains the values of slot 11 etc ... Any suggestion is welcome, I've been trying for several days but can't seem to jump out, please help me. This is what i've done so far:
import pandas as pd
#import matplotlib.pyplot as plt
import os
import glob
import numpy as np
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv"))
for f in csv_files:
dfDay = pd.read_csv(f, encoding = "ISO-8859-1", sep = ';')
//inside dfday there are all the files that contain the data

Provided that all the csv-files have the same structure (i.e. column names) you could do something like this:
...
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat(
(pd.read_csv(f, encoding='ISO-8859-1', sep=';') for f in csv_files),
ignore_index=True
)
slot_dfs = {slot: group for slot, group in df.groupby("slotID")}
# Exporting to csv-files
for n, df_slot in enumerate(slot_dfs.values(), start=1):
df_slot.to_csv(f"dataframe{n}.csv", index=False)
The dictionary slot_dfs contains the dataframes for each available slot.
If you really want to create variables for the dataframes then you could try
for n, (_, group) in enumerate(df.groupby("slotID"), start=1):
globals()[f"dataframe{n}"] = group
# Exporting to csv-file
group.to_csv(f"dataframe{n}.csv", index=False)
instead of creating the slot_dfs dictionary. After that print(dataframe1) should show the dataframe for the first slot etc.

Related

Adding a column to dataframe while reading csv files [pandas]

I'm reading multiple csv files and combining them into a single dataframe like below:
pd.concat([pd.read_csv(f, encoding='latin-1') for f in glob.glob('*.csv')],
ignore_index=False, sort=False)
Problem:
I want to add a column that doesn't exist in any csv (to the dataframe) based on the csv file name for every csv file that is getting concatenated to the dataframe. Any help will be appreciated.
glob.glob returns normal string so you can just add a column to every individual dataframe in a loop.
Assuming you have files df1.csv and df2.csv in your directory:
import glob
import pandas as pd
files = glob.glob('df*csv')
dfs = []
for file in files:
df = pd.read_csv(file)
df['filename'] = file
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
df
a b filename
0 1 2 df1.csv
1 3 4 df1.csv
2 5 6 df2.csv
3 7 8 df2.csv
I have multiple csv files in my local directory. Each filename contains some numbers. Some of those numbers identify years for which the file is. I need to add a column year to each file that I'm concatenating and while I do I want to get the year information from the filename and insert it into that column. I'm using regex to extract the year and concatenate it like 20 + 11 = 2011. Then, I'm setting the column's data type to int32.
pd.concat(
[
pd.read_csv(f)
.assign(year = '20' + re.search('[a-z]+(?P<year>[0-9]{2})', f).group('year'))
.astype({'year' : 'int32'})
for f in glob.glob('stateoutflow*[0-9].csv')
],
ignore_index = True
)

When I write a list to excel all the values comes to the same cell

I am trying to export a list from a code in Python. The output from the code comes this way:
print(mylist)
Allocation
0 55
1 65
2 23
3 23
4 55
5 36
When I write this to excel It messes up and all the numbers comes in the same cell. like this:
55
65
23
23
55
36
I am writing the list to excel with this comand:
df = pd.DataFrame(mylist)
df.to_excel("test.xlsx")
You haven't mentioned anything about your output, If you simply want to Write the DataFrame data into a Excel then your code will work. Just check the path and Sheet name.
import pandas as pd
lis = [55,65,23,23,55,36]
df = pd.DataFrame(lis,columns=['alo'],index=None)
file_loc = r'C:\Users\uib05928\Desktop\New folder (2)\new.xlsx'
df1 = df.to_excel(file_loc)
Output -
You have not specified what kind of output you want:
if you want it to be printed in each row you should use extra parameter index like so:
df.to_excel("test.xlsx",
index=list(df.index.values))
Documentation of method to_excel
I use the following to save the excel:
# DF TO EXCEL
from pandas import ExcelWriter
writer = ExcelWriter('PythonExport.xlsx')
yourdf.to_excel(writer,'Sheet5')
writer.save()

How to merge data frame into one csv file after using glob?

I have tried to work on several csv files using glob, for example:
import glob
import pandas as pd
import numpy as np
import csv
# Read all csv files with same file name in the folder
filenames = sorted(glob.glob('./16_2018-02*.csv'))
for f in filenames:
df = pd.read_csv(f, names=['Date','RSSI','Data','Code'],
index_col=None)
# Slicing information
df["ID"] = df["Data"].str.slice(0,2)
df["X"] = df["Data"].str.slice(2,4)
# Save the output data to csv with different name
df.to_csv(f'{f[:-4]}-train.csv', index=False)
In the end of the code, I used to save each dataframe into a new csv file with different name. Considering now I have so many csv data to work with, I want to concatenate them without first writing into each csv file. How should I do that?
Original dataset first 5 rows:
Date RSSI Data Code
2018-02-20T00:00:20.886+09:00 -99 1068ffd703d101ec77f425ea98b201 F2D5
2018-02-20T00:00:21.904+09:00 -95 103cffbc032901ee77f49dea98b301 F2D5
2018-02-20T00:00:22.415+09:00 -97 103cffbc032901ee77f49dea98b301 F2D5
2018-02-20T00:00:46.580+09:00 -96 10fdfda803ff01f477f49dfd98cb03 F2D1
2018-02-20T00:00:48.593+09:00 -96 101bfed3037401f577f49dfe98cd03 F2D6
After:
Date RSSI Data Code ID X
2018-02-20T00:00:20.886+09:00 -99 1068ffd703d101ec77f425ea98b201 F2D5 16 0.065384
2018-02-20T00:00:21.904+09:00 -95 103cffbc032901ee77f49dea98b301 F2D5 16 0.065340
2018-02-20T00:00:22.415+09:00 -97 103cffbc032901ee77f49dea98b301 F2D5 16 0.065340
2018-02-20T00:00:46.580+09:00 -96 10fdfda803ff01f477f49dfd98cb03 F2D1 16 0.065021
2018-02-20T00:00:48.593+09:00 -96 101bfed3037401f577f49dfe98cd03 F2D6 16 0.065051
Try the below code [for appending all the files to 1 file]:
filenames = sorted(glob.glob('./16_2018-02*.csv'))
appended_data=[] #create a list
for f in filenames:
df = pd.read_csv(f, names=['Date','RSSI','Data','Code'],
index_col=None)
# Slicing information
df["ID"] = df["Data"].str.slice(0,2)
df["X"] = df["Data"].str.slice(2,4)
appended_data.append(df) #append to the list
appended_data = pd.concat(appended_data, axis=1) #concat them together
#remove axis=1 if need to append vertically
The appended_data is now a dataframe with all files appended together post which you can export the same to csv/excel.

Add filename of imported file to dataframe

Still pretty new to Python so please be patient. I have a directory of files, all with a similar naming scheme. The filename look like this:
yob2004.txt
yob2005.txt
What I am trying to do is open each one of those files and add to a dataframe. Then I want to extract the year from the filename and add that as a new column in the dataframe.
I can get parts of it but not the whole thing.
Here is the year extraction code for the year from the filename.
filenames = glob.glob('names/*.txt')
#split off the beginning of of he file path plus 'yob' and only keep
everything after that. ex. 1180.txt
split1 = [i.split('\yob', 1)[1] for i in filenames]
#split of the .txt from the strings in the list above
split2 = [i.split('.', 1)[0] for i in split1]
Here is the code to concatenate all of the files together
read_files = glob.glob("names/*.txt")
with open("allnames.txt", "wb") as outfile:
for f in read_files:
with open(f, "rb") as infile:
outfile.write(infile.read())
I'm thinking what I actually need to do is read the first file into a dataframe then extract the year from the filename and write that to a new column in the dataframe. Then move onto the next file. Rinse. Repeat.
Any guidance how to do this?
This should work for your data, suppose I have 2 files yob2004.txt and yob2005.txt:
#yob2004
1,2,3,4
2,3,4,5
5,6,7,8
#yob2005
8,9,10,11
a,b,c,d
f,j,k
i,j,k,l
We see that these files have different data types, and different numbers of rows/columns so most edge cases will be covered:
import pandas as pd
from os import walk
f = []
for (dirpath, dirnames, filenames) in walk('/home/dkennetz/yobDf'):
for x in filenames:
if x.startswith('yob'):
f.append(x)
#f = ['yob2005.txt', 'yob2004.txt'] created a list from filenames in directory ^^^
data = pd.DataFrame() # initialize empty df
for filename in f:
df = pd.read_csv(filename, names=['col1', 'col2', 'col3', 'col4']) # read in each csv to df
df['filename'] = filename # add a column with the filename
data = data.append(df) # add all small df's to big df
data['filename'] = data['filename'].map(lambda x: x.lstrip('yob').rstrip('.txt')) # get rid of yob and .txt and just keep year
print(data)
output:
col1 col2 col3 col4 filename
0 8 9 10 11 2005
1 a b c d 2005
2 f j k NaN 2005
3 i j k l 2005
0 1 2 3 4 2004
1 2 3 4 5 2004
2 5 6 7 8 2004
The output will tell which file it came from by placing the year next to column and NAN's where dfs are different sizes.

Python: read through multiple but not all csv files in my folder

I want to read some csv files from my folder and concatenate them to a big pandas dataframe. All of my csv files end with a number, and I only want to read files whose number end with (6~10, 16~20, 26~30.) My goal is to read the files iteratively. Attached is my code so far:
import pandas as pd
data_one = pd.read_csv('Datafile-6.csv', header=None)
for i in range(7,11):
data99 = pd.read_csv('Datafile-'+i+'*.csv', header=None) #this line needs work
data_one = pd.concat([data_one, data99.iloc[:,1]],axis=1,ignore_index=True)
data_two = pd.read_csv('Datafile-16.csv', header=None)
for j in range(17,21):
#Repeat similar process
What should I do about 'data99' such that 'data_one' contains columns from 'Datafile-6' through 'Datafile-10'?
The first five rows of data_one should look like this, after getting data from Datafiles 6-10.
0 1 2 3 4 5
0 -40.0 0.179836 0.179630 0.179397 0.179192 0.179031
1 -39.0 0.183696 0.183441 0.183204 0.182977 0.182795
2 -38.0 0.186720 0.186446 0.186191 0.185949 0.185762
3 -37.0 0.189490 0.189207 0.188935 0.188686 0.188475
4 -36.0 0.192154 0.191851 0.191569 0.191301 0.191086
Column 0 is included in all of the data files, so I'm only concatenating column 1 of all of the subsequent data files.
You need to use glob module:
import glob, os
import pandas as pd
path =r'C:\YourFolder' #path to folder with .csv files
all = glob.glob(path + "/*.csv")
d_frame = pd.DataFrame()
list_ = []
for file_ in all:
df = pd.read_csv(file_,index_col=None, header=0)
if df['YourColumns'].tail(1).isin([6,7,8,9,10,16,17,18,19,20,26,27,28,29,30]) == True: #You can modify list with conditions you need
list_.append(df)
d_frame = pd.concat(list_)

Categories

Resources