adding columns to dataframe based on file name in python - python

I have several txt files with different file names. I would like to do two things:
1.) Load the data all at once
2.) using partial parts from the file name and add it for the dedicated dataframe as additional column
3.) adding the files together
I have below a really really manual example but want to automize it somehow. How is that possible?
The code looks like the following
import pandas as pd
#load data files
data1 = pd.read_csv('C:/file1_USA_Car_1d.txt')
data2 = pd.read_csv('C:/file2_USA_Car_2d.txt')
data3 = pd.read_csv('C:/file3_USA_Car_1m.txt')
data4 = pd.read_csv('C:/file3_USA_Car_6m.txt')
data5 = pd.read_csv('C:file3_USA_Car_1Y.txt')
df = pd.DataFrame()
print(df)
df = data1
#--> The input for the column below should be taken from the name of the file
df['country'] = 'USA'
df['Type'] = 'Car'
df['duration'] = '1d'
print(df)

Iterate over your files with glob and do some simple splitting on the filenames.
import glob
import pandas as pd
df_list = []
for file in glob.glob('C:/file1_*_*_*.txt'):
# Tweak this to work for your actual filepaths, if needed.
country, typ, dur = file.split('.')[0].split('_')[1:]
df = (pd.read_csv(file)
.assign(Country=country, Type=typ, duration=dur))
df_list.append(df)
df = pd.concat(df_list)

One way of doing it would be to do this:
all_res = pd.DataFrame()
file_list = ['C:/file1_USA_Car_1d.txt', 'C:/file3_USA_Car_1m.txt', 'etc']
for file_name in file_list:
tmp = pd.read_csv(file_name)
tmp['file_name'] = file_name
all_res = all_res.append(tmp)
all_res = all_res.reset_index()

I'd do something like the following:
from pathlib import Path
from operator import itemgetter
import pandas as pd
file_paths = [
Path(path_str)
for path_str in (
'C:/file1_USA_Car_1d.txt', 'C:/file2_USA_Car_2d.txt',
'C:/file3_USA_Car_1m.txt', 'C:/file3_USA_Car_6m.txt',
'C:file3_USA_Car_1Y.txt')
]
def import_csv(csv_path):
df = pd.read_csv(csv_path)
df['country'], df['Type'], df['duration'] = itemgetter(1, 2, 3)(csv_path.stem.split('_'))
return df
dfs = [import_csv(csv_path) for csv_path in file_paths]
This helps encapsulate your desired behavior in a helper function and reduces the things you need to think about.

Related

How to create variables and read several excel files in a loop with pandas?

L=[('X1',"A"),('X2',"B"),('X3',"C")]
for i in range (len(L)):
path=os.path.join(L[i][1] + '.xlsx')
book = load_workbook(path)
xls = pd.ExcelFile(path)
''.join(L[i][0])=pd.read_excel(xls,'Sheet1')
File "<ipython-input-1-6220ffd8958b>", line 6
''.join(L[i][0])=pd.read_excel(xls,'Sheet1')
^
SyntaxError: can't assign to function call
I have a problem with pandas, I can not create several dataframes for several excel files but i don't know how to create variables
I'll need a result that looks like this :
X1 will have dataframe of A.xlsx
X2 will have dataframe of B.xlsx
.
.
.
Solved :
d = {}
for i,value in L:
path=os.path.join(value + '.xlsx')
book = load_workbook(path)
xls = pd.ExcelFile(path)
df = pd.read_excel(xls,'Sheet1')
key = 'df-'+str(i)
d[key] = df
Main pull:
I would approach this by reading everything into 1 dataframe (loop over files, and concat):
import os
import pandas as pd
files = [] #generate list for files to go into
path_of_directory = "path/to/folder/"
for dirname, dirnames, filenames in os.walk(path_of_directory):
for filename in filenames:
files.append(os.path.join(dirname, filename))
output_data = [] #blank list for building up dfs
for name in files:
df = pd.read_excel(name)
df['name'] = os.path.basename(name)
output_data.append(df)
total = pd.concat(output_data, ignore_index=True, sort=True)
Then:
From then you can interrogate the df by using df.loc[df['name'] == 'choice']
Or (in keeping with your question):
You could then split into a dictionary of dataframes, based on this column. This is the best approach...
dictionary = {}
df[column] = df[column].astype(str)
col_values = df[column].unique()
for value in col_values:
key_name = 'df'+str(value)
dictionary[key_name] = copy.deepcopy(df)
dictionary[key_name] = dictionary[key_name][df[column] == value]
dictionary[key_name].reset_index(inplace=True, drop=True)
return dictionary
The reason for this approach is discussed here:
Create new dataframe in pandas with dynamic names also add new column which basically says that dynamic naming of dataframes is bad, and this dict approach is best
This might help.
files_xls = ['all your excel filename goes here']
df = pd.DataFrame()
for f in files_xls:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
print(df)

Convert list of multiple strings into a Python data frame

I have a list of string values I read this from a text document with splitlines. which yields something like this
X = ["NAME|Contact|Education","SMITH|12345|Graduate","NITA|11111|Diploma"]
I have tried this
for i in X:
textnew = i.split("|")
data[x] = textnew
I want to make a dataframe out of this
Name Contact Education
SMITH 12345 Graduate
NITA 11111 Diploma
You can read it directly from your file by specifying a sep argument to pd.read_csv.
df = pd.read_csv("/path/to/file", sep='|')
Or if you wish to convert it from list of string instead:
data = [row.split('|') for row in X]
headers = data.pop(0) # Pop the first element since it's header
df = pd.DataFrame(data, columns=headers)
you had it almost correct actually, but don't use data as dictionary(by using keys - data[x] = textnew):
X = ["NAME|Contact|Education","SMITH|12345|Graduate","NITA|11111|Diploma"]
df = []
for i in X:
df.append(i.split("|"))
print(df)
# [['NAME', 'Contact', 'Education'], ['SMITH', '12345', 'Graduate'], ['NITA', '11111', 'Diploma']]
Depends on further transformations, but pandas might be overkill for this kind of task
Here is a solution for your problem
import pandas as pd
X = ["NAME|Contact|Education","SMITH|12345|Graduate","NITA|11111|Diploma"]
data = []
for i in X:
data.append( i.split("|") )
df = pd.DataFrame( data, columns=data.pop(0))
In your situation, you can avoid to load the file using readlines and use pandas for take care about loading the file:
As mentioned above, the solution is a standard read_csv:
import os
import pandas as pd
path = "/tmp"
filepath = "file.xls"
filename = os.path.join(path,filepath)
df = pd.read_csv(filename, sep='|')
print(df.head)
Another approach (in such situation when you have no access to the file or you have to deal with a list of string) can be wrap the list of string as a text file, then load normally using pandas
import pandas as pd
from io import StringIO
X = ["NAME|Contact|Education", "SMITH|12345|Graduate", "NITA|11111|Diploma"]
# Wrap the string list as a file of new line
DATA = StringIO("\n".join(X))
# Load as a pandas dataframe
df = pd.read_csv(DATA, delimiter="|")
Here the result

How do I add a list of values for a parameter to a group of dataframes, where that parameter has a different value for each dataframe?

I have 15 samples that each have a unique value of a parameter L.
Each sample was tested and provided data which I have placed into separate DataFrames in Pandas.
Each of the DataFrames has a different number of rows, and I want to place the corresponding value of L in each row, i.e. create a column for parameter L.
Note that L is constant in its respective DataFrame.
Is there a way to write a loop that will take a value of L from a list containing all of its values, and create a column in its corresponding sample data DataFrame?
I have so far been copying and pasting each line, and then updating the values and DataFrame names manually, but I suppose that this is not the most effective way of using python/pandas!
Most of the code I have used so far has been based on what I have found online, and my actual understanding of it is quite limited but I have tried to comment where possible.
UPDATED based on first suggested answer.
import pandas as pd
from pandas import DataFrame
import numpy as np
from pathlib import Path
from glob import glob
from os.path import join
path = r'file-directory/'
data_files = glob(join(path + '*.txt'))
def main():
from contextlib import ExitStack
with ExitStack() as context_manager: # Allows python to access different data folders
files = [context_manager.enter_context(open(f, "r")) for f in data_files]
# Define an empty list and start reading data files
df1 = []
for file in files:
df = pd.read_csv(file,
encoding='utf-8',
skiprows=114,
header=0,
# names=heads,
skipinitialspace=True,
sep='\t'
)
# Process the dataframe to remove unwanted rows and columns, and rename the headers
df = df[df.columns[[1, 2, 4, 6, 8, 10, 28]]]
df = df.drop(0, axis=0)
df = df.reset_index(drop=True)
df.rename(columns=dict(zip(df, heads)), inplace=True)
for columns in df:
df[columns] = pd.to_numeric(df[columns], errors='coerce')
# Append each new dataframe to a new row in the empty dataframe
df1.append(df)
# Extract dataframes from list
data1_0 = df1[0]
data1_1 = df1[1]
data1_2 = df1[2]
data1_3 = df1[3]
data1_4 = df1[4]
data1_5 = df1[5]
data1_6 = df1[6]
data1_7 = df1[7]
data1_8 = df1[8]
data1_9 = df1[9]
data1_10 = df1[10]
data1_11 = df1[11]
data1_12 = df1[12]
data1_13 = df1[13]
data1_14 = df1[14]
# Add in a new column for values of 'L'
L = ['L0', 'L1', 'L2', 'L3', 'L4', 'L5', 'L6', 'L7', 'L8', 'L9', 'L10', 'L11', 'L12', 'L13', 'L14']
data1_0['L'] = L[0]
data1_1['L'] = L[1]
data1_2['L'] = L[2]
data1_3['L'] = L[3]
data1_4['L'] = L[4]
data1_5['L'] = L[5]
data1_6['L'] = L[6]
data1_7['L'] = L[7]
data1_8['L'] = L[8]
data1_9['L'] = L[9]
data1_10['L'] = L[10]
data1_11['L'] = L[11]
data1_12['L'] = L[12]
data1_13['L'] = L[13]
data1_14['L'] = L[14]
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
The method I am using (copy and paste lines) works so far, it's just that it doesn't seem to be the most efficient use of my time or the tools I have, and I don't really know how to approach this one with my limited experience of python so far.
I also have several other parameters and datasets that I need to do this for, so any help would be greatly appreciated!
You can do just data1_0['L'] = L0 and so on for the rest of DataFrames. Given a single value on such assignment will fill the whole column with that value automatically, no need to compute length/index.
Untested code:
import pandas as pd
from pandas import DataFrame
import numpy as np
from pathlib import Path
from glob import glob
from os.path import join
path = r'file-directory/'
data_files = glob(join(path + '*.txt'))
def main():
from contextlib import ExitStack
with ExitStack() as context_manager: # Allows python to access different data folders
files = [context_manager.enter_context(open(f, "r")) for f in data_files]
# Define an empty list and start reading data files
df1 = []
for file in files:
df = pd.read_csv(file,
encoding='utf-8',
skiprows=114,
header=0,
# names=heads,
skipinitialspace=True,
sep='\t'
)
# Process the dataframe to remove unwanted rows and columns, and rename the headers
df = df[df.columns[[1, 2, 4, 6, 8, 10, 28]]]
df = df.drop(0, axis=0)
df = df.reset_index(drop=True)
df.rename(columns=dict(zip(df, heads)), inplace=True)
for columns in df:
df[columns] = pd.to_numeric(df[columns], errors='coerce')
# Add file name as identifier
df['FNAME'] = os.path.basename(file.name)
# Append each new dataframe to a new row in the empty dataframe
df1.append(df)
# Concatenate the results into single dataframe
data = pd.concat(df1)
L = ['L0', 'L1', 'L2', 'L3', 'L4', 'L5', 'L6', 'L7', 'L8', 'L9', 'L10', 'L11', 'L12', 'L13', 'L14']
# Supposing number of files and length of L is the same
repl_dict = {k:v for k,v in zip([os.path.basename(file.name) for file in files], L)}
# Add the new column
data1['L'] = data.FNAME.map(repl_dict)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())

Changing Column Heading CSV File

I am currently trying to change the headings of the file I am creating. The code I am using is as follows;
import pandas as pd
import os, sys
import glob
path = "C:\\Users\\cam19\\Desktop\\Test1\\*.csv"
list_=[]
for fname in glob.glob(path):
df = pd.read_csv(fname, dtype=None, low_memory=False)
output = (df['logid'].value_counts())
list_.append(output)
df1 = pd.DataFrame()
df2 = pd.concat(list_, axis=1)
df2.to_csv('final.csv')
Basically I am looping through a file directory and extracting data from each file. Using this is outputs the following image;
http://imgur.com/a/LE7OS
All i want to do it change the columns names from 'logid' to the file name it is currently searching but I am not sure how to do this. Any help is great! Thanks.
Instead of appending the values try to append values by creating the dataframe and setting the column i.e
output = pd.DataFrame(df['value'].value_counts())
output.columns = [os.path.basename(fname).split('.')[0]]
list_.append(output)
Changes in the code in the question
import pandas as pd
import os, sys
import glob
path = "C:\\Users\\cam19\\Desktop\\Test1\\*.csv"
list_=[]
for fname in files:
df = pd.read_csv(fname)
output = pd.DataFrame(df['value'].value_counts())
output.columns = [os.path.basename(fname).split('.')[0]]
list_.append(output)
df2 = pd.concat(list_, axis=1)
df2.to_csv('final.csv')
Hope it helps

Python / Pandas abbreviating my numbers.

Probably a very easy fix, but my english isn't good enough to search for the right answer.
Python/Pandas is changing the numbers that I'm writing from: 6570631401430749 to something like: 6.17063140131e+15
I'm merging hundreds of csv files, and this one column comes out all wrong. The name of the column is "serialnumber" and its the 3rd column.
import pandas as pd
import glob
import os
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
frame = pd.read_csv(filename)
print(os.path.basename(filename))
frame['filename'] = os.path.basename(filename)
df_list.append(frame)
full_df = pd.concat(df_list)
full_df.to_csv('output.csv',encoding='utf-8-sig')
You can use dtype = object when you read csv if you want to preserve the data in its original form. You can change your code to
import pandas as pd
import glob
import os
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
frame = pd.read_csv(filename,dtype=object)
print(os.path.basename(filename))
frame['filename'] = os.path.basename(filename)
df_list.append(frame)
full_df = pd.concat(df_list)
full_df.to_csv('output.csv',encoding='utf-8-sig')

Categories

Resources