Changing Column Heading CSV File - python

I am currently trying to change the headings of the file I am creating. The code I am using is as follows;
import pandas as pd
import os, sys
import glob
path = "C:\\Users\\cam19\\Desktop\\Test1\\*.csv"
list_=[]
for fname in glob.glob(path):
df = pd.read_csv(fname, dtype=None, low_memory=False)
output = (df['logid'].value_counts())
list_.append(output)
df1 = pd.DataFrame()
df2 = pd.concat(list_, axis=1)
df2.to_csv('final.csv')
Basically I am looping through a file directory and extracting data from each file. Using this is outputs the following image;
http://imgur.com/a/LE7OS
All i want to do it change the columns names from 'logid' to the file name it is currently searching but I am not sure how to do this. Any help is great! Thanks.

Instead of appending the values try to append values by creating the dataframe and setting the column i.e
output = pd.DataFrame(df['value'].value_counts())
output.columns = [os.path.basename(fname).split('.')[0]]
list_.append(output)
Changes in the code in the question
import pandas as pd
import os, sys
import glob
path = "C:\\Users\\cam19\\Desktop\\Test1\\*.csv"
list_=[]
for fname in files:
df = pd.read_csv(fname)
output = pd.DataFrame(df['value'].value_counts())
output.columns = [os.path.basename(fname).split('.')[0]]
list_.append(output)
df2 = pd.concat(list_, axis=1)
df2.to_csv('final.csv')
Hope it helps

Related

Read the all excel files in a folder and split the each file name, add splitted name into the dataframe

All files have a name convention such as NPS_Platform_FirstLabel_Session_Language_Version.xlsx
I want to have additional columns like Platform, FirstLabel, Session, Language, Version these will column names and the values determined by filenames. I coded the following, it works but the value of added columns just came from the last file. For example, assume that the last filename is
NPS_MEM_GAIT_Science_EN_10.xlsx. Therefore, all of the added columns values are MEM, GAIT_Science, etc. Not the corresponding file names.
import glob
import os
import pandas as pd
path = "C:/Users/User/blabla"
all_files = glob.glob(os.path.join(path, "*.xlsx")) #make list of paths
df = pd.DataFrame()
for f in all_files:
data = pd.read_excel(f)
df = df.append(data)
file_name = os.path.splitext(os.path.basename(f))[0]
nameList = []
nameList = file_name.rsplit('_')
df['Platform'] = nameList[1]
df['First label']= nameList[2]
df['Session'] = nameList[3]
df['Language'] = nameList[4]
df['Version'] = nameList[5]
df
I started with nameList[1] since I don't want NPS.
Any suggestions or feedback?
I have found a solution, I leave it here since there are more views than I expected.
import glob
import os
import pandas as pd
path = "C:/Users/User/....."
all_files = glob.glob(os.path.join(path, "*.xlsx")) #make list of paths
df_files= [pd.read_excel(filename) for filename in all_files]
for dataframe, filename in zip(df_files, all_files):
filename =os.path.splitext(os.path.basename(filename))[0]
filename = filename.rsplit('_')
dataframe['Platform'] = filename[1]
dataframe['First label']= filename[2]
dataframe['Session'] = filename[3]
dataframe['Language'] = filename[4]
dataframe['Version'] = filename[5]
df= pd.concat(files_df, ignore_index=True)
I think the reason is I was just iterating over the files, not the dataframe that I was trying to build. With this, I can iterate over the dataframe and file names at the same time. I have found this solution on https://jonathansoma.com/lede/foundations-2017/classes/working-with-many-files/class/
But still if you can give explicit answer about why the first code does not work as I want, it would be great

Pandas generating an empty csv while trying combine all csv's into one csv

I am writing a python script that will read all the csv files in the current location and merge them into a single csv file. Below is my code:-
import os
import numpy as np
import pandas as pd
import glob
path = os.getcwd()
extension = csv
os.chdir(path)
tables = glob.glob('*.{}'.format(extension))
data = pd.DataFrame()
for i in tables:
try:
df = pd.read_csv(r''+path+'/'+i+'')
# Here I want to create an index column with the name of the file and leave that column empty
df[i] = np.NaN
df.set_index(i, inplace=True)
# Below line appends an empty row for easy differentiation
df.loc[df.iloc[-1].name+1,:] = np.NaN
data = data.append(df)
except Exception as e:
print(e)
data.to_csv('final_output.csv', indexx=False, header=None)
If I remove the below lines of code then it works:-
df[i] = np.NaN
df.set_index(i, inplace=True)
But I want to have the first column name as the name of the file and its values NaN or empty.
I want the output to look something like this:-
I tend to avoid the .append method in favor of pandas.concat
Try this:
import os
from pathlib import Path
import pandas as pd
files = Path(os.getcwd()).glob('*.csv')
df = pd.concat([
pd.read_csv(f).assign(filename=f.name)
for f in files
], ignore_index=True)
df.to_csv('alldata.csv', index=False)

Loop excel files, add one more column and save them in Python

I have a folder D:/test/src in which have lots of excel files, I want to add one more column date which is 2019-08-01 in each one and save them into another folder D:/test/dst.
Here is what I have done. It works, but a little bit slow. So if you have quicker or others ideas, welcome to share. Thanks at advance.
import pandas as pd
import os
import glob
src = "D:/test/src/*.xls*"
dst = "D:/test/dst/"
dfs = []
for file in glob.glob(src):
df = pd.read_excel(file)
df['date'] = "2019-08-01"
df["date"] = df["date"].astype(str)
df.to_excel(os.path.join(dst, os.path.basename(file)),
index=False)
dfs.append(df)
Use threading:
import glob
import threading
import pandas as pd
src = "D:/test/src/*.xls*"
dst = "D:/test/dst/"
def update(excel_file):
df = pd.read_excel(excel_file)
df['date'] = "2019-08-01"
df["date"] = df["date"].astype(str)
df.to_excel(os.path.join(dst, os.path.basename(excel_file)), index=False)
for file in glob.glob(src):
threading.Thread(target=update, args=(file,)).start()

Appending dataframes from json files in a for loop

I am trying to iterate through json files in a folder and append them all into one pandas dataframe.
If I say
import pandas as pd
import numpy as np
import json
from pandas.io.json import json_normalize
import os
directory_in_str = 'building_data'
directory = os.fsencode(directory_in_str)
df_all = pd.DataFrame()
with open("building_data/rooms.json") as file:
data = json.load(file)
df = json_normalize(data['rooms'])
df_y.append(df, ignore_index=True)
I get a dataframe with the data from the one file. If I turn this thinking into a for loop, I have tried
import pandas as pd
import numpy as np
import json
from pandas.io.json import json_normalize
import os
directory_in_str = 'building_data'
directory = os.fsencode(directory_in_str)
df_all = pd.DataFrame()
for file in os.listdir(directory):
with open(directory_in_str+'/'+filename) as file:
data = json.load(file)
df = json_normalize(data['rooms'])
df_all.append(df, ignore_index=True)
print(df_all)
This returns an empty dataframe. Does anyone know why this is happening? If I print df before appending it, it prints the correct values, so I am not sure why it is not appending.
Thank you!
Instead of append next DataFrame I would try to join them like that:
if df_all.empty:
df_all = df
else:
df_all = df_all.join(df)
When joining DataFrames, you can specify on what they should be joined - on index or on specific (key) column, as well as how (default option is similar to appending - 'left').
Here's docs about pandas.DataFrame.join.
In these instances I load everything from json into a list by appending each file's returned dict onto that list. Then I pass the list to pandas.DataFrame.from_records (docs)
In this case the source would become something like...
import pandas as pd
import numpy as np
import json
from pandas.io.json import json_normalize
import os
directory_in_str = 'building_data'
directory = os.fsencode(directory_in_str)
json_data = []
for file in os.listdir(directory):
with open(directory_in_str+'/'+filename) as file:
data = json.load(file)
json_data.append( json_normalize(data['rooms']) )
df_all = pandas.DataFrame.from_records( json_data )
print(df_all)

Python / Pandas abbreviating my numbers.

Probably a very easy fix, but my english isn't good enough to search for the right answer.
Python/Pandas is changing the numbers that I'm writing from: 6570631401430749 to something like: 6.17063140131e+15
I'm merging hundreds of csv files, and this one column comes out all wrong. The name of the column is "serialnumber" and its the 3rd column.
import pandas as pd
import glob
import os
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
frame = pd.read_csv(filename)
print(os.path.basename(filename))
frame['filename'] = os.path.basename(filename)
df_list.append(frame)
full_df = pd.concat(df_list)
full_df.to_csv('output.csv',encoding='utf-8-sig')
You can use dtype = object when you read csv if you want to preserve the data in its original form. You can change your code to
import pandas as pd
import glob
import os
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
frame = pd.read_csv(filename,dtype=object)
print(os.path.basename(filename))
frame['filename'] = os.path.basename(filename)
df_list.append(frame)
full_df = pd.concat(df_list)
full_df.to_csv('output.csv',encoding='utf-8-sig')

Categories

Resources