Moving rows of data within pandas dataframe to end of last column - python

Python newbie, please be gentle. I have data in two "middle sections" of a multiple Excel spreadsheets that I would like to isolate into one pandas dataframe. Below is a link to a data screenshot.
Within each file, my headers are in Row 4 with data in Rows 5-15, Columns B:O. The headers and data then continue with headers on Row 21, data in Rows 22-30, Columns B:L. I would like to move the headers and data from the second set and append them to the end of the first set of data.
This code captures the header from Row 4 and data in Columns B:O but captures all Rows under the header including the second Header and second set of data. How do I move this second set of data and append it after the first set of data?
path =r'C:\Users\sarah\Desktop\Original'
allFiles = glob.glob(path + "/*.xls")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_excel(file_,sheetname="Data1", parse_cols="B:O",index_col=None, header=3, skip_rows=3 )
list_.append(df)
frame = pd.concat(list_)
Screenshot of my data

If all of your Excel files have the same number of rows and this is a one time operation, you could simply hard code those numbers in your read_excel. If not, it will be a little tricky, but you pretty much follow the same procedure:
for file_ in allFiles:
top = pd.read_excel(file_, sheetname="Data1", parse_cols="B:O", index_col=None,
header=4, skip_rows=3, nrows=14) # Note the nrows kwag
bot = pd.read_excel(file_, sheetname="Data1", parse_cols="B:L", index_col=None,
header=21, skip_rows=20, nrows=14)
list_.append(top.join(bot, lsuffix='_t', rsuffix='_b'))

you can do it this way:
df1 = pd.read_excel(file_,sheetname="Data1", parse_cols="B:O",index_col=None, header=3, skip_rows=3)
df2 = pd.read_excel(file_,sheetname="Data1", parse_cols="B:L",index_col=None, header=20, skip_rows=20)
# pay attention at `axis=1`
df = pd.concat([df1,df2], axis=1)

Related

How to save each row to csv in dataframe AND name the file based on the the first column in each row

I have the following df, with the row 0 being the header:
teacher,grade,subject
black,a,english
grayson,b,math
yodd,a,science
What is the best way to use export_csv in python to save each row to a csv so that the files are named:
black.csv
grayson.csv
yodd.csv
Contents of black.csv will be:
teacher,grade,subject
black,a,english
Thanks in advance!
Updated Code:
df8['CaseNumber'] = df8['CaseNumber'].map(str)
df8.set_index('CaseNumber', inplace=True)
for Casenumber, data in df8.iterrows():
data.to_csv('c:\\users\\admin\\' + Casenumber + '.csv')'''
This can be done simply by using pandas:
import pandas as pd
# Preempt the issue of columns being numeric by marking dtype=str
df = pd.read_csv('your_data.csv', header=1, dtype=str)
df.set_index('teacher', inplace=True)
for teacher, data in df.iterrows():
data.to_csv(teacher + '.csv')
Edits:
df8.set_index('CaseNumber', inplace=True)
for Casenumber, data in df8.iterrows():
# Use r and f strings to make your life easier:
data.to_csv(rf'c:\users\admin\{Casenumber}.csv')

how to loop through a folder of csv files and read header of each? then output in a folder

I'm a newbie in python and need help with this piece of code. I did a lot of search to get to this stage but couldn't fix it on my own. Thanks in advance for your help.
What I'm trying to do is that I have to compare 100+ csv files in a folder, and not all have the same number of columns or columns name. So I'm trying to use python to read the headers of each file and put in a csv file to output in a folder.
I got to this point but not sure if I'm on the right path even:
import pandas as pd
import glob
path = r'C:\Users\user1\Downloads\2016GAdata' # use your path
all_files = glob.glob(path + "/*.csv")
list1 = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
list1.append(df)
frame = pd.concat(list1, axis=0, ignore_index=True)
print(frame)
thanks for your help!
You can create a dictionary whose key is filename and value is dataframe columns. Using this dictionary to create dataframe results in filename as index and column names as column value.
d = {}
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
d[filename] = df.columns
frame = pd.DataFrame.from_dict(d, orient='index')
0 1 2 3
file1 Fruit Date Name Number
file2 Fruit Date Name None

Pandas Concatenate dataframes

This is driving me nuts! I have several Dataframe that I am trying to concatenate with pandas. The index is the filename. When I use df.to_csv for individual data frames I can see the index column (filename) along with the column of interest. When I concatenate along the filename axis I only get the column of interest and numbers. No filename.
Here is the code I am using as is. It works as I expect up until the "all_filename" line.
for filename in os.listdir(directory):
if filename.endswith("log.csv"):
df = pd.read_fwf(filename, skiprows=186, nrows=1, names=["Attribute"])
df['System_Library_Name'] = [x.split('/')[6] for x in df['Attribute']]
df2= pd.concat([df for filename in os.listdir(directory)], keys=[filename])
df2.to_csv(filename+"log_info.csv", index=filename)
all_filenames = glob.glob(os.path.join(directory,'*log_info.csv'))
cat_log = pd.concat([pd.read_csv(f) for f in all_filenames ])
cat_log2= cat_log[['System_Library_Name']]
cat_log2.to_excel("log.xlsx", index=filename)
I have tried adding keys=filename to the 3rd to last line and giving the index a name with df.index.name=
I have used similar code before and had it work fine, however this is only one column that I am using from a larger original input file if that makes a difference.
Any advice is greatly appreciated!
df = pd.concat(
# this is just reading one value from each file, yes?
[pd.read_fwf(filename, skiprows=186, nrows=1, names=["Attribute"])
.set_index(pd.Index([filename]))
.applymap(lambda x: x.split('/')[6])
.rename(columns={'Attribute':'System_Library_Name'})
for filename in glob.glob(os.path.join(directory,'*log.csv'))
]
)
df.to_xlsx("log_info.xlsx")

Problem to read .dat files in Panda in a loop from Folder

I have a strange problem. In my folder i have .dat data with CO2 values from a CO2 sensor in the laboratory. Data from experiment 4,5,6,7,8 with names CO2_4.dat,CO2_5.dat,CO2_6.dat,CO2_7.dat,CO2_8.dat
I know how to read them manually. For example for reading CO2_4 this works :
dfCO2_4_manual = pd.read_csv(r'C:\data\CO2\co2_4.dat', sep=";", encoding= 'unicode_escape', header = 0, skiprows=[0], usecols=[0,1,2,4], names =["ts","t","co2_4", "p"])
display(dfCO2_4_manual)
which gives me a dataframe with the correct values:
every minute one value
But if i want to loop over my folder and read them all with this technique ("which work for other CSV files from the laboratory") which is safing the dataframes in a dictionary:
exp_list =[4,5,6,7,8] # list with number of each experiment
path_CO2 = r'C:\data\CO2'
CO2_files = glob.glob(os.path.join(path_CO2, "*.dat"))
CO2_dict = {}
for f, i in zip(offline_files, exp_list):
CO2_dict["CO2_{0}".format(i)] = pd.read_csv(f, sep=";", encoding= 'unicode_escape', header = 0, skiprows=[0], usecols=[0,1,2,4], names =["ts","t","CO2_{0}".format(i), "p"])
display(CO2_dict["CO2_4"])
gives me a dataframe with many skipped and completely wrong values.
If i open the CO2_4.dat data with text editor it looks like this:
Does someone know what is happening?
It's not clear how to help exactly given we don't have access to your files, however, is this line
for f, i in zip(offline_files, exp_list):
correct? Where is offline_files defined? It's not in the code you have provided. Also, are you wanting to analyze each df separately? Is that why you are storing them in a dictionary?
As an alternative you can store each df in a list and concatenate them. You can then group and apply analyses to them that way.
df_hold_list = []
for f, i in zip(CO2_files, exp_list): #changed file list name; please verify
df = pd.read_csv(f, sep=";", encoding= 'unicode_escape', header = 0, skiprows=[0], usecols=[0,1,2,4], names =["ts","t","CO2_{0}".format(i), "p"])
df['file'] = 'CO2_{0}'.format(i) # add column for sorting/grouping
df_hold_list.append(df)
df_new = pd.concat(df_hold_list, axis=0) # check the axis, should be 0 or 1
I can't test the code, but should work. Let me know if it doesn't.

dropping columns in multiple excel spreedsheets

Is there a way in python i can drop columns in multiple excel files? i.e. i have a folder with several xlsx files. each file has about 5 columns (date, value, latitude, longitude, region). I want to drop all columns except date and value in each excel file.
Let's say you have a folder with multiple excel files:
from pathlib import Path
folder = Path('excel_files')
xlsx_only_files = list(folder.rglob('*.xlsx'))
def process_files(xls_file):
#stem is a method in pathlib
#that gets just the filename without the parent or the suffix
filename = xls_file.stem
#sheet = None ensure the data is read in as a dictionary
#this sets the sheetname as the key
#usecols allows you to read in only the relevant columns
df = pd.read_excel(xls_file, usecols = ['date','value'] ,sheet_name = None)
df_cleaned = [data.assign(sheetname=sheetname,
filename = filename)
for sheetname, data in df.items()
]
return df_cleaned
combo = [process_files(xlsx) for xlsx in xlsx_only_files]
final = pd.concat(combo, ignore_index = True)
Let me know how it goes
stem
I suggest you should define columns you want to keep as a list and then select as a new dataframe.
# after open excel file as
df = pd.read_excel(...)
keep_cols = ['date', 'value']
df = df[keep_cols] # keep only selected columns it will return df as dataframe
df.to_excel(...)

Categories

Resources