We have this if else iteration with the goal to split a dataframe into several dataframes. The result of this iteration will vary, so we will not know how much dataframes we will get out of a dataframe. We want to save that several dataframe as text (.txt):
txtDf = open('D:/My_directory/df0.txt', 'w')
txtDf.write(df0)
txtDf.close()
txtDf = open('D:/My_directory/df1.txt', 'w')
txtDf.write(df0)
txtDf.close()
txtDf = open('D:/My_directory/df2.txt', 'w')
txtDf.write(df0)
txtDf.close()
And so on ....
But, we want to save that several dataframes automatically, so that we don't need to write the code above for 100 times because of 100 splitted-dataframes.
This is the example our dataframe df:
column_df
237814
1249823
89176812
89634
976234
98634
and we would like to split the dataframe df to several df0, df1, df2 (notes: each column will be in their own dataframe, not in one dataframe):
column_df0 column_df1 column_df2
237814 89176812 976234
1249823 89634 98634
We tried this code:
import copy
import numpy as np
df= pd.DataFrame(df)
len(df)
if len(df) > 10:
print('EXCEEEEEEEEEEEEEEEEEEEEDDD!!!')
sys.exit()
elif len(df) > 2:
df_dict = {}
x=0
y=2
for df_letter in ['A','B','C','D','E','F']:
df_name = f'df_{df_letter}'
df_dict[df_name] = copy.deepcopy(df_letter)
df_dict[df_name] = pd.DataFrame(df[x:y]).to_string(header=False, index=False, index_names=False).split('\n ')
df_dict[df_name] = [','.join(ele.split()) for ele in df_dict[df_name]]
x += 2
y += 2
df_name
else:
df
for df_ in df_dict:
print(df_)
print(f'length: {len(df_dict[df_])}')
txtDf = open('D:/My_directory/{df_dict[df_]}.txt', 'w')
txtDf.write(df)
txtDf.close()
The problem with this code is that we cannot write several .txt files automatically, everything else works just fine. Can anybody figure it out?
If it is a list then you can iterate through it and save each element as string
import os
for key, value in df_dict.items():
with open(f'D:/My_directory/{key}.txt', "w") as file:
file.write('\n'.join(str(v) for v in value))
Related
So this is kind of weird but I'm new to Python and I'm committed to seeing my first project with Python through to the end.
So I am reading about 100 .xlsx files in from a file path. I then trim each file and send only the important information to a list, as an individual and unique dataframe. So now I have a list of 100 unique dataframes, but iterating through the list and writing to excel just overwrites the data in the file. I want to append the end of the .xlsx file. The biggest catch to all of this is, I can only use Excel 2010, I do not have any other version of the application. So the openpyxl library seems to have some interesting stuff, I've tried something like this:
from openpyxl.utils.dataframe import dataframe_to_rows
wb = load_workbook(outfile_path)
ws = wb.active
for frame in main_df_list:
for r in dataframe_to_rows(frame, index = True, header = True):
ws.append(r)
Note: In another post I was told it's not best practice to read dataframes line by line using loops, but when I started I didn't know that. I am however committed to this monstrosity.
Edit after reading Comments
So my code scrapes .xlsx files and stores specific data based on a keyword comparison into dataframes. These dataframes are stored in a list, I will list the entirety of the program below so hopefully I can explain what's in my head. Also, feel free to roast my code because I have no idea what is actual good python practices vs. not.
import os
import pandas as pd
from openpyxl import load_workbook
#the file path I want to pull from
in_path = r'W:\R1_Manufacturing\Parts List Project\Tool_scraping\Excel'
#the file path where row search items are stored
search_parameters = r'W:\R1_Manufacturing\Parts List Project\search_params.xlsx'
#the file I will write the dataframes to
outfile_path = r'W:\R1_Manufacturing\Parts List Project\xlsx_reader.xlsx'
#establishing my list that I will store looped data into
file_list = []
main_df = []
master_list = []
#open the file path to store the directory in files
files = os.listdir(in_path)
#database with terms that I want to track
search = pd.read_excel(search_parameters)
search_size = search.index
#searching only for files that end with .xlsx
for file in files:
if file.endswith('.xlsx'):
file_list.append(in_path + '/' + file)
#read in the files to a dataframe, main loop the files will be maninpulated in
for current_file in file_list:
df = pd.read_excel(current_file)
#get columns headers and a range for total rows
columns = df.columns
total_rows = df.index
#adding to store where headers are stored in DF
row_list = []
column_list = []
header_list = []
for name in columns:
for number in total_rows:
cell = df.at[number, name]
if isinstance(cell, str) == False:
continue
elif cell == '':
continue
for place in search_size:
search_loop = search.at[place, 'Parameters']
#main compare, if str and matches search params, then do...
if insensitive_compare(search_loop, cell) == True:
if cell not in header_list:
header_list.append(df.at[number, name]) #store data headers
row_list.append(number) #store row number where it is in that data frame
column_list.append(name) #store column number where it is in that data frame
else:
continue
else:
continue
for thing in column_list:
df = pd.concat([df, pd.DataFrame(0, columns=[thing], index = range(2))], ignore_index = True)
#turns the dataframe into a set of booleans where its true if
#theres something there
na_finder = df.notna()
#create a new dataframe to write the output to
outdf = pd.DataFrame(columns = header_list)
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
#I turn the dataframe into booleans and read until False
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
#Store actual dataframe into my output dataframe, outdf
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
So main_df is a list that has 100+ dataframes in it. For this example I will only use 2 of them. I would like them to print out into excel like:
So the comment from Ashish really helped me, all of the dataframes had different column titles so my 100+ dataframes eventually concat'd to a dataframe that is 569X52. Here is the code that I used, I completely abandoned openpyxl because once I was able to concat all of the dataframes together, I just had to export it using pandas:
# what I want to do here is grab all the data in the same column as each
# header, then move to the next column
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
to_xlsx_df = pd.DataFrame()
for frame in main_df:
to_xlsx_df = pd.concat([to_xlsx_df, frame])
to_xlsx_df.to_excel(outfile_path)
The output to excel ended up looking something like this:
Hopefully this can help someone else out too.
The code I am using:
import pandas as pd
from collections import Counter
import xlsxwriter
def list_generator(file, savefile):
#set writer for output filepath
writer = pd.ExcelWriter(savefile+'.xlsx', engine='xlsxwriter')
#set dataframe to file(path)
df = pd.read_csv(file)
#set split action
split = lambda x: pd.Series(str(x).split(','))
# Special character list
specials = ['\\','?','/','*',':','[',']']
#set columns
col_list = list(df.columns)
for j in col_list:
temp = df[j].apply(split)
temp_clean = []
for i, r in temp.iterrows():
for x in range(len(r)):
if x in temp_clean:
break
elif (r[x] is None) == True or str(r[x])=='':
break
else:
cleaned = str(r[x])
cleaned = cleaned.lstrip()
temp_clean.append(cleaned)
#temp_clean.append(r[x])
counted = Counter(temp_clean)
temp_list = pd.DataFrame(counted.items(), columns = [j, 'count'])
temp_list = temp_list.dropna()
for spec in specials:
if spec in j:
j = j.replace(spec, '')
if len(j)>30:
j = j[:30]
temp_list.to_excel(writer, sheet_name=j, index=False)
writer.save()
list_generator('/content/drive/MyDrive/Maryland/Data/md_res.csv', 'md_res_count')
the files are csv's downloaded from airtable. I want to split multi-select columns to get accurate counts of all occurences, which I get, but I can't understand how I keep getting blank spaces (which I think I figured out?) but also nan values. the output is an xlsx file with sheets that look like:
Also some of the multi-selects seem to split on the comma seperation as well as strings contained within string.
Sample sheet cut
Any help would be greatly appreciated! And can elaborate on anything needed.
L=[('X1',"A"),('X2',"B"),('X3',"C")]
for i in range (len(L)):
path=os.path.join(L[i][1] + '.xlsx')
book = load_workbook(path)
xls = pd.ExcelFile(path)
''.join(L[i][0])=pd.read_excel(xls,'Sheet1')
File "<ipython-input-1-6220ffd8958b>", line 6
''.join(L[i][0])=pd.read_excel(xls,'Sheet1')
^
SyntaxError: can't assign to function call
I have a problem with pandas, I can not create several dataframes for several excel files but i don't know how to create variables
I'll need a result that looks like this :
X1 will have dataframe of A.xlsx
X2 will have dataframe of B.xlsx
.
.
.
Solved :
d = {}
for i,value in L:
path=os.path.join(value + '.xlsx')
book = load_workbook(path)
xls = pd.ExcelFile(path)
df = pd.read_excel(xls,'Sheet1')
key = 'df-'+str(i)
d[key] = df
Main pull:
I would approach this by reading everything into 1 dataframe (loop over files, and concat):
import os
import pandas as pd
files = [] #generate list for files to go into
path_of_directory = "path/to/folder/"
for dirname, dirnames, filenames in os.walk(path_of_directory):
for filename in filenames:
files.append(os.path.join(dirname, filename))
output_data = [] #blank list for building up dfs
for name in files:
df = pd.read_excel(name)
df['name'] = os.path.basename(name)
output_data.append(df)
total = pd.concat(output_data, ignore_index=True, sort=True)
Then:
From then you can interrogate the df by using df.loc[df['name'] == 'choice']
Or (in keeping with your question):
You could then split into a dictionary of dataframes, based on this column. This is the best approach...
dictionary = {}
df[column] = df[column].astype(str)
col_values = df[column].unique()
for value in col_values:
key_name = 'df'+str(value)
dictionary[key_name] = copy.deepcopy(df)
dictionary[key_name] = dictionary[key_name][df[column] == value]
dictionary[key_name].reset_index(inplace=True, drop=True)
return dictionary
The reason for this approach is discussed here:
Create new dataframe in pandas with dynamic names also add new column which basically says that dynamic naming of dataframes is bad, and this dict approach is best
This might help.
files_xls = ['all your excel filename goes here']
df = pd.DataFrame()
for f in files_xls:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
print(df)
I have a csv file with 15 columns and around 17000 rows.
My problem is to search in a specific column (for example: column 'name') for an input string, if it matches, print the the row [i] that contains the string, the previous row [i-1] and the next row [i+1], in order i-1, i, i+1. Repeat the process till the last element of the column (my data file is formated so that it contains no duplicate).
I use this reference to find the rows and the program runs well. Below is my python code:
import pandas as pd
x = input('Please input the name: ')
df = pd.read_csv("input.csv", sep = ",")
idx = df[df.name.str.contains(x, na=False)].index.tolist()
for i in idx:
print(df.iloc[[i-1, i, i+1]])
I would like to ask how to export the filtered data above to a new dataframe and output it to a new csv file? I follow this reference:
df.iloc[[i-1, i, i+1]].to_csv('result.csv', index=True, mode='a')
The output file is ok but it doesn't include the columns' names and I also think that it is not so formal and optimal acording to the author of the topic.
Thank you very much.
I think you need min and max for avoid selecting not exist row before first and after last matched rows, then for new file first save only columns names and then in loop save only data with no header:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'name':list('aaabbb')
})
print (df)
#tested matching first row
x = 'a'
#tested matching last row
#x = 'b'
idx = df[df.name.str.contains(x, na=False)].index.tolist()
pd.DataFrame(columns=df.columns).to_csv('result.csv')
for i in idx:
df1 = df.iloc[[max(0, i-1), i, min(df.index[-1], i+1)]]
df1.to_csv('result.csv', index=False, mode='a', header=None)
#if need index values
#df1.to_csv('result.csv', mode='a', header=None)
Another solution is use concat of list of DataFrames, then save to csv with no mode append:
x = 'a'
idx = df[df.name.str.contains(x, na=False)].index.tolist()
dfs = []
for i in idx:
dfs.append(df.iloc[[max(0, i-1), i, min(df.index[-1], i+1)]])
#list comprehension alternative
#dfs = [df.iloc[[max(0, i-1), i, min(df.index[-1], i+1)]] for i in idx]
pd.concat(dfs).to_csv('result.csv', index=False)
#if need index
#pd.concat(dfs).to_csv('result.csv')
I have several pandas DataFrames of the same format, with five columns.
I would like to sum the values of each one of these dataframes using df.sum(). This will create a Series for each Dataframe, still with 5 columns.
My problem is how to take these Series, and create another Dataframe, one column being the filename, the other columns being the five columns above from df.sum()
import pandas as pd
import glob
batch_of_dataframes = glob.glob("*.txt")
newdf = []
for filename in batch_of_dataframes:
df = pd.read_csv(filename)
df['filename'] = str(filename)
df = df.sum()
newdf.append(df)
newdf = pd.concat(newdf, ignore_index=True)
This approach doesn't work unfortunately. 'df['filename'] = str(filename)' throws a TypeError, and the creating a new dataframe newdf doesn't parse correctly.
How would one do this correctly?
How do you take a number of pandas.Series objects and create a DataFrame?
Try in this order:
Create an empty list, say list_of_series.
For every file:
load into a data frame, then save the sum in a series s
add an element to s: s['filename'] = your_filename
append s to list_of_series
Finally, concatenate (and transpose if needed):
final_df = pd.concat(list_of_series, axis = 1).T
Code
Preparation:
l_df = [pd.DataFrame(np.random.rand(3,5), columns = list("ABCDE")) for _ in range(5)]
for i, df in enumerate(l_df):
df.to_csv(str(i)+'.txt', index = False)
Files *.txt are comma separated and contain headers.
! cat 1.txt
A,B,C,D,E
0.18021800981245173,0.29919271590063656,0.09527248614484807,0.9672038093199938,0.07655003742768962
0.35422759068109766,0.04184770882952815,0.682902924462214,0.9400817219440063,0.8825581077493059
0.3762875793116358,0.4745731412494566,0.6545473610147845,0.7479829630649761,0.15641907539706779
And, indeed, the rest is quite similar to what you did (I append file names to a series, not to data frames. Otherwise they got concatenated several times by sum()):
files = glob.glob('*.txt')
print(files)
['3.txt', '0.txt', '4.txt', '2.txt', '1.txt']
list_of_series = []
for f in files:
df = pd.read_csv(f)
s = df.sum()
s['filename'] = f
list_of_series.append(s)
final_df = pd.concat(list_of_series, axis = 1).T
print(final_df)
A B C D E filename
0 1.0675 2.20957 1.65058 1.80515 2.22058 3.txt
1 0.642805 1.36248 0.0237625 1.87767 1.63317 0.txt
2 1.68678 1.26363 0.835245 2.05305 1.01829 4.txt
3 1.22748 2.09256 0.785089 1.87852 2.05043 2.txt
4 0.910733 0.815614 1.43272 2.65527 1.11553 1.txt
To answer this specific question :
#ThomasTu How do I go from a list of Series with 'Filename' as a
column to a dataframe? I think that's the problem---I don't understand
this
It's essentially what you have now, but instead of appending to an empty list, you append to an empty dataframe. I think there's an inplace keyword if you don't want to reassign newdf on each iteration.
import pandas as pd
import glob
batch_of_dataframes = glob.glob("*.txt")
newdf = pd.DataFrame()
for filename in batch_of_dataframes:
df = pd.read_csv(filename)
df['filename'] = str(filename)
df = df.sum()
newdf = newdf.append(df, ignore_index=True)