As title, I need to create multiple spreadsheets into a excel file with Pandas. While this thread and this one
all provided solutions, I figured my situation is a bit different. Both of the cases use something similar to:
writer = pd.ExcelWriter('output.xlsx')
DF1.to_excel(writer,'Sheet1')
DF2.to_excel(writer,'Sheet2')
writer.save()
However, the problem is that I cannot afford to keep multiple dataframes in my memory at the same time since each of them are just too big. My data can be the complicated version of this:
df = pd.DataFrame(dict(A=list('aabb'), B=range(4), C=range(6,10)))
Out: A B C
0 a 0 6
1 a 1 7
2 b 2 8
3 b 3 9
I intend to use the items ['a', 'b', 'c'] in grplist to perform some sort of calculation and eventually generate separate spreadsheets when data['A'] == a through c :
data = pd.read_csv(fileloc)
grplist = [['a','b','c'],['d','e','f']]
for groups, numbers in zip(grplist, range(1, 5)):
for category in groups:
clean = data[(data['A'] == category) & (data['B'] == numbers)]['C']
# --------My calculation to generate a dataframe--------
my_result_df = pd.DataFrame(my_result)
writer = ExcelWriter('my_path_of_excel')
my_resultdf.to_excel(writer, 'Group%s_%s' % (numbers, category[:4]))
writer.save()
gc.collect()
Sadly my code does not create multiple spreadsheets as groups, numbers are looped through. I can only get the last result in the single spreadsheet lying in my excel. What can I do?
This is my very first post here. I hope I'm following every rules so this thread can end well. If anything need to be modified or improved, please kindly let me know. Thanks for your help :)
consider the df
df = pd.DataFrame(dict(A=list('aabb'), B=range(4)))
loop through groups and print
for name, group in df.groupby('A'):
print('{}\n\n{}\n\n'.format(name, group))
a
A B
0 a 0
1 a 1
b
A B
2 b 2
3 b 3
to_excel
writer = pd.ExcelWriter('output.xlsx')
for name, group in df.groupby('A'):
group.to_excel(writer, name)
writer.save()
writer.close()
Related
Beginner in python and pandas and trying to figure out how to read from csv in a particular way.
My datafile
01 AAA1234 AAA32452 AAA123123 0 -9 C C A A T G A G .......
01 AAA1334 AAA12452 AAA125123 1 -9 C A T G T G T G .......
...
...
...
So I have 100.000 columns in this file and I want to merge every two columns into one. But the merging needs to occur after the first 6 columns. I would prefer to do this while reading the file if possible instead of manipulating this huge datafile/
Desired outcome
01 AAA1234 AAA32452 AAA123123 0 -9 CC AA TG AG .......
01 AAA1334 AAA12452 AAA125123 1 -9 CA TG TG TG .......
...
...
...
That will result in a dataframe with half the columns. My datafile has no col names, the names reside in a different csv but that is another subject.
I d appreciate a solution, thanks in advance!
Separate the data frame initially. I created one for experimental purposes:
Then I defined a function. Then passed in the dataframe which needed manipulation as an argument into the function
def columns_joiner(data):
new_data = pd.DataFrame()
for i in range(0,11,2): # You can change range to your wish
# Here, I had only 10 columns to concatenate (Therefore the range ends at 11)
ser = data[i] + data[i + 1]
new_data = pd.concat([new_data, ser], axis = 1)
return new_data
I don't think this is an efficient solution. But it worked for me.
I'm trying to read a .txt file and output the count of each letter which works, however, I'm having trouble exporting that data to .csv in a specific way.
A snippet of the code:
freqs = {}
with open(Book1) as f:
for line in f:
for char in line:
if char in freqs:
freqs[char] += 1
else:
freqs[char] = 1
print(freqs)
And for the exporting to csv, I did the following:
test = {'Book 1 Output':[freqs]}
df = pd.DataFrame(test, columns=['Book 1 Output'])
df.to_csv(r'book_export.csv', sep=',')
Currently when I run it, the export looks like this (Manually done):
However I want the output to be each individual row, so it should look something like this when I open it:
I want it to separate it from the ":" and "," into 3 different columns.
I've tried various other answers on here but most of them end up with giving ValueErrors so maybe I just don't know how to apply it like the following one.
df[[',']] = df[','].str.split(expand=True)
Use DataFrame.from_dict with DataFrame.rename_axis for set index name, then csv looks like you need:
#sample data
freqs = {'a':5,'b':2}
df = (pd.DataFrame.from_dict(freqs, orient='index',columns=['Book 1 Output'])
.rename_axis('Letter'))
print (df)
Book 1 Output
Letter
a 5
b 2
df.to_csv(r'book_export.csv', sep=',')
Or alternative is use Series:
s = pd.Series(freqs, name='Book 1 Output').rename_axis('Letter')
print (s)
Letter
a 5
b 2
Name: Book 1 Output, dtype: int64
s.to_csv(r'book_export.csv', sep=',')
EDIT:
If there are multiple frequencies change DataFrame constructor:
freqs = {'a':5,'b':2}
freqs1 = {'a':9,'b':3}
df = pd.DataFrame({'f1':freqs, 'f2':freqs1}).rename_axis('Letter')
print (df)
f1 f2
Letter
a 5 9
b 2 3
so my df looks like this:
x y group
0 53 10 csv1
1 53 10 csv1
2 48 9 csv0
3 48 9 csv0
4 48 9 csv0
... ... ... ...
I have some files that are depending on the group name and want to use them in a function besides the x and y value.
what I am doing so far is the following:
dfGrouped = df.groupby('group') #group the dataframe
df['newcol'] = np.nan #crete new empty col
#use for loop to load file depending on group, note the file is very large, thats why I want to load it only once per group
for name, group in groupHashed:
file = open(name+'.txt')
#open the file
df['newcol'] = df[df['group'] == name].apply(lambda row: newValueFromFile(row.x,row.y, file), axis=1)
It seemed to work at first, unfortunately, newcol only holds the value of the last loop and seems to override the values created earlier with nan. Somebody any idea?
instead of file = open, use
with open('filename.txt', 'a') as file:
and then for the lambda expression file.write...
The 'a' in the opening tells it should append the data to the existing.
I guess currently you are overwriting the content of the file.
'with open()' also takes care for the automatic closing after you're done with the file.
I have the following code which reads a csv file and then analyzes it. One patient has more than one illness and I need to find how many times an illness is seen on all patients. But the query given here
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
is so slow that it takes more than 15 mins. Is there a way to make the query faster?
raw_data = pd.read_csv(r'C:\Users\omer.kurular\Desktop\Data_Entry_2017.csv')
data = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
illnesses = pd.DataFrame({"Finding_Label":[],
"Count_of_Patientes_Having":[],
"Count_of_Times_Being_Shown_In_An_Image":[]})
ids = raw_data["Patient ID"].drop_duplicates()
index = 0
for ctr in data[:1]:
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = raw_data[raw_data["Finding Labels"].str.contains(ctr)].size / 12
for i in ids:
illnesses.at[index, "Count_of_Patientes_Having"] = raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
index = index + 1
Part of dataframes:
Raw_data
Finding Labels - Patient ID
IllnessA|IllnessB - 1
Illness A - 2
From what I read I understand that ctr stands for the name of a disease.
When you are doing this query:
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
You are not only filtering the rows which have the disease, but also which have a specific patient id. If you have a lot of patients, you will need to do this query a lot of times. A simpler way to do it would be to not filter on the patient id and then take the count of all the rows which have the disease.
This would be:
raw_data[raw_data['Finding Labels'].str.contains(ctr)].size
And in this case since you want the number of rows, len is what you are looking for instead of size (size will be the number of cells in the dataframe).
Finally another source of error in your current code was the fact that you were not keeping the count for every patient id. You needed to increment illnesses.at[index, "Count_of_Patientes_Having"] not set it to a new value each time.
The code would be something like (for the last few lines), assuming you want to keep the disease name and the index separate:
for index, ctr in enumerate(data[:1]):
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = len(raw_data[raw_data["Finding Labels"].str.contains(ctr)]) / 12
illnesses.at[index, "Count_of_Patientes_Having"] = len(raw_data[raw_data['Finding Labels'].str.contains(ctr)])
I took the liberty of using enumerate for a more pythonic way of handling indexes. I also don't really know what "Count_of_Times_Being_Shown_In_An_Image" is, but I assumed you had had the same confusion between size and len.
Likely the reason your code is slow is that you are growing a data frame row-by-row inside a loop which can involve multiple in-memory copying. Usually this is reminiscent of general purpose Python and not Pandas programming which ideally handles data in blockwise, vectorized processing.
Consider a cross join of your data (assuming a reasonable data size) to the list of illnesses to line up Finding Labels to each illness in same row to be filtered if longer string contains shorter item. Then, run a couple of groupby() to return the count and distinct count by patient.
# CROSS JOIN LIST WITH MAIN DATA FRAME (ALL ROWS MATCHED)
raw_data = (raw_data.assign(key=1)
.merge(pd.DataFrame({'ills':ills, 'key':1}), on='key')
.drop(columns=['key'])
)
# SUBSET BY ILLNESS CONTAINED IN LONGER STRING
raw_data = raw_data[raw_data.apply(lambda x: x['ills'] in x['Finding Labels'], axis=1)]
# CALCULATE GROUP BY count AND distinct count
def count_distinct(grp):
return (grp.groupby('Patient ID').size()).size
illnesses = pd.DataFrame({'Count_of_Times_Being_Shown_In_An_Image': raw_data.groupby('ills').size(),
'Count_of_Patients_Having': raw_data.groupby('ills').apply(count_distinct)})
To demonstrate, consider below with random, seeded input data and output.
Input Data (attempting to mirror original data)
import numpy as np
import pandas as pd
alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
data_tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
ills = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia",
"Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax",
"Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
np.random.seed(542019)
raw_data = pd.DataFrame({'Patient ID': np.random.choice(data_tools, 25),
'Finding Labels': np.core.defchararray.add(
np.core.defchararray.add(np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]),
np.random.choice(ills, 25).astype('str')),
np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]))
})
print(raw_data.head(10))
# Patient ID Finding Labels
# 0 r xPNPneumothoraxXYm
# 1 python ScSInfiltration9Ud
# 2 stata tJhInfiltrationJtG
# 3 r thLPneumoniaWdr
# 4 stata thYAtelectasis6iW
# 5 sas 2WLPneumonia1if
# 6 julia OPEConsolidationKq0
# 7 sas UFFCardiomegaly7wZ
# 8 stata 9NQHerniaMl4
# 9 python NB8HerniapWK
Output (after running above process)
print(illnesses)
# Count_of_Times_Being_Shown_In_An_Image Count_of_Patients_Having
# ills
# Atelectasis 3 1
# Cardiomegaly 2 1
# Consolidation 1 1
# Effusion 1 1
# Emphysema 1 1
# Fibrosis 2 2
# Hernia 4 3
# Infiltration 2 2
# Mass 1 1
# Nodule 2 2
# Pleural_Thickening 1 1
# Pneumonia 3 3
# Pneumothorax 2 2
I'm trying to make a simple spreadsheet:
I input some details, and it gets saved into the spreadsheet.
So, I input the details into 2 lists, and 1 normal variable:
Dates = ['01/01/14', '01/02/14', '01/03/14']
Amount = ['1', '2', '3']
Cost = 12 (because it is always the same)
I'm trying to insert these into a spreadsheet like this:
for i in range(len(Dates)):
insertThese.extend([Dates[i], Amount[i], Cost])
ws.append(insertThese)
but this adds the 3 things side-by-side like:
A B C D E F G H I
01/01/14 1 12 01/02/14 2 12 01/03/14 3 12
but I want it to be like, basically adding a new row at the end of insertThese.expand...
A B C
01/01/14 1 12
01/02/14 2 12
01/03/14 3 12
I don't understand how to do this without removing by headers at the top of the file.
I tried using iter_rows() but that removes the header.
So how do I get the details to be added row-by-row?
I'm new to openpyxl, so if anything's obvious - sorry!
You can use zip/ itertools.izip to loop over your lists in parallel. Something like the following:
for d, a in zip(Dates, Amount):
ws.append([d, a, 12])