This is probably an easy fix, but I can't seem to figure it out...
outputting a list to CSV in Python using the following code:
w = csv.writer(file('filename.csv','wb'))
w.writerows(mylist)
One of the list items is a ratio, so it contains values like '23/54', '9/12', etc. Excel is recognizing some of these values (like 9/12) as a date. What's the easiest way to solve this?
Thanks
Because csv is a text-only format, you cannot tell Excel anything about how to interpret the data, I am afraid.
You'd have to generate actual Excel files (using xlwt for example, documentation and tutorials available on http://www.python-excel.org/).
You could do this:
# somelist contains data like '12/51','9/43' etc
mylist = ["'" + val + "'" for val in somelist]
w = csv.writer(open('filename.csv','wb'))
for me in mylist:
w.writerow([me])
This would ensure your data is written as it is to csv.
Related
I'm a doctor trying to learn some code for work, and was hoping you could help me solve a problem I have with regards to importing multiple images into python.
I am working in Jupyter Notebook, where I have created a dataframe (named df_1) using pandas. In this dataframe each row represents a patient, and the first column shows the case number for each patient (e.g. 85).
Now, what I want to do is import multiple images (.bmp) from a given folder(same location as the .ipynb file). There are many images in this folder, and I do not want all of them - only the ones who have filenames corresponding to the "case_number" column in my dataframe (e.g. 85.bmp).
I already read this post, but I must admit it was way to complicated for me to understand.
Is there some simple loop (or something else) I could create to import all images with filenames corresponding to the values of the "case number" column in the dataframe?
I was imagining something like the below would be possible, I just do not know how to write it.
for i=[(df_1['case_number'()]
cv2.imread('[i].bmp')
The images don't really need to be implemented in the dataframe, but I would like to be able to view them in my notebook by using e.g. plt.imshow(85) afterwards.
Here is an image of the head of my dataframe
Thank you for helping!
You can access all of your files using this:
imageList = []
for i in range(0, len(df_1)):
cv2.imread('./' + str(df_1['case_number'][i]) + '.bmp')
imageList.append('./' + str(df_1['case_number'][i]) + '.bmp')
plt.imshow(imagelist[x])
This is looping through every item in the case_number column, the ./ shows that your file is within the current directory, using the directory path leading up to your current file. And by making everything a string and joining it you make it so that the file path is readable. The path created by joining the strings should look something like ./85.bmp, which should open your desired file. Also, you are appending the filenames to the list so that they can be accessed by the plt.imshow()
If you would like to access the files based on their name, you can use another variable (which could be set as an input) and implement the code below
fileName = input('Enter Your Value: ')
inputFile = imageList.index('./' + fileName + '.bmp')
and from here, you could use the same plt.imshow(imagelist[x]), but replace the x with the inputFile variable.
So, I have successfully got a function built where I can gather some data, convert it to a dataframe, and then convert it into an excel document. The problem I am having is, I don't know how to create a responsive title name or storage location. I have the path using pandas hard coded like so:
developmentdata = pd.DataFrame(dev_names).to_excel(r'C:\Desktop\Dev\devproj\test.xlsx', header=False, index=False)
and the excel is titled 'test.xlsx'. What can I do to fix this?
Python offers many different possibilities for programmatically constructing text strings. A modern approach that I personally would recommend is to use f-strings. There are also many different possibilities to construct and format dates, but I think that python's datetime module should suit your needs. As an example:
from datetime import datetime
# Variables with identifying information
output_dir = r"C:\Desktop\Dev\devproj"
language = "php"
location = "boston"
# Retrieve current date as a formatted string
current_date = datetime.now().strftime("%Y%m%d")
# Construct output path using f-strings
file_name = f"{location}_{language}_{current_date}.xlsx"
output_path = rf"{output_dir}\{file_name}"
print(output_path)
# Output:
# C:\Desktop\Dev\devproj\boston_php_20190821.xlsx
Note 1: to use f-strings you must have python 3.6 or higher.
Note 2: for dates in file names I'd always recommend to use "year-month-day" formatting (as done in the code example). This has the benefit that when you sort multiple of these files by name, you automatically get a proper chronological sorting as well.
I´m used to program in Python. My company now got a Hadoop Cluster with Jupyter installed. Until now I never used Spark / Pyspark for anything.
I am able to load files from HDFS as easy as this:
text_file = sc.textFile("/user/myname/student_grades.txt")
And I´m able to write output like this:
text_file.saveAsTextFile("/user/myname/student_grades2.txt")
The thing I´m trying to achieve is to use a simple "for loop" to read text files one-by-one and write it's content into one HDFS file. So I tried this:
list = ['text1.txt', 'text2.txt', 'text3.txt', 'text4.txt']
for i in list:
text_file = sc.textFile("/user/myname/" + i)
text_file.saveAsTextFile("/user/myname/all.txt")
So this works for the first element of the list, but then gives me this error message:
Py4JJavaError: An error occurred while calling o714.saveAsTextFile.
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
XXXXXXXX/user/myname/all.txt already exists
To avoid confusion I "blured"-out the IP address with XXXXXXXX.
What is the right way to do this?
I will have tons of datasets (like 'text1', 'text2' ...) and want to perform a python function with each of them before saving them into HDFS. But I would like to have the results all together in "one" output file.
Thanks a lot!
MG
EDIT:
It seems like that my final goal was not really clear. I need to apply a function to each text file seperately and then I want to append the output to the existing output directory. Something like this:
for i in list:
text_file = sc.textFile("/user/myname/" + i)
text_file = really_cool_python_function(text_file)
text_file.saveAsTextFile("/user/myname/all.txt")
I wanted to post this as comment but could not do so as I do not have enough reputation.
You have to convert your RDD to dataframe and then write it in append mode. To convert RDD to dataframe please look into this answer:
https://stackoverflow.com/a/39705464/3287419
or this link http://spark.apache.org/docs/latest/sql-programming-guide.html
To save dataframe in append mode below link may be useful:
http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
Almost same question is here also Spark: Saving RDD in an already existing path in HDFS . But the answer provided is for scala. I hope something similar can be done in python also.
There is yet another (but ugly) approach. Convert your RDD to string. Let the resulting string be resultString . Use subprocess to append that string to destination file i.e.
subprocess.call("echo "+resultString+" | hdfs dfs -appendToFile - <destination>", shell=True)
you can read multiple files and save them by
textfile = sc.textFile(','.join(['/user/myname/'+f for f in list]))
textfile.saveAsTextFile('/user/myname/all')
you will get all part files within output directory.
If the text files all have the same schema, you could use Hive to read the whole folder as a single table, and directly write that output.
I would try this, it should be fine:
list = ['text1.txt', 'text2.txt', 'text3.txt', 'text4.txt']
for i in list:
text_file = sc.textFile("/user/myname/" + i)
text_file.saveAsTextFile(f"/user/myname/{i}")
Assuming I have a configuration txt file with this content:
{"Mode":"Classic","Encoding":"UTF-8","Colors":3,"Blue":80,"Red":90,"Green":160,"Shortcuts":[],"protocol":"2.1"}
How can i change a specific value like "Red":90 to "Red":110 in the file without changing its original format?
I have tried with configparser and configobj but as they are designed for .INI files I couldn't figure out how to make it work with this custom config file. I also tried splitting the lines searching for the keywords witch values I wanted to change but couldn't save the file the same way it was before. Any ideas how to solve this? (I'm very new in Python)
this looks like json so you could:
import json
obj = json.load(open("/path/to/jsonfile","r"))
obj["Blue"] = 10
json.dump(obj,open("/path/to/mynewfile","w"))
but be aware that a json dict does not have an order.
So the order of the elements is not guaranteed (and normally it's not needed) json lists have an order though.
Here's how you can do it:
import json
d = {} # store your data here
with open('config.txt','r') as f:
d = json.loads(f.readline())
d['Red']=14
d['Green']=15
d['Blue']=20
result = "{\"Mode\":\"%s\",\"Encoding\":\"%s\",\"Colors\":%s,\
\"Blue\":%s,\"Red\":%s,\"Green\":%s,\"Shortcuts\":%s,\
\"protocol\":\"%s\"}"%(d['Mode'],d['Encoding'],d['Colors'],
d['Blue'],d['Red'],d['Green'],
d['Shortcuts'],d['protocol'])
with open('config.txt','w') as f:
f.write(result)
f.close()
print result
I got a really strange problem. I'm trying to read some data from an excel file, but the property nrows has a wrong value. Although my file has a lot of rows, it just returns 2.
I'm working in pydev eclipse. I don't know what is actually the problem; everything looks fine.
When I try to access other rows by index manually, but I got the index error.
I appreciate any help.
If it helps, it's my code:
def get_data_form_excel(address):
wb = xlrd.open_workbook(address)
profile_data_list = []
for s in wb.sheets():
for row in range(s.nrows):
if row > 0:
values = []
for column in range(s.ncols):
values.append(str(s.cell(row, column).value))
profile_data_list.append(values)
print str(profile_data_list)
return profile_data_list
To make sure your file is not corrupt, try with another file; I doubt xlrd is buggy.
Also, I've cleaned up your code to look a bit nicer. For example the if row > 0 check is unneeded because you can just iterate over range(1, sheet.nrows) in the first place.
def get_data_form_excel(address):
# this returns a generator not a list; you can iterate over it as normal,
# but if you need a list, convert the return value to one using list()
for sheet in xlrd.open_workbook(address).sheets():
for row in range(1, sheet.nrows):
yield [str(sheet.cell(row, col).value) for col in range(sheet.ncols)]
or
def get_data_form_excel(address):
# you can make this function also use a (lazily evaluated) generator instead
# of a list by changing the brackets to normal parentheses.
return [
[str(sheet.cell(row, col).value) for col in range(sheet.ncols)]
for sheet in xlrd.open_workbook(address).sheets()
for row in range(1, sheet.nrows)
]
After trying some other files I'm sure it's about the file, and I think it's related to Microsoft 2003 and 2007 differences.
I recently got this problem too. I'm trying to read an excel file and the row number given by xlrd.nrows is less than the actual one. As Zeinab Abbasi saied, I tried other files but it worked fine.
Finally, I find out the difference : there's a VB-script based button embedded in the failed file, which is used to download and append records to the current sheet.
Then, I try to convert the file to .xlsx format, but it asks me to save as another format with macro enabled, e.g .xlsm. This time xlrd.nrows gives the correct value.
Is your excel file using external data? I just had the same problem and found a fix. I was using excel to get info from a google sheet, and I wanted to have python show me that data. So, the fix for me was going to DATA>Connections(in "Get External Data")>Properties and unchecking "Remove data from the external data range before saving the workbook"