I am new to Python, and I am currently writing code to parse through an excel sheet of websites, look at websites that have been modified more than three months ago, and then pull out the names and emails of contacts at those sites. My problem now is that whenever I run the code in my terminal, it only shows me some of the output, so I'd like to export it to a .csv file or really anything that lets me see all the values but I'm not sure how.
import pandas as pd
data = pd.read_csv("filename.csv")
data.sort_values("Last change", inplace = True)
filter1 = data["Last change"]<44285
data.where(filter1, inplace = True)
print(data)
note: the 44285 came from me converting the dates in excel to integers so I didn't have to in Python, lazy I know but I'm learning
You can try converting it to a csv.
data.to_csv('data.csv')
Alternately if you want to just view more records, for example 50, you could do this:
print(data.head(50))
If you can share your parser code also, I think we can save your from the hassle of editing the excel in between the process. Or maybe something from here with a few more lines.
To solve your problem use
data.to_csv("resultfile.csv")
or if you want an excel file
data.to_excel("resultfile.xlsx")
Related
I am using the below code to rename one of the column name in a CSV file.
input_filepath ='/<path_to_file>/2500.csv'
df_csv = spark.read.option('header', True).csv(input_filepath)
df1 = df_csv.withColumnRenamed("xyz", "abc")
df1.printSchema()
So, the above code works fine. however, I wanted to also convert the CSV to parquet format. If I am correct, the above code will make the changes and put in the memory and not to the actual file. Please correct me if I am wrong.
If the changes are kept in memory, then how can I put the changes to parquet file?
For the file format conversion, I will be using below code snippet
df_csv.write.mode('overwrite').parquet()
But not able to figure out how to use it in this case. Please suggest
Note: I am suing Databricks notebook for all the above steps.
Hi I am not 100% sure if that will solve your issue but can you try to save your csv like this:
df.to_parquet(PATH_WHERE_TO_STORE)
Let me know if that helped.
Edit: My workflow usually goes like this
Export dataframe as csv
Check visually if everything is correct
Run this function:
import pandas as pd
def convert_csv_to_parquet(path_to_csv):
df = pd.read_csv(path_to_csv)
df.to_parquet(path_to_csv.replace(".csv", ".parq"), index=False)
Which goes into the direction of Rakesh Govindulas' comment.
So I have a csv file with a column called reference_id. The values in reference id are 15 characters long, so something like '162473985649957'. When I open the CSV file, excel has changed the datatype to General and the numbers are something like '1.62474E+14'. To fix this in excel, I change the column type to Number and remove the decimals and it displays the correct value. I should add, it only does this in CSV file, if I output to xlsx, it works fine. PRoblem is, the file has to be csv.
Is there a way to fix this using python? I'm trying to automate a process. I have tried using the following to convert it to a string. It works in the sense that is converts the column to a string, but it still shows up incorrectly in the csv file.
df['reference_id'] = df['reference_id'].astype(str)
df.to_csv(r'Prev Day Branch Transaction Mems.csv')
Thanks
When I open the CSV file, excel has changed the data
This is an Excel problem. You can't fix how Excel decides to interpret your CSV. (You can work around some issues by using the text import format, but that's cumbersome.)
Either use XLS/XLSX files when working with Excel, or use eg. Gnumeric our something other that doesn't wantonly mangle your data.
My issue is as follows.
I've gathered some contact data from SurveyMonkey using the SM API, and I've converted that data into a txt file. When opening the txt file, I see the full data from the survey that I'm trying to convert into csv, however when I use the following code:
df = pd.read_csv("my_file.txt",sep =",", encoding = "iso-8859-10")
df.to_csv('my_file.csv')
It creates a csv file with only two lines of values (and cuts off in the middle of the second line). Similarly if I try to organize the data within a pandas dataframe, it only registers the first two lines, meaning most of my txt file is not being read registered.
As I've never run into this problem before and I've been able to convert into CSV without issues, I'm wondering if anyone here has ideas as to what might be causing this issue to occur and how I could go about solving it?
All help is much appreciated.
Edit:
I was able to get the data to display properly in csv, when I converted it directly into csv from json instead of converting it to a txt file first. I was not however able to figure out what when wrong in the conversion from txt to csv, as I tried multiple different encodings but came to the same result.
I'm a real beginner in python and i was asked to use it to retrieve some data.
I manage to get them but now I need to file them in an excel tab or a csv that could be used later on.
The data I have were in this format:
2005-02-04T01:00:00+02:00,1836.910000#2005-02-05T01:00:00+02:00
I managed doing this to classify them better
>>> date_value = np.array( [ (dateutil.parser.parse(d), float(v))
... for d,v in [l.split(',') for l in values.text.split('#')]] )
>>>ts = pd.Series(date_value[:,1],index=date_value[:,0])
>>> ts
and now I got them on this format:
2005-02-04 01:00:00+02:00 1836.91
2005-02-05 01:00:00+02:00 1821.45
And now I can't find a way to store them as an excel or csv file.
If you have any advice...?
Thanks
G.
Since you seem to have quickly figured out how to use pandas to read data, the next step is to use it to write CSV. and that's done with:
pandas.DataFrame.to_csv
Excel is perfectly able to read CSV files, no need to convert to excel yourself. However if this is really a project requirement,
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_excel.html
Hi I'm relatively new to python and was hoping if any of you guys can provide advice on templating matters.
I've managed to parse an excel file, made a dataframe out of the data (using xl.parse, .loc, str.contains, str.split, sort_index etc. methods) and output it into another excel file like so:
Excel doc with dataframe
I'm stuck at formatting - adding borders, bolding certain rows of strings (not necessarily in the same position between 2 different output files), highlighting certain cells with color, etc.
I have a template which I have to follow, like so(word doc): Format to replicate (word doc)
I'm considering two ways about this:
1) Replicate the formatting from scratch through python (either as an excel or word doc)
2) Write the raw data from the output excel file to the word doc with the template
It'd be great if someone can enlighten me on which way is more efficient, and what libraries, methods/functions I can look into to get the job done.
Thank you!
I recommend using xlsxwriter. You can add borders with code like this:
import xlsxwriter
# left
begcol = 2 # skip first col
endcol = ws.UsedRange.Columns.Count
begrow = 2 # skip first row
endrow = ws.UsedRange.Rows.Count
ws.Range(ws.Cells(begrow, begcol),
ws.Cells(endrow, endcol)).Borders(7).LineStyle = 1 # continuous
ws.Range(ws.Cells(begrow, begcol),
ws.Cells(endrow, endcol)).Borders(7).Weight = 2 # thin
You can bold a row this way:
# bold last row
ws.Range(ws.Cells(endrow, begcol),
ws.Cells(endrow, endcol)).Font.Bold = True
You can set the background color of a cell like this:
format = workbook.add_format()
format.set_pattern(1) # This is optional when using a solid fill.
format.set_bg_color('green')
worksheet.write('A1', 'Ray', format)
For writing to Word Documents you can use docx with an example of how to do that here: http://pbpython.com/python-word-template.html
There are a few good ways to do this. I typically do one of the following two approaches:
1) XLSX writer: This package has support for changing formatting of Excel files. So my workflow would be to export to Excel using Pandas in Python then after the data is in the Excel file I'd manipulate the formatting with XLSX. Pandas and XLSX Writer play well together as you can see from this demo.
2) For some workflows I found the amount/type of formatting I wanted to do in Excel was just not reasonable to do with XLSX Writer. In those cases the best bet is to put your data in something that's NOT Excel then link Excel to it. One easy approach is dumping the data to a CSV then linking your well formated Excel file to the CSV. You could also push data into a database with Pandas and then have the Excel file pull data from the DB.