DateTime when saving pandas dataframe to CSV - python

Background: Apparently Google doesn't have a straight answer to a very basic question, so here goes...
I have a pandas df with a Open Date column [Dtype = object] which (when previewing df) is formatted yyyy-mm-dd, which is the format I want, great! Not so great however, when I write df to a .csv which then defaults the formatting to m/dd/yyyy.
Issue: I have tried just about everything for the .csv to output yyyy-dd-mm to no avail.
What I've tried:
I have tried specifying a date format when writing the .csv
df.to_csv(filename, date_format="%Y%d%d")
I have tried changing the format of the column in question, prior to writing to a .csv
df['Open Date'] = pd.to_datetime(df['Open Date'])
I have also tried converting the column to a string, to try and force the correct output
df['Open Date'] = df['timestamp'].apply(lambda v: str(v))
Despite these attempts, I still get a m/dd/yyyy output.
Help: where am I embarrasingly going wrong here?

Your question contained various breaking typos which seems to suggest what may be causing the problem in general.
There's a few issues with what you are saying. Consider:
from pandas import DataFrame
from datetime import datetime
# just some example data, including some datetime and string data
data = [
{'Open date': datetime(2022, 3, 22, 0, 0), 'value': '1'},
{'Open date': datetime(2022, 3, 22, 0, 1), 'value': '2'},
{'Open date': datetime(2022, 3, 22, 0, 2), 'value': '3'}
]
df = DataFrame(data)
# note how the 'Open date' columns is actually a `datetime64[ns]`
# the 'value' string however is what you're saying you're getting, `object`
print(df['Open date'].dtype, df['value'].dtype)
# saving with a silly format, to show it works:
df.to_csv('test.csv', date_format='%Y.%m.%d')
The resulting file:
,Open date,value
0,2022.03.22,1
1,2022.03.22,2
2,2022.03.22,3
I picked a silly format because the default format for me is actually %Y-%m-%d .
The most likely issue is that your 'date' column is actually a string column, but the tools you are using to 'preview' your data are interpreting these strings as dates and actually showing them in some other format.
However, with the limited information you provided, it's guesswork. If you provide some example data that demonstrates the issue, it would be easier to say for sure.

Related

Detecting Excel column data types in Python Pandas

New to Python and Pandas here. I am trying to read an Excel file off of S3 (using boto3) and read the headers (first row of the spreadsheet) and determine what data type each header is, if this is possible to do. If it is, I need a map of key-value pairs where each key is the header name and value is its data type. So for example if the file I fetch from S3 has the following data in it:
Date,Name,Balance
02/01/2022,Jerry Jingleheimer,45.07
02/14/2022,Jane Jingleheimer,102.29
Then I would be looking for a map of KV pairs like so:
Key 1: "Date", Value 1: "datetime" (or whatever is the appropriate date type)
Key 2: "Name", Value 2: "string" (or whatever is the appropriate date type)
Key 3: "Balance", Value 3: "numeric" (or whatever is the appropriate date type)
So far I have:
s3Client = Res.resource('s3')
obj = s3Client.get_object(Bucket="some-bucket", Key="some-key")
file_headers = pd.read_excel(io.BytesIO(obj['Body'].read()), engine="openpyxl").columns.tolist()
I'm just not sure about how to go about extracting the data types that Pandas has detected or how to generate the map.
Can anyone point me in the right direction please?
IIUC, you can use dtypes:
>>> df.dtypes.to_dict()
{'Date': dtype('<M8[ns]'), 'Name': dtype('O'), 'Balance': dtype('float64')}
>>> {k: v.name for k, v in df.dtypes.to_dict().items()}
{'Date': 'datetime64[ns]', 'Name': 'object', 'Balance': 'float64'}
I suggest you to check this pandas tutorial.
The pandas.read_excel('my_file.xlsx').dtypes should give you the types of the columns.

How to give/convert values inside a dictionary in a json file to timestamp values

import requests
r=requests.get('https://www.ercot.com/api/1/services/read/dashboards/todays-outlook.json?_=1645233254068')
loaded=r.json()
I am scraping a dashboard from a website (https://www.ercot.com/gridmktinfo/dashboards/supplyanddemand) and the file there is in json format. So Now I am getting the json file and storing it. But in that json file the timestamp is coming like this
{'capacity': 51768,
'demand': 44863,
'forecast': 1,
'dstFlag': 0,
'interval': 0,
'hourEnding': 20},
{'capacity': 51941,
'demand': 44902,
'forecast': 1,
'dstFlag': 0,
'interval': 5,
'hourEnding': 20} ) so on......
So here I am getting what I wanted like capacity and demand. However, I also need the timestamp. but here the time is distributed in different keys like interval and hourEnding. Also I want to remove dstFlag and forecast from the whole data as they are of no use. So is there any way to convert those keys into timestamp?
And I want to save this json file as a csv at the end.
Seems like you're looking for something like this
from datetime import datetime
updated = datetime.fromisoformat(loaded["lastUpdated"])
for d in loaded["data"]:
capacity = d['capacity']
demand = d['demand']
is_forecast = d['forecast'] == 1
minute = d['interval']
hour = d['hourEnding']
ts = datetime(updated.year, updated.month, updated.day, hour=hour, minute=minute)
# TODO: print csv output
Keep in mind that DST does matter for timestamp creation

Reconstruct / explode list array into multiple rows for output to csv

I have a bunch of tasks to distribute evenly across a date range.
The task lists always contain 5 elements, excluding the final chunk, which will vary between 1 and 5 elements.
The process I've put together outputs the following data structure;
[{'Project': array([['AAC789A'],
['ABL001A'],
['ABL001D'],
['ABL001E'],
['ABL001X']], dtype=object), 'end_date': '2020-10-01'},
{'Project': array([['ACZ885G_MA'],
['ACZ885H'],
['ACZ885H_MA'],
['ACZ885I'],
['ACZ885M']], dtype=object), 'end_date': '2020-10-02'},
{'Project': array([['IGE025C']], dtype=object), 'end_date': '2020-10-03'}]
...but I really need the following format...
Project,end_date
AAC789A,2020-10-01
ABL001A,2020-10-01
ABL001D,2020-10-01
ABL001E,2020-10-01
ABL001X,2020-10-01
ACZ885G_MA,2020-10-02
ACZ885H,2020-10-02
ACZ885H_MA,2020-10-02
ACZ885I,2020-10-02
ACZ885M,2020-10-02
IGE025C,2020-10-03
I've looked at repeating and chaining using itertools, but I don't seem to be getting anywhere with it.
This is my first time working heavily with Python. How would this typically be accomplished in Python?
This is how I'm currently attempting to do this, but I get the error below.
df = pd.concat([pd.Series(row['end_date'], row['Project'].split(','))
for _, row in df.iterrows()]).reset_index()
AttributeError: 'numpy.ndarray' object has no attribute 'split'
here you have a solution using numpy flatten method:
import pandas as pd
import numpy as np
data = [{'Project': np.array([['AAC789A'],
['ABL001A'],
['ABL001D'],
['ABL001E'],
['ABL001X']], dtype=object), 'end_date': '2020-10-01'},
{'Project': np.array([['ACZ885G_MA'],
['ACZ885H'],
['ACZ885H_MA'],
['ACZ885I'],
['ACZ885M']], dtype=object), 'end_date': '2020-10-02'},
{'Project': np.array([['IGE025C']], dtype=object), 'end_date': '2020-10-03'}]
clean = lambda di : { 'Project': di['Project'].flatten(), 'end_date': di['end_date']}
result = pd.concat([pd.DataFrame(clean(d)) for d in data])
result is a dataframe which can be exported to a csv format. It contains the following:
Project,end_date
AAC789A,2020-10-01
ABL001A,2020-10-01
ABL001D,2020-10-01
ABL001E,2020-10-01
ABL001X,2020-10-01
ACZ885G_MA,2020-10-02
ACZ885H,2020-10-02
ACZ885H_MA,2020-10-02
ACZ885I,2020-10-02
ACZ885M,2020-10-02
IGE025C,2020-10-03
I found an answer that met my need. See link below - MaxU's answer served me best.
Using his explode method, I was able to accomplish my goal with one line of code.
df2 = explode(df.assign(var1=df.Project.str.split(',')), 'Project')
Split (explode) pandas dataframe string entry to separate rows

Flatten list of json objects into table with column for each object in Databricks

I have a json file that looks like this
[
{"id": 1,
"properties":[{"propertyname":"propertyone",
"propertyvalye": 5},
"propertyname":"properttwo",
"propertyvalye": 7}]},
{"id": 2,
"properties":[{"propertyname":"propertyone",
"propertyvalye": 3},
"propertyname":"properttwo",
"propertyvalye": 8}]}]
I was able load the file in databricks and parse it, getting a column called properties that contains the array in the data. The next step is to flatten this column and get one column for each object in the array with the name from property name and the value. Is there any native way of doing this in databricks?
Most json structures I have worked with in the past are of a {name:value} format which is straightforward to parse but the format i'm dealing with is giving me some headaches.
Any suggestions? I would prefer to to use inbuilt functionality, but if there is a way of doing it in python i can also write an UDF
EDIT
This is the output I am looking for.
Write the sample data to storage:
data = """
{"id": 1, "properties":[{"propertyname":"propertyone","propertyvalue": 5},{"propertyname":"propertytwo","propertyvalue": 7}]},
{"id": 2, "properties":[{"propertyname":"propertyone","propertyvalue": 3},
{"propertyname":"propertytwo","propertyvalue": 8}]}
"""
dbutils.fs.put(inputpath + "/x.json", data, True)
Read the json data:
df = spark.read.format("json").load(inputpath)
The resultset will look like:
dfe = df.select("id", explode("properties").alias("p")) \
.select("id", "p.propertyname", "p.propertyvalue")
Will explode the array:
Finally with pivot, you get the key-value-pairs as columns:
display (dfe.groupby('id').pivot('propertyname').agg({'propertyvalue': 'first'}))
See also examples in this Notebook how to implement transformations on complex datatypes.

Manipulate data in dictionary-column from TSV

I have a TSV file where one of the columns are a dictionary-format type.
Example of headers and one row (notice the string-quotes in Preferences-column)
Name, Age, Preferences
Nick, 18, "[{"Hobby":"Football", "Food":"Pizza", "FavoriteNumber":"72"}]"
To read the file into python:
df = pd.read_csv('search_data_assessment.tsv',delimiter='\t')
To remove the strings of the "Preferences" at beginning and end, I used ast.literal_eval:
df["Preferences"] = ast.literal_eval(df["Preferences"])
This raises "ValueError: malformed node or string: 0", but it seems to do the trick.
The question: How can I check all rows and look for "FavoriteNumber" in Preferences, and if it == 72, change it to 100 (arbitrary example)?
You can use pd.Series.apply with a custom function. Just note this is bordering on abuse of Pandas. Pandas isn't designed to hold lists of dictionaries in series. Here, you are running a loop in a particularly inefficient way.
from ast import literal_eval
df = pd.DataFrame([['Nick', 18, '[{"Hobby":"Football", "Food":"Pizza", "FavoriteNumber":"72"}]']],
columns=['Name', 'Age', 'Preferences'])
def updater(x):
if x[0]['FavoriteNumber'] == '72':
x[0]['FavoriteNumber'] = '100'
return x
df['Preferences'] = df['Preferences'].apply(literal_eval)
df['Preferences'] = df['Preferences'].apply(updater)
print(df['Preferences'].iloc[0])
[{'Hobby': 'Football', 'Food': 'Pizza', 'FavoriteNumber': '100'}]

Categories

Resources