I have a CSV full of tweets containing a few headers. Among them, for some unknown reason, the date format changes midway from %Y-%m-%d to %d/%m/%Y as shown in the image below.
This makes it difficult when trying to export it into another program e.g. Matlab. I'm attempting to solve this in Python, but any other solution would be great.
I've attempted multiple solutions from just googling around. Mainly parsing in a date format when reading the CSV, DateTime.strptime and others. I'm very new to Python so I'm sorry if I'm a bit clueless
I'm looking to standardise all the dates, e.g. changing the %d/%m/%Y to the other format, while keeping it individual row separate.
I'm thinking of following the approach held here, but adding an if statement if it recognises a certain format. How would I go about breaking the date down and changing it then?
This might work but I'm too lazy to check it against an image of a CSV file.
import pandas as pd
# Put all the formats into a list
possible_formats = ['%Y-%m-%d', '%d/%m/%Y']
# Read in the data
data = pd.read_csv("data_file.csv")
date_column = "date"
# Parse the dates in each format and stash them in a list
fixed_dates = [pd.to_datetime(data[date_column], errors='coerce', format=fmt) for fmt in possible_formats]
# Anything we could parse goes back into the CSV
data[date_column] = pd.NaT
for fixed in fixed_dates:
data.loc[~pd.isnull(fixed), date_column] = fixed[~pd.isnull(fixed)]
data.to_csv("new_file.csv")
Related
My issue is as follows.
I've gathered some contact data from SurveyMonkey using the SM API, and I've converted that data into a txt file. When opening the txt file, I see the full data from the survey that I'm trying to convert into csv, however when I use the following code:
df = pd.read_csv("my_file.txt",sep =",", encoding = "iso-8859-10")
df.to_csv('my_file.csv')
It creates a csv file with only two lines of values (and cuts off in the middle of the second line). Similarly if I try to organize the data within a pandas dataframe, it only registers the first two lines, meaning most of my txt file is not being read registered.
As I've never run into this problem before and I've been able to convert into CSV without issues, I'm wondering if anyone here has ideas as to what might be causing this issue to occur and how I could go about solving it?
All help is much appreciated.
Edit:
I was able to get the data to display properly in csv, when I converted it directly into csv from json instead of converting it to a txt file first. I was not however able to figure out what when wrong in the conversion from txt to csv, as I tried multiple different encodings but came to the same result.
I am attempting to read MapInfo .dat files into .csv files using Python. So far, I have found the easiest way to do this is though xlwings and pandas.
When I do this (below code) I get a mostly correct .csv file. The only issue is that some columns are appearing as symbols/gibberish instead of their real values. I know this because I also have the correct data on hand, exported from MapInfo.
import xlwings as xw
import pandas as pd
app = xw.App(visible=False)
tracker = app.books.open('./cable.dat')
last_row = xw.Range('A1').current_region.last_cell.row
data = xw.Range("A1:AE" + str(last_row))
test_dataframe = data.options(pd.DataFrame, header=True).value
test_dataframe.columns = list(schema)
test_dataframe.to_csv('./output.csv')
When I compare to the real data, I can see that the symbols do actually map the correct number (meaning that (1 = Â?, 2=#, 3=#, etc.)
Below is the first part of the 'dictionary' as to how they map:
My question is this:
Is there an encoding that I can use to turn these series of symbols into their correct representation? The floats aren't the only column affected by this, but they are the most important to my data.
Any help is appreciated.
import pandas as pd
from simpledbf import Dbf5
dbf = Dbf5('path/filename.dat')
df = dbf.to_dataframe()
.dat files are dbase files underneath https://www.loc.gov/preservation/digital/formats/fdd/fdd000324.shtml. so just use that method.
then just output the data
df.to_csv('outpath/filename.csv')
EDIT
If I understand well you are using XLWings to load the .dat file into excel. And then read it into pandas dataframe to export it into a csv file.
Somewhere along this it seems indeed that some binary data is not/incorrectly interpreted and then written as text to you csv file.
directly read dBase file
My first suggestion would be to try to read the input file directly into Python without the use of an excel instance.
According to Wikipedia, mapinfo .dat files are actually are dBase III files. These you can parse in python using a library like dbfread.
inspect data before writing to csv
Secondly, I would inspect the 'corrupted' columns in python instead of immediately writing them to disk.
Either something is going wrong in the excel import and the data of these columns gets imported as text instead of some binary number format,
Or this data is correctly into memory as a byte array (instead of a float), and when you write it to csv, it just gets byte-wise dumped to disk instead of interpreting it as a number format and making a text representation of it
note
Small remark about your initial question regarding mapping text to numbers:
Probably it will not be possible create a straightforward map of characters to numbers:
These numbers could have any encoding and might not be stored as decimal text values like you now seem to assume
These text representations are just a decoding using some character encoding (UTF-8, UTF-16). E.g. for UTF-8 several bytes might map to one character. And the question marks or squares you see, might indicate that one or more characters could not be decoded.
In any case you will be losing information if start from the text, you must start from the binary data to decode.
I have a .xlsx file that I want to convert to .csv file. I have done a demo file as shown in the screenshot. In the .xlsx file, I have 3 sheets and I want to keep the last sheet only. In addition, I want to preserve my dates in a MM/DD/YYYY format.
Found a few solutions here and there on converting then dropping sheets or vice versa. The closest I have come to is using the solution from this link :
But it doesn't keep the date format of MM/DD/YYYY and instead converts it to numbers e.g. 44079. Tried searching solution to convert the numbers to date but there is nothing on this.
Can anyone help me with this? I can provide more clarification if needed.
I am coding in Python.
Hi I solved my own question by using the answer from this Python using pandas to convert xlsx to csv file. How to delete index column?
In addition, because the date is converted into something, not I want in the converted .csv filee.g.
05-09-2020 00:00:00
I used pandas and load the converted csv file to a dataframe. From there I used df['date made'] = pd.to_datetime(df['date made']) to convert the date from an object to datetime. After that I used df['date made'] = df['date made'].dt.strftime(%m/%d/%Y) to get
09/05/2020
which is the date format I wanted. I repeat the steps for Date Due as well.
Hope this helps those who are looking to convert .xlsx to .csv and do some formatting on the date.
When I run the following code
import glob,os
import pandas as pd
dirpath = os.getcwd()
inputdirectory = dirpath
for xls_file in glob.glob(os.path.join(inputdirectory,"*.xls*")):
data_xls = pd.read_excel(xls_file, sheet_name=0, index_col=None)
csv_file = os.path.splitext(xls_file)[0]+".csv"
data_xls.to_csv(csv_file, encoding='utf-8', index=False)
It will convert all xls files in the folder into CSV as I want.
HOWEVER, on doing so, any dates such as 20/12/2018 will be converted to 20/12/2018 00:00:00 which is causing major issues with later data processing.
What is going wrong with this?
Nothing is "going wrong" per se. You simply need to provide a custom date_format to df.to_csv:
date_format : string, default None
Format string for datetime objects
In your case that would be
data_xls.to_csv(csv_file, encoding='utf-8', index=False, date_format='%d/%m/%Y')
This will fix the way the raw data is saved to the file. If you will open the file in Excel you may still see it using the full format. This is because Excel tries to assume the cell formats based on their content. You will need to right click the column and select another cell formatting, there is nothing that pandas or Python can do about that (as long as you are using to_csv and not to_excel).
if the above answers still don't work, try this?
import datetime as dt
xls_data['date']=pd.to_datetime(xls_data['date'], format="%d/%m/%y")
xls_data['date'] = xls_data['date'].dt.date
The original xls file is actually storing this fields as datetime.
When you open it with Excel - you seeing it formated the way Excel think you want to see it based on your settings / OS locale / etc.
When python reads the file, the date cells becomes python date objects.
CSV files are basically just text, it cannot holds datetime objects.
When python needs to write datetime object to a text file it gets the full text.
So you have 2 options:
Change the original file date column to text type.
or the better option:
Use python to iterate this fields and change it the text format you would like to see in the csv.
I just tried to reproduce your issue with no success:
>>>import pandas as pd
>>>xls_data = pd.read_excel('test.xls', sheet_name=0, index_cole=None)
>>>xls_data
name date
0 walla 1988-12-10
1 cool 1999-12-10
>>>xls_data.to_csv(encoding='utf-8', index=False)
'name,date\nwalla,1988-12-10\ncool,1999-12-10\n'`
P.S. Any time you deal with datetime objects you should test the result to see if anything change based on your pc locale settings.
My objective is: Converting DF to HTML which is send as an everyday mail
Current Method : converting df to csv to html
Problem: I have created my df which has as_index=True set, but when I save it to a csv this formatting is lost :
Example DataFrame:
Now when I save this df using to_csv(), the formatting in the index is lost ( means that ABC is now written 3 times across the index, instead of once as I want it)
I want the CSV to have the same formatting is that possible?
Please install pandas and use to_html().
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_html.html
Hope it can help you.