unable to read head after converting excel to csv

unable to read head after converting excel to csv - python

I am trying to read an excel file, convert it into csv and load its head:
df = pd.read_excel("final.xlsx", sheet_name="NewCustomerList")
# df = df.to_csv()
print(df.head(3))
Without converting to csv, the results look like this:
Note: The data and information in this document is reflective of a hypothetical situation and client. \
0 first_name
1 Chickie
2 Morly
However, if I uncomment the conversion, I get an error that:
'str' object has no attribute 'head'
I am guessing its because of the first line of the data. How else can I convert this properly and read it?

to_csv() is used to save a table to disk, it has no effect on memory-stored tables and returns None. So, you are changing your df variable to None with your commented line.
If you just want to display the table on screen in a specific format, perhaps take a look at to_string()
If you absolutely MUST have each row of your df as a comma-separated string then try a list comprehension:
my_csv_list = [','.join(map(str, row)) for row in df.itertuples()]
Beware of the csv format, if any datapoint contains a comma then you are in for a nightmare when decoding back to a table.

According to the documentation, Pandas' to_csv() method returns None (nothing) or a string.
You could further need to use something like in this answer to turn the string into a dataframe again and use its head.

Related

If there is a regex match, append to list

I am trying to check a column of an Excel file for values in a given format and, if there is a match, append it to a list. Here is my code:
from openpyxl import load_workbook
import re
#Open file and read column with PBSID.
PBSID = []
wb = load_workbook(filename="FILE_PATH", data_only=True)
sheet = wb.active
for col in sheet["E"]:
if re.search("\d{3}[-\.\s]??\d{5}", str(col)):
PBSID.append(col.value)
print(PBSID)
Column E of the Excel file contains IDs like 431-00456 that I would like to append to the list named PBSID.
Expected result: PBSID list to be populated with ID in regex mask XXX-XXXXX.
Actual result: Output is an empty list ("[]").
Am I missing something? (I know there are more elegant ways of doing this but I am relatively new to Python and very open to critism).
Thanks!

Semantically, I think the for loop should be written as:
for row in sheet["E"]:
As I'm guessing that sheet["E"] is simply referring to the column 'E' already.
Without seeing exact data that's in a cell, I think what's happening here is that python is interpreting your call to str() as follows:
It's performing a maths operation (in my example) '256 - 23690' before giving you the string of the answer, which is '-23434', and then looking for your regular expression in '-23434' for which it won't find any match (hence no results). Make sure the string is interpreted as a raw string.
You also appear to be referring to the whole row object in 'str(col)', and then referring separately to the row value in 'PBSID.append(col.value)'. It's best to refer to the same object, whichever is more suitable in your case.

How can I read a document with pandas (python) that don't look like the average one?

I am trying to get the values from the colums from a file. The doc looks like this
the data I want to read
So all the examples I have found uses pd.read_csv or pd.DataFrame but all data usually have a clear header an nothing on top of the data (I have like 10 lines of code I don't really need for what I am doing).
Also, I think maybe there is something wrong because I tried to run:
data = pd.read_csv('tdump_BIL_100_-96_17010100.txt',header=15)
and I get
pd.read_csv output
which is just the row in one column, so there is no separation apparently, and therefore no way of getting the columns I need.
So my question is if there is a way to get the data from this file with pandas and how to get it.

If a defined number, skip the initial rows, indicate that no header is present, and that values are separated by spaces.
df = pd.read_csv(filename, skiprows=15, header=None, sep='\s+')
See read_csv for documentation.

DataFrame is saving brackets while exporting to csv

I have a Pandas DataFrame that looks like this.
DataFrame picture
I thought of save a tuple of two values under a column and then retrieve whichever value is needed. But now, for example, if I want the first value in the tuple located at the first row of the 'Ref' column, I get "(" instead of "c0_4"
df = pd.read_csv(df_path)
print(df['Ref'][0][0])
The output for this is "(" and not "c0_4".
I don't want to use split() because I want the values to be searchable in the dataframe. For example, I would want to search for "c0_8" under the "Ref" column and get the row.
What other alternatives do I have to save two values in a row under the same column?

The immediate problem is that you're simply accessing character 0 of a string.
file is character-oriented storage; there is no "data frame" abstraction. Hence, we use CSV to hold the columnar data as text, a format that allows easy output and input recovery.
A CSV file consists only of text, with the separator character and newline having special meanings. There is no "tuple" form. Your data frame is stored as string data. If you want to recover your original tuple form, you will need to write parsing code to convert the strings back to tuples. Alternately, you can switch to a "pickle" (PCL) format for storing your data.
That should be enough leads to allow you to research whatever alternatives you want.

Your data is stored as a string
To format it into a tuple, split every string in your DataFrame and save it back as a tuple, with something like:
for n...
for m...
df[n][m] = (df[n][m].split(",")[0][1:], df[n][m].split(",")[1][:-1])

How to populate DataFrame column from JSON string element in another column

I have a DataFrame with a "details" column that I believe is a dictionary. The initial data is a JSON string parsed with json.loads, then converted from a dictionary to DataFrame. I would like to populate a new "affectedDealId" column with the value in data['details']['actions']['affectedDealId'].
I'm hoping I can do this the "DataFrame" way without using a loop with something like:
data['affectedDealId'] = data['details'].get('actions').get('affectedDealId')
To simplify I've tried:
data['actions'] = data['details'].get('actions')
But that ends up as "None".
Also data['details'] seems to be a series when I think it's a dictionary before converting it to a DataFrame.
Alternatively, I do later loop through the DataFrame. How would I access that 'affectedDealId' element?
Below is a screenshot of the DataFrame from the PyCharm debugger.

I'm making some assumptions about details json, but does this help? You'll will have to adjust the json.loads(x) key/index to extract the right location.
df['affectedDealId'] = df['details'].apply(lambda x: json.loads(x)['affectedDealId'])

I think with will be great if you could do something like this.
so this create a data frame off your json column by calling the pd.Series
data_frame_new = df['details'].apply(pd.Series)
and then reassign your data frame by concat your data_frame_new with your existing data frame.
df = pd.concat([df,data_frame_new],axis = 1)
print(df)
This approach worked for me on a recent project.
your affectedId will be come a column of it own with the data populated.
it may be of help to you.
Thanks

How to treat date as plain text with pandas?

I use pandas to read a .csv file, then save it as .xls file. Code as following:
import pandas as pd
df = pd.read_csv('filename.csv', encoding='GB18030')
print(df)
df.to_excel('filename.xls')
There's a column contains date like '2020/7/12', it's looks like pandas recognized it as date and output it to '2020-07-12' automatically. I don't want to format this column, or any other columns like this, I'd like to keep all data remain the same as plain text.
This convertion happens at read_csv(), because print(df) already outputs YYYY-MM-DD, before to_excel().
I tried use df.info() to check the data type of that column, the data type is object. Then I added argument dtype=pd.StringDtype() to read_csv() and it doesn't help.
The file contains Chinese characters so I set encoding to GB18030, don't know if this matters.

My experience concerning pd.read_csv indicates that:
Only columns convertible to int or float are by default
converted to respective types.
"Date-like" strings are still read as strings (the column type in
the resulting DataFrame is actually object).
If you want read_csv to convert such column to datetime type, you
should pass parse_dates parameter, specifying a list of columns to be
parsed as dates. Since you didn't do it, no source column should be
converted to datetime type.
To check this detail, after you read file, run file.info() and check
the type of the column in question.
So if respective Excel file column is of Date type, then probably
this conversion is caused by to_excel.
And one more remark concerning variable names:
What you have read using read_csv is a DataFrame, not a file.
Actual file is the source object, from which you read the content,
but here you passed only file name.
So don't use names like file to name the resulting DataFrame, as this
is misleading. It is much better to use e.g. df.
Edit following a comment as of 05:58Z
To check in full extent what you wrote in your comment, I created
the following CSV file:
DateBougth,Id,Value
2020/7/12,1031,500.15
2020/8/18,1032,700.40
2020/10/16,1033,452.17
I ran: df = pd.read_csv('Input.csv') and then print(df), getting:
DateBougth Id Value
0 2020/7/12 1031 500.15
1 2020/8/18 1032 700.40
2 2020/10/16 1033 452.17
So, at the Pandas level, no format conversion occurred in DateBougth
column. Both remaining columns, contain numeric content, so they were
silently converted to int64 and float64, but DateBought remained as object.
Then I saved this df to an Excel file, running: df.to_excel('Output.xls')
and opened it with Excel. The content is:
So neither at the Excel level any data type conversion took place.
To see the actual data type of B2 cell (the first DateBougth),
I clicked on this cell and pressed Ctrl-1, to display cell formatting.
The format is General (not Date), just as I expected.
Maybe you have some outdated version of software?
I use Python v. 3.8.2 and Pandas v. 1.0.3.
Another detail to check: Look at your code after pd.read_csv.
Maybe somewhere you put instruction like df.DateBought = pd.to_datetime(df.DateBought) (explicit type conversion)?
Or at least format conversion. Note that in my environment
there was absolutely no change in the format of DateBought column.

Problem solved. I double checked my .csv file, opened it with notepad, the data is 2020-07-12, which displays as 2020/7/12 on Office. Turns out that Office reformatted date to yyyy/m/d (based on your region). I'm developing a tool to process and import data to DB for my company, we did these work manually by copy and paste so no one noticed this issue. Thanks to #Valdi_Bo for his investigate and patience.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.