If I use DataFrame.set_index, I get this result:
import pandas as pd
df = pd.DataFrame([['foo',1,3.0],['bar',2,2.9],
['baz',4,2.85],['quux',3,2.82]],
columns=['name','order','gpa'])
df.set_index('name')
Note the unnecessary row... I know it does this because it reserves the upper left cell for the column title, but I don't care about it, and it makes my table look somewhat unprofessional if I use it in a presentation.
If I don't use DataFrame.set_index, the extra row is gone, but I get numeric row indices, which I don't want:
If I use to_html(index=False) then I solve those problems, but the first column isn't bold:
import pandas as pd
from IPython.display import HTML
df = pd.DataFrame([['foo',1,3.0],['bar',2,2.9],
['baz',4,2.85],['quux',3,2.82]],
columns=['name','order','gpa'])
HTML(df.to_html(index=False))
If I want to control styling to make the names boldface, I guess I could use the new Styler API via HTML(df.style.do_something_here().render()) but I can't figure out how to achieve the index=False functionality.
What's a hacker to do? (besides construct the HTML myself)
I poked around in the source for Styler and figured it out; if you set df.index.names = [None] then this suppresses the "extra" row (along with the column header that I don't really care about):
import pandas as pd
df = pd.DataFrame([['foo',1,3.0],['bar',2,2.9],
['baz',4,2.85],['quux',3,2.82]],
columns=['name','order','gpa'])
df = df.set_index('name')
df.index.names = [None]
df
These days pandas actually has a keyword for this:
df.to_html(index_names=False)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_html.html
Related
I am stuck here, but I it's a two part question. Looking at the output of .describe(include = 'all'), not all columns are showing; how do I get all columns to show?
This is a common problem that I have all of the time with Spyder, how to have all columns to show in Console. Any help is appreciated.
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as stats
import seaborn as sns
mydata = pd.read_csv("E:\ho11.csv")
mydata.head()
print(mydata.describe(include="all", exclude = None))
mydata.info()
OUTPUT:
code output
Solution
You could use either of the following methods:
Method-1:
source
pd.options.display.max_columns = None
Method-2:
source
pd.set_option('display.max_columns', None)
# to reset this
pd.reset_option('display.max_columns')
Method-3:
source
# assuming df is your dataframe
pd.set_option('display.max_columns', df.columns.size)
# to reset this
pd.reset_option('display.max_columns')
Method-4:
source
# assuming df is your dataframe
pd.set_option('max_columns', df.columns.size)
# to reset this
pd.reset_option('max_columns')
To not wrap the output into multiple lines do this
source
pd.set_option('display.expand_frame_repr', False)
References
I will recommend you to explore the following resources for more details and examples.
How to show all of columns name on pandas dataframe?
How do I expand the output display to see more columns of a pandas DataFrame?
How to show all columns / rows of a Pandas Dataframe?
Since you are using Spyder the easiest thing to do would be:
myview = mydata.describe()
Then you can inspect 'myview' in the variable explorer.
Using pd.set_option listed column names in the console truncated in the middle with three dots.
To print a full list of the column names from a dataframe to the console in Spyder:
list(df.columns)
I have a Pandas dataframe in which one column contains JSON data (the JSON structure is simple: only one level, there is no nested data):
ID,Date,attributes
9001,2020-07-01T00:00:06Z,"{"State":"FL","Source":"Android","Request":"0.001"}"
9002,2020-07-01T00:00:33Z,"{"State":"NY","Source":"Android","Request":"0.001"}"
9003,2020-07-01T00:07:19Z,"{"State":"FL","Source":"ios","Request":"0.001"}"
9004,2020-07-01T00:11:30Z,"{"State":"NY","Source":"windows","Request":"0.001"}"
9005,2020-07-01T00:15:23Z,"{"State":"FL","Source":"ios","Request":"0.001"}"
I would like to normalize the JSON content in the attributes column so the JSON attributes become each a column in the dataframe.
ID,Date,attributes.State, attributes.Source, attributes.Request
9001,2020-07-01T00:00:06Z,FL,Android,0.001
9002,2020-07-01T00:00:33Z,NY,Android,0.001
9003,2020-07-01T00:07:19Z,FL,ios,0.001
9004,2020-07-01T00:11:30Z,NY,windows,0.001
9005,2020-07-01T00:15:23Z,FL,ios,0.001
I have been trying using Pandas json_normalize which requires a dictionary. So, I figure I would convert the attributes column to a dictionary but it does not quite work out as expected for the dictionary has the form:
df.attributes.to_dict()
{0: '{"State":"FL","Source":"Android","Request":"0.001"}',
1: '{"State":"NY","Source":"Android","Request":"0.001"}',
2: '{"State":"FL","Source":"ios","Request":"0.001"}',
3: '{"State":"NY","Source":"windows","Request":"0.001"}',
4: '{"State":"FL","Source":"ios","Request":"0.001"}'}
And the normalization takes the key (0, 1, 2, ...) as the column name instead of the JSON keys.
I have the feeling that I am close but I can't quite work out how to do this exactly. Any idea is welcome.
Thank you!
Normalize expects to work on an object, not a string.
import json
import pandas as pd
df_final = pd.json_normalize(df.attributes.apply(json.loads))
You shouldn’t need to convert to a dictionary first.
Try:
import pandas as pd
pd.json_normalize(df[‘attributes’])
I found an solution but I am not overly happy with it. I reckon it is very inefficient.
import pandas as pd
import json
# Import full dataframe
df = pd.read_csv(r'D:/tmp/sample_simple.csv', parse_dates=['Date'])
# Create empty dataframe to hold the results of data conversion
df_attributes = pd.DataFrame()
# Loop through the data to fill the dataframe
for index in df.index:
row_json = json.loads(df.attributes[index])
normalized_row = pd.json_normalize(row_json)
# df_attributes = df_attributes.append(normalized_row) (deprecated method) use concat instead
df_attributes = pd.concat([df_attributes, normalized_row], ignore_index=True)
# Reset the index of the attributes dataframe
df_attributes = df_attributes.reset_index(drop=True)
# Drop the original attributes column
df = df.drop(columns=['attributes'])
# Join the results
df_final = df.join(df_attributes)
# Show results
print(df_final)
print(df_final.info())
Which gives me the expected result. However, as I said, there are several inefficiencies in it. For starters, the dataframe append in the for loop. According to the documentation the best practice is to make a list and then append but I could not figure out how to do that while keeping the shape I wanted. I welcome all critics and ideas.
I am new to plotting charts in python. I've been told to use Pandas for that, using the following command. Right now it is assumed the csv file has headers (time,speed, etc). But how can I change it to when the csv file doesn't have headers? (data starts from row 0)
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.read_csv("P1541350772737.csv")
#df.head(5)
df.plot(figsize=(15,5), kind='line',x='timestamp', y='speed') # scatter plot
You can specify x and y by the index of the columns, you don't need names of the columns for that:
Very simple: df.plot(figsize=(15,5), kind='line',x=0, y=1)
It works if x column is first and y column is second and so on, columns are numerated from 0
For example:
The same result with the names of the columns instead of positions:
I may havve missinterpreted your question but II'll do my best.
Th problem seems to be that you have to read a csv that have no header but you want to add them. I would use this code:
cols=['time', 'speed', 'something', 'else']
df = pd.read_csv('useful_data.csv', names=cols, header=None)
For your plot, the code you used should be fine with my correction. I would also suggest to look at matplotlib in order to do your graph.
You can try
df = pd.read_csv("P1541350772737.csv", header=None)
with the names-kwarg you can set arbitrary column headers, this implies silently headers=None, i.e. reading data from row 0.
You might also want to check the doc https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
Pandas is more focused on data structures and data analysis tools, it actually supports plotting by using Matplotlib as backend. If you're interested in building different types of plots in Python you might want to check it out.
Back to Pandas, Pandas assumes that the first row of your csv is a header. However, if your file doesn't have a header you can pass header=None as a parameter pd.read_csv("P1541350772737.csv", header=None) and then plot it as you are doing it right now.
The full list of commands that you can pass to Pandas for reading a csv can be found at Pandas read_csv documentation, you'll find a lot of useful commands there (such as skipping rows, defining the index column, etc.)
Happy coding!
For most commands you will find help in the respective documentation. Looking at pandas.read_csv you'll find an argument names
names : array-like, default None
List of column names to use. If file contains no header row, then you should explicitly
pass header=None.
So you will want to give your columns names by which they appear in the dataframe.
As an example: Suppose you have this data file
1, 2
3, 4
5, 6
Then you can do
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("data.txt", names=["A", "B"], header=None)
print(df)
df.plot(x="A", y="B")
plt.show()
which outputs
A B
0 1 2
1 3 4
2 5 6
please be gentle total Python newbie, I'm currently writing a script which is turning out to be very long and i thought there must! be a for loop method to make this easier. I'm currently going through a CSV, pulling the header titles and placing it within a str.replace code, manually.
df['Col 1'] = df['Col 1'].str.replace('text','replacement')
I figured it would start like this.. but no idea how to proceed!
Import pandas as pd
df = pd.read_csv('file.csv')
for row in df.columns:
if (df[,:] =...
Sorry I know this probably looks terrible, but this is all I could fathom with my limited knowledge!
Thanks!
jezrael comment solved it much more ellegantly.
But, in case you needed specific code for each column it would go something like this:
import pandas as pd
df = pd.read_csv('file.csv')
for column in df.columns:
df[column] = df[column].str.replace('text','replacement')
No worries! We've all been there.
Your import statement should be lowercase: import pandas as pd
In your for loop, I think there's a misunderstanding of what you'll be iterating over. The for row in df.columns will iterate over the column names, not the rows.
Is it correct to say that you'd like to convert the column names to strings?
You can do a multiple-column replacement in one shot with replace by passing in a dictionary.
Say you want to replace t1 with r1 in column a; t2 with r2 in column b, you can do
df.replace({"a":{"t1":"r1"}, "b":{"t2":"r2"}})
df = pd.read_csv('file.csv',usecols=['List of column names you want to use from your csv'],
names=['list of names of column you want your pandas df to have'])
You should read the docs and identify the fields that are important in your case.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
I read an Excel Sheet into a pandas DataFrame this way:
import pandas as pd
xl = pd.ExcelFile("Path + filename")
df = xl.parse("Sheet1")
the first cell's value of each column is selected as the column name for the dataFrame, I want to specify my own column names, How do I do this?
This thread is 5 years old and outdated now, but still shows up on the top of the list from a generic search. So I am adding this note. Pandas now (v0.22) has a keyword to specify column names at parsing Excel files. Use:
import pandas as pd
xl = pd.ExcelFile("Path + filename")
df = xl.parse("Sheet 1", header=None, names=['A', 'B', 'C'])
If header=None is not set, pd seems to consider the first row as header and delete it during parsing. If there is indeed a header, but you dont want to use it, you have two choices, either (1) use "names" kwarg only; or (2) use "names" with header=None and skiprows=1. I personally prefer the second option, since it clearly makes note that the input file is not in the format I want, and that I am doing something to go around it.
I think setting them afterwards is the only way in this case, so if you have for example four columns in your DataFrame:
df.columns = ['W','X','Y','Z']
If you know in advance what the headers in the Excelfile are its probably better to rename them, this would rename W into A, etc:
df.rename(columns={'W':'A', 'X':'B', etc})
As Ram said, this post comes on the top and may be useful to some....
In pandas 0.24.2 (may be earlier as well), read_excel itself has the capability of ignoring the source headers and giving your own col names and few other good controls:
DID = pd.read_excel(file1, sheet_name=0, header=None, usecols=[0, 1, 6], names=['A', 'ID', 'B'], dtype={2:str}, skiprows=10)
# for example....
# usecols => read only specific col indexes
# dtype => specifying the data types
# skiprows => skip number of rows from the top.
call .parse with header=None keyword argument.
df = xl.parse("Sheet1", header=None)
In case the excel sheet only contains the data without headers:
df=pd.read_excel("the excel file",header=None,names=["A","B","C"])
In case the excel sheet already contains header names, then use skiprows to skip the line:
df=pd.read_excel("the excel file",header=None,names=["A","B","C"],skiprows=1)