Label encoding in Pandas - python

I am working with data set which have numerical and categorical values. I find solution with numerical values, so next step is to make label encoding with categorical values. In order to do that I wrote these lines of code:
import pandas as pd
dataset_categorical = dataset.select_dtypes(include = ['object'])
new_column = dataset_categorical.astype('category')
After execution of last line of code in Jupyter I can't see an error, but values are not converted into encoded values.
Also this line work for example when I try with only one column but don't work with whole data frame.
So can anybody help me how to solve this problem?

df1 = {
'Name':['George','Andrea','micheal','maggie','Ravi',
'Xien','Jalpa'],
'Is_Male':[1,0,1,0,1,1,0]}
df1 = pd.DataFrame(df1,columns=['Name','Is_Male'])
Typecast to Categorical column in pandas
df1['Is_Male'] = df1.Is_Male.astype('category')

Related

Pandas .sort_values() function returning data frame with scattered values

I'm using pandas to load a short_desc.csv with the following columns: ["report_id", "when","what"]
with
#read csv
shortDesc = pd.read_csv('short_desc.csv')
#get all numerical and nonnull values
shortDesc = shortDesc[shortDesc['report_id'].str.isdigit().notnull()]
#convert 'when' from UNIX timestamp to datetime
shortDesc['when'] = pd.to_datetime(shortDesc['when'],unit='s')
which results in the following:
I'm trying to remove rows that have duplicate 'report_id's by sorting by
date and getting the newest date where that 'report_id' is present with the following:
shortDesc = shortDesc.sort_values(by='when').drop_duplicates(['report_id'], keep='last')
the problem is that when I use .sort_values() in this particular dataframe the values of 'what' come out scattered across all columns, and the 'report_id' values disappear:
shortDesc = shortDesc.sort_values(by=['when'], inplace=False)
I'm not sure why this is happening in this particular instance since I was able to achieve the correct results by another dataframe with the same shape and using the same code (P.S it's not a mistake, I dropped the 'what' column in the second pic):
similar shape dataframe
desired results example with similar shape DF
I found out that:
#get all numerical and nonnull values
shortDesc = shortDesc[shortDesc['report_id'].str.isdigit().notnull()]
was only checking if a value was not null and probably overwriting the str.isdigit() check, which caused the field "report_id" to not drop nonnumeric values. I changed this to two separate lines
shortDesc = shortDesc[shortDesc['report_id'].notnull()]
shortDesc = shortDesc[shortDesc['report_id'].str.isnumeric()]
which allowed
shortDesc.sort_values(by='when', inplace=True)
to work as intended, I am still confused as to why .sort_values(by="when") was affected by the column "report_id". So if anyone knows please enlighten me.

How to stop Pandas converting integer to decimal when reading in an .xlsx file?

I have an .xlsx file that I am loading into a dataframe using the pd.read_excel method. However, when I do so, one of my columns appears to change format, with pandas adding a decimal point. Does anyone know why this is happening and how to stop it please?
Example of data in the .xlsx file:
191001
191002
191003
Example of the same data in the dataframe:
191001.0
191002.0
191003.0
The relevant column is using the 'General' format option in Excel.
I tried removing the decimal point with the following method; however I got the error message "pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer".
df.column1 = df.column1.astype(int)
Any help would be appreciated!
Your file most likely has infinite and nan values within the column.
You will need to remove them first
import numpy as np
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.fillna(0, inplace = True)
df.column1 = df.column1.astype(int)

Mean of a Row in Python (Reading CSV files using Pandas)

I am completely new to python.. I would like to ask how can I fix my code?
I can't make it to work because for some reason, it only calculates columns.
import numpy as np
import pandas as pd
rainfall = pd.read_csv('rainfall.csv', low_memory=False, parse_dates=True, header=None)
mean_rainfall = rainfall[0].mean()
print(mean_rainfall)
the picture of my csv
In pandas dataframe mean function you can provide parameter to let him him know either take mean of a row or column.
Check Here: pandas.DataFrame.mean.
It seams though it takes default axis value of 1 so it is calculation the mean of column.
Try this:
mean_rainfall = rainfall.iloc[0].mean(axis = 1)

Pandas while importing ignores 0 from the values if it starts with 0 [duplicate]

I am importing study data into a Pandas data frame using read_csv.
My subject codes are 6 numbers coding, among others, the day of birth. For some of my subjects this results in a code with a leading zero (e.g. "010816").
When I import into Pandas, the leading zero is stripped of and the column is formatted as int64.
Is there a way to import this column unchanged maybe as a string?
I tried using a custom converter for the column, but it does not work - it seems as if the custom conversion takes place before Pandas converts to int.
As indicated in this answer by Lev Landau, there could be a simple solution to use converters option for a certain column in read_csv function.
converters={'column_name': str}
Let's say I have csv file projects.csv like below:
project_name,project_id
Some Project,000245
Another Project,000478
As for example below code is trimming leading zeros:
from pandas import read_csv
dataframe = read_csv('projects.csv')
print dataframe
Result:
project_name project_id
0 Some Project 245
1 Another Project 478
Solution code example:
from pandas import read_csv
dataframe = read_csv('projects.csv', converters={'project_id': str})
print dataframe
Required result:
project_name project_id
0 Some Project 000245
1 Another Project 000478
To have all columns as str:
pd.read_csv('sample.csv', dtype=str)
To have certain columns as str:
# column names which need to be string
lst_str_cols = ['prefix', 'serial']
dict_dtypes = {x: 'str' for x in lst_str_cols}
pd.read_csv('sample.csv', dtype=dict_dtypes)
here is a shorter, robust and fully working solution:
simply define a mapping (dictionary) between variable names and desired data type:
dtype_dic= {'subject_id': str,
'subject_number' : 'float'}
use that mapping with pd.read_csv():
df = pd.read_csv(yourdata, dtype = dtype_dic)
et voila!
If you have a lot of columns and you don't know which ones contain leading zeros that might be missed, or you might just need to automate your code. You can do the following:
df = pd.read_csv("your_file.csv", nrows=1) # Just take the first row to extract the columns' names
col_str_dic = {column:str for column in list(df)}
df = pd.read_csv("your_file.csv", dtype=col_str_dic) # Now you can read the compete file
You could also do:
df = pd.read_csv("your_file.csv", dtype=str)
By doing this you will have all your columns as strings and you won't lose any leading zeros.
You Can do This , Works On all Versions of Pandas
pd.read_csv('filename.csv', dtype={'zero_column_name': object})
You can use converters to convert number to fixed width if you know the width.
For example, if the width is 5, then
data = pd.read_csv('text.csv', converters={'column1': lambda x: f"{x:05}"})
This will do the trick. It works for pandas==0.23.0 and also read_excel.
Python3.6 or higher required.
I don't think you can specify a column type the way you want (if there haven't been changes reciently and if the 6 digit number is not a date that you can convert to datetime). You could try using np.genfromtxt() and create the DataFrame from there.
EDIT: Take a look at Wes Mckinney's blog, there might be something for you. It seems to be that there is a new parser from pandas 0.10 coming in November.
As an example, consider the following my_data.txt file:
id,A
03,5
04,6
To preserve the leading zeros for the id column:
df = pd.read_csv("my_data.txt", dtype={"id":"string"})
df
id A
0 03 5
1 04 6

Plot diagram in Pandas from CSV without headers

I am new to plotting charts in python. I've been told to use Pandas for that, using the following command. Right now it is assumed the csv file has headers (time,speed, etc). But how can I change it to when the csv file doesn't have headers? (data starts from row 0)
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.read_csv("P1541350772737.csv")
#df.head(5)
df.plot(figsize=(15,5), kind='line',x='timestamp', y='speed') # scatter plot
You can specify x and y by the index of the columns, you don't need names of the columns for that:
Very simple: df.plot(figsize=(15,5), kind='line',x=0, y=1)
It works if x column is first and y column is second and so on, columns are numerated from 0
For example:
The same result with the names of the columns instead of positions:
I may havve missinterpreted your question but II'll do my best.
Th problem seems to be that you have to read a csv that have no header but you want to add them. I would use this code:
cols=['time', 'speed', 'something', 'else']
df = pd.read_csv('useful_data.csv', names=cols, header=None)
For your plot, the code you used should be fine with my correction. I would also suggest to look at matplotlib in order to do your graph.
You can try
df = pd.read_csv("P1541350772737.csv", header=None)
with the names-kwarg you can set arbitrary column headers, this implies silently headers=None, i.e. reading data from row 0.
You might also want to check the doc https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
Pandas is more focused on data structures and data analysis tools, it actually supports plotting by using Matplotlib as backend. If you're interested in building different types of plots in Python you might want to check it out.
Back to Pandas, Pandas assumes that the first row of your csv is a header. However, if your file doesn't have a header you can pass header=None as a parameter pd.read_csv("P1541350772737.csv", header=None) and then plot it as you are doing it right now.
The full list of commands that you can pass to Pandas for reading a csv can be found at Pandas read_csv documentation, you'll find a lot of useful commands there (such as skipping rows, defining the index column, etc.)
Happy coding!
For most commands you will find help in the respective documentation. Looking at pandas.read_csv you'll find an argument names
names : array-like, default None
List of column names to use. If file contains no header row, then you should explicitly
pass header=None.
So you will want to give your columns names by which they appear in the dataframe.
As an example: Suppose you have this data file
1, 2
3, 4
5, 6
Then you can do
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("data.txt", names=["A", "B"], header=None)
print(df)
df.plot(x="A", y="B")
plt.show()
which outputs
A B
0 1 2
1 3 4
2 5 6

Categories

Resources