Create new column in panda data frame - python

I'm extremely new to python and have been searching google and stackoverflow to solve this issue which I am sure is simply a syntax problem.
I have a data frame with several columns.
import pandas as pd
df = pd.read_csv("C:/path/file.csv")
My csv has 5 columns and ~ 100k rows
I simply want a substring of the first 2 digits of column 5.
I've tried:
df.assign(new = lambda x: x.column5[0:2],)
This creates the new field and populates the first two rows with the complete value in column 5 and gives me NaN for the remainder.
These attempts give me syntax erros:
df['new'] = df['column5'].str[0:2]
df.map(lambda df['column5']: [:2])
I am simply at a loss of how to create a new column using the first two digits of an existing column from a table read in via pandas.
If this were SAS I'd have been done hours ago, but I am trying to make a go of Python so your help is appreciated

I guess your column5 column is of int*/float* dtype, so
try to convert it to string first:
df['new'] = df['column5'].astype(str).str[:2]
you can explicitly specify types of columns when reading CSV file:
df = pd.read_csv('file_name.csv', ..., dtype={'column5': object})

Related

Creating list from imported CSV file with pandas

I am trying to create a list from a CSV. This CSV contains a 2 dimensional table [540 rows and 8 columns] and I would like to create a list that contains the values of an specific column, column 4 to be specific.
I tried: list(df.columns.values)[4], it does mention the name of the column but i'm trying to get the values from the rows on column 4 and make them a list.
import pandas as pd
import urllib
#This is the empty list
company_name = []
#Uploading CSV file
df = pd.read_csv('Downloads\Dropped_Companies.csv')
#Extracting list of all companies name from column "Name of Stock"
companies_column=list(df.columns.values)[4] #This returns the name of the column.
companies_column = list(df.iloc[:,4].values)
So for this you can just add the following line after the code you've posted:
company_name = df[companies_column].tolist()
This will get the column data in the companies column as pandas Series (essentially a Series is just a fancy list) and then convert it to a regular python list.
Or, if you were to start from scratch, you can also just use these two lines
import pandas as pd
df = pd.read_csv('Downloads\Dropped_Companies.csv')
company_name = df[df.columns[4]].tolist()
Another option: If this is the only thing you need to do with your csv file, you can also get away just using the csv library that comes with python instead of installing pandas, using this approach.
If you want to learn more about how to get data out of your pandas DataFrame (the df variable in your code), you might find this blog post helpful.
I think that you can try this for getting all the values of a specific column:
companies_column = df[{column name}]
Replace "{column name}" with the column you want to access the values of.

filter off NaNs in a large DataFrame with headers

I have a large number of time series, with blanks on certain dates for some of them. I read that with xlwings from an XL sheet:
Y0 = xw.Range('SomeRangeinXLsheet').options(pd.DataFrame, index=True , header=3).value
I'm trying to create a filter to run regressions on those series so I have to take out the void dates. If I :
print(Y0.iloc[:,[i]]==Y0.iloc[:,[i]])
I get a proper series of true/false for my column number i, fine.
I'm then stuck, can't find a way to filter the whole df, with the true/false for that column, or even just extract that clean series as a pd.Series.
I need them one by one to adapt my independent variables' dates to those of my each of these separately.
Thank you for your help.
I believe you want to use df.dropna()
I am not sure if I understood your problem, but if you want to for check NULLs in a specific column and drop those rows, you can try this -
import pandas as pd
df = df[pd.notnull(df['column_name'])]
For deleting NaNs, df.dropna() should work, as suggested in the previous answer. If it is not working, you can try replacing NaNs with a placeholder text and try deleting the rows that contain that placeholder text.
df['column_name'] = df['column_name'].replace(np.nan, 'delete-it', regex = True)
df = df[df["column_name"] != 'delete-it']
Hope this helps!

pandas max function results in inoperable DataFrame

I have a DataFrame with four columns and want to generate a new DataFrame with only one column containing the maximum value of each row.
Using df2 = df1.max(axis=1) gave me the correct results, but the column is titled 0 and is not operable. Meaning I can not check it's data type or change it's name, which is critical for further processing. Does anyone know what is going on here? Or better yet, has a better way to generate this new DataFrame?
It is Series, for one column DataFrame use Series.to_frame:
df2 = df1.max(axis=1).to_frame('maximum')

How to make an object to be a dataframe in python

I have implemented the below part of code :
array = [table.iloc[:, [0]], table.iloc[:, [i]]]
It is supposed to be a dataframe consisted of two vectors extracted from previously imported dataset. I use the parameter i, because this code is a part of a loop which uses a predefined function to analyze correlations between one fixed variable [0] and the rest of them - each iteration check a correlation with different variable [i].
Python treats this object as a list or as a tuple when I change the brackets to round ones. I need this object to be a dataframe (next step is to remove NaN values using .dropna which is a df atribute.
How can I fix that issue?
If I have correctly understood your question, you want to build an extract from a larger dataframe containing only 2 columns known by their index number. You can simply do:
sub = table.iloc[:, [0,i]]
It will keep all attributes (including index, column names and dtype) from the original table dataframe.
What is your goal with the dataframe?
dataframe is a common term in data analysis using pandas
Pandas was developed just to facilitate such analysis, in it to get the data in a .csv file and transform into a dataframe is simple like:
import pandas as pd
df = pd.read_csv('my-data.csv')
df.info()
Or from a dict or array
df = pd.DataFrame(my_dict_or_array)
Then u can select the rows u wish
df.loc[:, ['INDEX_ROW_1', 'INDEX_ROW_2']]
Let us know if it's what you are looking for

Pandas while importing ignores 0 from the values if it starts with 0 [duplicate]

I am importing study data into a Pandas data frame using read_csv.
My subject codes are 6 numbers coding, among others, the day of birth. For some of my subjects this results in a code with a leading zero (e.g. "010816").
When I import into Pandas, the leading zero is stripped of and the column is formatted as int64.
Is there a way to import this column unchanged maybe as a string?
I tried using a custom converter for the column, but it does not work - it seems as if the custom conversion takes place before Pandas converts to int.
As indicated in this answer by Lev Landau, there could be a simple solution to use converters option for a certain column in read_csv function.
converters={'column_name': str}
Let's say I have csv file projects.csv like below:
project_name,project_id
Some Project,000245
Another Project,000478
As for example below code is trimming leading zeros:
from pandas import read_csv
dataframe = read_csv('projects.csv')
print dataframe
Result:
project_name project_id
0 Some Project 245
1 Another Project 478
Solution code example:
from pandas import read_csv
dataframe = read_csv('projects.csv', converters={'project_id': str})
print dataframe
Required result:
project_name project_id
0 Some Project 000245
1 Another Project 000478
To have all columns as str:
pd.read_csv('sample.csv', dtype=str)
To have certain columns as str:
# column names which need to be string
lst_str_cols = ['prefix', 'serial']
dict_dtypes = {x: 'str' for x in lst_str_cols}
pd.read_csv('sample.csv', dtype=dict_dtypes)
here is a shorter, robust and fully working solution:
simply define a mapping (dictionary) between variable names and desired data type:
dtype_dic= {'subject_id': str,
'subject_number' : 'float'}
use that mapping with pd.read_csv():
df = pd.read_csv(yourdata, dtype = dtype_dic)
et voila!
If you have a lot of columns and you don't know which ones contain leading zeros that might be missed, or you might just need to automate your code. You can do the following:
df = pd.read_csv("your_file.csv", nrows=1) # Just take the first row to extract the columns' names
col_str_dic = {column:str for column in list(df)}
df = pd.read_csv("your_file.csv", dtype=col_str_dic) # Now you can read the compete file
You could also do:
df = pd.read_csv("your_file.csv", dtype=str)
By doing this you will have all your columns as strings and you won't lose any leading zeros.
You Can do This , Works On all Versions of Pandas
pd.read_csv('filename.csv', dtype={'zero_column_name': object})
You can use converters to convert number to fixed width if you know the width.
For example, if the width is 5, then
data = pd.read_csv('text.csv', converters={'column1': lambda x: f"{x:05}"})
This will do the trick. It works for pandas==0.23.0 and also read_excel.
Python3.6 or higher required.
I don't think you can specify a column type the way you want (if there haven't been changes reciently and if the 6 digit number is not a date that you can convert to datetime). You could try using np.genfromtxt() and create the DataFrame from there.
EDIT: Take a look at Wes Mckinney's blog, there might be something for you. It seems to be that there is a new parser from pandas 0.10 coming in November.
As an example, consider the following my_data.txt file:
id,A
03,5
04,6
To preserve the leading zeros for the id column:
df = pd.read_csv("my_data.txt", dtype={"id":"string"})
df
id A
0 03 5
1 04 6

Categories

Resources