Pandas: how to change all the values of a column? - python

I have a data frame with a column called "Date" and want all the values from this column to have the same value (the year only). Example:
City Date
Paris 01/04/2004
Lisbon 01/09/2004
Madrid 2004
Pekin 31/2004
What I want is:
City Date
Paris 2004
Lisbon 2004
Madrid 2004
Pekin 2004
Here is my code:
fr61_70xls = pd.ExcelFile('AMADEUS FRANCE 1961-1970.xlsx')
#Here we import the individual sheets and clean the sheets
years=(['1961','1962','1963','1964','1965','1966','1967','1968','1969','1970'])
fr={}
header=(['City','Country','NACE','Cons','Last_year','Op_Rev_EUR_Last_avail_yr','BvD_Indep_Indic','GUO_Name','Legal_status','Date_of_incorporation','Legal_status_date'])
for year in years:
# save every sheet in variable fr['1961'], fr['1962'] and so on
fr[year]=fr61_70xls.parse(year,header=0,parse_cols=10)
fr[year].columns=header
# drop the entire Legal status date column
fr[year]=fr[year].drop(['Legal_status_date','Date_of_incorporation'],axis=1)
# drop every row where GUO Name is empty
fr[year]=fr[year].dropna(axis=0,how='all',subset=[['GUO_Name']])
fr[year]=fr[year].set_index(['GUO_Name','Date_of_incorporation'])
It happens that in my DataFrames, called for example fr['1961'] the values of Date_of_incorporation can be anything (strings, integer, and so on), so maybe it would be best to completely erase this column and then attach another column with only the year to the DataFrames?

As #DSM points out, you can do this more directly using the vectorised string methods:
df['Date'].str[-4:].astype(int)
Or using extract (assuming there is only one set of digits of length 4 somewhere in each string):
df['Date'].str.extract('(?P<year>\d{4})').astype(int)
An alternative slightly more flexible way, might be to use apply (or equivalently map) to do this:
df['Date'] = df['Date'].apply(lambda x: int(str(x)[-4:]))
# converts the last 4 characters of the string to an integer
The lambda function, is taking the input from the Date and converting it to a year.
You could (and perhaps should) write this more verbosely as:
def convert_to_year(date_in_some_format):
date_as_string = str(date_in_some_format) # cast to string
year_as_string = date_in_some_format[-4:] # last four characters
return int(year_as_string)
df['Date'] = df['Date'].apply(convert_to_year)
Perhaps 'Year' is a better name for this column...

You can do a column transformation by using apply
Define a clean function to remove the dollar and commas and convert your data to float.
def clean(x):
x = x.replace("$", "").replace(",", "").replace(" ", "")
return float(x)
Next, call it on your column like this.
data['Revenue'] = data['Revenue'].apply(clean)

Or if one want to use lambda function in the apply function:
data['Revenue']=data['Revenue'].apply(lambda x:float(x.replace("$","").replace(",", "").replace(" ", "")))

Related

Pycharm problem set (Stuck from step 3 onwards)

Using the ff_monthly.csv data set https://github.com/alexpetralia/fama_french,
use the first column as an index
(this contains the year and month of the data as a string
Create a new column ‘Mkt’ as ‘Mkt-RF’ + ‘RF’
Create two new columns in the loaded DataFrame, ‘Month’ and ‘Year’ to
contain the year and month of the dataset extracted from the index column.
Create a new DataFrame with columns ‘Mean’ and ‘Standard
Deviation’ and the full set of years from (b) above.
Write a function which accepts (r_m,s_m) the monthy mean and standard
deviation of a return series and returns a tuple (r_a,s_a), the annualised
mean and standard deviation. Use the formulae: r_a = (1+r_m)^12 -1, and
s_a = s_m * 12^0.5.
Loop through each year in the data, and calculate the annualised mean and
standard deviation of the new ‘Mkt’ column, storing each in the newly
created DataFrame. Note that the values in the input file are % returns, and
need to be divided by 100 to return decimals (i.e the value for August 2022
represents a return of -3.78%).
. Print the DataFrame and output it to a csv file.
Workings so far:
import pandas as pd
ff_monthly=pd.read_csv(r"file path")
ff_monthly=pd.read_csv(r"file path",index_col=0)
Mkt=ff_monthly['Mkt-RF']+ff_monthly['RF']
ff_monthly= ff_monthly.assign(Mkt=Mkt)
df=pd.DataFrame(ff_monthly)
enter image description here
There are a few things to pay attention to.
The Date is the index of your DataFrame. This is treated in a special way compared to the normal columns. This is the reason df.Date gives an Attribute error. Date is not an Attribute, but the index. Instead try df.index
df.Date.str.split("_", expand=True) would work if your Date would look like 22_10. However according to your picture it doesn't contain an underscore and also contains the day, so this cannot work
In fact the format you have is not even following any standard. In order to properly deal with that the best way would be parsing this to a proper datetime64[ns] type that pandas will understand with df.index = pd.to_datetime(df.index, format='%y%m%d'). See the python docu for supported format strings.
If all this works, it should be rather straightforward to create the columns
df['year'] = df.index.dt.year
In fact, this part has been asked before

How to relocate different data that is in a single column to their respective columns?

I have a dataframe whose data are strings and different information are mixed in a single column. Like this:
0
Place: House
1
Date/Time: 01/02/03 at 09:30
2
Color:Yellow
3
Place: Street
4
Date/Time: 12/12/13 at 13:21:21
5
Color:Red
df = pd.DataFrame(['Place: House','Date/Time: 01/02/03 at 09:30', 'Color:Yellow', 'Place: Street','Date/Time: 21/12/13 at 13:21:21', 'Color:Red'])
I need the dataframe like this:
Place
Date/Time
Color
House
01/02/03
Yellow
Street
21/12/13
Red
I started by converting the excel file to csv, and then I tried to open it as follows:
df = pd.read_csv(filename, sep=":")
I tried using the ":" to separate the columns, but the time formatting also uses ":", so it didn't work. The time is not important information so I even tried to delete it and keep the date, but I couldn't find a way that wouldn't affect the other information in the column either.
Given the values in your data, you will need to limit the split to just happen once, which you can do with n parameter of split. You can expand the split values into two columns then pivot.
The trick here is to create a grouping by taking the df.index // 3 as the index, so that every 3 lines is in a new group.
df = pd.DataFrame(['Place: House','Date/Time: 01/02/03 at 09:30', 'Color:Yellow', 'Place: Street','Date/Time: 21/12/13 at 13:21:21', 'Color:Red'])
df = df[0].str.split(':', n=1, expand=True)
df['idx'] = df.index//3
df.pivot(index='idx', columns=0, values=1).reset_index().drop(columns='idx')[['Place','Date/Time','Color']]
Output
0 Place Date/Time Color
0 House 01/02/03 at 09:30 Yellow
1 Street 21/12/13 at 13:21:21 Red
Your data is all strings, IMO you are likely to get better performance wrangling it within vanilla python, before bringing it back into Pandas; the only time you are likely to get better performance for strings in Pandas is if you are using the pyarrow string data type.
from collections import defaultdict
out = df.squeeze().tolist() # this works since it is just one column
frame = defaultdict(list)
for entry in out:
key, value = entry.split(':', maxsplit=1)
if key == "Date/Time":
value = value.split('at')[0]
value = value.strip()
key = key.strip() # not really necessary
frame[key].append(value)
pd.DataFrame(frame)
Place Date/Time Color
0 House 01/02/03 Yellow
1 Street 21/12/13 Red

Search for variable name using iloc function in pandas dataframe

I have a pandas dataframe that consist of 5000 rows with different countries and emission data, and looks like the following:
country
year
emissions
peru
2020
1000
2019
900
2018
800
The country label is an index.
eg. df = emission.loc[['peru']]
would give me a new dataframe consisting only of the emission data attached to peru.
My goal is to use a variable name instead of 'peru' and store the country-specific emission data into a new dataframe.
what I search for is a code that would work the same way as the code below:
country = 'zanzibar'
df = emissions.loc[[{country}]]
From what I can tell the problem arises with the iloc function which does not accept variables as input. Is there a way I could circumvent this problem?
In other words I want to be able to create a new dataframe with country specific emission data, based on a variable that matches one of the countries in my emission.index()all without having to change anything but the given variable.
One way could be to iterate through or maybe create a function in some way?
Thank you in advance for any help.
An alternative approach where you dont use a country name for your index:
emissions = pd.DataFrame({'Country' : ['Peru', 'Peru', 'Peru', 'Chile', 'Chile', 'Chile'], "Year" : [2021,2020,2019,2021,2020,2019], 'Emissions' : [100,200,400,300,200,100]})
country = 'Peru'
Then to filter:
df = emissions[emissions.Country == country]
or
df = emissions.loc[emissions.Country == country]
Giving:
Country Year Emissions
0 Peru 2021 100
1 Peru 2020 200
2 Peru 2019 400
You should be able to select by a certain string for your index. For example:
df = pd.DataFrame({'a':[1,2,3,4]}, index=['Peru','Peru','zanzibar','zanzibar'])
country = 'zanzibar'
df.loc[{country}]
This will return:
a
zanzibar 3
zanzibar 4
In your case, removing one set of square brackets should work:
country = 'zanzibar'
df = emissions.loc[{country}]
I don't know if this solution is the same as your question. In this case I will give the solution to make a country name into a variable
But, because a variable name can't be named by space (" ") character, you have to replace the space character to underscore ("_") character.
(Just in case your 'country' values have some country names using more than one word)
Example:
the United Kingdom to United_Kingdom
by using this code:
df['country'] = df['country'].replace(' ', '_', regex=True)
So after your country names changed to a new format, you can get all the country names to a list from the dataframe using .unique() and you can store it to a new variable by this code:
country_name = df['country'].unique()
After doing that code, all the unique values in 'country' columns are stored to a list variable called 'country_name'
Next,
Use for to make an iteration to generate a new variable by country name using this code:
for i in country_name:
locals()[i] = df[df['country']=="%s" %(i)]
So, locals() here is to used to transform string format to a non-string format (because in 'country_name' list is filled by country name in string format) and df[df['country']=="%s" %(i)] is used to subset the dataframe by condition country = each unique values from 'country_name'.
After that, it already made a new variable for each country name in 'country' columns.
Hopefully this can help to solve your problem.

Python - How to make crosstable in pandas from non numeric data?

So, the thing is I need to create a crosstable from string data. I mean like in excel, if You put some string data into crosstable it is going to be automatically transformed into counted values per the other factor. For instance, I have column 'A' which contains application numbers and column 'B' which contains dates. I need to show how many applications were placed per each day. Classic crosstable returns me an error.
data.columns = [['applicationnumber', 'date', 'param1', 'param2', 'param3']] #mostly string values
Examples of input data:
applicationnumber = "AAA12345678"
date = 'YYYY-MM-DD'
Is this what you are looking for:
df = pd.DataFrame([['app1', '01/01/2019'],
['app2', '01/02/2019'],
['app3', '01/02/2019'],
['app4', '01/02/2019'],
['app5', '01/04/2019'],
['app6', '01/04/2019']],
columns=['app.no','date'])
print(pd.pivot_table(df, values='app.no', index='date', aggfunc=np.size))
Output:
app.no
date
01/01/2019 1
01/02/2019 3
01/04/2019 2

How to access index of string value in a cell of pandas data frame?

I'm working with the Bureau of Labor Statistics data which looks like this:
series_id year period value
CES0000000001 2006 M01 135446.0
series_id[3][4] indicate the supersector. for example, CES10xxxxxx01 would be Mining & Logging. There are 15 supersectors that I'm concerned with and hence I want to create 15 separate data frames for each supersector to perform time series analysis. So I'm trying to access each value as a list to achieve something like:
# *psuedocode*:
mining_and_logging = df[df.series_id[3]==1 and df.series_id[4]==0]
Can I avoid writing a for loop where I convert each value to a list then access by index and add the row to the new dataframe?
How can I achieve this?
One way to do what you want and recursively store the dataframes through a for loop could be:
First, create an auxiliary column to make your life easier:
df['id'] = df['series_id'][3:5] #Exctract characters 3 and 4 of every string (counting from zero)
Then, you create an empty dictionary and populate it:
dict_df = {}
for unique_id in df.id.unique():
dict_df[unique_id] = df[df.id == unique_id]
Now you'll have a dictionary with 15 dataframes inside. For example, if you want to call the dataframe associated with id = 01, you just do:
dict_df['01']
Hope it helps !
Solved it by combining answers from Juan C and G. Anderson.
Select the 3rd and 4th character:
df['id'] = df.series_id.str.slice(start=3, stop=5)
And then the following to create dataframes:
dict_df = {}
for unique_id in df.id.unique():
dict_df[unique_id] = df[df.id == unique_id]

Categories

Resources