I have tried various methods to add new column to a Panda dataframe but I get the same result.
Methods tried:
call_duration is a list having same number of items as in the data frame.
df['Duration_sec'] = pd.Series(call_duration,index=np.arange(len(df)))
and
df['Duration_sec'] = pd.Series(call_duration,index=df.index)
and
# df['Duration_sec'] = np.array(call_duration)
All three gave the same result as under-
I don't understand why the new column is added to new line? And why is there a \ at the end of the first line?
"The new column is not added to a new line"
The DataFrame is wider than the screen and hence continued in next row. In python the \ is usually used to denote join
To add a column, Simply use df.assign
df.assign(Duration_sec=call_duration)
You can just do
df['Duration_sec'] = call_duration
"\" means the dataframe is wider than your screen and will continue.
Related
I am trying to write some Python logic to fill a csv file/pandas dataframe table called (table) with certain conditions, but I can't seem to get it to do what I want.
I have two columns in table: 1. trade_type and 2. execution_venue.
Conditional statement I want to write in Python:
The execution_venue entry will only be filled with either AQXE or AQEU, depending on the trade_type.
When the trade_type is filled with the string DARK, I want the the execution_venue to be filled with XUBS (if it was filled with AQXE before), and AQED (if it was filled with AQEU before).
Here is my code to do this:
security_mic = ('AQXE', 'AQEU')
table.loc[table['trade_type'] == 'DARK', 'execution_venue'] = {'AQXE': 'XUBS',
'AQEU': 'AQED'}.get(security_mic)
When I replace the right hand side of the equality with a string test, I am getting the same error, so I suspect the error is to do with the left hand side, in that it is not accessing the correct place in the dataframe!
Lets use replace for substitution of old values where trade_type os DARK
d = {'AQXE': 'XUBS', 'AQEU': 'AQED'}
table.loc[table['trade_type'] == 'DARK', 'execution_venue'] = table['execution_venue'].replace(d)
This is my first post to the coding community, so I hope I get the right level of detail in my request for help!
Background info:
I want to repeat (loop) command in a df using a variable that contains a list of options. While the series 'amenity_options' contains a simple list of specific items (let's say only four amenities as the example below) the df is a large data frame with many other items. My goal is the run the operation below for each item in the 'amenity_option' until the end of the list.
amenity_options = ['cafe','bar','cinema','casino'] # this is a series type with multiple options
df = df[df['amenity'] == amenity_options] # this is my attempt to select the the first value in the series (e.g. cafe) out of dataframe that contains such a column name.
df.to_excel('{}_amenity.xlsx, format('amenity') # wish to save the result (e.g. cafe_amenity) as a separate file.
Desired result:I wish to loop step one and two for each and every item available in the list (e.g. cafe, bar, cinema...). So that I will have separate excel files in the end. Any thoughts?
What #Rakesh suggested is correct, you probably just need one more step.
df = df[df['amenity'].isin(amenity_options)]
for key, g in df.groupby('amenity'):
g.to_excel('{}_amenity.xlsx'.format(key))
After you call groupby() of your df, you will get 4 groups so that you can directly loop on them.
The key is the group key, which are cafe, bar and etc. and the g is the sub-dataframe that specifically filtered by that key.
Seems like you just need a simple for loop:
for amenity in amenity_options:
df[df['amenity'] == amenity].to_excel(f"{amenity}_amenity.xlsx")
I'm trying to the convert the below DataFrame into Series:
The columns "Emerging Markets" and "Event Driven" are of interest to me. So, I create a new DataFrame by using the code below:
columns = ['Emerging Markets','Event Driven'] #Indicate which columns I want to use
TargetData = Hedgefunds[columns]
But now I want to create two series, one for "Emerging Markets" and one for "Event Driven" but I'm can't figure out how to do it. I used the code below (same logic as above) but it does not work:
Emerging_Markets_Column = ['Emerging Markets']
EM = TargetData['Emerging_Markets-Column']
What would be the best way to go about separating the columns from each other?
Why dont you use the first dataframe as reference and try .
EM = Hedgefunds['Emerging Markets']
ED = Hedgefunds['Event Driven']
I have a dataframe df in a PySpark setting. I want to change a column, say it is called A, whose datatype is "string". I want to change its values according to their lengths. In particular, if in a row we have only a character, we want to concatenate 0 to the end. Otherwise, we take the default value. The name of the "modified" column must still be A. This is for a Jupyter Notebook using PySpark3.
This is what I have tried so far:
df = df.withColumn("A", when(size(df.col("A")) == 1, concat(df.col("A"), lit("0"))).otherwise(df.col("A")))
I also tried the same code deleting the "df.col"'s.
When I run this code, the software complains saying that the syntax is invalid, but I don't see the error.
df.withColumn("temp", when(length(df.A) == 1, concat(df.A, lit("0"))).\
otherwise(df.A)).drop("A").withColumnRenamed('temp', 'A')
What I understood after reading your question was, you are getting one extra column A.
So you want that old column A replaced by new Column A. So I created a temp column with your required logic then dropped column A then renamed temp column to A.
Listen here child...
To choose a column from a DF in pyspark, you must not use the "col" function, since it is a Scala/Java API. In Pyspark, the correct way is just to choose the name from the DF: df.colName.
To get the length of your string, use the "length" function. Size function is for iterables.
And for the grand solution... (drums drums drums)
df.withColumn("A", when(length(df.A) == 1, concat(df.A, lit("0"))).otherwise(df.A))
Por favor!
I am working on Jupyter Notebook. I have multiple data frames in which I am comparing data between them. Before I compare the data between these data frames, I need to clean up some of the strings. I need to remove the double quotes (") AND I need to get rid of the NaN values in the empty cells.
In order to do this for one data frame, titled df1970, I created two functions:
df1970['Title'] = pd.Series(df1970['Title']).str.replace('"', '')
df1970= df1970.replace(np.nan, "", regex=True)
When I refer to df1970 downstream, it gives me the cleaned data frame. However, I have a dataset titled df1966 and I want to remove the double quotes and replace NaN without typing the whole above code again. So I created a function:
def cleanupdataset(df):
df['Title'] = pd.Series(df['Title']).str.replace('"','')
df= df.replace(np.nan, "", regex=True)
return df
Then, when I call:
cleanupdataset(df1966)
...it gives me a nice clean dataset of 1966 that I want to use downstream.
My later functions call USETHISDF as the title of the data frame on which to operate. So this time around I want to use my nice new and clean df1966, so I redefine it:
cleanupdataset(df1966)
USETHISDF = df1966
But when I call it to check that it's cleaned...
USETHISDF
it uses gives me the non-cleaned version of df1966. What am I doing wrong?
Your function does not change the initial dataframe in-place, but returns a new dataframe. In order to see the changes you have to use the return value of your function:
USETHISDF = cleanupdataset(df1966)