How to separate one column from a DataFrame in Python / Jupyter? - python

I'm trying to the convert the below DataFrame into Series:
The columns "Emerging Markets" and "Event Driven" are of interest to me. So, I create a new DataFrame by using the code below:
columns = ['Emerging Markets','Event Driven'] #Indicate which columns I want to use
TargetData = Hedgefunds[columns]
But now I want to create two series, one for "Emerging Markets" and one for "Event Driven" but I'm can't figure out how to do it. I used the code below (same logic as above) but it does not work:
Emerging_Markets_Column = ['Emerging Markets']
EM = TargetData['Emerging_Markets-Column']
What would be the best way to go about separating the columns from each other?

Why dont you use the first dataframe as reference and try .
EM = Hedgefunds['Emerging Markets']
ED = Hedgefunds['Event Driven']

Related

New dataframe in Pandas based on specific values(a lot of them) from existing df

Good evening! I'm using pandas on Jupyter Notebook. I have a huge dataframe representing full history of posts of 26 channels in a messenger. It has a column "dialog_id" which represents in which dialog the message was sent(so, there can be only 26 unique values in the column, but there are more then 700k rows, and the df is sorted itself by time, not id, so it is kinda chaotic). I have to split this dataframe into 2 different(one will contain full history of 13 channels, and the other will contain history for the rest 13 channels). I know ids by which I have to split, they are random as well. For example, one is -1001232032465 and the other is -1001153765346.
The question is, how do I do it most elegantly and adequate?
I know I can do it somehow with df.loc[], but I don't want to put like 13 rows of df.loc[]. I've tried to use logical operators for this, like:
df1.loc[(df["dialog_id"] == '-1001708255880') & (df["dialog_id"] == '-1001645788710' )], but it doesn't work. I suppose I'm using them wrong. I expect a solution with any method creating a new df, with the use of logical operators. In verbal expression, I think it should sound like "put the row in a new df if the dialog_id is x, or dialog_id is y, or dialog_id is z, etc". Please help me!
The easiest way seems to be just setting up a query.
df = pd.DataFrame(dict(col_id=[1,2,3,4,], other=[5,6,7,8,]))
channel_groupA = [1,2]
channel_groupB = [3,4]
df_groupA = df.query(f'col_id == {channel_groupA}')
df_groupB = df.query(f'col_id == {channel_groupB}')

Iterate through list of dataframes, performing calculations on certain columns of each dataframe, resulting in new dataframe of the results

Newbie here. Just as the title says, I have a list of dataframes (each dataframe is a class of students). All dataframes have the same columns. I have made certain columns global.
BINARY_CATEGORIES = ['Gender', 'SPED', '504', 'LAP']
for example. These are yes/no or male/female categories, and I have already changed all of the data to be 1's and 0's for these columns. There are several other columns which I want to ignore as I iterate.
I am trying to accept the list of classes (dataframes) into my function and perform calculations on each dataframe using only my BINARY_CATEGORIES list of columns. This is what I've got, but it isn't making it through all of the classes and/or all of the columns.
def bal_bin_cols(classes):
i = 0
c = 0
for x in classes:
total_binary = classes[c][BINARY_CATEGORIES[i]].sum()
print(total_binary)
i+=1
c+=1
Eventually I need a new dataframe from this all of the sums corresponding to the categories and the respective classes. print(total binary) is just a place holder/debugger. I don't have that code yet that will populate the dataframe from the results of the above code, but I'd like it to be the classes as the index and the total calculation as the columns.
I know there's probably a vectorized way to do this, or enum, or groupby, but I will take a fix to my loop. I've been stuck forever. Please help.
Try something like:
Firstly create a dictionary:
d={
'male':1,
'female':0,
'yes':1,
'no':0
}
Finally use replace():
df[BINARY_CATEGORIES]=df[BINARY_CATEGORIES].replace(d.keys(),d.values(),regex=True)

Manipulate string in python (replace string with part of the string itself)

So I am trying to transform the data I have into the form I can work with. I have this column called "season/ teams" that looks smth like "1989-90 Bos"
I would like to transform it into a string like "1990" in python using pandas dataframe. I read some tutorials about pd.replace() but can't seem to find a use for my scenario. How can I solve this? thanks for the help.
FYI, I have 16k lines of data.
A snapshot of the data I am working with:
To change that field from "1989-90 BOS" to "1990" you could do the following:
df['Yr/Team'] = df['Yr/Team'].str[:2] + df['Yr/Team'].str[5:7]
If the structure of your data will always be the same, this is an easy way to do it.
If the data in your Yr/Team column has a standard format you can extract the values you need based on their position.
import pandas as pd
df = pd.DataFrame({'Yr/Team': ['1990-91 team'], 'data': [1]})
df['year'] = df['Yr/Team'].str[0:2] + df['Yr/Team'].str[5:7]
print(df)
Yr/Team data year
0 1990-91 team 1 1991
You can use pd.Series.str.extract to extract a pattern from a column of string. For example, if you want to extract the first year, second year and team in three different columns, you can use this:
df["year"].str.extract(r"(?P<start_year>\d+)-(?P<end_year>\d+) (?P<team>\w+)")
Note the use of named parameters to automatically name the columns
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html

Problem while adding new column in Panda dataframe using Python

I have tried various methods to add new column to a Panda dataframe but I get the same result.
Methods tried:
call_duration is a list having same number of items as in the data frame.
df['Duration_sec'] = pd.Series(call_duration,index=np.arange(len(df)))
and
df['Duration_sec'] = pd.Series(call_duration,index=df.index)
and
# df['Duration_sec'] = np.array(call_duration)
All three gave the same result as under-
I don't understand why the new column is added to new line? And why is there a \ at the end of the first line?
"The new column is not added to a new line"
The DataFrame is wider than the screen and hence continued in next row. In python the \ is usually used to denote join
To add a column, Simply use df.assign
df.assign(Duration_sec=call_duration)
You can just do
df['Duration_sec'] = call_duration
"\" means the dataframe is wider than your screen and will continue.

Iterate over multiple dataframes and perform maths functions save output

I have several dataframes on which I an performing the same functions - extracting mean, geomean, median etc etc for a particular column (PurchasePrice), organised by groups within another column (GORegion). At the moment I am just performing this for each dataframe separately as I cannot work out how to do this in a for loop and save separate data series for each function performed on each dataframe.
i.e. I perform median like this:
regmedian15 = pd.Series(nw15.groupby(["GORegion"])['PurchasePrice'].median(), name = "regmedian_nw15")
I want to do this for a list of dataframes [nw15, nw16, nw17], extracting the same variable outputs for each of them.
I have tried things like :
listofnwdfs = [nw15, nw16, nw17]
for df in listofcmldfs:
df+'regmedian' = pd.Series(df.groupby(["GORegion"])
['PurchasePrice'].median(), name = df+'regmedian')
but it says "can't assign to operator"
I think the main point is I can't work out how to create separate output variable names using the names of the dataframes I am inputting into the for loop. I just want a for loop function that produces my median output as a series for each dataframe in the list separately, and I can then do this for means and so on.
Many thanks for your help!
First, df+'regmedian' = ... is not valid Python syntax. You are trying to assign a value to an expression of the form A + B, which is why Python complains that you are trying to re-define the meaning of +.
Also, df+'regmedian' itself seems strange. You are trying to add a DataFrame and a string.
One way to keep track of different statistics for different datafarmes is by using dicts. For example, you can replace
listofnwdfs = [nw15, nw16, nw17]
with
dict_of_nwd_frames = {15: nw15, 16: nw16, 17: nw17}
Say you want to store 'regmedian' data for each frame. You can do this with dicts as well.
data = dict()
for key, df in dict_of_nwd_frames.items():
data[(i, 'regmedian')] = pd.Series(df.groupby(["GORegion"])['PurchasePrice'].median(), name = str(key) + 'regmedian')

Categories

Resources