Pandas Dataframe - Append dataframes (Multiple columns/rows + Multiple columns, single row) - python

More adventures in dataframes :)
So, I've pretty much have all the basics, however, this one is stumping me. I have two dataframes (pictures below). The first (techIndicator) has a ton of columns and rows, all filled properly. The second dataframe (social) has multiple columns, but only one row.
I need to add the columns (working as per the SS), but I want to duplicate the social dataframe's row all the way down to "fill in the nan's".
Here is the code that I'm using to concatenate all of the dataframes into a single one (all work except for social):
techIndicator = pd.concat([inter_day, macd, rsi, ema, vwap, adx, dmi, social], axis = 1)
techIndicator.sort_index(ascending=False, inplace=True)
techIndicator.dropna()
techIndicator.reset_index(drop=True)
As per the SS's below, the first threerows should look like:
datetime1 | 1 | 2 | 3 | 4 | ......| 9 | 8| 7 | 6
datetime2 | 2 | 1 | 4 | 3 | ......| 9 | 8| 7 | 6
datetime3 | 3 | 4 | 1 | 2 | ......| 9 | 8| 7 | 6
Instead, the concatenate above adds the columns, but deletes the values (I already checked the data types, they're all float64's)
Please help =) My google-fu isn't working for this >.<
With help from Alex below, I was able to solve the many issues that I was having!
dfTemp = pd.concat([inter_day, macd, rsi, ema, vwap, adx, dmi], axis = 1)
dfTemp.sort_index(ascending=False, inplace=True)
dfTemp.dropna()
dfTemp.reset_index(inplace = True)
long_social = social
for a in range(dfTemp.shape[0] - 1):
long_social=pd.concat([long_social, social])
long_social.reset_index(inplace = True)
long_social.drop(columns = ['index'], inplace = True)
techIndicator = pd.concat([dfTemp, long_social], axis = 1)
techIndicator.rename(columns={'date': 'Date',
'1. open': 'Open',
'2. high': 'High',
'3. low': 'Low',
'4. close': 'Close',
'5. volume': 'Volume',
'DX': 'DMI'}, inplace=True)
techIndicator.dropna(inplace=True)
techIndicator.reset_index(drop=True, inplace=True)
techIndicator.set_index('Date', inplace=True)
techIndicator.sort_index(ascending=False, inplace=True)

So i have a solution for you which does not use concat to add the columns to your main dataset, but it gets the job done.
The flow of it is since the two dataframes are not of even size, we make them even sized and the just loop through to add the columns by naming them.
# first create a copy of your social_df which we will append it to for as long as your main df is
long_soial=social_df
for a in range(main_df.shape[0]):
social_long=pd.concat([social_long, social])
# now you have a long_social_df with the same length as your main df
social_vars=list(social.columns) # Get the column names from social for naming them as we add to the main df
for i, var in enumerate(social_vars):
main_df[var]=list[social_long[i] # add the columns by creating a new empty column with the desired name and adding the social info as a list to the named column

Related

Compare a row substring in one dataframe column with row substring of another dataframe column, and remove non-matching ones

I have two dataframes with different row counts.
df1 has the problems and count
problems | count
broken, torn | 10
torn, faded | 15
worn-out, broken | 25
faded | 5
df2 has the order_id and problems
order_id | problems
123 | broken
594 | torn
811 | worn-out, broken
I need to remove all rows from df1 that do not match the individual problems in the list in df2. And I want to maintain the count of df1.
The final df1 data frame would look like this:
problems | count
broken | 10
torn | 15
worn-out, broken | 25
Can someone please help?
IIUC, try this if your problems are strings:
df1 = pd.DataFrame({'problems':['broken, torn', 'torn, faded', 'worn-out, broken', 'faded'],
'count':[10,15,25,5]})
df2 = pd.DataFrame({'order_id':[123,594,811],
'problems':['broken', 'torn', 'worn-out, broken']})
prob_df2 = df2['problems'].str.split(',\s?').explode()
df1_prob = df1.assign(prob_exp=(df1['problems'].str.split(',\s?')))
df1_exp = df1_prob.explode('prob_exp')
df1_out = (df1_exp[df1_exp['prob_exp'].isin(prob_df2)]
.groupby(level=0)
.agg({'count':'first', 'prob_exp': ', '.join}))
df1_out
Output:
count prob_exp
0 10 broken, torn
1 15 torn
2 25 worn-out, broken
Details:
Create a list of problems from df2 using explode and .str.split with
a regex that captures the space after comma if present
Create a new column in df1 with exploded problems
Check to see if problems in df1 are in the df2 problem list
Use groupby to combine problems together for each index

add values in Pandas DataFrame

I want to add values in a dataframe. But i want to write clean code (short and faster). I really want to improve my skill in writing.
Suppose that we have a DataFrame and 3 values
df=pd.DataFrame({"Name":[],"ID":[],"LastName":[]})
value1="ema"
value2=023123
value3="Perez"
I can write:
df.append([value1,value2,value3])
but the output is gonna create a new column
like
0 | Name | ID | LastName
ema | nan | nan | nan
023123 | nan | nan| nan
Perez | nan | nan | nan
i want the next output with the best clean code
Name | ID | LastName
ema | 023123 | Perez
There are a way to do this , without append one by one? (i want the best short\fast code)
You can convert the values to dict then use append
df.append(dict(zip(['Name', 'ID', 'LastName'],[value1,value2,value3])), ignore_index=True)
Name ID LastName
0 ema 23123.0 Perez
Here the explanation:
First set your 3 values into an array
values=[value1,value2,value3]
and make variable as index marker when lopping latter
i = 0
Then use the code below
for column in df.columns:
df.loc[0,column] = values[i]
i+=1
column in df.columns will give you all the name of the column in the DataFrame
and df.loc[0,column] = values[i] will set the values at index i to row=0 and column=column
[Here the code and the result]

column dates to quarters with pandas

I have a pandas dataframe with these columns (important part is that i have every month from 1996-04 to 2016-08)
Index(['RegionID', 'RegionName', 'State', 'Metro', 'CountyName', 'SizeRank',
'1996-04', '1996-05', '1996-06', '1996-07',
...
'2015-11', '2015-12', '2016-01', '2016-02', '2016-03', '2016-04',
'2016-05', '2016-06', '2016-07', '2016-08'],
dtype='object', length=251)
I need to group columns by three to represent financial quarters, eg:
| 1998-01 | 1999-02 | 1999-03 |
| 2 | 4 | 7 |
Needs to become
| 1998q1 |
|avg(2,4,7)|
Any hint about the right approach?
First convert all non dates columns to index, convert them to quarter period and aggregate by columns with mean:
df = df.set_index(['RegionID', 'RegionName', 'State', 'Metro', 'CountyName', 'SizeRank'])
df.columns = pd.to_datetime(df.columns).to_period('Q').strftime('%Yq%q')
df = df.groupby(level=0, axis=1).mean().reset_index()

How to reshape dataframe with multi year data in Python

I believe my question can be solved with a loop but I haven't been able to create such. I have a data sample which looks like this
sample data
And I would like to have dataframe that would be organised by the year:
result data
I tried pivot-function by creating a year column with df['year'] = df.index.year and then reshaping with pivot but it will populate only the first year column because of the index.
I have managed to do this type of reshaping manually but with several years of data it is time consuming solution. Here is the example code for manual solution:
mydata = pd.DataFrame()
mydata2 = pd.DataFrame()
mydata3 = pd.DataFrame()
mydata1['1'] = df['data'].iloc[160:664]
mydata2['2'] = df['data'].iloc[2769:3273]
mydata3['3'] = df['data'].iloc[5583:6087]
mydata1.reset_index(drop=True, inplace=True)
mydata2.reset_index(drop=True, inplace=True)
mydata3.reset_index(drop=True, inplace=True)
mydata = pd.concat([mydata1, mydata2, mydata3],axis=1, ignore_index=True)
mydata.columns = ['78','88','00','05']
Welcome to StackOverflow! I think I understood what you were asking for from your question, but please correct me if I'm wrong. Basically, you want to reshape your current pandas.DataFrame using a pivot. I set up a sample dataset and solved the problem in the following way:
import pandas as pd
#test set
df = pd.DataFrame({'Index':['2.1.2000','3.1.2000','3.1.2001','4.1.2001','3.1.2002','4.1.2002'],
'Value':[100,101,110,111,105,104]})
#create a year column for yourself
#by splitting on '.' and selecting year element.
df['Year'] = df['Index'].str.split('.', expand=True)[2]
#pivot your table
pivot = pd.pivot_table(df, index=df.index, columns='Year', values='Value')
#now, in my pivoted test set there should be unwanted null values showing up so
#we can apply another function that drops null values in each column without losing values in other columns
pivot = pivot.apply(lambda x: pd.Series(x.dropna().values))
Result on my end
| Year | 2000 | 2001 | 2002 |
|------|------|------|------|
| 0 | 100 | 110 | 105 |
| 1 | 101 | 111 | 104 |
Hope this solves your problem!

How to add new row in pandas dataframe? [duplicate]

I have an existing dataframe which I need to add an additional column to which will contain the same value for every row.
Existing df:
Date, Open, High, Low, Close
01-01-2015, 565, 600, 400, 450
New df:
Name, Date, Open, High, Low, Close
abc, 01-01-2015, 565, 600, 400, 450
I know how to append an existing series / dataframe column. But this is a different situation, because all I need is to add the 'Name' column and set every row to the same value, in this case 'abc'.
df['Name']='abc' will add the new column and set all rows to that value:
In [79]:
df
Out[79]:
Date, Open, High, Low, Close
0 01-01-2015, 565, 600, 400, 450
In [80]:
df['Name'] = 'abc'
df
Out[80]:
Date, Open, High, Low, Close Name
0 01-01-2015, 565, 600, 400, 450 abc
You can use insert to specify where you want to new column to be. In this case, I use 0 to place the new column at the left.
df.insert(0, 'Name', 'abc')
Name Date Open High Low Close
0 abc 01-01-2015 565 600 400 450
Summing up what the others have suggested, and adding a third way
You can:
assign(**kwargs):
df.assign(Name='abc')
access the new column series (it will be created) and set it:
df['Name'] = 'abc'
insert(loc, column, value, allow_duplicates=False)
df.insert(0, 'Name', 'abc')
where the argument loc ( 0 <= loc <= len(columns) ) allows you to insert the column where you want.
'loc' gives you the index that your column will be at after the insertion. For example, the code above inserts the column Name as the 0-th column, i.e. it will be inserted before the first column, becoming the new first column. (Indexing starts from 0).
All these methods allow you to add a new column from a Series as well (just substitute the 'abc' default argument above with the series).
Single liner works
df['Name'] = 'abc'
Creates a Name column and sets all rows to abc value
I want to draw more attention to a portion of #michele-piccolini's answer.
I strongly believe that .assign is the best solution here. In the real world, these operations are not in isolation, but in a chain of operations. And if you want to support a chain of operations, you should probably use the .assign method.
Here is an example using snowfall data at a ski resort (but the same principles would apply to say ... financial data).
This code reads like a recipe of steps. Both assignment (with =) and .insert make this much harder:
raw = pd.read_csv('https://github.com/mattharrison/datasets/raw/master/data/alta-noaa-1980-2019.csv',
parse_dates=['DATE'])
def clean_alta(df):
return (df
.loc[:, ['STATION', 'NAME', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'DATE',
'PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN', 'TOBS']]
.groupby(pd.Grouper(key='DATE', freq='W'))
.agg({'PRCP': 'sum', 'TMAX': 'max', 'TMIN': 'min', 'SNOW': 'sum', 'SNWD': 'mean'})
.assign(LOCATION='Alta',
T_RANGE=lambda w_df: w_df.TMAX-w_df.TMIN)
)
clean_alta(raw)
Notice the line .assign(LOCATION='Alta', that creates a column with a single value in the middle of the rest of the operations.
One Line did the job for me.
df['New Column'] = 'Constant Value'
df['New Column'] = 123
You can Simply do the following:
df['New Col'] = pd.Series(["abc" for x in range(len(df.index))])
This single line will work.
df['name'] = 'abc'
The append method has been deprecated since Pandas 1.4.0
So instead use the above method only if using actual pandas DataFrame object:
df["column"] = "value"
Or, if setting value on a view of a copy of a DataFrame, use concat() or assign():
This way the new Series created has the same index as original DataFrame, and so will match on exact rows
# adds a new column in view `where_there_is_one` named
# `client` with value `display_name`
# `df` remains unchanged
df = pd.DataFrame({"number": ([1]*5 + [0]*5 )})
where_there_is_one = df[ df["number"] == 1]
where_there_is_one = pd.concat([
where_there_is_one,
pd.Series(["display_name"]*df.shape[0],
index=df.index,
name="client")
],
join="inner", axis=1)
# Or use assign
where_there_is_one = where_there_is_one.assign(client = "display_name")
Output:
where_there_is_one: df:
| 0 | number | client | | 0 | number |
| --- | --- | --- | |---| -------|
| 0 | 1 | display_name | | 0 | 1 |
| 1 | 1 | display_name | | 1 | 1 |
| 2 | 1 | display_name | | 2 | 1 |
| 3 | 1 | display_name | | 3 | 1 |
| 4 | 1 | display_name | | 4 | 1 |
| 5 | 0 |
| 6 | 0 |
| 7 | 0 |
| 8 | 0 |
| 9 | 0 |
Ok, all, I have a similar situation here but if i take this code to use: df['Name']='abc'
instead 'abc' the name for the new column I want to take from somewhere else in the csv file.
As you can see from the picture, df is not cleaned yet but I want to create 2 columns with the name "ADI dms rivoli" which will continue for every row, and the same for the "December 2019". Hope it is clear for you to understand, it was hard to explaine, sorry.

Categories

Resources