Concatenate Series using For - python

I'm having some trouble creating a DataFrame with some Series. Is there a way to concat them with a for? Because each time I try I only get the last Series in the DF, when I really want it to concat it the columns and not in place.
suma_queries = list()
for query in queries:
cur.execute(query)
schema = lib.get_schema_sql(cursor = cur)
table = lib.get_table_sql(cur)
df = pd.DataFrame(data = table, columns = schema)
suma_queries.append(df.iloc[:,18].sum())
suma_queries = pd.Series(suma_queries)
concat_df = pd.concat([suma_queries], axis=1)
As you see, for each "suma_queries" Series gotten from the for, I try to concatenate it to a dataframe called concat_df, and so on for the next "suma_queries" Series, but in the end, I only get the last Series, because the for replaces the value.
What I want at the end should be a dataframe like:
Series1 Series2 Series3 … SeriesN
s1_1 s2_1 s3_1 sn_1
s1_2 … … …
s1_3 … … …
… … … …
s1_n s2_n s3_n sn_n
where each column is a series.
Please let me know if there is way to do it,
Thanks!!

you should reasign the appended dataframe back to itself in the for loop:
for query in queries:
cur.execute(query)
schema = lib.get_schema_sql(cursor = cur)
table = lib.get_table_sql(cur)
df = pd.DataFrame(data = table, columns = schema)
suma_queries=suma_queries.append(df.iloc[:,18].sum())
suma_queries = pd.Series(suma_queries)
concat_df = pd.concat([suma_queries], axis=1)

Related

How to concatenate a series to a pandas dataframe in python?

I would like to iterate through a dataframe rows and concatenate that row to a different dataframe basically building up a different dataframe with some rows.
For example:
`IPCSection and IPCClass Dataframes
allcolumns = np.concatenate((IPCSection.columns, IPCClass.columns), axis = 0)
finalpatentclasses = pd.DataFrame(columns=allcolumns)
for isec, secrow in IPCSection.iterrows():
for icl, clrow in IPCClass.iterrows():
if (secrow[0] in clrow[0]):
pdList = [finalpatentclasses, pd.DataFrame(secrow), pd.DataFrame(clrow)]
finalpatentclasses = pd.concat(pdList, axis=0, ignore_index=True)
display(finalpatentclasses)
The output is:
I want the nan values to dissapear and move all the data under the correct columns. I tried axis = 1 but messes up the column names. Append does not work as well all values are placed diagonally at the table with nan values as well.
Alright, I have figured it out. The idea is that you create a newrowDataframe and concatenate all the data in a list from there you can add it to the dataframe and then conc with the final dataframe.
Here is the code:
allcolumns = np.concatenate((IPCSection.columns, IPCClass.columns), axis = 0)
finalpatentclasses = pd.DataFrame(columns=allcolumns)
for isec, secrow in IPCSection.iterrows():
for icl, clrow in IPCClass.iterrows():
newrow = pd.DataFrame(columns=allcolumns)
values = np.concatenate((secrow.values, subclrow.values), axis=0)
newrow.loc[len(newrow.index)] = values
finalpatentclasses = pd.concat([finalpatentclasses, newrow], axis=0)
finalpatentclasses.reset_index(drop=false, inplace=True)
display(finalpatentclasses)
Update the code below is more efficient:
allcolumns = np.concatenate((IPCSection.columns, IPCClass.columns, IPCSubClass.columns, IPCGroup.columns), axis = 0)
newList = []
for secrow in IPCSection.itertuples():
for clrow in IPCClass.itertuples():
if (secrow[1] in clrow[1]):
values = ([secrow[1], secrow[2], subclrow[1], subclrow[2]])
new_row = {IPCSection.columns[0]: [secrow[1]], IPCSection.columns[1]: [secrow[2]],
IPCClass.columns[0]: [clrow[1]], IPCClass.columns[1]: [clrow[2]]}
newList.append(values)
finalpatentclasses = pd.DataFrame(newList, columns=allcolumns)
display(finalpatentclasses)

How to create lag feature in pandas in this case?

I have a table like this (with more columns):
date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825
I have created some features like this:
sectorGroup = df.groupby(["date","Sector"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby(["date"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,dateGroupGroup,on=["date"],how="left",suffixes=["","_byDate"])
Now my new df looks like this:
date,Sector,Value1,Value2,Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
14/03/22,Medical,86,64,275.0,81.5,281.75,260.25
14/03/22,Medical,464,99,275.0,81.5,281.75,260.25
14/03/22,Industry,22,35,22.0,35.0,281.75,260.25
14/03/22,Services,555,843,555.0,843.0,281.75,260.25
15/03/22,Services,111,533,111.0,533.0,1634.75,616.0
15/03/22,Industry,222,169,222.0,169.0,1634.75,616.0
15/03/22,Medical,672,937,3103.0,881.0,1634.75,616.0
15/03/22,Medical,5534,825,3103.0,881.0,1634.75,616.0
Now, I want to create lag features for Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
For example, a new column named Value1_by_Date_lag1 and Value1_bySector_lag1.
And this new column will look like this:
date,Sector,Value1_by_Date_lag1,Value1_bySector_lag1
15/03/22,Services,281.75,555.0
15/03/22,Industry,281.75,22.0
15/03/22,Medical,281.75,275.0
15/03/22,Medical,281.75,275.0
Basically in Value1_by_Date_lag1, the date "15/03" will contain the value "281.75" which is for the date "14/03" (lag of 1 shift).
Basically in Value1_bySector_lag1, the date "15/03" and Sector "Medical" will contain the value "275.0", which is the value for "14/03" and "Medical" rows.
I hope, the question is clear and gave you all the details.
Create a lagged date variable by shifting the date column, and then merge again with dateGroupGroup and sectorGroup using the lagged date instead of the actual date.
df = pd.read_csv(io.StringIO("""date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825"""))
# Add a lagged date variable
lagged = df.groupby("date")["date"].first().shift()
df = df.join(lagged, on="date", rsuffix="_lag")
# Create date and sector groups and merge them into df, as you already do
sectorGroup = df.groupby(["date","Sector"])[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby("date")[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df, dateGroupGroup, on="date",how="left", suffixes=["","_byDate"])
# Merge again, this time matching the lagged date in df to the actual date in sectorGroup and dateGroupGroup
df = pd.merge(df, sectorGroup, left_on=["date_lag", "Sector"], right_on=["date", "Sector"], how="left", suffixes=["", "_by_sector_lag"])
df = pd.merge(df, dateGroupGroup, left_on="date_lag", right_on="date", how="left", suffixes=["", "_by_date_lag"])
# Drop the extra unnecessary columns that have been created in the merge
df = df.drop(columns=['date_by_date_lag', 'date_by_sector_lag'])
This assumes the data is sorted by date - if not you will have to sort before generating the lagged date. It will work whether or not all the dates are consecutive.
I found 1 inefficient solution (slow and memory intensive).
Lag of "date" group
cols = ["Value1_byDate","Value2_byDate"]
temp = df[["date"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp.date = temp.date.shift(-1-i)
df = pd.merge(df,temp,on="date",how="left",suffixes=["","_lag"+str(i+1)])
Lag of "date" and "Sector" group
cols = ["Value1_bySector","Value2_bySector"]
temp = df[["date","Sector"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp[["Value1_bySector","Value2_bySector"]] = temp.groupby("Sector")["Value1_bySector","Value2_bySector"].shift(1+1)
df = pd.merge(df,temp,on=["date","Sector"],how="left",suffixes=["","_lag"+str(i+1)])
Is there a more simple solution?

Reformatting a dataframe to access it for sort after concatenating two series

I've joined or concatenated two series into a dataframe. However one of the issues I'm not facing is that I have no column headings on the actual data that would help me do a sort
hist_a = pd.crosstab(category_a, category, normalize=True)
hist_b = pd.crosstab(category_b, category, normalize=True)
counts_a = pd.Series(np.diag(hist_a), index=[hist_a.index])
counts_b = pd.Series(np.diag(hist_b), index=[hist_b.index])
df_plots = pd.concat([counts_a, counts_b], axis=1).fillna(0)
The data looks like the following:
0 1
category
0017817703277 0.000516 5.384341e-04
0017817703284 0.000516 5.384341e-04
0017817731348 0.000216 2.856169e-04
0017817731355 0.000216 2.856169e-04
and I'd like to do a sort, but there are no proper column headings
df_plots = df_plots.sort_values(by=['0?'])
But the dataframe seems to be in two parts. How could I better structure the dataframe to have 'proper' columns such as '0' or 'plot a' rather than being indexable by an integer, which seems to be hard to work with.
category plot a plot b
0017817703277 0.000516 5.384341e-04
0017817703284 0.000516 5.384341e-04
0017817731348 0.000216 2.856169e-04
0017817731355 0.000216 2.856169e-04
Just rename the columns of the dataframe, for example:
df = pd.DataFrame({0:[1,23]})
df = df.rename(columns={0:'new name'})
If you have a lot of columns you rename all of them at once like:
df = pd.DataFrame({0:[1,23]})
rename_dict = {key: f'Col {key}' for key in df.keys() }
df = df.rename(columns=rename_dict)
You can also define the series with the name, so you avoid changing the name afterwards:
counts_a = pd.Series(np.diag(hist_a), index=[hist_a.index], name = 'counts_a')
counts_b = pd.Series(np.diag(hist_b), index=[hist_b.index], name = 'counts_b')

How to return the name of dataframe column in an array

Trying to iterate my code, and created a list full with 4 dataframe columns. Is there a way to return the dataframe column name in the array?
code:
df1['c1'] = df1['c1'].apply(lambda x: 'test')
df1['c2'] = df1['c2'].apply(lambda x: 'test')
df2['c1'] = df2['c1'].apply(lambda x: 'test')
df2['c2'] = df2['c2'].apply(lambda x: 'test')
What I tried:
tables = [df1['c1'], [df1['c2'], df2['c1'], df2['c2']]
for i in tables:
i = i.apply(lambda x: 'test')
Essentially: how to call the items in tables list, and return the actual value, not the table itself.
Desired:
tables = [df1['c1'], [df1['c2'], df2['c1'], df2['c2']]
for i in tables:
# iterates by calling the name of each dataframe column in tables so it
Essentially how to call the items in tables list, and return the dataframe variable name....
so it returns:
[df1['c1'], [df1['c2'], instead of the actual dataframe.

Memory problem using python pandas to join stock DataFrames in loop

I am trying to join a lot of dataframes in order to do the correlation matrix in pandas.
So, it seems that I have to keep on adding columns on the right hand, with the "Date" as the index.
But, when I try to do this function with just 50 dataframes, it ends with the memory error.
Is there anyone knows what is happening?
def taking_and_combining_data_from_mysql_to_excel(root):
saved_path = root + "\main_df.xlsx"
main_df = pd.DataFrame()
mycursor = mydb.cursor(buffered=True)
for key, value in stock_dic.items():
mycursor.execute("""SELECT date, Adj_close
FROM hk_stock
Where date >= '2020-03-13 00:00:00' and stock_number = '{}'""".format(key))
row_result = mycursor.fetchall()
df = pd.DataFrame(row_result)
df.columns = ['Date', value]
df.set_index('Date',inplace=True)
if main_df.empty:
main_df = df
else:
main_df = main_df.join(df,how="outer")
with pd.ExcelWriter(saved_path) as writer:
main_df.to_excel(writer,sheet_name="raw_data")
main_df.corr().to_excel(writer,sheet_name="correlation")
return main_df
Pandas is not designed for such dynamic concatenations. You could just append things into a list, and convert that list into a DataFrame. Like so:
join=[]
for key, value in stock_dic.items():
join.append({'Date':value} )
df_join=pd.DataFrame(join)

Categories

Resources