Python - Generating new larger dataset from existing dataset, looping row - python

I looked at both How do I generate a new set of values from existing dataset and generate data by using existing dataset as the base dataset neither fullfill mig needs, so I read a ton off looping answers, but that didn't get me all the way.
I have the traditional adult dataset. After cleaning it and saving some for validation, so it look like this:
Adult dataset - 43958 rows and 12 colums
I want to run a loop that takes each row and adds a new row where age is increased by 1, but keeps all other data equal to that of the row.
I have tried two diffrent ways.
Nr 1:
df1 = newDataFrame
#iterate through each row of dataframe
for index, row in df1.iterrows():
new_row ={'age':index+1 , 'workclass':[], 'education':[], 'educational-num':[], 'marital-status':[],'occupation':[],
'race':[], 'gender':[], 'capital-gain':[], 'capital-loss':[],'hours-per-week':[], 'income':[]}
print(new_row)
But that gives me:
{'age': 35596, 'workclass': [], 'education': [], 'educational-num': [], 'marital-status': [], 'occupation': [], 'race': [], 'gender': [], 'capital-gain': [], 'capital-loss': [], 'hours-per-week': [], 'income': []}
I also tried:
df1 = newDataFrame
colums =list(df1)
#iterate through each row of dataframe
for index, row in df1.iterrows():
values = [([0]+1),[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11]]
zipped =zip(colums, values)
a_dictionary = dict(zipped)
print(a_dictionary)
But get error
> TypeError: can only concatenate list (not "int") to list
I understand that it is becuase of the colums = list but I don't know how to change it. Tried some append() but that didn't help.
So after two days I turn to you.
To goal is to make the dataset bigger, but keeping a strong correlation between values.
Perfect, thanks #gofvonx!
I hade to make a sammal change but this worked
df1 = newDataFrame
df_new= df1.copy()
df_new.age += 1
pd.concat([df1, df_new], axis=0, ignore_index=True)

Your code above has some issues. E.g., new_row is overwritten at each iteration without storing the previous value.
But you do not need to use a loop. You can try
df_new = df1.copy()
df_new['age'] += 1
pd.concat([df1, df_new], axis=0, ignore_index=True)
Note that ignore_index=True will create a new index 0,...,n-1 (see the documentation here).

Related

Dataframe returning empty after assignment of values?

Essentially, I would like to add values to certain columns in an empty DataFrame with defined columns, but when I run the code, I get.
Empty DataFrame
Columns: [AP, AV]
Index: []
Code:
df = pd.DataFrame(columns=['AP', 'AV'])
df['AP'] = propName
df['AV'] = propVal
I think this could be a simple fix, but I've tried some different solutions to no avail. I've tried adding the values to an existing dataframe I have, and it works when I do that, but would like to have these values in a new, separate structure.
Thank you,
It's the lack of an index.
If you create an empty dataframe with an index.
df = pd.DataFrame(index = [5])
Output
Empty DataFrame
Columns: []
Index: [5]
Then when you set the value, it will be set.
df[5] = 12345
Output
5
5 12345
You can also create an empty dataframe. And when setting a column with a value, pass the value in the list. The index will be automatically set.
df = pd.DataFrame()
df['qwe'] = [777]
Output
qwe
0 777
Assign propName and propValue to dictionary:
dict = {}
dict[propName] = propValue
Then, push to empty DataFrame, df:
df = pd.DataFrame()
df['AP'] = dict.keys()
df['AV'] = dict.values()
Probably not the most elegant solution, but works great for me.

How to make empty Pandas DataFrame with named columns and then add a row?

Questions:
How to make empty Pandas DataFrame, name 2 columns in it?
How to add a row for previously created DataFrame?
Can somebody help please? My code below doesn't work properly, it returns strange DataFrame.
My code:
data_set = pd.DataFrame(columns=['POST_TEXT', 'TARGET'])
data_set[data_set.shape[0]] = ['Today is a great day!', '1']
Code result:
data_set = pd.DataFrame(columns=['POST_TEXT', 'TARGET'])
# output
Empty DataFrame
Columns: [POST_TEXT, TARGET]
Index: []
# add row
data_set = data_set.append({"POST_TEXT": 5, "TARGET": 10}, ignore_index=True)
# output
POST_TEXT TARGET
0 5 10
So to append row you have to define dict where key is name of the column and value is the value you want to append.
If you would like to add row and populate only one column:
data_set = data_set.append({"POST_TEXT": 50}, ignore_index=True)
# output
POST_TEXT TARGET
0 50.0 NaN
instead of adding the value post-creation, it is possible to add it during creation:
data_set = pd.DataFrame(columns=['POST_TEXT', 'TARGET'], data=[['Today is a great day!', '1']])

Pandas "A value is trying to be set on a copy of a slice from a DataFrame"

Having a bit of trouble understanding the documentation
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
dfbreed['x'] = dfbreed.apply(testbreed, axis=1)
C:/Users/erasmuss/PycharmProjects/Sarah/farmdata.py:38: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Code is basically to re-arrange and clean some data to make analysis easier.
Code in given row-by per each animal, but has repetitions, blanks, and some other sparse values
Idea is to basically stack rows into columns and grab the useful data (Weight by date and final BCS) per animal
Initial DF
few snippets of the dataframe
Output Format
Output DF/csv
import pandas as pd
import numpy as np
#Function for cleaning up multiple entries of breeds
def testbreed(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
#Read Data
df1 = pd.read_csv("farmdata.csv")
#Drop empty rows
df1.dropna(how='all', axis=1, inplace=True)
#Copy to extract Weights in DF2
df2 = df1.copy()
df2 = df2.drop(['BCS', 'Breed','Age'], axis=1)
#Pivot for ID names in DF1
df1 = df1.pivot(index='ID', columns='Date', values=['Breed','Weight', 'BCS'])
#Pivot for weights in DF2
df2 = df2.pivot(index='ID', columns='Date', values = 'Weight')
#Split out Breeds and BCS into individual dataframes w/Duplicate/missing data for each ID
df3 = df1.copy()
dfbreed = df3[['Breed']]
dfBCS = df3[['BCS']]
#Drop empty BCS columns
df1.dropna(how='all', axis=1, inplace=True)
#Shorten Breed and BCS to single Column by grabbing first value that is real. see function above
dfbreed['x'] = dfbreed.apply(testbreed, axis=1)
dfBCS['x'] = dfBCS.apply(testbreed, axis=1)
#Populate BCS and Breed into new DF
df5= pd.DataFrame(data=None)
df5['Breed'] = dfbreed['x']
df5['BCS'] = dfBCS['x']
#Join Weights
df5 = df5.join(df2)
#Write output
df5.to_csv(r'.\out1.csv')
I want to take the BCS and Breed dataframes which are multi-indexed on the column by Breed or BCS and then by date to take the first non-NaN value in the rows of dates and set that into a column named breed.
I had a lot of trouble getting the columns to pick the first unique values in-situ on the DF
I found a work-around with a 2015 answer:
2015 Answer
which defined the function at the top.
reading through the setting a value on the copy-of a slice makes sense intuitively,
but I can't seem to think of a way to make it work as a direct-replacement or index-based.
Should I be looping through?
Trying from The second answer here
I get
dfbreed.loc[:,'Breed'] = dfbreed['Breed'].apply(testbreed, axis=1)
dfBCS.loc[:, 'BCS'] = dfBCS.apply['BCS'](testbreed, axis=1)
which returns
ValueError: Must have equal len keys and value when setting with an iterable
I'm thinking this has something to do with the multi-index
keys come up as:
MultiIndex([('Breed', '1/28/2021'),
('Breed', '2/12/2021'),
('Breed', '2/4/2021'),
('Breed', '3/18/2021'),
('Breed', '7/30/2021')],
names=[None, 'Date'])
MultiIndex([('BCS', '1/28/2021'),
('BCS', '2/12/2021'),
('BCS', '2/4/2021'),
('BCS', '3/18/2021'),
('BCS', '7/30/2021')],
names=[None, 'Date'])
Sorry for the long question(s?)
Can anyone help me out?
Thanks.
You created dfbreed as:
dfbreed = df3[['Breed']]
So it is a view of the original DataFrame (limited to just this one column).
Remember that a view has not any own data buffer, it is only a tool to "view"
a fragment of the original DataFrame, with read only access.
When you attempt to perform dfbreed['x'] = dfbreed.apply(...), you
actually attempt to violate the read-only access mode.
To avoid this error, create dfbreed as an "independent" DataFrame:
dfbreed = df3[['Breed']].copy()
Now dfbreed has its own data buffer and you are free to change the data.

How can I get data from a dataframe based on matching criteria with pandas?

I have a list called 'common_numbers'. This list has numbers (in str format) that will match with some of the numbers in a data frame that are in the 4th column. So for example:
common_numbers = ['512', '653', '950']
(example row in a data frame) df = expeditious, tree, www.stackflow.com, 512, asn
data frame example:
0 0,1,2,3,4
1 host,ip,FQDN,asn,asnOrgName
2 barracuda,208.92.204.42,barracuda.godsgarden.com,17359,exampleorgName
The commonality in common_numbers and the data frame in this example is 512. Thus, the value I want to retrieve is www.stackflow.com from the data frame.
I tried:
wanted_data =[]
if i in common_values:
print("Match found.. generating fqdn..")
for i in df_is_not:
wanted_data.append(df.loc[df[2].isin(common_values)])
print(wanted_fqdn_data)
It returns:
Columns: [0, 1, 2, 3, 4]
Index: [], Empty DataFrame
What am I doing wrong? How can I fix this? Thanks so much. With the example I gave above I'm expecting to get:
print(wanted_data)
>>>['www.stackflow.com']
Try this
df1 = pd.DataFrame(['512', '653', '950'])
df2 = pd.DataFrame([['expeditious', 'tree', 'www.stackflow.com', '512', 'asn'],
['barracuda','208.92.204.42','barracuda.godsgarden.com','17359','exampleorgName']],
columns=['c1','c2','c3','c4','c5'])
df3 = df2.merge(df1, left_on=['c4'], right_on=[0], how='inner', left_index=False)[['c3']]
df3
The result will be
c3
0 www.stackflow.com
You have the right idea, there just really isn't a need for a loop in this case.
If you ultimately just want to pull out the third column of every row where the fourth column is in a list you have, then you can do the following:
df = df[df[3].isin(common_numbers)]
wanted_data = list(df[2])
Hopefully, this answers your question.

How to Dynamically generate for loop based on columns in dataframe?

I am trying to generate a for loop dynamically based on the number of columns in a dataframe.
For e.g if my columns in dataframe is 5, then I generate the for loop and assign variables accordingly.
if
df_cols = ['USER_ID', 'BLID', 'PACKAGE_NAME', 'PACKAGE_PRICE', 'ENDED_DATE']
and brics is my dataframe
Then
for index, row in brics.iterrows():
analytics.track(row['USER_ID'], 'Cancelled Subscription', {
df_cols[1]: row['BLID']
df_cols[2]: row['PACKAGE_NAME'],
df_cols[3]: row['PACKAGE_PRICE'],
df_cols[4]: row['ENDED_DATE'],
})
The df_cols and the row[value] should be generated based on the number of columns in dataframe.
For e.g, if there are only 2 columns in data frame the below is how the code should look like.
if
df_cols = ['USER_ID', 'BLID']
Then
for index, row in brics.iterrows():
analytics.track(row['USER_ID'], 'Cancelled Subscription', {
df_cols[1]: row['BLID']
})
I searched SO for this solution but couldnt find the one related to dataframe's (Though R is available). Any pointers will be helpful.THank you.
df_cols = ['USER_ID', 'BLID', 'PACKAGE_NAME', 'PACKAGE_PRICE', 'ENDED_DATE']
for index, row in brics.iterrows():
analytics.track(row['USER_ID'], 'Cancelled Subscription', {
df_cols[i+1]: row[df_cols[i]] for i in range(len(df_cols)-1)
})

Categories

Resources