Dataframe returning empty after assignment of values? - python

Essentially, I would like to add values to certain columns in an empty DataFrame with defined columns, but when I run the code, I get.
Empty DataFrame
Columns: [AP, AV]
Index: []
Code:
df = pd.DataFrame(columns=['AP', 'AV'])
df['AP'] = propName
df['AV'] = propVal
I think this could be a simple fix, but I've tried some different solutions to no avail. I've tried adding the values to an existing dataframe I have, and it works when I do that, but would like to have these values in a new, separate structure.
Thank you,

It's the lack of an index.
If you create an empty dataframe with an index.
df = pd.DataFrame(index = [5])
Output
Empty DataFrame
Columns: []
Index: [5]
Then when you set the value, it will be set.
df[5] = 12345
Output
5
5 12345
You can also create an empty dataframe. And when setting a column with a value, pass the value in the list. The index will be automatically set.
df = pd.DataFrame()
df['qwe'] = [777]
Output
qwe
0 777

Assign propName and propValue to dictionary:
dict = {}
dict[propName] = propValue
Then, push to empty DataFrame, df:
df = pd.DataFrame()
df['AP'] = dict.keys()
df['AV'] = dict.values()
Probably not the most elegant solution, but works great for me.

Related

How to check pandas column names and then append to row data efficiently?

I have a dataframe with several columns, some of which have names that match the keys in a dictionary. I want to append the value of the items in the dictionary to the non null values of the column whos name matches the key in said dictionary. Hopefully that isn't too confusing.
example:
realms = {}
realms['email'] = '<email>'
realms['android'] = '<androidID>'
df = pd.DataFrame()
df['email'] = ['foo#gmail.com','',foo#yahoo.com]
df['android'] = [1234567,,55533321]
how could I could I append '<email>' to 'foo#gmail.com' and 'foo#yahoo.com'
without appending to the empty string or null value too?
I'm trying to do this without using iteritems(), as I have about 200,000 records to apply this logic to.
expected output would be like 'foo#gmail.com<email>',,'foo#yahoo.com<email>'
for column in df.columns:
df[column] = df[column].astype(str) + realms[column]
>>> df
email android
0 foo#gmail.com<email> 1234567<androidID>
1 foo#yahoo.com<email> 55533321<androidID>

pandas df masking specific row by list

I have pandas df which has 7000 rows * 7 columns. And I have list (row_list) that consists with the value that I want to filter out from df.
What I want to do is to filter out the rows if the rows from df contain the corresponding value in the list.
This is what I got when I tried,
"Empty DataFrame
Columns: [A,B,C,D,E,F,G]
Index: []"
df = pd.read_csv('filename.csv')
df1 = pd.read_csv('filename1.csv', names = 'A')
row_list = []
for index, rows in df1.iterrows():
my_list = [rows.A]
row_list.append(my_list)
boolean_series = df.D.isin(row_list)
filtered_df = df[boolean_series]
print(filtered_df)
replace
boolean_series = df.RightInsoleImage.isin(row_list)
with
boolean_series = df.RightInsoleImage.isin(df1.A)
And let us know the result. If it doesn't work show a sample of df and df1.A
(1) generating separate dfs for each condition, concat, then dedup (slow)
(2) a custom function to annotate with bool column (default as False, then annotated True if condition is fulfilled), then filter based on that column
(3) keep a list of indices of all rows with your row_list values, then filter using iloc based on your indices list
Without an MRE, sample data, or a reason why your method didn't work, it's difficult to provide a more specific answer.

Wide to long returns empty output - Python dataframe

I have a dataframe which can be generated from the code as given below
df = pd.DataFrame({'person_id' :[1,2,3],'date1':
['12/31/2007','11/25/2009','10/06/2005'],'val1':
[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'val2':[1,3,5],'date3':
['12/31/2027','11/25/2029','10/06/2025'],'val3':[7,9,11]})
I followed the below solution to convert it from wide to long
pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id',
j='grp').sort_index(level=0)
Though this works with sample data as shown below, it doesn't work with my real data which has more than 200 columns. Instead of person_id, my real data has subject_ID which is values like DC0001,DC0002 etc. Does "I" always have to be numeric? Instead it adds the stub values as new columns in my dataset and has zero rows
This is how my real columns looks like
My real data might contains NA's as well. So do I have to fill them with default values for wide_to_long to work?
Can you please help as to what can be the issue? Or any other approach to achieve the same result is also helpful.
Try adding additional argument in the function which allows the strings suffix.
pd.long_to_wide(.......................,suffix='\w+')
The issue is with your column names, the numbers used to convert from wide to long need to be at the end of your column names or you need to specify a suffix to groupby. I think the easiest solution is to create a function that accepts regex and the dataframe.
import pandas as pd
import re
def change_names(df, regex):
# Select one of three column groups
old_cols = df.filter(regex = regex).columns
# Create list of new column names
new_cols = []
for col in old_cols:
# Get the stubname of the original column
stub = ''.join(re.split(r'\d', col))
# Get the time point
num = re.findall(r'\d+', col) # returns a list like ['1']
# Make new column name
new_col = stub + num[0]
new_cols.append(new_col)
# Create dictionary mapping old column names to new column names
dd = {oc: nc for oc, nc in zip(old_cols, new_cols)}
# Rename columns
df.rename(columns = dd, inplace = True)
return df
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
# Change date columns
tdf = change_names(tdf, 'date$')
tdf = change_names(tdf, 'val$')
print(tdf)
person_id hdate1 tval1 hdate2 tval2 hdate3 tval3
0 1 12/31/2007 2 12/31/2017 1 12/31/2027 7
1 2 11/25/2009 4 11/25/2019 3 11/25/2029 9
2 3 10/06/2005 6 10/06/2015 5 10/06/2025 11
This is quite late to answer this question. But putting the solution here in case someone else find it useful
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
## You can use m13op22 solution to rename your columns with numeric part at the
## end of the column name. This is important.
tdf = tdf.rename(columns={'h1date': 'hdate1', 't1val': 'tval1',
'h2date': 'hdate2', 't2val': 'tval2',
'h3date': 'hdate3', 't3val': 'tval3'})
## Then use the non-numeric portion, (in this example 'hdate', 'tval') as
## stubnames. The mistake you were doing was using ['date', 'val'] as stubnames.
df = pd.wide_to_long(tdf, stubnames=['hdate', 'tval'], i='person_id', j='grp').sort_index(level=0)
print(df)

Iterating over multiIndex dataframe

I have a data frame as shown below
I have a problem in iterating over the rows. for every row fetched I want to return the key value. For example in the second row for 2016-08-31 00:00:01 entry df1 & df3 has compass value 4.0 so I wanted to return the keys which has the same compass value which is df1 & df3 in this case
I Have been iterating rows using
for index,row in df.iterrows():
Update
Okay so now I understand your question better this will work for you.
First change the shape of your dataframe with
dfs = df.stack().swaplevel(axis=0)
This will make your dataframe look like:
Then you can iterate the rows like before and extract the information you want. I'm just using print statements for everything, but you can put this in some more appropriate data structure.
for index, row in dfs.iterrows():
dup_filter = row.duplicated(keep=False)
dfss = row_tuple[dup_filter].index.values
print("Attribute:", index[0])
print("Index:", index[1])
print("Matches:", dfss, "\n")
which will print out something like
.....
Attribute: compass
Index: 5
Matches: ['df1' 'df3']
Attribute: gyro
Index: 5
Matches: ['df1' 'df3']
Attribute: accel
Index: 6
Matches: ['df1' 'df3']
....
You could also do it one attribute at a time by
dfs_compass = df.stack().swaplevel(axis=0).loc['compass']
and iterate through the rows with just the index.
Old
If I understand your question correctly, i.e. you want to return the indexes of rows which have matching values on the second level of your columns, i.e. ('compass', 'accel', 'gyro'). The following will work.
compass_match_indexes = []
for index, row in df.iterrows():
match_filter = row[:, 'compass'].duplicated()
if len(row[:, 'compass'][match_filter] > 0)
compass_match_indexes.append(index)
You can use select your dataframe with that list like df.loc[compass_match_indexes]
--
Another approach, you could get the transform of your DataFrame with df.T and then use the duplicated function.

Read in data and set it to the index of a DataFrame with Pandas

I want to iterate through the rows of a DataFrame and assign values to a new DataFrame. I've accomplished that task indirectly like this:
#first I read the data from df1 and assign it to df2 if something happens
counter = 0 #line1
for index,row in df1.iterrows(): #line2
value = row['df1_col'] #line3
value2 = row['df1_col2'] #line4
#try unzipping a file (pseudo code)
df2.loc[counter,'df2_col'] = value #line5
counter += 1 #line6
#except
print("Error, could not unzip {}") #line7
#then I set the desired index for df2
df2 = df2.set_index(['df2_col']) #line7
Is there a way to assign the values to the index of df2 directly in line5? Sorry my original question was unclear. I'm creating an index based on the something happening.
There are a bunch of ways to do this. According to your code, all you've done is created an empty df2 dataframe with an index of values from df1.df1_col. You could do this directly like this:
df2 = pd.DataFrame([], df1.df1_col)
# ^ ^
# | |
# specifies no data, yet |
# defines the index
If you are concerned about having to filter df1 then you can do:
# cond is some boolean mask representing a condition to filter on.
# I'll make one up for you.
cond = df1.df1_col > 10
df2 = pd.DataFrame([], df1.loc[cond, 'df1_col'])
No need to iterate, you can do:
df2.index = df1['df1_col']
If you really want to iterate, save it to a list and set the index.

Categories

Resources