Adding elements to an empty dataframe in pandas - python

I am new to Python and have a basic question. I have an empty dataframe Resulttable with columns A B and C which I want to keep filling with the answers of some calculations which I run in a loop represented by the loop index n. For ex. I want to store the value 12 in the nth row of column A, 35 in nth row of column B and so on for the whole range of n.
I have tried something like
Resulttable['A'].iloc[n] = 12
Resulttable['B'].iloc[n] = 35
I get an error single positional indexer is out-of-bounds for the first value of n, n=0.
How do I resolve this? Thanks!

You can first create an empty pandas dataframe and then append rows one by one as you calculate. In your range you need to specify one above the highest value you want i.e. range(0, 13) if you want to iterate for 0-12.
import pandas as pd
df = pd.DataFrame([], columns=["A", "B", "C"])
for i in range(0, 13):
x = i**1
y = i**2
z = i**3
df_tmp = pd.DataFrame([(x, y, z)], columns=["A", "B", "C"])
df = df.append(df_tmp)
df = df.reset_index()
This will result in a DataFrame as follows:
df.head()
index A B C
0 0 0 0 0
1 0 1 1 1
2 0 2 4 8
3 0 3 9 27
4 0 4 16 64

There is no way of filling an empty dataframe like that. Since there are no entries in your dataframe something like
Resulttable['A'].iloc[n]
will always result in the IndexError you described.
Instead of trying to fill the dataframe like that you better store the results from your loop in a list which you could call 'result_list'. Then you can create a dataframe using your list like that:
Resulttable= pd.DataFrame({"A": result_list})
If you've got another another list of results you want to store in another column of your dataframe, let's say result_list2, then you can create your dataframe like that:
Resulttable= pd.DataFrame({"A": result_list, "B": result_list2})
If 'Resulttable' has already been created you can add column B like that
Resulttable["B"] = result_list2
I hope I could help you.

Related

Python - list 2 columns from excel file as list but alternate between values in columns

I have an excel file and have 2 columns that I would like to convert to a list. I want to alternate between the columns, so it would be like this:
"F1,O1,F2,O2,F3,O3....."
whats the best way using python? I have read the columns as df, and when i ask for list it just shows the headers.
#read profles excel ss, column O and drop all Nan cells
import pandas
path=r"Profiles.xlsx"
df = pandas.read_excel(path,usecols="F,O").dropna(how="all")
print(list(df))
This just shows "Header F, Header O"
Probably very simple but im new to python and learning :)
This can be done with list comprehension, as mentioned in another another
A simple for loop can also do the trick that you can understand easily.
Here is a sample on how that can be done:
a = list(range(1, 10, 2))
b = list(range(2,11, 2))
df = pd.DataFrame(columns = ['A','B'])
df['A'] = a
df['B'] = b
df
The sample dataframe created would be something like this:
A B
0 1 2
1 3 4
2 5 6
3 7 8
4 9 10
This is the for loop you want to run to get the desired result:
res = []
for i,r in df.iterrows():
res.append(r['A'])
res.append(r['B'])
print(res)

How unique is each row based on 3-4 columns?

I am going to merge two datasets soon by 3 columns.
The hope is that there are no/few 3 column group repeats in the original dataset. I would like to produce something that says approximately how unique each row is. Like maybe some kind of frequency plot (might not work as I have a very large dataset), maybe a table that displays the average frequency for each .5million rows or something like that.
Is there a way to determine how unique each row is compared to the other rows?
1 2 3
A 100 B
A 200 B
A 200 B
Like for the above data frame, I would like to say that each row is unique
1 2 3
A 200 B
A 200 B
A 100 B
For this data set, rows 1 and 2 are not unique. I don't want to drop one, but I am hoping to quantify/weight the amount of non-unique rows.
The problem is my dataframe is 14,000,000 lines long, so I need to think of a way I can show how unique each row is on a set this big.
Assuming you are using pandas, here's one possible way:
import pandas as pd
# Setup, which you can probably skip since you already have the data.
cols = ["1", "2", "3"]
rows = [
["A", 200, "B",],
["A", 200, "B",],
["A", 100, "B",],
]
df1 = pd.DataFrame(rows, columns=cols)
# Get focus column values before adding a new column.
key_columns = df1.columns.values.tolist()
# Add a line column
df1["line"] = 1
# Set new column to cumulative sum of line values.
df1["match_count"] = df1.groupby(key_columns )['line'].apply(lambda x: x.cumsum())
# Drop line column.
df1.drop("line", axis=1, inplace=True)
Print results
print(df1)
Output -
1 2 3 match_count
0 A 200 B 1
1 A 200 B 2
2 A 100 B 1
Return only unique rows:
# We only want results where the count is less than 2,
# because we have our key columns saved, we can just return those
# and not worry about 'match_count'
df_unique = df1.loc[df1["match_count"] < 2, key_columns]
print(df_unique)
Output -
1 2 3
0 A 200 B
2 A 100 B

Filtering row data based off column values in python pandas

So I am able to get what I want if I only filter by one item, but can't figure out how to filter by two items.
Basically I have a data set with a potential of unlimited rows but have 26 columns. I want to filter row data based on column data on columns A and B but only want the data in C and D to be returned only If A AND B match the values passed into the function. A and B values will be different but specified by being passed into the function.
It seems simple to me but when I try to run the second filter on the first filtered df my returned df is empty.
>>> import pandas as pd
>>> df = pd.DataFrame({"A":[1,2,3,4], "B":[7,6,5,4], "C":[9,8,7,6], "D":[0,1,0,1]})
>>> df = df.loc[(df.A>1) & (df.B>4), ["C", "D"]]
>>> print(df)
C D
1 8 1
2 7 0

Add array of new columns to Pandas dataframe

How do I append a list of integers as new columns to each row in a dataframe in Pandas?
I have a dataframe which I need to append a 20 column sequence of integers as new columns. The use case is that I'm translating natural text in a cell of the row into a sequence of vectors for some NLP with Tensorflow.
But to illustrate, I create a simple data frame to append:
df = pd.DataFrame([(1, 2, 3),(11, 12, 13)])
df.head()
Which generates the output:
And then, for each row, I need to pass a function that takes in a particular value in the column '2' and will return an array of integers that need to be appended as columns in the the data frame - not as an array in a single cell:
def foo(x):
return [x+1, x+2, x+3]
Ideally, to run a function like:
df[3, 4, 5] = df['2'].applyAsColumns(foo)
The only solution I can think of is to create the data frame with 3 blank columns [3,4,5] , and then use a for loop to iterate through the blank columns and then input them as values in the loop.
Is this the best way to do it, or is there any functions built into Pandas that would do this? I've tried checking the documentation, but haven't found anything.
Any help is appreciated!
IIUC,
def foo(x):
return pd.Series([x+1, x+2, x+3])
df = pd.DataFrame([(1, 2, 3),(11, 12, 13)])
df[[3,4,5]] = df[2].apply(foo)
df
Output:
0 1 2 3 4 5
0 1 2 3 4 5 6
1 11 12 13 14 15 16

Adding column of random floats to data frame, but with equal values for equal data frame entries

I have a column of integers, some are unique and some are the same. I want to add a column of random floats between 0 and 1 per row, but I want all of the floats to be the same per integer.
The code I'm providing shows a column of ints and a second column of random floats, but I need the floats for the same ints, like 1, 1, and 1, or 6 and 6, to all be the same, while still having whatever the float assigned to that int randomly generated. The ints I'm working with, however, are 8 digits, and the data set I am using is about 500,000 lines, so I am trying to be as efficient as possible.
I've created a working solution that iterates through the data frame that has already been created, but creating the random column, then iterating through checking like ints takes long. I wasn't sure if there was a more efficient method.
import numpy as np
import pandas as pd
col1 = [1,1,1,2,3,3,3,4,5,6,6,7]
col2 = np.random.uniform(0,1,12)
data = np.array([col1, col2])
df1 = pd.DataFrame(data=data)
df1 = df1.transpose()
Use transform after a groupby:
col1 = [1,1,1,2,3,3,3,4,5,6,6,7]
df = pd.DataFrame(col1, columns=['Col1'])
df['Col2'] = df.groupby('Col1')['Col1'].transform(lambda x: np.random.rand())
Result:
Col1 Col2
0 1 0.304472
1 1 0.304472
2 1 0.304472
3 2 0.883114
4 3 0.381417
5 3 0.381417
6 3 0.381417
7 4 0.668433
8 5 0.365895
9 6 0.484803
10 6 0.484803
11 7 0.403913
This takes about 200 ms for 600K rows on my old laptop computer.
This isn't totally iteration-free, but you're still only iterating over groups rather than every single row, so it's a touch better:
col1 = [1,1,1,2,3,3,3,4,5,6,6,7]
col2 = np.random.uniform(0,1,len(set(col1)))
data = np.array([col1])
df1 = pd.DataFrame(data=data)
df1 = df1.transpose()
df2 = df1.groupby(0)
counter = 0
final_df = pd.DataFrame(columns=[0,1])
for key, item in df2:
temp_df = df2.get_group(key)
temp_df[1] = [col2[counter]]*df2.get_group(key).shape[0]
counter += 1
final_df = final_df.append(temp_df)
final_df should be the result you're looking for.
Create a dictionary with random floats for each integer key, and then map Column 2 to the dictionary.
For integers already in Column1, start by making the dictionary:
myInts = df.Column1.unique().tolist()
myFloats = [random.uniform(0,1) for i in range(len(myInts))]
myDictionary = dict(list(zip(myInts , myFloats )))
This will give you:
{0: 0.7361124230574458,
1: 0.8039650720388128,
2: 0.7474880952026456,
3: 0.06792890878546265,
4: 0.4765215518349696,
5: 0.8058550699163101,
6: 0.8865969467094966,
7: 0.251791893958454,
8: 0.42261798056239686,
9: 0.03972320851777933,
....
}
Then map the dictionary keys to Column 1 so that each identical integer gets the same float. Something like:
df.Column2 = df.Column1.map(myDictionary)
More info on how to map a series to a dictionary is here:
Using if/else in pandas series to create new series based on conditions
In this way you can get the desired results without rearranging your dataframe or iterating through it.
Cheers!

Categories

Resources