How to (elegantly) add single values and rows to a DataFrame? - python

Imagine the following DataFrame.
import pandas as pd
animal_sizes = pd.DataFrame({"Animal": ["Horse", "Mouse"],
"Size": ["Large", "Small"]})
Animal
Size
Horse
Large
Mouse
Small
I want to add another row for "Dog". If I understand correctly, I have to first create another DataFrame and then concatenate the new and the existing DataFrame.
pd.concat([animal_sizes,
pd.DataFrame({"Animal": ["Dog"],
"Size": ["Medium"]})]
)
Animal
Size
Horse
Large
Mouse
Small
Dog
Medium
This doesn't seem terribly elegant. Is there a simpler way? I imagine something like animal_sizes.append_row(["Dog", "Medium"]).
Imagine I only want to add another value to the Animal column. (Perhaps I haven't measured the size yet.) Again, pd.concat with an explicit empty (or NaN) value for the Size column seems awkward:
pd.concat([animal_sizes,
pd.DataFrame({"Animal": ["Crow"], "Size": [""]})]
Animal
Size
Horse
Large
Mouse
Small
Crow
Is there a simpler solution? I'm looking for something like animal_sizes["Animal"].append_value("Crow").
I know about pd.append (see this fine answer), but not only is it deprecated, it also expects you to explicate the column for each new row value. This makes it slightly unwieldy for my taste.
animal_sizes.append({"Animal": "Crow"}, ignore_index=True)
Are there more elegant solutions for this very simple problem?

I recommend defining an appropriate index (animals in this case) and using it to insert new rows by name. Use dictionaries to add incomplete rows.
import pandas as pd
animal_sizes = pd.DataFrame({"Animal": ["Horse", "Mouse"],
"Size": ["Large", "Small"],
"othercol": ["A", "B"]}
).set_index("Animal")
animal_sizes.loc["Dog"] = {"othercol": "C"}
animal_sizes.loc["Elephant"] = ["verylarge", "D"]
animal_sizes.loc["unspecifiedanimal"] = {}
print(animal_sizes)
# result:
Size othercol
Animal
Horse Large A
Mouse Small B
Dog NaN C
Elephant verylarge D
unspecifiedanimal NaN NaN
Adding an existing animal replaces a row. This may or may not be intended behavior. If the goal is to blindly dump rows into the table while accepting duplicates, the best solution might still be concat.

Solution for default RangeIndex values in index for always inserting new rows to end of DataFrame:
Use DataFrame.loc with list, only necessary same length like number of columns - new index value is created by length of rows:
animal_sizes.loc[len(animal_sizes)] = ["Dog", "Medium"]
print (animal_sizes)
Animal Size
0 Horse Large
1 Mouse Small
2 Dog Medium
If need also specify columns names:
animal_sizes.loc[len(animal_sizes)] = {"Animal": "Dog", "Size": "Medium"}
print (animal_sizes)
Animal Size
0 Horse Large
1 Mouse Small
2 Dog Medium

You can add a single row to a Pandas DataFrame using the .loc indexing method:
animal_sizes.loc[len(animal_sizes)] = ["Dog", "Medium"]
To add a single value to the Animal column, you can create a new column with that value and concatenate the DataFrames:
animal_sizes['Size'] = animal_sizes['Size'].astype(str)
animal_sizes = pd.concat([animal_sizes, pd.DataFrame({"Animal": ["Crow"], "Size": [""]})], sort=False)
Note that you need to cast the Size column to a string data type to accommodate the empty string.

Related

Keep rows according to condition in Pandas

I am looking for a code to find rows that matches a condition and keep those rows.
In the image example, I wish to keep all the apples with amt1 => 5 and amt2 < 5. I also want to keep the bananas with amt1 => 1 and amt2 < 5 (highlighted red in image). There are many other fruits in the list that I have to filter for (maybe about 10 fruits).
image example
Currently, I am filtering it individually (ie. creating a dataframe that filters out the red and small apples and another dataframe that filters out the green and big bananas and using concat to join the dataframes together afterwards). However, this process takes a long time to run because the dataset is huge. I am looking for a faster way (like filtering it in the dataframe itself without having to create a new dataframes). I also have to use column index instead of column names as the column name changes according to the date.
Hopefully what I said makes sense. Would appreciate any help!
I am not quite sure I understand your requirements because I don't understand how the conditions for the rows to keep are formulated.
One thing you can use to combine multiple criteria for selecting data is the query method of the dataframe:
import pandas as pd
df = pd.DataFrame([
['Apple', 5, 1],
['Apple', 4, 2],
['Orange', 3, 3],
['Banana', 2, 4],
['Banana', 1, 5]],
columns=['Fruits', 'Amt1', 'Amt2'])
df.query('(Fruits == "Apple" & (Amt1 >= 5 & Amt2 < 5)) | (Fruits == "Banana" & (Amt1 >= 1 & Amt2 < 5))')
You might use filter combined with itertuples following way
import pandas as pd
df = pd.DataFrame({"x":[1,2,3,4,5],"y":[10,20,30,40,50]})
def keep(row):
return row[0] >= 2 and row[1] <= 40
df_filtered = pd.DataFrame(filter(keep,df.itertuples())).set_index("Index")
print(df_filtered)
gives output
x y
Index
2 3 30
3 4 40
4 5 50
Explanation: keep is function which should return True for rows to keep False for rows to jettison. .itertuples() provides iterable of tuples, which are feed to filter which select records where keep evaluates to True, these selected rows are used to create new DataFrame. After that is done I set index so Index is corresponding to original DataFrame. Depending on your use case you might elect to not set index.

DataFrame is empty, expected data in it

I want to find duplicate items within 2 rows in Excel. So for example my Excel consists of:
list_A list_B
0 ideal ideal
1 brown colour
2 blue blew
3 red red
I checked the pandas documentation and tried duplicate method but I simply don't know why it keeps saying "DataFrame is empty". It finds both columns and I guess it's iterated over it but why doesn't it find the values and compare them?
I also tried using iterrows but honestly don't know how to implement it.
When running the code I get this output:
Empty DataFrame
Columns: [list A, list B]
Index: []
import pandas as pd
pt = pd.read_excel(r"C:\Users\S531\Desktop\pt.xlsx")
dfObj = pd.DataFrame(pt)
doubles = dfObj[dfObj.duplicated()]
print(doubles)
The output I'm looking for is:
list_A list_B
0 ideal ideal
3 red red
Final solved code looks like this:
import pandas as pd
pt = pd.read_excel(r"C:\Users\S531\Desktop\pt.xlsx")
doubles = pt[pt['list_A'] == pt['list_B']]
print(doubles)
The term "duplicate" is usually used to mean rows that are exact duplicates of previous rows (see the documentation of pd.DataFrame.duplicate).
What you are looking for is just the rows where these two columns are equal. For that, you want:
doubles = pt[pt['list_A'] == pt['list_B']]

PySpark - an efficient way to find DataFrame columns with more than 1 distinct value

I need an efficient way to list and drop unary columns in a Spark DataFrame (I use the PySpark API). I define a unary column as one which has at most one distinct value and for the purpose of the definition, I count null as a value as well. That means that a column with one distinct non-null value in some rows and null in other rows is not a unary column.
Based on the answers to this question I managed to write an efficient way to obtain a list of null columns (which are a subset of my unary columns) and drop them as follows:
counts = df.summary("count").collect()[0].asDict()
null_cols = [c for c in counts.keys() if counts[c] == '0']
df2 = df.drop(*null_cols)
Based on my very limited understanding of the inner workings of Spark this is fast because the method summary manipulates the entire data frame simultaneously (I have roughly 300 columns in my initial DataFrame). Unfortunately, I cannot find a similar way to deal with the second type of unary columns - ones which have no null values but are lit(something).
What I currently have is this (using the df2 I obtain from the code snippet above):
prox_counts = (df2.agg(*(F.approx_count_distinct(F.col(c)).alias(c)
for c in df2.columns
)
)
.collect()[0].asDict()
)
poss_unarcols = [k for k in prox_counts.keys() if prox_counts[k] < 3]
unar_cols = [c for c in poss_unarcols if df2.select(c).distinct().count() < 2]
Essentially, I first find columns which could be unary in a fast but approximate way and then look at the "candidates" in more detail and more slowly.
What I don't like about it is that a) even with the approximative pre-selection it is still fairly slow, taking over a minute to run even though at this point I only have roughly 70 columns (and about 6 million rows) and b) I use the approx_count_distinct with the magical constant 3 (approx_count_distinct does not count null, hence 3 instead of 2). Since I'm not exactly sure how the approx_count_distinct works internally I am a little worried that 3 is not a particularly good constant since the function might estimate the number of distinct (non-null) values as say 5 when it really is 1 and so maybe a higher constant is needed to guarantee nothing is missing in the candidate list poss_unarcols.
Is there a smarter way to do this, ideally so that I don't even have to drop the null columns separately and do it all in one fell swoop (although that is actually quite fast and so that big a big issue)?
I suggest that you have a look at the following function
pyspark.sql.functions.collect_set(col)
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe
It shall return all the values in col with multiplicated elements eliminated. Then you can check for the length of result (whether it equals one). I would be wondering about performance but I think it will beat distinct().count() definitely. Lets have a look on Monday :)
you can df.na.fill("some non exisitng value").summary() and then drop the relevant columns from the original dataframe
So far the best solution I found is this (it is faster than the other proposed answers, although not ideal, see below):
rows = df.count()
nullcounts = df.summary("count").collect()[0].asDict()
del nullcounts['summary']
nullcounts = {key: (rows-int(value)) for (key, value) in nullcounts.items()}
# a list for columns with just null values
null_cols = []
# a list for columns with no null values
full_cols = []
for key, value in nullcounts.items():
if value == rows:
null_cols.append(key)
elif value == 0:
full_cols.append(key)
df = df.drop(*null_cols)
# only columns in full_cols can be unary
# all other remaining columns have at least 1 null and 1 non-null value
try:
unarcounts = (df.agg(*(F.countDistinct(F.col(c)).alias(c) for c in full_cols))
.collect()[0]
.asDict()
)
unar_cols = [key for key in unarcounts.keys() if unarcounts[key] == 1]
except AssertionError:
unar_cols = []
df = df.drop(*unar_cols)
This works reasonably fast, mostly because I don't have too many "full columns", i.e. columns which contain no null rows and I only go through all rows of these, using the fast summary("count") method to clasify as many columns as I can.
Going through all rows of a column seems incredibly wasteful to me, since once two distinct values are found, I don't really care what's in the rest of the column. I don't think this can be solved in pySpark though (but I am a beginner), this seems to require a UDF and pySpark UDFs are so slow that it is not likely to be faster than using countDistinct(). Still, as long as there are many columns with no null rows in a dataframe, this method will be pretty slow (and I am not sure how much one can trust approx_count_distinct() to differentiate between one or two distinct values in a column)
As far as I can say it beats the collect_set() approach and filling the null values is actually not necessary as I realized (see the comments in the code).
I tried your solution, and it was too slow in my situation, so I simply grabbed the first row of the data frame and checked for duplicates. This turned out to be far more performant. I'm sure there's a better way, but I don't know what it is!
first_row = df.limit(1).collect()[0]
drop_cols = [
key for key, value in df.select(
[
sqlf.count(
sqlf.when(sqlf.col(column) != first_row[column], column)
).alias(column)
for column in df.columns
]
).collect()[0].asDict().items()
if value == 0
]
df = df.drop(*[drop_cols])

How do I get one column of an array into a new Array whilst applying a fancy indexing filter on it?

So basically I have an array, that consists of 14 Columns and 426 rows, every column represents one property of a dog and every row represents one dog, now I want to know the average heart frequency of an ill dog, the 14. column is the column that indicates whether the Dog is ill or not [0 = Healthy 1 = ill], the 8. row is the heart frequency. Now my problem is, that I don't know how I can get the 8. column out of the whole array and use the boolean filter on it
I am pretty new to Python. As I mentioned above I think that I know what I have to do [Use a fancy indexing filter] but I don't know how I can do this. I tried doing it while still being in the original Array but that didn't work out, so I thought I need to get the Infos into another one and use the Boolean filter on that one.
EDIT: Ok, so here is the code that I got right now:
import numpy as np
def average_heart_rate_for_pathologic_group(D):
a=np.array(D[:, 13]) #gets information, wether the dogs are sick or not
b=np.array(D[:, 7]) #gets the heartfrequency
R=(a >= 0) #gets all the values that are from sick dogs
amhr = np.mean(R) #calculates the average heartfrequency
return amhr
I think boolean indexing is the way foward.
The shortcuts for this work like:
#Your data:
data = [[0,1,2,3,4,5,6,7,8...],[..]...]
#This indexing chooses the rows in the 8th column that equals 1 and then their
#column number 14 values. Any analysis can be done after this on the new variable
heart_frequency_ill = data[data[:,7] == 1,13]
Probably you'll have to actually copy the data from the original array into a new one with the selected data.
Could you please share a sample with let's say 3 or 4 rows of your data?
I will give a try thought.
Let me build data with 4 columns here (but you could use 14 as in your problem)
data = [['c1a','c2a','c3a','c4a'], ['c1b','c2b','c3b','c4b']]
You could use numpy.array to get its nth column.
See how one can get the 2nd column:
import numpy as np
a = np.array(data)
a[:,2]
If you want to get the 8. Column of all the dogs that are healthy, you can do it the following:
# we use 7 for the column because the index starts by 0
# we use filter and fancy to get the rows where the conditions are true
# we use n.argwhere to get the indices where the conditions are true
A[np.argwhere([A[:,13] == 0])[:,1],7]
If you also want to compute the mean:
A[np.argwhere([A[:,13] == 0])[:,1],7].mean()

Pandas updating values in a column using a lookup dictionary

I have column in a Pandas dataframe that I want to use to lookup a value of cost in a lookup dictionary.
The idea is that I will update an existing column if the item is there and if not the column will be left blank.
All the methods and solutions I have seen so far seem to create a new column, such as apply and assign methods, but it is important that I preserve the existing data.
Here is my code:
lookupDict = {'Apple': 1, 'Orange': 2,'Kiwi': 3,'Lemon': 8}
df1 = pd.DataFrame({'Fruits':['Apple','Banana','Kiwi','Cheese'],
'Pieces':[6, 3, 5, 7],
'Cost':[88, 55, 65, 55]},)
What I want to achieve is lookup the items in the fruit column and if the item is there I want to update the cost column with the dictionary value multiplied by the number of pieces.
For example for Apple the cost is 1 from the lookup dictionary, and in the dataframe the number of pieces is 6, therefore the cost column will be updated from 88 to (6*1) = 6. The next item is banana which is not in the lookup dictionary, therefore the cost in the original dataframe will be left unchanged. The same logic will be applied to the rest of the items.
The only way I can think of achieving this is to separate the lists from the dataframe, iterate through them and then add them back into the dataframe when I'm finished. I am wondering if it would be possible to act on the values in the dataframe without using separate lists??
From other responses I image I have to use the loc indicators such as the following: (But this is not working and I don't want to create a new column)
df1.loc[df1.Fruits in lookupDict,'Cost'] = lookupDict[df1.Fruits] * lookupD[df1.Pieces]
I have also tried to map but it overwrites all the content of the existing column:
df1['Cost'] = df1['Fruits'].map(lookupDict)*df1['Pieces']
EDIT*******
I have been able to achieve it with the following using iteration, however I am still curious if there is a cleaner way to achieve this:
#Iteration method
for i,x in zip(df1['Fruits'],xrange(len(df1.index))):
fruit = (df1.loc[x,'Fruits'])
if fruit in lookupDict:
newCost = lookupDict[fruit] * df1.loc[x,'Pieces']
print(newCost)
df1.loc[x,'Cost'] = newCost
If I understood correctly:
mask = df1['Fruits'].isin(lookupDict.keys())
df1.loc[mask, 'Cost'] = df1.loc[mask, 'Fruits'].map(lookupDict) * df1.loc[mask, 'Pieces']
Result:
In [29]: df1
Out[29]:
Cost Fruits Pieces
0 6 Apple 6
1 55 Banana 3
2 15 Kiwi 5
3 55 Cheese 7

Categories

Resources