DataFrame is empty, expected data in it - python

I want to find duplicate items within 2 rows in Excel. So for example my Excel consists of:
list_A list_B
0 ideal ideal
1 brown colour
2 blue blew
3 red red
I checked the pandas documentation and tried duplicate method but I simply don't know why it keeps saying "DataFrame is empty". It finds both columns and I guess it's iterated over it but why doesn't it find the values and compare them?
I also tried using iterrows but honestly don't know how to implement it.
When running the code I get this output:
Empty DataFrame
Columns: [list A, list B]
Index: []
import pandas as pd
pt = pd.read_excel(r"C:\Users\S531\Desktop\pt.xlsx")
dfObj = pd.DataFrame(pt)
doubles = dfObj[dfObj.duplicated()]
print(doubles)
The output I'm looking for is:
list_A list_B
0 ideal ideal
3 red red
Final solved code looks like this:
import pandas as pd
pt = pd.read_excel(r"C:\Users\S531\Desktop\pt.xlsx")
doubles = pt[pt['list_A'] == pt['list_B']]
print(doubles)

The term "duplicate" is usually used to mean rows that are exact duplicates of previous rows (see the documentation of pd.DataFrame.duplicate).
What you are looking for is just the rows where these two columns are equal. For that, you want:
doubles = pt[pt['list_A'] == pt['list_B']]

Related

Keep rows according to condition in Pandas

I am looking for a code to find rows that matches a condition and keep those rows.
In the image example, I wish to keep all the apples with amt1 => 5 and amt2 < 5. I also want to keep the bananas with amt1 => 1 and amt2 < 5 (highlighted red in image). There are many other fruits in the list that I have to filter for (maybe about 10 fruits).
image example
Currently, I am filtering it individually (ie. creating a dataframe that filters out the red and small apples and another dataframe that filters out the green and big bananas and using concat to join the dataframes together afterwards). However, this process takes a long time to run because the dataset is huge. I am looking for a faster way (like filtering it in the dataframe itself without having to create a new dataframes). I also have to use column index instead of column names as the column name changes according to the date.
Hopefully what I said makes sense. Would appreciate any help!
I am not quite sure I understand your requirements because I don't understand how the conditions for the rows to keep are formulated.
One thing you can use to combine multiple criteria for selecting data is the query method of the dataframe:
import pandas as pd
df = pd.DataFrame([
['Apple', 5, 1],
['Apple', 4, 2],
['Orange', 3, 3],
['Banana', 2, 4],
['Banana', 1, 5]],
columns=['Fruits', 'Amt1', 'Amt2'])
df.query('(Fruits == "Apple" & (Amt1 >= 5 & Amt2 < 5)) | (Fruits == "Banana" & (Amt1 >= 1 & Amt2 < 5))')
You might use filter combined with itertuples following way
import pandas as pd
df = pd.DataFrame({"x":[1,2,3,4,5],"y":[10,20,30,40,50]})
def keep(row):
return row[0] >= 2 and row[1] <= 40
df_filtered = pd.DataFrame(filter(keep,df.itertuples())).set_index("Index")
print(df_filtered)
gives output
x y
Index
2 3 30
3 4 40
4 5 50
Explanation: keep is function which should return True for rows to keep False for rows to jettison. .itertuples() provides iterable of tuples, which are feed to filter which select records where keep evaluates to True, these selected rows are used to create new DataFrame. After that is done I set index so Index is corresponding to original DataFrame. Depending on your use case you might elect to not set index.

Comparing two dataframes based off one column, with the equal values in different index positions

So I am having an issue trying to find a solution to a problem I seem to be coming across.
I am trying to compare two dataframes which are quite large, but for the case of my first issue, I have reduced this to a smaller sample size.
At the present time, I would like to simply print out the name of a player that is in both of these dataframes. In the future I will be then looping through the columns to compare the values and record the difference, but that is future me problems.
I have noticed that in other examples and solutions being shared, most people will have the two values that they want to compare, in the same index, however I am not experienced enough with Pandas commands to know how to manipulate these solutions.
import pandas as pd
df1=pd.read_excel('Example players 2019.xlsx')
df2=pd.read_excel('Example players 2018.xlsx')
header2019 = df1.iloc[0]
df1 = df1[1:]
df1.columns = header2019
header2018 = df2.iloc[0]
df2 = df2[1:]
df2.columns = header2018
print('df1')
print(df1)
print('df2')
print(df2)
columnLength2019=df1.shape[1]
columnLength2018=df2.shape[1]
rowLength2019=df1.shape[0]
rowLength2018=df2.shape[0]
for i in range (1, rowLength2019):
for j in range (1, rowLength2018):
if df1['Player'] == df2['Player']:
print(df1['Player'])
Example players 2019
Example players 2018
You might want to merge the two dataframes on the player column, see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html.
Example:
import pandas as pd
df_2018 = pd.DataFrame({'player':['a','b','c'], 'team':['x','y','z']})
df_2019 = pd.DataFrame({'player':['b','c','d'], 'team':['y','j','k']})
matched = df_2018.merge(df_2019,
on='player',
how='inner',
suffixes=['_2018','_2019']
)
print(matched)
Output:
player team_2018 team_2019
0 b y y
1 c z j
To print out the matching players you can then do something like:
for player in matched['player']:
print(player)
Having both years of data in the same DataFrame should also make it easier to compare them later.
You can use isin to chek if a value is in a series
a =df1[(df1.player.isin(df2.player))]
for player in a['player']:
print(player)
Or you can use np.where with isin to chekc & print in a single line.
np.where((df1.player.isin(df2.player)), df1.player+ " is present", df1.player+ " is NOT present").tolist()
You use np.where to create a column in your dataframe as well
df1['present'] = np.where((df1.player.isin(df2.player)), "Present", "NOT present")

How do I get one column of an array into a new Array whilst applying a fancy indexing filter on it?

So basically I have an array, that consists of 14 Columns and 426 rows, every column represents one property of a dog and every row represents one dog, now I want to know the average heart frequency of an ill dog, the 14. column is the column that indicates whether the Dog is ill or not [0 = Healthy 1 = ill], the 8. row is the heart frequency. Now my problem is, that I don't know how I can get the 8. column out of the whole array and use the boolean filter on it
I am pretty new to Python. As I mentioned above I think that I know what I have to do [Use a fancy indexing filter] but I don't know how I can do this. I tried doing it while still being in the original Array but that didn't work out, so I thought I need to get the Infos into another one and use the Boolean filter on that one.
EDIT: Ok, so here is the code that I got right now:
import numpy as np
def average_heart_rate_for_pathologic_group(D):
a=np.array(D[:, 13]) #gets information, wether the dogs are sick or not
b=np.array(D[:, 7]) #gets the heartfrequency
R=(a >= 0) #gets all the values that are from sick dogs
amhr = np.mean(R) #calculates the average heartfrequency
return amhr
I think boolean indexing is the way foward.
The shortcuts for this work like:
#Your data:
data = [[0,1,2,3,4,5,6,7,8...],[..]...]
#This indexing chooses the rows in the 8th column that equals 1 and then their
#column number 14 values. Any analysis can be done after this on the new variable
heart_frequency_ill = data[data[:,7] == 1,13]
Probably you'll have to actually copy the data from the original array into a new one with the selected data.
Could you please share a sample with let's say 3 or 4 rows of your data?
I will give a try thought.
Let me build data with 4 columns here (but you could use 14 as in your problem)
data = [['c1a','c2a','c3a','c4a'], ['c1b','c2b','c3b','c4b']]
You could use numpy.array to get its nth column.
See how one can get the 2nd column:
import numpy as np
a = np.array(data)
a[:,2]
If you want to get the 8. Column of all the dogs that are healthy, you can do it the following:
# we use 7 for the column because the index starts by 0
# we use filter and fancy to get the rows where the conditions are true
# we use n.argwhere to get the indices where the conditions are true
A[np.argwhere([A[:,13] == 0])[:,1],7]
If you also want to compute the mean:
A[np.argwhere([A[:,13] == 0])[:,1],7].mean()

Pandas - slice sections of dataframe into multiple dataframes

I have a Pandas dataframe with 3000+ rows that looks like this:
t090:   c0S/m:    pr:      timeJ:  potemp090C:   sal00:  depSM:  \
407  19.3574  4.16649  1.836  189.617454      19.3571  30.3949   1.824
408  19.3519  4.47521  1.381  189.617512      19.3517  32.9250   1.372
409  19.3712  4.44736  0.710  189.617569      19.3711  32.6810   0.705
410  19.3602  4.26486  0.264  189.617627      19.3602  31.1949   0.262
411  19.3616  3.55025  0.084  189.617685      19.3616  25.4410   0.083
412  19.2559  0.13710  0.071  189.617743      19.2559   0.7783   0.071
413  19.2092  0.03000  0.068  189.617801      19.2092   0.1630   0.068
414  19.4396  0.00522  0.068  189.617859      19.4396   0.0321   0.068
What I want to do is: create individual dataframes from each portion of the dataframe in which the values in column 'c0S/m' exceed 0.1 (eg rows 407-412 in the example above).
So let's say that I have 7 sections in my 3000+ row dataframe in which a series of rows exceed 0.1 in the second column. My if/for/while statement will slice these sections and create 7 separate dataframes.
I tried researching the best I could but could not find a question that would address this problem. Any help is appreciated.
Thank you.
You can try this:
First add a column of 0 or 1 based on whether the value is greater than 1 or less.
df['splitter'] = np.where(df['c0S/m:'] > 1, 1, 0)
Now groupby this column diff.cumsum()
df.groupby((df['splitter'].diff(1) != 0).astype('int').cumsum()).apply(lambda x: [x.index.min(),x.index.max()])
You get the required blocks of indices
splitter
1 [407, 411]
2 [412, 414]
3 [415, 415]
Now you can create dataframes using loc
df.loc[407:411]
Note: I added a line to your sample df using:
df.loc[415] = [19.01, 5.005, 0.09, 189.62, 19.01, 0.026, 0.09]
to be able to test better and hence its splitting in 3 groups
Here's another way.
sub_set = df[df['c0S/m'] > 0.1]
last = None
for i in sub_set.index:
if last is None:
start = i
else:
if i - last > 1:
print start, last
start = i
last = i
I think it works. (Instead of print start, last you could insert code to create the slices you wanted of the original data frame).
Some neat tricks here that do an even better job.

Splitting strings in an array - python

I have a pandas dataframe with an array variable that's currently made up of a two part string as in the example below. The first part is a datetime and the second part is a price. Records in the dataframe have different length price_trend arrays.
Id Name Color price_trend
1 apple red '1420848000:1.25', '1440201600:1.35', '1443830400:1.52' 60
2 lemon yellow '1403740800:0.32','1422057600:0.25'
I'd like to split each of the strings in the array into two parts around the colon (:), however when I run the code below, all the values in price_trend are replaced with nan
df['price_trend'] = df['price_trend'].str.split(':')
I would like to keep the array in this dataframe, and not a create a new one.
df['price_trend'].apply(lambda x:[i.split(':') for i in x])
0 [['1420848000, 1.25'], [ '1440201600, 1.35'], [ '1443830400, 1.52']]
1 [['1403740800, 0.32'], ['1422057600, 0.25']]
I assume below code should work for you
>>> df={}
>>> df['p']=['1420848000:1.25', '1440201600:1.35', '1443830400:1.52']
>>> df['p']=[ x.split(':') for x in df['p']]
>>> df
{'p': [['1420848000', '1.25'], ['1440201600', '1.35'], ['1443830400', '1.52']]}

Categories

Resources