When I try to do the following, the subsequent error occurs.
ranges = []
a_values= []
b_values= []
for x in params:
a= min(fifa[params][x])
a= a - (a*.25)
b = max(fifa[params][x])
b = b + (b*.25)
ranges.append((a,b))
for x in range(len(fifa['short_name'])):
if fifa['short_name'][x]=='Nunez':
a_values = df.iloc[x].values.tolist()
Error Description
What does it mean? How do I solve this?
Thank you in advance
The problem is on this line:
if fifa['short_name'][x]=='Nunez':
fifa['short_name'] is a Series;
fifa['short_name'][x] tries to index that series with x;
your code doesn't show it, but the stack trace suggests x is some scalar type;
pandas tries to look up x in the index of fifa['short_name'], and it's not there, resulting in the error.
Since the Series will share the index of the dataframe fifa, this means that the index x isn't in the dataframe. And it probably isn't, because you let x range from 0 upto (but not including) len(fifa).
What is the index of your dataframe? You didn't include the definition of params, nor that of fifa, but your problem is most likely in the latter, or you should loop over the dataframe differently, by looping over its index instead of just integers.
However, there's more efficient ways to do what you're trying to do generally in pandas - you should just include some definition of the dataframe to allow people to show you the correct one.
Related
I'm looking up a certain row in my Pandas DataFrame by using the index - the row information is stored in variable p. As you can see p gives me a normal Pandas DataFrame. Now I want to save just the integer in in_reply_to_status_id as variable y but, in my code below, it gives me an object. Does anyone know if and how it would be possible to just store the integer (1243885949697888263 in this case) as y?
y is a series, you can try as follow to pick second (1243885949697888263) value
print(y.array[0])
Did you try this?
y = df.at[i, 'in_reply_to_status_id']
That way you don't have to create p.
If you want to do it with p:
y = p.iloc[0].at['in_reply_to_status_id']
Or:
y = p.iat[0,1]
Or:
y = p.at[i, 'in_reply_to_status_id']
Below is the example to reproduce the error:
testx1df = pd.DataFrame()
testx1df['A'] = [100,200,300,400]
testx1df['B'] = [15,60,35,11]
testx1df['C'] = [11,45,22,9]
testx1df['D'] = [5,15,11,3]
testx1df['E'] = [1,6,4,0]
(testx1df[testx1df < 6].apply(lambda x: x.index.get_loc(x.first_valid_index(), method='ffill'), axis=1))
The desired output should be a list or array with the values [3,NaN,4,3]. The NaN because it does not satisfy the criteria.
I checked the pandas references and it says that for cases when you do not have an exact match you can change the "method" to 'fill', 'brill', or 'nearest' to pick the previous, next, or closest index. Based on this, if i indicated the method as 'ffill' it would give me an index of 4 instead of NaN. However, when i do so it does not work and i get the error show in the question title. For criteria higher than 6 it works fine but it doesn't for less than 6 due to the fact that the second row in the data frame does not satisfy it.
Is there a way around this issue? should it not work for my example(return previous index of 3 or 4)?
One solution i thought of is to add a dummy column populated by zeros so that is has a place to "find" and index that satisfies the criteria but this is a bit crude to me and i think there is a more efficient solution out there.
please try this:
import numpy as np
ls = list(testx1df[testx1df<6].T.isna().sum())
ls = [np.nan if x==testx1df.shape[1] else x for x in ls]
print(ls)
I wrote a simple python script to concatenate values from the first row of one dataframe with all rows of another dataframe.
Snippet of one of the dataframes (the second one has identical number of columns):
ID Sequence
1 ATGCCCC
2 GCTCCAC
...
My code:
...
def mixer(x):
for row in df1.iterrows():
fdf["New_ID"]=df1.loc[x, "First_ID"]+df2["Second_ID"]
fdf["Sequence"]=df1.loc[x, "Sequence"]+df2["Sequence"]
print(fdf)
mixer(0)
mixer(1)
mixer(2)
...
Currently my first dataframe has only 8 rows but in the future I may have up to a 1000.
How can I avoid repeatedly calling the function for each value of the argument x (as you can see at the end of the code snippet)?
I tried using "range" and putting row numbers into a list/tuple and passing it through the function but neither worked.
Would be grateful for your help!
Why don't you try:
for x in range(num):
mixer(x)
or you could try:
num = 1
while num < other_num:
mixer(num)
But this seems so obvius judging by the complexity of your question, so in case this isn't what you were expecting, could you be more specific?
I have a dataframe containing two columns: one filled with a string (irrelevant), and the other one is (a reference to) a dataframe.
Now I want to only keep the rows, where the dataframes in the second column have entries aka len(df.index) > 0 (there should be rows left, I don't care about columns).
I know that sorting out rows like this works perfectly fine for me if I use it in a list comprehension and can do it on every entry by its own, like in the following example:
[do_x for a, inner_df
in zip(outer_df.index, outer_df["inner"])
if len(inner_df.index) > 0]
But if I try using it for conditional indexing to create a shorter version of the dataframe, it will produce the error KeyError: True.
I thought, that putting len() around it could be a problem so I also tried different approaches to check for zero rows. In the following I show 4 examples of how I tried it:
# a) with the length of the index
outer_df = outer_df.loc[len(outer_df["inner"].index) > 0, :]
# b) same, but with lambda just like in the panda docs user guide
# I used it on the other versions too, with no change in result
outer_df = outer_df.loc[lambda df: len(df["inner"]) > 0, :]
# c) switching
outer_df = outer_df.loc[outer_df["inner"].index.size > 0, :]
# d) even "shorter" version
outer_df = outer_df.loc[not outer_df["inner"].empty, :]
So... where is my error and can I even do it with conditional indexing or do I need to find another way?
Edit: Changed and added some sentences above for more clarity plus added all below.
I know, that the filtering here kind of works through creating a Series the same length as the dataframe consisting of "True" and "False" after a comparison, resulting in keeping only the rows that contain a "True".
I do not however see a fundamental difference between my attempt to create such a list and the following examples (Source https://www.geeksforgeeks.org/selecting-rows-in-pandas-dataframe-based-on-conditions/):
# 1. difference: the resulting Series is *not* altered
# it just gets compared directly with here the value 80
# -> I thought this might be the problem, but then there is also #2
df = df[df['Percentage'] > 80]
# or
df = df.loc[df['Percentage'] > 80]
# 2. Here the entry is checked in a similar way to my c and d
options = ['x', 'y']
df = df[df['Stream'].isin(options)]
# or
df = df.loc[df['Stream'].isin(options)]
In both, number two here and my versions c & d, the entry in the cell (string // dataframe) is checked for something (is part of list // is empty).
Not sure if I understand your question or where you are stuck. however, I will just write my comment in this answer so that I can easily edit the post.
First, let's try typing in myvar = df['Percentage'] > 80 and see what myvar is. See if the content of myvar makes sense to you.
There is really only 1 true rule of .loc[], that is, the TRUTH TABLE.
Regarding the df[stuff] expression always appears within .loc[ df[stuff] expression ], you might get the impression that df[stuff] expression had some special meaning. For example: df[df['Percentage'] > 80] is asking for any Percentage that is greater than 80, looks quite intuitive! so...df['Percentage'] > 80 must be a "special syntax"? In reality, df['Percentage'] > 80 isn't anything special, it is just another truth table. df[stuff] expression will always be a truth table, that's it.
I am pretty new in data science. I am trying to deal DataFrame data inside a list. I have read the almost every post about string indices must be integers, but it did not help at all.
My DataFrame looks like this:
And the my list look like this
myList -> [0098b710-3259-4794-9075-3c83fc1ba058 1.561642e+09 32.775882 39.897459],
[0098b710-3259-4794-9075-3c83fc1ba057 1.561642e+09 32.775882 39.897459],
and goes on...
This is the Data in case you need to reproduce something guys.
I need to access the list items(dataframes) one by one, then I need to split dataframe if the difference between two timestamps greater than 60000
I wrote code this, but it gives an error, whenever I tried to access timestamp. Can you guys help with the problem
mycode:
a = []
for i in range(0,len(data_one_user)):
x = data_one_user[i]
x['label'] = (x['timestamp'] - x['timestamp'].shift(1))
x['trip'] = np.where(x['label'] > 60000, True, False)
x = x.drop('label', axis=1)
x['trip'] = np.where(x['trip'] == True, a.append(x) , a.extend(x))
#a = a.drop('trip', axis=1)
x = a
Edit: If you wonder the object types
data_one_user -> list
data_one_user[0] = x -> pandas. core.frame.DataFrame
data_one_user[0]['timestamp'] = x['timestamp'] -> pandas.core.series.Series
Edit2: I added the error print out
Edit3: Output of x
I found the problem that causes the error. At the end of the list, labels are repeated.