I have a dataframe where each series if filled with 0 and 1 as follows:
flagdf=pd.DataFrame({'a':[1,0,1,0,1,0,0,1], 'b':[0,0,1,0,1,0,1,0]})
Now, depending on some analysis I have done, I need to change some 0s to 1s. So the final dataframe will be:
final=pd.DataFrame({'a':[1,1,1,0,1,1,1,1], 'b':[1,1,1,0,1,1,1,1]})
The results of the analysis which shows which 0s have to be changed are stored in a second dataframe built with a multi-index:
first last
a 1 1 1
5 5 6
b 0 0 1
5 5 5
7 7 7
For each 'a' and 'b' I have the first and the last indexes of the 0s I need to change.
First question: The second index in the multi-index dataframe is equal to the series 'first'. I was initially trying to use it directly, but I found it easier to handle two series rather than an index and a series. Am I missing something?
Here is the code to do the job:
def change_one_value_one_column(flagdf,col_name,event):
flagdf[col_name].iloc[event]=1
def change_val_column(col_name, tochange, flagdf):
col_tochange=tochange.ix[col_name]
tomod=col_tochange[['first','last']].values
iter_tomod=[xrange(el[0],el[1]+1) for el in tomod]
[change_one_value_one_column(flagdf,col_name,event) for iterel in iter_tomod for event in iterel]
[change_val_colmun(col_name) for col_name in flagdf.columns]
Second question: I genuinely think that a list comprehension is always good but in cases like that, when I write a function specifically for a list comprehension, I have some doubt. Is it truly the best thing to do?
Third question: I think that the code is quite pythonic, but I am not proud of that because of the last list comprehension which is running over the series of the dataframe: using the method apply would look better to my eyes (but I'm not sure how to do it). Nontheless is there any real reason (apart from elegance) I should work to do the changes?
To answer the part about exhausting an iterator, I think you have a few pythonic choices (all of which I prefer over a list comprehension):
# the easiest, and most readable
for col_name in flagdf.columns:
change_val_column(col_name)
# consume/exhaust an iterator using built-in any (assuming each call returns None)
any(change_val_colmun(col_name) for col_name in flagdf.columns)
# use itertools' consume recipe
consume(change_val_colmun(col_name) for col_name in flagdf.columns)
See consume recipe from itertools.
However, when doing this kind of thing in numpy/pandas, you should be asking yourself "can I vertorize / use indexing here?". If you can your code will usually be both faster and more readable.
I think in this case you'll be able to remove one level of loops by doing something like:
def change_val_column(col_name, tochange, flagdf):
col_tochange = tochange.ix[col_name] # Note: you're accessing index not column here??
tomod = col_tochange[['first','last']].values
for i, j in tomod:
flag_df.loc[i:j, col_name] = 1
You may even be able to remove the for loop, but it's not obvious how / what the intention is here...
If I'm staying in python and iterating over rows, I prefer using zip/izip as a first pass.
for col, start, end in izip(to_change.index.get_level_values(0), tochange['first'], tochange['last']):
flagdf.loc[start:end, col] = 1
Simple and fast.
Related
Is there any way to construct a list of dataframes from a 1-dim dataframe or list? I thought apply would do the trick but it seems not to be the case. The job can be done easily by using a for loop but I wish to avoid that. More details down below.
This is the code I tried but it wouldn't work
pd.DataFrame([1,2,3,4,5]).apply(lambda x: pd.DataFrame([x]))
This is the code that would do the trick but for loop is what I wish to avoid at all cost, do run it so that you know what I actually try to achieve
list = [1,2,3,4,5]
j = []
for i in list:
i = pd.DataFrame([i])
j = j + [i]
In the project I work on, the thing I wish to do would be much more complex than just turning a element into a 1x1 dataframe but rather transforming it into a huge dataframe and eventually each of the dataframes generated would be put into a list, the only bottleneck is exactly this issue I described,
thanks in advance.
You can simplify and speed up your loop by using list comprehension which takes away the overhead of a for loop, here's a good read on it
Note: I renamed your list to lst since "list" is a reserved word in Python, don't use it as an variable
dfs = [pd.DataFrame([x]) for x in lst]
Now we can access each dataframe:
print(dfs[0])
0
0 1
print(dfs[1])
0
0 2
When I run the following code, it prints. However, I expected only one 1 rather than two.
for i in (1,1):
print(i)
Output
1
1
You are iterating over a tuple which contains two elements with value 1 so it prints 1 twice. Your code is equivalent to:
list = [1, 1]
for item in list:
print(item)
If you want to loop over a range of numbers:
for i in range(1, 2):
print(i)
Or if you want to print unique numbers or values in list or tuple convert it into the set it will automatically remove the duplicates
newList = set(list)
for value in newList:
print(value)
Sets and tuples are different. I suspect you are confusing them. On a set:
for i in {1, 1}:
print(i)
1
On a tuple:
for i in (1, 1):
print(i)
1
1
Think of sets as being like sets in math, and tuples as being more like sequences - you can have redundancies in a sequence, but not in a set.
After reading #KeshavGarg's answer, I suspect you thought that (a,b) in Python would mean stuff in a through b. As you're probably aware by now, this is not the case - you need range to get that. Interestingly (and I admit tangentially), the syntax we're discussing here varies by language. In MATLAB, the range syntax looks a lot more like what I assume you thought the Python range syntax was:
>> for i=1:4
disp(i)
end
There has been some discussion of implementing range literals (a la Matlab) in Python. This introduces a variety of interesting new problems, which you can read about in the documentation linked in the previous sentence.
For loops are always inclusive in Python: they always run over all elements of the iterator (other than exceptions such as break, etc.). What probably has you confused is the range syntax. range(1,1) will create a range object with one element. It is the range function, not the for-loop, that is exclusive, in the sense of not including the stop argument in the range object.
I'm trying to get the difference in time between the last two times a person has applied for our service. My solution works, but it's ugly.
Is there a more pythonic way of accomplishing this?
for customer in previous_apps:
app_times = df.ix[df['customer_id']==customer, 'hit_datetime']
days_since_last_app = [(b-a).days for a,b in zip(app_times, app_times[1:])][-1:][0]
df.ix[df['customer_id']==customer, 'days_since_last_app'] = days_since_last_app
Having a list comprehension calculate all the differences in dates of applications then slice it with [-1:] so you have a list with only the last element then extract it by indexing with [0] is completely unnecessary.
you can just take the last application date app_times[-1] and the second last one app_times[-2] and take the difference:
days_since_last_app = (app_times[-1] - app_times[-2]).days
this will fail if there are less then 2 entries in the list so you probably want a special case for that.
(I'm guessing that line evolved into what it is by trying to resolve IndexErrors that were the result of not having previous entries.)
Start by defining a two-argument function that calculates the time difference for you, e.g. time_diff(a, b). Use it something like this:
df["last_visit"] = df.groupby("customer_id").apply(
lambda x: x.apply(time_diff(*x["hit_datetime"][-2:]))
(Assuming the values in hit_datetime are sorted, which your code implies they are.)
The above "broadcasts" the last_visit values, since multiple records have the same customer_id. If you prefer you can just store the result as a Series with one row per customer:
last_visit = df.groupby("customer_id").apply(
lambda x: x.apply(time_diff(*x["hit_datetime"][-2:]))
I'm not sure I precisely understand how your data is structured, but the following should provide the functionality you require:
df.sort_values(['customer_id','hit_datetime'],ascending=True,inplace=True)
df['days_since_last_app'] = df.groupby('customer_id')['hit_datetime'].transform(lambda y: y.diff().apply(lambda x: 0 if x!=x else x.days))
Converting a string to code
Noteworthy points:
I'm new to coding and am testing various things to learn;
i.e. yes, I'm sure there are better ways do achieve what I am trying to do;
I would like to know any alternative / more efficient methods, however;
I would also still like to know how to convert string to code to achieve my goal with this technique
So far I have looked around the forum and on google, and seen a few topics on this, none of which I can made work here, or which precisely answer the question from my perspective, including using eval and exec.
The Scenario
I have a dataframe: london with 23 columns
I want to create a dataframe showing all rows with 'NaN' values
I have tried to use .isnull(), but it appears to only work on a single column at a time
I am trying to achieve my desired result by using | to return any rows in any columns where .isnull() returns True
An example of this working with just two columns is:
london[(london['Events'].isnull() | london['Max Gust SpeedKm/h'].isnull())]
However, I need to achieve this result with all 23 columns, so I have attempted to complete this with some code.
Attempted Solution
Creating a string containing all of the column headers
i.e. london[(london['Column Header'].isnull() followed by | and then the next column
Then using this string within the container shown in the working example above
i.e. london[(string)]
I have managed to create the string I need using the following:
string = []
for i in (london.columns.values):
string.append("london['" + i + "'].isnull()")
string.append(" | ")
del string[-1]
final_string = "".join(string)
And finally when I try to implement the final step, I cannot work out how to convert this string into usable code.
For example:
now = eval(final_string)
london[now]
Resulting in:
NotImplementedError: 'Call' nodes are not implemented
Thank you in advance.
This is the easiest way to select the rows in your dataframe with NaN values:
df[pd.isnull(df).any(axis=1)]
string = []
for i in (london.columns.values):
string.append(london[i].isnull())
london[0<sum(string)]
Since you will have only 1 and 0 and you are looking for at least one 1 then you can just add 1,0's to your list then sum them. if the sum is more than one your if will turn 1 otherwise your if will turn 0 so you can do london index after that.
I have a huge list of tuples from which I want to extract individual columns. I have tried two methods.
Assuming the name of the list name is List and I want to extract the jth column.
First one is
column=[item[j] for item in List]
Second one is
newList=zip(*List)
column=newList[j]
However both the methods are too slow since the length of the list is about 50000 and length of each tuple is about 100. Is there a faster way to extract the columns from the list?
this is something numpy does well
A = np.array(Lst) # this step may take a while now ... maybe you should have Lst as a np.array before you get to this point
sliced = A[:,[j]] # this should be really quite fast
that said
newList=zip(*List)
column=newList[j]
takes less than a second for me with a 50kx100 tuple ... so maybe profile your code and make sure the bottleneck is actually where you think it is...