Quite pythonic but not convincing as pandas style - python

I have a dataframe where each series if filled with 0 and 1 as follows:
flagdf=pd.DataFrame({'a':[1,0,1,0,1,0,0,1], 'b':[0,0,1,0,1,0,1,0]})
Now, depending on some analysis I have done, I need to change some 0s to 1s. So the final dataframe will be:
final=pd.DataFrame({'a':[1,1,1,0,1,1,1,1], 'b':[1,1,1,0,1,1,1,1]})
The results of the analysis which shows which 0s have to be changed are stored in a second dataframe built with a multi-index:
first last
a 1 1 1
5 5 6
b 0 0 1
5 5 5
7 7 7
For each 'a' and 'b' I have the first and the last indexes of the 0s I need to change.
First question: The second index in the multi-index dataframe is equal to the series 'first'. I was initially trying to use it directly, but I found it easier to handle two series rather than an index and a series. Am I missing something?
Here is the code to do the job:
def change_one_value_one_column(flagdf,col_name,event):
flagdf[col_name].iloc[event]=1
def change_val_column(col_name, tochange, flagdf):
col_tochange=tochange.ix[col_name]
tomod=col_tochange[['first','last']].values
iter_tomod=[xrange(el[0],el[1]+1) for el in tomod]
[change_one_value_one_column(flagdf,col_name,event) for iterel in iter_tomod for event in iterel]
[change_val_colmun(col_name) for col_name in flagdf.columns]
Second question: I genuinely think that a list comprehension is always good but in cases like that, when I write a function specifically for a list comprehension, I have some doubt. Is it truly the best thing to do?
Third question: I think that the code is quite pythonic, but I am not proud of that because of the last list comprehension which is running over the series of the dataframe: using the method apply would look better to my eyes (but I'm not sure how to do it). Nontheless is there any real reason (apart from elegance) I should work to do the changes?

To answer the part about exhausting an iterator, I think you have a few pythonic choices (all of which I prefer over a list comprehension):
# the easiest, and most readable
for col_name in flagdf.columns:
change_val_column(col_name)
# consume/exhaust an iterator using built-in any (assuming each call returns None)
any(change_val_colmun(col_name) for col_name in flagdf.columns)
# use itertools' consume recipe
consume(change_val_colmun(col_name) for col_name in flagdf.columns)
See consume recipe from itertools.
However, when doing this kind of thing in numpy/pandas, you should be asking yourself "can I vertorize / use indexing here?". If you can your code will usually be both faster and more readable.
I think in this case you'll be able to remove one level of loops by doing something like:
def change_val_column(col_name, tochange, flagdf):
col_tochange = tochange.ix[col_name] # Note: you're accessing index not column here??
tomod = col_tochange[['first','last']].values
for i, j in tomod:
flag_df.loc[i:j, col_name] = 1
You may even be able to remove the for loop, but it's not obvious how / what the intention is here...

If I'm staying in python and iterating over rows, I prefer using zip/izip as a first pass.
for col, start, end in izip(to_change.index.get_level_values(0), tochange['first'], tochange['last']):
flagdf.loc[start:end, col] = 1
Simple and fast.

Related

Can I use apply() or anyting else on a 1-dim dataframe to construct a list of dataframes?

Is there any way to construct a list of dataframes from a 1-dim dataframe or list? I thought apply would do the trick but it seems not to be the case. The job can be done easily by using a for loop but I wish to avoid that. More details down below.
This is the code I tried but it wouldn't work
pd.DataFrame([1,2,3,4,5]).apply(lambda x: pd.DataFrame([x]))
This is the code that would do the trick but for loop is what I wish to avoid at all cost, do run it so that you know what I actually try to achieve
list = [1,2,3,4,5]
j = []
for i in list:
i = pd.DataFrame([i])
j = j + [i]
In the project I work on, the thing I wish to do would be much more complex than just turning a element into a 1x1 dataframe but rather transforming it into a huge dataframe and eventually each of the dataframes generated would be put into a list, the only bottleneck is exactly this issue I described,
thanks in advance.
You can simplify and speed up your loop by using list comprehension which takes away the overhead of a for loop, here's a good read on it
Note: I renamed your list to lst since "list" is a reserved word in Python, don't use it as an variable
dfs = [pd.DataFrame([x]) for x in lst]
Now we can access each dataframe:
print(dfs[0])
0
0 1
print(dfs[1])
0
0 2

for loop inclusive or exclusive in python

When I run the following code, it prints. However, I expected only one 1 rather than two.
for i in (1,1):
print(i)
Output
1
1
You are iterating over a tuple which contains two elements with value 1 so it prints 1 twice. Your code is equivalent to:
list = [1, 1]
for item in list:
print(item)
If you want to loop over a range of numbers:
for i in range(1, 2):
print(i)
Or if you want to print unique numbers or values in list or tuple convert it into the set it will automatically remove the duplicates
newList = set(list)
for value in newList:
print(value)
Sets and tuples are different. I suspect you are confusing them. On a set:
for i in {1, 1}:
print(i)
1
On a tuple:
for i in (1, 1):
print(i)
1
1
Think of sets as being like sets in math, and tuples as being more like sequences - you can have redundancies in a sequence, but not in a set.
After reading #KeshavGarg's answer, I suspect you thought that (a,b) in Python would mean stuff in a through b. As you're probably aware by now, this is not the case - you need range to get that. Interestingly (and I admit tangentially), the syntax we're discussing here varies by language. In MATLAB, the range syntax looks a lot more like what I assume you thought the Python range syntax was:
>> for i=1:4
disp(i)
end
There has been some discussion of implementing range literals (a la Matlab) in Python. This introduces a variety of interesting new problems, which you can read about in the documentation linked in the previous sentence.
For loops are always inclusive in Python: they always run over all elements of the iterator (other than exceptions such as break, etc.). What probably has you confused is the range syntax. range(1,1) will create a range object with one element. It is the range function, not the for-loop, that is exclusive, in the sense of not including the stop argument in the range object.

Datediff in same column

I'm trying to get the difference in time between the last two times a person has applied for our service. My solution works, but it's ugly.
Is there a more pythonic way of accomplishing this?
for customer in previous_apps:
app_times = df.ix[df['customer_id']==customer, 'hit_datetime']
days_since_last_app = [(b-a).days for a,b in zip(app_times, app_times[1:])][-1:][0]
df.ix[df['customer_id']==customer, 'days_since_last_app'] = days_since_last_app
Having a list comprehension calculate all the differences in dates of applications then slice it with [-1:] so you have a list with only the last element then extract it by indexing with [0] is completely unnecessary.
you can just take the last application date app_times[-1] and the second last one app_times[-2] and take the difference:
days_since_last_app = (app_times[-1] - app_times[-2]).days
this will fail if there are less then 2 entries in the list so you probably want a special case for that.
(I'm guessing that line evolved into what it is by trying to resolve IndexErrors that were the result of not having previous entries.)
Start by defining a two-argument function that calculates the time difference for you, e.g. time_diff(a, b). Use it something like this:
df["last_visit"] = df.groupby("customer_id").apply(
lambda x: x.apply(time_diff(*x["hit_datetime"][-2:]))
(Assuming the values in hit_datetime are sorted, which your code implies they are.)
The above "broadcasts" the last_visit values, since multiple records have the same customer_id. If you prefer you can just store the result as a Series with one row per customer:
last_visit = df.groupby("customer_id").apply(
lambda x: x.apply(time_diff(*x["hit_datetime"][-2:]))
I'm not sure I precisely understand how your data is structured, but the following should provide the functionality you require:
df.sort_values(['customer_id','hit_datetime'],ascending=True,inplace=True)
df['days_since_last_app'] = df.groupby('customer_id')['hit_datetime'].transform(lambda y: y.diff().apply(lambda x: 0 if x!=x else x.days))

How do I convert a string into code in Python?

Converting a string to code
Noteworthy points:
I'm new to coding and am testing various things to learn;
i.e. yes, I'm sure there are better ways do achieve what I am trying to do;
I would like to know any alternative / more efficient methods, however;
I would also still like to know how to convert string to code to achieve my goal with this technique
So far I have looked around the forum and on google, and seen a few topics on this, none of which I can made work here, or which precisely answer the question from my perspective, including using eval and exec.
The Scenario
I have a dataframe: london with 23 columns
I want to create a dataframe showing all rows with 'NaN' values
I have tried to use .isnull(), but it appears to only work on a single column at a time
I am trying to achieve my desired result by using | to return any rows in any columns where .isnull() returns True
An example of this working with just two columns is:
london[(london['Events'].isnull() | london['Max Gust SpeedKm/h'].isnull())]
However, I need to achieve this result with all 23 columns, so I have attempted to complete this with some code.
Attempted Solution
Creating a string containing all of the column headers
i.e. london[(london['Column Header'].isnull() followed by | and then the next column
Then using this string within the container shown in the working example above
i.e. london[(string)]
I have managed to create the string I need using the following:
string = []
for i in (london.columns.values):
string.append("london['" + i + "'].isnull()")
string.append(" | ")
del string[-1]
final_string = "".join(string)
And finally when I try to implement the final step, I cannot work out how to convert this string into usable code.
For example:
now = eval(final_string)
london[now]
Resulting in:
NotImplementedError: 'Call' nodes are not implemented
Thank you in advance.
This is the easiest way to select the rows in your dataframe with NaN values:
df[pd.isnull(df).any(axis=1)]
string = []
for i in (london.columns.values):
string.append(london[i].isnull())
london[0<sum(string)]
Since you will have only 1 and 0 and you are looking for at least one 1 then you can just add 1,0's to your list then sum them. if the sum is more than one your if will turn 1 otherwise your if will turn 0 so you can do london index after that.

Fast way of slicing columns from tuples

I have a huge list of tuples from which I want to extract individual columns. I have tried two methods.
Assuming the name of the list name is List and I want to extract the jth column.
First one is
column=[item[j] for item in List]
Second one is
newList=zip(*List)
column=newList[j]
However both the methods are too slow since the length of the list is about 50000 and length of each tuple is about 100. Is there a faster way to extract the columns from the list?
this is something numpy does well
A = np.array(Lst) # this step may take a while now ... maybe you should have Lst as a np.array before you get to this point
sliced = A[:,[j]] # this should be really quite fast
that said
newList=zip(*List)
column=newList[j]
takes less than a second for me with a 50kx100 tuple ... so maybe profile your code and make sure the bottleneck is actually where you think it is...

Categories

Resources