Sometimes I end up with a series of tuples/lists when using Pandas. This is common when, for example, doing a group-by and passing a function that has multiple return values:
import numpy as np
from scipy import stats
df = pd.DataFrame(dict(x=np.random.randn(100),
y=np.repeat(list("abcd"), 25)))
out = df.groupby("y").x.apply(stats.ttest_1samp, 0)
print out
y
a (1.3066417476, 0.203717485506)
b (0.0801133382517, 0.936811414675)
c (1.55784329113, 0.132360504653)
d (0.267999459642, 0.790989680709)
dtype: object
What is the correct way to "unpack" this structure so that I get a DataFrame with two columns?
A related question is how I can unpack either this structure or the resulting dataframe into two Series/array objects. This almost works:
t, p = zip(*out)
but it t is
(array(1.3066417475999257),
array(0.08011333825171714),
array(1.557843291126335),
array(0.267999459641651))
and one needs to take the extra step of squeezing it.
maybe this is most strightforward (most pythonic i guess):
out.apply(pd.Series)
if you would want to rename the columns to something more meaningful, than:
out.columns=['Kstats','Pvalue']
if you do not want the default name for the index:
out.index.name=None
maybe:
>>> pd.DataFrame(out.tolist(), columns=['out-1','out-2'], index=out.index)
out-1 out-2
y
a -1.9153853424536496 0.067433
b 1.277561889173181 0.213624
c 0.062021492729736116 0.951059
d 0.3036745009819999 0.763993
[4 rows x 2 columns]
I believe you want this:
df=pd.DataFrame(out.tolist())
df.columns=['KS-stat', 'P-value']
result:
KS-stat P-value
0 -2.12978778869 0.043643
1 3.50655433879 0.001813
2 -1.2221274198 0.233527
3 -0.977154419818 0.338240
I have met the similar problem. What I found 2 ways to solving it are exactly the answer of #CT ZHU and that of #Siraj S.
Here is my supplementary information you might be interested:
I have compared 2 ways and found the way of #CT ZHU performs much faster when the size of input grows.
Example:
#Python 3
import time
from statistics import mean
df_a = pd.DataFrame({'a':range(1000),'b':range(1000)})
#function to test
def func1(x):
c = str(x)*3
d = int(x)+100
return c,d
# Siraj S's way
time_difference = []
for i in range(100):
start = time.time()
df_b = df_a['b'].apply(lambda x: func1(x)).apply(pd.Series)
end = time.time()
time_difference.append(end-start)
print(mean(time_difference))
# 0.14907703161239624
# CT ZHU's way
time_difference = []
for i in range(100):
start = time.time()
df_b = pd.DataFrame(df_a['b'].apply(lambda x: func1(x)).tolist())
end = time.time()
time_difference.append(end-start)
print(mean(time_difference))
# 0.0014058423042297363
PS: Please forgive my ugly code.
not sure if the t, r are predefined somewhere, but if not, I am getting the two tuples passing to t and r by,
>>> t, r = zip(*out)
>>> t
(-1.776982300308175, 0.10543682705459552, -1.7206831272759038, 1.0062163376448068)
>>> r
(0.08824925924534484, 0.9169054844258786, 0.09817788453771065, 0.3243492942246433)
Thus, you could do this,
>>> df = pd.DataFrame(columns=['t', 'r'])
>>> df.t, df.r = zip(*out)
>>> df
t r
0 -1.776982 0.088249
1 0.105437 0.916905
2 -1.720683 0.098178
3 1.006216 0.324349
Related
I'm currently trying to learn how to use multiprocessing on python. Moreover I want to apply multiprocessing on a code of mine.
I have read other questions on the subject but the solutions on those questions did not work on my environment (maybe because something has changed with python 3.10)
My code looks like:
def obtenern2():
A = []
for d in days:
aux = dfhabil[dfhabil["day"] == d]
n2 = casosn(aux,2)
aml = ExportarMODml(n2)
adl = ExportarMODdl(n2)
A.append(aml)
A.append(adl)
return pd.concat(A)
B = obtenern2()
where "ExportarMODml" or "ExportarMODdl" takes the dataframe "n2" and perform some calculations returning a dataframe (so "A" is actually a list of dataframes).
I think that "ExportarMODml" and "ExportarMODdl" could be process in parallel, but I dont know how to append the resulting dataframes to the same list without causing corruption or something like that.
Here is a pattern that you could probably adapt to your requirements.
We have two functions ExportarMODml and ExportarMODdl. Each function takes a dictionary as its only argument and returns a DataFrame.
These can be executed in parallel and a concatenation of the returned DataFrames can be achieved thus:
from pandas import concat, DataFrame
from concurrent.futures import ProcessPoolExecutor
def ExportarMODml(d):
return DataFrame(d)
def ExportarMODdl(d):
return DataFrame(d)
def main():
d = {'a': [1,2], 'b': [3,4]}
with ProcessPoolExecutor() as ppe:
futures = [ppe.submit(func, d) for func in (ExportarMODml, ExportarMODdl)]
df = concat([future.result() for future in futures])
print(df)
if __name__ == '__main__':
main()
Output:
a b
0 1 3
1 2 4
0 1 3
1 2 4
I have a dataframe as shown below:
>>> import pandas as pd
>>> df = pd.DataFrame(data = [['app;',1,2,3],['app; web;',4,5,6],['web;',7,8,9],['',1,4,5]],columns = ['a','b','c','d'])
>>> df
a b c d
0 app; 1 2 3
1 app; web; 4 5 6
2 web; 7 8 9
3 1 4 5
I have an input array that looks like this: ["app","web"]
For each of these values I want to check against a specific column of a dataframe and return a decision as shown below:
>>> df.a.str.contains("app")
0 True
1 True
2 False
3 False
Since str.contains only allows me to look for an individual value, I was wondering if there's some other direct way to determine the same something like:
df.a.str.contains(["app","web"]) # Returns TypeError: unhashable type: 'list'
My end goal is not to do an absolute match (df.a.isin(["app", "web"]) but rather a 'contains' logic that says return true even if it has those characters present in that cell of data frame.
Note: I can of course use apply method to create my own function for the same logic such as:
elementsToLookFor = ["app","web"]
df[header] = df.apply(lambda element: all([a in element for a in elementsToLookFor]))
But I am more interested in the optimal algorithm for this and so prefer to use a native pandas function within pandas, or else the next most optimized custom solution.
This should work too:
l = ["app","web"]
df['a'].str.findall('|'.join(l)).map(lambda x: len(set(x)) == len(l))
also this should work as well:
pd.concat([df['a'].str.contains(i) for i in l],axis=1).all(axis = 1)
so many solutions, which one is the most efficient
The str.contains-based answers are generally fastest, though str.findall is also very fast on smaller dfs:
values = ['app', 'web']
pattern = ''.join(f'(?=.*{value})' for value in values)
def replace_dummies_all(df):
return df.a.str.replace(' ', '').str.get_dummies(';')[values].all(1)
def findall_map(df):
return df.a.str.findall('|'.join(values)).map(lambda x: len(set(x)) == len(values))
def lower_contains(df):
return df.a.astype(str).str.lower().str.contains(pattern)
def contains_concat_all(df):
return pd.concat([df.a.str.contains(l) for l in values], axis=1).all(1)
def contains(df):
return df.a.str.contains(pattern)
Try with str.get_dummies
df.a.str.replace(' ','').str.get_dummies(';')[['web','app']].all(1)
0 False
1 True
2 False
3 False
dtype: bool
Update
df['a'].str.contains(r'^(?=.*web)(?=.*app)')
Update 2: (To ensure case insenstivity doesn't matter and the column dtype is str without which the logic may fail):
elementList = ['app','web']
for eachValue in elementList:
valueString += f'(?=.*{eachValue})'
df[header] = df[header].astype(str).str.lower() #To ensure case insenstivity and the dtype of the column is string
result = df[header].str.contains(valueString)
I would like to create a dataframe in a loop and after use these dataframe in a loop. I tried eval() function but it didn't work.
For example :
for i in range(5):
df_i = df[(df.age == i)]
There I would like to create df_0,df_1 etc. And then concatenate these new dataframe after some calculations :
final_df = pd.concat(df_0,df_1)
for i in range(2:5):
final_df = pd.concat(final_df, df_i)
You can create a dict of DataFrames x and have is as dict keys:
np.random.seed(42)
df = pd.DataFrame({'age': np.random.randint(0, 5, 20)})
x = {}
for i in range(5):
x[i] = df[df['age']==i]
final = pd.concat(x.values())
Then you can refer to individual DataFrames as:
x[1]
Output:
age
5 1
13 1
15 1
And concatenate all of them with:
pd.concat(x.values())
Output:
age
18 0
5 1
13 1
15 1
2 2
6 2
...
The way is weird and not recommended, but it can be done.
Answer
for i in range(5):
exec("df_{i} = df[df['age']=={i}]")
def UDF(dfi):
# do something in user-defined function
for i in range(5):
exec("df_{i} = UDF(df_{i})")
final_df = pd.concat(df_0,df_1)
for i in range(2:5):
final_df = pd.concat(final_df, df_i)
Better Way 1
Using a list or a dict to store the dataframe should be a better way since you can access each dataframe by an index or a key.
Since another answer shows the way using dict (#perl), I will show you the way using list.
def UDF(dfi):
# do something in user-defined function
dfs = [df[df['age']==i] for i in range(i)]
final_df = pd.concat(map(UDF, dfs))
Better Way 2
Since you are using pandas.DataFrame, groupby function is a 'pandas' way to do what you want. (maybe, I guess, cause I don't know what you want to do. LOL)
def UDF(dfi):
# do something in user-defined function
final_df = df.groupby('age').apply(UDF)
Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
I have a huge set of data. Something like 100k lines and I am trying to drop a row from a dataframe if the row, which contains a list, contains a value from another dataframe. Here's a small time example.
has = [['#a'], ['#b'], ['#c, #d, #e, #f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
tweet user
0 [#a] 1
1 [#b] 2
2 [#c, #d, #e, #f] 3
3 [#g] 5
z
0 #d
1 #a
The desired outcome would be
tweet user
0 [#b] 2
1 [#g] 5
Things i've tried
#this seems to work for dropping #a but not #d
for a in range(df.tweet.size):
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a)
#this works for my small scale example but throws an error on my big data
df['tweet'] = df.tweet.apply(', '.join)
test = df[~df.tweet.str.contains('|'.join(df2['z'].astype(str)))]
#the error being "unterminated character set at position 1343770"
#i went to check what was on that line and it returned this
basket.iloc[1343770]
user_id 17060480
tweet [#IfTheyWereBlackOrBrownPeople, #WTF]
Name: 4612505, dtype: object
Any help would be greatly appreciated.
is ['#c, #d, #e, #f'] 1 string or a list like this ['#c', '#d', '#e', '#f'] ?
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
simple solution would be
screen = set(df2.z.tolist())
to_delete = list() # this will speed things up doing only 1 delete
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
speed comparaison (for 10 000 rows):
st = time.time()
screen = set(df2.z.tolist())
to_delete = list()
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
print(time.time()-st)
2.142000198364258
st = time.time()
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break
print(time.time()-st)
43.99799990653992
For me, your code works if I make several adjustments.
First, you're missing the last line when putting range(df.tweet.size), either increase this or (more robust, if you don't have an increasing index), use df.tweet.index.
Second, you don't apply your dropping, use inplace=True for that.
Third, you have #d in a string, the following is not a list: '#c, #d, #e, #f' and you have to change it to a list so it works.
So if you change that, the following code works fine:
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break # so if we already dropped it we no longer look whether we should drop this line
This will provide the desired result. Be aware of this potentially being not optimal due to missing vectorization.
EDIT:
you can achieve the string being a list with the following:
from itertools import chain
df.tweet = df.tweet.apply(lambda l: list(chain(*map(lambda lelem: lelem.split(","), l))))
This applies a function to each line (assuming each line contains a list with one or more elements): Split each element (should be a string) by comma into a new list and "flatten" all the lists in one line (if there are multiple) together.
EDIT2:
Yes, this is not really performant But basically does what was asked. Keep that in mind and after having it working, try to improve your code (less for iterations, do tricks like collecting the indices and then drop all of them).
I have a string as follows :
2017-11-27T09:59:57.278-06:00,,"[0.2094101093721778, -65.0, -76.0]"
2017-11-27T10:00:17.250-06:00,,"[0.13055123127828835, -62.0, -76.0]"
I would like to have following in my data frame:
09:59:57.278 0.2094101093721778 -65.0 -76.0
10:00:17.250 0.13055123127828835 -62.0 -76.0
I tried to strip the first value as:
a = "2017-11-27T09:59:57.278-06:00,,\"[0.2094101093721778, -65.0, -76.0]\""
b = a.strip("2017-11-27T")
I got following output :
9:59:57.278-06:00,,"[0.2094101093721778, -65.0, -76.0]"
I actually wanted 09:59:57.278-06:00,,"[0.2094101093721778, -65.0, -76.0]"
Your strip removes all combination of the characters provided, so it also removed the preceding 0 from 09. You might want to do one of the following instead:
a = "2017-11-27T09:59:57.278-06:00,,\"[0.2094101093721778, -65.0, -76.0]\""
b = a.replace("2017-11-27T","")
OR
b = ''.join(a.split("2017-11-27T")[1:])
Output (for both)
'09:59:57.278-06:00,,"[0.2094101093721778, -65.0, -76.0]"'
If you have different dates though (and hardcoding usually is a bad practice anyways), you probably want to convert that segment of the string as a datetime object and represent it again in the string:
t = a.split(",")
t[0] = datetime.datetime.strftime(datetime.datetime.strptime(t[0][0:-6], "%Y-%m-%dT%H:%M:%S.%f"),"%H:%M:%S.%f")
b = ''.join(t)
The best way though if it's intended for your DataFrame, is probably just interpret the date with pandas. See this link for more details.
You could try this
import pandas as pd
lin = '2017-11-27T09:59:57.278-06:00,,"[0.2094101093721778, -65.0, -76.0]"\n 2017-11-27T10:00:17.250-06:00,,"[0.13055123127828835, -62.0, -76.0]"'
chrToReplace = [',,','[',']','"',',']
y =[]
# Iterate through your lines
for x in lin.splitlines():
for c in chrToReplace:
if c in x:
x = x.replace(c," ")
x= x.split()
n = 0
z ={}
for elm in x:
z.update({"V"+str(n):elm})
n += 1
y.append(z)
df = pd.DataFrame(y)
print(df)
This gives you
V0 V1 V2 V3
0 2017-11-27T09:59:57.278-06:00 0.2094101093721778 -65.0 -76.0
1 2017-11-27T10:00:17.250-06:00 0.13055123127828835 -62.0 -76.0