Is there a way to get a textual representation of a dataframe that I can just paste back into the repl, but that still looks good as a table? Numpy repr manages this pretty well, I'm talking something like:
> df
A B C
i
0 3 1 8
1 3 1 6
2 7 4 6
> df.to_python()
DataFrame(
columns=['i', 'A', 'B', 'C'],
data = [[ 0, 30, 1, 8],
[ 1, 3, 1, 6],
[ 2, 7, 4, 6]]
).set_index('i')
This seems like it would be especially useful for stack overflow, but I often find myself needing to share small dataframes and would love it if this were possible.
Edit: I know about to_csv and to_dict and so on, what I want is a way of exactly reproducing a dataframe that also can be read as a table. It seems that this probably doesn't have a current answer (although I'd love to see pandas add it), but I think I can make pd.read_clipboard('\s\s+') work for 95% of my usages.
StringIO tells python to treat a string as a filelike object which allows you to use the read_csv method example below...
df = """ A B C
i
0 3 1 8
1 3 1 6
2 7 4 6"""#this is equivalent to str(df) or what happens when you use print df
df = pd.read_csv(StringIO.StringIO(df),sep="\s*",engine = 'python')
df.to_dict() will get you close, although you do lose the index name:
df.to_dict()
Out[5]: {'A': {0: 30, 1: 3, 2: 7}, 'B': {0: 1, 1: 1, 2: 4}, 'C': {0: 8, 1: 6, 2: 6}}
df_copy = pd.DataFrame(df.to_dict())
df_copy
Out[7]:
A B C
0 30 1 8
1 3 1 6
2 7 4 6
Related
For example I have created this data frame:
import pandas as pd
df = pd.DataFrame({'Cycle': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]})
#Maybe something like this: df['Cycle Type'] = df['Cycle'].rolling(2).apply(lambda x: len(set(x)) != len(x),raw= True).replace({0 : False, 1: True})
I want to count the amount of values and than assign a type of cycle to it. If the cycle has less than 12 rows or more than 100 rows mark it as bad, else mark it as good. I was thinking of using something like that lambda function to check if the value from the row before was the same, but I'm not sure how to add the count feature to give it the parameters I want.
Start by counting the number of rows in each group with pandas.DataFrame.groupby, pandas.DataFrame.transform, and pandas.DataFrame.count as
df["cycle_quality"] = df.groupby("Cycle")["Cycle"].transform("count")
Then apply the quality function to it using pandas.DataFrame.apply:
• If number of rows is less than 12 and more than 100, define cycle_quality as bad
• Else, cycle_quality should be good
df["cycle_quality"] = df.apply(lambda x: "bad" if x["cycle_quality"] < 12 or x["cycle_quality"] > 100 else "good", axis=1)
[Out]:
Cycle cycle_quality
0 0 good
1 0 good
2 0 good
3 0 good
4 0 good
.. ... ...
71 5 bad
72 5 bad
73 5 bad
74 5 bad
75 5 bad
Use a groupby, transform to get size of each cycle and use between to see if the size of each cycle falls between 13, 100 (both inclusive) and mark the True as good and False as bad. Because as per requirement any size that is less than 12 and greater than 100 is bad and everything else that is in between [13, 100] is good.
df['Cycle_Type'] = df.groupby('Cycle')['Cycle'].transform('size').between(13, 100,
inclusive='both').replace({True: 'good', False: 'bad'})
output:
Cycle Cycle_Type
0 0 bad
1 0 bad
2 0 bad
3 0 bad
4 0 bad
.. ... ...
71 5 bad
72 5 bad
73 5 bad
74 5 bad
75 5 bad
Edit:
You can change the interval in which you want good or bad as you wish.
If your requirement is that less than 12 should be marked good then include 12 in the interval like:
df['Cycle_Type'] = df.groupby('Cycle')['Cycle'].transform('size').between(12, 100,
inclusive='both').replace({True: 'good', False: 'bad'})
Then your output is:
Cycle Cycle_Type
0 0 good
1 0 good
2 0 good
3 0 good
4 0 good
.. ... ...
71 5 bad
72 5 bad
73 5 bad
74 5 bad
75 5 bad
Another way to achieve this:
Use pd.Series.value_counts to get a count for all unique values in df['Cycle'].
Next, apply pd.Series.between to obtain a series with booleans.
This series we turn into 'good'|'bad' with replace, before passing it to pd.Series.map applied to column Cycle.
import pandas as pd
df = pd.DataFrame({'Cycle': [0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5]})
vc = df.Cycle.value_counts()
df['Cycle_Type'] = df['Cycle'].map(
vc.between(12,100,inclusive='both').replace({True: 'good', False: 'bad'}))
# printing output per value
print(df.groupby('Cycle', as_index=False).first())
Cycle Cycle_Type
0 0 good
1 1 bad
2 2 good
3 3 good
4 4 good
5 5 bad
Here is a way using pd.cut(). This could be useful if more categories than good and bad need to be applied.
(df['Cycle']
.map(
pd.cut(df['Cycle'].value_counts(),
bins = [0,12,100,np.inf],
right = False,
labels = ['bad','good','bad'],
ordered=False)))
or
s = df['Cycle'].diff().ne(0).cumsum()
np.where(s.groupby(s).transform('count').between(12,100),'good','bad')
Output:
0 good
1 good
2 good
3 good
4 good
...
71 bad
72 bad
73 bad
74 bad
75 bad
I build a script with Python and i use Pandas.
I'm trying to delete line from a dataframe.
I want to delete lines that contains empty values into two specific columns.
If one of those two column is regularly completed but not the other one, the line is preserved.
So i have build this code that works. But i'm beginner and i am sure that i can simplify my work.
I'm sure i don't need loop "for" in my function. I think there is a way with a good method. I read the doc on internet but i found nothing.
I try my best but i need help.
Also for some reasons i don't want to use numpy.
So here my code :
import pandas as pnd
def drop_empty_line(df):
a = df[(df["B"].isna()) & (df["C"].isna())].index
for i in a:
df = df.drop([i])
return df
def main():
df = pnd.DataFrame({
"A": [5, 0, 4, 6, 5],
"B": [pnd.NA, 4, pnd.NA, pnd.NA, 5],
"C": [pnd.NA, pnd.NA, 9, pnd.NA, 8],
"D": [5, 3, 8, 5, 2],
"E": [pnd.NA, 4, 2, 0, 3]
})
print(drop_empty_line(df))
if __name__ == '__main__':
main()
You indeed don't need a loop. You don't even need a custom function, there is already dropna:
df = df.dropna(subset=['B', 'C'], how='all')
# or in place:
# df.dropna(subset=['B', 'C'], how='all', inplace=True)
output:
A B C D E
1 0 4 <NA> 3 4
2 4 <NA> 9 8 2
4 5 5 8 2 3
The question is I would like to avoid iterrows here.
From my dataframe I want to create a new column "unique" that will be based on the condition that if "a" and "b" column values are the same I would give it a value "uniqueN" then for all occurrence of the exact "a" and "b" I would need the same value "uniqueN".
In this case
"1", "3" (the first row) from "a" and "b" is the first unique pair, so I give that the value "unique1", and the seventh row will also have the same value which is "unique1" as it is also "1", "3".
"2", "2" (the second row) is the next unique "a", "b" pair so I give them "unique2" and the eight row also has "2", "2" so that will also have "unique2".
"3", "1" (third row) is the next unique, so "unique3", no more rows in the df is "3", "1" so that value wont repeat.
and so on
I have a working code that uses loops but this is not the pandas way, can anyone suggest how I can do this using pandas functions?
Expected Output (My code works, but its not using pandas methods)
a b unique
0 1 3 unique1
1 2 2 unique2
2 3 1 unique3
3 4 2 unique4
4 3 3 unique5
5 4 2 unique4
6 1 3 unique1
7 2 2 unique2
Code
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]})
c = 1
seen = {}
for i, j in df.iterrows():
j = tuple(j)
if j not in seen:
seen[j] = 'unique' + str(c)
c += 1
for key, value in seen.items():
df.loc[(df.a == key[0]) & (df.b == key[1]), 'unique'] = value
Let's use groupby ngroup with sort=False to ensure values are enumerated in order of appearance, add 1 so group numbers start at one, then convert to string with astype so we can add the prefix unique to the number:
df['unique'] = 'unique' + \
df.groupby(['a', 'b'], sort=False).ngroup().add(1).astype(str)
Or with map and format instead of converting and concatenating:
df['unique'] = (
df.groupby(['a', 'b'], sort=False).ngroup()
.add(1)
.map('unique{}'.format)
)
df:
a b unique
0 1 3 unique1
1 2 2 unique2
2 3 1 unique3
3 4 2 unique4
4 3 3 unique5
5 4 2 unique4
6 1 3 unique1
7 2 2 unique2
Setup:
import pandas as pd
df = pd.DataFrame({
'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]
})
I came up with a slightly different solution. I'll add this for posterity, but the groupby answer is superior.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]})
print(df)
df1 = df[~df.duplicated()]
print(df1)
df1['unique'] = df1.index
print(df1)
df2 = df.merge(df1, how='left')
print(df2)
I was going through this link: Return top N largest values per group using pandas
and found multiple ways to find the topN values per group.
However, I prefer dictionary method with agg function and would like to know if it is possible to get the equivalent of the dictionary method for the following problem?
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': [1, 1, 2, 2, 1],
'C': [10, 20, 30, 40, 50],
'D': ['X', 'Y', 'X', 'Y', 'Y']})
print(df)
A B C D
0 1 1 10 X
1 1 1 20 Y
2 1 2 30 X
3 2 2 40 Y
4 2 1 50 Y
I can do this:
df1 = df.groupby(['A'])['C'].nlargest(2).droplevel(-1).reset_index()
print(df1)
A C
0 1 30
1 1 20
2 2 50
3 2 40
# also this
df1 = df.sort_values('C', ascending=False).groupby('A', sort=False).head(2)
print(df1)
# also this
df.set_index('C').groupby('A')['B'].nlargest(2).reset_index()
Required
df.groupby('A',as_index=False).agg(
{'C': lambda ser: ser.nlargest(2) # something like this
})
Is it possible to use the dictionary here?
If you want to get a dictionary like A: 2 top values from C,
you can run:
df.groupby(['A'])['C'].apply(lambda x:
x.nlargest(2).tolist()).to_dict()
For your DataFrame, the result is:
{1: [30, 20], 2: [50, 40]}
I am workint in Python 2.7 and I have a data frame and I want to get the average of the column called 'c', but only the rows that verify that the values in another column are equal to some value.
When I execute the code, the answer is unexpected, but when I execute the calculation, calculating the median, the result is correct.
Why is the output of the mean incorrect?
The code is the following:
df = pd.DataFrame(
np.array([['A', 1, 2, 3], ['A', 4, 5, np.nan], ['A', 7, 8, 9], ['B', 3, 2, np.nan], ['B', 5, 6, np.nan], ['B',5, 6, np.nan]]),
columns=['a', 'b', 'c', 'd']
)
df
mean1 = df[df.a == 'A'].c.mean()
mean2 = df[df.a == 'B'].c.mean()
median1 = df[df.a == 'A'].c.median()
median2 = df[df.a == 'B'].c.median()
The output:
df
Out[1]:
a b c d
0 A 1 2 3
1 A 4 5 nan
2 A 7 8 9
3 B 3 2 nan
4 B 5 6 nan
5 B 5 6 nan
mean1
Out[2]: 86.0
mean2
Out[3]: 88.66666666666667
median1
Out[4]: 5.0
median2
Out[5]: 6.0
It is obvious that the output of the mean is incorrect.
Thanks.
Pandas is doing string concatenation for the "sum" when calculating the mean, this is plain to see from your example frame.
>>> df[df.a == 'B'].c
3 2
4 6
5 6
Name: c, dtype: object
>>> 266 / 3
88.66666666666667
If you look at the dtype's for your DataFrame, you'll notice that all of them are object, even though no single Series contains mixed types. This is due to the declaration of your numpy array. Arrays are not meant to contain heterogenous types, so the array defaults to dtype object, which is then passed to the DataFrame constructor. You can avoid this behavior by passing the constructor a list instead, which can hold differing dtype's with no issues.
df = pd.DataFrame(
[['A', 1, 2, 3], ['A', 4, 5, np.nan], ['A', 7, 8, 9], ['B', 3, 2, np.nan], ['B', 5, 6, np.nan], ['B',5, 6, np.nan]],
columns=['a', 'b', 'c', 'd']
)
df[df.a == 'B'].c.mean()
4.666666666666667
In [17]: df.dtypes
Out[17]:
a object
b int64
c int64
d float64
dtype: object
I still can't imagine that this behavior is intended, so I believe it's worth opening an issue report on the pandas development page, but in general, you shouldn't be using object dtype Series for numeric calculations.