I'm new to python and struggling to manipulate data in pandas library. I have a pandas database like this:
Year Value
0 91 1
1 93 4
2 94 7
3 95 10
4 98 13
And want to complete the missing years creating rows with empty values, like this:
Year Value
0 91 1
1 92 0
2 93 4
3 94 7
4 95 10
5 96 0
6 97 0
7 98 13
How do i do that in Python?
(I wanna do that so I can plot Values without skipping years)
I would create a new dataframe that has Year as an Index and includes the entire date range that you need to cover. Then you can simply set the values across the two dataframes, and the index will make sure that they correct rows are matched (I've had to use fillna to set the missing years to zero, by default they will be set to NaN):
df = pd.DataFrame({'Year':[91,93,94,95,98],'Value':[1,4,7,10,13]})
df.index = df.Year
df2 = pd.DataFrame({'Year':range(91,99), 'Value':0})
df2.index = df2.Year
df2.Value = df.Value
df2= df2.fillna(0)
df2
Value Year
Year
91 1 91
92 0 92
93 4 93
94 7 94
95 10 95
96 0 96
97 0 97
98 13 98
Finally you can use reset_index if you don't want Year as your index:
df2.drop('Year',1).reset_index()
Year Value
0 91 1
1 92 0
2 93 4
3 94 7
4 95 10
5 96 0
6 97 0
7 98 13
Related
I have two separate data frames named df1 and df2 as shown below:
Scaffold Position Ref_Allele_Count Alt_Allele_Count Coverage_Depth Alt_Allele_Frequency
0 1 11 7 51 58 0.879310
1 1 16 20 95 115 0.826087
2 2 9 9 33 42 0.785714
3 2 12 86 51 137 0.372263
4 2 67 41 98 139 0.705036
5 3 8 0 0 0 0.000000
6 4 99 32 26 58 0.448276
7 4 101 100 24 124 0.193548
8 4 115 69 26 95 0.273684
9 5 6 40 57 97 0.587629
10 5 19 53 87 140 0.621429
Scaffold Position Ref_Allele_Count Alt_Allele_Count Coverage_Depth Alt_Allele_Frequency
0 1 11 7 64 71 0.901408
1 1 16 10 90 100 0.900000
2 2 9 79 86 165 0.521212
3 2 12 12 73 85 0.858824
4 2 67 54 96 150 0.640000
5 3 8 0 0 0 0.000000
6 4 99 86 28 114 0.245614
7 4 101 32 25 57 0.438596
8 4 115 97 16 113 0.141593
9 5 6 86 43 129 0.333333
10 5 19 59 27 86 0.313953
I have already found the sum values for df1 and df2 in Allele_Count and Coverage Depth but I need to divide the resulting Alt_Allele_Count and Coverage_Depth of both df's with one another to fine the total allele frequency(AF). I have tried dividing the two variable and got the error message :
TypeError: float() argument must be a string or a number, not 'DataFrame'
when I tried to convert them to floats and this table when I laft it as a df:
Alt_Allele_Count Coverage_Depth
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN NaN
My code so far:
import csv
import pandas as pd
import numpy as np
df1 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_1.csv')
df2 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_2.csv')
print(df1)
print(df2)
Ref_Allele_Count = (df1[['Ref_Allele_Count']] + df2[['Ref_Allele_Count']])
print(Ref_Allele_Count)
Alt_Allele_Count = (df1[['Alt_Allele_Count']] + df2[['Alt_Allele_Count']])
print(Alt_Allele_Count)
Coverage_Depth = (df1[['Coverage_Depth']] + df2[['Coverage_Depth']]).astype(float)
print(Coverage_Depth)
AF = Alt_Allele_Count / Coverage_Depth
print(AF)
The error stems from the difference between a pandas series and a dataframe. Series are 1 dimensional structures like a singular column, while dataframes are 2d objects like tables. Series added together make a new series of values while dataframes added together make something a lot less usable.
Taking slices of a dataframe can either result in a series or dataframe object depending on how you do it:
df['column_name'] -> Series
df[['column_name', 'column_2']] -> Dataframe
So in the line:
Ref_Allele_Count = (df1[['Ref_Allele_Count']] + df2[['Ref_Allele_Count']])
df1[['Ref_Allele_Count']] becomes a singular column dataframe rather than a series.
Ref_Allele_Count = (df1['Ref_Allele_Count'] + df2['Ref_Allele_Count'])
Should return the correct result here. Same goes for the rest of the columns you're adding together.
This can be fixed by only using once set of brackets '[]' while referring to a column in a pandas df, rather than 2.
import csv
import pandas as pd
import numpy as np
df1 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_1.csv')
df2 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_2.csv')
print(df1)
print(df2)
# note that I changed your double brackets ([["col_name"]]) to single (["col_name"])
# this results in pd.Series objects instead of pd.DataFrame objects
Ref_Allele_Count = (df1['Ref_Allele_Count'] + df2['Ref_Allele_Count'])
print(Ref_Allele_Count)
Alt_Allele_Count = (df1['Alt_Allele_Count'] + df2['Alt_Allele_Count'])
print(Alt_Allele_Count)
Coverage_Depth = (df1['Coverage_Depth'] + df2['Coverage_Depth']).astype(float)
print(Coverage_Depth)
AF = Alt_Allele_Count / Coverage_Depth
print(AF)
I am new to numpy and need some help in solving my problem.
I read records from a binary file using dtypes, then I am selecting 3 columns
df = pd.DataFrame(np.array([(124,90,5),(125,90,5),(126,90,5),(127,90,0),(128,91,5),(129,91,5),(130,91,5),(131,91,0)]), columns = ['atype','btype','ctype'] )
which gives
atype btype ctype
0 124 90 5
1 125 90 5
2 126 90 5
3 127 90 0
4 128 91 5
5 129 91 5
6 130 91 5
7 131 91 0
'atype' is of no interest to me for now.
But what I want is the row numbers when
(x,90,5) appears in 2nd and 3rd columns
(x,90,0) appears in 2nd and 3rd columns
when (x,91,5) appears in 2nd and 3rd columns
and (x,91,0) appears in 2nd and 3rd columns
etc
There are 7 variables like 90,91,92,93,94,95,96 and correspondingly there will be values of either 5 or 0 in the 3rd column.
The entries are 1 million. So is there anyway to find out these without a for loop.
Using pandas you could try the following.
df[(df['btype'].between(90, 96)) & (df['ctype'].isin([0, 5]))]
Using your example. if some of the values are changed, such that df is
atype btype ctype
0 124 90 5
1 125 90 5
2 126 0 5
3 127 90 100
4 128 91 5
5 129 0 5
6 130 91 5
7 131 91 0
then using the solution above, the following is returned.
atype btype ctype
0 124 90 5
1 125 90 5
4 128 91 5
6 130 91 5
7 131 91 0
I have a data frame as shown in Image, what I want to do is to take the mean along the column 'trial'. It for every subject, condition and sample (when all these three columns has value one), take average of data along column trial (100 rows).
what I have done in pandas is as following
sub_erp_pd= pd.DataFrame()
for j in range(1,4):
sub_c=subp[subp['condition']==j]
for i in range(1,3073):
sub_erp_pd=sub_erp_pd.append(sub_c[sub_c['sample']==i].mean(),ignore_index=True)
But this take alot of time..
So i am thinking to use dask instead of Pandas.
But in dask i am having issue in creating an empty data frame. Like we create an empty data frame in pandas and append data to it.
image of data frame
as suggested by #edesz I made changes in my approach
EDIT
%%time
sub_erp=pd.DataFrame()
for subno in progressbar.progressbar(range(1,82)):
try:
sub=pd.read_csv('../input/data/{}.csv'.format(subno,subno),header=None)
except:
sub=pd.read_csv('../input/data/{}.csv'.format(subno,subno),header=None)
sub_erp=sub_erp.append(sub.groupby(['condition','sample'], as_index=False).mean())
Reading a file using pandas take 13.6 seconds while reading a file using dask take 61.3 ms. But in dask, I am having trouble in appending.
NOTE - The original question was titled Create an empty dask dataframe and append values to it.
If I understand correctly, you need to
use groupby (read more here) in order to group the subject, condition and sample columns
this will gather all rows, which have the same value in each of these three columns, into a single group
take the average using .mean()
this will give you the mean within each group
Generate some dummy data
df = df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)),
columns=['trial','condition','sample'])
df.insert(0,'subject',[1]*10 + [2]*30 + [5]*60)
print(df.head())
subject trial condition sample
0 1 71 96 34
1 1 2 89 66
2 1 90 90 81
3 1 93 43 18
4 1 29 82 32
Pandas approach
Aggregate and take mean
df_grouped = df.groupby(['subject','condition','sample'], as_index=False)['trial'].mean()
print(df_grouped.head(15))
subject condition sample trial
0 1 18 24 89
1 1 43 18 93
2 1 67 47 81
3 1 82 32 29
4 1 85 28 97
5 1 88 13 48
6 1 89 59 23
7 1 89 66 2
8 1 90 81 90
9 1 96 34 71
10 2 0 81 19
11 2 2 39 58
12 2 2 59 94
13 2 5 42 13
14 2 9 42 4
Dask approach
Step 1. Imports
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
Step 2. Convert Pandas DataFrame to Dask DataFrame, using .from_pandas
ddf = dd.from_pandas(df, npartitions=2)
Step 3. Aggregate and take mean
ddf_grouped = (
ddf.groupby(['subject','condition','sample'])['trial']
.mean()
.reset_index(drop=False)
)
with ProgressBar():
df_grouped = ddf_grouped.compute()
[ ] | 0% Completed | 0.0s
[########################################] | 100% Completed | 0.1s
print(df_grouped.head(15))
subject condition sample trial
0 1 18 24 89
1 1 43 18 93
2 1 67 47 81
3 1 82 32 29
4 1 85 28 97
5 1 88 13 48
6 1 89 59 23
7 1 89 66 2
8 1 90 81 90
9 1 96 34 71
10 2 0 81 19
11 2 2 39 58
12 2 2 59 94
13 2 5 42 13
14 2 9 42 4
IMPORTANT NOTE: The approach in this answer does not use the approach of creating an empty Dask DataFrame and append values to it in order to calculate a mean within groupings of subject, condition and trial. Instead, this answer provides an alternate approach (using GROUP BY) to obtaining the desired end result (of calculating the mean within groupings of subject, condition and trial).
This question already has answers here:
Split (explode) pandas dataframe string entry to separate rows
(27 answers)
Separate comma-separated values within individual cells of Pandas Series using regex
(1 answer)
Closed 4 years ago.
I am looking to convert data frame df1 to df2 using Python. I have a solution that uses loops but I am wondering if there is an easier way to create df2.
df1
Test1 Test2 2014 2015 2016 Present
1 x a 90 85 84 0
2 x a:b 88 79 72 1
3 y a:b:c 75 76 81 0
4 y b 60 62 66 0
5 y c 68 62 66 1
df2
Test1 Test2 2014 2015 2016 Present
1 x a 90 85 84 0
2 x a 88 79 72 1
3 x b 88 79 72 1
4 y a 75 76 81 0
5 y b 75 76 81 0
6 y c 75 76 81 0
7 y b 60 62 66 0
8 y c 68 62 66 1
Here's one way using numpy.repeat and itertools.chain:
import numpy as np
from itertools import chain
# split by delimiter and calculate length for each row
split = df['Test2'].str.split(':')
lens = split.map(len)
# repeat non-split columns
cols = ('Test1', '2014', '2015', '2016', 'Present')
d1 = {col: np.repeat(df[col], lens) for col in cols}
# chain split columns
d2 = {'Test2': list(chain.from_iterable(split))}
# combine in a single dataframe
res = pd.DataFrame({**d1, **d2})
print(res)
2014 2015 2016 Present Test1 Test2
1 90 85 84 0 x a
2 88 79 72 1 x a
2 88 79 72 1 x b
3 75 76 81 0 y a
3 75 76 81 0 y b
3 75 76 81 0 y c
4 60 62 66 0 y b
5 68 62 66 1 y c
This will achieve what you want:
# Converting "Test2" strings into lists of values
df["Test2"] = df["Test2"].apply(lambda x: x.split(":"))
# Creating second dataframe with "Test2" values
test2 = df.apply(lambda x: pd.Series(x['Test2']),axis=1).stack().reset_index(level=1, drop=True)
test2.name = 'Test2'
# Joining both dataframes
df = df.drop('Test2', axis=1).join(test2)
print(df)
Test1 2014 2015 2016 Present Test2
1 x 90 85 84 0 a
2 x 88 79 72 1 a
2 x 88 79 72 1 b
3 y 75 76 81 0 a
3 y 75 76 81 0 b
3 y 75 76 81 0 c
4 y 60 62 66 0 b
5 y 68 62 66 1 c
Similar questions (column already existing as a list): 1 2
I am testing merging on below dataframes:
From below line of code:
merge1 = pd.merge(df1,df2,on='HPI',how='inner')
I expected this output:
However instead I have:
Moreover, it doesn't matter which option I use in how parameter('inner', 'outer','left','right') I always get the same output.
It is sure, that I do not unserstant properly merging in accordance to parameter how. Could somebody please explain, why I get the same outputs for all options?
There is problem with duplicates in HPI column. There is possible create MultiIndex by set_index with concat:
merge1 = pd.concat([df1.set_index('HPI', append=True),
df2.set_index('HPI', append=True)], axis=1).reset_index(level=1)
print (merge1)
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
2001 80 2 50 50 7
2002 85 3 55 52 8
2003 88 2 65 50 9
2004 85 2 55 53 6
Or reset_index for column from index and merge by 2 columns:
merge1 = pd.merge(df1.reset_index(),df2.reset_index(),on=['index','HPI'])
print (merge1)
index HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 2001 80 2 50 50 7
1 2002 85 3 55 52 8
2 2003 88 2 65 50 9
3 2004 85 2 55 53 6
Last solution if is possible index values are duplicated too:
df1 = df1.assign(new=df1.groupby('HPI').cumcount())
df2 = df2.assign(new=df2.groupby('HPI').cumcount())
merge1 = pd.merge(df1,df2,on=['new','HPI']).drop('new',axis=1)
print (merge1)
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 85 3 55 52 8
2 88 2 65 50 9
3 85 2 55 53 6
From your result it appears you just need to perform a left merge on 2 columns ('HPI', 'Low_teri_HPI'), instead of just 'HPI'.
merge1 = pd.merge(df1, df2, on=['HPI', 'Low_tier_HPI'], how='left')
This should produce your desired result.
If there are repeated keys in df2, you can drop duplicates via df2.drop_duplicates(subset=['HPI', 'Low_tier_HPI']) first. In your minimal example, this is not necessary.