I have multiple csv file that looks something like this:
>>> print(df)
x x.1 x.2 x.3 ... Unnamed: 33 Unnamed: 34 Unnamed: 35 Unnamed: 36
0 x x x x ... x x x x
1 x x x x ... x x x x
2 x x x x ... NaN NaN NaN NaN
3 x x x x ... NaN NaN NaN NaN
4 x x x x ... NaN NaN NaN NaN
5 x x x x ... NaN NaN NaN NaN
6 x x x x ... NaN NaN NaN NaN
7 x x x x ... NaN NaN NaN NaN
8 x x x x ... NaN NaN NaN NaN
9 x x x x ... NaN NaN NaN NaN
10 x x x x ... NaN NaN NaN NaN
11 x x x x ... NaN NaN NaN NaN
12 x x x x ... NaN NaN NaN NaN
13 x x x x ... NaN NaN NaN NaN
14 A A A A ... NaN NaN NaN NaN
15 B B B B ... NaN NaN NaN NaN
16 C C C C ... NaN NaN NaN NaN
17 D D D D ... NaN NaN NaN NaN
18 E E E E ... NaN NaN NaN NaN
19 F F F F ... NaN NaN NaN NaN
20 x x x x ... NaN NaN NaN NaN
21 x x x x ... NaN NaN NaN NaN
22 x x x x ... NaN NaN NaN NaN
23 x x x x ... NaN NaN NaN NaN
24 x x x x ... NaN NaN NaN NaN
[25 rows x 37 columns]
There are a lot of different types of data in this csv file, but the only data I require is that labelled A-F. I have a large amount of these csv files, so what I want to do is merge them together but only with the data that I want from them.
I have two approaches, one better than the other.
(1) The data I want pretty much always occurs on row 14-19 and is 4 columns long. So what I was thinking each time I read in one of these csv files I can skip rows above 14 and below 19 however I am unsure how to do this?
Something like this data = pd.read_csv(file,skiprows=[0:14]) however I also want to skip any rows after 19? Is there a way to just load in rows 14-19 with just columns 0-4?
(2) My second idea I am not sure if possible but in case the data doesn't appear on row 14-19 in one file, maybe I can get python to somehow search for the data I want and that will get rid of any errors of taking the wrong rows?
Any help is appreciated, thanks!
pandas has an additional param, nrows, which can be used to read only a specified number of rows
>>> import pandas as pd
>>> df = pd.read_csv(filename, skiprows=list(range(14)), n_rows=6)
>>> df
x x.1 x.2 x.3 ... Unnamed: 33 Unnamed: 34 Unnamed: 35 Unnamed: 36
0 A A A A ... NaN NaN NaN NaN
1 B B B B ... NaN NaN NaN NaN
2 C C C C ... NaN NaN NaN NaN
3 D D D D ... NaN NaN NaN NaN
4 E E E E ... NaN NaN NaN NaN
5 F F F F ... NaN NaN NaN NaN
following your second idea 'in case the data doesn't appear on row 14-19 in one file':
#getting the desired rows
df_desired = data.loc[ (data['x'] == 'A') | (data['x'] == 'B')|(data['x'] == 'C') | (data['x'] == 'E')| (data['x'] == 'F')]
#getting the first 4 columns
df=df.ix[:,[0:4]]
Related
I have written a function random_sample() where the output is a large dataframe with 2 row and 751 rows. Every time I run the function there is a novel data frame see below.
W_332 W_333 W_334 ... W_1066 W_1067 W_1068 W_1069
0 0.098432 0.094451 0.096085 ... 0.090937 0.068576 0.085326 0.095416
1 0.164170 0.197848 0.161228 ... 0.272259 0.283551 0.229989 0.230067
[2 rows x 751 columns]
When I run the following code
for g in range(5):
sample = random_sample()
ref = {g:sample}
empt_dict.append(ref)
df_pd = pd.DataFrame(empt_dict)
df_pd
I get the following table as my output but I can see all of the 5 dataframes of the above [2 rows x 751 columns] repeated with a novel set of numbers but not together
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
How do I save the 5 iterative 2 by 751 from my function? Thank you!
I have a dataframe with a multiindex that looks like this:
In [1]: import pandas as pd
In [2]: idx = pd.MultiIndex.from_product([["a", "b"], ["p", "q"], ["x", "y"]])
In [3]: df = pd.DataFrame(index=idx, columns=list("ABCD"))
In [4]: df
Out[4]:
A B C D
a p x NaN NaN NaN NaN
y NaN NaN NaN NaN
q x NaN NaN NaN NaN
y NaN NaN NaN NaN
b p x NaN NaN NaN NaN
y NaN NaN NaN NaN
q x NaN NaN NaN NaN
y NaN NaN NaN NaN
I would like to broadcast an assignment of the value [[1,2,3,4],[5,6,7,8]] to the location "a" at level 0, and ["x", "y"] at level 2, and all locations for level 1. So the resulting dataframe would like this:
A B C D
a p x 1 2 3 4
y 5 6 7 8
q x 1 2 3 4
y 5 6 7 8
b p x NaN NaN NaN NaN
y NaN NaN NaN NaN
q x NaN NaN NaN NaN
y NaN NaN NaN NaN
Is this possible in a single assignment operation? My best try has been this, but it doesn't work because the shapes aren't compatible:
In [5]: df.loc["a",:,["x","y"]] = [[1,2,3,4], [5,6,7,8]]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [5], in <module>
----> 1 df.loc["a",:,["x","y"]] = [[1,2,3,4], [5,6,7,8]]
...
ValueError: could not broadcast input array from shape (2,4) into shape (4,4)
The best I've been able to come up with is doing it in two steps:
In [6]: df.loc["a",:,"x"] = [1,2,3,4]
In [7]: df.loc["a",:,"y"] =[5,6,7,8]
Well after posting this I figured it out on my own. Feeling a bit dumb, but here is a way that works fine:
In [16]: df.loc["a"] = [[1,2,3,4], [5,6,7,8]]*2
To make it a bit more general:
In [20]: expand=len(df.index.levels[-2])
In [21]: df.loc["a"] = [[1,2,3,4], [5,6,7,8]]*expand
I'd still like to find a nicer way, but this is ok.
I have a DataFrame df that looks something like this:
df
a b c
0 0.557894 -0.196294 -0.020490
1 1.138774 -0.699224 NaN
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
4 1.040089 -0.271777 NaN
5 -0.337374 NaN -0.771888
6 -1.813278 -1.564666 NaN
7 NaN NaN NaN
8 0.737413 NaN 0.679575
9 -2.345448 2.443669 -1.409422
I want to select the rows that have a value over some value, which I would normally do using:
new_df = df[df['c'] >= .5]
but that will return:
a b c
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
5 -0.337374 NaN 0.771888
8 0.737413 NaN 0.679575
I want to get those rows, but also keep the rows that have nan values in column 'c'. I haven't been able to find a question asking the same thing, they usually ask for one or the other, but not both. I can hard code the rows that I want to drop since I know the specific values, but I was wondering if there is a better solution. The end result should look something like this:
a b c
1 1.138774 -0.699224 NaN
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
4 1.040089 -0.271777 NaN
6 -1.813278 -1.564666 NaN
7 NaN NaN NaN
8 0.737413 NaN 0.679575
Only dropping rows 0,5 and 9 since they are less than .5 in columns 'c'
You should use the | (or) operator.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [0.557894,1.138774,np.nan,-0.069319,1.040089,-0.337374,-1.813278,np.nan,0.737413,-2.345448],
'b': [-0.196294,-0.699224,2.384483,np.nan,-0.271777,np.nan,-1.564666,np.nan,np.nan,2.443669],
'c': [-0.020490,np.nan,0.554292,1.162941,np.nan,-0.771888,np.nan,np.nan,0.679575,-1.409422]})
df = df[(df['c'] >= .5) | (df['c'].isnull())]
print(df)
Output:
a b c
1 1.138774 -0.699224 NaN
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
4 1.040089 -0.271777 NaN
6 -1.813278 -1.564666 NaN
7 NaN NaN NaN
8 0.737413 NaN 0.679575
You should be able to do this by
new_df = df[df['c'] >=5 or df['c'] == 'NaN']
how to fill df with empty rows or create a df with empty rows.
have df :
df = pd.DataFrame(columns=["naming","type"])
how to fill this df with empty rows
Specify index values:
df = pd.DataFrame(columns=["naming","type"], index=range(10))
print (df)
naming type
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
If need empty strings:
df = pd.DataFrame('',columns=["naming","type"], index=range(10))
print (df)
naming type
0
1
2
3
4
5
6
7
8
9
I have the following dataframe:
In [11]: import numpy as np
...: import pandas as pd
...: df = pd.DataFrame(np.random.random(size=(10,10)), index=range(10), columns=range(10))
...: cols = pd.MultiIndex.from_product([['a', 'b', 'c', 'd', 'e'], ['m', 'n']], names=['l1', 'l2'])
...: df.columns = cols
In [12]: df
Out[12]:
l1 a b c d e
l2 m n m n m n m n m n
0 0.257448 0.207198 0.443456 0.553674 0.765539 0.428972 0.587296 0.942761 0.115083 0.073907
1 0.099647 0.702320 0.792053 0.409488 0.112574 0.435044 0.767640 0.946108 0.257002 0.286178
2 0.110061 0.058266 0.350634 0.657057 0.900674 0.882870 0.250355 0.861289 0.041383 0.981890
3 0.408866 0.042692 0.726473 0.482945 0.030925 0.337217 0.377866 0.095778 0.033939 0.550848
4 0.255034 0.455349 0.193223 0.377962 0.445834 0.400846 0.725098 0.567926 0.052293 0.471593
5 0.133966 0.239252 0.479669 0.678660 0.146475 0.042264 0.929615 0.873308 0.603774 0.788071
6 0.068064 0.849320 0.786785 0.767797 0.534253 0.348995 0.267851 0.838200 0.351832 0.566974
7 0.240924 0.089154 0.161263 0.179304 0.077933 0.846366 0.916394 0.771528 0.798970 0.942207
8 0.808719 0.737900 0.300483 0.205682 0.073342 0.081998 0.002116 0.550923 0.460010 0.650109
9 0.413887 0.671698 0.294521 0.833841 0.002094 0.363820 0.148294 0.632994 0.278557 0.340835
And then I want to do the following groupby-apply operation.
In [17]: def func(df):
...: return df.loc[:, df.columns.get_level_values('l2') == 'm']
...:
In [19]: df.groupby(level='l1', axis=1).apply(func)
Out[19]:
l1 a b c d e
l2 m n m n m n m n m n
0 0.257448 NaN 0.443456 NaN 0.765539 NaN 0.587296 NaN 0.115083 NaN
1 0.099647 NaN 0.792053 NaN 0.112574 NaN 0.767640 NaN 0.257002 NaN
2 0.110061 NaN 0.350634 NaN 0.900674 NaN 0.250355 NaN 0.041383 NaN
3 0.408866 NaN 0.726473 NaN 0.030925 NaN 0.377866 NaN 0.033939 NaN
4 0.255034 NaN 0.193223 NaN 0.445834 NaN 0.725098 NaN 0.052293 NaN
5 0.133966 NaN 0.479669 NaN 0.146475 NaN 0.929615 NaN 0.603774 NaN
6 0.068064 NaN 0.786785 NaN 0.534253 NaN 0.267851 NaN 0.351832 NaN
7 0.240924 NaN 0.161263 NaN 0.077933 NaN 0.916394 NaN 0.798970 NaN
8 0.808719 NaN 0.300483 NaN 0.073342 NaN 0.002116 NaN 0.460010 NaN
9 0.413887 NaN 0.294521 NaN 0.002094 NaN 0.148294 NaN 0.278557 NaN
Notice that even if I do not retun any data for columns with l2=='n', the structure of the original dataframe is still preserved and pandas automatically fill in the values with nan.
This is a simplified example, my intention here is not to select out the 'm' columns, this example is just for a illustration of the problem I am facing -- I want to apply some function on some subset of the columns in the dataframe and the result dataframe should only have the columns I care about.
Also I noticed that you cannot rename the column in the apply function. For example if you do:
In [25]: def func(df):
...: df = df.loc[:, df.columns.get_level_values('l2') == 'm']
...: df = df.rename(columns={'m':'p'}, level=1)
...: return df
...:
In [26]: df.groupby(level='l1', axis=1).apply(func)
Out[26]:
l1 a b c d e
l2 m n m n m n m n m n
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Notice the result is full of NaN but the original format of the DF is preserved.
My question is, what should I do so that in the applied function I can manipulate the df so the output of the apply can be different in shape compared to the original df?
Read "What is the difference between pandas agg and apply function?". Depending on your actual use case, you may not need to change the function being passed into .agg or .apply.
I want to apply some function on some subset of the columns in the dataframe
You can shape the DataFrame before grouping, or return only a subset of e.g. columns with the desired aggregation or function application.
# pass an indexed view
grouped0 = df.loc[:, ['a', 'b', 'c'].groupby(level='l1', axis=1)
# perform the .agg or .apply on a subset of e.g. columns
result1 = df.groupby(level='l1', axis=1)['a', 'b', 'c'].agg(np.sum)
Using .agg on your example code:
In [2]: df
Out[2]:
l1 a b ... d e
l2 m n m n ... m n m n
0 0.007932 0.697320 0.181242 0.380013 ... 0.075391 0.820732 0.335901 0.808365
1 0.736584 0.621418 0.736926 0.962414 ... 0.331465 0.711948 0.426704 0.849730
2 0.099217 0.802882 0.082109 0.489288 ... 0.758056 0.627021 0.539329 0.808187
3 0.152319 0.378918 0.205193 0.489060 ... 0.337615 0.475191 0.025432 0.616413
4 0.582070 0.709464 0.739957 0.472041 ... 0.299662 0.151314 0.113506 0.504926
5 0.351747 0.480518 0.424127 0.364428 ... 0.267780 0.092946 0.134434 0.443320
6 0.572375 0.157129 0.582345 0.124572 ... 0.074523 0.421519 0.733218 0.079004
7 0.026940 0.762937 0.108213 0.073087 ... 0.758596 0.559506 0.601568 0.603528
8 0.991940 0.864772 0.759207 0.523460 ... 0.981770 0.332174 0.012079 0.034952
In [4]: df.groupby(level='l1', axis=1).sum()
Out[4]:
l1 a b c d e
0 0.705252 0.561255 0.804299 0.896123 1.144266
1 1.358002 1.699341 1.422559 1.043413 1.276435
2 0.902099 0.571397 0.273161 1.385077 1.347516
3 0.531237 0.694253 0.914989 0.812806 0.641845
4 1.291534 1.211998 1.138044 0.450976 0.618433
5 0.832265 0.788555 1.063437 0.360726 0.577754
6 0.729504 0.706917 1.018795 0.496042 0.812222
7 0.789877 0.181300 0.406009 1.318102 1.205095
8 1.856713 1.282666 1.183835 1.313944 0.047031
9 0.273369 0.391189 0.867865 0.978350 0.654145
In [10]: df.groupby(level='l1', axis=1).agg(lambda x: x[0])
Out[10]:
l1 a b c d e
0 0.007932 0.181242 0.708712 0.075391 0.335901
1 0.736584 0.736926 0.476286 0.331465 0.426704
2 0.099217 0.082109 0.037351 0.758056 0.539329
3 0.152319 0.205193 0.419761 0.337615 0.025432
4 0.582070 0.739957 0.279153 0.299662 0.113506
5 0.351747 0.424127 0.845485 0.267780 0.134434
6 0.572375 0.582345 0.309942 0.074523 0.733218
7 0.026940 0.108213 0.084424 0.758596 0.601568
8 0.991940 0.759207 0.412974 0.981770 0.012079
9 0.045315 0.282569 0.019320 0.638741 0.292028
In [11]: df.groupby(level='l1', axis=1).agg(lambda x: x[1])
Out[11]:
l1 a b c d e
0 0.697320 0.380013 0.095587 0.820732 0.808365
1 0.621418 0.962414 0.946274 0.711948 0.849730
2 0.802882 0.489288 0.235810 0.627021 0.808187
3 0.378918 0.489060 0.495227 0.475191 0.616413
4 0.709464 0.472041 0.858891 0.151314 0.504926
5 0.480518 0.364428 0.217953 0.092946 0.443320
6 0.157129 0.124572 0.708853 0.421519 0.079004
7 0.762937 0.073087 0.321585 0.559506 0.603528
8 0.864772 0.523460 0.770861 0.332174 0.034952
9 0.228054 0.108620 0.848545 0.339609 0.362117
Since you say that your example func is not your use case, please provide an example of your specific use case if the general cases don't fit.