How to automate dataframe parsing on pandas [duplicate]

How to automate dataframe parsing on pandas [duplicate] - python

I have a dataframe df :
a b c
0 0.897134 -0.356157 -0.396212
1 -2.357861 2.066570 -0.512687
2 -0.080665 0.719328 0.604294
3 -0.639392 -0.912989 -1.029892
4 -0.550007 -0.633733 -0.748733
5 -0.712962 -1.612912 -0.248270
6 -0.571474 1.310807 -0.271137
7 -0.228068 0.675771 0.433016
8 0.005606 -0.154633 0.985484
9 0.691329 -0.837302 -0.607225
10 -0.011909 -0.304162 0.422001
11 0.127570 0.956831 1.837523
12 -1.074771 0.379723 -1.889117
13 -1.449475 -0.799574 -0.878192
14 -1.029757 0.551023 2.519929
15 -1.001400 0.838614 -1.006977
16 0.677216 -0.403859 0.451338
17 0.221596 -0.323259 0.324158
18 -0.241935 -2.251687 -0.088494
19 -0.995426 0.665569 -2.228848
20 1.714709 -0.353391 0.671539
21 0.155050 1.136433 -0.005721
22 -0.502412 -0.610901 1.520165
23 -0.853906 0.648321 1.124464
24 1.149151 -0.187300 -0.412946
25 0.329229 -1.690569 -2.746895
26 0.165158 0.173424 0.896344
27 1.157766 0.525674 -1.279618
28 1.729730 -0.798158 0.644869
29 -0.107285 -1.290374 0.544023
that I need to split into multiple dataframes that will contain every 10 rows of df , and every small dataframe I will write to separate file. so I decided create multilevel dataframe, and for this first assign the index to every 10 rows in my df with this method:
df['split'] = df['split'].apply(lambda x: np.searchsorted(df.iloc[::10], x, side='right')[0])
that throws out
TypeError: 'function' object has no attribute '__getitem__'
So, do you have an idea of how to fix it? Where is my method wrong?
But if you have another approach to split my dataframe into multiple dataframes every of which contains 10 rows of df, you are also welcome, cause this approach was just the first I thought about, but I'm not sure that it's the best one.

There are many ways to do what you want, your method looks over-complicated. A groupby using a scaled index as the grouping key would work:
df = pd.DataFrame(data=np.random.rand(100, 3), columns=list('ABC'))
groups = df.groupby(np.arange(len(df.index))//10)
for (frameno, frame) in groups:
frame.to_csv("%s.csv" % frameno)

You can use a dictionary comprehension to save slices of the dataframe in groups of ten rows:
df_dict = {n: df.iloc[n:n+10, :]
for n in range(0, len(df), 10)}
>>> df_dict.keys()
[0, 10, 20]
>>> df_dict[10]
a b c
10 -0.011909 -0.304162 0.422001
11 0.127570 0.956831 1.837523
12 -1.074771 0.379723 -1.889117
13 -1.449475 -0.799574 -0.878192
14 -1.029757 0.551023 2.519929
15 -1.001400 0.838614 -1.006977
16 0.677216 -0.403859 0.451338
17 0.221596 -0.323259 0.324158
18 -0.241935 -2.251687 -0.088494
19 -0.995426 0.665569 -2.228848

Related

Finding overlap in range based on multiple dataframe column values

I have a TSV that looks as follows:
chr_1 start_1 chr_2 start_2
11 69633786 14 105884873
12 81940993 X 137690551
13 29782093 12 97838049
14 105864244 11 69633799
17 33207000 20 9992701
17 38446991 20 2102271
17 38447482 17 29623333
20 9992701 17 33207000
20 10426599 17 33094167
20 13765533 17 29469669
22 27415959 8 36197094
22 37191634 8 38983042
22 44464751 18 74004141
8 36197054 22 23130534
8 36197054 22 23131537
8 36197054 8 23130539
This will be referred to as transDiffStartEndChr, which is a Dataframe.
I am working on a program that takes this TSV as input, and outputs rows that have the same chr_1 and chr_2, and a start_1 and start_2 that are +/- 1000.
Ideal output would look like:
chr_1 start_1 chr_2 start_2
8 36197054 8 23130539
8 36197054 22 23131537
Potentially creating groups for every hit based on chr_1 and chr_2.
My current script/thoughts:
transDiffStartEndChr = pd.read_csv('test-input.tsv', sep='\t')
#I will extract rows first by chr_1, in this case I'm doing a test case for 17.
rowsStartChr17 = transDiffStartEndChr[transDiffStartEndChr.apply(extractChr, chr='17', axis=1)]
#I figure I can do something stupid and using brute force, but I feel like I'm not tackling this problem correctly
for index, row in rowsStartChr17.iterrows():
for index2, row2 in rowsStartChr17.iterrows():
if index == index2:
continue
elif row['chr_1'] == row2['chr_1'] and row['chr_2'] == row2['chr_2']:
if proximityCheck(row['start_1'], row2['start_1']) and proximityCheck(row['start_2'], row2['start_2']):
print(f'Row: {index} Match: {index2}')
Any thoughts are appreciated.

Can play with numpy and pandas to filter out the groups that don't match your requirements.
>>> df.groupby(['chr_1', 'chr_2'])\
.filter(lambda s: len(np.array(np.where(
np.tril(
np.abs(
np.subtract.outer(s['start_2'].values,
s['start_2'].values)) < 1500 , -1)))\
.flatten()) > 0)
The logic is to groupby chr_1 and chr_2 and perform an outer subtraction between start_2 values to check whether we can values below 1500 (the threshold I used).

Removing NaN values from Pandas series - no prior post answers have worked

I have the following command which returns a Pandas Series as its output:
def run_ttest():
for key,value in enumerate(data['RegionName']):
if value in stateslist:
indexing = data['differ'].iloc[key]
Townames.append(indexing)
else:
indexing = data['differ'].iloc[key]
Notowns.append(indexing)
Unitowns['Unitownvalues'] = Townames
Notunitowns['Notunitownvalues'] = Notowns
Notunitowns['Notunitownvalues'] = Notunitowns['Notunitownvalues']
Unitowns['Unitownvalues'] = Unitowns['Unitownvalues']
return Unitowns['Unitownvalues']
run_ttest()
The output prints the series Unitowns['Unitownvalues']:
0 -32000.000000
1 -16200.000000
2 -12466.666667
3 -14600.000000
4 633.333333
5 -10600.000000
6 -6466.666667
7 800.000000
8 -3066.666667
9 NaN
10 1566.666667
11 10633.333333
12 6466.666667
13 1333.333333
14 -15233.333333
15 -11833.333333
16 -3200.000000
17 -1566.666667
18 -8333.333333
19 5166.666667
20 5033.333333
21 -6166.666667
22 -16366.666667
23 -22266.666667
24 -112766.666667
25 2566.666667
26 3000.000000
27 -5666.666667
28 NaN
Name: Unitownvalues, dtype: float64
I have tried the following:
Notunitowns['Notunitownvalues'] = Notunitowns['Notunitownvalues'].s[~s.isnull()]
Unitowns['Unitownvalues'] = Unitowns['Unitownvalues'].s[~s.isnull()]
Notunitowns['Notunitownvalues'] = Notunitowns['Notunitownvalues'].dropna()
Unitowns['Unitownvalues'] = Unitowns['Unitownvalues'].dropna()
But neither of these attempts have been successful.
There was a prior suggestion on a previous post referring to the conversion of the datatype to 'float', but since the type already is 'float64', adding .astype(float) does not solve the issue.
Would anybody be willing to give me a helping hand?

Unitowns is a dataframe? In that case, I would do:
Unitowns.dropna(subset=['Unitownvalues'])
This wil get you a dataframe with rows dropped where Unitownvalues is na. If you just want the Series, Unitowns['Unitownvalues'].dropna() will work, but you can't assign it right back to the dataframe, as that column will not match the length of the other columns I assume you have (I guess this is the Error you are having).
Edit:
Does the following not work for you? If not, what is your error?
s = run_ttest()
s = s.dropna()
s

Filter dataframe based on list with ranges

probably the title of my question is some kind of wrong. Currently I have a list:
a = [11,12,13,14,15,16,17,18,19,20,21,22,25,26,27,28,29,30,31,37,38,39]
and a dataframe df:
colfrom colto
1 99
23 24
25 32
25 40
How can I filter my dataframe that the colfrom is inside the array a or smaller then it, and that coltois inside the array or bigger then it? So basically this rule would lead to:
colfrom colto
1 99
25 32
25 40
The only row who gets kicked out is row 2 (or in python row 1), as 23 and 24 are not in the array (and not lower then 11 and not higher then 39).

Use:
mask = ((df['colfrom'].isin(a)) | (df['colfrom']<min(a)) & (df['colto'].isin(a)) | (df['colto']>max(a)))
df[mask]
colfrom colto
0 1 99
2 25 32
3 25 40

Select middle value for nth rows in Python

I am creating new dataframe which should contain an only middle value (not Median!!) for every nth rows, however my code doesn't work!
I've tried several approaches through pandas or simple Python but I always fail.
value date index
14 40 1983-07-15 14
15 86 1983-07-16 15
16 12 1983-07-17 16
17 78 1983-07-18 17
18 69 1983-07-19 18
19 78 1983-07-20 19
20 45 1983-07-21 20
21 47 1983-07-22 21
22 48 1983-07-23 22
23 ..... ......... ..
RSDF5 = RSDF4.groupby(pd.Grouper(freq='15D', key='DATE')).[int(len(RSDF5)//2)].reset_index()
I know that the code is wrong and I am completely out of ideas!
SyntaxError: invalid syntax

A solution based on indexes.
df is your original dataframe, N is the number of rows you want to group (assumed to be ad odd number, so there is a unique middle row).
df2 = df.groupby(np.arange(len(df))//N).apply(lambda x : x.iloc[len(x)//2])
Be aware that if the total number or rows is not divisible by N, the last group is shorter (you still get its middle value, though).
If N is an even number, you get the central row closer to the end of the group: for example, if N=6, you get the 4th row of each group of 6 rows.

How to sort two columns at once using sorted()

I have a .csv file which holds two columns which I want to sort.
I want to sort the first column alphabetically and the second one by highest number to lowest.
I used sortedColumn = sorted(csv_opener,key=operator.itemgetter(0)) to sort the first column alphabetically but I also want to do the same thing for the second column. How would I go about doing that?

You can sort by two aspects by having the key callable return a tuple.
I'm assuming that the second column is a string convertable to an integer:
sortedColumn = sorted(csv_opener, key=lambda row: (row[0], -int(row[1])))
By returning negative values from row[1] you can sort from highest-to-lowest, while the main sort is done or row[0] in alphabetical order.
So for the sample rows:
Alpha, 10
Beta, 30
Alpha, 42
Gamma, 81
Beta, 10
the sorted output gives you:
Alpha, 42
Alpha, 10
Beta, 30
Beta, 10
Gamma, 81
sorting first alphabetically on the first column, and then for equal values in the first column, the rows are sorted in descending order on the second column.

Martijn Pieters already provided a perfect answer, but I think it is worth checking out Pandas DataFrame for dealing with CSV data in case you have not considered it.
You can use pandas.read_csv() to read the CSV input as a DataFrame and then use DataFrame.sort_values() to sort it any way you want.
To add an example, let's first generate some random sample data
from faker import Factory
from random import randint, choice
import pandas
fake = Factory.create()
names = [fake.name() for i in range(5)]
nums = [randint(1, 50) for i in range(5)]
data = []
for i in range(10):
data.append((choice(names), choice(nums)))
df = pandas.DataFrame.from_records(data, columns=("Names", "Nums"))
Resulting in, for example
Names Nums
0 Jeffry Wintheiser 25
1 Dr. Corine Sporer PhD 25
2 Jeffry Wintheiser 17
3 Emmett Reilly 17
4 Jeffry Wintheiser 17
5 Emmett Reilly 33
6 Jeffry Wintheiser 33
7 Lilah Purdy 17
8 Emmett Reilly 22
9 Miss Julie Wisoky 25
Then you can use the sort_values as follows
df.sort_values(["Names", "Nums"], ascending=[True, False])
Resulting in
Names Nums
1 Dr. Corine Sporer PhD 25
5 Emmett Reilly 33
8 Emmett Reilly 22
3 Emmett Reilly 17
6 Jeffry Wintheiser 33
0 Jeffry Wintheiser 25
2 Jeffry Wintheiser 17
4 Jeffry Wintheiser 17
7 Lilah Purdy 17
9 Miss Julie Wisoky 25

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to automate dataframe parsing on pandas [duplicate] - python

Related

Finding overlap in range based on multiple dataframe column values

Removing NaN values from Pandas series - no prior post answers have worked

Filter dataframe based on list with ranges

Select middle value for nth rows in Python

How to sort two columns at once using sorted()

Categories

Resources