python pandas - map using 2 columns as reference - python

I have 2 txt files I'd like to read into python: 1) A map file, 2) A data file. I'd like to have a lookup table or dictionary read the values from TWO COLUMNS of one, and determine which value to put in the 3rd column using something like the pandas.map function. The real map file is ~700,000 lines, and the real data file is ~10 million lines.
Toy Dataframe (or I could recreate as a dictionary) - Map
Chr Position Name
1 1000 SNPA
1 2000 SNPB
2 1000 SNPC
2 2000 SNPD
Toy Dataframe - Data File
Chr Position
1 1000
1 2000
2 1000
2 2001
Resulting final table:
Chr Position Name
1 1000 SNPA
1 2000 SNPB
2 1000 SNPC
2 2001 NaN
I found several questions about this with only one column lookup: Adding a new pandas column with mapped value from a dictionary. But can't seem to find a way to use 2 columns. I'm also open to other packages that may handle genomic data.
As a bonus second question, it'd also be nice if there was a way to map the 3rd column if it was with a certain amount of the mapped value. In other words, row 4 of the resulting table above would map to SNPD, as it's only 1 away. But I'd be happy to just get the solution for above.

i would do it this way:
read your map data so that first two columns will become an index:
dfm = pd.read_csv('/path/to/map.csv', delim_whitespace=True, index_col=[0,1])
change delim_whitespace=True to sep=',' if you have , as a delimiter
read up your DF (setting the same index):
df = pd.read_csv('/path/to/data.csv', delim_whitespace=True, index_col=[0,1])
join your DFs:
df.join(dfm)
Output:
In [147]: df.join(dfm)
Out[147]:
Name
Chr Position
1 1000 SNPA
2000 SNPB
2 1000 SNPC
2001 NaN
PS for the bonus question try something like this

Related

How to attach a column containing the number of occurrences of values in other columns to an existing Dataframe?

I have a data frame containing hyponym and hypernym pairs extracted from StackOverflow posts. You can see an excerpt from it in the following:
0 1 2 3 4
linq query asmx web service THH 10 a linq query as an asmx web service
application bolt THH 1 my application is a bolt on data visualization...
area r time THH 1 the area of the square is r times
sql query syntax HTH 3 sql like query syntax
...
7379596 rows × 5 columns
The column 0 and the column 1 contain the hyponym and hyperonym parts of the phrases contained by the column 4. I would like to implement a filter based on statistical features, therefore I have to count all occurrences of the pairs (0, 1) columns together, all occurrences of the hyponym and hyperonym parts respectively. Pandas has a method called value_counts(), so counting the occurrences can be obtained by:
df.value_counts([0])
df.value_counts([1])
df.value_counts([0, 1])
This is nice, but the method resulted in a Pandas Series which has much fewer records than the original DataFrame, therefore, adding a new column like df[5] = df.value_counts([0, 1]) does not work.
I have found a workaround: I have created 3 Pandas Series for every occurrence type (pair, hyponym, hyperonym) and I have written a small loop to calculate a confidence score for every pair but as the original dataset is huge (more than 7 million records) this calculation is not an efficient way to do that (the calculation has not finished after 30 hours). So, the feasible and hopefully efficient solution would be using the Pandas applymap() for this purpose, but it is needed to attach columns containing the occurrences to the original DataFrame. So I would like a DataFrame like this one:
0 1 2 3 4 5 6 7
sql query anything anything a phrase 1000 800 500
sql query anything anything anotherphrase 1000 800 500
...
The column 5 is the occurences of the hyponym part (sql), the column 6 is the number of occurrences of the hyperonym part (query) and the column 7 is the occurrences of the pair (sql,
query). As you can see the pairs are the same but they are extracted from different phrases.
My question is how to do that? How can I attach occurrences as a new column to an existing DataFrame?
Here's a solution on how to map the value counts of the combination of two columns to a new column:
# Create an example DataFrame
df = pd.DataFrame({0: ["a", "a", "a", "b"], 1: ["c", "d", "d", "d"]})
# Count the paired occurrences in a new column
df["count"] = df.groupby([0,1])[0].transform('size')
Before editing, I had answered this question with a solution using value_counts and a merge. This original solution is slower and more complicated than the groupby:
# Put the value_counts in a new DataFrame, call them count
vcdf = pd.DataFrame(df[[0, 1]].value_counts(), columns=["count"])
# Merge the df with the vcs
merged = pd.merge(left=df, right=vcdf, left_on=[0, 1], right_index=True)
# Potentially sort index
merged = merged.sort_index()
The resulting DataFrame:
0 1 count
0 a c 1
1 a d 2
2 a d 2
3 b d 1

Pandas and lists efficiency problem? Script takes too long

I'm kind of new to python and pandas. I have a csv with aroun 100k rows, with only three interest columns:
idd
date
prod
1
201601
1000
1
200605
2000
2
200102
1500
2
200903
1200
3
....... .
.......
I needed to group by idd, order by date (year) and then transpose the 'prod column so the first existing 'prod' value for each idd sorted by date ends up in the first column after idd, dropping the date value. In my example it would be this:
idd
'1'
'2'
'3'
1
2000
1000
...
2
1500
1200
...
3
...
.....
...
I also filtered for idds which have more than "nrows" reported values, since I am not interested in idds that have lesser than a certain value. Since I have read that recorring groups made by groupby is not efficient, I made a list of names resulting of groupby and made the queries to the original dataframe, but nevertheless it takes too long (like 5 minutes) to run. Maybe I am doing something wrong? I tried to use objects at minimum, loop using iloc and for loops to increase efficiency and use list of names instead of "get_group" but maybe I am missing something. Here is my code:
nrows = 36
for name in grouped_df.groups.keys():
for i in range(0, len(origin_df[origin_df.idd == name]['idd'])):
if len(origin_df[origin_df.idd == name]['idd']) >= nrows:
aux_df = origin_df[origin_df.idd == name]
aux_df.sort_values(by=['date'], inplace=True)
idd = name
prod = aux_df.iloc[i, 1]
new_df.loc[idd, i + 1] = prod
new_df.loc[idd, 'idd'] = idpozo
This is my first question in this page, so if I made some styling errors please forgive me, and all suggestions are welcome!!! Thanks in advance :)
Try:
df.set_index(['idd', df.groupby('idd').cumcount() + 1])['prod'].unstack()
Output:
1 2
idd
1 1000 2000
2 1500 1200

Finding rows with highest means in dataframe

I am trying to find the rows, in a very large dataframe, with the highest mean.
Reason: I scan something with laser trackers and used a "higher" point as reference to where the scan starts. I am trying to find the object placed, through out my data.
I have calculated the mean of each row with:
base = df.mean(axis=1)
base.columns = ['index','Mean']
Here is an example of the mean for each row:
0 4.407498
1 4.463597
2 4.611886
3 4.710751
4 4.742491
5 4.580945
This seems to work fine, except that it adds an index column, and gives out columns with an index of type float64.
I then tried this to locate the rows with highest mean:
moy = base.loc[base.reset_index().groupby(['index'])['Mean'].idxmax()]
This gives out tis :
index Mean
0 0 4.407498
1 1 4.463597
2 2 4.611886
3 3 4.710751
4 4 4.742491
5 5 4.580945
But it only re-index (I have now 3 columns instead of two) and does nothing else. It still shows all rows.
Here is one way without using groupby
moy=base.sort_values('Mean').tail(1)
It looks as though your data is a string or single column with a space in between your two numbers. Suggest splitting the column into two and/or using something similar to below to set the index to your specific column of interest.
import pandas as pd
df = pd.read_csv('testdata.txt', names=["Index", "Mean"], delimiter="\s+")
df = df.set_index("Index")
print(df)

Pandas adding rows to df in loop

I'm parsing data in a loop and once it's parsed and structured I would like to then add it to a data frame.
The end format of the data frame I would like is something like the following:
df:
id 2018-01 2018-02 2018-03
234 2 1 3
345 4 5 1
534 5 3 4
234 2 2 3
When I iterate through the data in the loop I have a dictionary with the id, the month and the value for the month, for example:
{'id':234,'2018-01':2}
{'id':534,'2018-01':5}
{'id':534,'2018-03':4}
.
.
.
What is the best way to take an empty data frame and add rows and columns with their values to it in a loop?
Essentially as I iterate it would look something like this
df:
id 2018-01
234 2
then
df:
id 2018-01
234 2
534 5
then
df:
id 2018-01 2018-03
234 2
534 5 4
and so on...
IIUC, you need to convert the single dict to dataframe firstly, then we do append, in case we do not have duplicate 'id' we need groupby get the first value
df=pd.DataFrame()
l=[{'id':234,'2018-01':2},
{'id':534,'2018-01':5},
{'id':534,'2018-03':4}]
for x in l:
df=df.append(pd.Series(x).to_frame().T.set_index('id')).groupby(level=0).first()
print(df)
2018-01
id
234 2
2018-01
id
234 2
534 5
2018-01 2018-03
id
234 2.0 NaN
534 5.0 4.0
It is not advisable to generate a new data frame at each iteration and append it, this is quite expensive. If your data is not too big and fits into memory, you can make a list of dictionaries first and then pandas allows you to simply do:
df = pd.DataFrame(your_list_of_dicts)
df.set_index('id')
If making a list is to expensive (because you'd like to save memory for the data frame) consider using a generator instead of a list. The basic anatomy of a generator function is this:
def datagen(your_input):
for item in your_input:
# your code to make a dict
yield dict
The generator object data = datagen(input) will not store the dicts but yields a dict at each iteration. It can generate items on demand. When you do pd.DataFrame(data), pandas will stream all the data and make a data frame. Generators can be used for data pipelines (like pipes in UNIX) and are very powerful for big data workflows. Be aware, however, that a generator object can be consumed only once, that is if you run pd.DataFrame(data) again, you will get an empty data frame.
The easiest way I've found in Pandas (although not intuitive) to iteratively append new data rows to a dataframe is using df.loc[ ] to reference the last (nonexistent) row, with len(df) as the index:
df.loc[ len(df) ] = [new, row, of, data]
This will "append" the new data row to the end of the dataframe in-place.
The above example is for an empty Dataframe with exactly 4 columns, such as:
df = pandas.DataFrame( columns=["col1", "col2", "col3", "col4"] )
df.loc[ ] indexing can insert data at any Row at all, whether or not it exists yet. It seems it will never give an IndexError, like an numpy.array or List would if you tried to assign to a nonexistent row.
For a brand-new, empty DataFrame, len(df) returns 0, and thus references the first, blank row, and then increases by one each time you add a row.
–––––
I do not know the speed/memory efficiency cost of this method, but it works great for my modest datasets (few thousand rows). At least from a memory perspective, I imagine that a large loop appending data to to the target DataFrame directly would use less memory than generating an intermediate List of duplicate data first, then generating a DataFrame from that list. Time "efficiency" could be a different question entirely, one for the other SO gurus to comment on.
–––––
However for the OP's specific case where you also requested to combine the columns if the data is for an existing identically-named column, you'd need som logic during your for loop.
Instead I would make the DataFrame "dumb" and just import the data as-is, repeating dates as they come, eg. your post-loop DataFrame would look like this, with simple column names describing the raw data:
df:
id date data
234 2018-01 2
534 2018-01 5
535 2018-03 4
(has two entries for the same date).
Then I would use the DataFrame's databasing functions to organize this data how you like, probably using some combination of df.unique() and df.sort(). Will look into that more later.

Pandas: Nesting Dataframes

Hello I want to store a dataframe in another dataframe cell.
I have a data that looks like this
I have daily data which consists of date, steps, and calories. In addition, I have minute by minute HR data of a specific date. Obviously it would be easy to put the minute by minute data in 2 dimensional list but I'm fearing that would be harder to analyze later.
What would be the best practice when I want to have both data in one dataframe? Is it even possible to even nest dataframes?
Any better ideas ? Thanks!
Yes, it seems possible to nest dataframes but I would recommend instead rethinking how you want to structure your data, which depends on your application or the analyses you want to run on it after.
How to "nest" dataframes into another dataframe
Your dataframe containing your nested "sub-dataframes" won't be displayed very nicely. However, just to show that it is possible to nest your dataframes, take a look at this mini-example:
Here we have 3 random dataframes:
>>> df1
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057
>>> df2
0 1 2
0 0.090917 0.457668 0.598548
1 0.748639 0.729935 0.680409
2 0.301244 0.024004 0.361283
>>> df3
0 1 2
0 0.200375 0.059798 0.665323
1 0.086708 0.320635 0.594862
2 0.299289 0.014134 0.085295
We can make a main dataframe that includes these dataframes as values in individual "cells":
df = pd.DataFrame({'idx':[1,2,3], 'dfs':[df1, df2, df3]})
We can then access these nested datframes as we would access any value in any other dataframe:
>>> df['dfs'].iloc[0]
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057

Categories

Resources