Pandas and lists efficiency problem? Script takes too long - python

I'm kind of new to python and pandas. I have a csv with aroun 100k rows, with only three interest columns:
idd
date
prod
1
201601
1000
1
200605
2000
2
200102
1500
2
200903
1200
3
....... .
.......
I needed to group by idd, order by date (year) and then transpose the 'prod column so the first existing 'prod' value for each idd sorted by date ends up in the first column after idd, dropping the date value. In my example it would be this:
idd
'1'
'2'
'3'
1
2000
1000
...
2
1500
1200
...
3
...
.....
...
I also filtered for idds which have more than "nrows" reported values, since I am not interested in idds that have lesser than a certain value. Since I have read that recorring groups made by groupby is not efficient, I made a list of names resulting of groupby and made the queries to the original dataframe, but nevertheless it takes too long (like 5 minutes) to run. Maybe I am doing something wrong? I tried to use objects at minimum, loop using iloc and for loops to increase efficiency and use list of names instead of "get_group" but maybe I am missing something. Here is my code:
nrows = 36
for name in grouped_df.groups.keys():
for i in range(0, len(origin_df[origin_df.idd == name]['idd'])):
if len(origin_df[origin_df.idd == name]['idd']) >= nrows:
aux_df = origin_df[origin_df.idd == name]
aux_df.sort_values(by=['date'], inplace=True)
idd = name
prod = aux_df.iloc[i, 1]
new_df.loc[idd, i + 1] = prod
new_df.loc[idd, 'idd'] = idpozo
This is my first question in this page, so if I made some styling errors please forgive me, and all suggestions are welcome!!! Thanks in advance :)

Try:
df.set_index(['idd', df.groupby('idd').cumcount() + 1])['prod'].unstack()
Output:
1 2
idd
1 1000 2000
2 1500 1200

Related

Group by ids, sort by date and get values as list on big data python

I have a big data (30 milions rows).
Each table has id,date,value.
I need to go over each id and per these id get a list of values sorted by date so the first value is the list will be the older date.
Example:
ID DATE VALUE
1 02/03/2020 300
1 04/03/2020 200
2 04/03/2020 456
2 01/03/2020 300
2 05/03/2020 78
Desire table:
ID VALUE_LIST_ORDERED
1 [300,200]
2 [300,456,78]
I can do it by for loop, by apply but its not effictive and with milion of users it's not feasible.
I thought about using group by and sort the dates but I dont know of to make a list and if so, groupby on pandas df is the best way?
I would love to get some suggestions on how to do it and which kind of df/technology to use.
Thank you!
what you need to do is to order your data using pandas.dataframe.sort_values and then apply the groupby method
I don't have huge data set to test this code on, but I believe this would do the trick:
sorted = data.sort_values('DATE')
result = data.groupby('ID').VALUE.apply(np.array)
and since it's Python you can always put everything in one statement
print(data.sort_values('DATE').data.groupby('ID').VALUE.apply(np.array))

Pandas: add a new column with one single value at the last row of a dataframe

My request is simple but I am stuck and do not know why... my dataframe looks like this:
price time
0 1 3
1 3 6
2 4 7
What I need to do is to add a new column mkt with only one value equal to 10 at the last row (in my example index = 2). What I have tried:
df.mkt=''
df.mkt.loc[-1] =10
But when I want to see again my dataframe the last row is not updated... I know the answer must be simple but I am stuck? Any idea? thanks!
You can use the at function:
df.mkt = ''
df.at[-1, 'mkt'] = 10
Use pd.Series:
df['mkt'] = [''] * (len(df)-1) +[10]

Pandas - get first n-rows based on percentage

I have a dataframe i want to pop certain number of records, instead on number I want to pass as a percentage value.
for example,
df.head(n=10)
Pops out first 10 records from data set. I want a small change instead of 10 records i want to pop first 5% of record from my data set.
How to do this in pandas.
I'm looking for a code like this,
df.head(frac=0.05)
Is there any simple way to get this?
I want to pop first 5% of record
There is no built-in method but you can do this:
You can multiply the total number of rows to your percent and use the result as parameter for head method.
n = 5
df.head(int(len(df)*(n/100)))
So if your dataframe contains 1000 rows and n = 5% you will get the first 50 rows.
I've extended Mihai's answer for my usage and it may be useful to people out there.
The purpose is automated top-n records selection for time series sampling, so you're sure you're taking old records for training and recent records for testing.
# having
# import pandas as pd
# df = pd.DataFrame...
def sample_first_prows(data, perc=0.7):
import pandas as pd
return data.head(int(len(data)*(perc)))
train = sample_first_prows(df)
test = df.iloc[max(train.index):]
I also had the same problem and #mihai's solution was useful. For my case I re-wrote to:-
percentage_to_take = 5/100
rows = int(df.shape[0]*percentage_to_take)
df.head(rows)
I presume for last percentage rows df.tail(rows) or df.head(-rows) would work as well.
may be this will help:
tt = tmp.groupby('id').apply(lambda x: x.head(int(len(x)*0.05))).reset_index(drop=True)
df=pd.DataFrame(np.random.randn(10,2))
print(df)
0 1
0 0.375727 -1.297127
1 -0.676528 0.301175
2 -2.236334 0.154765
3 -0.127439 0.415495
4 1.399427 -1.244539
5 -0.884309 -0.108502
6 -0.884931 2.089305
7 0.075599 0.404521
8 1.836577 -0.762597
9 0.294883 0.540444
#70% of the Dataframe
part_70=df.sample(frac=0.7,random_state=10)
print(part_70)
0 1
8 1.836577 -0.762597
2 -2.236334 0.154765
5 -0.884309 -0.108502
6 -0.884931 2.089305
3 -0.127439 0.415495
1 -0.676528 0.301175
0 0.375727 -1.297127

Identify unique values within pandas dataframe rows that share a common id number

Here is a sample df:
data = {"Brand":{"0":"BrandA","1":"BrandA","2":"BrandB","3":"BrandB","4":"BrandC","5":"BrandC"},"Cost":{"0":18.5,"1":19.5,"2":6,"3":6,"4":17.69,"5":18.19},"IN STOCK":{"0":10,"1":15,"2":5,"3":1,"4":12,"5":12},"Inventory Number":{"0":1,"1":1,"2":2,"3":2,"4":3,"5":3},"Labels":{"0":"Black","1":"Black","2":"White","3":"White","4":"Blue","5":"Blue"},"Maximum Price":{"0":30.0,"1":35.0,"2":50,"3":45.12,"4":76.78,"5":76.78},"Minimum Price":{"0":23.96,"1":25.96,"2":12.12,"3":17.54,"4":33.12,"5":28.29},"Product Name":{"0":"Product A","1":"Product A","2":"ProductB","3":"ProductB","4":"ProductC","5":"ProductC"}}
df = pd.DataFrame(data=data)
My actual data set is much larger, but maintains the same pattern of there being 2 rows that share the same Inventory Number throughout.
My goal is to create a new data frame that contains only the inventory numbers where a cell value is not duplicated across both rows, and for those inventory numbers, only contains the data from the row with the lower index that is different from the other row.
For this example the resulting data frame would need to look like:
data = {"Inventory Number":{"0":1,"1":2,"2":3},"Cost":{"0":18.50,"1":"","2":17.69},"IN STOCK":{"0":10,"1":5,"2":""},"Maximum Price":{"0":30,"1":50,"2":""},"Minimum Price":{"0":23.96,"1":12.12,"2":33.12}}
df = pd.DataFrame(data=data)
The next time this would run, perhaps nothing changed in the "Maximum Price", so that column would need to not be included at all.
I was hoping someone would have a clean solution using groupby, but if not, i imagine the solution would include dropping all duplicates. then looping through all of the remaining inventory numbers, evaluating each column for duplicates.
icol = 'Inventory Number'
d0 = df.drop_duplicates(keep=False)
i = d0.groupby(icol).cumcount()
d1 = d0.set_index([icol, i]).unstack(icol).T
d1[1][d1[1] != d1[0]].unstack(0)
Cost IN STOCK Maximum Price Minimum Price
Inventory Number
1 19.5 15 35 25.96
2 None 1 45.12 17.54
3 18.19 None None 28.29
Try this:
In [68]: cols = ['Cost','IN STOCK','Inventory Number','Maximum Price','Minimum Price']
In [69]: df[cols].drop_duplicates(subset=['Inventory Number'])
Out[69]:
Cost IN STOCK Inventory Number Maximum Price Minimum Price
0 18.5 10 100566 30.0 23.96

python pandas - map using 2 columns as reference

I have 2 txt files I'd like to read into python: 1) A map file, 2) A data file. I'd like to have a lookup table or dictionary read the values from TWO COLUMNS of one, and determine which value to put in the 3rd column using something like the pandas.map function. The real map file is ~700,000 lines, and the real data file is ~10 million lines.
Toy Dataframe (or I could recreate as a dictionary) - Map
Chr Position Name
1 1000 SNPA
1 2000 SNPB
2 1000 SNPC
2 2000 SNPD
Toy Dataframe - Data File
Chr Position
1 1000
1 2000
2 1000
2 2001
Resulting final table:
Chr Position Name
1 1000 SNPA
1 2000 SNPB
2 1000 SNPC
2 2001 NaN
I found several questions about this with only one column lookup: Adding a new pandas column with mapped value from a dictionary. But can't seem to find a way to use 2 columns. I'm also open to other packages that may handle genomic data.
As a bonus second question, it'd also be nice if there was a way to map the 3rd column if it was with a certain amount of the mapped value. In other words, row 4 of the resulting table above would map to SNPD, as it's only 1 away. But I'd be happy to just get the solution for above.
i would do it this way:
read your map data so that first two columns will become an index:
dfm = pd.read_csv('/path/to/map.csv', delim_whitespace=True, index_col=[0,1])
change delim_whitespace=True to sep=',' if you have , as a delimiter
read up your DF (setting the same index):
df = pd.read_csv('/path/to/data.csv', delim_whitespace=True, index_col=[0,1])
join your DFs:
df.join(dfm)
Output:
In [147]: df.join(dfm)
Out[147]:
Name
Chr Position
1 1000 SNPA
2000 SNPB
2 1000 SNPC
2001 NaN
PS for the bonus question try something like this

Categories

Resources