trying to select subset from a list, however the order is reversed after selection
tried using pandas isin
df.mon =[1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,11,12,...]
# selecting
results = df[df.month.isin([10,11,12,1,2,3])]
print(results.mon]
mon = [1,2,3,10,11,12, 1,2,3,10,11,12,...]
desired results
mon= [10,11,12,1,2,3,10,11,12,1,2,3,...]
# sorting results in this
mon = [1,1,2,2,3,3,10,10,11,11,12,12] and I dont want that either
thanks for the help
I work most with basic python lists, so I have converted the df to a list.
Data
The data is displayed in an xlsx file like this.
The input is a xlsx document which goes 1, 2, .. 12, 1, 2, .. 12 only twice, the "Values" start at 90 and count by 10 all the way out to the second 12.
Process
import pandas as pd
df = pd.read_excel('Book1.xlsx')
arr = df['Column'].tolist()
arr2 = df['Values'].tolist()
monthsofint = [10, 11, 12, 1, 2, 3]
locs = []
dictor = {}
for i in range(len(monthsofint)):
dictor[monthsofint[i]] = []
for i in range(len(monthsofint)): # !! Assumption !!
for j in range(len(arr)):
if monthsofint[i] == arr[j]:
dictor[monthsofint[i]].append(j)
newlist = []
newlist2 = []
for i in range(len(dictor[monthsofint[0]])):
for j in range(len(monthsofint)):
newlist.append(arr[dictor[monthsofint[j]][i]])
newlist2.append(arr2[dictor[monthsofint[j]][i]])
print(newlist)
print(newlist2)
Output: [10, 11, 12, 1, 2, 3, 10, 11, 12, 1, 2, 3] and [180, 190, 200, 90, 100, 110, 300, 310, 320, 210, 220, 230]
Note on Assumption: The assumption made is that there will always be 12 months for every year in the file.
In your case , we using Categorical + cumcount
#results = df[df.mon.isin([10, 11, 12, 1, 2, 3])].copy()
results.mon=pd.Categorical(results.mon,[10,11,12,1,2,3])
s=results.sort_values('mon')
s=s.iloc[s.groupby('mon').cumcount().argsort()]
s
Out[172]:
mon
9 10
10 11
11 12
0 1
1 2
2 3
21 10
22 11
23 12
12 1
13 2
14 3
I think you can take what we can have for each category, then use izip_longest to zip those lists.
So I found a relatively easy and simple way to do it from another source
For those who might be interested:
df[(df.index > 4) & (df.month.isin([10, 11, 12, 1, 2, 3]))]
Related
Well, I have a numpy array like that:
a=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24]
My desired output is:
b=['87654321','161514131211109','2423222120191817']
For it, I need first to split "a" into arrays of 8 elements and then I have a list like that:
np.split(a) = [array([1, 2, 3, 4, 5, 6, 7, 8], dtype=int8),
array([9, 10, 11, 12, 13, 14, 15, 16], dtype=int8),
array([17, 18, 19, 20, 21, 22, 23, 24], dtype=int8)]
so, I need to invert each array into it and join the numbers to make like a list of joint numbers.
No need for numpy, though it will work for an array as well. One way:
>>> [''.join(str(c) for c in a[x:x+8][::-1]) for x in range(0, len(a), 8)]
['87654321', '161514131211109', '2423222120191817']
Try this. You reshape your data and then convert it to string elements. Loop it and append it to new list.
import numpy as np
a=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24]
lst = list(np.array(a).reshape(3,8).astype("U"))
my_lst = []
for i in lst:
my_lst.append("".join(i[::-1]))
print(my_lst)
The simplest way is first to reverse the original array (or create a reversed copy), and then to split:
a = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24]
acopy = a[::-1]
splitted = np.array_split(acopy, 3)
print(splitted[0]) # [24 23 22 21 20 19 18 17]
print(splitted[1]) # [16 15 14 13 12 11 10 9]
print(splitted[2]) # [8 7 6 5 4 3 2 1]
Now when lists are reversed, you can join elements of each list to make strings:
str1 = ''.join(str(x) for x in splitted[0]) # '2423222120191817'
str2 = ''.join(str(x) for x in splitted[1]) # '161514131211109'
str3 = ''.join(str(x) for x in splitted[2]) # '87654321'
I have an input df:
input_ = pd.DataFrame.from_records(
[
['X_val', 'Y_val1', 'Y_val2', 'Y_val3'],
[1, 10, 11, 31],
[2, 20, 12, 21],
[3, 30, 13, 11],])
and want to concat every y-value but still distinct where the value came from for plotting and analysis,
I have multiple files with variable number of Y columns and ended up concatenating them column by column and extending with multiplied value, but was wondering if there is a better solution, because mine is terribly tedious.
expected_output_ = pd.DataFrame.from_records(
[
['X_val', 'Y_val' 'Y_type'],
[1, 10, 'Y_val1'],
[1, 11, 'Y_val2'],
[1, 31, 'Y_val3'],
[2, 20, 'Y_val1'],
[2, 12, 'Y_val2'],
[2, 21, 'Y_val3'],
[3, 30, 'Y_val1'],
[3, 13, 'Y_val2'],
[3, 11, 'Y_val3'],])
You can use pandas.DataFrame.melt :
input_.melt(
id_vars=['X_val'],
value_vars=['Y_val1', 'Y_val2', 'Y_val3'],
var_name='Y_type',
value_name='Y_val'
).sort_values(['X_val'], ignore_index=True)
Alternatively, as suggested by #Vishnudev, you can also use the following variation, especially for large number of similarly named Y_val* columns:
input_.melt(
id_vars=['X_val'],
value_vars=input_.filter(regex='Y_val').columns,
var_name='Y_type',
value_name='Y_val'
).sort_values(['X_val'], ignore_index=True)
Output:
X_val Y_type Y_val
0 1 Y_val1 10
1 1 Y_val2 11
2 1 Y_val3 31
3 2 Y_val1 20
4 2 Y_val2 12
5 2 Y_val3 21
6 3 Y_val1 30
7 3 Y_val2 13
8 3 Y_val3 11
Optionally, you can rearrange the column sequence if you like.
How do I output a list which counts and displays the number of times different values fit into a range?
Based on the below example, the output would be x = [0, 3, 2, 1, 0] as there are 3 Pro scores (11, 24, 44), 2 Champion scores (101, 888), and 1 King score (1234).
- P1 = 11
- P2 = 24
- P3 = 44
- P4 = 101
- P5 = 1234
- P6 = 888
totalsales = [11, 24, 44, 101, 1234, 888]
Here is ranking corresponding to the sales :
Sales___________________Ranking
0-10____________________Noob
11-100__________________Pro
101-1000________________Champion
1001-10000______________King
100001 - 200000__________Lord
This is one way, assuming your values are integers and ranges do not overlap.
from collections import Counter
# Ranges go to end + 1
score_ranges = [
range(0, 11), # Noob
range(11, 101), # Pro
range(101, 1001), # Champion
range(1001, 10001), # King
range(10001, 200001) # Lord
]
total_sales = [11, 24, 44, 101, 1234, 888]
# This counter counts how many values fall into each score range (by index).
# It works by taking the index of the first range containing each value (or -1 if none found).
c = Counter(next((i for i, r in enumerate(score_ranges) if s in r), -1) for s in total_sales)
# This converts the above counter into a list, taking the count for each index.
result = [c[i] for i in range(len(score_ranges))]
print(result)
# [0, 3, 2, 1, 0]
As a general rule homework should not be posted on stackoverflow. As such, just a pointer on how to solve this, implementation is up to you.
Iterate over the totalsales list and check if each number is in range(start,stop). Then for each matching check increment one per category in your result list (however using a dict to store the result might be more apt).
Here a possible solution with no use of modules such as numpy or collections:
totalsales = [11, 24, 44, 101, 1234, 888]
bins = [10, 100, 1000, 10000, 20000]
output = [0]*len(bins)
for s in totalsales:
slot = next(i for i, x in enumerate(bins) if s <= x)
output[slot] += 1
output
>>> [0, 3, 2, 1, 0]
If your sales-to-ranking mapping always follows a logarithmic curve, the desired output can be calculated in linear time using math.log10 with collections.Counter. Use an offset of 0.5 and the abs function to handle sales of 0 and 1:
from collections import Counter
from math import log10
counts = Counter(int(abs(log10(abs(s - .5)))) for s in totalsales)
[counts.get(i, 0) for i in range(5)]
This returns:
[0, 3, 2, 1, 0]
Here, I have used the power of dataframe to store the values, then using bin and cut to group the values into the right categories. The extracting the value count into list.
Let me know if it is okay.
import pandas as pd
import numpy
df = pd.DataFrame([11, 24, 44, 101, 1234, 888], columns=['P'])# Create dataframe
bins = [0, 10, 100, 1000, 10000, 200000]
labels = ['Noob','Pro', 'Champion', 'King', 'Lord']
df['range'] = pd.cut(df.P, bins, labels = labels)
df
outputs:
P range
0 11 Pro
1 24 Pro
2 44 Pro
3 101 Champion
4 1234 King
5 888 Champion
Finally, to get the value count. Use:
my = df['range'].value_counts().sort_index()#this counts to the number of occurences
output=map(int,my.tolist())#We want the output to be integers
output
The result below:
[0, 3, 2, 1, 0]
You can use collections.Counter and a dict:
from collections import Counter
totalsales = [11, 24, 44, 101, 1234, 888]
ranking = {
0: 'noob',
10: 'pro',
100: 'champion',
1000: 'king',
10000: 'lord'
}
c = Counter()
for sale in totalsales:
for k in sorted(ranking.keys(), reverse=True):
if sale > k:
c[ranking[k]] += 1
break
Or as a two-liner (credits to #jdehesa for the idea):
thresholds = sorted(ranking.keys(), reverse=True)
c = Counter(next((ranking[t] for t in thresholds if s > t)) for s in totalsales)
I have a large dataframe with the following heads
import pandas as pd
f = pd.Dataframe(columns=['month', 'Family_id', 'house_value'])
Months go from 0 to 239, Family_ids up to 10900 and house values vary. So the dataframe has more than 2 and a half million lines.
I want to filter the Dataframe only for those in which there is a difference between the final house price and its initial for each family.
Some sample data would look like this:
f = pd.DataFrame({'month': [0, 0, 0, 0, 0, 1, 1, 239, 239], 'family_id': [0, 1, 2, 3, 4, 0, 1, 0, 1], 'house_value': [10, 10, 5, 7, 8, 10, 11, 10, 11]})
And from that sample, the resulting dataframe would be:
g = pd.DataFrame({'month': [0, 1, 239], 'family_id': [1, 1, 1], 'house_value': [10, 11, 11]})
So I thought in a code that would be something like this:
ft = f[f.loc['month'==239, 'house_value'] > f.loc['month'==0, 'house_value']]
Also tried this:
g = f[f.house_value[f.month==239] > f.house_value[f.month==0] and f.family_id[f.month==239] == f.family_id[f.month==0]]
And the above code gives an error Keyerror: False and ValueError Any ideas. Thanks.
Use groupby.filter:
(f.sort_values('month')
.groupby('family_id')
.filter(lambda g: g.house_value.iat[-1] != g.house_value.iat[0]))
# family_id house_value month
#1 1 10 0
#6 1 11 1
#8 1 11 239
As commented by #Bharath, your approach errors out because for boolean filter, it expects the boolean series to have the same length as the original data frame, which is not true in both of your cases due to the filter process you applied before the comparison.
I need to create a set of matrices from the file below, the lines/rows with the same value of Z will go in a matrix together.
Below is a shortened version of my txt file:
X Y Z
-1 10 0
1 20 5
2 15 10
2 50 10
2 90 10
3 15 11
4 50 11
5 90 11
6 13 14
7 50 14
8 70 14
8 95 14
8 75 14
So for example my first matrix will be
[-1, 10, 0],
my second one will be
[1, 20, 5],
my third will be
([2, 15, 10],
[2, 50, 10],
[2, 90, 10]) etc
I've looked at a few questions related to this but nothing seems to be quite right.
I started by making each column an array. I was thinking a for loop might work well. So far I have
f = open("data.txt", "r")
header1 = f.readline()
for line in f:
line = line.strip()
columns = line.split()
x = columns[0]
y = columns[1]
z = columns[2]
i = line in f
z.old = line(i-1,4)
i=1
for line in f:
f.readline(i)
if z(0) == [i,3]:
line(i) = matrix[i,:]
else z(0) != [i,3]:
store line(i) as M
continue
i = i+1
however, I'm getting 'invalid syntax' for line,
else z(0) != line(4):
By this else clause, I mean that if z(0)/(z initial) is not equal to line(4) then this line will then get stored as the first line of the next matrix we will check under this code.
However, I'm not sure how well this would work.
Any help would be greatly appreciated!
The following should work for your data, it assumes the columns in your text file are tab delimited:
import csv
import operator
with open('input.txt', 'rb') as f_input:
csv_input = csv.reader(f_input, delimiter='\t')
headers = next(csv_input)
row_number = 1
for k, g in itertools.groupby(csv_input, key=operator.itemgetter(0)):
row = []
for entry in g:
entry = [float(e) for e in entry]
row.append([row_number] + entry)
row_number += 1
print row
This would print the following output:
[[1, -1, 10, 0]]
[[2, 1, 20, 5]]
[[3, 2, 15, 10], [4, 2, 50, 10], [5, 2, 90, 10]]
[[6, 3, 15, 11]]
[[7, 4, 50, 11]]
[[8, 5, 90, 11]]
[[9, 6, 13, 14]]
[[10, 7, 50, 14]]
[[11, 8, 70, 14], [12, 8, 95, 14], [13, 8, 75, 14]]
If your CSV file is exactly as you have it shown, i.e. with spaces separating the columns, then you will need to change the csv.reader line as follows:
csv_input = csv.reader(f_input, delimiter=' ', skipinitialspace=True)
The following, much simpler code, will do what you want:
import numpy as np
# Load the file using numpy (skip the first row which contains the header)
foo = np.loadtxt("/path/to/your/data-file", skiprows=1)
# Prepend a column with the row number
first_col = np.arange(foo.shape[0]) + 1 # +1 because we don't want to start with 0
bar = np.hstack((first_col[:, None], foo))
You can now access the single lines via bar[0], bar[1], ...