subset a python dataframe by conditions - python

I trying to select the name row with count>250, which is called effective here. So we will try to find the mean of its rate
t3=dfnew.groupby('name')['ratings']
t4=t3.count()
t5=t4[t4.values>250]
t6=t3.mean()
t6[(t6.index==t5.index)]
Obviously the problem is in last row of my code. Where I want to match t6's index with t5's index. If they match, then save it, otherwise left it out. It is kind of like inner join in SQL.
What should I do to modify last row?
Suppose dataframe like this
input:
name ratings
A 1
A 2
:
A 251
B 1
B 2
:
B 230
so intended result should be 126 ( (1+251)/2))
Output
A 126

t3=dfnew.groupby('name')['ratings'].agg(['count','mean'])
t5=t3[t3['count']>250]
t5
It works fine when I aggregate two functions at the same time.

Related

I am trying to apply a defined function to a grouped pandas grouped dataframe and output the results to a csv

I have a defined function that requires a list, and the function outputs one value for every item in the list. I need to group by industry code (SIC), and apply the function within industry (so only industry 1 firms are grouped together for the defined calculation).
example:
1 50
1 40
2 100
2 110
I do the following code:
dr=pd.read_csv("sample.csv", usecols=columns)
d1=dr.groupby('SIC')['value'].apply(list)
for groups in d1
a=my_function(groups)
b=pd.DataFrame(A)
b.to_csv('output.csv', index=False)
I expected to get an output file with the function values for all 4 rows (lets say I want the difference between that row and the average. row 1 should be 50-(avg(50+40)) which equals 5.
Instead, I get a csv file with only the last group's values. It seems like I should make a new CSV file for each group, but by doing the apply(list) I cant figure out how to identify each group.
Edit: I modified the functionality as described in the comment below to output only one file.

Ordering by multiple columns including Count in PySpark

I am currently working on understanding Pyspark, and am running into a problem. I am attempting to resolve how to order by multiple columns in the dataframe, when one of these is a count.
As an example, say I have a dataframe (df) with three columns, A,B,and C. I want to group by A and B, and then count these instances. So if there are 10 instances where A=1 and B=1, the Table for that row should look like:
A|B|Count
1 1 10
I have determined that I can do this fairly easily by running:
df.groupBy('A', 'B').count()
Then if I want to order this dataframe by count (descending), this is also pretty straightforward:
df.groupBy('A', 'B').count().orderBy(desc("count"))
This next step is where I am having trouble. What if now I want to also order by column C, ie order first by count, and then by C? I had thought that the syntax would be something akin to:
df.groupBy('A', 'B').count().orderBy(desc("count"), desc("C"))
But this does not work, presumably because once I run count(), the dataframe is limited to only the columns A, B, and count. Do I need to somehow create a new column in the original dataframe with the count column, and if so, how can I do this?
Is there another simpler way that I am missing to order by both count and C?
For clarity an example dataframe that I would like to end with could appear as:
A|B|Count|C
1 1 10 5
1 2 9 3
1 5 9 1
2 4 8 10
2 7 8 5
Any insights or guidance are greatly appreciated.
Try using a window Function , the column 'C' is not in the group by, hence is not available for order/sorting the columns. If you just want the grouped columns eg A,B and the count column, you can always use select statement to get just that after the window function.
from pyspark.sql.window import Window
windowSpec = Window.partitionBy("A","B")
df.withColumn('count',F.count('*').over(windowSpec)).select("A","B","count").distinct().orderBy(F.col('count').desc(),F.col('C').desc()).show()

Group by ids, sort by date and get values as list on big data python

I have a big data (30 milions rows).
Each table has id,date,value.
I need to go over each id and per these id get a list of values sorted by date so the first value is the list will be the older date.
Example:
ID DATE VALUE
1 02/03/2020 300
1 04/03/2020 200
2 04/03/2020 456
2 01/03/2020 300
2 05/03/2020 78
Desire table:
ID VALUE_LIST_ORDERED
1 [300,200]
2 [300,456,78]
I can do it by for loop, by apply but its not effictive and with milion of users it's not feasible.
I thought about using group by and sort the dates but I dont know of to make a list and if so, groupby on pandas df is the best way?
I would love to get some suggestions on how to do it and which kind of df/technology to use.
Thank you!
what you need to do is to order your data using pandas.dataframe.sort_values and then apply the groupby method
I don't have huge data set to test this code on, but I believe this would do the trick:
sorted = data.sort_values('DATE')
result = data.groupby('ID').VALUE.apply(np.array)
and since it's Python you can always put everything in one statement
print(data.sort_values('DATE').data.groupby('ID').VALUE.apply(np.array))

Fetch previous rows based on if condition and Shift function - Python dataframe

I have data as shown below. I would like to select rows based on two conditions.
1) rows that start with digits (1,2,3 etc)
2) previous row of the records that satisfy 1st condition
Please find the how the input data looks like
Please find how I expect the output to be
I tried using the shift(-1) function but it seems to be throwing error. I am sure I messed up with the logic/syntax. Please find the code below that I tried
# i get the index of all records that start with number.
s=df1.loc[df1['VARIABLE'].str.contains('^\d')==True].index
# now I need to get the previous record of each group but this is
#incorrect
df1.loc[((df1['VARIABLE'].shift(-1).str.contains('^\d')==False) &
(df1['VARIABLE'].str.contains('^\d')==True))].index
Use:
df1 = pd.DataFrame({'VARIABLE':['studyid',np.nan,'age_interview','Gender','1.Male',
'2.Female',np.nan, 'dob', 'eth',
'Ethnicity','1.Chinese','2.Indian','3.Malay']})
#first remove missing rows by column VARIABLE
df1 = df1.dropna(subset=['VARIABLE'])
#test startinf numbers
s = (df1['VARIABLE'].str.contains('^\d'))
#chain shifted values by | for OR
mask = s | s.shift(-1)
#filtering by boolean indexing
df1 = df1[mask]
print (df1)
VARIABLE
3 Gender
4 1.Male
5 2.Female
9 Ethnicity
10 1.Chinese
11 2.Indian
12 3.Malay

sqlite, filter rows with dynamic number of keys, but only if they have the same value in a specific column?

I am brand new to sqlite (and databases in general). I have done a ton of reading both here and elsewhere and am unable to find this specific problem. People tend to want counts, or duplicates. I need to filter.
I have a database with 3 columns (and a few hundred thousand entries)
column1 column2 column3
abc 123 ##$
egf 456 $%#
abc 321 !##
kop 123 &$%
pok 321 ^$#
and so on.
What I am trying to do is this. I need to retrieve all possible combinations of a list. For example
[123, 321]
all possible combos would be
[123],[321],[123,321]
I do not know what input can possibly be, it can be more than 2 strings, and so the combinations list can grow pretty fast. For single entries above, like 123, 321, it works out of the gate, the thing I am trying to get to work is with more than 1 value in a list.
So I am dynamically generating the select statement
sqlquery = "SELECT fileloc, frequency FROM words WHERE word=?"
while numOfVariables < len(list):
sqlquery += " or word=?"
numOfVariables += 1
This generates the query, then I execute it with
cursor.execute(sqlquery,tuple(list))
Which works. It finds me all rows with any of those combinations.
Now I need one more thing, I need it to ONLY select them if their column1 is the same (I do not know what this value may be).
So in the above example it would select rows 1 and 3 because their column2 has the values I am interested in, and their column1 is the same. But column 4 would not be selected even though it has value we want. Because it's column1 does not match 321's column1. Same thing for row 5, again even though its one of the values we need, it's column1 doesnt match 123's.
From things Ive been able to find, people compare against specific value by using GROUP BY. But in my case I do not know what that value may be. All I care about is if its the same between the rows or not.
I am sorry if my explanation is not clear. I have never used mysql before this week so I dont know all the technical terms.
But basically I need the functionality of (pseudo code):
if (column2 is 123 or 321) and 123.column1 == 321.column1:
count
else:
dont count
I have a feeling this can be done by first moving whatever matches 123 or 321 into a new table. Then going through that table and only keeping records that have both 123 and 321 with the same column1 value. But I am not sure how to do this or if its the proper approach? Because this thing is going to scale pretty quick, if there are 5 inputs, then the rows that are kept is if there is one row to account for each input and all of their column1 is the same. (So rows would be saved in sets of 5).
Thank you.
(I am using Python 2.7.15)
You wrote:
"I need to retrieve all possible combinations of a list"
"Now I need one more thing, I need it to ONLY select them if their column1 is the same (I do not know what this value may be).
Use self-join for this purpose:
SELECT W1.column2, W2.column2
FROM words W1
JOIN words W2 ON W1.column1 = W2.column1
Correct me if I miss something in your question but this 3 lines must be sufficient.
Python looks as irrelevant for your question. It could be solved in pure SQL

Categories

Resources