How to append from a dataframe to a list? - python

I have a dataframe called pop_emoj that has two columns (one for the emoji, and one for the emoji count) as seen below.
☕ 585
🌭 193
🌮 186
🌯 85
🌰 53
🌶 124
🌽 138
🍄 46
🍅 170
🍆 506
I have sorted the df based on the counts in descending order as seen below.
emoji_updated = pop_emoj.head(105).sort_values(ascending=False)
🍻 1809
🎂 1481
🍔 1382
🍾 1078
🥂 1028
And I'm trying to use the top n emojis to append to a new list called top_list, but I am getting stuck. Here is my code so far.
def top_number_of_emojis(n):
top_list = []
top_list = emoji_updated[0].tolist()
return top_list
I'm wanting to take all of column 1 (the emojis) and append them to my list (top_number_of_emojis). The output should look like this:
top_number_of_emojis(1) == ['🍻']
top_number_of_emojis(2) == ['🍻', '🎂']
top_number_of_emojis(3) == ['🍻', '🎂', '🍔']

if you already got the top 5 emojis, you just need to save them to a list.
An option for that is iterrows.
top_list = []
for id, emoji in emoji_updated.iterrows():
top_list.append(emoji)

Related

How to make a function dynamic and accept new parameters in python?

Following up on my previous question
I have a list of records as shown below
taken from this table
itemImage
name
nameFontSize
nameW
nameH
conutry
countryFont
countryW
countryH
code
codeFontSize
codeW
codeH
sample.jpg
Apple
142
1200
200
US
132
1200
400
1564
82
1300
600
sample2.jpg
Orange
142
1200
200
UK
132
1200
400
1562
82
1300
600
sample3.jpg
Lemon
142
1200
200
FR
132
1200
400
1563
82
1300
600
Right now, I have one function setText which takes all the elements of a row from this table.
I only have name, country and code for now but will be adding other stuff in the future.
I want to make this code more future proof and dynamic. For example, If I added four new columns in my data following the same pattern. How do I make python automatically adjust to that? instead of me going and declaring variables in my code every time.
Basically, I want to send each 4 columns starting from name to a function then continue till no column is left. Once that's done go to the next row and continue the loop.
Thanks to #Samwise who helped me clean up the code a bit.
import os
from PIL import Image,ImageFont,ImageDraw, features
import pandas as pd
path='./'
files = []
for (dirpath, dirnames, filenames) in os.walk(path):
files.extend(filenames)
df = pd.read_excel (r'./data.xlsx')
records = list(df.to_records(index=False))
def setText(itemImage, name, nameFontSize, nameW, nameH,
conutry, countryFontSize,countryW, countryH,
code, codeFontSize, codeW, codeH):
font1 = ImageFont.truetype(r'./font.ttf', nameFontSize)
font2 = ImageFont.truetype(r'./font.ttf', countryFontSize)
font3 = ImageFont.truetype(r'./font.ttf', codeFontSize)
file = Image.open(f"./{itemImage}")
draw = ImageDraw.Draw(file)
draw.text((nameW, nameH), name, font=font1, fill='#ff0000',
align="right",anchor="rm")
draw.text((countryW, countryH), conutry, font=font2, fill='#ff0000',
align="right",anchor="rm")
draw.text((codeW, codeH), str(code), font=font3, fill='#ff0000',
align="right",anchor="rm")
file.save(f'done {itemImage}')
for i in records:
setText(*i)
Sounds like df.columns might help. It returns a list, then you can iterate through whatever cols are present.
for col in df.columns():
The answers in this thread should help dial you in:
How to iterate over columns of pandas dataframe to run regression
It sounds like you also want row-wise results, so you could nest within df.iterrows or vice versa...though going cell by cell is generally not desirable and could end up being quite slow as your df grows.
So perhaps be thinking about how you could use your function with df.apply()

Python: calculate values in column only in rows with specific value in other column

I am desperately trying to solve this issue:
I have a csv with information on well core data with different columns, among them one column with IDs and two with X and Y coordinates. I was told now by the data supplier that some of the well cores (= rows) have wrong Y coordinates - the value should be e.g. instead 1400 -1400.
I am now trying to write a script to automatically change all the Y-values in the affected rows (well cores) (by *-1), but nothing has worked:
ges = pd.read_csv(r"C:\A....csv")
bk = [26740001, 26740002, 26740003] # List of IDs that should be changed
for x in bk:
for line in ges:
np.where(ges.query('ID== {}'.format(x)), ges.Y=ges.Y*-1, ges['Y'])
I have also tried it like this:
for line in ges:
if ges.ID.values == bk:
ges.Y = ges.Y*-1
else:
pass
or like this:
ges.loc[(ges.ID == bk), 'Y']=*-1
or:
ges.loc[ges['ID'].isin(bk), ges['Y']] = *-1
or:
ges.loc[ges['ID'].isin(bk), ges['Y']] = ges['Y']*-1
I am very grateful for every help!
edit:
I am sorry, this is my first post. To make it clearer, my data looks like this:
Now I was informed that the Y-values of ID 2, 3 and 6 are wrong and should be negative values. So my desired output is the following:
ID X Y other column other column
1 3459 1245 information information
2 4541 -1256 information information
3 2378 -2353 information information
4 6947 874 information information
5 2349 2351 information information
6 2347 -746 information information
I hope it is clear now. Thanks.
Try the following:
ids = [26740001, 26740002, 26740003]
for number_id in ids:
idx = ges['ID'] == number_id
ges.loc[idx, 'Y'] *= -1

Create Multiple dataframes from a large text file

Using Python, how do I break a text file into data frames where every 84 rows is a new, different dataframe? The first column x_ft is the same value every 84 rows then increments up by 5 ft for the next 84 rows. I need each identical x_ft value and corresponding values in the row for the other two columns (depth_ft and vel_ft_s) to be in the new dataframe too.
My text file is formatted like this:
x_ft depth_ft vel_ft_s
0 270 3535.755 551.735107
1 270 3534.555 551.735107
2 270 3533.355 551.735107
3 270 3532.155 551.735107
4 270 3530.955 551.735107
.
.
33848 2280 3471.334 1093.897339
33849 2280 3470.134 1102.685547
33850 2280 3468.934 1113.144287
33851 2280 3467.734 1123.937134
I have tried many, many different ways but keep running into errors and would really appreciate some help.
I suggest looking into pandas.read_table, which automatically outputs a DataFrame. Once doing so, you can isolate the rows of the DataFrame that you are looking to separate (every 84 rows) by doing something like this:
df = #Read txt datatable with Pandas
arr = []
#This gives you an array of all x values in your dataset
for x in range(0,403):
val = 270+5*x
arr.append(val)
#This generates csv files for every row with a specific x_ft value with its corresponding columns (depth_ft and vel_ft_s)
for x_value in arr:
tempdf = df[(df['x_ft'])] = x_value
tempdf.to_csv("df"+x_value+".csv")
You can get indexes to split your data:
rows = 84
datasets = round(len(data)/rows) # total datasets
index_list = []
for index in data.index:
x = index % rows
if x == 0:
index_list.append(index)
print(index_list)
So, split original dataset by indexes:
l_mod = index_list + [max(index_list)+1]
dfs_list = [data.iloc[l_mod[n]:l_mod[n+1]] for n in range(len(l_mod)-1)]
print(len(dfs_list))
Outputs
print(type(dfs_list[1]))
# pandas.core.frame.DataFrame
print(len(dfs_list[0]))
# 84

Pandas dataframe query not finding value

I have an issue that I can't understand why the value I'm searching doesn't appear to exist in dataframe.
I've started with combining couple numerical columns in my original dataframe and assigned to a new column. And then I've extracted a list of unique values from that combined column
~snip
self.df['Flitch'] = self.df['Blast'].map(
str) + "-" + self.df['Bench'].map(str)
self.flitches = self.df['Flitch'].unique()
~snip
Now slightly further in the code I need to get earliest date values corresponding to these unique identifiers. So I go and run a query on the dataframe:
~snip
def get_dates(self):
'''Extracts mining and loading dates from filtered dataframe'''
loading_date, mining_date = [],[]
#loop through all unique flitch ids and get their mining
#and loading dates
for flitch in self.flitches:
temp = self.df.query('Activity=="MINE"')
temp = temp.query(f'Flitch=={flitch}')
mining = temp['PeriodStartDate'].min()
mining_date.append(mining)
~snip
...and I get nothing. I can't understand why. I mean, I'm comparing the data extracted from the column to that same column and I'm not getting any matches.
I've gone and manually checked that the list of unique ids is populated correctly.
I've checked that dataframe I'm running query on does indeed has those same flitch ids.
I've manually checked for several random values from self.flitches list and it comes back as False every time.
Before I've combined those two columns and used only 'Blast' as identifier, everything worked perfectly, but now I'm not sure what is happening.
Here for example I've printed the self.flitches list:
['5252-528' '5251-528' '3030-492' '8235-516' '7252-488' '7251-488'
'2351-588' '5436-588' '1130-624' '5233-468' '1790-516' '6301-552'
'6302-552' '5444-576' '2377-564' '2380-552' '2375-564' '5253-528'
'2040-468' '2378-564' '1132-624' '1131-624' '6314-540' '7254-488'
'7253-488' '8141-480' '7250-488']
And here is data from self.df['Flitch'] column:
173 5252-528
174 5251-528
175 5251-528
176 5251-528
177 5251-528
178 5251-528
180 3030-492
181 3030-492
182 3030-492
183 3030-492
...
It looks like they have to match but they don't...

Pandas: Query tool doesn't work if column headers are tuples: TypeError: argument of type 'int' is not iterable

TLDR: The df.query() tool doesn't seem to work if the df's columns are tuples or even tuples converted into strings. How can I work around this to get the slice I'm aiming for?
Long Version: I have a pandas dataframe that looks like this (although there are a lot more columns and rows...):
> dosage_df
Score ("A_dose","Super") ("A_dose","Light") ("B_dose","Regular")
28 1 40 130
11 2 40 130
72 3 40 130
67 1 90 130
74 2 90 130
89 3 90 130
43 1 40 700
61 2 40 700
5 3 40 700
Along with my data frame, I also have a python dictionary with the relevant ranges for each feature. The keys are the feature names, and the different values which it can take are the keys:
# Original Version
dosage_df.columns = ['First Score', 'Last Score', ("A_dose","Super"), ("A_dose","Light"), ("B_dose","Regular")]
dict_of_dose_ranges = {("A_dose","Super"):[1,2,3],
("A_dose","Light"):[40,70,90],
("B_dose","Regular"):[130,200,500,700]}
For my purposes, I need to generate a particular combination (say A_dose = 1, B_dose = 90, and C_dose = 700), and based on those settings take the relevant slice out of my dataframe, and do relevant calculations from that smaller subset, and save the results somewhere.
I'm doing this by implementing the following:
from itertools import product
for dosage_comb in product(*dict_of_dose_ranges.values()):
dosage_items = zip(dict_of_dose_ranges.keys(), dosage_comb)
query_str = ' & '.join('{} == {}'.format(*x) for x in dosage_items)
**sub_df = dosage_df.query(query_str)**
...
The problem is that is gets hung up on the query step, as it returns the following error message:
TypeError: argument of type 'int' is not iterable
In this case, the query generated looks like this:
query_str = "("A_dose","Light") == 40 & ("A_dose","Super") == 1 & ("B_dose","Regular") == 130"
Troubleshooting Attempts:
I've confirmed that indeed that solution should work for a dataframe with just string columns as found here. In addition, I've also tried "tricking" the tool by converting the columns and the dictionary keys into strings by the following code... but that returned the same error.
# String Version
dosage_df.columns = ['First Score', 'Last Score', '("A_dose","Super")', '("A_dose","Light")', '("B_dose","Regular")']
dict_of_dose_ranges = {
'("A_dose","Super")':[1,2,3],
'("A_dose","Light")':[40,70,90],
'("B_dose","Regular")':[130,200,500,700]}
Is there an alternate tool in python that can take tuples as inputs or a different way for me to trick it into working?
You can build a list of conditions and logically condense them with np.all instead of using query:
for dosage_comb in product(*dict_of_dose_ranges.values()):
dosage_items = zip(dict_of_dose_ranges.keys(), dosage_comb)
condition = np.all([dosage_df[col] == dose for col, dose in dosage_items], axis=0)
sub_df = dosage_df[condition]
This method seems to be a bit more flexible than query, but when filtering across many columns I've found that query often performs better. I don't know if this is true in general though.

Categories

Resources