I'm still quite new to Python and Pandas. I need some help with the code that i'm working with.
I have a dictionary called df which contains some files and their content in txt format. Key of this dictionary is the filename (date.txt), and value is the content of it. Here's what it looks like:
{'02_01_2020': 0
0 1 229017 Cust_1 CUR ...
1 2 629324 Cust_2 CUR ...
2 3 863300 Cust_3 CUR ...
3 4 670338 Cust_4 CUR ...
4 5 987039 Cust_5 CUR ...
5 6 485912 Cust_6 CUR ...,'03_01_2020': 0
0 1 122403 Cust_1 CUR ...
1 2 779269 Cust_2 CUR ...
2 3 728965 Cust_3 CUR ...
3 4 527716 Cust_4 CUR ...
4 5 796179 Cust_5 CUR ...
5 6 027872 Cust_6 CUR ...
6 7 449767 Cust_7 CUR ...
7 8 598752 Cust_8 CUR ...
8 9 180422 Cust_9 CUR ..., .... goes until the last file ('31_01_2020')}
As you can see above, each file contains different data. File 02_01_2020.txt has 6 entries, file 03_01_2020.txt has 9 entries, and so on until the last file (31_01_2020.txt).
My goal here is to separate the necessary information into their own columns (customer name, currency, etc) with the filename inserted into a separate column called paid_date. I used iterrows() to loop through this dictionary file. Here's the code:
def data_process(df):
#dataframe that i created outside this function
global df_data_1
for key,value in df.items():
df1 = pd.DataFrame(value)
df1['Paid_date'] = key.replace('_', '/')
#df1.insert(1, 'Paid_date', key.replace('_','/')) - another attempt to insert the col
for index,row in df1.iterrows():
df_Item_Num = row.str.slice(start = 0, stop=2) # entry number
df_DUMP_1 = row.str.slice(start = 0, stop=23) # not used
df_NAME = row.str.slice(start = 23, stop=40)
df_CURRENCY = row.str.slice(start = 40, stop=54)
df_AMOUNT = row.str.slice(start = 54, stop=66)
df_DATE = row.str.slice(start = 68, stop=86)
df_DUMP_2 = row.str.slice(start = 87, stop=-1) # not used
df_ALL_ITEMS = pd.concat([df_Item_Num, df_NAME, df_CURRENCY, df_AMOUNT, df_DATE], ignore_index=True)
df_data_1 = df_data_1.append(df_ALL_ITEMS, ignore_index=True)
return df_data_1
When i disabled the column creation code where the key is passed df1['Paid_date'] = key.replace('_', '/'), the result looks like this:
0 1 2 3 4
0 1 Cust_1 CUR Amount Date_Time
1 2 Cust_2 CUR Amount Date_Time
2 3 Cust_3 CUR Amount Date_Time
3 4 Cust_4 CUR Amount Date_Time
4 5 Cust_5 CUR Amount Date_Time
.. .. ... ... ... ...
185 10 Cust_6 CUR Amount Date_Time
186 11 Cust_7 CUR Amount Date_Time
187 12 Cust_8 CUR Amount Date_Time
188 13 Cust_9 CUR Amount Date_Time
189 14 Cust_10 CUR Amount Date_Time
Which is exactly what i need, only the paid_date column is not included (I need the filename to be stored in each row that corresponds to that particular file. e.g. 02_01_2020 would be printed 6 times to 6 rows, 03_01_2020 to 9 rows, etc). However, when i enabled the column creation code, it ended up like this:
0 1 2 3 ... 6 7 8 9
0 1 02 Cust_1 ... Amount Date_Time
1 2 02 Cust_2 ... Amount Date_Time
2 3 02 Cust_3 ... Amount Date_Time
3 4 02 Cust_4 ... Amount Date_Time
4 5 02 Cust_5 ... Amount Date_Time
.. .. .. ... .. ... ... .. ... ..
185 10 31 Cust_6 ... Amount Date_Time
186 11 31 Cust_7 ... Amount Date_Time
187 12 31 Cust_8 ... Amount Date_Time
188 13 31 Cust_9 ... Amount Date_Time
189 14 31 Cust_10 ... Amount Date_Time
I have a couple of new empty columns and apparently the key (filename) is not fully inserted (only the date is somehow stored into the new column, month and year are not included). What is the most efficient way to fix this? Any help would be greatly appreciated. Thank you
Edit 1
The entries of each txt file that i'm working with looks something like this:
1 CUST_NAME_1 CURRENCY AMOUNT DATE_TIME
2 CUST_NAME_2 CURRENCY AMOUNT DATE_TIME
3 CUST_NAME_3 CURRENCY AMOUNT DATE_TIME
4 CUST_NAME_4 CURRENCY AMOUNT DATE_TIME
5 CUST_NAME_5 CURRENCY AMOUNT DATE_TIME
Inside the txt file, there's a lot of white space that separates the information as you can see above. What my code does first is to loop through the directory within my computer that stores all the files there, and append them to two lists. Here's the code:
#SET UP EMPTY LISTS & Dictionary
filelist = []
filename = []
df = {}
def file_process(mydir):
for path, dirs, files in os.walk(mydir):
for file in files:
if file.endswith('.txt'):
filelist.append(file)
filename.append(file[0:10])
return filelist, filename
The above code returns two lists.
filelist contains each txt file (02_01_2020.txt, 03_01_2020.txt, etc)
filename contains only the name of each file (02_01_2020, 03_01_2020, etc)
Then i wrote the following code to convert those two lists into a single dictionary (will avoid using df for dictionary name, thanks for the suggestion).
def dict_process(filelist, filename):
for key in filename:
for value in filelist:
df[key] = pd.read_csv(value, sep="delimiter", skiprows = [0,1,2,3,4,5,6,7,8], skipfooter=6, header=None)
filelist.remove(value)
break
return df
And the code above returns back the dictionary that i've written previously where the filename is set as the key, and all the file content as its value.
What i did (or thought i did) within the for index,row in df1.iterrows(): was to slice every single series that iterrows() returned and keep only the information that i want and concatenate them into an empty dataframe. Is this efficient? or is there another way?
Related
I've got a dataframe with data taken from a database like this:
conn = sqlite3.connect('REDB.db')
dataAvg1 = pd.read_sql_query(
"SELECT UNIQUE_RE_NUMBER, TYP_ID, LOCATION, RE_PRICE, PRICE.RE_ID, PRICE.UPDATE_DATE, HOUSEINFO.RE_POLOHA, HOUSEINFO.RE_DRUH, HOUSEINFO.RE_TYP, HOUSEINFO.RE_UPLOCHA FROM PRICE INNER JOIN REAL_ESTATE, ADDRESS, HOUSEINFO ON REAL_ESTATE.ID=PRICE.RE_ID AND REAL_ESTATE.ID=ADDRESS.RE_ID AND REAL_ESTATE.ID=HOUSEINFO.INF_ID",conn
)
dataAvg2 = pd.read_sql_query(
"SELECT UNIQUE_RE_NUMBER, TYP_ID, LOCATION, RE_PRICE, PRICE.RE_ID, PRICE.UPDATE_DATE, FLATINFO.RE_DISPOZICE, FLATINFO.RE_DRUH, FLATINFO.RE_PPLOCHA FROM PRICE INNER JOIN REAL_ESTATE, ADDRESS, FLATINFO ON REAL_ESTATE.ID=PRICE.RE_ID AND REAL_ESTATE.ID=ADDRESS.RE_ID AND REAL_ESTATE.ID=FLATINFO.INF_ID",conn
)
dataAvg3 = pd.read_sql_query(
"SELECT UNIQUE_RE_NUMBER, TYP_ID, LOCATION, RE_PRICE, PRICE.RE_ID, PRICE.UPDATE_DATE, LANDINFO.RE_PLOCHA, LANDINFO.RE_DRUH, LANDINFO.RE_SITE, LANDINFO.RE_KOMUNIKACE FROM PRICE INNER JOIN REAL_ESTATE, ADDRESS, LANDINFO ON REAL_ESTATE.ID=PRICE.RE_ID AND REAL_ESTATE.ID=ADDRESS.RE_ID AND REAL_ESTATE.ID=LANDINFO.INF_ID",conn
)
conn.close()
df2 = [dataAvg1, dataAvg2, dataAvg3]
dfAvg = pd.concat(df2)
dfAvg = dfAvg.reset_index(drop=True)
The main columns are UNIQUE_RE_NUMBER, RE_PRICE and UPDATE_DATE. I would like to count frequency of change in prices each day. Ideally create a new column called 'Frequency' and for each day add a number. For example:
UPDATE_DAY UNIQUE_RE_NUMBER RE_PRICE FREQUENCY
1.1.2021 1 500 2
1.1.2021 2 400 2
2.1.2021 1 500 1
2.1.2021 2 450 1
I hope this example is understandable.
Right now I have something like this:
dfAvg['FREQUENCY'] = dfAvg.groupby('UPDATE_DATE')['UPDATE_DATE'].transform('count')
dfAvg.drop_duplicates(subset=['UPDATE_DATE'], inplace=True)
This code counts every price added that day, so when the price of real estate on 1.1.2021 was 500 and the next day, its also 500, it counts as "change" in price, but in fact the price stayed the same and I dont want to count that. I would like to select only distinct values in prices for each real estate. Is it possible?
Not sure if this is the most efficient way, but maybe it helps:
def ident_deltas(sdf):
return sdf.assign(
DELTA=(sdf.RE_PRICE.shift(1) != sdf.RE_PRICE).astype(int)
)
def sum_deltas(sdf):
return sdf.assign(FREQUENCY=sdf.DELTA.sum())
df = (
df.groupby("UNIQUE_RE_NUMBER").apply(ident_deltas)
.groupby("UPDATE_DAY").apply(sum_deltas)
.drop(columns="DELTA")
)
Result for
df =
UPDATE_DAY UNIQUE_RE_NUMBER RE_PRICE
0 2021-01-01 1 500
1 2021-01-01 2 400
2 2021-02-01 1 500
3 2021-02-01 2 450
is
UPDATE_DAY UNIQUE_RE_NUMBER RE_PRICE FREQUENCY
0 2021-01-01 1 500 2
1 2021-01-01 2 400 2
2 2021-02-01 1 500 1
3 2021-02-01 2 450 1
I have data that outputs into a csv file as:
url date id hits
a 2017-01-01 123 2
a 2017-01-01 123 2
b 2017-01-01 45 25
c 2017-01-01 123 5
d 2017-01-03 678 1
d 2017-01-03 678 7
and so on where hits is the number of times the id value appears on a given day per url. (ie: the id number 123 appears 2 times on 2017-01-01 for url "a".
I need to create another column after hits, called "total hits" that captures the total number of hits there are per day for a given url, date and id value. So the output would look like this..
url date id hits total_hits
a 2017-01-01 123 2 4
a 2017-01-01 123 2 4
b 2017-01-01 45 25 25
c 2017-01-01 123 5 5
d 2017-01-03 678 1 8
d 2017-01-03 678 7 8
if there are solutions to this without using pandas or numpy that would be amazing.
Please help! Thanks in advance.
Simple with standard python installation.
read & parse file using line-by-line read & split
create a collections.defaultdict(int) to count the occurences of the url/date/id triplet
add the info in an extra column
write back (I chose csv)
like this:
import collections,csv
d = collections.defaultdict(int)
rows = []
with open("input.csv") as f:
title = next(f).split() # skip title
for line in f:
toks = line.split()
d[toks[0],toks[1],toks[2]] += int(toks[3])
rows.append(toks)
# complete data
for row in rows:
row.append(d[row[0],row[1],row[2]])
title.append("total_hits")
with open("out.csv","w",newline="") as f:
cw = csv.writer(f)
cw.writerow(title)
cw.writerows(rows)
here's the output file:
url,date,id,hits,total_hits
a,2017-01-01,123,2,4
a,2017-01-01,123,2,4
b,2017-01-01,45,25,25
c,2017-01-01,123,5,5
d,2017-01-03,678,1,8
d,2017-01-03,678,7,8
I have a CSV file containing information with of medication (name), and dose for some patients (id) take.
The CSV file is structured as follows:
name, id, dose
ator, 034, 20
ator, 034, 30
para, 034, 30
mar, 035, 20
mar, 034, 10
The goal is to parse it into a "long" format, with following columns: "id," "table" (a table name given in the code), field (i.e., name, dose), value (i.e., the value of for instance name or dose). So far I have succeeded in formatting the original CSV structure into this.
But, I also want a column, "count," that contains the increment of medications each patient take.
For instance patient with id 034, takes three medications (ator, para, and mar), corresponding to a count of 1, 2, and 3. Thus, the desired output is the following:
id,table,field,count,value
034, meds, name, 1, ator
034, meds, name, 1, ator
034, meds, name, 2, para
035, meds, name, 1, mar
034, meds, name, 3, mar
034, meds, dose, 1, 20
034, meds, dose, 1, 30
034, meds, dose, 2, 30
035, meds, dose, 1, 20
034, meds, dose, 3, 10
Every time a patient (i.e., id) gets a new medication (i.e., name) the "count" should represent which medication that corresponds with for instance dose later in the table.
But I am struggling with getting a count column like that.
I have tried to add a count column to the data frame via my code (please, see below) without luck.
Any help for creating this column would be great!
import pandas as pd
# load the data into a pandas table:
file = '~/data/meds.csv'
df = pd.read_table(file, delimiter=',')
#### CANNOT GET THIS PART TO WORK: #####
count = []
for index, row in df.iterrows():
count.append(df[(df['id'] == row['id']) & (df['name'] < row['name'])].shape[0])
df['count'] = count
########################################
# convert data frame into the long format
df = pd.melt(df, id_vars=['id', 'count'], var_name='field', value_name='value')
# Change all NaNs to None
df = df.where((pd.notnull(df)), None)
# creating new column with table name
table = []
df['table'] = 'meds'
# save to file:
df.to_csv('~/data/meds_out.csv', encoding='utf-8')
Use melt with GroupBy.cumcount for counter column:
df = pd.melt(df, id_vars='id', var_name='field', value_name='value')
#if constant value set this way
df['table'] = 'meds'
df['count'] = df.groupby(['id','field']).cumcount() + 1
#change order of columns if necessary
df = df[['id','table','field','count','value']]
print (df)
id table field count value
0 34 meds name 1 ator
1 34 meds name 2 para
2 35 meds name 1 mar
3 34 meds name 3 mar
4 34 meds dose 1 20
5 34 meds dose 2 30
6 35 meds dose 1 20
7 34 meds dose 3 10
EDIT:
df['count'] = df.groupby('id')['name'].cumcount() + 1
df['count'] = df.groupby('id')['count'].ffill().astype(int)
df = pd.melt(df, id_vars=['id','count'], var_name='field', value_name='value')
print (df)
id count field value
0 34 1 name ator
1 34 2 name ator
2 34 3 name para
3 35 1 name mar
4 34 4 name mar
5 34 1 dose 20
6 34 2 dose 30
7 34 3 dose 30
8 35 1 dose 20
9 34 4 dose 10
I've got order data with SKUs inside and would like to find out, how often a SKU has been bought per month over the last 3 years.
for row in df_skus.iterrows():
df_filtered = df_orders.loc[df_orders['item_sku'] == row[1]['sku']]
# Remove unwanted rows:
df_filtered = df_filtered[['txn_id', 'date', 'item_sku']].copy()
# Group by year and date:
df_result = df_filtered['date'].groupby([df_filtered.date.dt.year, df_filtered.date.dt.month]).agg('count')
print ( df_result )
print ( type ( df_result ) )
The (shortened) result looks good so far:
date date
2017 3 1
Name: date, dtype: int64
date date
2017 2 1
3 6
4 1
6 1
Name: date, dtype: int64
Now, I'd like to create a CSV which looks like that:
SKU 2017-01 2017-02 2017-03
17 0 0 1
18 0 1 3
Is it possible to simply 'convert' my data into the desired structure?
I do these kind of calculations all the time and this seems to be the fastest.
import pandas as pd
df_orders = df_orders[df_orders["item_sku"].isin(df_skus["sku"])]
monthly_sales = df_orders.groupby(["item_sku", pd.Grouper(key="date",freq="M")]).size()
monthly_sales = monthly_sales.unstack(0)
monthly_sales.to_csv("my_csv.csv")
first line filters to the SKUs you want
the second line does a groupby and counts the number of sales per sku per month
the next line changes the dataframe from a multi index to the format you want
exports to csv
I am trying to create csv download ,but result download gives me in different format
def csv_download(request):
import csv
import calendar
from datetime import *
from dateutil.relativedelta import relativedelta
now=datetime.today()
month = datetime.today().month
d = calendar.mdays[month]
# Create the HttpResponse object with the appropriate CSV header.
response = HttpResponse(mimetype='text/csv')
response['Content-Disposition'] = 'attachment; filename=somefilename.csv'
m=Product.objects.filter(product_sellar = 'jhon')
writer = csv.writer(response)
writer.writerow(['S.No'])
writer.writerow(['product_name'])
writer.writerow(['product_buyer'])
for i in xrange(1,d):
writer.writerow(str(i) + "\t")
for f in m:
writer.writerow([f.product_name,f.porudct_buyer])
return response
output of above code :
product_name
1
2
4
5
6
7
8
9
1|10
1|1
1|2
.
.
.
2|7
mgm | x_name
wge | y_name
I am looking out put like this
s.no porduct_name product_buyser 1 2 3 4 5 6 7 8 9 10 .....27 total
1 mgm x_name 2 3 8 13
2 wge y_name 4 9 13
can you please help me with above csv download ?
if possible can you please tell me how to sum up all the individual user total at end?
Example :
we have selling table in that every day seller info will be inserted
table data looks like
S.no product_name product_seller sold Date
1 paint jhon 5 2011-03-01
2 paint simth 6 2011-03-02
I have created a table where it displays below format and i am trying to create csv download
s.no prod_name prod_sellar 1-03-2011 2-03-2011 3-03-2011 4-03-2011 total
1 paint john 10 15 0 0 25
2 paint smith 2 6 2 0 10
Please read the csv module documentation, particularly the writer object API.
You'll notice that the csv.writer object takes a list with elements representing their position in your delimited line. So to get the desired output, you would need to pass in a list like so:
writer = csv.writer(response)
writer.writerow(['S.No', 'product_name', 'product_buyer'] + range(1, d) + ['total'])
This will give you your desired header output.
You might want to explore the csv.DictWriter class if you want to only populate some parts of the row. It's much cleaner. This is how you would do it:
writer = csv.DictWriter(response,
['S.No', 'product_name', 'product_buyer'] + range(1, d) + ['total'])
Then when your write command would follow as so:
for f in m:
writer.writerow({'product_name': f.product_name, 'product_buyer': f.product_buyer})