how to optimize this kind of process (lemmy.eco.br)

submitted 11 months ago by driving_crooner@lemmy.eco.br to c/python@programming.dev

5 comments fedilink hide all child comments

Hi, When im working with some big dataframes and I need to create some columns based on functions. So i have some code like this

Def function(row): function

And then I run the function on the df as

df['new column'] = df.apply(function, axis=1)

But I do this with 10 or more columns/functions at time. I don't think this is efficient because each time a column is created it had to parce the entire data frame. There's a way to create all the columns at the same time while parsing the rows only once?

Thanks for any help.

you are viewing a single comment's thread
view the rest of the comments

[-] misk@sopuli.xyz 5 points 11 months ago* (last edited 11 months ago)

Whatever you do, usually as long as the data frame fits in memory it should be pretty fast. Depending on functions you're using applymap on splices of columns might be faster but code readability will suffer.

How big is your dataset? If it's huge or your need are complex you'll get way more performance by switching from Pandas to Polars dataframes rather than trying to optimize Pandas operations.

[-] driving_crooner@lemmy.eco.br 2 points 11 months ago

6M rows (it grows by 35K rows at month aprox), 6 columns, after the function it's go to 17 columns and then finally to 9 where I starts to processes. It currently took 8min the pd.read_cvs() and 20min the creation of the columns. I would like to reduce that 20 min process.